NIXIESEARCH:
Running Lucene over S3
And why we are building our own serverless search engine
Haystack EU 2024 | Roman Grebennikov
whoami
๐
- PhD in CS, quant trading, credit scoring
- Findify: e-commerce search, personalization
- Delivery Hero: food search, LLMs
- Opensource: Metarank, lightgbm4j, flink-scala-api
Conference-reality mismatch
Curse of stateful apps
- Depends on per-node storage
- Up/Down-scaling requires rebalance
- "state inside" - good luck if borked
Blessing of stateless apps
- Managed by cloud providers (and not by you)
- Easier to modify (data + metadata)
Stateless search?
- Lucene world: NRTSearch, OpenSearch
- Non-Lucene: quickwit, turbopuffer
Main point: decouple search and storage
open-source?
Nixiesearch
- Started as a POC for Lucene over S3
- Went further: RAG, local inference, hybrid search
Lucene and S3: 2007
Lucene Directory
IO abstraction for data access:
- lawful good: MMapDirectory, ByteBuffersDirectory
- chaotic evil: JDBCDirectory
public abstract class Directory implements Closeable {
public abstract String[] listAll();
public abstract void deleteFile(String name);
public abstract long fileLength(String name);
public abstract void rename(String source, String dest);
public abstract IndexOutput createOutput(String name, IOContext context);
public abstract IndexInput openInput(String name, IOContext context);
public abstract void close();
}
Lucene S3 Directory
S3 - a remote block store!
Nixie v0.0.1: S3Directory
Own read-only S3Directory, with block caching
Nixie v0.0.1: S3Directory
Own read-only S3Directory, with block caching
- pros: no storage, stateless nodes
- pros: easy up-down scaling, state - file on S3
- con: LATENCY
S3 vs S3-Express latency
S3 Express: low-latency single-AZ S3
first-byte latency: 20ms vs 5ms
S3 vs S3-Express latency
last-byte latency: const_delay + transfer
Experiment setup
- MSMARCO dataset, 10k, 100k, 1m documents
- HNSW search over e5-base-v2 (768 dims), 1 segment
- Default Lucene HNSW settings (M=16, efC=100)
- 10k docs: #1 request reads 30% of the index
- 1M docs: 3 SECOND LATENCY?
HNSW for dummies
Each probe = sequental random read = +5ms
N-th request latency
1M documents, bs=8192:
Not getting better after request #32 ๐ญ
Lucene v10 I/O concurrency
LUCENE-13179 TLDR
- Sequential IO is slow, let's make it concurrent
- IndexInput.prefetch - hint for future reads
S3: ๐ latency, ๐ concurrency
- 1 GB/s. Can fetch 10GB index in 10 seconds
Serverless dilemma
- No cold start: need to sync the index
- Fast startup: high search latency
Nixiesearch: segment replication
How far does stateless go?
- Stateless index: on S3 โ
- Immutable configuration ๐ค

Immutable config? ๐ค
Regular backend app config change:
- Commit config to git
- PR review, CI/CD blue-green deploy
Immutable config? ๐ค
Mapping change for an index:
- Send HTTP POST request to prod cluster
- Earth shaking, light goes off
- You hear siren sounds
Config management in Nixie
inference:
embedding:
text:
provider: onnx
model: nixiesearch/e5-small-v2-onnx
schema:
helloworld:
fields:
title:
type: text
search:
type: semantic
model: text
price:
type: int # can be also float/long/double
filter: true
facet: true
sort: true
Search is not special
- No way to change the runtime config
- Index schema, system settings = just conf
When things go bad
- Index + config compatible = healthcheck OK
- int -> string = need to reindex โข๏ธ
index = directory on S3
Push vs pull indexing

- Push: same cluster, control backpressure
- Pull: separate service, offline
Offline reindex
Offline reindex
Offline reindex
Offline reindex
The Bad, The Ugly

- No sharding support (yet)
No sharding, really?

- 1M docs MSMARCO index: 3GB
- 1B/1T index: 95% it's logs/APM/traces
Why local inference?
- Latency: CPU ONNX e5-base-v2 inference is 5ms
- Privacy: data is not leaving your perimeter
Optional: openai, cohere, mxb, google providers
Text in, text out, 100% local

- Embeddings and LLM inference is local
- ONNX, GGUF - GPU for indexing ๐ค
docker run --gpus=all nixiesearch/nixiesearch:0.3.3-amd64-gpu index
Future: reranking support
Single retrieval pipeline:
- Cross-encoder: rerank top-N candidates
- retrieve -> rerank -> summarize
- Local ONNX, optional external providers
Future: ingestion pipeline
Common high-level tasks, automated:
- Transform a field before indexing
- Split to chunks, summarize
- Contextual embeddings for RAG
- Automated category detection
- type: image
model: openclip
- type: text
from: ".title | summarize(gemma2, prompt='...')"
EC2 g4.large with T4 GPU = 300$/month
Future: domain adaptation
Different search engines, same embedding, same results
- Fine-tuning is dead: you need training data
- You have only docs: can you generate queries and labels?
Future: domain adaptation
Fine-tuning on LLM-generated synthetic data
- For each document generate a query: it's a positive
- For query+positive, mine negatives
- Fine-tune the embedding model
Expect a bumpy ride
- Some features might be missing (like sorting)
- Docs are imperfect - but they exist!
- There
may will be breaking changes
Links
