NIXIESEARCH:

Running Lucene over S3

And why we are building our own serverless search engine

Haystack EU 2024 | Roman Grebennikov

whoami

๐Ÿ”Ž

  • PhD in CS, quant trading, credit scoring
  • Findify: e-commerce search, personalization
  • Delivery Hero: food search, LLMs
  • Opensource: Metarank, lightgbm4j, flink-scala-api

Conference-reality mismatch

Curse of stateful apps

  • Depends on per-node storage
  • Up/Down-scaling requires rebalance
  • "state inside" - good luck if borked

Blessing of stateless apps

  • Managed by cloud providers (and not by you)
  • Easier to modify (data + metadata)

Stateless search?

  • Lucene world: NRTSearch, OpenSearch
  • Non-Lucene: quickwit, turbopuffer

Main point: decouple search and storage

Industry goes stateless

  • Uber: Lucene: Uberโ€™s Search Platform Version Upgrade [1]
  • Doordash: Introducing DoorDashโ€™s in-house search engine [2]
  • Amazon: E-Commerce search at scale on Apache Lucene [3]

[1]: https://www.uber.com/en-NL/blog/lucene-version-upgrade/ [2]: https://careers.doordash.com/blog/introducing-doordashs-in-house-search-engine/ [3]: https://www.youtube.com/watch?v=EkkzSLstSAE

open-source?

Nixiesearch

  • Started as a POC for Lucene over S3
  • Went further: RAG, local inference, hybrid search

Lucene and S3: 2007

Lucene Directory

IO abstraction for data access:

  • lawful good: MMapDirectory, ByteBuffersDirectory
  • chaotic evil: JDBCDirectory

				public abstract class Directory implements Closeable {
					public abstract String[] listAll();
					public abstract void deleteFile(String name);
					public abstract long fileLength(String name);
					public abstract void rename(String source, String dest);
					public abstract IndexOutput createOutput(String name, IOContext context);
					public abstract IndexInput openInput(String name, IOContext context);
					public abstract void close();
				}
				

Lucene S3 Directory

S3 - a remote block store!

Nixie v0.0.1: S3Directory

Own read-only S3Directory, with block caching

Nixie v0.0.1: S3Directory

Own read-only S3Directory, with block caching

  • pros: no storage, stateless nodes
  • pros: easy up-down scaling, state - file on S3
  • con: LATENCY

S3 vs S3-Express latency

S3 Express: low-latency single-AZ S3

first-byte latency: 20ms vs 5ms

S3 vs S3-Express latency

last-byte latency: const_delay + transfer

Experiment setup

  • MSMARCO dataset, 10k, 100k, 1m documents
  • HNSW search over e5-base-v2 (768 dims), 1 segment
  • Default Lucene HNSW settings (M=16, efC=100)
  • 10k docs: #1 request reads 30% of the index
  • 1M docs: 3 SECOND LATENCY?

HNSW for dummies

Each probe = sequental random read = +5ms

N-th request latency

1M documents, bs=8192:

Not getting better after request #32 ๐Ÿ˜ญ

Lucene v10 I/O concurrency

LUCENE-13179 TLDR

  • Sequential IO is slow, let's make it concurrent
  • IndexInput.prefetch - hint for future reads

S3: ๐Ÿ‘Ž latency, ๐Ÿ‘ concurrency

  • 1 GB/s. Can fetch 10GB index in 10 seconds

Serverless dilemma

  • No cold start: need to sync the index
  • Fast startup: high search latency

Nixiesearch: segment replication

How far does stateless go?

  • Stateless index: on S3 โœ…
  • Immutable configuration ๐Ÿค”

Immutable config? ๐Ÿค”

Regular backend app config change:

  • Commit config to git
  • PR review, CI/CD blue-green deploy

Immutable config? ๐Ÿค”

Mapping change for an index:

  • Send HTTP POST request to prod cluster
  • Earth shaking, light goes off
  • You hear siren sounds

Config management in Nixie


inference:
	embedding:
		text:
			provider: onnx
			model: nixiesearch/e5-small-v2-onnx

schema:
	helloworld:
		fields:
			title:
				type: text
				search:
					type: semantic
					model: text
			price:
				type: int     # can be also float/long/double
				filter: true  
				facet: true   
				sort: true
				

Search is not special

  • No way to change the runtime config
  • Index schema, system settings = just conf

When things go bad

  • Index + config compatible = healthcheck OK
  • int -> string = need to reindex โ˜ข๏ธ

index = directory on S3

Push vs pull indexing

  • Push: same cluster, control backpressure
  • Pull: separate service, offline

Offline reindex

Offline reindex

Offline reindex

Offline reindex

demo

The Bad, The Ugly

  • No sharding support (yet)

No sharding, really?

  • 1M docs MSMARCO index: 3GB
  • 1B/1T index: 95% it's logs/APM/traces

Why local inference?

  1. Latency: CPU ONNX e5-base-v2 inference is 5ms
  2. Privacy: data is not leaving your perimeter

Optional: openai, cohere, mxb, google providers

Text in, text out, 100% local

  • Embeddings and LLM inference is local
  • ONNX, GGUF - GPU for indexing ๐Ÿค”

					docker run --gpus=all nixiesearch/nixiesearch:0.3.3-amd64-gpu index
				

Future: reranking support

Single retrieval pipeline:

  • Cross-encoder: rerank top-N candidates
  • retrieve -> rerank -> summarize
  • Local ONNX, optional external providers

Future: ingestion pipeline

Common high-level tasks, automated:

  • Transform a field before indexing
  • Split to chunks, summarize
  • Contextual embeddings for RAG
  • Automated category detection

					- type: image
					  model: openclip

					- type: text
					  from: ".title | summarize(gemma2, prompt='...')"
				

EC2 g4.large with T4 GPU = 300$/month

Future: domain adaptation

Different search engines, same embedding, same results

  • Fine-tuning is dead: you need training data
  • You have only docs: can you generate queries and labels?

Future: domain adaptation

Fine-tuning on LLM-generated synthetic data

  • For each document generate a query: it's a positive
  • For query+positive, mine negatives
  • Fine-tune the embedding model

Expect a bumpy ride

  • Some features might be missing (like sorting)
  • Docs are imperfect - but they exist!
  • There may will be breaking changes

Links