HOW NOT TO EVALUATE

your RAG

Berlin Buzzwords 2025 | Roman Grebennikov

whoami

  • PhD in CS, quant trading, credit scoring
  • Findify: e-commerce search, personalization
  • Delivery Hero: food search, LLMs
  • Opensource: Metarank, lightgbm4j, flink-scala-api

Nixiesearch

  • TLDR: Lucene search engine on top of S3
  • Expected: facets, filters, autocomplete, RRF
  • ML: embedding inference, cross-encoder ranking
  • RAG: LLM inference via llamacpp

Perks of being open-source


  • they: hey nice project!
  • you: thanks! ❤️
  • they: we use it for RAG in a bank with 3M customers
  • they: and got some issues, can you help?
  • you: hmm 🤔🤔🤔

The agenda

  • intro: RAG as a support agent for support agents
  • data: chunking and context length
  • search+generation: the R and G in RAG
  • tools: RAGAS, and why you should perhaps build your own

Support agent for support agents

80% of all questions are covered by FAQ

  • 40%: answer immediately as you know the answer
  • 40%: answer after checking FAQ/docs
  • 20%: answer after internal discussion

Support agent for support agents

  • Still human: nobody likes chatting with ChatGPT
  • Faster onboarding: just read the docs

RAG

  • Dialogue: summarize to a query
  • Retrieve: search for top-N relevant docs
  • Summarize: answer the last question in the dialogue

Getting real

  • FinTech: airgapped, you can't just use OpenAI API
  • Languages: CIS region, Uzbek/Kazakh
  • Knowledge base: what knowledge base?

wait is RAG solved thing in 2025?

Iteration #1

  • LLM convert all docs to Markdown
  • LangChain chunk, embed with multilingual-e5-small
  • Qwen2.5 for summarization

CTO@k: it works but sucks

  • Relevant docs missing, irrelevant found
  • Wrong query (or summary) generated

CTO@k: it works but sucks

  • R: Relevant docs missing, irrelevant found
  • G: Wrong query (or summary) generated

decompose > evaluate > improve

RAG as a system

preprocessing + retrieval + generation = RAG

Vibe-evaluating RAG

TLDR: evaluate each RAG step separately

Corpus preprocessing

Corpus preprocessing

  • Local files: docx, pdf, txt, napkin scans
  • Convert everything to Markdown
  • ~1k docs [docling, markitdown]

Evaluating chunking

  • chunking = the way you create corpus
  • cannot label changing corpus 🙁

"vibe evaluating"

Chunking: large vs small

  • Large: embed complete documents
  • Small: split docs to chunks

Problems of large chunking

Bad UX: no time to check, context too large

Problems of small chunking

Lost context due to over-chunking

Anthropic's contextual chunking

Anthropic's contextual chunking

  • LLM inference per chunk is expensive 🙁
  • Yes but we have markdown titles 😃

GPU poor contextual chunking

Chunking TLDR

  • Contextual: title + 1-2 paragraphs

Immutable index


$> nixiesearch index file -c config.yml --url s3://bucket/source --index hello

$> aws s3 ls s3://bucket/index/hello
-rw-r--r-- 1 shutty shutty       512 May 22 12:55 _0.cfe
-rw-r--r-- 1 shutty shutty 123547444 May 22 12:55 _0.cfs
-rw-r--r-- 1 shutty shutty       322 May 22 12:55 _0.si
-rw-r--r-- 1 shutty shutty      1610 May 22 12:55 index.json
-rw-r--r-- 1 shutty shutty       160 May 22 12:55 segments_1
-rw-r--r-- 1 shutty shutty         0 May 22 12:48 write.lock

$> nixiesearch search -c config.yml
				

R in RAG

R in RAG

The most relevant context

  • Context window is limited (but it's growing)
  • Fill the context with relevant chunks

Which model to choose?

Not all embeddings are OK for non-English!

  • No labels, no queries, unusual languages
  • No baseline to compare against
  • No public eval corpus 🙁

MIRACL dataset

  • TLDR: wikipedia query + docs + labels, 18 languages
  • No Kazakh/Uzbek split, but what if we machine-translate? 🤔

Machine-translated datasets?

m-MARCO: A Multilingual Version of the MS MARCO [2021]

Are MT/LLM models of 2025 better? 🤔

XCOMET, en->xx

  • Google Translate: great on high-resource languages
  • GPT4o: strong on low-resource languages

MTEB eval on MT data

  • The bigger = the better on low-resource languages
  • Wait, LLM as an embedding model?

LLM embeddings and ONNX


from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-Embedding-8B", backend="onnx")

# raw export
model.save_pretrained("export_dir")

# optimization - DOES NOT WORK
export_optimized_onnx_model(model, "O3", "export_dir")

# QInt8 quantization
export_dynamic_quantized_onnx_model(model, "avx512_vnni", "export_dir")
				

ONNX + Nixiesearch


inference:
  embedding:
    # our embedding model
    bge-gemma:
      provider: onnx
      model: file://dir/model_qint8.onnx

schema:
  # index schema
  knowledgebase:
    fields:
      content:
        type: text
        search:
          semantic:
            model: bge-gemma
  • just worked, CLS pooling, even qint8 quantization

okey retrieval = quepid time?

  • Perfect world: get queries, label results
  • Reality: "who's going to label results?"

RAGAS approach: LLM-as-a-judge

Context precision+recall

No NDCG: position does not matter for LLM

Context relevance

Context retrieval metrics overview

  • Absolute numbers: not much benefit
  • Relative performance: with automated evals!

G in RAG

  • What open LLM to use?
  • Balance: precision, hardware, latency

Evaluating generation

  • Comprehension: how LLM can understand language?
  • Generation: how good is generated response?

FB Belebele benchmark

  • 122 languages: passage, question, 4 answers
  • 400 triplets per language

Task example


Passage:
The atom can be considered to be one of the fundamental building blocks of 
all matter. Its a very complex entity which consists, according to a simplified 
Bohr model, of a central nucleus orbited by electrons, somewhat similar to 
planets orbiting the sun - see Figure 1.1. The nucleus consists of two 
particles - neutrons and protons. Protons have a positive electric charge 
while neutrons have no charge. The electrons have a negative electric charge.
	
Query: The particles that orbit the nucleus have which type of charge?
	
A: Positive charge
B: No charge
C: Negative charge
D: Positive and negative charge					
				
  • TLDR: NLU evaluation, but multilingual

Results

TLDR: Gemma2 (and 3!) is one of the best open LLMs

Benchmarking generation

[1]: UpTrain: Decoding Perplexity and its significance in LLMs

Perplexity

  • Gemma2-27B: balance of comprehension+generation
  • Later upgrade to Gemma3-27B

Uncharted territory

LLM-as-a-judge 🤔🤔🤔 metrics which do exist, but never tried:

  • Answer accuracy: compare prediction and ground truth
  • Answer groundness: is answer based on context?

RAGAS: answer accuracy

RAGAS: answer groundness

RAGAS vs client.completion

  • ended up with custom prompt collection
  • bad prompt example:
    
    is document '{doc}' relevant to query '{query}'?
    						
  • good prompt example:
    
    classify '{doc}' relevance to query '{query}':                    
    score 0: doc has different topic and don't answer 
             the question asked in the query.
    score 1: doc has the same or similar topic, but don't 
             answer the the query.
    score 2: document exactly answers the question in query
    						

Implementation

  • on-prem: 1-2x GPU, 4U server
  • perf: 100 tps, 500ms time to first token

Nixiesearch, llamacpp, GGUF


inference:
  completion:
    # LLM inference here!
    gemma2-27b:
	    provider: llamacpp
	    model: file://dir/model_Q4_K_0.GGUF
  embedding:
    bge-gemma:
      provider: onnx
      model: file://dir/model_qint8.onnx

schema:
  knowledgebase:
    fields:
      content:
        type: text
        search:
          semantic:
            model: bge-gemma

Things we learned

  • English >> High-resource >> Low-resource languages
  • Decompose: complex system = many simple parts
  • LLM-as-a-judge: when you have no choice

Links