HOW NOT TO EVALUATE

your RAG

Berlin Buzzwords 2025 | Roman Grebennikov

whoami

PhD in CS, quant trading, credit scoring
Findify: e-commerce search, personalization
Delivery Hero: food search, LLMs
Opensource: Metarank, lightgbm4j, flink-scala-api

Nixiesearch

TLDR: Lucene search engine on top of S3
Expected: facets, filters, autocomplete, RRF
ML: embedding inference, cross-encoder ranking
RAG: LLM inference via llamacpp

Perks of being open-source

they: hey nice project!
you: thanks! ❤️
they: we use it for RAG in a bank with 3M customers
they: and got some issues, can you help?
you: hmm 🤔🤔🤔

The agenda

intro: RAG as a support agent for support agents
data: chunking and context length
search+generation: the R and G in RAG
tools: RAGAS, and why you should perhaps build your own

Support agent for support agents

80% of all questions are covered by FAQ

40%: answer immediately as you know the answer
40%: answer after checking FAQ/docs
20%: answer after internal discussion

Support agent for support agents

Still human: nobody likes chatting with ChatGPT
Faster onboarding: just read the docs

RAG

Dialogue: summarize to a query
Retrieve: search for top-N relevant docs
Summarize: answer the last question in the dialogue

Getting real

FinTech: airgapped, you can't just use OpenAI API
Languages: CIS region, Uzbek/Kazakh
Knowledge base: what knowledge base?

wait is RAG solved thing in 2025?

Iteration #1

LLM convert all docs to Markdown
LangChain chunk, embed with multilingual-e5-small
Qwen2.5 for summarization

CTO@k: it works but sucks

Relevant docs missing, irrelevant found
Wrong query (or summary) generated

CTO@k: it works but sucks

R: Relevant docs missing, irrelevant found
G: Wrong query (or summary) generated

decompose > evaluate > improve

RAG as a system

preprocessing + retrieval + generation = RAG

Vibe-evaluating RAG

TLDR: evaluate each RAG step separately

Corpus preprocessing

Local files: docx, pdf, txt, napkin scans
Convert everything to Markdown
~1k docs [docling, markitdown]

Evaluating chunking

chunking = the way you create corpus
cannot label changing corpus 🙁

"vibe evaluating"

Chunking: large vs small

Large: embed complete documents
Small: split docs to chunks

Problems of large chunking

Bad UX: no time to check, context too large

Problems of small chunking

Lost context due to over-chunking

Anthropic's contextual chunking

LLM inference per chunk is expensive 🙁
Yes but we have markdown titles 😃

GPU poor contextual chunking

Chunking TLDR

Contextual: title + 1-2 paragraphs

Immutable index


$> nixiesearch index file -c config.yml --url s3://bucket/source --index hello

$> aws s3 ls s3://bucket/index/hello
-rw-r--r-- 1 shutty shutty       512 May 22 12:55 _0.cfe
-rw-r--r-- 1 shutty shutty 123547444 May 22 12:55 _0.cfs
-rw-r--r-- 1 shutty shutty       322 May 22 12:55 _0.si
-rw-r--r-- 1 shutty shutty      1610 May 22 12:55 index.json
-rw-r--r-- 1 shutty shutty       160 May 22 12:55 segments_1
-rw-r--r-- 1 shutty shutty         0 May 22 12:48 write.lock

$> nixiesearch search -c config.yml

R in RAG

The most relevant context

Context window is limited (but it's growing)
Fill the context with relevant chunks

Which model to choose?

Not all embeddings are OK for non-English!

No labels, no queries, unusual languages
No baseline to compare against
No public eval corpus 🙁

MIRACL dataset

TLDR: wikipedia query + docs + labels, 18 languages
No Kazakh/Uzbek split, but what if we machine-translate? 🤔

Machine-translated datasets?

m-MARCO: A Multilingual Version of the MS MARCO [2021]

Are MT/LLM models of 2025 better? 🤔

XCOMET, en->xx

Google Translate: great on high-resource languages
GPT4o: strong on low-resource languages

MTEB eval on MT data

The bigger = the better on low-resource languages
Wait, LLM as an embedding model?

LLM embeddings and ONNX


from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-Embedding-8B", backend="onnx")

# raw export
model.save_pretrained("export_dir")

# optimization - DOES NOT WORK
export_optimized_onnx_model(model, "O3", "export_dir")

# QInt8 quantization
export_dynamic_quantized_onnx_model(model, "avx512_vnni", "export_dir")

ONNX + Nixiesearch


inference:
  embedding:
    # our embedding model
    bge-gemma:
      provider: onnx
      model: file://dir/model_qint8.onnx

schema:
  # index schema
  knowledgebase:
    fields:
      content:
        type: text
        search:
          semantic:
            model: bge-gemma

just worked, CLS pooling, even qint8 quantization

okey retrieval = quepid time?

Perfect world: get queries, label results
Reality: "who's going to label results?"

RAGAS approach: LLM-as-a-judge

Context precision+recall

No NDCG: position does not matter for LLM

Context relevance

Context retrieval metrics overview

Absolute numbers: not much benefit
Relative performance: with automated evals!

G in RAG

What open LLM to use?
Balance: precision, hardware, latency

Evaluating generation

Comprehension: how LLM can understand language?
Generation: how good is generated response?

FB Belebele benchmark

122 languages: passage, question, 4 answers
400 triplets per language

Task example


Passage:
The atom can be considered to be one of the fundamental building blocks of 
all matter. Its a very complex entity which consists, according to a simplified 
Bohr model, of a central nucleus orbited by electrons, somewhat similar to 
planets orbiting the sun - see Figure 1.1. The nucleus consists of two 
particles - neutrons and protons. Protons have a positive electric charge 
while neutrons have no charge. The electrons have a negative electric charge.
	
Query: The particles that orbit the nucleus have which type of charge?
	
A: Positive charge
B: No charge
C: Negative charge
D: Positive and negative charge

TLDR: NLU evaluation, but multilingual

Results

TLDR: Gemma2 (and 3!) is one of the best open LLMs

Benchmarking generation

[1]: UpTrain: Decoding Perplexity and its significance in LLMs

Perplexity

Gemma2-27B: balance of comprehension+generation
Later upgrade to Gemma3-27B

Uncharted territory

LLM-as-a-judge 🤔🤔🤔 metrics which do exist, but never tried:

Answer accuracy: compare prediction and ground truth
Answer groundness: is answer based on context?

RAGAS: answer accuracy

RAGAS: answer groundness

RAGAS vs client.completion

ended up with custom prompt collection

bad prompt example:


is document '{doc}' relevant to query '{query}'?

good prompt example:


classify '{doc}' relevance to query '{query}':                    
score 0: doc has different topic and don't answer 
         the question asked in the query.
score 1: doc has the same or similar topic, but don't 
         answer the the query.
score 2: document exactly answers the question in query

Implementation

on-prem: 1-2x GPU, 4U server
perf: 100 tps, 500ms time to first token

Nixiesearch, llamacpp, GGUF


inference:
  completion:
    # LLM inference here!
    gemma2-27b:
	    provider: llamacpp
	    model: file://dir/model_Q4_K_0.GGUF
  embedding:
    bge-gemma:
      provider: onnx
      model: file://dir/model_qint8.onnx

schema:
  knowledgebase:
    fields:
      content:
        type: text
        search:
          semantic:
            model: bge-gemma

Things we learned

English >> High-resource >> Low-resource languages
Decompose: complex system = many simple parts
LLM-as-a-judge: when you have no choice

HOW NOT TO EVALUATE

your RAG

whoami

Nixiesearch

Perks of being open-source

The agenda

Support agent for support agents

Support agent for support agents

RAG

Getting real

wait is RAG solved thing in 2025?

Iteration #1

CTO@k: it works but sucks

CTO@k: it works but sucks

decompose > evaluate > improve

RAG as a system

Vibe-evaluating RAG

Corpus preprocessing

Corpus preprocessing

Evaluating chunking

Chunking: large vs small

Problems of large chunking

Problems of small chunking

Anthropic's contextual chunking

Anthropic's contextual chunking

GPU poor contextual chunking

Chunking TLDR

Immutable index

R in RAG

R in RAG

The most relevant context

Which model to choose?

MIRACL dataset

Machine-translated datasets?

XCOMET, en->xx

MTEB eval on MT data

LLM embeddings and ONNX

ONNX + Nixiesearch

okey retrieval = quepid time?

Context precision+recall

Context relevance

Context retrieval metrics overview

G in RAG

Evaluating generation

FB Belebele benchmark

Task example

Results

Benchmarking generation

Perplexity

Uncharted territory

RAGAS: answer accuracy

RAGAS: answer groundness

RAGAS vs client.completion

Implementation

Nixiesearch, llamacpp, GGUF

Things we learned

Links