HOW NOT TO EVALUATE
your RAG
Berlin Buzzwords 2025 | Roman Grebennikov
whoami

- PhD in CS, quant trading, credit scoring
- Findify: e-commerce search, personalization
- Delivery Hero: food search, LLMs
- Opensource: Metarank, lightgbm4j, flink-scala-api
Nixiesearch
- TLDR: Lucene search engine on top of S3
- Expected: facets, filters, autocomplete, RRF
- ML: embedding inference, cross-encoder ranking
- RAG: LLM inference via llamacpp
Perks of being open-source
- they: hey nice project!
- you: thanks! ❤️
- they: we use it for RAG in a bank with 3M customers
- they: and got some issues, can you help?
- you: hmm 🤔🤔🤔
The agenda
- intro: RAG as a support agent for support agents
- data: chunking and context length
- search+generation: the R and G in RAG
-
tools: RAGAS, and why you should perhaps build your own
Support agent for support agents
80% of all questions are covered by FAQ
- 40%: answer immediately as you know the answer
- 40%: answer after checking FAQ/docs
- 20%: answer after internal discussion
Support agent for support agents
- Still human: nobody likes chatting with ChatGPT
- Faster onboarding: just read the docs
RAG
- Dialogue: summarize to a query
- Retrieve: search for top-N relevant docs
- Summarize: answer the last question in the dialogue
Getting real
- FinTech: airgapped, you can't just use OpenAI API
- Languages: CIS region, Uzbek/Kazakh
- Knowledge base: what knowledge base?
wait is RAG solved thing in 2025?
Iteration #1
- LLM convert all docs to Markdown
- LangChain chunk, embed with multilingual-e5-small
- Qwen2.5 for summarization
CTO@k: it works but sucks
- Relevant docs missing, irrelevant found
- Wrong query (or summary) generated
CTO@k: it works but sucks
- R: Relevant docs missing, irrelevant found
- G: Wrong query (or summary) generated
decompose > evaluate > improve
RAG as a system
preprocessing + retrieval + generation = RAG
Vibe-evaluating RAG
TLDR: evaluate each RAG step separately
Corpus preprocessing
Corpus preprocessing
- Local files: docx, pdf, txt, napkin scans
- Convert everything to Markdown
- ~1k docs [docling, markitdown]
Evaluating chunking

- chunking = the way you create corpus
- cannot label changing corpus 🙁
"vibe evaluating"
Chunking: large vs small
- Large: embed complete documents
- Small: split docs to chunks
Problems of large chunking
Bad UX: no time to check, context too large
Problems of small chunking
Lost context due to over-chunking
Anthropic's contextual chunking
Anthropic's contextual chunking
- LLM inference per chunk is expensive 🙁
- Yes but we have markdown titles 😃
GPU poor contextual chunking
Chunking TLDR
- Contextual: title + 1-2 paragraphs
Immutable index
$> nixiesearch index file -c config.yml --url s3://bucket/source --index hello
$> aws s3 ls s3://bucket/index/hello
-rw-r--r-- 1 shutty shutty 512 May 22 12:55 _0.cfe
-rw-r--r-- 1 shutty shutty 123547444 May 22 12:55 _0.cfs
-rw-r--r-- 1 shutty shutty 322 May 22 12:55 _0.si
-rw-r--r-- 1 shutty shutty 1610 May 22 12:55 index.json
-rw-r--r-- 1 shutty shutty 160 May 22 12:55 segments_1
-rw-r--r-- 1 shutty shutty 0 May 22 12:48 write.lock
$> nixiesearch search -c config.yml
R in RAG
R in RAG
The most relevant context
- Context window is limited (but it's growing)
- Fill the context with relevant chunks
Which model to choose?
Not all embeddings are OK for non-English!
- No labels, no queries, unusual languages
- No baseline to compare against
- No public eval corpus 🙁
MIRACL dataset
- TLDR: wikipedia query + docs + labels, 18 languages
- No Kazakh/Uzbek split, but what if we machine-translate? 🤔
Machine-translated datasets?
m-MARCO: A Multilingual Version of the MS MARCO [2021]
Are MT/LLM models of 2025 better? 🤔
XCOMET, en->xx
- Google Translate: great on high-resource languages
- GPT4o: strong on low-resource languages
MTEB eval on MT data
- The bigger = the better on low-resource languages
- Wait, LLM as an embedding model?
LLM embeddings and ONNX
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-Embedding-8B", backend="onnx")
# raw export
model.save_pretrained("export_dir")
# optimization - DOES NOT WORK
export_optimized_onnx_model(model, "O3", "export_dir")
# QInt8 quantization
export_dynamic_quantized_onnx_model(model, "avx512_vnni", "export_dir")
ONNX + Nixiesearch
inference:
embedding:
# our embedding model
bge-gemma:
provider: onnx
model: file://dir/model_qint8.onnx
schema:
# index schema
knowledgebase:
fields:
content:
type: text
search:
semantic:
model: bge-gemma
- just worked, CLS pooling, even qint8 quantization
okey retrieval = quepid time?
- Perfect world: get queries, label results
- Reality: "who's going to label results?"
RAGAS approach: LLM-as-a-judge
Context precision+recall
No NDCG: position does not matter for LLM
Context relevance
Context retrieval metrics overview
- Absolute numbers: not much benefit
- Relative performance: with automated evals!
G in RAG
- What open LLM to use?
- Balance: precision, hardware, latency
Evaluating generation
- Comprehension: how LLM can understand language?
- Generation: how good is generated response?
FB Belebele benchmark
- 122 languages: passage, question, 4 answers
- 400 triplets per language
Task example
Passage:
The atom can be considered to be one of the fundamental building blocks of
all matter. Its a very complex entity which consists, according to a simplified
Bohr model, of a central nucleus orbited by electrons, somewhat similar to
planets orbiting the sun - see Figure 1.1. The nucleus consists of two
particles - neutrons and protons. Protons have a positive electric charge
while neutrons have no charge. The electrons have a negative electric charge.
Query: The particles that orbit the nucleus have which type of charge?
A: Positive charge
B: No charge
C: Negative charge
D: Positive and negative charge
- TLDR: NLU evaluation, but multilingual
Results
TLDR: Gemma2 (and 3!) is one of the best open LLMs
Perplexity
- Gemma2-27B: balance of comprehension+generation
- Later upgrade to Gemma3-27B
Uncharted territory
LLM-as-a-judge 🤔🤔🤔 metrics which do exist, but never tried:
- Answer accuracy: compare prediction and ground truth
- Answer groundness: is answer based on context?
RAGAS: answer accuracy
RAGAS: answer groundness
RAGAS vs client.completion
- ended up with custom prompt collection
-
bad prompt example:
is document '{doc}' relevant to query '{query}'?
-
good prompt example:
classify '{doc}' relevance to query '{query}':
score 0: doc has different topic and don't answer
the question asked in the query.
score 1: doc has the same or similar topic, but don't
answer the the query.
score 2: document exactly answers the question in query
Implementation
- on-prem: 1-2x GPU, 4U server
- perf: 100 tps, 500ms time to first token
Nixiesearch, llamacpp, GGUF
inference:
completion:
# LLM inference here!
gemma2-27b:
provider: llamacpp
model: file://dir/model_Q4_K_0.GGUF
embedding:
bge-gemma:
provider: onnx
model: file://dir/model_qint8.onnx
schema:
knowledgebase:
fields:
content:
type: text
search:
semantic:
model: bge-gemma
Things we learned
- English >> High-resource >> Low-resource languages
- Decompose: complex system = many simple parts
- LLM-as-a-judge: when you have no choice
Links