In my previous blog, I have discussed about complete indexing part like how to extract the data from multiple file sources, chunking, embedding and vector store/DB. This is a very important step to prepare knowledge base for our project to inject our company specific non-confidential information to process user query during 'retrieval' process in RAG. Once you prepare and ready with knowledge base then only we can enter into actual RAG implementation which this blog talks about. Incase if you are planning to implement the complete RAG, I recommend you to read below mentioned blog before going through this blog.
https://arunsdatasphere.blogspot.com/2026/04/ai-blog17-rag-preparing-knowledge-base.html
RAG (Retrieval Augmented Generation)
RAG is a technique where an AI model first retrieves relevant information from an external knowledge source(like a vector database) and then uses it to generate more accurate and context-aware responses.
Look at above image to understand the order of components in RAG. Below are the sequence of components to follow while developing a RAG:
- Query Reformulation
- Query Expansion
- Intent Retrieval/Validation
- Pre Filter
- Post Filter
- Hybrid Search
- Semantic/Vector search
- BM25 (Keyword search)
- Semantic + BM25 (Hybrid search) - RRF
- Re-ranking (very important - mandatory technique)
- LLM Score based re-ranking
- Pair wise re-ranking
- List wise re-ranking
- Query aware re-ranking
- Hybrid re-ranking
- Evaluation metrics
- Precision
- Recall
- NDCG
- Faithfulness
- Answer relevancy
- Context precision
- Context recall
We will discuss about every technique mentioned in the above blog. These is called Retrieval + Augmentation + Generation.
Let us assume, below is the user query :
"How to cancel my order and what is the refund policy for electronic items ?" Whenever we are submitted this query as part of RAG - it goes to a step called Retrieval, then immediately it won't start RAG process like chunking, embedding this query and directly going to check the similarity of these chunks in input query with the chunks in vector DB.
Internally we need to do following steps:
- Query reformulation
- Query expansion
- Intent validation
Query Reformulation
Assume my question is just "AI". Tell me how am i going to interpret this question ?? If I ask AI, just 'AI' - is there any meaning ? May be correct search strings are 'What is AI?' or 'What are the applications used in AI ?' then we will get the correct response. For a moment, just ignore AI & RAG - even if you ask a person, simply 'AI' - do you think other person will understand what you are exactly talking about ? NO right ? This is when Query Reformulation is required. As part of Query Reformulation, we are going to validate whether the query is appropriate or not. If the query is not appropriate, then we need to reformulate that query to make AI understand what a user is asking about.
Example :
Question is "How to cancel my order ?" If a user ask this question "Order Cancel" - it is a vague query. Whenever users ask these kind of questions, we need to articulate the query in a meaningful way. This is called Query Reformulation.
Query Expansion
Assume:
- User-1 using application first time
- User-2 using 50th time
- User-3 using 1000th time
Intent Validation
Let us assume, you created this bot/agent for an e-commerce application. But user a question about healthcare, then the intent of the bot is not related to healthcare right ? Here we need to validate the query - if the intent is appropriate, then we need to take user query to next level, otherwise we need inform user saying the actual intention of the bot/agent he is using. Something like This bot is mainly meant for e-commerce applications ! (some sort of response to let user understand what this bot is about).
To make it simple, in the retrieval step - we need to enable above 3 reforms. If we are not using above 3 reforms, we will get irrelevant output.
We can fix these issues in 2 ways :
- Either we have to maintain our meta data while creating knowledge base, using metadata we can take care of Query Reformulation, Query Expansion & Intent Validation
- Otherwise, we need to take the help of LLM
Pre Filter
Let us say users intent is to understand the 'refund policy', and he entered a query to get this information. Do we need to search entire data ? or only refund policy related information ? It is a smart idea to search only 'refund policy' related information only right ? This is called Pre Filter.
Post Filter
This comes after search completed. We may have to still filter the context which is the output from Pre-Filtering. This is called Post Filter.
One line summary :
- Pre-Filtering narrows what you search
- Post-Filtering fixes what you found.
Hybrid search
Hybrid search is a combination of Semantic/Vector search + Keyword search.
In health care, we should not assume anything, we need to give exact results with correct keywords. We need to use keyword search in such cases for exact results.
One more example is, user is looking for 'Refund Policy', especially looking for 'Electronics Refund Policy'. In this situation, we can use semantic search for 'Refund Policy' but for 'Electronic Refund Policy', we need to use Keyword search.
Hybrid would be something like:
- 70% Semantic search & 30% Keyword search
- 50% Semantic search & 50% Keyword search
- 30% Semantic search & 70% Keyword search
It all depends on the use case.
Assume, we got top 10 results from Hybrid search. Next question is whether all these top 10 results arranged in descending order or not ? based on the similarity score - they should be arranged in highest to lowest score. This will happen during Re-Ranking.
Note, all the above steps are happening during Retrieval step. Once all these steps are done, along with user query, relevant context will be submitted to LLM as part and this is called Generation.
Modern RAG Flow :
- User Query
- Query understanding
- Reformulation
- Expansion
- Intent validation(if irrelevant - inform user)
- Pre-filtering(Optional but common)
- Metadata filters(category, tenant, language etc. - search only required data)
- Hybrid search
- Semantic/vector search(Pinecone etc.)
- Keyword/BM25 search
- Post-Filtering( if required - fix what you found)
- Re-ranking
- Improves top-k quality significantly
- Context selection/compression
- Remove redundancy
- Fit within token limits(often missed in production flows - hit context window)
- Prompt construction
- Combine User query + retrieved context
- Instructions / system prompt
- LLM Generation
- Post Processing
- Format output
- Gaurdrails (hallucination checks, safety)
- Evaluation & Feedback loop
- Logging
- Metrics(Precision, Recall, Faithfulness)
- Continuous improvement
Retrieval Strategies
This is the foundation layer of RAG quality. If retrieval is weak, no reranking or generation can fully fix it.
Let's see 3 core Retrieval strategies:
- Query Formulation
- Transforming users raw query into a better search query
- User query - reformulated query - retrieval
- Query Expansion
- Intent Validation
Main problem is, users don't speak database language. They generally write query/input in a normal language. We need to interpret it.
Example:
Imagine asking a librarian about "Books about things going wrong in chips" but the catalog is indexed under "Semiconductor failure mechanisms". Then librarian reformulates your question before searching. That's called Query formulation. LLM can take care of reformulation.
Types of query formulation:
- Semantic rewriting
- Keyword enrichment
- Domain normalization
- Clarification-base reformulation
It bridges user language - document language gap.
2) Query Expansion
Generating multiple related queries from user query.
- User query - N queries - Retrieval - Merge results
Example: "What causes diabetes ?"
But the document might use:
- Blood sugar disorder
- Insulin resistance
- Glucose imbalance
- Causes of diabetes
- Insulin resistance explanation
- Blood sugar disorder causes
Disadvantage is, it will cost us more as we use LLM for this feature. More tokens, more money.
3) Intent Validation
Check whether the query is:
- Relevant
- Valid
- Safe
- In-domain
Example :
User asking "what is the capital of France" when we our domain is medical.
Full retrieval pipeline:
User Query - Intent Validation - Query formulation - Query Expansion - Vector/Hybrid retrieval - Context - LLM - Answer
Implementation of Retrieval Strategies :
Pre/ Post Filtering
Pre-Filtering:
Understand if user is looking for information about "Sick leave policy" and the user query would be "How many sick leaves are allowed per quarter ?" - for this use case, clearly this information is related HR policy about sick leaves, but if you search entire data(all vectors) in the vector DB, then it latency will screw up and it takes good amount of time. Hence we need to apply a filter before search, routing search filter to search only HR policy related chunks in the Vector DB. This is called Pre-Filtering. On the top of this - if you vector DB is a cloud hosted DB then it will cost us fortune. We need to keep all these considerations in mind while designing RAG pipeline. This is called Pre-Filtering.
Always apply constraints like:
- Domain
- Category
- Time
- Source
- Meta Data
Post-Filtering:
- Even after retrieval, Top-k results ≠ Fully relevant
- Some results are partially relevant, noisy, misleading
Implementing Pre-Post Filtering :
Output :
Vector store, Chroma DB schema :
Lets refresh what we have learnt so far :
We have discussed about following retrieval strategies:
- Intent validation
- Query reformulation
- Query expansion
- Pre-Filter
- Post-Filter
- Search strategies
- Semantic search(vector)
- Keyword search(BM25 - Best Match version 25)
- Hybrid search (Semantic + BM25) - RRF
- Re-Ranking
Hybrid search strategy
Hybrid search strategy is one of the most important ideas in modern RAG systems. Hybrid search is where "retrieval" becomes more powerful. It is the combination of Keyword search(BM25) & Semantic search(Embeddings).
Pipeline view:
User query - BM25(keyword search) - Semantic search(vector) - Fusion(RRF)
Problem with BM25:
- BM25 relies on exact words
- Ex: "Heart attack causes" but document says "myocardial infarction reasons"
- BM25 fails (no exact match)
- Semantic search understands meaning but can miss exact keywords
- Ex: "Python list append syntax" but semantic search may return "How to modify arrays in programming" - this is too generic
RRF (Reciprocal Rank Fusion)
RRF combines rankings from multiple retrievers.
- Key idea is, instead of combining scores, combine ranks
- Formula, Score = 1/(k + rank), where
- rank = position in list
- k = smoothing constant, usually 60 (means 60% weightage to Semantic + 40% to BM25)
- A - 1st
- B - 2nd
- C - 3rd
- B - 1st
- A - 2nd
- C - 3rd
- A good in both
- B strong in one
- C moderate
Full retrieval stack :
Full pipeline :
Output :
Explanation :
- Observe that we are using LLM for retrieval techniques
- BM25 which is a keyword search, taking input as query and return top 3 results based on score
- Splitting query, and getting scores for each word/token
- Sorting the data based on score and storing in a variable called ranked
- returning top-k documents
- Semantic retrieval
- It is also accepting user query and top_k information as input
- Getting embeddings from query
- Results are getting store in chroma DB
- Hybrid retrieval getting sub queries from expand_query() and adding them into variable queries
- Original user query is also getting appended into same variable
- Created empty list all_rank_lists to store search results
- Started a loop based on no. of queries
- Looping through bm25 and semantic search until all queries search is complete
- Appending search results into all_rank_lists
- Passing these search results into RRF
- returning top 5 results
Finally, we are asking some queries as below.
Important point : If you observe above code carefully, especially BM25 search - we are not using vector DB for this type of search.
BM25 doesn't require a vector DB. It is a standalone retrieval system based on inverted indexing, while vector DB enable semantic retrieval. In production RAG systems, both are often combined(hybrid search) to balance precision and recall.
Re-ranking strategies
Re-ranking in a RAG pipeline is where you take the initially retrieved documents and reorder them using a more accurate model so that the most relevant context goes first.
Where it sits in RAG ?
User Query
↓
Retriever (BM25 / Vector / Hybrid) → gets top N (e.g., 20)
↓
Reranker (cross-encoder / LLM scoring) → reorders those 20
↓
Top-K selection (e.g., 5)
↓
LLM (final answer generation)
How reranking works ?
Instead of storing documents independently, a reranker:
- Looks at (query, document) pair together
- Assigns a relevance score
- Retriever might rank Doc A higher
- Re-ranker correctly boosts Doc B to top
Without re-ranking :
- LLM gets noisy context
- Hallucinations increase
- Answer quality drops
- Better context precision
- Lower token waste
- More accurate responses
When to use what kind of LLM based re-ranking?
- Simple system - LLM Score Based
- Small dataset - Pair wise
- Production RAG - List wise
- Ambiguous queries - Query-aware
- Enterprise system - Hybrid (LLM Based)
Implementation of re-ranking :
Important points to remember :
- We have seen entire RAG pipeline i.e. Retrieval Augmentation Generation
- Discussed about all the steps involved in Retrieval, Augmentation, Generation
- One myth people always assume is RAG is completed here, BUT NO
- We should evaluate with from parameters to confirm that we are getting best results
- Most of the AI engineers fail here to explain this stuff.
- Let us see what is it.
Evaluation Metrics are classified into two categories
- Retrieval Metrics
- Precision@K
- Recall@K
- MRR (Mean Reciprocal Rank)
- NDCG@K
- Context Relevance
- Generation Metrics
- Faithfulness
- Answer Relevancy
- Groundedness
Being a production grade AI engineers, we should be in a position to explain all these techniques. Remember that retrieval metrics are classified during Retrieval Metrics and generation metrics are classified during Generation Metrics.
There are frameworks like RAAGAS, Truelens etc. to perform same steps but better learn hard way of implementing these steps. Lets see how these metrics work.
Retrieval Metrics
- We will apply below techniques after retrieval (post re-ranking)
1) Precision@K
What is K ? If the end user is asking top 3 / 5 / 10 results - this value is nothing but K (This K value is after re-ranking)
User question : What is the eligibility criteria for a home loan ?
Result :
Formula for precision is as below:
Precision@K, Precision@5 = Relevance documents in top K / K = 3 / 5 = 0.6
Means, 60% of retrieved documents are useful.
Industry standards of Precision@K :
- 0.8 - 1.0 - Excellent
- 0.6 - 0.8 - Acceptable
- < 0.6 - Irrelevant (means system is performing very poor)
- Apply Pre-Filters (Meta data filters, Product type filters etc.)
- Improve Re-Ranking
- Reduce K value - Instead of 5, go with 3 (This will be controlled by end user)
2) Recall@K
User question : What is the eligibility criteria for a home loan ?
Ground Truth is (Total relevance documents are 5) : Get them from Golden Data Set
- Salary
- Credit score
- Age
- Employment Type
- Existing loans
But our system retrieved, 3 relevant documents. But ground truth is 5.
Formula for Recall@K = Relevant documents in top K / total no. of relevant documents = 3 / 5 = 0.6
Means, 60% of retrieved documents are useful
Industry standards of Recall@K :
- 0.8 - 1.0 - Excellent
- 0.6 - 0.8 - Acceptable
- < 0.6 - Irrelevant (means system is performing very poor)
Then what is the preventing technique ?
- Increase K value
- Use Hybrid search (BM25 + Semantic)
- Add Query Expansion
3) Mean Reciprocal Rank (MRR)
User question : What is EMI ?
Result :
Formula for MRR = 1 / Rank of first relevant result = 1 / 2 = 0.5
Means, 50% of retrieved documents are useful
Interpretation of MRR is - Correct answer is not always an immediate answer (But it should be Rank-1)
Industry standards of MRR :
- > 0.8 - Correct answer is usually at rank-1
- 0.5 - 0.8 - OK
- < 0.5 - Poor Ranking
Then what is the preventing technique ?
- Improve reranking strategy
- Try multiple reranking mechanisms and see output
- Tune your embedding model
- We use embeddings at 2 places
- 1st while creating knowledge base
- 2nd while processing user query
- Try using one model at both places like open AI small model etc. and observe results
- Try another model at both places and compare results with the results of first model
- Keep experimenting until you get good results and fix that model
4) NDCG@K
User question : Best way tor educe home loan interest ?
Result :
Step1 : Calculate DCG@K
Step2 : Calculate IDCG@K
Step3 : Calculate NDCG
NDCG@K = DCG@K / IDCG@K = 3.76 / 4.76 = 0.79
Interpretation : 79% ranking is decent but not optimal
Industry standards of NDCG :
- > 0.9 - Near perfect ranking
- 0.7 - 0.9 - Good ranking
- < 0.7 - Ranking problem
How to improve this quality ?
- Improve reranking strategy
- Improve relevance labelling
- Write appropriate prompt, to attach label value to each document for the relevant data after reranking
5) Context Relevancy
User question : How to improve credit score ?
Result :
Formula for context relevancy = Relevant chunks / Total number of chunks = 3 / 5 = 60%
Interpretation : 40% of noise in retrieved context
Industry standards of Context Relevancy:
- > 0.8 - Clean context
- 0.6 - 0.8 - Some noise is associated
- < 0.6 - Noisy retrieval
How to improve the quality of context relevancy ?
- Select best chunking strategy
- Add semantic / pre filters (in meta data)
- Use appropriate reranking strategy
Important points about evaluation techniques in RAG :
- Above 5 evaluation techniques are related to retrieval process in RAG
- We should calculate these metrics after RAG retrieval (post reranking) to confirm that we built a better RAG system and it will help to shows metrics to client
- If we are not clear on above 5 techniques, then we are not building production grade RAG - it will be just like a toy project
Generation Metrics
- Faithfulness
- Answer Relevancy
- Groundedness
1) Faithfulness
User question : How to improve my credit score ?
Context that we got as part of Retrieval + Augmentation is :
- Pay EMI on time
- Reduce credit card utilization
LLM Response :
Pay EMI's on time, reduce credit card utilization and invest in gold
Pay EMI on time & reduce credit card utilization are from context BUT invest is gold is generated by LLM.
Formula for Faithfulness = Supported claims / Total claims = 2 / 3 = 0.67
Means, 33% of data is hallucinated data. If we present 33% hallucinated of data to customer, then they won't be happy.
Benchmarks :
- > 0.9 - Very safe
- 0.7 - 0.9 - Minor Issues
- < 0.7 - hallucinated data
How to prevent hallucination :
- Write strict prompting as below
- You answer only from context
- DO NOT generate hallucinated answers
- Reduce temperature value
- Go towards more deterministic answers (< 0.5)
2) Answer Relevancy
User question : How to improve my credit score ?
Answer : Credit score is calculated using your financial history
Formula for Answer Relevancy = Similarity (User Query, LLM response)
Benchmarks :
- > 0.85 - Strong Alignment
- 0.6 - 0.85 - Partial
- < 0.6 - Wrong Answer
- Improve Query Reformulation
- Query Intent - Fast Fail (Understand the intent, if it is meaningful then only move to next step)
- Improve prompt instructions according to user query
3) Groundedness
User question : How to improve my credit score ?
Context :
- Pay EMIs on time
- Reduce the utilization of credit card
- Pay EMI, Reduce utilization of credit card, avoid loans & invest in stocks
Interpretation : 40% of answer is not supported
Benchmarks :
- > 0.9 - Fully grounded
- 0.7 - 0.9 - Mostly grounded
- < 0.7 - Unsafe
How to increase Groundedness :
- Force context only answers in prompt
- Add Retrieval citations (Citations will be produced by LLM)
Implementation :
import os import json import numpy as np from dotenv import load_dotenv from langchain_openai import ChatOpenAI from sentence_transformers import SentenceTransformer import chromadb load_dotenv() assert os.getenv("OPENAI_API_KEY") llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) embedding_model = SentenceTransformer("all-MiniLM-L6-v2") client = chromadb.Client() collection = client.create_collection("rag_eval")documents = [ "Diabetes affects blood sugar levels.", "Hypertension increases heart disease risk.", "Diversification reduces investment risk.", "Neural networks are used in deep learning." ] collection.add( documents=documents, embeddings=[embedding_model.encode(d).tolist() for d in documents], ids=[str(i) for i in range(len(documents))] ) # GOLDEN DATASET golden_data = [ { "query": "What is diabetes?", "relevant_docs": ["Diabetes affects blood sugar levels."], "answer": "Diabetes affects blood sugar levels." }, { "query": "How to reduce investment risk?", "relevant_docs": ["Diversification reduces investment risk."], "answer": "Diversification reduces investment risk." } ]def retrieve(query, k=2): emb = embedding_model.encode(query).tolist() results = collection.query(query_embeddings=[emb], n_results=k) return results["documents"][0]def precision_at_k(retrieved, relevant, k): retrieved_k = retrieved[:k] rel = sum([1 for doc in retrieved_k if doc in relevant]) return rel / kdef recall_at_k(retrieved, relevant, k): retrieved_k = retrieved[:k] rel = sum([1 for doc in retrieved_k if doc in relevant]) return rel / len(relevant)def ndcg_at_k(retrieved, relevant, k): dcg = 0 for i, doc in enumerate(retrieved[:k]): if doc in relevant: dcg += 1 / np.log2(i + 2) idcg = sum([1 / np.log2(i + 2) for i in range(min(len(relevant), k))]) return dcg / idcg if idcg > 0 else 0import re def extract_score(text): try: # Extract first float number match = re.search(r"\d*\.?\d+", text) if match: return float(match.group()) except: pass return 0.0 # fallbackdef faithfulness(query, answer, context): prompt = f""" Score from 0 to 1. Return ONLY a number. Context: {context} Answer: {answer} """ response = llm.invoke(prompt).content return extract_score(response)def answer_relevancy(query, answer): prompt = f""" Score from 0 to 1. Return ONLY a number. Query: {query} Answer: {answer} """ response = llm.invoke(prompt).content return extract_score(response)def context_precision(query, context): prompt = f""" Score from 0 to 1. Return ONLY a number. Query: {query} Context: {context} """ response = llm.invoke(prompt).content return extract_score(response)def context_recall(query, context, golden_answer): prompt = f""" Score from 0 to 1. Return ONLY a number. Context: {context} Expected Answer: {golden_answer} """ response = llm.invoke(prompt).content return extract_score(response)def generate_answer(query, context): prompt = f""" Answer using context only. Context: {context} Query: {query} """ return llm.invoke(prompt).contentdef evaluate_rag(golden_data): results = [] for item in golden_data: query = item["query"] relevant_docs = item["relevant_docs"] golden_answer = item["answer"] retrieved_docs = retrieve(query, k=2) context = "\n".join(retrieved_docs) answer = generate_answer(query, context) # Retrieval Metrics p = precision_at_k(retrieved_docs, relevant_docs, k=2) r = recall_at_k(retrieved_docs, relevant_docs, k=2) ndcg = ndcg_at_k(retrieved_docs, relevant_docs, k=2) # Generation Metrics faith = faithfulness(query, answer, context) ans_rel = answer_relevancy(query, answer) # Context Metrics ctx_p = context_precision(query, context) ctx_r = context_recall(query, context, golden_answer) results.append({ "query": query, "precision@k": p, "recall@k": r, "ndcg@k": ndcg, "faithfulness": faith, "answer_relevancy": ans_rel, "context_precision": ctx_p, "context_recall": ctx_r }) return resultsif __name__ == "__main__": results = evaluate_rag(golden_data) import json print(json.dumps(results, indent=2))
Output :
[
{
"query": "What is diabetes?",
"precision@k": 0.5,
"recall@k": 1.0,
"ndcg@k": 1.0,
"faithfulness": 0.8,
"answer_relevancy": 0.8,
"context_precision": 0.3,
"context_recall": 0.5
},
{
"query": "How to reduce investment risk?",
"precision@k": 0.5,
"recall@k": 1.0,
"ndcg@k": 1.0,
"faithfulness": 1.0,
"answer_relevancy": 0.8,
"context_precision": 0.8,
"context_recall": 1.0
}
]
Conclusion :
- That's all about RAG and RAG metrics
- I will see you guys in my next MCP blog !
- Automated Frameworks
- Raagas
- Documentation for Raagas is available at : https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
- Truelens
- Documentation for Truelens is available at : https://www.trulens.org/
LLM Fine Tuning :
- PEFT - Partial Fine Tuning
- LoRA
- QLORA
- FFT - Full Fine Tuning
And Agent, RAG finetuning are different. RAG fine tuning we discussed above. LLM fine tuning we are yet to discuss.
Please find some advanced reranking techniques as below :
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment