(AI #18) RAG - Retrieval Augmented Generation

In my previous blog, I have discussed about complete indexing part like how to extract the data from multiple file sources, chunking, embedding and vector store/DB. This is a very important step to prepare knowledge base for our project to inject our company specific non-confidential information to process user query during 'retrieval' process in RAG. Once you prepare and ready with knowledge base then only we can enter into actual RAG implementation which this blog talks about. Incase if you are planning to implement the complete RAG, I recommend you to read below mentioned blog before going through this blog.

https://arunsdatasphere.blogspot.com/2026/04/ai-blog17-rag-preparing-knowledge-base.html

RAG (Retrieval Augmented Generation)

RAG is a technique where an AI model first retrieves relevant information from an external knowledge source(like a vector database) and then uses it to generate more accurate and context-aware responses.

Look at above image to understand the order of components in RAG. Below are the sequence of components to follow while developing a RAG:

Query Reformulation
Query Expansion
Intent Retrieval/Validation
Pre Filter
Post Filter
Hybrid Search

Semantic/Vector search
BM25 (Keyword search)
Semantic + BM25 (Hybrid search) - RRF

Re-ranking (very important - mandatory technique)

LLM Score based re-ranking
Pair wise re-ranking
List wise re-ranking
Query aware re-ranking
Hybrid re-ranking

Evaluation metrics

Precision
Recall
NDCG
Faithfulness
Answer relevancy
Context precision
Context recall

We will discuss about every technique mentioned in the above blog. These is called Retrieval + Augmentation + Generation.

Let us assume, below is the user query :

"How to cancel my order and what is the refund policy for electronic items ?" Whenever we are submitted this query as part of RAG - it goes to a step called Retrieval, then immediately it won't start RAG process like chunking, embedding this query and directly going to check the similarity of these chunks in input query with the chunks in vector DB.

Internally we need to do following steps:

Query reformulation
Query expansion
Intent validation

Query Reformulation

Assume my question is just "AI". Tell me how am i going to interpret this question ?? If I ask AI, just 'AI' - is there any meaning ? May be correct search strings are 'What is AI?' or 'What are the applications used in AI ?' then we will get the correct response. For a moment, just ignore AI & RAG - even if you ask a person, simply 'AI' - do you think other person will understand what you are exactly talking about ? NO right ? This is when Query Reformulation is required. As part of Query Reformulation, we are going to validate whether the query is appropriate or not. If the query is not appropriate, then we need to reformulate that query to make AI understand what a user is asking about.

Example :

Question is "How to cancel my order ?" If a user ask this question "Order Cancel" - it is a vague query. Whenever users ask these kind of questions, we need to articulate the query in a meaningful way. This is called Query Reformulation.

Query Expansion

Assume:

User-1 using application first time
User-2 using 50th time
User-3 using 1000th time

User-3 have more context about the application he is using right ? But User-1 doesn't know about application.

Now, if User-1 asking "How to cancel my order ?" This query will be converted in 'n' no. of queries like query1, query2, query3 & query4. This is called Query Expansion.

The question could be a complex question or user is a first time user. Both the cases, we need to expand the query and form multiple sub queries to build a full context of what that particular user is asking about ?

Intent Validation

Let us assume, you created this bot/agent for an e-commerce application. But user a question about healthcare, then the intent of the bot is not related to healthcare right ? Here we need to validate the query - if the intent is appropriate, then we need to take user query to next level, otherwise we need inform user saying the actual intention of the bot/agent he is using. Something like This bot is mainly meant for e-commerce applications ! (some sort of response to let user understand what this bot is about).

To make it simple, in the retrieval step - we need to enable above 3 reforms. If we are not using above 3 reforms, we will get irrelevant output.

We can fix these issues in 2 ways :

Either we have to maintain our meta data while creating knowledge base, using metadata we can take care of Query Reformulation, Query Expansion & Intent Validation
Otherwise, we need to take the help of LLM

Pre Filter

Let us say users intent is to understand the 'refund policy', and he entered a query to get this information. Do we need to search entire data ? or only refund policy related information ? It is a smart idea to search only 'refund policy' related information only right ? This is called Pre Filter.

Post Filter

This comes after search completed. We may have to still filter the context which is the output from Pre-Filtering. This is called Post Filter.

One line summary :

Pre-Filtering narrows what you search
Post-Filtering fixes what you found.

Hybrid search

Hybrid search is a combination of Semantic/Vector search + Keyword search.

In health care, we should not assume anything, we need to give exact results with correct keywords. We need to use keyword search in such cases for exact results.

One more example is, user is looking for 'Refund Policy', especially looking for 'Electronics Refund Policy'. In this situation, we can use semantic search for 'Refund Policy' but for 'Electronic Refund Policy', we need to use Keyword search.

Hybrid would be something like:

70% Semantic search & 30% Keyword search
50% Semantic search & 50% Keyword search
30% Semantic search & 70% Keyword search

It all depends on the use case.

Assume, we got top 10 results from Hybrid search. Next question is whether all these top 10 results arranged in descending order or not ? based on the similarity score - they should be arranged in highest to lowest score. This will happen during Re-Ranking.

Note, all the above steps are happening during Retrieval step. Once all these steps are done, along with user query, relevant context will be submitted to LLM as part and this is called Generation.

Modern RAG Flow :

User Query
Query understanding

Reformulation
Expansion
Intent validation(if irrelevant - inform user)

Pre-filtering(Optional but common)

Metadata filters(category, tenant, language etc. - search only required data)

Hybrid search

Semantic/vector search(Pinecone etc.)
Keyword/BM25 search

Post-Filtering( if required - fix what you found)
Re-ranking

Improves top-k quality significantly

Context selection/compression

Remove redundancy
Fit within token limits(often missed in production flows - hit context window)

Prompt construction

Combine User query + retrieved context
Instructions / system prompt

LLM Generation
Post Processing

Format output
Gaurdrails (hallucination checks, safety)

Evaluation & Feedback loop

Logging
Metrics(Precision, Recall, Faithfulness)
Continuous improvement

Retrieval Strategies

This is the foundation layer of RAG quality. If retrieval is weak, no reranking or generation can fully fix it.

Let's see 3 core Retrieval strategies:

Query Formulation

Transforming users raw query into a better search query
User query - reformulated query - retrieval

Query Expansion
Intent Validation

1) Query Formulation:

Main problem is, users don't speak database language. They generally write query/input in a normal language. We need to interpret it.

Example:

Imagine asking a librarian about "Books about things going wrong in chips" but the catalog is indexed under "Semiconductor failure mechanisms". Then librarian reformulates your question before searching. That's called Query formulation. LLM can take care of reformulation.

Types of query formulation:

Semantic rewriting
Keyword enrichment
Domain normalization
Clarification-base reformulation

It bridges user language - document language gap.

2) Query Expansion

Generating multiple related queries from user query.

User query - N queries - Retrieval - Merge results

Example: "What causes diabetes ?"

But the document might use:

Blood sugar disorder
Insulin resistance
Glucose imbalance

Here single query missed context.

Solution:

Expand "What causes diabetes" into

Causes of diabetes
Insulin resistance explanation
Blood sugar disorder causes

Disadvantage is, it will cost us more as we use LLM for this feature. More tokens, more money.

3) Intent Validation

Check whether the query is:

Relevant
Valid
Safe
In-domain

User query - Validate - Proceed/ Reject/ Redirect

Example :

User asking "what is the capital of France" when we our domain is medical.

Full retrieval pipeline:

User Query - Intent Validation - Query formulation - Query Expansion - Vector/Hybrid retrieval - Context - LLM - Answer

Implementation of Retrieval Strategies :

import chromadb

from sentence_transformers import SentenceTransformer

from langchain_openai import ChatOpenAI

from dotenv import load_dotenv

load_dotenv()

llm = ChatOpenAI(model = "gpt-4o-mini", temperature=0.3)

# Initialize embedding model

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Create Chroma client

client = chromadb.Client()

collection = client.create_collection(name="rag-final")

# Sample documents

docs = [

    "RAG stands for Retrieval Augmented Generation.",

    "ChromaDB is a vector database used for storing embeddings.",

    "Query expansion improves recall in retrieval systems.",

    "Intent validation ensures the query is relevant to the system."

]

embeddings = [embedding_model.encode(doc).tolist() for doc in docs]

# Add documents

collection.add(

    documents=docs,

    embeddings=embeddings,

    ids=[f"id_{i}" for i in range(len(docs))]

)

def reformulate_query_llm(query):

    prompt = f"""

Rewrite the following query to make it clearer for semantic search.

Query: {query}

Rewritten Query:

"""

    response = llm.invoke(prompt)

    return response.content.strip()

def expand_query_llm(query):

    prompt = f"""

Generate 4 different search queries similar to the given query.

Query: {query}

Return as a Python list.

"""

    response = llm.invoke(prompt)

        # Safe fallback parsing

    try:

        queries = eval(response.content)

        if isinstance(queries, list):

            return queries

    except:

        pass

    return [query]

def validate_intent_llm(query):

    prompt = f"""

You are an intent classifier.

Classify if the user query is related to:

- RAG

- retrieval systems

- embeddings

- vector databases

Return ONLY "YES" or "NO".

Query: {query}

"""

    response = llm.invoke(prompt)

    return "YES" in response.content.upper()

def retrieve_docs(queries, top_k=2):

    results_all = []

    for q in queries:

        emb = embedding_model.encode(q).tolist()

        results = collection.query(

            query_embeddings=[emb],

            n_results=top_k

)

        results_all.extend(results["documents"][0])

    return list(set(results_all))

def generate_answer_llm(query, context):

    context_text = "\n".join(context)

    prompt = f"""

Answer the question using the context below.

Context:

{context_text}

Question:

{query}

"""

    response = llm.invoke(prompt)

    return response.content

# Full Pipeline

def rag_pipeline_llm(user_query):

    # 1. Intent Validation

    if not validate_intent_llm(user_query):

        return "❌ Query is not relevant to the system."

    # 2. Reformulation

    refined_query = reformulate_query_llm(user_query)

    # 3. Expansion

    expanded_queries = expand_query_llm(refined_query)

    # 4. Retrieval

    docs = retrieve_docs(expanded_queries)

    # 5. Generation

    answer = generate_answer_llm(user_query, docs)

    return {

        "original_query": user_query,

        "refined_query": refined_query,

        "expanded_queries": expanded_queries,

        "retrieved_docs": docs,

        "answer": answer

}

result = rag_pipeline_llm("what is query expansion in rag")

print(result)

Output :

{'original_query': 'what is query expansion in rag',

'refined_query': 'What is the concept of query expansion in retrieval-augmented

generation (RAG)?',

'expanded_queries': ['What is the concept of query expansion in retrieval-augmented

generation (RAG)?'],

'retrieved_docs': ['RAG stands for Retrieval Augmented Generation.',

'Query expansion improves recall in retrieval systems.'],

'answer': "Query expansion in Retrieval Augmented Generation (RAG) refers to the

process of enhancing the original search query by adding additional terms or phrases.

This technique aims to improve the recall of the retrieval system, allowing it to

fetch more relevant documents or information that may not have been captured by the

initial query. By broadening the scope of the search, query expansion helps ensure

that the generated responses are more comprehensive and relevant to the user's needs.

Pre/ Post Filtering

Pre-Filtering:

Understand if user is looking for information about "Sick leave policy" and the user query would be "How many sick leaves are allowed per quarter ?" - for this use case, clearly this information is related HR policy about sick leaves, but if you search entire data(all vectors) in the vector DB, then it latency will screw up and it takes good amount of time. Hence we need to apply a filter before search, routing search filter to search only HR policy related chunks in the Vector DB. This is called Pre-Filtering. On the top of this - if you vector DB is a cloud hosted DB then it will cost us fortune. We need to keep all these considerations in mind while designing RAG pipeline. This is called Pre-Filtering.

Always apply constraints like:

Domain
Category
Time
Source
Meta Data

Main idea is narrowing down the amount of data to search, using meta data that we collected during data extraction phase and use it effectively to apply filter.

Example :

User query - How to reduce investment risk ?

Pre-Filter

{

"domain": "finance"

}

Apply above filter, and search only finance related data.

We can either use LLM or Meta Data to enable this feature. That's the reason, please categorize meta data while creating your knowledge base itself as part of Indexing(to be precise during data extraction). If needed, sit with SME on data from your organization and implement this feature.

Even if it is a complex query that needs answer from 2 or more domains, then it will be flexible to handle because we already segregated meta data based on different constraints.

There is something called, Golden Data Set - which we need to create during/ may be before data extraction(depends on use case), which is a warehouse of multiple questions and related answers from our knowledge base. We prepare this per each chunk during data extraction, something like "Create 10 different questions per chunk" during data extraction and repeat this for entire knowledge base and keep this warehouse as a Golden Data Set. This will be very useful during evaluation once LLM generated the response after Generation. This data set will be very helpful to compare the response generated from LLM to decide whether LLM is generating grounded response or not.

Post-Filtering:

Even after retrieval, Top-k results ≠ Fully relevant
Some results are partially relevant, noisy, misleading

Evaluate retrieved documents and remove bad ones.

Implementing Pre-Post Filtering :

import os
import json
import chromadb
from sentence_transformers import SentenceTransformer
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

# OpenAI Chat Model
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Chroma client
client = chromadb.Client()
collection = client.create_collection(name="multi_domain_rag")


# ==========================================
# 2. SAMPLE DOCUMENTS (GENERIC DOMAINS)
# ==========================================
documents = [
    "Diabetes is a chronic disease that affects blood sugar levels.",
    "Hypertension increases the risk of heart disease and stroke.",
    "Stock markets fluctuate based on economic conditions.",
    "Diversification reduces investment risk in finance.",
    "Artificial Intelligence enables machines to learn from data.",
    "Cloud computing provides scalable infrastructure over the internet.",
    "Neural networks are used in deep learning applications."
]

metadatas = [
    {"domain": "healthcare", "type": "disease"},
    {"domain": "healthcare", "type": "risk"},
    {"domain": "finance", "type": "market"},
    {"domain": "finance", "type": "investment"},
    {"domain": "technology", "type": "ai"},
    {"domain": "technology", "type": "cloud"},
    {"domain": "technology", "type": "ml"}
]

# Add to Chroma
embeddings = [embedding_model.encode(doc).tolist() for doc in documents]

collection.add(
    documents=documents,
    embeddings=embeddings,
    metadatas=metadatas,
    ids=[f"id_{i}" for i in range(len(documents))]
)



# ==========================================
# 3. FILTER VALIDATION + BUILDING
# ==========================================
VALID_DOMAINS = ["healthcare", "finance", "technology"]
VALID_TYPES = ["disease", "risk", "market", "investment", "ai", "cloud", "ml"]

def clean_filters(filters):
    cleaned = {}

    if filters.get("domain") in VALID_DOMAINS:
        cleaned["domain"] = filters["domain"]

    if filters.get("type") in VALID_TYPES:
        cleaned["type"] = filters["type"]

    return cleaned


def build_chroma_filter(filters):
    conditions = []

    for k, v in filters.items():
        if v:
            conditions.append({k: v})

    if not conditions:
        return None

    if len(conditions) == 1:
        return conditions[0]

    return {"$and": conditions}



# ==========================================
# 4. PRE-FILTERING (LLM)
# ==========================================
def detect_filters_llm(query):
    prompt = f"""
    Extract metadata filters from query.

    Allowed values:
    domain: healthcare, finance, technology
    type: disease, risk, market, investment, ai, cloud, ml

    Return ONLY JSON:
    {{
      "domain": "...",
      "type": "..."
    }}

    Query: {query}
    """

    response = llm.invoke(prompt)

    try:
        filters = json.loads(response.content)
        return filters
    except:
        return {}



# ==========================================
# 5. POST-FILTERING (LLM)
# ==========================================
def post_filter_docs(query, docs):
    filtered_docs = []

    for doc in docs:
        prompt = f"""
        Check if this document is relevant.

        Query: {query}
        Document: {doc}

        Answer YES or NO only.
        """

        response = llm.invoke(prompt)

        if "YES" in response.content.upper():
            filtered_docs.append(doc)

    return filtered_docs


# ==========================================
# 6. RETRIEVAL (FIXED)
# ==========================================
def hybrid_retrieval(query):

    # Step 1: LLM filter extraction
    raw_filters = detect_filters_llm(query)
    print("🧠 Raw filters:", raw_filters)

    # Step 2: Clean filters
    cleaned_filters = clean_filters(raw_filters)
    print("🧹 Cleaned filters:", cleaned_filters)

    # Step 3: Convert to Chroma format
    chroma_filter = build_chroma_filter(cleaned_filters)
    print("🔎 Chroma filter:", chroma_filter)

    # Step 4: Vector search
    query_embedding = embedding_model.encode(query).tolist()

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=10,
        where=chroma_filter
    )

    docs = results["documents"][0]
    print("📄 Retrieved:", docs)

    # Step 5: Post-filter
    refined_docs = post_filter_docs(query, docs)
    print("✅ After post-filter:", refined_docs)

    return refined_docs[:5]



# ==========================================
# 7. GENERATION
# ==========================================
def generate_answer(query, context_docs):
    context = "\n".join(context_docs)

    prompt = f"""
    Answer ONLY using the context.

    Context:
    {context}

    Question: {query}
    """

    response = llm.invoke(prompt)
    return response.content


# ==========================================
# 8. FULL PIPELINE
# ==========================================
def rag_pipeline(query):

    context_docs = hybrid_retrieval(query)

    if not context_docs:
        return {"error": "No relevant documents found"}

    answer = generate_answer(query, context_docs)

    return {
        "query": query,
        "context": context_docs,
        "answer": answer
    }



# ==========================================
# 9. TEST
# ==========================================
if __name__ == "__main__":

    queries = [
        "What is diabetes?",
        "How to reduce financial risk?",
        "Explain neural networks"
    ]

    for q in queries:
        print("\n============================")
        print("Query:", q)

        result = rag_pipeline(q)

        print("\n🎯 FINAL OUTPUT:")
        print(json.dumps(result, indent=2))

Output :

Vector store, Chroma DB schema :

collection.add(

    documents=documents,

    embeddings=embeddings,

    metadatas=metadatas,

    ids=[f"id_{i}" for i in range(len(documents))]

)

Lets refresh what we have learnt so far :

We have discussed about following retrieval strategies:

Intent validation
Query reformulation
Query expansion
Pre-Filter
Post-Filter

From now, we are going to discuss about:

Search strategies

Semantic search(vector)
Keyword search(BM25 - Best Match version 25)
Hybrid search (Semantic + BM25) - RRF

Re-Ranking

Hybrid search strategy

Hybrid search strategy is one of the most important ideas in modern RAG systems. Hybrid search is where "retrieval" becomes more powerful. It is the combination of Keyword search(BM25) & Semantic search(Embeddings).

Pipeline view:

User query - BM25(keyword search) - Semantic search(vector) - Fusion(RRF)

Problem with BM25:

BM25 relies on exact words
Ex: "Heart attack causes" but document says "myocardial infarction reasons"
BM25 fails (no exact match)

Problem with semantic search:

Semantic search understands meaning but can miss exact keywords
Ex: "Python list append syntax" but semantic search may return "How to modify arrays in programming" - this is too generic

Best solution : Combine both Semantic search + Keyword search (BM25)

RRF (Reciprocal Rank Fusion)

RRF combines rankings from multiple retrievers.

Key idea is, instead of combining scores, combine ranks
Formula, Score = 1/(k + rank), where

rank = position in list
k = smoothing constant, usually 60 (means 60% weightage to Semantic + 40% to BM25)

Analogy : Imagine 2 judges ranking contestants

Judge1 (BM25):

A - 1st
B - 2nd
C - 3rd

Judge2 (Semantic):

B - 1st
A - 2nd
C - 3rd

RRF combines rankings:

A good in both
B strong in one
C moderate

Final decision = balanced answer

Why RRF is powerful ? It doesn't depend on score scale, robust to noisy rankings and works across different retrievers.

🔥 10. Common Mistakes

❌ Using only vector search

❌ Ignoring keyword matching

❌ Not using fusion (RRF)

❌ Combining scores incorrectly

Full retrieval stack :

🚀 11. Full Retrieval Stack (Modern RAG)

User Query

   ↓

Intent Validation
     
     |
     
Query Formulation

   ↓

Query Expansion

   ↓

Pre-filters
 
    |

Hybrid Search (BM25 + Semantic)

   ↓

RRF Fusion

   ↓

Post-filtering

   ↓

Reranking

   ↓

Final Context

   ↓

LLM Answer

Full pipeline :

Implementation of Hybrid Search :

# ==========================================
# 1. SETUP
# ==========================================
import os
import json
import numpy as np
import chromadb
from dotenv import load_dotenv
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from langchain_openai import ChatOpenAI

load_dotenv()
assert os.getenv("OPENAI_API_KEY")

# OpenAI Chat Model
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Embedding Model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# ==========================================
# 2. DATA
# ==========================================
documents = [
    "Diabetes is a chronic disease affecting blood sugar levels.",
    "Hypertension increases risk of heart disease.",
    "Stock markets fluctuate due to economic conditions.",
    "Diversification reduces investment risk.",
    "Neural networks are key to deep learning.",
    "Cloud computing provides scalable infrastructure."
]

# BM25 setup
tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Chroma setup
client = chromadb.Client()
collection = client.create_collection("hybrid_rag_full")

embeddings = [embedding_model.encode(doc).tolist() for doc in documents]
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[str(i) for i in range(len(documents))]
)


# ==========================================
# 3. QUERY EXPANSION (LLM)
# ==========================================
def expand_query(query):
    prompt = f"""
    Generate 3 alternative search queries.
    Return JSON list.

    Query: {query}
    """
    response = llm.invoke(prompt)

    try:
        return json.loads(response.content)
    except:
        return [query]


# ==========================================
# 4. BM25 RETRIEVAL
# ==========================================
def bm25_retrieve(query, top_k=3):
    scores = bm25.get_scores(query.lower().split())
    ranked = np.argsort(scores)[::-1]
    return [documents[i] for i in ranked[:top_k]]


# ==========================================
# 5. VECTOR RETRIEVAL
# ==========================================
def vector_retrieve(query, top_k=3):
    q_emb = embedding_model.encode(query).tolist()
    results = collection.query(query_embeddings=[q_emb], n_results=top_k)
    return results["documents"][0]


# ==========================================
# 6. RRF FUSION
# ==========================================
def rrf(rank_lists, k=60):
    scores = {}

    for rlist in rank_lists:
        for rank, doc in enumerate(rlist):
            scores[doc] = scores.get(doc, 0) + 1 / (k + rank)

    return sorted(scores, key=scores.get, reverse=True)


# ==========================================
# 7. HYBRID RETRIEVAL
# ==========================================
def hybrid_retrieval(query):

    # Step 1: Expand queries
    queries = expand_query(query)
    queries.append(query)

    all_rank_lists = []

    for q in queries:
        bm25_docs = bm25_retrieve(q)
        vec_docs = vector_retrieve(q)

        all_rank_lists.append(bm25_docs)
        all_rank_lists.append(vec_docs)

    # Step 2: Fuse rankings
    fused_docs = rrf(all_rank_lists)

    return fused_docs[:5]


# ==========================================
# 8. LLM RERANKING
# ==========================================
def rerank_llm(query, docs):
    scored_docs = []

    for doc in docs:
        prompt = f"""
        Score relevance from 0 to 1.

        Query: {query}
        Document: {doc}
        """

        try:
            score = float(llm.invoke(prompt).content.strip())
        except:
            score = 0.0

        scored_docs.append((doc, score))

    ranked = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked]


# ==========================================
# 9. FINAL GENERATION
# ==========================================
def generate_answer(query, docs):
    context = "\n".join(docs)

    prompt = f"""
    Answer ONLY using the context.

    Context:
    {context}

    Question: {query}
    """

    response = llm.invoke(prompt)
    return response.content


# ==========================================
# 10. FULL PIPELINE
# ==========================================
def hybrid_rag_pipeline(query):

    print("🔍 Query:", query)

    # Step 1: Hybrid Retrieval
    retrieved_docs = hybrid_retrieval(query)
    print("📄 Retrieved:", retrieved_docs)

    # Step 2: Rerank
    final_docs = rerank_llm(query, retrieved_docs)
    print("⭐ Reranked:", final_docs)

    # Step 3: Generate Answer
    answer = generate_answer(query, final_docs[:3])

    return {
        "query": query,
        "retrieved_docs": retrieved_docs,
        "final_docs": final_docs[:3],
        "answer": answer
    }


# ==========================================
# 11. TEST
# ==========================================
if __name__ == "__main__":

    queries = [
        "How to reduce investment risk?",
        "What is diabetes?",
        "Explain neural networks"
    ]

    for q in queries:
        print("\n========================")
        result = hybrid_rag_pipeline(q)
        print(json.dumps(result, indent=2))

Output :

Explanation :

Observe that we are using LLM for retrieval techniques

# ==========================================
# 4. BM25 RETRIEVAL
# ==========================================
def bm25_retrieve(query, top_k=3):
    scores = bm25.get_scores(query.lower().split())
    ranked = np.argsort(scores)[::-1]
    return [documents[i] for i in ranked[:top_k]]

BM25 which is a keyword search, taking input as query and return top 3 results based on score
Splitting query, and getting scores for each word/token
Sorting the data based on score and storing in a variable called ranked
returning top-k documents

# ==========================================
# 5. VECTOR RETRIEVAL
# ==========================================
def vector_retrieve(query, top_k=3):
    q_emb = embedding_model.encode(query).tolist()
    results = collection.query(query_embeddings=[q_emb], n_results=top_k)
    return results["documents"][0]

Semantic retrieval
It is also accepting user query and top_k information as input
Getting embeddings from query
Results are getting store in chroma DB

# ==========================================
# 6. RRF FUSION
# ==========================================
def rrf(rank_lists, k=60):
    scores = {}

    for rlist in rank_lists:
        for rank, doc in enumerate(rlist):
            scores[doc] = scores.get(doc, 0) + 1 / (k + rank)

    return sorted(scores, key=scores.get, reverse=True)

# ==========================================
# 7. HYBRID RETRIEVAL
# ==========================================
def hybrid_retrieval(query):

    # Step 1: Expand queries
    queries = expand_query(query)
    queries.append(query)

    all_rank_lists = []

    for q in queries:
        bm25_docs = bm25_retrieve(q)
        vec_docs = vector_retrieve(q)

        all_rank_lists.append(bm25_docs)
        all_rank_lists.append(vec_docs)

    # Step 2: Fuse rankings
    fused_docs = rrf(all_rank_lists)

    return fused_docs[:5]

Hybrid retrieval getting sub queries from expand_query() and adding them into variable queries
Original user query is also getting appended into same variable
Created empty list all_rank_lists to store search results
Started a loop based on no. of queries

Looping through bm25 and semantic search until all queries search is complete
Appending search results into all_rank_lists

Passing these search results into RRF
returning top 5 results

# ==========================================
# 6. RRF FUSION
# ==========================================
def rrf(rank_lists, k=60):
    scores = {}

    for rlist in rank_lists:
        for rank, doc in enumerate(rlist):
            scores[doc] = scores.get(doc, 0) + 1 / (k + rank)

    return sorted(scores, key=scores.get, reverse=True)

Finally, we are asking some queries as below.

# ==========================================
# 11. TEST
# ==========================================
if __name__ == "__main__":

    queries = [
        "How to reduce investment risk?",
        "What is diabetes?",
        "Explain neural networks"
    ]

    for q in queries:
        print("\n========================")
        result = hybrid_rag_pipeline(q)
        print(json.dumps(result, indent=2))

Important point : If you observe above code carefully, especially BM25 search - we are not using vector DB for this type of search.

BM25 doesn't require a vector DB. It is a standalone retrieval system based on inverted indexing, while vector DB enable semantic retrieval. In production RAG systems, both are often combined(hybrid search) to balance precision and recall.

Re-ranking strategies

Re-ranking in a RAG pipeline is where you take the initially retrieved documents and reorder them using a more accurate model so that the most relevant context goes first.

Where it sits in RAG ?

User Query

↓

Retriever (BM25 / Vector / Hybrid) → gets top N (e.g., 20)

↓

Reranker (cross-encoder / LLM scoring) → reorders those 20

↓

Top-K selection (e.g., 5)

↓

LLM (final answer generation)

How reranking works ?

Instead of storing documents independently, a reranker:

Looks at (query, document) pair together
Assigns a relevance score

Example:

Query : How to finetune LLM ?

Doc A : "Steps to train Neural Networks"

Doc B : "Fine tuning GPT models using LoRA"

Retriever might rank Doc A higher
Re-ranker correctly boosts Doc B to top

Without re-ranking :

LLM gets noisy context
Hallucinations increase
Answer quality drops

With re-ranking :

Better context precision
Lower token waste
More accurate responses

When to use what kind of LLM based re-ranking?

Simple system - LLM Score Based
Small dataset - Pair wise
Production RAG - List wise
Ambiguous queries - Query-aware
Enterprise system - Hybrid (LLM Based)

Implementation of re-ranking :

import os
import json
import re
import chromadb
import numpy as np
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer
from langchain_openai import ChatOpenAI

load_dotenv()
assert os.getenv("OPENAI_API_KEY")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

client = chromadb.Client()
collection = client.create_collection("rerank_demo")

documents = [
    "Diabetes affects blood sugar levels.",
    "Hypertension increases heart disease risk.",
    "Diversification reduces investment risk.",
    "Neural networks are used in deep learning.",
    "Cloud computing provides scalable infrastructure."
]

collection.add(
    documents=documents,
    embeddings=[embedding_model.encode(d).tolist() for d in documents],
    ids=[str(i) for i in range(len(documents))]
)


def retrieve(query, k=5):
    emb = embedding_model.encode(query).tolist()
    results = collection.query(query_embeddings=[emb], n_results=k)
    return results["documents"][0]


LLM Score Based re-ranking 

# 🔍 1. Purpose of This Function

# 👉 Extract a numeric score (float) from a string

# Why needed?

# LLMs often return messy outputs like:

# "Score: 0.85 because it is relevant"

# 👉 You only need:

# 0.85

# 🔧 2. Line-by-Line Explanation

# 🔹 Step 1: Function Definition

# def extract_score(text):

# 👉 Input:

# text → string (LLM response)

# Example:

# text = "Score: 0.85 because it is relevant"

# 🔹 Step 2: Regex Search

# match = re.search(r"\d*\.?\d+", text)

# What is happening?

# 👉 Searching for the first number inside the text

# 🔍 Understanding the Regex Pattern

# r"\d*\.?\d+"

# Break it down:

# Part  Meaning
# \d*   0 or more digits
# \.?   optional decimal point
# \d+   at least one digit

# ✅ Matches Examples

# Input                Match
# "0.85"               0.85
# "Score: 1"            1
# "0.5 relevance"      0.5
# "Score = 10"         10

def extract_score(text):
    match = re.search(r"\d*\.?\d+", text)
    return float(match.group()) if match else 0.0


def rerank_score(query, docs):
    scored = []

    for doc in docs:
        prompt = f"""
        Score relevance from 0 to 1.
        Return only number.

        Query: {query}
        Document: {doc}
        """

        score = extract_score(llm.invoke(prompt).content)
        scored.append((doc, score))

    return [d for d, _ in sorted(scored, key=lambda x: x[1], reverse=True)]



Pairwise re-ranking :
Compare two docs - choose better

def pairwise_compare(query, doc1, doc2):
    prompt = f"""
    Which document is more relevant?

    Query: {query}

    A: {doc1}
    B: {doc2}

    Answer A or B.
    """

    return llm.invoke(prompt).content.strip()


def rerank_pairwise(query, docs):
    scores = {doc: 0 for doc in docs}

    for i in range(len(docs)):
        for j in range(i + 1, len(docs)):
            result = pairwise_compare(query, docs[i], docs[j])

            if "A" in result:
                scores[docs[i]] += 1
            else:
                scores[docs[j]] += 1

    return sorted(scores, key=scores.get, reverse=True)



List wise re-ranking :

LLM ranks all docs at once 

def rerank_listwise(query, docs):
    docs_text = "\n".join([f"{i}: {d}" for i, d in enumerate(docs)])

    prompt = f"""
    Rank documents by relevance.

    Query: {query}

    Documents:
    {docs_text}

    Return ordered indices as JSON list.
    """

    response = llm.invoke(prompt).content

    try:
        order = json.loads(response)
        return [docs[i] for i in order]
    except:
        return docs




Query aware re-ranking :

Extract intent - rerank accordingly

def extract_intent(query):
    prompt = f"""
    Extract intent (short phrase).

    Query: {query}
    """
    return llm.invoke(prompt).content


def rerank_query_aware(query, docs):
    intent = extract_intent(query)

    scored = []
    for doc in docs:
        prompt = f"""
        Score relevance (0-1)

        Intent: {intent}
        Document: {doc}
        """
        score = extract_score(llm.invoke(prompt).content)
        scored.append((doc, score))

    return [d for d, _ in sorted(scored, key=lambda x: x[1], reverse=True)]



def rerank_hybrid(query, docs):
    q_emb = embedding_model.encode(query)

    scored = []
    for doc in docs:
        d_emb = embedding_model.encode(doc)

        sim = np.dot(q_emb, d_emb) / (np.linalg.norm(q_emb) * np.linalg.norm(d_emb))

        prompt = f"""
        Score relevance (0-1)

        Query: {query}
        Document: {doc}
        """

        llm_score = extract_score(llm.invoke(prompt).content)

        final_score = 0.5 * sim + 0.5 * llm_score
        scored.append((doc, final_score))

    return [d for d, _ in sorted(scored, key=lambda x: x[1], reverse=True)]



FULL PIPELINE (END-TO-END)

def rag_with_reranking(query, method="listwise"):

    # Step 1: Retrieval
    docs = retrieve(query)
    print("📄 Retrieved:", docs)

    # Step 2: Reranking
    if method == "score":
        ranked = rerank_score(query, docs)
    elif method == "pairwise":
        ranked = rerank_pairwise(query, docs)
    elif method == "listwise":
        ranked = rerank_listwise(query, docs)
    elif method == "query":
        ranked = rerank_query_aware(query, docs)
    else:
        ranked = rerank_hybrid(query, docs)

    print("⭐ Reranked:", ranked)

    # Step 3: Generation
    context = "\n".join(ranked[:3])

    prompt = f"""
    Answer using context only.

    Context:
    {context}

    Query:
    {query}
    """

    answer = llm.invoke(prompt).content

    return {
        "query": query,
        "top_docs": ranked[:3],
        "answer": answer
    }


if __name__ == "__main__":
    result = rag_with_reranking(
        "How to reduce investment risk?",
        method="listwise"
    )

    import json
    print(json.dumps(result, indent=2))


Output :

Important points to remember :

We have seen entire RAG pipeline i.e. Retrieval Augmentation Generation
Discussed about all the steps involved in Retrieval, Augmentation, Generation
One myth people always assume is RAG is completed here, BUT NO
We should evaluate with from parameters to confirm that we are getting best results
Most of the AI engineers fail here to explain this stuff.
Let us see what is it.

Evaluation Metrics

Evaluation Metrics are classified into two categories

Retrieval Metrics

Precision@K
Recall@K
MRR (Mean Reciprocal Rank)
NDCG@K
Context Relevance

Generation Metrics

Faithfulness
Answer Relevancy
Groundedness

Being a production grade AI engineers, we should be in a position to explain all these techniques. Remember that retrieval metrics are classified during Retrieval Metrics and generation metrics are classified during Generation Metrics.

There are frameworks like RAAGAS, Truelens etc. to perform same steps but better learn hard way of implementing these steps. Lets see how these metrics work.

Retrieval Metrics

We will apply below techniques after retrieval (post re-ranking)

1) Precision@K

What is K ? If the end user is asking top 3 / 5 / 10 results - this value is nothing but K (This K value is after re-ranking)

User question : What is the eligibility criteria for a home loan ?

Result :

Formula for precision is as below:

Precision@K, Precision@5 = Relevance documents in top K / K = 3 / 5 = 0.6

Means, 60% of retrieved documents are useful.

Industry standards of Precision@K :

0.8 - 1.0 - Excellent
0.6 - 0.8 - Acceptable
< 0.6 - Irrelevant (means system is performing very poor)

Then what is the preventing technique ?

Apply Pre-Filters (Meta data filters, Product type filters etc.)
Improve Re-Ranking
Reduce K value - Instead of 5, go with 3 (This will be controlled by end user)

2) Recall@K

User question : What is the eligibility criteria for a home loan ?

Ground Truth is (Total relevance documents are 5) : Get them from Golden Data Set

Salary
Credit score
Age
Employment Type
Existing loans

But our system retrieved, 3 relevant documents. But ground truth is 5.

Formula for Recall@K = Relevant documents in top K / total no. of relevant documents = 3 / 5 = 0.6

Means, 60% of retrieved documents are useful

Industry standards of Recall@K :

0.8 - 1.0 - Excellent
0.6 - 0.8 - Acceptable
< 0.6 - Irrelevant (means system is performing very poor)

Then what is the preventing technique ?

Increase K value
Use Hybrid search (BM25 + Semantic)
Add Query Expansion

3) Mean Reciprocal Rank (MRR)

User question : What is EMI ?

Result :

Formula for MRR = 1 / Rank of first relevant result = 1 / 2 = 0.5

Means, 50% of retrieved documents are useful

Interpretation of MRR is - Correct answer is not always an immediate answer (But it should be Rank-1)

Industry standards of MRR :

> 0.8 - Correct answer is usually at rank-1
0.5 - 0.8 - OK
< 0.5 - Poor Ranking

Then what is the preventing technique ?

Improve reranking strategy

Try multiple reranking mechanisms and see output

Tune your embedding model

We use embeddings at 2 places

1st while creating knowledge base
2nd while processing user query

Try using one model at both places like open AI small model etc. and observe results
Try another model at both places and compare results with the results of first model
Keep experimenting until you get good results and fix that model

4) NDCG@K

User question : Best way tor educe home loan interest ?

Result :

Step1 : Calculate DCG@K

Step2 : Calculate IDCG@K

Step3 : Calculate NDCG

NDCG@K = DCG@K / IDCG@K = 3.76 / 4.76 = 0.79

Interpretation : 79% ranking is decent but not optimal

Industry standards of NDCG :

> 0.9 - Near perfect ranking
0.7 - 0.9 - Good ranking
< 0.7 - Ranking problem

How to improve this quality ?

Improve reranking strategy
Improve relevance labelling

How to label condition ?

Write appropriate prompt, to attach label value to each document for the relevant data after reranking

This is how relevance labelling in NDCG.

5) Context Relevancy

User question : How to improve credit score ?

Result :

Formula for context relevancy = Relevant chunks / Total number of chunks = 3 / 5 = 60%

Interpretation : 40% of noise in retrieved context

Industry standards of Context Relevancy:

> 0.8 - Clean context
0.6 - 0.8 - Some noise is associated
< 0.6 - Noisy retrieval

How to improve the quality of context relevancy ?

Select best chunking strategy
Add semantic / pre filters (in meta data)
Use appropriate reranking strategy

Important points about evaluation techniques in RAG :

Above 5 evaluation techniques are related to retrieval process in RAG
We should calculate these metrics after RAG retrieval (post reranking) to confirm that we built a better RAG system and it will help to shows metrics to client
If we are not clear on above 5 techniques, then we are not building production grade RAG - it will be just like a toy project

Generation Metrics

Faithfulness
Answer Relevancy
Groundedness

1) Faithfulness

User question : How to improve my credit score ?

Context that we got as part of Retrieval + Augmentation is :

Pay EMI on time
Reduce credit card utilization

LLM Response :

Pay EMI's on time, reduce credit card utilization and invest in gold

Pay EMI on time & reduce credit card utilization are from context BUT invest is gold is generated by LLM.

Formula for Faithfulness = Supported claims / Total claims = 2 / 3 = 0.67

Means, 33% of data is hallucinated data. If we present 33% hallucinated of data to customer, then they won't be happy.

Benchmarks :

> 0.9 - Very safe
0.7 - 0.9 - Minor Issues
< 0.7 - hallucinated data

How to prevent hallucination :

Write strict prompting as below

You answer only from context
DO NOT generate hallucinated answers

Reduce temperature value

Go towards more deterministic answers (< 0.5)

2) Answer Relevancy

User question : How to improve my credit score ?

Answer : Credit score is calculated using your financial history

Formula for Answer Relevancy = Similarity (User Query, LLM response)

Benchmarks :

> 0.85 - Strong Alignment
0.6 - 0.85 - Partial
< 0.6 - Wrong Answer

How to increase Answer Relevancy ?

Improve Query Reformulation
Query Intent - Fast Fail (Understand the intent, if it is meaningful then only move to next step)
Improve prompt instructions according to user query

3) Groundedness

User question : How to improve my credit score ?

Context :

Pay EMIs on time
Reduce the utilization of credit card

LLM response :

Pay EMI, Reduce utilization of credit card, avoid loans & invest in stocks

Formula for Groundedness = Grounded Tokens / Total number of tokens = 6/10 = 0.6

(Assume, total tokens = 10 & Grounded tokens = 6)

Interpretation : 40% of answer is not supported

Benchmarks :

> 0.9 - Fully grounded
0.7 - 0.9 - Mostly grounded
< 0.7 - Unsafe

How to increase Groundedness :

Force context only answers in prompt
Add Retrieval citations (Citations will be produced by LLM)

Implementation :

import os
import json
import numpy as np
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from sentence_transformers import SentenceTransformer
import chromadb

load_dotenv()
assert os.getenv("OPENAI_API_KEY")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

client = chromadb.Client()
collection = client.create_collection("rag_eval")

documents = [
    "Diabetes affects blood sugar levels.",
    "Hypertension increases heart disease risk.",
    "Diversification reduces investment risk.",
    "Neural networks are used in deep learning."
]

collection.add(
    documents=documents,
    embeddings=[embedding_model.encode(d).tolist() for d in documents],
    ids=[str(i) for i in range(len(documents))]
)

# GOLDEN DATASET
golden_data = [
    {
        "query": "What is diabetes?",
        "relevant_docs": ["Diabetes affects blood sugar levels."],
        "answer": "Diabetes affects blood sugar levels."
    },
    {
        "query": "How to reduce investment risk?",
        "relevant_docs": ["Diversification reduces investment risk."],
        "answer": "Diversification reduces investment risk."
    }
]

def retrieve(query, k=2):
    emb = embedding_model.encode(query).tolist()
    results = collection.query(query_embeddings=[emb], n_results=k)
    return results["documents"][0]

def precision_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    rel = sum([1 for doc in retrieved_k if doc in relevant])
    return rel / k

def recall_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    rel = sum([1 for doc in retrieved_k if doc in relevant])
    return rel / len(relevant)

def ndcg_at_k(retrieved, relevant, k):
    dcg = 0
    for i, doc in enumerate(retrieved[:k]):
        if doc in relevant:
            dcg += 1 / np.log2(i + 2)

    idcg = sum([1 / np.log2(i + 2) for i in range(min(len(relevant), k))])

    return dcg / idcg if idcg > 0 else 0

import re

def extract_score(text):
    try:
        # Extract first float number
        match = re.search(r"\d*\.?\d+", text)
        if match:
            return float(match.group())
    except:
        pass

    return 0.0  # fallback

def faithfulness(query, answer, context):
    prompt = f"""
    Score from 0 to 1.

    Return ONLY a number.

    Context:
    {context}

    Answer:
    {answer}
    """

    response = llm.invoke(prompt).content
    return extract_score(response)

def answer_relevancy(query, answer):
    prompt = f"""
    Score from 0 to 1.

    Return ONLY a number.

    Query:
    {query}

    Answer:
    {answer}
    """

    response = llm.invoke(prompt).content
    return extract_score(response)

def context_precision(query, context):
    prompt = f"""
    Score from 0 to 1.

    Return ONLY a number.

    Query:
    {query}

    Context:
    {context}
    """

    response = llm.invoke(prompt).content
    return extract_score(response)

def context_recall(query, context, golden_answer):
    prompt = f"""
    Score from 0 to 1.

    Return ONLY a number.

    Context:
    {context}

    Expected Answer:
    {golden_answer}
    """

    response = llm.invoke(prompt).content
    return extract_score(response)

def generate_answer(query, context):
    prompt = f"""
    Answer using context only.

    Context:
    {context}

    Query:
    {query}
    """

    return llm.invoke(prompt).content


def evaluate_rag(golden_data):

    results = []

    for item in golden_data:

        query = item["query"]
        relevant_docs = item["relevant_docs"]
        golden_answer = item["answer"]

        retrieved_docs = retrieve(query, k=2)
        context = "\n".join(retrieved_docs)

        answer = generate_answer(query, context)

        # Retrieval Metrics
        p = precision_at_k(retrieved_docs, relevant_docs, k=2)
        r = recall_at_k(retrieved_docs, relevant_docs, k=2)
        ndcg = ndcg_at_k(retrieved_docs, relevant_docs, k=2)

        # Generation Metrics
        faith = faithfulness(query, answer, context)
        ans_rel = answer_relevancy(query, answer)

        # Context Metrics
        ctx_p = context_precision(query, context)
        ctx_r = context_recall(query, context, golden_answer)

        results.append({
            "query": query,
            "precision@k": p,
            "recall@k": r,
            "ndcg@k": ndcg,
            "faithfulness": faith,
            "answer_relevancy": ans_rel,
            "context_precision": ctx_p,
            "context_recall": ctx_r
        })

    return results


if __name__ == "__main__":
    results = evaluate_rag(golden_data)

    import json
    print(json.dumps(results, indent=2))

Output :

[
  {
    "query": "What is diabetes?",
    "precision@k": 0.5,
    "recall@k": 1.0,
    "ndcg@k": 1.0,
    "faithfulness": 0.8,
    "answer_relevancy": 0.8,
    "context_precision": 0.3,
    "context_recall": 0.5
  },
  {
    "query": "How to reduce investment risk?",
    "precision@k": 0.5,
    "recall@k": 1.0,
    "ndcg@k": 1.0,
    "faithfulness": 1.0,
    "answer_relevancy": 0.8,
    "context_precision": 0.8,
    "context_recall": 1.0
  }
]

Conclusion :

That's all about RAG and RAG metrics
I will see you guys in my next MCP blog !
Automated Frameworks

Raagas

Documentation for Raagas is available at : https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/

Truelens

Documentation for Truelens is available at : https://www.trulens.org/

LLM Fine Tuning :

PEFT - Partial Fine Tuning

LoRA
QLORA

FFT - Full Fine Tuning

And Agent, RAG finetuning are different. RAG fine tuning we discussed above. LLM fine tuning we are yet to discuss.

Please find some advanced reranking techniques as below :

Thank you for reading this blog !

Arun Mathe

DataSphere

Search This Blog

(AI #18) RAG - Retrieval Augmented Generation

Labels

Comments

Post a Comment

Popular posts from this blog

(AI #1) Deep Learning and Neural Networks

(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

Spark Core : Understanding RDD & Partitions in Spark