(AI Blog#17) RAG - Preparing Knowledge Base - Data Extractions, Chunking, Embedding, Vector Store/DB
RAG(Retrieval Augmented Generation) is a technique to make AI models(LLMs) more accurate, up-to-date and context aware by combining two things:
- Retrieval (fetching relevant data)
- Generation (creating response using an LLM)
- Have fixed knowledge(based on training data)
- Can hallucinate(make up answers)
- Don't know your private/company data
RAG Pipeline :
Kindly refer below image in detail for all the topics that we discuss in this blog and the next blog.
Understand that LLMs are pre-trained models and extract the data from the internet via various sources and train the model. To make you clear, if you ask LLMs a question like "What is the capital of Andhra Pradesh ?" It will say "Amaravati" but if you ask "what is our company's sick leave policy ?" It will confuse because it doesn't know what company you are referring! It doesn't have access to our companies database. This is where RAG comes into picture to inject our project specific data in a safe way.
First and foremost pre-step, before building a RAG is preparing a Knowledge Base. It could be a PDF file, a web page, or a relational database or a file system. It could be any data source which is a proprietary to our organization. We have to place this data into a DB called Vector DB.This is a 4 step process as below and it is called Indexing:
- Data Extraction
- Data Chunking
- Data Embedding
- Store Embedded data into Vector DB
Means our organizations proprietary data will be stored in a system called Vector DB. This process is called indexing/storing our data/knowledge base.
Note: It is extremely important to note that this is NOT retrieval phase in RAG. This is called indexing which is preparing our knowledge base, this must happen even before we start building RAG.
What is RAG ?
Once you build this knowledge base, suppose user have a question called "What are the benefits of renewable energy ?" assume that answer to this question is our company proprietary data which is available in our knowledge base. Once RAG system receive this question from user, then RETRIEVAL step will start. It simply read user query, then this user query will be converted into EMBEDDING(here chunking is a optional step as user query might not be as long as context window), then we will compare this embedded query data with the embeddings of chunks available in Vector DB based on the similarity search. Means those embeddings which are already stored in vector DB as part of preparing knowledge base will be retrieved based on the similarity score.
We have different ways to do a similarity search but most famous in production is Cosine similarity search with below formulae:
Where:
- A, B are two vectors (in our case - user query & matched embedding in vector DB)
- A. B is dot product of above 2 vectors
Final result :
Cosine similarity score of two given vectors A, B is ~ 0.99 which means two vectors are almost similar and this will be returned from vector DB. Similarly, we can search even top-n similarity chunks from vector based on the users requirement.
I hope the concept is clear in your brain about how a similarity match could found using Cosine similarity search in real time. This entire procedure is called a RETRIEAVAL mechanism in RAG.
Also, we need to understand the concept of Context Window to understand 'why we do Chunking'. Below image shows the context window size of GPT models from Open AI.
Context Window :A context window is the maximum amount of text (measured in token - each sub-word in context is a token) that a AI model can read, remember and use at one time while generating a response. It is model short term memory limit. Above image shows the context window size of each of GPT models. It has evolved from 2k tokens in GPT-3 to 1M tokens in GPT-4.1, enabling modern LLMs to process entire documents, code bases and long conversations efficiently.
That's why, once we extract the data from step1(data extraction - as part of preparing knowledge base), we will chunk the data and convert them into embeddings and then store those embedded chunks in Vector DB.
Output of RETRIEVAL step are corresponding embeddings from Knowledge base which it collected based on the similarity score. If user mention top 3 chunks, then results from top 3 similarity scores will be retrieved from Vector DB.
Augmentation
Output of RETRIEVAL step + user query is nothing but Augmentation. We will prompt techniques for it. Please refer the diagram mentioned on the starting of this blog for the same.
Generation
Generation seeks the help of LLM. Output of retrieval step + user query submitted to LLM. LLM will articulate the final response. This is called Generation.
Important points to remember :
- In real time, most of the engineers are landing into issues due to not properly handling the Indexing part
- We need to understand the correct format of our input files and use corresponding extractors only to extract the text, otherwise you will end up creating incorrect knowledge base with stale data or not data
- For example, all .PDF files are not PDF, they might be screenshots finally wrapped up as a .PDF file and if you use simple PDF extractor for such files then you will definitely land into issues.
- Indexing part is the most complicated part, it needs to be handled cautiously, need to validate once done.
- If you are managed to handle this part right, then RAG will be comfortable.
Let us start exploring Indexing part which is a pre-step for RAG.
Indexing
1) Data extraction & processing
We are going to deal all the below file formats
- Scanned image
- docx
- pptx
- html
- xlsx
- JSON
PDF Files : We can use below libraries to parse PDF files
- PyPDF2
- PDFPlumber - Stable library for production
- PyMuPDF
Implementation of data extraction using PyPDF2 :
- We are referring a file called "financial_report_2024.pdf" from folder 'Data' inside same directory where our code is located
- Using Path class from pathlib to get the path
- Created a function as above, opened pdf file as 'f' and reading each page using the class PdfReader
- Once a page is read, adding the context to variable text which is a string object and returning text
Implementation of data extraction using PDFPlumber :
Almost same code but just using PDFPlumber (instead of PyPDF2) and see how accurately it read the text from PDF file. It is far better than PyPDF2. Hence this is a standard library that we use in real time for extracting data from PDF files.
It printed tables as table and normal text as text. This seems to be more powerful. PDFPlumber retains the exact structure of source document.
Output :
Observe below difference between PyPDF2 Vs PDFPlumber :
PDFPlumber is properly articulating the context in the PDF. Hence it is recommended to use PDFPlumber in real time.
Another example :- We are using PDFPlumber only but the difference is, we are using only one additional parameter called use_text_flow=True
- Rest of the program is same as shown in first example. If we enable this property, it will read the PDF based on the flow of text.
- If we don't enable this property, it will read the PDF row by row blindly.
Try to disable above property when you try, then it will blindly read row.
Also note that all these implementations are available in frameworks like LangChain as well : https://docs.langchain.com/oss/javascript/integrations/document_loaders
But the reason behind learning all these classes is, we should understand the implementation of all these classes so that we can work independently if out client is not using LangChain. This helps us to write our own logic.
Note :
So far, we have seen how to extract data from PDF files using PyPDF2 & PDFPlumber. Lets see how to extract data from scanned images.
What are scanned images ?
- When you take a photo in your mobile and need data to be extracted from those photos
- You got photos or screenhots of some document and converted that into a .PDF file. This is not an actual PDF file, isn't it ? We will see how to extract data in such case.
Popular technique to handle scanned images is, OCR (Optical Character Recognition). This helps to read text from scanned documents & images and extract that text into editable, searchable text.
Tesseract :
Tesseract is one of the popular package to read and extract images, text from scanned images as mentioned above. It is meant for POC's, not recommended in real time. For enterprise level or production applications, below cloud based services are available.
Cloud based OCR APIs are as below(recommended for production) :
- Google :
- Google cloud vision API - to handle images from scanned documents
- Google document AI - to handle text from scanned documents
- Microsoft :
- Azure computer vision OCR - to handle images from scanned documents
- Azure Form recognizer - to handle text from scanned documents
- AWS
- Amazon Textract - it will handle both images, text from scanned documents
Note :
Please download below .exe from following GitHub location for tesseract to work in your local machine : https://github.com/UB-Mannheim/tesseract/wiki
Also, download below zip folder and place it in any folder in your computer from follwoing location : https://github.com/oschwartz10612/poppler-windows/releases/
Implementation of Tesseract in local machine :
Output :
- Carefully observe the path for input scanned file, Tesseract and Poppler paths in the code.
- Observe the image in the input file
- Note this will only extract text from scanned images, if you want to read image itself then you need to convert that into a vector or string representation using models like CLIP or other similar methods.
- we are using pytesseract library to scan text from image in the above code
- config="--oem 3 --psm 6"
- oem is OCR engine mode, it contains multiple modes and we selected 3
- if we choose 2 then it will use LSTM
- psm is page sigmentation mode
- it will help tesseract library to understand layout of image
- it has different options 0, 3, 6, 7, 8, 10
- we selected 6,it will assume a single uniform block of text in the image
Implementation of PyMuPDF using Tesseract :
This is another way of extracting test from a complex PDF which is not a actual PDF but a wrapped .pdf file with screenshots.
Handling PPT files :
Incase if our data is in the form of PPT files, then how to handle it ? Lets see:
It will start extracting the text slide by slide. We are using a class called Presentation from library pptx.
Output :
Handling PDF files with multiple types of tables :
We have a powerful library which was released recently called tabula. Using this library, we can extract text from complex PDF files where we have multiple types of tables in it.
Implementation :
Output :
But always remember, if you want to reduce cost, then we need to write our own logic.
Handling xlsx files :
- Approach-1
- Using Pandas
- Approach-2
- Using openpyxl library, load_workbook()
Output :
Handling docx files :
Observe that we handled heading, subheading, paragraphs etc. separately in the above code.
Output :
Handling HTML files :
Output :
Note : We need to use the class called BeautifulSoup from library bs4 to handle HTML files.
Handling JSON files :
Output :
Important Information
- Recently Google introduced a parser called Layout Parser
- It hides the logic but we can upload any file in any format, it will extract the data for us.
- We don't know what's happening inside, we just need to purchase their API key but it does work for us
- Explore - https://docs.cloud.google.com/document-ai/docs/layout-parse-quickstart
We can simply use above Layout Parser but we can't secure our data.
- If we need to process confidential data, we need to mask that data first and then extract - In guardrails, we will come to know how to mask that data
- We need to implement evaluation techniques for all the above file handling mechanisms - means we need to compare the data before and after extraction and it must be same
- In the above image, during indexing - if we are implementing HNSW indexing technique and we need to implement p99 latency, then we need to implement p99 latency in all the steps like data extraction, chunking, embedding and also during indexing. This is very important technique to understand. We will what are Flat, IVF-PQ & HNSW indexing techniques in the later part of this blog.
Conclusion for Data extraction as part of Indexing :
So far, we have seen how to extract data from multiple source files like PDF, screenshots in PDF format, docx, PPT, xlsx, HTML, JSON etc. as we have clarity on how to extract data from different source types, lets look at the next part of indexing i.e. chunking.
Chunking
Chunking is the process of splitting large documents into smaller, semantically meaningful units to enable efficient embedding, accurate retrieval, and better context injection in RAG systems.
Instead of chunking a full document, or a 100+ page PDF, you split into smaller sections such as:
- Paragraph
- Sections
- Sliding windows of text (200 - 500 tokens etc.)
Why do we need chunking ?
Embedding models(like OpenAI embeddings) have input limits. You can't embed very large documents directly. Chunking ensures each piece fits in the context window of AI models. Without chunking, data gets truncated or fails.
- RAG works by retrieving relevant chunks, not entire document
- Chunking ensures better semantic matching
As shown in the above image, below are 8 chunking techniques.
1) Fixed-Size chunking
- Split text into equal-sized chunks of characters
- Disadvantage is we might loose the context if we break in the middle of the sentence
- Recursive split using separators(newlines, sentences, words)
- We are retaining the entire context and we can use this in production
Simply, we are chunking based on mentioned separators in the code. Separators could be anything, depends on the context in the text.
- Split text into meaningful sections based on topics/headers
- This is another recommended chunking technique in production
- Disadvantage is all chunks are not in same size
- If we deal with millions of documents, it is hard to identify headers in the documents
- Another way of handling this technique is by using LLM - it will take care of chunking based on headers thought we deal millions of documents, but if we use LLM, we need to pay for tokens
- Incase if all the documents are of same structure, then Semantic chunking is the best way in production
- Group of fixed number of sentences per chunk with optional overlap
- Note overlap is optional here
- Split test into chunks based on the number of tokens (model tokenizer)
- Note overlap is optional here
- Use a window of fixed size that slides by a step, creating overlapping chunks
- Overlap is not optional here, it is mandatory in this technique
- Keep tables as separate chunks and rest of the context as separate chunks
- Create large parent chunks(sections) and small child chunks inside them
- Observe that for each section, it created a parent chunks and multiple child chunks
- Recursive character text splitting
- Semantic chunking
- Parent-Child chunking
We are done with chunking strategies. Out of all 8 techniques, Semantic, Recursive character text splitter and Parent-Child are production recommend strategies. We can use LLM calls for chunking as well by creating proper prompt for above chunking strategies but remember it involve cost.
Remaining topics in this blog :
- Embeddings & Cost
- Vector Store Vs Vector DB
- Indexing Mechanism
- Flat Indexing
- IVF-PQ Indexing
- HNSW Indexing
- Meta Data Filtering
Embeddings
Embeddings are numerical vector representations of text that capture the meaning and context of content. When you split documents into chunks during indexing, each chunk is converted into a list of numbers(vector), so that machines can understand the semantic meaning of text.
Embeddings might look like : [0.21, -0.45, 0.88, 0.13, ...] - Usually hundreds and thousands of dimensions
Real Example :
Document chunk: "AWS S3 provide object storage"
User ask: "Where can I store files in AWS"
Even though words differ, embeddings place them near each other in vector space.
Very Important point to remember :
Please note that cost factor will start from this point. Because we need to use LLM for converting chunks into embeddings.
Important concept to remember regarding how we store embeddings in Vector store/DB :
- Consider a statement - "AI is powerful system"
- When a LLM process this line, it converts each character/word/sub-word into a token and assign a token ID based on the vocabulary of the model
- Then those assigned token ID's are converted into corresponding embeddings (after going through the training process as part of neural network)
- For each token ID, there will be a embedding vector of 'n' dimensions, where 'n' depends on the model
- For text-embedding-3-small - 1536 dimensions
- For text-embedding-3-large - 3072 dimensions
- These dimensions are related to whole input text, not the token embedding size inside GPT LLM layers. GPT models won't reveal about internal embedding structure.
- Till here, we are talking about the dimension of a embedding vector per token
- But while storing these embedding vector values in Vector store/DB - we store the embeddings per chunk (NOT per word). This is called Aggregated or text level Semantic embedding values
Our providers like OpenAI, Claude, Google Gemini will decide thee text level semantic embeddings.
Available Embeddings :
- Open AI Embeddings - Closed source and recommended for production
- Hugging Face Embeddings
- Open source embeddings
Let us assume, what will happen if the LLM model which we are using is not trained properly with some sort of domain data, say it is not trained with Banking domain related data. Then it won't users data into embeddings properly which will result in hallucinated or incorrect results. An experienced AI developer realize this during validation process of data extraction. A trained AI engineer might not get this assuming problem with model. This is one of the important point to understand.
That's the reason people already started thinking towards developing domain specific models. If time permits, look at below white papers:
- https://arxiv.org/pdf/2409.18511v3
- https://huggingface.co/blog/nvidia/domain-specific-embedding-finetune
Implementation of Embeddings :
Output :
Note :
- Above code have doc strings with proper explanation of both design and business logic
- Understand models we are using, patterns we are using in design and also python related topics like data classes, regular expressions etc.
- Code is huge, but if you go through line by line, definition by definition, it will be easy
- I recommend, instead of understanding logic from python perspective, try to understand what we are going to get from this logic at the end - then it will make more sense.
Vector Store Vs Vector DB
A Vector store or a database is a specialized database that stores embeddings and retrieves semantically similar data using nearest-neighbor search.
We should have a clear idea on when to use Vector store vs Vector DB.
- If total no. of vectors < 1000 then we can use Vector store
- If total no. of vectors > 1000 then go for Vector DB
Vector Store
- Stores only vectors
- Fast similarity search
- Simple + lightweight
- Limited functionality
Vector DB
- Stores vectors + metadata + text
- Search will be fast due to metadata
- Filtering + Hybrid search
- Hybrid means Similarity + keyword search
- Scalable + full featured
- RAG + applications
Try to understand below example to understand the power of adding meta data into vector DB
We are using a vector DB where we have a facility to store the meta data of the vectors. Assume, we have stored the vectors for 2 different vendors i.e. vendor-1, vendor-2. Now, if we need to search only vendor-2 related vector information in vector DB, we can simply a filter on vendor-2 and search the data related to ONLY vendor-2. This facility won't be available if we use vector store in production. This is the power of adding meta data into Vector DB.
Hope you are confident now on using Vector store Vs Vector DB.
Note : Carefully observe above images to understand the deciding factors for a Vector DB.
Indexing
Indexing in a vector DB means organizing embeddings into a search structure so nearest-neighbor retrieval is much faster than scanning every vector
We have 3 types of indexing as below:
- Flat Indexing
- IVF - PQ
- HNSW
I will add indexing information by tomorrow EOD.
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment