We have seen about RNN and LSTM in my previous blog, which are the foundation concepts for LLM. In today's blog, we will start looking at the actual LLM concepts like Transformers, LLM, Agents etc.
LLM Vs Agents
LLM(Large Language Model) is a Neural Network, trained to predict the next word and provide meaningful context/output. Agents is a system that use LLM + tools + memory + decision logic to achieve some goals. An Agent is not just a model - its a system architecture.
LLM
- LLM takes input from user and it will produce the output, but every run it may produce a different output. This is called Non-Deterministic output.
- Once conversation is over, it won't remember anything. This is called State-less.
- That's the reason, we can't directly use LLM and build an application, as it may produce different outputs each time. It is not designed specifically for this application and it behaves inconsistent.
Agents
- It will use LLM, Tools, Memory, Observability & Evaluation
- Tools are nothing but external API's, RAG, RDBMS etc.
- Memory store, it will remember the state (State-full)
- Observability is also called as Tracing/Logging
- Evaluation is nothing but How accurate my response ?
We can add below tools to an Agent and improve its strength :
- For example, we are asking about the quarterly revenue of a particular company XYZ to a LLM.
- Note, revenue is a confidential information and we can't feed this type of data to LLM. We need to use tools like RAG (Retrieval Augmented Generation). It is like we are adding additional power to LLM
- Now LLM will decide on redirecting this request to tools RAG, LLM will act like a gateway.
- These tools will add additional benefits to LLM
- I want to maintain the state of previous conversation and we need memory to store it
- Incase if next user ask same question, then LLM will hit the cache instead of processing that request again and send it to RAG etc.
- Once we deployed a LLM into production, and if I want to see the behaviour
- Then we required a mechanism called Observability
- It will keep track of logging/tracing to keep track of underlying components
- Finally we need to Evaluate the response, correct ?
- For this reason, we will add few evaluation metrics to graphs to evaluate the response.
- Feedback mechanism is also a part of evaluation to improve the accuracy of model
- We can integrate below tools to keep track of evaluation metrics
- RAGAS
- Eval
- Open lens
- True lens
- Langsmith
- Opik
- Open Telemetry
- Watch Dog
Transformer architecture :
We will talk about all the internal components of below Transformer architecture. We are going to talk abut it in the next 2-3 blogs.
It mainly contains 2 parts :
- Encoder (left hand side)
- Decoder (right hand side)
Main purpose of this architecture when they implemented it in ~ 2017 is they want to input French and expected English. Yes, they used it for language translation. They took this architecture as reference and introduced other models into market, as shown in the below image.
Above tree consists of 3 branches :
- Encoder-only
- Encoder-Decoder
- Decoder-only
Above tree contains multiple open and closed source models. Filled box models are open source models and Open box models are closed source models(have to take subscription). We can utilise open source models for our POCs but for production purpose, we need to use closed source models by paying amount, as we can't expose our internal data to internet/LLMs.
What is a Decoder Model ? Meant for next word prediction
In production grade system, we need to use CLOSED SOURCE models by paying some money. When you ask a question to ChatGPT, while generating output, based on 1st word 2nd word will be predicted, similarly 3rd word will generate based on first 2 words. This is the way a decoder model will work. It is going to predict the next token.
What is a Encoder Model ? Meant to fill missing words in the middle of sentence.
GPT models and its history of usage :
Decoder model :
- GPT-1 (Open source model)
- GPT-2 (Open source model)
- GPT-3 (Closed source model)
- GPT-4 etc. (Closed source model)
- GPT-5.2 (Closed source model)
Popular models :
- Encoder Model (BERT - Bidirectional Encoder Representation Transformer)
- Mainly used to fill missed words/tokens
- To fill blanks, it should aware of previous & next word, that's why Bidirectional
- It is inheriting from transformer architecture
- Decoder Model (GPT - OpenAI)
- Generative Pretrained Transformer
- It is also inherited from Transformer architecture
- But mainly used to predict next word/token
- ChatGPT is an application, it is internally using GPT models
Why LLMs are called as LARGE Language Models ?
- Large
- Language Models
For example, LLAMA 3.3 was trained with 70 Billion parameters, look at below link for source : https://www.llama.com/models/llama-3/
Looks at GPT parameters size in the below image. They are LARGE, isn't it ?
- GPT-3 : 175 Billion parameters (weights & biases)
- GPT-4 : Approximately 1 Trillion parameters (weights & biases)
Just imagine the Neural Network size and scale! Hence training is very expensive. We won't create LLMs in the organization. We will use pre trained models and implement our AI Agents.
LRM (Large Reasoning Models) - Advanced to LLM
A model optimized specifically for structured reasoning, multi-step logic, and problem solving.
LRMs are typically :
- Fine tuning models of LLMs
- Trained using reasoning datasets
- Enhanced with reinforcement learning
- Built to think step-by-step
- Open AI's 01
- DeepSeek's DeepSeek R1
- ChatGPT 5.2 (Current version)
- Gemini 1.5 Pro
What LRM's are good at ?
- Math problems
- Logical reasoning
- Coding problems
- Planning
- Multi step decision making
- Agent-like workflows
**Fine Tuned model = Pre Trained Model + Our Own Data
So, in real time, we will create Fine Tune Model.
Note : At the end, we are going to create some application but powered by a pretrained ML model.
Ex : https://www.harvey.ai/
Context Window :
A context window is the maximum amount of text(in tokens) that a model can remember at one time while generating a response.
Think of it as : models short term working memory.
What does that mean practically ?
When you send a prompt to LLM like ChatGPT-5.2 or Gemini, the model processes :
[ All previous conversation tokens ]
+ [ Your new prompt tokens ]
+ [ Tokens it generates as output ]
All of that must fit inside the context window limit. If it exceeds the limit, older tokens get truncated. The model forgets earlier parts.
- GPT-5.2 accept 4,00,000 tokens as context window
- Zero short
- You give the model a task without any examples
- One Short
- You provide one example before the real question
- Few Short
- You provide multiple examples before the real question
- Implementing LLM architecture and data preparation process
- Pretraining an LLM to create a foundation model
- Fine-tuning the foundation model to become a personal assistant on text classifier
We are going to discuss about each individual item mentioned in the above transformer architecture.
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment