(AI #9) Basics of Large Language Models (LLMs)

We have seen about RNN and LSTM in my previous blog, which are the foundation concepts for LLM. In today's blog, we will start looking at the actual LLM concepts like Transformers, LLM, Agents etc.

LLM Vs Agents

LLM(Large Language Model) is a Neural Network, trained to predict the next word and provide meaningful context/output. Agents is a system that use LLM + tools + memory + decision logic to achieve some goals. An Agent is not just a model - its a system architecture.

LLM

LLM takes input from user and it will produce the output, but every run it may produce a different output. This is called Non-Deterministic output.
Once conversation is over, it won't remember anything. This is called State-less.
That's the reason, we can't directly use LLM and build an application, as it may produce different outputs each time. It is not designed specifically for this application and it behaves inconsistent.

Agents

It will use LLM, Tools, Memory, Observability & Evaluation

Tools are nothing but external API's, RAG, RDBMS etc.
Memory store, it will remember the state (State-full)
Observability is also called as Tracing/Logging
Evaluation is nothing but How accurate my response ?

We can add below tools to an Agent and improve its strength :

For example, we are asking about the quarterly revenue of a particular company XYZ to a LLM.

Note, revenue is a confidential information and we can't feed this type of data to LLM. We need to use tools like RAG (Retrieval Augmented Generation). It is like we are adding additional power to LLM
Now LLM will decide on redirecting this request to tools RAG, LLM will act like a gateway.
These tools will add additional benefits to LLM

I want to maintain the state of previous conversation and we need memory to store it

Incase if next user ask same question, then LLM will hit the cache instead of processing that request again and send it to RAG etc.

Once we deployed a LLM into production, and if I want to see the behaviour

Then we required a mechanism called Observability
It will keep track of logging/tracing to keep track of underlying components

Finally we need to Evaluate the response, correct ?

For this reason, we will add few evaluation metrics to graphs to evaluate the response.
Feedback mechanism is also a part of evaluation to improve the accuracy of model
We can integrate below tools to keep track of evaluation metrics

RAGAS
Eval
Open lens
True lens

As discussed above, we need to add above components to LLMs and create an entity called AGENT. We will see how Agents will work in Lang Graph. Also in terms of providing security in between LLM <--> RAG we will use something called Guard Rails. It will use to avoid exposing confidential data to LLMs.

Below third party tools are used to track our application :

Langsmith
Opik
Open Telemetry
Watch Dog

Transformer architecture :

We will talk about all the internal components of below Transformer architecture. We are going to talk abut it in the next 2-3 blogs.

It mainly contains 2 parts :

Encoder (left hand side)
Decoder (right hand side)

Main purpose of this architecture when they implemented it in ~ 2017 is they want to input French and expected English. Yes, they used it for language translation. They took this architecture as reference and introduced other models into market, as shown in the below image.

Above tree consists of 3 branches :

Encoder-only
Encoder-Decoder
Decoder-only

Above tree contains multiple open and closed source models. Filled box models are open source models and Open box models are closed source models(have to take subscription). We can utilise open source models for our POCs but for production purpose, we need to use closed source models by paying amount, as we can't expose our internal data to internet/LLMs.

What is a Decoder Model ? Meant for next word prediction

In production grade system, we need to use CLOSED SOURCE models by paying some money. When you ask a question to ChatGPT, while generating output, based on 1st word 2nd word will be predicted, similarly 3rd word will generate based on first 2 words. This is the way a decoder model will work. It is going to predict the next token.

What is a Encoder Model ? Meant to fill missing words in the middle of sentence.

GPT models and its history of usage :

Decoder model :

GPT-1 (Open source model)
GPT-2 (Open source model)
GPT-3 (Closed source model)
GPT-4 etc. (Closed source model)
GPT-5.2 (Closed source model)

Encoder + Decoder model :

BERT
T5

Popular models :

Encoder Model (BERT - Bidirectional Encoder Representation Transformer)

Mainly used to fill missed words/tokens
To fill blanks, it should aware of previous & next word, that's why Bidirectional
It is inheriting from transformer architecture

Decoder Model (GPT - OpenAI)

Generative Pretrained Transformer
It is also inherited from Transformer architecture
But mainly used to predict next word/token
ChatGPT is an application, it is internally using GPT models

Why LLMs are called as LARGE Language Models ?

Large
Language Models

LLMs are nothing but Neural Networks, they contain weights and biases. Here the word LARGE represents large number of model parameters.

For example, LLAMA 3.3 was trained with 70 Billion parameters, look at below link for source : https://www.llama.com/models/llama-3/

Looks at GPT parameters size in the below image. They are LARGE, isn't it ?

GPT-3 : 175 Billion parameters (weights & biases)
GPT-4 : Approximately 1 Trillion parameters (weights & biases)

Just imagine the Neural Network size and scale! Hence training is very expensive. We won't create LLMs in the organization. We will use pre trained models and implement our AI Agents.

LRM (Large Reasoning Models) - Advanced to LLM

A model optimized specifically for structured reasoning, multi-step logic, and problem solving.

LRMs are typically :

Fine tuning models of LLMs
Trained using reasoning datasets
Enhanced with reinforcement learning
Built to think step-by-step

Examples :

Open AI's 01
DeepSeek's DeepSeek R1
ChatGPT 5.2 (Current version)
Gemini 1.5 Pro

What LRM's are good at ?

Math problems
Logical reasoning
Coding problems
Planning
Multi step decision making
Agent-like workflows

Core capability is : generate intermediate reasoning steps before final answer.

LLM Vs LRM :

Fine Tuning

**Fine Tuned model = Pre Trained Model + Our Own Data

So, in real time, we will create Fine Tune Model.

Note : At the end, we are going to create some application but powered by a pretrained ML model.

Ex : https://www.harvey.ai/

Context Window :

A context window is the maximum amount of text(in tokens) that a model can remember at one time while generating a response.

Think of it as : models short term working memory.

What does that mean practically ?

When you send a prompt to LLM like ChatGPT-5.2 or Gemini, the model processes :

[ All previous conversation tokens ]

+ [ Your new prompt tokens ]

+ [ Tokens it generates as output ]

All of that must fit inside the context window limit. If it exceeds the limit, older tokens get truncated. The model forgets earlier parts.

GPT-5.2 accept 4,00,000 tokens as context window

This is very important while designing application. As shown in the below diagram, if our LLM supports 1024 tokens as context window and if any user is uploading a PDF with size as 10,000 tokens then LLM won't allow this action.

Alternatively, we can divide this PDF into chunks/batch, like 10 chunks of 1000 records each. But we have to compromise in processing time. We are going to face issues like this while designing a real time application. Hence we have to design our applications to handle a situation like it.

Incase if we don't to compromise the time, then we can add a router which will cost us some extra bucks, in this situation, if we have router, then it will route to some high end LLM which will accept all tokens as single entity from user.

Prompt engineering Techniques :

Zero short

You give the model a task without any examples

One Short

You provide one example before the real question

Few Short

You provide multiple examples before the real question

Three main stages of coding an LLM :

Implementing LLM architecture and data preparation process
Pretraining an LLM to create a foundation model
Fine-tuning the foundation model to become a personal assistant on text classifier

We are going to talk about above 3 stages of implementing an LLM in the next few blogs. We will see how to code it entirely in our local machine.

We are going to discuss about each individual item mentioned in the above transformer architecture.

That's all for this blog.

Next blog i.e. AI Blog#10 would be about Tokenization, Input, Position embedding etc.

Thank you for reading this blog !

Arun Mathe

(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

It is extremely important to have a deep knowledge while designing a machine learning model, otherwise we will end up creating ML models which are of no use. We have to have a clear understanding on certain techniques to confidently build a ML model, train it using "training data", finalize the model and to deploy it in production. So far, from blog #1, #2, we have seen about the fundamentals of Deep Learning and Neural Network, architecture of a Neural Network, internal layers and components etc. Providing the links of Blogs #1 , #2 below for quick reference. Deep Learning & Neural Networks : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html Building a real world neural network: A practical usecase explained : https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html Now let's dive through below concepts/criteria to help gaining confidence on building your ML model: Activation Functions (Forward Propaga...

DataSphere

Search This Blog

(AI #9) Basics of Large Language Models (LLMs)

Labels

Comments

Post a Comment

Popular posts from this blog

(AI #1) Deep Learning and Neural Networks

(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

Spark Core : Understanding RDD & Partitions in Spark