Skip to main content

(AI Blog#9) Basics of Large Language Models (LLMs)

We have seen about RNN and LSTM in my previous blog, which are the foundation concepts for LLM. In today's blog, we will start looking at the actual LLM concepts like Transformers, LLM, Agents etc.


LLM Vs Agents

LLM(Large Language Model) is a Neural Network, trained to predict the next word and provide meaningful context/output. Agents is a system that use LLM + tools + memory + decision logic to achieve some goals. An Agent is not just a model - its a system architecture.


LLM

  • LLM takes input from user and it will produce the output, but every run it may produce a different output. This is called Non-Deterministic output.
  • Once conversation is over, it won't remember anything. This is called State-less.
  • That's the reason, we can't directly use LLM and build an application, as it may produce different outputs each time. It is not designed specifically for this application and it behaves inconsistent.


Agents 

  • It will use LLM, Tools, Memory, Observability & Evaluation 
    • Tools are nothing but external API's, RAG, RDBMS etc.
    • Memory store, it will remember the state (State-full)
    • Observability is also called as Tracing/Logging
    • Evaluation is nothing but How accurate my response ?


We can add below tools to an Agent and improve its strength : 

  • For example, we are asking about the quarterly revenue of a particular company XYZ to a LLM.
    • Note, revenue is a confidential information and we can't feed this type of data to LLM. We need to use tools like RAG (Retrieval Augmented Generation). It is like we are adding additional power to LLM
    • Now LLM will decide on redirecting this request to tools RAG, LLM will act like a gateway.
    • These tools will add additional benefits to LLM
  • I want to maintain the state of previous conversation and we need memory to store it
    • Incase if next user ask same question, then LLM will hit the cache instead of processing that request again and send it to RAG etc.
  • Once we deployed a LLM into production, and if I want to see the behaviour
    • Then we required a mechanism called Observability
    • It will keep track of logging/tracing to keep track of underlying components
  • Finally we need to Evaluate the response, correct ?
    • For this reason, we will add few evaluation metrics to graphs to evaluate the response.
    • Feedback mechanism is also a part of evaluation to improve the accuracy of model
    • We can integrate below tools to keep track of evaluation metrics
      • RAGAS
      • Eval
      • Open lens
      • True lens
As discussed above, we need to add above components to LLMs and create an entity called AGENT. We will see how Agents will work in Lang Graph. Also in terms of providing security in between LLM <--> RAG we will use something called Guard Rails. It will use to avoid exposing confidential data to LLMs. 

Below third party tools are used to track our application :
  • Langsmith
  • Opik
  • Open Telemetry
  • Watch Dog


Transformer architecture : 

We will talk about all the internal components of below Transformer architecture. We are going to talk abut it in the next 2-3 blogs.

It mainly contains 2 parts :

  • Encoder (left hand side)
  • Decoder (right hand side) 

Main purpose of this architecture when they implemented it in ~ 2017 is they want to input French and expected English. Yes, they used it for language translation. They took this architecture as reference and introduced other models into market, as shown in the below image.


Above tree consists of 3 branches :

  1. Encoder-only
  2. Encoder-Decoder
  3. Decoder-only

Above tree contains multiple open and closed source models. Filled box models are open source models and Open box models are closed source models(have to take subscription). We can utilise open source models for our POCs but for production purpose, we need to use closed source models by paying amount, as we can't expose our internal data to internet/LLMs.


What is a Decoder Model ? Meant for next word prediction

In production grade system, we need to use CLOSED SOURCE models by paying some money. When you ask a question to ChatGPT, while generating output, based on 1st word 2nd word will be predicted, similarly 3rd word will generate based on first 2 words. This is the way a decoder model will work. It is going to predict the next token.

What is a Encoder Model ? Meant to fill missing words in the middle of sentence.  

GPT models and its history of usage :

Decoder model :

  • GPT-1 (Open source model)
  • GPT-2 (Open source model)
Encoder + Decoder model :

  • GPT-3 (Closed source model)
  • GPT-4 etc. (Closed source model)
  • GPT-5.2 (Closed source model)


Popular models : 

  • Encoder Model (BERT - Bidirectional Encoder Representation Transformer)
    • Mainly used to fill missed words/tokens
    • To fill blanks, it should aware of previous & next word, that's why Bidirectional 
    • It is inheriting from transformer architecture
  • Decoder Model (GPT - OpenAI)
    • Generative Pretrained Transformer
    • It is also inherited from Transformer architecture
    • But mainly used to predict next word/token
    • ChatGPT is an application, it is internally using GPT models


Why LLMs are called as LARGE Language Models ?

  • Large
  • Language Models
LLMs are nothing but Neural Networks, they contain weights and biases. Here the word LARGE represents large number of model parameters.

For example, LLAMA 3.3 was trained with 70 Billion parameters, look at below link for source : https://www.llama.com/models/llama-3/ 

Looks at GPT parameters size in the below image. They are LARGE, isn't it ?


  • GPT-3 : 175 Billion parameters (weights & biases)
  • GPT-4 : Approximately 1 Trillion parameters (weights & biases)

Just imagine the Neural Network size and scale! Hence training is very expensive. We won't create LLMs in the organization. We will use pre trained models and implement our AI Agents.


LRM (Large Reasoning Models) - Advanced to LLM

A model optimized specifically for structured reasoning, multi-step logic, and problem solving.

LRMs are typically :

  • Fine tuning models of LLMs
  • Trained using reasoning datasets
  • Enhanced with reinforcement learning
  • Built to think step-by-step
Examples :
  • Open AI's 01
  • DeepSeek's DeepSeek R1
  • ChatGPT 5.2 (Current version)
  • Gemini 1.5 Pro

What LRM's are good at ?

  • Math problems
  • Logical reasoning
  • Coding problems
  • Planning
  • Multi step decision making
  • Agent-like workflows
Core capability is : generate intermediate reasoning steps before final answer.

LLM Vs LRM :


Fine Tuning



**Fine Tuned model = Pre Trained Model + Our Own Data

So, in real time, we will create Fine Tune Model


Note : At the end, we are going to create some application but powered by a pretrained ML model.

Ex : https://www.harvey.ai/ 


Context Window :

A context window is the maximum amount of text(in tokens) that a model can remember at one time while generating a response.

Think of it as : models short term working memory.


What does that mean practically ?

When you send a prompt to LLM like ChatGPT-5.2 or Gemini, the model processes :

[ All previous conversation tokens ]

+ [ Your new prompt tokens ]

+ [ Tokens it generates as output ]

All of that must fit inside the context window limit. If it exceeds the limit, older tokens get truncated. The model forgets earlier parts.

  • GPT-5.2 accept 4,00,000 tokens as context window
This is very important while designing application. As shown in the below diagram, if our LLM supports 1024 tokens as context window and if any user is uploading a PDF with size as 10,000 tokens then LLM won't allow this action.

Alternatively, we can divide this PDF into chunks/batch, like 10 chunks of 1000 records each. But we have to compromise in processing time. We are going to face issues like this while designing a real time application. Hence we have to design our applications to handle a situation like it. 

Incase if we don't to compromise the time, then we can add a router which will cost us some extra bucks, in this situation, if we have router, then it will route to some high end LLM which will accept all tokens as single entity from user.



Prompt engineering Techniques :
  • Zero short
    • You give the model a task without any examples
  • One Short
    • You provide one example before the real question
  • Few Short
    • You provide multiple examples before the real question



Three main stages of coding an LLM :
  • Implementing LLM architecture and data preparation process
  • Pretraining an LLM to create a foundation model
  • Fine-tuning the foundation model to become a personal assistant on text classifier


We are going to talk about above 3 stages of implementing an LLM in the next few blogs. We will see how to code it entirely in our local machine. 


We are going to discuss about each individual item mentioned in the above transformer architecture. 


That's all for this blog. 

Next blog i.e. AI Blog#10 would be about Tokenization, Input, Position embedding etc.

Thank you for reading this blog !

Arun Mathe

Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

Spark Core : Introduction & understanding Spark Context

Apache Spark is a free, open-source tool for processing large amounts of data in parallel across multiple computers. It is used for big data work loads like machine learning, graph processing and big data analytics. Spark is built on top of Hadoop, it is aware how Hadoop works. Programming languages for Spark : Scala Python Java R SQL Spark support 2 operations : Transformations Actions RDD (Resilient Distributed Dataset) : Entire Spark is built on the base concept called RDD. Below 2  operations are supported by RDD. Transformations Actions Features of Spark : Distributed, Partitioned, Replicated Note that if data is mutable then it will be hard to distribute, partition and replicate Hence Spark required Immutability feature Immutability We can't change the data By design, Spark is purely designed for Analytical operations(OLAP) It do support transactional operations using some 3rd party tools Cacheable  To reuse data we cache it If information is static, no need to recomput...