Three main stages of coding an LLM are, implement the data sampling and understand the basic mechanism, pre-train the LLM on unlabelled data to obtain a foundation model for further fine-tuning, fine-tune the pre-trained LLM to create a classifier or personal assistant or chat model.
As part of implementing an end-to-end LLM, we need to implement above 3 stages and it involved below steps :
- Data preparation & Sampling(This blog covers this concept)
- Attention mechanism
- LLM architecture
- Pre-training
- Training Loop
- Model Evaluation
- Load Pre-training weights
- Fine-tuning (to create a classification model)
- Fine-tuning (to create a personal assistant or chat model)
We are going to discuss about the first step i.e. Data Preparation & Sampling in this blog.
Data Preparation & Sampling :
Data preparation & Sampling is the input to the LLM.
It includes below steps :
- Tokenization
- Input Output pairs
- Token Embeddings
- Positional Encodings (Positional Embeddings)
- Input Embeddings
We will cover all the above topics in detail in this blog.
Flow of above steps :
As shown in the above image :
- Input Data will be converted into tokens by using a technique called Tokenization
- Once tokens are ready, we will convert them into Token Embeddings
- Once Token Embeddings, then we have to define the position of each token using Positional Encoding
- Combining Token Embeddings & Positional Encodings, we will create Input Embeddings
- Once Input Embeddings are ready, we will input them to GPT model
Explanation for above image :
- Splitting each as a separate entity : 'This', 'is', 'an', 'example'
- Then convert each word into a token id (some random numbers) as below
- This (123), is (456), an (789), example (104)
- This is called Tokenization
In Tokenization class, we have 2 methods :
- Encode()
- It will split given sentence into each word
- Ex : "This is an example" into 'This', 'is', 'an', 'example'
- Then assign random number to each word
- Decode()
- Input to decode() method is token id's and convert this token id's to words and it will formulate sentence at the end
Lets see below code regarding splitting words from a sentence :
It is just an example to get some idea on how sentences will split into words as part of Data Preparation & Sampling.
You can access/download code from following GitHub location : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb
Also, I have got raw_text from following GItHub repo : https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/the-verdict.txt (which I have used a input)
Now, the input data is ready for next step, Tokenization.
Step2 : Create a vocabulary - Data must be in a Sorted order & Unique
Please see below code f
Observe that, after sentence split in step1, we have got 4690 words, but vocabulary should be Unique. Hence after sorting and removing duplicates, vocabulary size is 1130.
# Iterating & printing entire vocabulary by adding an index using enumerate()
Output :
('!', 0) ('"', 1) ("'", 2) ('(', 3) (')', 4) (',', 5) ('--', 6) ('.', 7) (':', 8) (';', 9) ('?', 10) ('A', 11) ('Ah', 12) ('Among', 13) ('And', 14) ('Are', 15) ('Arrt', 16) ('As', 17) ('At', 18) ('Be', 19) ('Begin', 20) ('Burlington', 21) ('But', 22) ('By', 23) ('Carlo', 24) ('Chicago', 25) ('Claude', 26) ('Come', 27) ('Croft', 28) ('Destroyed', 29) ('Devonshire', 30) ('Don', 31) ('Dubarry', 32) ('Emperors', 33) ('Florence', 34) ('For', 35) ('Gallery', 36) ('Gideon', 37) ('Gisburn', 38) ('Gisburns', 39) ('Grafton', 40) ('Greek', 41) ('Grindle', 42) ('Grindles', 43) ('HAD', 44) ('Had', 45) ('Hang', 46) ('Has', 47) ('He', 48) ('Her', 49) ('Hermia', 50)
Now, our vocabulary consists of words and corresponding tokens id's. In the above 2 steps, we have the functionalities of encoder() and decoder() separately. Below class have both of these functionalities at one place.
Note:
re.sub(r'\s+([,.?!"()\'])', r'\1', text) will identify one/more white spaces before the punctuations from and replace them with same punctuation without a white space. This is called whitespace normalization before punctuation.
Why we landed here is text = " ".join([self.int_to_str[i] for i in ids]) is adding space in between every string it is iterating. Hence we are deleting that space wherever not needed, but retaining punctuations. Hope it makes sense.
Syntax :
re.sub(pattern, replacement, string)
Example : "Hello , world ! How are you ?" ==> "Hello, world! How are you?"
In real time, we won't implement this Tokenization, LLMs are already trained with it. But we should aware of it and how it work in ML models.
Also just FYI, here we have used some text to create vocabulary, but in real time every ML model has its own vocabulary. For example, vocabulary for GPT2 model can be seen at https://huggingface.co/openai-community/gpt2/raw/main/vocab.json with 50256 tokens. Similarly, for GPT-5.2, which is the current version of GPT has 4,00,000 tokens!
Important : So, whatever we type/input to a ML model, it will go and check each and every thing in its vocabulary and then assign tokens accordingly as per Vocabulary.
# Created instance of above class
Above piece of code is to test whether our vocabulary is working or not. Got some piece of context from verdict.txt, and passing it to encode() to see if it print the corresponding tokens. Yes, it worked! Again, if we pass same tokens to decode(), it returned string. This is how it work in a ML model.
Lets apply it to a new text sample that is not contained in the Vocabulary.
Lets apply it to a new text sample that is not contained in the Vocabulary. We will get KeyError as shown in below image.
- |unk| - Unknown word
- |endoftext| - End of Text
Please see above image, we have added both the above words to Vocabulary and printed them for confirmation. Observe that Vocabulary size is 1132 now! (it was 1130 before)
Now, lets see below code where we :
- Replace unknown words by |unk| tokens
- Replace spaces before the specified punctuations
In other models, they introduced other tokens like BOS(beginning of sequence), EOS(end of sequence), PAD(padding) etc.
Tokenization Algorithms :
- Word Based
- Character Based
- Subword Based
1) Word Based Tokenization Algorithm :
- Example : This is an Example
- After converting it into tokens : ["This", "is", "an", "Example"]
- This is english language right ? How many unique english words in the world ? that's 2 lakhs. (we need to create a vocabulary and that's the reason for asking this question)
- Imagine the vocabulary size, if only english has 2 lakh words.
- Remember GPT-2 has 50k tokens which was released in 2023
- Do you think english has only 50k tokens in 2023 ? No right ? Means they implemented some other logic for tokenization instead of Word Based Tokenization. This is the first problem.
- If we got with "Word Based Tokenization" vocabulary size would be too large.
- Play, Plays, Playing, Played (in word based tokenization algorithm, each word will be converted into one token right ?) But what is the root word for these list of words. That's PLAY.
- Play(Root Word)
- Plays
- Playing
- Played
- In this case, word based tokenization will assign unique token to all the above words though they carry similar meaning. It didn't categorized the root word. This is the second problem. we are missing the similarity/semantic relation here.
- Huge vocabulary size
- Missing similarity, not capturing semantic meaning of words. Each word has a separate token.
Hence Character based tokenization algorithm has started.
2) Character Based Tokenization Algorithm :
- Example : This is an Example
- After converting it into tokens : ['T', 'h', 'i', 's', ',', 'i', 's', ',', 'a', 'n', 'E', 'x', 'a', 'm', 'p', 'l', 'e]
- Now, as we create tokens based on character, size of vocabulary reduced a lot
- But we still have the problem of not carrying any semantic meaning between similar words.
Hence Subword based tokenization algorithm has started.
3) Subword Based Tokenization Algorithm :
- This is a hybrid model from Word based & Character based tokenization algorithms.
BPE (Byte Pair Encoding) is introduced in 1994. It was mainly used for Data Compression those days. Those days, compression algorithms are very famous. One of the company released this algorithm and it was very famous.
Lets understand how BPE algorithm work! and then we will see how to customize based on Subword tokenization.
BPE follows two rules :
- Rule-1 : DO NOT split frequently used words into smaller subwords
- Rule-2 : Split the rare words into smaller meaningful subwords
Important statement : Most common pair of consecutive bytes of data is replaced with a byte that doesn't occur in the data.
Lets say this is original data : "aaabdaaabac"
- aa is one pair (1st a, 2nd a)
- aa is another pair (2nd a, 3rd a)
- aaabdaaabac (we will replace first pair of 'aa' with z)
- zabdzabac (now 'ab' is most common pair of consecutive bytes - replace them with y)
- zydzyac (now 'zy' is most common pair of consecutive bytes - replace them with x)
- xdxac
Initial data is "aaabdaaabac", after applying BPE, it reduced to 'xdxac'. Compression technique justified!
Now, lets assume, below dataset of words : {word : no_of_times_repeated}
{"old" : 7, "older" : 3, "finest" : 9, "lowest" : 9 }
Preprocessing Technique : To know end of each word, lets add '</w>' which means end of word.
Dataset = {"old</w>" : 7, "older</w>" : 3, "finest</w>" : 9, "lowest</w>" : 9 }
(We will get to know why we are adding </w> by end of this equation)
==> Lets split words into characters and count their frequency.
Frequency Table :
Now, apply BPE for above data set and customize this data with about 2 rules(Rule-1, Rule-2).
Understand, we need to compress common pair from above dataset, and form a new word, and while we form a common word, we need to subtract those characters from from above frequency table. Lets see how we do it.
- "es" is the top common pair, assume after compression "es" is X repeated 13 times
- "est" is the next common pair, assume "est" is Y repeated 13 times
- "est</w>" is the next common pair, repeated 13 times
Most possible subwords at the end :
This is called Subword Tokenization! Which is a feasible tokenization technique in current models.
Remember we added </w> at the end of word, that's because :
- Assume, 2 words : estimate, highest
- If we don't ass </w>, then during compression, we don't know whether 'est' is prefix or a suffix
- That's the reason, during pre-processing, we added </w>, to segregate prefix/suffix
We have a python package called tiktoken, which we can use for tokenization techniques in LLM, it internally use BPE algorithm.
! pip3 install tiktoken
import importlib
import tiktoken
print("tiktoken version:", importlib.metadata.version("tiktoken"))
# Testing test data
text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces" "of someunknownPlace." ) integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) print(integers)
# Decoding back to actual data
strings = tokenizer.decode(integers) print(strings)
# One more example
integers = tokenizer.encode("Akwirw ier") print(integers) strings = tokenizer.decode(integers) print(strings)
Using tiktoken library :
! pip install tiktoken import tiktoken # Text to encode and decode text = "The lion roams in the jungle" # ───────────────────────────────────────────────────────────────────────── # 1. GPT-2 Encoding/Decoding # Using the "gpt2" encoding # ───────────────────────────────────────────────────────────────────────── tokenizer_gpt2 = tiktoken.get_encoding("gpt2") # We need to specify the model # Encode: text -> list of token IDs token_ids_gpt2 = tokenizer_gpt2.encode(text) # Decode: list of token IDs -> original text (just to verify correctness) decoded_text_gpt2 = tokenizer_gpt2.decode(token_ids_gpt2) # We can also get each token string by decoding the IDs one by one tokens_gpt2 = [tokenizer_gpt2.decode([tid]) for tid in token_ids_gpt2] print("=== GPT-2 Encoding ===") print("Original Text: ", text) print("Token IDs: ", token_ids_gpt2) print("Tokens: ", tokens_gpt2) print("Decoded Text: ", decoded_text_gpt2) print()
Output :
=== GPT-2 Encoding === Original Text: The lion roams in the jungle Token IDs: [464, 18744, 686, 4105, 287, 262, 20712] Tokens: ['The', ' lion', ' ro', 'ams', ' in', ' the', ' jungle'] Decoded Text: The lion roams in the jungle
This is called TOKENIZATION.
Guys, we don't even use all these techniques directly, but we should understand these internal concepts before stepping into stuff that we are going to see in future blogs. We simply call LLM and it internally does all of these discussed calculation etc. It is good to have a deep knowledge on internals before stepping into actual stuff. So, do not panic at the moment. Save it for later stuff 😁
Please find the code from following GitHub location, it has more test cases with other models as well : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb
- Input, Output pairs
- Data Sampling with sliding window
From above data, target variable is Sal ; Age, Exp are independent variables. Expectation is, after training, if we give the age & exp of a person, model will predict the Salary of that person.
For example, "Deep Learning is Powerful" is the input. If we give this data to a Decoder model, it should predict the next word.
- It should convert the data into a language which a Neural network can understand.
- Data Sampling is required to handle the Context Window (***)
- Sliding Window is required to retain meaningful context (***)
x: [290, 4920, 2241, 287] y: [4920, 2241, 287, 257]
Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:
for i in range(1, context_size+1):
context = enc_sample[:i]
desired = enc_sample[i]
print(context, "---->", desired)
Output :[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257
Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.
For illustration purposes, let's repeat the previous code but convert the token IDs into text:
for i in range(1, context_size+1):
context = enc_sample[:i]
desired = enc_sample[i]
print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))
and ----> established
and established ----> himself
and established himself ----> in
and established himself in ----> a
We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters.
This is called input, output pairs. LLM expect the data to be in the form on input, output pairs.
See above images, to see the concepts of sliding window size(1 word), context window size (4 words). Check how inputs and targets segregated.
Step 1: Tokenize the entire text
Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length
Step 3: Return the total number of rows in the dataset
Step 4: Return a single row from the dataset
from torch.utils.data import Dataset, DataLoader class GPTDatasetV1(Dataset): def __init__(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = [] # Tokenize the entire text token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) # Use a sliding window to chunk the book into overlapping sequences of max_length for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk)) def __len__(self): return len(self.input_ids) def __getitem__(self, idx): return self.input_ids[idx], self.target_ids[idx]
The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class.
It defines how individual rows are fetched from the dataset.
Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.
The target_chunk tensor contains the corresponding targets.
I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader -- this will bring additional intuition and clarity.
The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:
Step 1: Initialize the tokenizer
Step 2: Create dataset
Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training
Step 4: The number of CPU processes to use for preprocessing
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): # Initialize the tokenizer tokenizer = tiktoken.get_encoding("gpt2") # Create dataset dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # Create dataloader dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers ) return dataloader
Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4,
This will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together:
with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read()
Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function
import torch print("PyTorch version:", torch.__version__) dataloader = create_dataloader_v1( raw_text, batch_size=1, max_length=4, stride=1, shuffle=False ) data_iter = iter(dataloader) first_batch = next(data_iter) print(first_batch)
Output :
PyTorch version: 2.6.0+cu124 [tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.
Since the max_length is set to 4, each of the two tensors contains 4 token IDs.
Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.
To illustrate the meaning of stride=1, let's fetch another batch from this dataset:
second_batch = next(data_iter)
print(second_batch)
# Output :
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]
If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch.
For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input.
The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach
Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes. If you have previous experience with deep learning, you may know that small batch sizes require less memory during training but lead to more noisy model updates.
Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.
Before we move on to the two final sections of this chapter that are focused on creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False) data_iter = iter(dataloader) inputs, targets = next(data_iter) print("Inputs:\n", inputs) print("\nTargets:\n", targets)
# Output :
Inputs:
tensor([[ 40, 367, 2885, 1464],
[ 1807, 3619, 402, 271],
[10899, 2138, 257, 7026],
[15632, 438, 2016, 257],
[ 922, 5891, 1576, 438],
[ 568, 340, 373, 645],
[ 1049, 5975, 284, 502],
[ 284, 3285, 326, 11]])
Targets:
tensor([[ 367, 2885, 1464, 1807],
[ 3619, 402, 271, 10899],
[ 2138, 257, 7026, 15632],
[ 438, 2016, 257, 922],
[ 5891, 1576, 438, 568],
[ 340, 373, 645, 1049],
[ 5975, 284, 502, 284],
[ 3285, 326, 11, 287]])
Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also, avoid any overlap between the batches, since more overlap could lead to increased overfitting.We have seen below topics so far :
- Data Sampling
- Context window, Sliding window/Stride
- Input, Output pairs
- Token Embedding
- Positional Embedding/Encoding
- Input Embeddings
Understanding Embedding Concept (Very Important to understand, read carefully!)
Context refresh! Main idea of decoder model is to predict next word.
But after generating Input, Output pairs &Data Sampling etc. data is still in the form of a token. Isn't it ? and it is not carrying any meaning.
Consider below example of data : Cat, Book, Tablet, Kitten, Dog, Puppy
Idea what we have to understand is : Though we have semantic relationship between the data(Ex : Dog Vs Puppy in above data), model is not carrying that information yet, as words are still in token format.
How to carry semantic relation between data into ML model/LLM ?
Answer is word2vec
One more example : Dog, Cat comes under animals AND Apple, Banana comes under fruits
As Humans we know their categories, but how computer knew ? Answer for this question is attributes or input features.
- Dog has a tail, it barks
- Cat also has a tail, has 4 legs, it makes sound etc.
- Banana is eatable etc.
Using above vector of values, we can understand the similarity of inputs by looking at attributes. For example, Dog & cat has similar features if you can compare the vector values of same attributes, they will be almost similar in value.
- Where ever Cat value is high, Dog value is also high
- Where ever Cat value is low, God value is also low
- Vice versa for Apple, Banana as well
Means, tokens of similar data are close to each other. Hence, vector captures semantic meaning. Converting our tokens into vectors is nothing but Embedding.
Main question : HOW THESE VECTOR VALUES WILL BE INITIALIZED.
Answer : NEURAL NETWORK
Lets revise NN simply :
- NN has a Input, Hidden, Output layers and each neuron is connected to every other neuron in next layer.
- These connections are called weights. Initially they are random values.
- As we move forward with iterations, by repeating forward propagation, loss calculation, back propagation and adjusting weights; all these random weight values will be adjusted and come to a state where they become the accurate values of input data. Right ?
Very important statement :
These embedding weight vectors are random values at first. This initialization servers as the starting point of LLM's learning process. Later we will optimize the LLM weights as part of LLM training.
I will be adding Positional Embedding/Encoding as well this this same blog! Concept is similar, but positional embeddings carry the position of word/token(instead of semantic relation) and finally during the process of Input embeddings, both Token Embedding and Positional Embedding will be added to create a common context for input data and will be passed to our GPT model!
That's all about Tokenization! Let me add Positional & Input Embedding in a day or two to this blog.
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment