Three main stages of coding an LLM are, implement the data sampling and understand the basic mechanism, pre-train the LLM on unlabelled data to obtain a foundation model for further fine-tuning, fine-tune the pre-trained LLM to create a classifier or personal assistant or chat model.
As part of implementing an end-to-end LLM, we need to implement above 3 stages and it involved below steps :
- Data preparation & Sampling(This blog covers this concept)
- Attention mechanism
- LLM architecture
- Pre-training
- Training Loop
- Model Evaluation
- Load Pre-training weights
- Fine-tuning (to create a classification model)
- Fine-tuning (to create a personal assistant or chat model)
We are going to discuss about the first step i.e. Data Preparation & Sampling in this blog.
Data Preparation & Sampling :
Data preparation & Sampling is the input to the LLM.
It includes below steps :
- Tokenization
- Input Output pairs
- Token Embeddings
- Positional Encodings (Positional Embeddings)
- Input Embeddings
We will cover all the above topics in detail in this blog.
Flow of above steps :
As shown in the above image :
- Input Data will be converted into tokens by using a technique called Tokenization
- Once tokens are ready, we will convert them into Token Embeddings
- Once Token Embeddings, then we have to define the position of each token using Positional Encoding
- Combining Token Embeddings & Positional Encodings, we will create Input Embeddings
- Once Input Embeddings are ready, we will input them to GPT model
Explanation for above image :
- Splitting each as a separate entity : 'This', 'is', 'an', 'example'
- Then convert each word into a token id (some random numbers) as below
- This (123), is (456), an (789), example (104)
- This is called Tokenization
In Tokenization class, we have 2 methods :
- Encode()
- It will split given sentence into each word
- Ex : "This is an example" into 'This', 'is', 'an', 'example'
- Then assign random number to each word
- Decode()
- Input to decode() method is token id's and convert this token id's to words and it will formulate sentence at the end
Lets see below code regarding splitting words from a sentence :
It is just an example to get some idea on how sentences will split into words as part of Data Preparation & Sampling.
You can access/download code from following GitHub location : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb
Also, I have got raw_text from following GItHub repo : https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/the-verdict.txt (which I have used a input)
Now, the input data is ready for next step, Tokenization.
Step2 : Create a vocabulary - Data must be in a Sorted order & Unique
Please see below code f
Observe that, after sentence split in step1, we have got 4690 words, but vocabulary should be Unique. Hence after sorting and removing duplicates, vocabulary size is 1130.
# Iterating & printing entire vocabulary by adding an index using enumerate()
Output :
('!', 0) ('"', 1) ("'", 2) ('(', 3) (')', 4) (',', 5) ('--', 6) ('.', 7) (':', 8) (';', 9) ('?', 10) ('A', 11) ('Ah', 12) ('Among', 13) ('And', 14) ('Are', 15) ('Arrt', 16) ('As', 17) ('At', 18) ('Be', 19) ('Begin', 20) ('Burlington', 21) ('But', 22) ('By', 23) ('Carlo', 24) ('Chicago', 25) ('Claude', 26) ('Come', 27) ('Croft', 28) ('Destroyed', 29) ('Devonshire', 30) ('Don', 31) ('Dubarry', 32) ('Emperors', 33) ('Florence', 34) ('For', 35) ('Gallery', 36) ('Gideon', 37) ('Gisburn', 38) ('Gisburns', 39) ('Grafton', 40) ('Greek', 41) ('Grindle', 42) ('Grindles', 43) ('HAD', 44) ('Had', 45) ('Hang', 46) ('Has', 47) ('He', 48) ('Her', 49) ('Hermia', 50)
Now, our vocabulary consists of words and corresponding tokens id's. In the above 2 steps, we have the functionalities of encoder() and decoder() separately. Below class have both of these functionalities at one place.
Note:
re.sub(r'\s+([,.?!"()\'])', r'\1', text) will identify one/more white spaces before the punctuations from and replace them with same punctuation without a white space. This is called whitespace normalization before punctuation.
Why we landed here is text = " ".join([self.int_to_str[i] for i in ids]) is adding space in between every string it is iterating. Hence we are deleting that space wherever not needed, but retaining punctuations. Hope it makes sense.
Syntax :
re.sub(pattern, replacement, string)
Example : "Hello , world ! How are you ?" ==> "Hello, world! How are you?"
In real time, we won't implement this Tokenization, LLMs are already trained with it. But we should aware of it and how it work in ML models.
Also just FYI, here we have used some text to create vocabulary, but in real time every ML model has its own vocabulary. For example, vocabulary for GPT2 model can be seen at https://huggingface.co/openai-community/gpt2/raw/main/vocab.json with 50256 tokens. Similarly, for GPT-5.2, which is the current version of GPT has 4,00,000 tokens!
Important : So, whatever we type/input to a ML model, it will go and check each and every thing in its vocabulary and then assign tokens accordingly as per Vocabulary.
# Created instance of above class
Above piece of code is to test whether our vocabulary is working or not. Got some piece of context from verdict.txt, and passing it to encode() to see if it print the corresponding tokens. Yes, it worked! Again, if we pass same tokens to decode(), it returned string. This is how it work in a ML model.
Lets apply it to a new text sample that is not contained in the Vocabulary.
Lets apply it to a new text sample that is not contained in the Vocabulary. We will get KeyError as shown in below image.
- |unk| - Unknown word
- |endoftext| - End of Text
Please see above image, we have added both the above words to Vocabulary and printed them for confirmation. Observe that Vocabulary size is 1132 now! (it was 1130 before)
Now, lets see below code where we :
- Replace unknown words by |unk| tokens
- Replace spaces before the specified punctuations
In other models, they introduced other tokens like BOS(beginning of sequence), EOS(end of sequence), PAD(padding) etc.
Tokenization Algorithms :
- Word Based
- Character Based
- Sub-word Based
1) Word Based Tokenization Algorithm :
- Example : This is an Example
- After converting it into tokens : ["This", "is", "an", "Example"]
- This is english language right ? How many unique english words in the world ? that's 2 lakhs. (we need to create a vocabulary and that's the reason for asking this question)
- Imagine the vocabulary size, if only english has 2 lakh words!
- Remember GPT-2 has 50k tokens which was released in 2023
- Do you think english has only 50k tokens in 2023 ? No right ? Means they implemented some other logic for tokenization instead of Word Based Tokenization. This is the first problem.
- If we got with "Word Based Tokenization" vocabulary size would be too large.
- Play, Plays, Playing, Played (in word based tokenization algorithm, each word will be converted into one token right ?) But what is the root word for these list of words. That's PLAY.
- Play(Root Word)
- Plays
- Playing
- Played
- In this case, word based tokenization will assign different tokens to all the above words though they carry similar meaning. It didn't categorized the root word. This is the second problem. we are missing the similarity/semantic relation here.
- Huge vocabulary size
- Missing similarity, not capturing semantic meaning of words. Each word has a separate token.
Hence Character based tokenization algorithm has started.
2) Character Based Tokenization Algorithm :
- Example : This is an Example
- After converting it into tokens : ['T', 'h', 'i', 's', ',', 'i', 's', ',', 'a', 'n', 'E', 'x', 'a', 'm', 'p', 'l', 'e]
- Now, as we create tokens based on character, size of vocabulary reduced a lot
- But we still have the problem of not carrying any semantic meaning between similar words.
Hence Sub-word based tokenization algorithm has started.
3) Subword Based Tokenization Algorithm :
- This is a hybrid model from Word based & Character based tokenization algorithms.
BPE (Byte Pair Encoding) is introduced in 1994. It was mainly used for Data Compression those days. Those days, compression algorithms are very famous. One of the company released this algorithm and it was very famous.
Lets understand how BPE algorithm work! and then we will see how to customize based on Subword tokenization.
BPE follows two rules :
- Rule-1 : DO NOT split frequently used words into smaller subwords
- Rule-2 : Split the rare words into smaller meaningful subwords
Important statement : Most common pair of consecutive bytes of data is replaced with a byte that doesn't occur in the data.
Lets say this is original data : "aaabdaaabac"
- aa is one pair (1st a, 2nd a)
- aa is another pair (2nd a, 3rd a)
- aaabdaaabac (we will replace first pair of 'aa' with z)
- zabdzabac (now 'ab' is most common pair of consecutive bytes - replace them with y)
- zydzyac (now 'zy' is most common pair of consecutive bytes - replace them with x)
- xdxac
Initial data is "aaabdaaabac", after applying BPE, it reduced to 'xdxac'. Compression technique justified!
Now, lets assume, below dataset of words : {word : no_of_times_repeated}
{"old" : 7, "older" : 3, "finest" : 9, "lowest" : 9 }
Preprocessing Technique : To know end of each word, lets add '</w>' which means end of word.
Dataset = {"old</w>" : 7, "older</w>" : 3, "finest</w>" : 9, "lowest</w>" : 9 }
(We will get to know why we are adding </w> by end of this equation)
==> Lets split words into characters and count their frequency.
Frequency Table :
Now, apply BPE for above data set and customize this data with about 2 rules(Rule-1, Rule-2).
Understand, we need to compress common pair from above dataset, and form a new word, and while we form a common word, we need to subtract those characters from from above frequency table. Lets see how we do it.
- "es" is the top common pair, assume after compression "es" is X repeated 13 times
- "est" is the next common pair, assume "est" is Y repeated 13 times
- "est</w>" is the next common pair, repeated 13 times
Most possible subwords at the end :
This is called Subword Tokenization! Which is a feasible tokenization technique in current models.
Remember we added </w> at the end of word, that's because :
- Assume, 2 words : estimate, highest
- If we don't ass </w>, then during compression, we don't know whether 'est' is prefix or a suffix
- That's the reason, during pre-processing, we added </w>, to segregate prefix/suffix
We have a python package called tiktoken, which we can use for tokenization techniques in LLM, it internally use BPE algorithm.
! pip3 install tiktoken
import importlib
import tiktoken
print("tiktoken version:", importlib.metadata.version("tiktoken"))
# Testing test data
text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces" "of someunknownPlace." ) integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) print(integers)
# Decoding back to actual data
strings = tokenizer.decode(integers) print(strings)
# One more example
integers = tokenizer.encode("Akwirw ier") print(integers) strings = tokenizer.decode(integers) print(strings)
Using tiktoken library :
! pip install tiktoken import tiktoken # Text to encode and decode text = "The lion roams in the jungle" # ───────────────────────────────────────────────────────────────────────── # 1. GPT-2 Encoding/Decoding # Using the "gpt2" encoding # ───────────────────────────────────────────────────────────────────────── tokenizer_gpt2 = tiktoken.get_encoding("gpt2") # We need to specify the model # Encode: text -> list of token IDs token_ids_gpt2 = tokenizer_gpt2.encode(text) # Decode: list of token IDs -> original text (just to verify correctness) decoded_text_gpt2 = tokenizer_gpt2.decode(token_ids_gpt2) # We can also get each token string by decoding the IDs one by one tokens_gpt2 = [tokenizer_gpt2.decode([tid]) for tid in token_ids_gpt2] print("=== GPT-2 Encoding ===") print("Original Text: ", text) print("Token IDs: ", token_ids_gpt2) print("Tokens: ", tokens_gpt2) print("Decoded Text: ", decoded_text_gpt2) print()
Output :
=== GPT-2 Encoding === Original Text: The lion roams in the jungle Token IDs: [464, 18744, 686, 4105, 287, 262, 20712] Tokens: ['The', ' lion', ' ro', 'ams', ' in', ' the', ' jungle'] Decoded Text: The lion roams in the jungle
This is called TOKENIZATION.
Guys, we don't even use all these techniques directly, but we should understand these internal concepts before stepping into stuff that we are going to see in future blogs. We simply call LLM and it internally does all of these discussed calculation etc. It is good to have a deep knowledge on internals before stepping into actual stuff. So, do not panic at the moment. Save it for later stuff 😁
Please find the code from following GitHub location, it has more test cases with other models as well : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb
- Input, Output pairs
- Data Sampling with sliding window
From above data, target variable is Sal ; Age, Exp are independent variables. Expectation is, after training, if we give the age & exp of a person, model will predict the Salary of that person.
For example, "Deep Learning is Powerful" is the input. If we give this data to a Decoder model, it should predict the next word.
- It should convert the data into a language which a Neural network can understand.
- Data Sampling is required to handle the Context Window (***)
- Sliding Window is required to retain meaningful context (***)
x: [290, 4920, 2241, 287] y: [4920, 2241, 287, 257]
Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:
for i in range(1, context_size+1):
context = enc_sample[:i]
desired = enc_sample[i]
print(context, "---->", desired)
Output :[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257
Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.
For illustration purposes, let's repeat the previous code but convert the token IDs into text:
for i in range(1, context_size+1):
context = enc_sample[:i]
desired = enc_sample[i]
print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))
and ----> established
and established ----> himself
and established himself ----> in
and established himself in ----> a
We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters.
This is called input, output pairs. LLM expect the data to be in the form on input, output pairs.
See above images, to see the concepts of sliding window size(1 word), context window size (4 words). Check how inputs and targets segregated.
Step 1: Tokenize the entire text
Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length
Step 3: Return the total number of rows in the dataset
Step 4: Return a single row from the dataset
from torch.utils.data import Dataset, DataLoader class GPTDatasetV1(Dataset): def __init__(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = [] # Tokenize the entire text token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) # Use a sliding window to chunk the book into overlapping sequences of max_length for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk)) def __len__(self): return len(self.input_ids) def __getitem__(self, idx): return self.input_ids[idx], self.target_ids[idx]
The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class.
It defines how individual rows are fetched from the dataset.
Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.
The target_chunk tensor contains the corresponding targets.
I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader -- this will bring additional intuition and clarity.
The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:
Step 1: Initialize the tokenizer
Step 2: Create dataset
Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training
Step 4: The number of CPU processes to use for preprocessing
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): # Initialize the tokenizer tokenizer = tiktoken.get_encoding("gpt2") # Create dataset dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # Create dataloader dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers ) return dataloader
Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4,
This will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together:
with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read()
Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function
import torch print("PyTorch version:", torch.__version__) dataloader = create_dataloader_v1( raw_text, batch_size=1, max_length=4, stride=1, shuffle=False ) data_iter = iter(dataloader) first_batch = next(data_iter) print(first_batch)
Output :
PyTorch version: 2.6.0+cu124 [tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.
Since the max_length is set to 4, each of the two tensors contains 4 token IDs.
Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.
To illustrate the meaning of stride=1, let's fetch another batch from this dataset:
second_batch = next(data_iter)
print(second_batch)
# Output :
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]
If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch.
For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input.
The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach
Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes. If you have previous experience with deep learning, you may know that small batch sizes require less memory during training but lead to more noisy model updates.
Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.
Before we move on to the two final sections of this chapter that are focused on creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False) data_iter = iter(dataloader) inputs, targets = next(data_iter) print("Inputs:\n", inputs) print("\nTargets:\n", targets)
# Output :
Inputs:
tensor([[ 40, 367, 2885, 1464],
[ 1807, 3619, 402, 271],
[10899, 2138, 257, 7026],
[15632, 438, 2016, 257],
[ 922, 5891, 1576, 438],
[ 568, 340, 373, 645],
[ 1049, 5975, 284, 502],
[ 284, 3285, 326, 11]])
Targets:
tensor([[ 367, 2885, 1464, 1807],
[ 3619, 402, 271, 10899],
[ 2138, 257, 7026, 15632],
[ 438, 2016, 257, 922],
[ 5891, 1576, 438, 568],
[ 340, 373, 645, 1049],
[ 5975, 284, 502, 284],
[ 3285, 326, 11, 287]])
Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also, avoid any overlap between the batches, since more overlap could lead to increased overfitting.We have seen below topics so far :
- Data Sampling
- Context window, Sliding window/Stride
- Input, Output pairs
- Token Embedding
- Positional Embedding/Encoding
- Input Embeddings
Understanding Embedding Concept (Very Important to understand, read carefully!)
Context refresh! Main idea of decoder model is to predict next word.
But after generating Input, Output pairs &Data Sampling etc. data is still in the form of a token. Isn't it ? and it is not carrying any meaning.
Consider below example of data : Cat, Book, Tablet, Kitten, Dog, Puppy
Idea what we have to understand is : Though we have semantic relationship between the data(Ex : Dog Vs Puppy in above data), model is not carrying that information yet, as words are still in token format.
How to carry semantic relation between data into ML model/LLM ?
Answer is word2vec
One more example : Dog, Cat comes under animals AND Apple, Banana comes under fruits
As Humans we know their categories, but how computer knew ? Answer for this question is attributes or input features.
- Dog has a tail, it barks
- Cat also has a tail, has 4 legs, it makes sound etc.
- Banana is eatable etc.
Using above vector of values, we can understand the similarity of inputs by looking at attributes. For example, Dog & cat has similar features if you can compare the vector values of same attributes, they will be almost similar in value.
- Where ever Cat value is high, Dog value is also high
- Where ever Cat value is low, God value is also low
- Vice versa for Apple, Banana as well
Means, tokens of similar data are close to each other. Hence, vector captures semantic meaning. Converting our tokens into vectors is nothing but Embedding.
Main question : HOW THESE VECTOR VALUES WILL BE INITIALIZED.
Answer : Initially they are random values inside a NEURAL NETWORK
Lets revise NN to understand this concept:
- NN has a Input, Hidden, Output layers and each neuron is connected to every other neuron in next layer.
- These connections are called weights. Initially they are random values.
- As we move forward with iterations, by repeating forward propagation, loss calculation, backward propagation and adjusting weights; all these random weight values will be adjusted and come to a state where they become the accurate values of input data. Correct?
Very important statement :
These embedding weight vectors are random values at first. This initialization servers as the starting point of LLM's learning process. Later we will optimize the LLM weights as part of LLM training.
Lets revise whatever we discussed in this blog until now :
As part of Data Preprocessing -
- Input of LLM is a collection of sentences.
- Ex : "This is an example"
- Once we provide this data to a LLM, especially GPT-2 model, next step is :
- It is going to split the words using some Tokenization technique
- If it is a GPT-2 model, internally it is using BPE sub word Tokenization algorithm
- Other Tokenization techniques are Word, character Tokenization
- GPT-2 is using Tiktoken python package for Sub-word Tokenization
- Tiktoken contains 2 methods :
- encode() - to detect missing word in between a sentence
- decode() - to detect next word/token at the end of sentence
- After Tokenization, next step is Input? Output pairs as we are using NNs, and expected output of a Neural Network is x, y (input, output)
- Then Data Sampling & Sliding window concept start
- To fit into context window size, we need to divide chunks
- To retain the context, we are using context
- Programmatically, sliding window is nothing but stride
- Token Embedding
- Positional Embedding
- Input Embeddings (= Token Embeddings + Positional Embeddings)
- Finally, Input Embeddings is the input to GPT model
Token Embeddings :
Consider below words and respective tokens :
- Dog - 1234
- Cat - 2345
- Apple - 3456
- Banana - 5678
Initially, these vector values are random values and fed to NN. As training progress, accurate weights/vector values will be identified as training makes progress. From above image, x1, x2, x3, x4 are input values which are inputs to a NN and observe that each input value has a corresponding vector. Once LLM pretraining is completed, the semantic relation of x1, x2, x3, x4 and how close they are in terms of semantic relation.
Note :
These embedding weight vectors are random values at first. This initialization servers as the starting point of LLM's learning process. Later we will optimize the LLM weights as part of LLM training.
Vocabulary size of GPT-2 : 50257
All the token ID's are part of GPT-2 vocabulary(assuming we are using GPT-2 model).
(**)GPT-2 token embedding size : 768 (GPT-2 small), 1024(Medium), 1280(Large), 1600 (XL) means every token will be represented in a 768 dimensional space (for GPT-2 model)
Tokens :
0 - [768 dimensional vector]
123 - [768 dimensional vector]
As shown in the above image, if we use GPT-2 small model, then each token will be represented in a 768 dimensional space.
Similarly :
- if we use GPT-2 Medium model, each token will be represented in a 1024 dimensional space
- if we use GPT-2 Large model, each token will be represented in a 1280 dimensional space
- if we use GPT-2 XL model, each token will be represented in a 1600 dimensional space
As vocabulary size is 50256 & embedding vector is represented in a 768 dimension. A neural network with that structure is represented as shown in the above image. Initially weights are random values and they will be adjusted to a accurate weights/vectors as training progress in ML model.
Size of the matrix would be : 50256 * 768 ( very complex! ). Hence LLMs are complex Neural Network architectures.
In the above example, assuming vocabulary size is 6 (6 unique words) & output_dim is 3, means each embedding vector will be represented in a 3 dimensional space(for simplicity).
Then NN will be as below :
Token embeddings will be created as shown in the above example.
Code :
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)
Above 3 lines of code will produce weight matrix nothing but token embeddings. Initially they are random numbers and they will be optimized as training progress.
That's all about Token Embeddings. Lets move to next topic i.e. Positional Encodings.
Positional Encodings :
We have seen Token Embeddings till now, which will help us to carry the semantic meaning of the input data into the model.
Now, read below 2 sentences :
- The cat sat on the mat
- On the mat the cat sat
For both the above sentences, token embeddings will be same as words are same but position of those words are different.
If we don't consider the positional then it will be difficult to predict the next word(as model is meant to predict next word). Hence position matter a lot.
How to implement Positional Embedding/Encoding :
- Consider the vocabulary size of GPT-2 i.e. 52768 and for flexibility consider the no. of dimensions as 256 (instead of 768 - GPT-2 small models dimensions) and build a NN as below.
- context_length : 4
- batch_size : 8
In the above image, 1st matrix is related to DataSet, DataLoader where we segregated data for parallel processing and 2nd matrix is the vector embeddings for each token available in vocabulary represented in 256 dimensions(just for simplicity - in real time it is 768 dimensional for GPT-2 as discussed before).
Also note all the tokens that we are segregated for data processing in matrix-1, respective vector embeddings are available in the 2nd matrix as shown in the above image.
Input Embeddings = Token Embedding + Positional Encoding
Example :
Consider Token embedding for one token, [0.1, 1.2, 1.8, 2.5] and corresponding positional encoding value [0.3, 0.4, 0.5, 0.6]
Input Embedding = [0.1, 1.2, 1.8, 2.5] + [0.3, 0.4, 0.5, 0.6] = [0.4, 1.6, 2.3, 3.1]
This is what we fed to GPT model.
Note : Positional embedding values are calculated ONLY once, they won't adjust again and again. Only Token Embedding values will be adjusted as part of model training.
Positional Encoding formula :
As per Transformer architecture paper "Attention is all you need", above mentioned 2 formulas are introduced.
Assume each word as one single token, I [0.1, 0.2, 0.1], Like [0.3, 0.22, 1.2], AI [0.11, 0.2, 0.4]
How Positional Embedding will be created from above Token Embeddings ?
I [0.1, 0.2, 0.1]
Like [0.3, 0.22, 1.2]
AI [0.11, 0.2, 0.4]
Consider index positions as 0, 1, 2 for above Token Embeddings.
Position, POS = 0 (I), 1(Like), 2(AI) ; d_model = no. of dimensions of data representation (3 in this ex) ; i = for I [0.1, 0.2, 0.1] = [i0, i1, i2]
Positional Index for I, POS is 0, i.e. even position. Hence use sin formula.
Now, calculate 2nd value(index -1) in positional vector : Odd position. Hence use cos formula.
For I, 2nd Positional vector is [0, 1, _ ]
Now, calculate 3rd value(index-2) in positional vector : Even position. Hence use cos formula.
Sample process for 2nd word, Like :
Sample process for 3rd word, AI:
Note :
- Positional Embeddings can be created ONLY once
- Token Embeddings + Positional Embeddings = Input Embeddings
This is for one single word, I. Similarly, same steps will be repeated for Like, AI in the input sentence. Models trainings loop will executed until it complete all the tokens, especially until it reach Global minima for all Input Embedding Values. This is all happening in the LLM training.
Important points in this blog :
- Tokenization
- Model won't understand sentence, need to split into a word/character/sub-word
- Split into tokens and assign token ID's to them
- Sub-word based algorithms are used in GPT models and it internally use BPE(Byte Pair) algorithm, and we customized this BPE for sub-word based algorithm
- Token Embeddings
- We need to convert token ID's into token embeddings to carry the semantic meaning of data
- It internally uses Neural Network, initiate random weights as token embeddings, as training progress these vector values(token embeddings will be adjusted to accurate values)
- For each token ID, a embedding vector will be created
- Dimension size of GPT-2 small model is 768 and vocabulary size 50257
- Context Window
- Maximum number of token that can be passed into a model for a single context
- Positional Encoding
- To maintain the order of the token in the sentence
- Input Encoding
- Input Encoding = Token Embedding + Positional Embedding
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment