(AI #10) LLM : Data preparation & Sampling - Tokenization & Embedding Explained

Three main stages of coding an LLM are, implement the data sampling and understand the basic mechanism, pre-train the LLM on unlabelled data to obtain a foundation model for further fine-tuning, fine-tune the pre-trained LLM to create a classifier or personal assistant or chat model.

As part of implementing an end-to-end LLM, we need to implement above 3 stages and it involved below steps :

Data preparation & Sampling(This blog covers this concept)
Attention mechanism
LLM architecture
Pre-training
Training Loop
Model Evaluation
Load Pre-training weights
Fine-tuning (to create a classification model)
Fine-tuning (to create a personal assistant or chat model)

We are going to discuss about the first step i.e. Data Preparation & Sampling in this blog.

Data Preparation & Sampling :

Data preparation & Sampling is the input to the LLM.

It includes below steps :

Tokenization
Input Output pairs
Token Embeddings
Positional Encodings (Positional Embeddings)
Input Embeddings

We will cover all the above topics in detail in this blog.

Flow of above steps :

As shown in the above image :

Input Data will be converted into tokens by using a technique called Tokenization
Once tokens are ready, we will convert them into Token Embeddings
Once Token Embeddings, then we have to define the position of each token using Positional Encoding
Combining Token Embeddings & Positional Encodings, we will create Input Embeddings
Once Input Embeddings are ready, we will input them to GPT model

Below image show how text convert into Token Embedding :

Note : Assume that token means a word for now, but going forward once we learn how tokenization work then token could be a word, a char or a sub-word. We will see more details about it.

Explanation for above image :

Step1 : Reading the input data & split into words including special characters(excluding whitespaces)

Splitting each as a separate entity : 'This', 'is', 'an', 'example'

Then convert each word into a token id (some random numbers) as below
This (123), is (456), an (789), example (104)
This is called Tokenization

In Tokenization class, we have 2 methods :

Encode()

It will split given sentence into each word
Ex : "This is an example" into 'This', 'is', 'an', 'example'
Then assign random number to each word

Decode()

Input to decode() method is token id's and convert this token id's to words and it will formulate sentence at the end

Lets see below code regarding splitting words from a sentence :

It is just an example to get some idea on how sentences will split into words as part of Data Preparation & Sampling.

You can access/download code from following GitHub location : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb

Also, I have got raw_text from following GItHub repo : https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/the-verdict.txt (which I have used a input)

Now, the input data is ready for next step, Tokenization.

Step2 : Create a vocabulary - Data must be in a Sorted order & Unique

Please see below code f

Observe that, after sentence split in step1, we have got 4690 words, but vocabulary should be Unique. Hence after sorting and removing duplicates, vocabulary size is 1130.

# set() will remove duplicates & sorted will sort the data
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

vocab = {token:integer for integer,token in enumerate(all_words)}

# Iterating & printing entire vocabulary by adding an index using enumerate()

# Printing only first 50 characters 
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

Output :

('!', 0) ('"', 1) ("'", 2) ('(', 3) (')', 4) (',', 5) ('--', 6) ('.', 7) (':', 8) (';', 9) ('?', 10) ('A', 11) ('Ah', 12) ('Among', 13) ('And', 14) ('Are', 15) ('Arrt', 16) ('As', 17) ('At', 18) ('Be', 19) ('Begin', 20) ('Burlington', 21) ('But', 22) ('By', 23) ('Carlo', 24) ('Chicago', 25) ('Claude', 26) ('Come', 27) ('Croft', 28) ('Destroyed', 29) ('Devonshire', 30) ('Don', 31) ('Dubarry', 32) ('Emperors', 33) ('Florence', 34) ('For', 35) ('Gallery', 36) ('Gideon', 37) ('Gisburn', 38) ('Gisburns', 39) ('Grafton', 40) ('Greek', 41) ('Grindle', 42) ('Grindles', 43) ('HAD', 44) ('Had', 45) ('Hang', 46) ('Has', 47) ('He', 48) ('Her', 49) ('Hermia', 50)

Now, our vocabulary consists of words and corresponding tokens id's. In the above 2 steps, we have the functionalities of encoder() and decoder() separately. Below class have both of these functionalities at one place.

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Note:

re.sub(r'\s+([,.?!"()\'])', r'\1', text) will identify one/more white spaces before the punctuations from and replace them with same punctuation without a white space. This is called whitespace normalization before punctuation.

Why we landed here is text = " ".join([self.int_to_str[i] for i in ids]) is adding space in between every string it is iterating. Hence we are deleting that space wherever not needed, but retaining punctuations. Hope it makes sense.

Syntax :

re.sub(pattern, replacement, string)

Example : "Hello , world ! How are you ?" ==> "Hello, world! How are you?"

In real time, we won't implement this Tokenization, LLMs are already trained with it. But we should aware of it and how it work in ML models.

Also just FYI, here we have used some text to create vocabulary, but in real time every ML model has its own vocabulary. For example, vocabulary for GPT2 model can be seen at https://huggingface.co/openai-community/gpt2/raw/main/vocab.json with 50256 tokens. Similarly, for GPT-5.2, which is the current version of GPT has 4,00,000 tokens!

Important : So, whatever we type/input to a ML model, it will go and check each and every thing in its vocabulary and then assign tokens accordingly as per Vocabulary.

# Created instance of above class

tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

Above piece of code is to test whether our vocabulary is working or not. Got some piece of context from verdict.txt, and passing it to encode() to see if it print the corresponding tokens. Yes, it worked! Again, if we pass same tokens to decode(), it returned string. This is how it work in a ML model.

Lets apply it to a new text sample that is not contained in the Vocabulary.

text = "Hello, do you like tea?"
print(tokenizer.encode(text))

Lets apply it to a new text sample that is not contained in the Vocabulary. We will get KeyError as shown in below image.

If a word is not available in the vocabulary, simply it should replace with some alternative words. Lets see what does that mean :

|unk| - Unknown word
|endoftext| - End of Text

Please see above image, we have added both the above words to Vocabulary and printed them for confirmation. Observe that Vocabulary size is 1132 now! (it was 1130 before)

Now, lets see below code where we :

Replace unknown words by |unk| tokens
Replace spaces before the specified punctuations

Context refresh!

We are just trying to understand all the problems they faced while implementing LLMs(GPT model). At the end, we will how a latest model works by fixing all these issues that we are talking about. So, keep in mind that, these issues are the path forward to have a strong foundation to understand LLMs.

In other models, they introduced other tokens like BOS(beginning of sequence), EOS(end of sequence), PAD(padding) etc.

Tokenization Algorithms :

Word Based
Character Based
Sub-word Based

1) Word Based Tokenization Algorithm :

Example : This is an Example
After converting it into tokens : ["This", "is", "an", "Example"]
This is english language right ? How many unique english words in the world ? that's 2 lakhs. (we need to create a vocabulary and that's the reason for asking this question)
Imagine the vocabulary size, if only english has 2 lakh words!
Remember GPT-2 has 50k tokens which was released in 2023
Do you think english has only 50k tokens in 2023 ? No right ? Means they implemented some other logic for tokenization instead of Word Based Tokenization. This is the first problem.
If we got with "Word Based Tokenization" vocabulary size would be too large.

Also, lets consider below words :

Play, Plays, Playing, Played (in word based tokenization algorithm, each word will be converted into one token right ?) But what is the root word for these list of words. That's PLAY.
Play(Root Word)

Plays
Playing
Played

In this case, word based tokenization will assign different tokens to all the above words though they carry similar meaning. It didn't categorized the root word. This is the second problem. we are missing the similarity/semantic relation here.

Problems with word based algorithm :

Huge vocabulary size
Missing similarity, not capturing semantic meaning of words. Each word has a separate token.

Hence Character based tokenization algorithm has started.

2) Character Based Tokenization Algorithm :

Example : This is an Example
After converting it into tokens : ['T', 'h', 'i', 's', ',', 'i', 's', ',', 'a', 'n', 'E', 'x', 'a', 'm', 'p', 'l', 'e]
Now, as we create tokens based on character, size of vocabulary reduced a lot
But we still have the problem of not carrying any semantic meaning between similar words.

Hence Sub-word based tokenization algorithm has started.

3) Subword Based Tokenization Algorithm :

This is a hybrid model from Word based & Character based tokenization algorithms.

Note :

GPT-2, 3, 4, 5 is using BPE (Byte Pair Encoding) and BPE is internally using Subword based Tokenization algorithm.

BPE (Byte Pair Encoding) is introduced in 1994. It was mainly used for Data Compression those days. Those days, compression algorithms are very famous. One of the company released this algorithm and it was very famous.

Lets understand how BPE algorithm work! and then we will see how to customize based on Subword tokenization.

BPE follows two rules :

Rule-1 : DO NOT split frequently used words into smaller subwords
Rule-2 : Split the rare words into smaller meaningful subwords

Important statement : Most common pair of consecutive bytes of data is replaced with a byte that doesn't occur in the data.

Lets say this is original data : "aaabdaaabac"

aa is one pair (1st a, 2nd a)
aa is another pair (2nd a, 3rd a)

As per above statement, we need to replace those common pairs with a new byte.

aaabdaaabac (we will replace first pair of 'aa' with z)
zabdzabac (now 'ab' is most common pair of consecutive bytes - replace them with y)
zydzyac (now 'zy' is most common pair of consecutive bytes - replace them with x)
xdxac

Initial data is "aaabdaaabac", after applying BPE, it reduced to 'xdxac'. Compression technique justified!

Now, lets assume, below dataset of words : {word : no_of_times_repeated}

{"old" : 7, "older" : 3, "finest" : 9, "lowest" : 9 }

Preprocessing Technique : To know end of each word, lets add '</w>' which means end of word.

Dataset = {"old</w>" : 7, "older</w>" : 3, "finest</w>" : 9, "lowest</w>" : 9 }

(We will get to know why we are adding </w> by end of this equation)

==> Lets split words into characters and count their frequency.

Frequency Table :

Now, apply BPE for above data set and customize this data with about 2 rules(Rule-1, Rule-2).

Understand, we need to compress common pair from above dataset, and form a new word, and while we form a common word, we need to subtract those characters from from above frequency table. Lets see how we do it.

"es" is the top common pair, assume after compression "es" is X repeated 13 times
"est" is the next common pair, assume "est" is Y repeated 13 times
"est</w>" is the next common pair, repeated 13 times

Most possible subwords at the end :

This is called Subword Tokenization! Which is a feasible tokenization technique in current models.

Remember we added </w> at the end of word, that's because :

Assume, 2 words : estimate, highest
If we don't ass </w>, then during compression, we don't know whether 'est' is prefix or a suffix
That's the reason, during pre-processing, we added </w>, to segregate prefix/suffix

We have a python package called tiktoken, which we can use for tokenization techniques in LLM, it internally use BPE algorithm.

! pip3 install tiktoken

import importlib import tiktoken print("tiktoken version:", importlib.metadata.version("tiktoken"))

# Testing test data

text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces" "of someunknownPlace." ) integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) print(integers)

# Decoding back to actual data

strings = tokenizer.decode(integers) print(strings)

# One more example

integers = tokenizer.encode("Akwirw ier") print(integers) strings = tokenizer.decode(integers) print(strings)

Using tiktoken library :

! pip install tiktoken import tiktoken # Text to encode and decode text = "The lion roams in the jungle" # ───────────────────────────────────────────────────────────────────────── # 1. GPT-2 Encoding/Decoding # Using the "gpt2" encoding # ───────────────────────────────────────────────────────────────────────── tokenizer_gpt2 = tiktoken.get_encoding("gpt2") # We need to specify the model # Encode: text -> list of token IDs token_ids_gpt2 = tokenizer_gpt2.encode(text) # Decode: list of token IDs -> original text (just to verify correctness) decoded_text_gpt2 = tokenizer_gpt2.decode(token_ids_gpt2) # We can also get each token string by decoding the IDs one by one tokens_gpt2 = [tokenizer_gpt2.decode([tid]) for tid in token_ids_gpt2] print("=== GPT-2 Encoding ===") print("Original Text: ", text) print("Token IDs: ", token_ids_gpt2) print("Tokens: ", tokens_gpt2) print("Decoded Text: ", decoded_text_gpt2) print()

Output :

=== GPT-2 Encoding ===
Original Text:  The lion roams in the jungle
Token IDs:      [464, 18744, 686, 4105, 287, 262, 20712]
Tokens:         ['The', ' lion', ' ro', 'ams', ' in', ' the', ' jungle']
Decoded Text:   The lion roams in the jungle

This is called TOKENIZATION.

Guys, we don't even use all these techniques directly, but we should understand these internal concepts before stepping into stuff that we are going to see in future blogs. We simply call LLM and it internally does all of these discussed calculation etc. It is good to have a deep knowledge on internals before stepping into actual stuff. So, do not panic at the moment. Save it for later stuff 😁

Please find the code from following GitHub location, it has more test cases with other models as well : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb

Data Sampling with Sliding window & Input, Output pairs

Input, Output pairs
Data Sampling with sliding window

From above data, target variable is Sal ; Age, Exp are independent variables. Expectation is, after training, if we give the age & exp of a person, model will predict the Salary of that person.

For example, "Deep Learning is Powerful" is the input. If we give this data to a Decoder model, it should predict the next word.

It should convert the data into a language which a Neural network can understand.

LLMs are nothing but Neural Networks. If our data in the above format(sequence-to-sequence) text, we should prepare it into input, output pairs. This is how we should prepare data i.e. into input and output pairs. This will happen after Tokenization.

Suppose, if you are not preparing data in this format. You are not going to construct the NN model. Textual data must be prepared like this.

Data Sampling & Sliding window :

Imagine, I have a sentence which contains words from 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _9999.

Now, if I give same data to GPT model, then what will happen ? We are talking about context window concept which is nothing but maximum no. of tokens a GPT model can allow. Assume my GPT model can allow only 1000 tokens, but we are giving 9999 tokens at once. This is nothing but Data Sampling where we are dividing data into CHUNKS.

0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1000

1001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2000

2001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 3000

3001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 4000

4001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 5000

5001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 6000

6001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 7000

7001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 8000

8001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9000

9001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9999

Other important parameter, that we should keep in mind is Sliding Window. Understand carefully, incase if the actual meaning of the context is disconnected at token 1005 BUT as per the concept of dividing "Chunks" into equal no. of tokens(as shown above) which is ended at 1000, then we will loose the context right ? Because we need next 5 tokens are well, to form a meaningful context. We are loosing the context here.

Now, lets see the technique to form a meaningful context : Instead of dividing chunks equally, divide as shown below. We should have some overlap to maintain context.

0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1000

800_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2000

1800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 3000

2800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 4000

3800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 5000

4800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 6000

5800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 7000

6800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 8000

7800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9000

8800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9999

Below image shows context1, context2 and a overlap :

Hence :

Data Sampling is required to handle the Context Window (***)
Sliding Window is required to retain meaningful context (***)

(***) For latest GPT model, which is GPT-5.2, context window is 4,00,000 tokens. But in our example, we are providing an input which is of size 9999 tokens. Main thing which we need to understand is, though our input can go at once but here we have a chance of Overfitting Issue. There will be no learning, it will just memorise. If model has to find out hidden patterns in the given data, WE HAVE TO CHUNK AND SLIDE.

We have to chunk the data and apply Sliding window technique. It doesn't matter, if your input token range is within the range of context window.

Note, Data Sampling will happen AFTER tokenization but BEFORE embeddings.

In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.

To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous section:

with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read() enc_text = tokenizer.encode(raw_text) print(len(enc_text))

o/p : 5145

Executing the code above will return 5145, the total number of tokens in the training set, after applying the BPE tokenizer.

Next, we remove the first 50 tokens from the dataset for demonstration purposes as it results in a slightly more interesting text passage in the next steps:

enc_sample = enc_text[50:]

The context size determines how many tokens are included in the input

context_size = 4 #length of the input #The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens) #to predict the next word in the sequence. #The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5] x = enc_sample[:context_size] y = enc_sample[1:context_size+1] print(f"x: {x}") print(f"y: {y}")

Output :

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]

Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

Output :

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257

Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.

For illustration purposes, let's repeat the previous code but convert the token IDs into text:

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a

We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters.


This is called input, output pairs. LLM expect the data to be in the form on input, output pairs.

See above images, to see the concepts of sliding window size(1 word), context window size (4 words). Check how inputs and targets segregated.

Sliding window programatically called as Stride.

In the above example, context window size is 4 and stride is 1.

There's only one more task before we can turn the tokens into embeddings:implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict,

Implementing a Data Loader :

For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes.

Step 1: Tokenize the entire text

Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step 3: Return the total number of rows in the dataset

Step 4: Return a single row from the dataset

from torch.utils.data import Dataset, DataLoader class GPTDatasetV1(Dataset): def __init__(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = [] # Tokenize the entire text token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) # Use a sliding window to chunk the book into overlapping sequences of max_length for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk)) def __len__(self): return len(self.input_ids) def __getitem__(self, idx): return self.input_ids[idx], self.target_ids[idx]

The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class.

It defines how individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.

I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader -- this will bring additional intuition and clarity.

The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:

Step 1: Initialize the tokenizer

Step 2: Create dataset

Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training

Step 4: The number of CPU processes to use for preprocessing

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): # Initialize the tokenizer tokenizer = tiktoken.get_encoding("gpt2") # Create dataset dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # Create dataloader dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers ) return dataloader

Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4,

This will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together:

with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read()

Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function

import torch print("PyTorch version:", torch.__version__) dataloader = create_dataloader_v1( raw_text, batch_size=1, max_length=4, stride=1, shuffle=False ) data_iter = iter(dataloader) first_batch = next(data_iter) print(first_batch)

Output :

PyTorch version: 2.6.0+cu124
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]

The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.
Since the max_length is set to 4, each of the two tensors contains 4 token IDs.
Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.

To illustrate the meaning of stride=1, let's fetch another batch from this dataset:

second_batch = next(data_iter)
print(second_batch)
     
# Output :
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]

If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch.

For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input.

The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach

Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes. If you have previous experience with deep learning, you may know that small batch sizes require less memory during training but lead to more noisy model updates.

Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.

Before we move on to the two final sections of this chapter that are focused on creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:

dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False) data_iter = iter(dataloader) inputs, targets = next(data_iter) print("Inputs:\n", inputs) print("\nTargets:\n", targets)

# Output :

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but   also, avoid any overlap between the batches, since more overlap could lead to increased overfitting.

We have seen below topics so far :

Data Sampling
Context window, Sliding window/Stride
Input, Output pairs

Pending topics in first part of implementing and LLM are :

Token Embedding
Positional Embedding/Encoding
Input Embeddings

Understanding Embedding Concept (Very Important to understand, read carefully!)

Context refresh! Main idea of decoder model is to predict next word.

But after generating Input, Output pairs &Data Sampling etc. data is still in the form of a token. Isn't it ? and it is not carrying any meaning.

Consider below example of data : Cat, Book, Tablet, Kitten, Dog, Puppy

Idea what we have to understand is : Though we have semantic relationship between the data(Ex : Dog Vs Puppy in above data), model is not carrying that information yet, as words are still in token format.

How to carry semantic relation between data into ML model/LLM ?

Answer is word2vec

One more example : Dog, Cat comes under animals AND Apple, Banana comes under fruits

As Humans we know their categories, but how computer knew ? Answer for this question is attributes or input features.

Dog has a tail, it barks
Cat also has a tail, has 4 legs, it makes sound etc.
Banana is eatable etc.

See below image for more categories to have some idea about attributes :

And these are vector of values(see below image) :

Using above vector of values, we can understand the similarity of inputs by looking at attributes. For example, Dog & cat has similar features if you can compare the vector values of same attributes, they will be almost similar in value.

Jut look at has_a_tail value of Dog & Cat : 35, 32 (almost similar, isn't it) but for same attribute look at values of Apple, Banana, they are not even closure(but Apple & Banana values are similar as they are fruits). This is how we need carry the semantic relation between input data into our ML model.

Where ever Cat value is high, Dog value is also high
Where ever Cat value is low, God value is also low
Vice versa for Apple, Banana as well

Means, tokens of similar data are close to each other. Hence, vector captures semantic meaning. Converting our tokens into vectors is nothing but Embedding.

Main question : HOW THESE VECTOR VALUES WILL BE INITIALIZED.

Answer : Initially they are random values inside a NEURAL NETWORK

Lets revise NN to understand this concept:

NN has a Input, Hidden, Output layers and each neuron is connected to every other neuron in next layer.
These connections are called weights. Initially they are random values.
As we move forward with iterations, by repeating forward propagation, loss calculation, backward propagation and adjusting weights; all these random weight values will be adjusted and come to a state where they become the accurate values of input data. Correct?

Similarly, during embedding process, random values will be assigned to these vectors and passed to a NN, as NN iterates, all these vector values will be adjusted and come to a state, where these values represent the semantic meaning of input data for corresponding attributes. REPEAT READING THIS 'n' of times, until you got this Embedding process right. It is worth understanding this concept.

Very important statement :

These embedding weight vectors are random values at first. This initialization servers as the starting point of LLM's learning process. Later we will optimize the LLM weights as part of LLM training.

Lets revise whatever we discussed in this blog until now :

As part of Data Preprocessing -

Input of LLM is a collection of sentences.

Ex : "This is an example"

Once we provide this data to a LLM, especially GPT-2 model, next step is :

It is going to split the words using some Tokenization technique
If it is a GPT-2 model, internally it is using BPE sub word Tokenization algorithm
Other Tokenization techniques are Word, character Tokenization

GPT-2 is using Tiktoken python package for Sub-word Tokenization

Tiktoken contains 2 methods :

encode() - to detect missing word in between a sentence
decode() - to detect next word/token at the end of sentence

After Tokenization, next step is Input? Output pairs as we are using NNs, and expected output of a Neural Network is x, y (input, output)
Then Data Sampling & Sliding window concept start

To fit into context window size, we need to divide chunks
To retain the context, we are using context
Programmatically, sliding window is nothing but stride

Token Embedding
Positional Embedding
Input Embeddings (= Token Embeddings + Positional Embeddings)

Finally, Input Embeddings is the input to GPT model

Token Embeddings :

Consider below words and respective tokens :

Dog - 1234
Cat - 2345
Apple - 3456
Banana - 5678

Each word has a corresponding Token, but it not carrying and semantic relation(meaningful information) into the model as tokens are just numbers. To maintain that meaningful information, each token ID will be created into a Vector.

Initially, these vector values are random values and fed to NN. As training progress, accurate weights/vector values will be identified as training makes progress. From above image, x1, x2, x3, x4 are input values which are inputs to a NN and observe that each input value has a corresponding vector. Once LLM pretraining is completed, the semantic relation of x1, x2, x3, x4 and how close they are in terms of semantic relation.

Note :

Vocabulary size of GPT-2 : 50257

All the token ID's are part of GPT-2 vocabulary(assuming we are using GPT-2 model).

(**)GPT-2 token embedding size : 768 (GPT-2 small), 1024(Medium), 1280(Large), 1600 (XL) means every token will be represented in a 768 dimensional space (for GPT-2 model)

Tokens :

0 - [768 dimensional vector]

123 - [768 dimensional vector]

As shown in the above image, if we use GPT-2 small model, then each token will be represented in a 768 dimensional space.

Similarly :

if we use GPT-2 Medium model, each token will be represented in a 1024 dimensional space
if we use GPT-2 Large model, each token will be represented in a 1280 dimensional space
if we use GPT-2 XL model, each token will be represented in a 1600 dimensional space

As vocabulary size is 50256 & embedding vector is represented in a 768 dimension. A neural network with that structure is represented as shown in the above image. Initially weights are random values and they will be adjusted to a accurate weights/vectors as training progress in ML model.

Size of the matrix would be : 50256 * 768 ( very complex! ). Hence LLMs are complex Neural Network architectures.

In the above example, assuming vocabulary size is 6 (6 unique words) & output_dim is 3, means each embedding vector will be represented in a 3 dimensional space(for simplicity).

Then NN will be as below :

Token embeddings will be created as shown in the above example.

Code :

torch.manual_seed(123)

embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

print(embedding_layer.weight)

Above 3 lines of code will produce weight matrix nothing but token embeddings. Initially they are random numbers and they will be optimized as training progress.

That's all about Token Embeddings. Lets move to next topic i.e. Positional Encodings.

Positional Encodings :

We have seen Token Embeddings till now, which will help us to carry the semantic meaning of the input data into the model.

Now, read below 2 sentences :

The cat sat on the mat
On the mat the cat sat

Look at the order. "Cat" in 1st sentence and 2nd sentence will be assigned with same token ID as the word is same but order is different is both the sentences. So, we need to retain the position of each word in addition with semantic meaning. That's why we need to implement "Positional Embedding" and respective vectors and carry them into the model.

For both the above sentences, token embeddings will be same as words are same but position of those words are different.

See below image where we are adding Positional Information :

If we don't consider the positional then it will be difficult to predict the next word(as model is meant to predict next word). Hence position matter a lot.

How to implement Positional Embedding/Encoding :

Consider the vocabulary size of GPT-2 i.e. 52768 and for flexibility consider the no. of dimensions as 256 (instead of 768 - GPT-2 small models dimensions) and build a NN as below.

Lets use Dataset & DataLoader for parallel processing, for my input data.

context_length : 4
batch_size : 8

In the above image, 1st matrix is related to DataSet, DataLoader where we segregated data for parallel processing and 2nd matrix is the vector embeddings for each token available in vocabulary represented in 256 dimensions(just for simplicity - in real time it is 768 dimensional for GPT-2 as discussed before).

Also note all the tokens that we are segregated for data processing in matrix-1, respective vector embeddings are available in the 2nd matrix as shown in the above image.

Input Embeddings = Token Embedding + Positional Encoding

Example :

Consider Token embedding for one token, [0.1, 1.2, 1.8, 2.5] and corresponding positional encoding value [0.3, 0.4, 0.5, 0.6]

Input Embedding = [0.1, 1.2, 1.8, 2.5] + [0.3, 0.4, 0.5, 0.6] = [0.4, 1.6, 2.3, 3.1]

This is what we fed to GPT model.

Note : Positional embedding values are calculated ONLY once, they won't adjust again and again. Only Token Embedding values will be adjusted as part of model training.

Positional Encoding formula :

As per Transformer architecture paper "Attention is all you need", above mentioned 2 formulas are introduced.

Input sentence = "I Like AI" ==> For even index in input use sin, for odd index use cos

Assume each word as one single token, I [0.1, 0.2, 0.1], Like [0.3, 0.22, 1.2], AI [0.11, 0.2, 0.4]

How Positional Embedding will be created from above Token Embeddings ?

I [0.1, 0.2, 0.1]

Like [0.3, 0.22, 1.2]

AI [0.11, 0.2, 0.4]

Consider index positions as 0, 1, 2 for above Token Embeddings.

Position, POS = 0 (I), 1(Like), 2(AI) ; d_model = no. of dimensions of data representation (3 in this ex) ; i = for I [0.1, 0.2, 0.1] = [i0, i1, i2]

Positional Index for I, POS is 0, i.e. even position. Hence use sin formula.

For I, 1st Positional vector is [0, _, _]

Now, calculate 2nd value(index -1) in positional vector : Odd position. Hence use cos formula.

For I, 2nd Positional vector is [0, 1, _ ]

Now, calculate 3rd value(index-2) in positional vector : Even position. Hence use cos formula.

Sample process for 2nd word, Like :

Sample process for 3rd word, AI:

Note :

Positional Embeddings can be created ONLY once
Token Embeddings + Positional Embeddings = Input Embeddings

In every iteration of training, only token, input embeddings will be adjusted but Positional Embeddings are same.

Assume, during Epoc-5, token, input embeddings reached Global minima with accurate embedding vector values, then :

This is for one single word, I. Similarly, same steps will be repeated for Like, AI in the input sentence. Models trainings loop will executed until it complete all the tokens, especially until it reach Global minima for all Input Embedding Values. This is all happening in the LLM training.

Important points in this blog :

Tokenization

Model won't understand sentence, need to split into a word/character/sub-word
Split into tokens and assign token ID's to them
Sub-word based algorithms are used in GPT models and it internally use BPE(Byte Pair) algorithm, and we customized this BPE for sub-word based algorithm

Token Embeddings

We need to convert token ID's into token embeddings to carry the semantic meaning of data
It internally uses Neural Network, initiate random weights as token embeddings, as training progress these vector values(token embeddings will be adjusted to accurate values)
For each token ID, a embedding vector will be created
Dimension size of GPT-2 small model is 768 and vocabulary size 50257

Context Window

Maximum number of token that can be passed into a model for a single context

Positional Encoding

To maintain the order of the token in the sentence

Input Encoding

Input Encoding = Token Embedding + Positional Embedding

Thank you for reading this blog !

Arun Mathe

DataSphere

Search This Blog

(AI #10) LLM : Data preparation & Sampling - Tokenization & Embedding Explained

Labels

Comments

Post a Comment

Popular posts from this blog

(AI #1) Deep Learning and Neural Networks

(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

Spark Core : Understanding RDD & Partitions in Spark