Skip to main content

(AI Blog#10) LLM : Data preparation & Sampling - Tokenization & Embedding Explained

Three main stages of coding an LLM are, implement the data sampling and understand the basic mechanism, pre-train the LLM on unlabelled data to obtain a foundation model for further fine-tuning, fine-tune the pre-trained LLM to create a classifier or personal assistant or chat model.

As part of implementing an end-to-end LLM, we need to implement above 3 stages and it involved below steps :

  1. Data preparation & Sampling(This blog covers this concept)
  2. Attention mechanism
  3. LLM architecture
  4. Pre-training
  5. Training Loop
  6. Model Evaluation
  7. Load Pre-training weights
  8. Fine-tuning (to create a classification model)
  9. Fine-tuning (to create a personal assistant or chat model)

We are going to discuss about the first step i.e. Data Preparation & Sampling in this blog.



Data Preparation & Sampling :

Data preparation & Sampling is the input to the LLM. 

It includes below steps :

  • Tokenization
  • Input Output pairs
  • Token Embeddings
  • Positional Encodings (Positional Embeddings)
  • Input Embeddings

We will cover all the above topics in detail in this blog.


Flow of above steps :


As shown in the above image :

  • Input Data will be converted into tokens by using a technique called Tokenization
  • Once tokens are ready, we will convert them into Token Embeddings
  • Once Token Embeddings, then we have to define the position of each token using Positional Encoding
  • Combining Token Embeddings & Positional Encodings, we will create Input Embeddings
  • Once Input Embeddings are ready, we will input them to GPT model

Below image show how text convert into Token Embedding :

Note : Assume that token means a word for now, but going forward once we learn how tokenization work then token could be a word, a char or a sub-word. We will see more details about it.

Explanation for above image :

Step1 : Reading the input data & split into words including special characters(excluding whitespaces)
  • Splitting each as a separate entity : 'This', 'is', 'an', 'example'
    • Then convert each word into a token id (some random numbers) as below
    • This (123), is (456), an (789), example (104)
    • This is called Tokenization 

In Tokenization class, we have 2 methods :

  • Encode()
    • It will split given sentence into each word
    • Ex : "This is an example" into 'This', 'is', 'an', 'example'
    • Then assign random number to each word
  • Decode()
    • Input to decode() method is token id's and convert this token id's to words and it will formulate sentence at the end


Lets see below code regarding splitting words from a sentence : 

It is just an example to get some idea on how sentences will split into words as part of Data Preparation & Sampling.




You can access/download code from following GitHub location : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb

Also, I have got raw_text from following GItHub repo : https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/the-verdict.txt (which I have used a input)

Now, the input data is ready for next step, Tokenization.

Step2 : Create a vocabulary - Data must be in a Sorted order & Unique 

Please see below code f


Observe that, after sentence split in step1, we have got 4690 words, but vocabulary should be Unique. Hence after sorting and removing duplicates, vocabulary size is 1130

# set() will remove duplicates & sorted will sort the data
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

vocab = {token:integer for integer,token in enumerate(all_words)}


# Iterating & printing entire vocabulary by adding an index using enumerate()

# Printing only first 50 characters
for i, item in enumerate(vocab.items()):
print(item)
if i >= 50:
break

Output :

('!', 0) ('"', 1) ("'", 2) ('(', 3) (')', 4) (',', 5) ('--', 6) ('.', 7) (':', 8) (';', 9) ('?', 10) ('A', 11) ('Ah', 12) ('Among', 13) ('And', 14) ('Are', 15) ('Arrt', 16) ('As', 17) ('At', 18) ('Be', 19) ('Begin', 20) ('Burlington', 21) ('But', 22) ('By', 23) ('Carlo', 24) ('Chicago', 25) ('Claude', 26) ('Come', 27) ('Croft', 28) ('Destroyed', 29) ('Devonshire', 30) ('Don', 31) ('Dubarry', 32) ('Emperors', 33) ('Florence', 34) ('For', 35) ('Gallery', 36) ('Gideon', 37) ('Gisburn', 38) ('Gisburns', 39) ('Grafton', 40) ('Greek', 41) ('Grindle', 42) ('Grindles', 43) ('HAD', 44) ('Had', 45) ('Hang', 46) ('Has', 47) ('He', 48) ('Her', 49) ('Hermia', 50)

Now, our vocabulary consists of words and corresponding tokens id's. In the above 2 steps, we have the functionalities of encoder() and decoder() separately. Below class have both of these functionalities at one place. 

class SimpleTokenizerV1:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}

def encode(self, text):
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

preprocessed = [
item.strip() for item in preprocessed if item.strip()
]
ids = [self.str_to_int[s] for s in preprocessed]
return ids

def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
# Replace spaces before the specified punctuations
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
return text

Note:

re.sub(r'\s+([,.?!"()\'])', r'\1', text) will identify one/more white spaces before the punctuations from and replace them with same punctuation without a white space. This is called whitespace normalization before punctuation. 

Why we landed here is  text = " ".join([self.int_to_str[i] for i in ids]) is adding space in between every string it is iterating. Hence we are deleting that space wherever not needed, but retaining punctuations. Hope it makes sense.

Syntax : 

re.sub(pattern, replacement, string)

Example : "Hello , world ! How are you ?" ==> "Hello, world! How are you?"

In real time, we won't implement this Tokenization, LLMs are already trained with it. But we should aware of it and how it work in ML models.

Also just FYI, here we have used some text to create vocabulary, but in real time every ML model has its own vocabulary. For example, vocabulary for GPT2 model can be seen at https://huggingface.co/openai-community/gpt2/raw/main/vocab.json with 50256 tokens. Similarly, for GPT-5.2, which is the current version of GPT has 4,00,000 tokens!

Important : So, whatever we type/input to a ML model, it will go and check each and every thing in its vocabulary and then assign tokens accordingly as per Vocabulary. 


# Created instance of above class

tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)


Above piece of code is to test whether our vocabulary is working or not. Got some piece of context from verdict.txt, and passing it to encode() to see if it print the corresponding tokens. Yes, it worked! Again, if we pass same tokens to decode(), it returned string. This is how it work in a ML model. 


Lets apply it to a new text sample that is not contained in the Vocabulary.

text = "Hello, do you like tea?"
print(tokenizer.encode(text))

Lets apply it to a new text sample that is not contained in the Vocabulary. We will get KeyError as shown in below image. 


If a word is not available in the vocabulary, simply it should replace with some alternative words. Lets see what does that mean :
  • |unk| - Unknown word
  • |endoftext| - End of Text

Please see above image, we have added both the above words to Vocabulary and printed them for confirmation. Observe that Vocabulary size is 1132 now! (it was 1130 before)

Now, lets see below code where we :

  • Replace unknown words by |unk| tokens
  • Replace spaces before the specified punctuations


Context refresh! 
We are just trying to understand all the problems they faced while implementing LLMs(GPT model). At the end, we will how a latest model works by fixing all these issues that we are talking about. So, keep in mind that, these issues are the path forward to have a strong foundation to understand LLMs.

In other models, they introduced other tokens like BOS(beginning of sequence), EOS(end of sequence), PAD(padding) etc.


Tokenization Algorithms : 

  • Word Based 
  • Character Based
  • Subword Based


1) Word Based Tokenization Algorithm :

  • Example : This is an Example
  • After converting it into tokens : ["This", "is", "an", "Example"]
  • This is english language right ? How many unique english words in the world ? that's 2 lakhs. (we need to create a vocabulary and that's the reason for asking this question) 
  • Imagine the vocabulary size, if only english has 2 lakh words. 
  • Remember GPT-2 has 50k tokens which was released in 2023
  • Do you think english has only 50k tokens in 2023 ? No right ? Means they implemented some other logic for tokenization instead of Word Based Tokenization. This is the first problem.
  • If we got with "Word Based Tokenization" vocabulary size would be too large.
Also, lets consider below words :
  • Play, Plays, Playing, Played (in word based tokenization algorithm, each word will be converted into one token right ?) But what is the root word for these list of words. That's PLAY. 
  • Play(Root Word)
    • Plays
    • Playing
    • Played
  • In this case, word based tokenization will assign unique token to all the above words though they carry similar meaning. It didn't categorized the root word. This is the second problem. we are missing the similarity/semantic relation here.
Problems with word based algorithm :
  • Huge vocabulary size
  • Missing similarity, not capturing semantic meaning of words. Each word has a separate token.

Hence Character based tokenization algorithm has started.


2) Character Based Tokenization Algorithm :

  • Example : This is an Example
  • After converting it into tokens : ['T', 'h', 'i', 's', ',',  'i', 's', ',',  'a', 'n',  'E', 'x', 'a', 'm', 'p', 'l', 'e]
  • Now, as we create tokens based on character, size of vocabulary reduced a lot
  • But we still have the problem of not carrying any semantic meaning between similar words.

Hence Subword based tokenization algorithm has started.



3) Subword Based Tokenization Algorithm :

  • This is a hybrid model from Word based & Character based tokenization algorithms.

Note :
GPT-2, 3, 4, 5 is using BPE (Byte Pair Encoding) and BPE is internally using Subword based Tokenization algorithm.


BPE (Byte Pair Encoding) is introduced in 1994. It was mainly used for Data Compression those days. Those days, compression algorithms are very famous. One of the company released this algorithm and it was very famous. 

Lets understand how BPE algorithm work! and then we will see how to customize based on Subword tokenization.

BPE follows two rules :

  • Rule-1 : DO NOT split frequently used words into smaller subwords
  • Rule-2 : Split the rare words into smaller meaningful subwords

Important statement : Most common pair of consecutive bytes of data is replaced with a byte that doesn't occur in the data.


Lets say this is original data : "aaabdaaabac"

  • aa is one pair (1st a, 2nd a)
  • aa is another pair (2nd a, 3rd a)
As per above statement, we need to replace those common pairs with a new byte.
  • aaabdaaabac (we will replace first pair of 'aa' with z)
  • zabdzabac (now 'ab' is most common pair of consecutive bytes - replace them with y)
  • zydzyac (now 'zy' is most common pair of consecutive bytes - replace them with x)
  • xdxac 

Initial data is "aaabdaaabac", after applying BPE, it reduced to 'xdxac'. Compression technique justified!

Now,  lets assume, below dataset of words : {word : no_of_times_repeated}

{"old" : 7, "older" : 3, "finest" : 9, "lowest" : 9 }

Preprocessing Technique : To know end of each word, lets add '</w>' which means end of word. 

Dataset = {"old</w>" : 7, "older</w>" : 3, "finest</w>" : 9, "lowest</w>" : 9 }

(We will get to know why we are adding </w> by end of this equation)

==> Lets split words into characters and count their frequency. 


Frequency Table :


Now, apply BPE for above data set and customize this data with about 2 rules(Rule-1, Rule-2).

Understand, we need to compress common pair from above dataset, and form a new word, and while we form a common word, we need to subtract those characters from from above frequency table. Lets see how we do it.

  • "es" is the top common pair, assume after compression "es" is X repeated 13 times
  • "est" is the next common pair, assume "est" is Y repeated 13 times
  • "est</w>" is the next common pair, repeated 13 times



Most possible subwords at the end :


This is called Subword Tokenization! Which is a feasible tokenization technique in current models.


Remember we added </w> at the end of word, that's because :

  • Assume, 2 words : estimate, highest
  • If we don't ass </w>, then during compression, we don't know whether 'est' is prefix or a suffix
  • That's the reason, during pre-processing, we added </w>, to segregate prefix/suffix


We have a python package called tiktoken, which we can use for tokenization techniques in LLM, it internally use BPE algorithm. 

! pip3 install tiktoken


import
importlib import tiktoken print("tiktoken version:", importlib.metadata.version("tiktoken"))

# Testing test data

text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces" "of someunknownPlace." ) integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) print(integers)

# Decoding back to actual data

strings = tokenizer.decode(integers) print(strings)


# One more example

integers = tokenizer.encode("Akwirw ier") print(integers) strings = tokenizer.decode(integers) print(strings)


Using tiktoken library :

! pip install tiktoken import tiktoken # Text to encode and decode text = "The lion roams in the jungle" # ───────────────────────────────────────────────────────────────────────── # 1. GPT-2 Encoding/Decoding # Using the "gpt2" encoding # ───────────────────────────────────────────────────────────────────────── tokenizer_gpt2 = tiktoken.get_encoding("gpt2") # We need to specify the model # Encode: text -> list of token IDs token_ids_gpt2 = tokenizer_gpt2.encode(text) # Decode: list of token IDs -> original text (just to verify correctness) decoded_text_gpt2 = tokenizer_gpt2.decode(token_ids_gpt2) # We can also get each token string by decoding the IDs one by one tokens_gpt2 = [tokenizer_gpt2.decode([tid]) for tid in token_ids_gpt2] print("=== GPT-2 Encoding ===") print("Original Text: ", text) print("Token IDs: ", token_ids_gpt2) print("Tokens: ", tokens_gpt2) print("Decoded Text: ", decoded_text_gpt2) print()

Output :

=== GPT-2 Encoding ===
Original Text:  The lion roams in the jungle
Token IDs:      [464, 18744, 686, 4105, 287, 262, 20712]
Tokens:         ['The', ' lion', ' ro', 'ams', ' in', ' the', ' jungle']
Decoded Text:   The lion roams in the jungle


This is called TOKENIZATION. 

Guys, we don't even use all these techniques directly, but we should understand these internal concepts before stepping into stuff that we are going to see in future blogs. We simply call LLM and it internally does all of these discussed calculation etc. It is good to have a deep knowledge on internals before stepping into actual stuff. So, do not panic at the moment. Save it for later stuff 😁


Please find the code from following GitHub location, it has more test cases with other models as well : https://github.com/amathe1/LLMs/blob/main/Tokenization_in_LLM.ipynb





Data Sampling with Sliding window & Input, Output pairs
  • Input, Output pairs
  • Data Sampling with sliding window

From above data, target variable is Sal ; Age, Exp are independent variables. Expectation is, after training, if we give the age & exp of a person, model will predict the Salary of that person.

For example, "Deep Learning is Powerful" is the input. If we give this data to a Decoder model, it should predict the next word. 
  • It should convert the data into a language which a Neural network can understand.

LLMs are nothing but Neural Networks. If our data in the above format(sequence-to-sequence) text, we should prepare it into input, output pairs. This is how we should prepare data i.e. into input and output pairs. This will happen after Tokenization

Suppose, if you are not preparing data in this format. You are not going to construct the NN model. Textual data must be prepared like this.



Data Sampling & Sliding window :

                    Imagine, I have a sentence which contains words from 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _9999.
Now, if I give same data to GPT model, then what will happen ? We are talking about context window concept which is nothing but maximum no. of tokens a GPT model can allow.  Assume my GPT model can allow only 1000 tokens, but we are giving 9999 tokens at once. This is nothing but Data Sampling where we are dividing data into CHUNKS. 

0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1000
1001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2000
2001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 3000
3001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 4000
4001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 5000
5001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 6000
6001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 7000
7001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 8000
8001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9000
9001 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9999

Other important parameter, that we should keep in mind is Sliding Window. Understand carefully, incase if the actual meaning of the context is disconnected at token 1005 BUT as per the concept of dividing "Chunks" into equal no. of tokens(as shown above) which is ended at 1000, then we will loose the context right ? Because we need next 5 tokens are well, to form a meaningful context. We are loosing the context here. 

Now, lets see the technique to form a meaningful context : Instead of dividing chunks equally, divide as shown below. We should have some overlap to maintain context.

0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1000
800_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2000
1800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 3000
2800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 4000
3800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 5000
4800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 6000
5800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 7000
6800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 8000
7800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9000
8800 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 9999


Below image shows context1, context2 and a overlap :


Hence :
  • Data Sampling is required to handle the Context Window (***)
  • Sliding Window is required to retain meaningful context (***)
 
(***) For latest GPT model, which is GPT-5.2, context window is 4,00,000 tokens. But in our example, we are providing an input which is of size 9999 tokens. Main thing which we need to understand is, though our input can go at once but here we have a chance of Overfitting Issue. There will be no learning, it will just memorise. If model has to find out hidden patterns in the given data, WE HAVE TO CHUNK AND SLIDE. 

We have to chunk the data and apply Sliding window technique. It doesn't matter, if your input token range is within the range of context window. 

Note, Data Sampling will happen AFTER tokenization but BEFORE embeddings. 


In this section we implement a data loader that fetches the input-target pairs using a sliding window approach. 

To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous section:

with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read() enc_text = tokenizer.encode(raw_text) print(len(enc_text))

o/p : 5145

Executing the code above will return 5145, the total number of tokens in the training set, after applying the BPE tokenizer.

Next, we remove the first 50 tokens from the dataset for demonstration purposes as it results in a slightly more interesting text passage in the next steps:

enc_sample = enc_text[50:]


The context size determines how many tokens are included in the input

context_size = 4 #length of the input #The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens) #to predict the next word in the sequence. #The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5] x = enc_sample[:context_size] y = enc_sample[1:context_size+1] print(f"x: {x}") print(f"y: {y}")

Output :
x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

Output :
[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257

Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.

For illustration purposes, let's repeat the previous code but convert the token IDs into text:

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a

We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters.


This is called input, output pairs. LLM expect the data to be in the form on input, output pairs. 




See above images, to see the concepts of sliding window size(1 word), context window size (4 words). Check how inputs and targets segregated. 

Sliding window programatically called as Stride

In the above example, context window size is 4 and stride is 1.


There's only one more task before we can turn the tokens into embeddings:implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.


In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict,


Implementing a Data Loader :

For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes.

Step 1: Tokenize the entire text

Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step 3: Return the total number of rows in the dataset

Step 4: Return a single row from the dataset


from torch.utils.data import Dataset, DataLoader class GPTDatasetV1(Dataset): def __init__(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = [] # Tokenize the entire text token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"}) # Use a sliding window to chunk the book into overlapping sequences of max_length for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk)) def __len__(self): return len(self.input_ids) def __getitem__(self, idx): return self.input_ids[idx], self.target_ids[idx]



The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class.

It defines how individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.

I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader -- this will bring additional intuition and clarity.




The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader:


Step 1: Initialize the tokenizer

Step 2: Create dataset

Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training

Step 4: The number of CPU processes to use for preprocessing


def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0): # Initialize the tokenizer tokenizer = tiktoken.get_encoding("gpt2") # Create dataset dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # Create dataloader dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers ) return dataloader


Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4,

This will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together:


with open("the-verdict.txt", "r", encoding="utf-8") as f: raw_text = f.read()




Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function


import torch print("PyTorch version:", torch.__version__) dataloader = create_dataloader_v1( raw_text, batch_size=1, max_length=4, stride=1, shuffle=False ) data_iter = iter(dataloader) first_batch = next(data_iter) print(first_batch)


Output :

PyTorch version: 2.6.0+cu124
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]



The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.

Since the max_length is set to 4, each of the two tensors contains 4 token IDs.

Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.


To illustrate the meaning of stride=1, let's fetch another batch from this dataset:


second_batch = next(data_iter) print(second_batch)

# Output :

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]



If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch.

For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input.

The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach


Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes. If you have previous experience with deep learning, you may know that small batch sizes require less memory during training but lead to more noisy model updates.

Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.


Before we move on to the two final sections of this chapter that are focused on creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1:


dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False) data_iter = iter(dataloader) inputs, targets = next(data_iter) print("Inputs:\n", inputs) print("\nTargets:\n", targets)


# Output :

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])

Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but   also, avoid any overlap between the batches, since more overlap could lead to increased overfitting.



We have seen below topics so far :

  • Data Sampling
  • Context window, Sliding window/Stride
  • Input, Output pairs
Pending topics in first part of implementing and LLM are :
  • Token Embedding
  • Positional Embedding/Encoding
  • Input Embeddings


Understanding Embedding Concept (Very Important to understand, read carefully!)

Context refresh! Main idea of decoder model is to predict next word. 

But after generating Input, Output pairs &Data Sampling etc. data is still in the form of a token. Isn't it ? and it is not carrying any meaning. 

Consider below example of data : Cat, Book, Tablet, Kitten, Dog, Puppy

Idea what we have to understand is : Though we have semantic relationship between the data(Ex : Dog Vs Puppy in above data), model is not carrying that information yet, as words are still in token format. 


How to carry semantic relation between data into ML model/LLM ?

Answer is word2vec 


One more example : Dog, Cat comes under animals AND Apple, Banana comes under fruits 

As Humans we know their categories, but how computer knew ? Answer for this question is attributes or input features. 

  • Dog has a tail, it barks
  • Cat also has a tail, has 4 legs, it makes sound etc.
  • Banana is eatable etc.
See below image for more categories to have some idea about attributes :

And these are vector of values(see below image) :

Using above vector of values, we can understand the similarity of inputs by looking at attributes. For example, Dog & cat has similar features if you can compare the vector values of same attributes, they will be almost similar in value. 

Jut look at has_a_tail value of Dog & Cat : 35, 32 (almost similar, isn't it) but for same attribute look at values of Apple, Banana, they are not even closure(but Apple & Banana values are similar as they are fruits). This is how we need carry the semantic relation between input data into our ML model. 

  • Where ever Cat value is high, Dog value is also high
  • Where ever Cat value is low, God value is also low
  • Vice versa for Apple, Banana as well

Means, tokens of similar data are close to each other. Hence, vector captures semantic meaning. Converting our tokens into vectors is nothing but Embedding. 



Main question : HOW THESE VECTOR VALUES WILL BE INITIALIZED

Answer : NEURAL NETWORK


Lets revise NN simply :

  • NN has a Input, Hidden, Output layers and each neuron is connected to every other neuron in next layer. 
  • These connections are called weights. Initially they are random values.
  • As we move forward with iterations, by repeating forward propagation, loss calculation, back propagation and adjusting weights; all these random weight values will be adjusted and come to a state where they become the accurate values of input data. Right ?
Similarly, during embedding process, random values will be assigned to these vectors and passed to a NN, as NN iterates, all these vector values will be adjusted and come to a state, where these values represent the semantic meaning of input data for corresponding attributes. REPEAT READING THIS 'n' of times, until you got this Embedding process right. It is worth understanding this concept.


Very important statement : 

These embedding weight vectors are random values at first. This initialization servers as the starting point of LLM's learning process. Later we will optimize the LLM weights as part of LLM training. 


I will be adding Positional Embedding/Encoding as well this this same blog! Concept is similar, but positional embeddings carry the position of word/token(instead of semantic relation) and finally during the process of Input embeddings, both Token Embedding and Positional Embedding will be added to create a common context for input data and will be passed to our GPT model! 


That's all about Tokenization! Let me add Positional & Input Embedding in a day or two to this blog.



Thank you for reading this blog !

Arun Mathe

Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

Spark Core : Introduction & understanding Spark Context

Apache Spark is a free, open-source tool for processing large amounts of data in parallel across multiple computers. It is used for big data work loads like machine learning, graph processing and big data analytics. Spark is built on top of Hadoop, it is aware how Hadoop works. Programming languages for Spark : Scala Python Java R SQL Spark support 2 operations : Transformations Actions RDD (Resilient Distributed Dataset) : Entire Spark is built on the base concept called RDD. Below 2  operations are supported by RDD. Transformations Actions Features of Spark : Distributed, Partitioned, Replicated Note that if data is mutable then it will be hard to distribute, partition and replicate Hence Spark required Immutability feature Immutability We can't change the data By design, Spark is purely designed for Analytical operations(OLAP) It do support transactional operations using some 3rd party tools Cacheable  To reuse data we cache it If information is static, no need to recomput...