Posts

(AI Blog#10) LLM : Data preparation & Sampling - Tokenization & Embedding Explained

Three main stages of coding an LLM are, implement the data sampling and understand the basic mechanism, pre-train the LLM on unlabelled data to obtain a foundation model for further fine-tuning, fine-tune the pre-trained LLM to create a classifier or personal assistant or chat model. As part of implementing an end-to-end LLM, we need to implement above 3 stages and it involved below steps : Data preparation & Sampling(This blog covers this concept) Attention mechanism LLM architecture Pre-training Training Loop Model Evaluation Load Pre-training weights Fine-tuning (to create a classification model) Fine-tuning (to create a personal assistant or chat model) We are going to discuss about the first step i.e. Data Preparation & Sampling in this blog. Data Preparation & Sampling : Data preparation & Sampling is the input to the LLM.  It includes below steps : Tokenization Input Output pairs Token Embeddings Positional Encodings (Positional Embeddings) Input Embeddings ...
Recent posts