LLM finetuning is all about taking a pre-trained LLM and training it further on your own domain/task-specific data so it becomes specialized for your use case.
In fine tuning, model might haven't seen this type of complex data during pre-training and we would like to tune the model with this complex data. Here we are not talking about changing the model, we are going to use same model where we have less accuracy but we will train it further with this complex set of data to make the model perfect.
Generally prompt engineering rely on the current knowledge of LLM, and we will use RAG to get more accuracy in our project specific data. If problem is beyond this, then we end up fine-tuning model itself. Drug discovery companies, Oil and Gas etc. domains use model fine-tuning as data is very rare.
What is model fine-tuning is still not enough ? Then we have obviously go for developing a new ML model.
Sometime, for some use cases - a combination of fine-tuning model + prompt engineering + RAG + Agentic AI orchestration will be helpful to achieve expected accuracy.
Fine Tune Decision Framework :
Please observe above example carefully to understand when to go for fine-tuning.
Till now, we are talking about what is fine-tuning and when to go for it. Let us deep dive into it.
Understand what happens when we update model weights :
Models weights shifts from general knowledge towards task expertise.
How it work ?
- Feed instruction into the model (feed some data set)
- Model generates a response (token by token)
- Compare generated tokens with desired response.
- Compute cross-entropy loss and backpropagate
- Update weights to make the desired response more likely
Fine-Tuning Techniques :
PEFT - Parameter Efficient Fine Tuning - How many parameters do we actually need to update ?
- Instead of changing everything, we will change certain things in the model to get expected accuracy. This is the basis for PEFT.
- Need small GPU's to fine tuning at this basic level in our local laptop
As we can see in the above image, full fine tuning may not be feasible for small and mid range companies all the time. If you can observe above image, even for a small model with 7B parameters, we need 112 GB GPU memory, then just think about latest Claude models with trillions of parameters. Hence we ended up with below techniques.
LoRA: Low-Rank Adaption
The Key Insight - Weight updates during fine-tuning are LOW-RANK. We can decompose them into smaller matrices.
LoRA is a parameter efficient fine-tuning technique that freezes the base LLM weightsand trains only LOW-RANK adapter matrices to reduce training cost and GPU memory.
QLoRA: Quantization + LoRA
QLoRA extends LoRA by combining low rank adapters with quantized model weights, enabling efficient fine tuning of LLMs on LOW memory GPUs.
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment