We are going to discuss about Normalization & Optimizers which will be used across AI, we will use them in LLM's, Agentic AI framework etc.
As we discussed in our previous blogs(links for previous blogs are mentioned at the end of this blog), while calculating gradients, when the previous weights are almost similar to new weights, we will land into a problem called Vanishing Gradient. Similarly when current weight is too large than the previous weights, then we will land into Exploding Gradient issue.
Using Normalization problem we are going to prevent Vanishing & Exploding Gradient problems.
This blogs agenda :
- Normalization
- Batch Normalization (Useful in Neural Network)
- Layer Normalization (Useful in LLM's)
- Weights Initialization
- Xavier
- He
- EWMA (Exponential Weighted Moving Average)
- We will use it in Optimizers like Momentum, NAG, Adagrad, RMSProp, Adam
Normalization
Normalization in Neural Networks and Deep Learning simply means rescaling values so training becomes stable, faster and more reliable.
Why Normalization is needed ?
Imagine training a model where
- One input/feature ranges from 0-1
- Another input/feature ranges from 0-10,00,000
In those scenarios, Gradient Descent behaves like a person walking on uneven terrain with:
- Small scale features --> Tiny steps --> Slow learning
- Large scale features --> Huge steps --> Overshooting
Normalization flattens the terrain, so optimization becomes smooth.
What exactly gets Normalized ?
Normalization can happen at three levels
- Data/Input Normalization
- Network Level Normalization
- Weight Normalization (Less common)
Let us consider below Neural Network which inputs as Age, Salary and with 3 Neurons in the Hidden layer. I would like to predict the Salary of a person with Age 32. Note that this is a Regression Problem.
When we directly input these values to NN, it doesn't have intelligence to consider first input as Age and seconds as Salary. Isn't it ? How will it know that 42 is Age and 3,50,000 is Salary ? It will just identify it as a number. Observer the difference between those numbers, it is too high(3,50,000 - 42 = big number). When we calculate the derivatives of Age and Number, difference between those 2 would be too high and those derivates would be unstable. To be simplified :
- d/dx(Age) = 0.002
- d/dx(Salary) = 10000.20
Observe carefully, if derivate value is too small then it leads to Vanishing Gradient problem, and if it is too high then it leads to Exploding Gradient problem. We have to balance both of these problems, means values wouldn't be too small and too large. Note that we landed into these issues because of our input values. Correct ?
Because of this reason, we need to Normalize our values/inputs/features. Normalization is nothing but considering almost same range of values(not exactly same but with some difference) to all the neurons in the NN.
Techniques in Normalization :
- Data Level Normalization
- Network Level Normalization
Data Level Normalization
This technique will be applied before training. We have to first apply normalization to data and then use that data for training. When we say before training, conceptually/programmatically it must be before backward propagation, as training will happen during backward propagation.
Again, we have 2 types in it
- Min Max Scaling
- Range[0, 1]
- Formula : x_scaled = (x - x_min) / (x_max - x_min)
Where x = Original data value
x_min = minimum value in the feature
x_max = maximum value in the feature
x_scaled = normalized value range[0, 1]
Example : If dataset = [1, 2, 3, 400, 10]
Lets calculate for value 2, x_scaled = (2 - 1)/ (400 - 1) = 1/399 = 0.0033
Note : In the input value is too small, then we land into Vanishing gradient problem. Hence we should not use Min Max Scaling if data has Outliers.
- Standardization (or) Z-Score
- Range[-1, 1]
- Formula : xₛcaled = (x − μ) / σ
σ(sigma) = Standard deviation of feature value
Example :
If dataset = [2, 4, 6, 8]
μ = (2+4+6+8)/4 = 5
σ = √[ (1 / n) Σ (xᵢ − μ)² ], where n is no of values, μ is mean (as calculated above)
so, σ = √[ ((2−5)² + (4−5)² + (6−5)² + (8−5)²) / 4 ] = √[ (9 + 1 + 1 + 9) / 4 ] = √5 = 2.236
After normalization, expectation would be mean = 0, variance = 1
Network Level Normalization
This technique will be applied during training. We have 2 types in it
- Batch Normalization
- Each layer output becomes input to next layer and the expectation is training should be stable. Hence we use batch Normalization in between layers during network/training
- Observe below image to understand how batch normalization works
- Here as well, mean = 0 & variance = 1
- It will happen in all the layers, in between one layer and other
- Input ------------ Batch Normalization ------------------then apply activation function(Af)
- Batch Normalization, normalization the activations of each layer during training, mathematically it resembles z-score but done dynamically per batch
- Layer Normalization
- We will discuss about it during LLM's
So far, we have tried to normalize input data. Now, lets see Weight initialization techniques.
Weight Initialization Techniques
We need to properly initate weights, otherwise we will again end up with Vanishing/Exploding gradient issues.
Generally, W_new = W_old − η · ( ∂L / ∂W_old )
and Z = w1x1+w2x2+...+b1
Techniques in weight initialization :
- Xavier initialization
- Choose weights so signal stays same strength from layer to layer
- Whenever we use activation function called Tanh, Sigmoid, then the recommended technique for weights initialization would be "Xavier" initialization
Rules :
- How many weights comes in
- How many weights go out
- Pick weights in a balanced way
Assume we have 4 inputs x = [1, ,1, 1, 1], now initially randomly assigning weights w = [2, 2, 2, 2]
Linear transformation, Z = 1*2 + 1*2 + 1*2 + 1*2 = 8 (w1x1+w2x2+w3x3+w4x4, ignore bias) and then apply ReLU activation function, max(0, Z) then value is 8, if we repeat this process after some layers thi value will be big and it leads to exploding gradient.
Now, Xavier proposed, if value is exploding then var(w) = 1/n
Just remember, if we apply Xavier initialization, weights will be balanced across all layers and output will be controlled, and training will be stable.
- He initialization
- Whenever we use activation function called ReLU, then the recommended technique for weights initialization would be "He" initialization
- Var(w) = 2/n
- Weight size = √2/n
- Basically, when we use ReLU, sometime o/p from ReLu will keep reducing to half and recursively after passing through multiple NN layers, chances are there for it to become 0 or close to 0. He weight initialization fix this issue.
We need to understand Optimizers now and before that, lets see what is EWMA. These optimizers are called as boosters.
EWMA (Exponential Weighted Moving Average)
- More preference to new gradients and less preference to old gradients
- Lets say, we are predicting tomorrow temperature, in scenarios like this, we will consider current temperature conditions instead of considering the temperature 10 days back. Isn't it ? Works with time series data.
- This concept will be helpful in optimizers
- It will smoothen zig-zag nature of NN, while reaching Global minima. It reduces training time, and it will help reaching global minima quickly.
where v_t EWMA at time t, x_t is current value, β is smoothing factor (Hyper Parameter)
We will use it in Optimizers like Momentum, NAG, Adagrad, RMSProp and Adam.
Optimizers
An Optimizer is an algorithm that updates the models weights and biases so that the neural network learns from data.
Mathematical view : W_new = W_old − η · ( ∂L / ∂W_old )
Where η = learning rate,
∂L / ∂W_old = derivative of Loss with respect to old weight
The Optimizer decides how exactly this update is done.
Optimizers are classified based on
- Speed/Velocity
- Learning rate
Under
- Speed/Velocity category
- they introduced optimizers like Momentum, NAG.
- Learning rate
- they introduced optimizers like Adagrad, RMSProp
Another optimizer which is advanced called as Adam, which will control both speed and learning rate and this is recommended for Production in real time.
Momentum : Instead of using only the current gradient, use a moving average of past gradients.
Formula for GD (Gradient Descent) is, W_new = W_old − η · ( ∂L / ∂W_old )
First problem with GD is lot of zig zag nature, hence it will hard to reach Global minima in expected time. It will take too much time to reach end goal. This is the first problem of GD.
Second problem with GD is, it won't remember/store old weight information once one iteration is completed in NN. Why we need them is, based on old weights & biases, it has to change the direction while travelling towards Global minima. If you don't know where you started, then we may come back to same step again right ! which is a clear problem.
Now, if we inject EWMA(Exponential Weighted Moving Average) to GD, then it will remember older weights as well which will help to improvise ML model.
For example, below time series data is about temperatures on each day, EWMA will give more preference to latest gradient and less preference to older gradients, BUT it will memorize all the previous weights.
Time series data : t1, t2, t3, t4, ......t10, t11
Formula & Example :
But it will overshoot as we have used momentum due to Velocity, and it will take lot of Oscillations.
- GD most of the times land into Local Minima
- Momentum have more Oscillations, due to speed but it has reached Global Minima
- Overshooting
- Lot of oscillations
- Time taking process
NAG (Nesterov Accelerated Gradient) : NAG is an improved version of "Momentum" - based GD. It has something called look-a-head point as mentioned in the below image which will help reducing the speed from Momentum.
- Instead of computing the gradient at current position, NAG looks a head where Momentum is about to take you and then corrects that step.
- Observe the difference in steps in Momentum Vs NAG, you will see speed in Momentum at Global minima, as it crossed it and oscillating ; but in NAG speed is controlling that speed BUT remember it still has Oscillations(comparatively better than Momentum).
Now lets see the Optimizers used to control the Learning rate.
- Adagrade
- RMSProp
Adagrad (Adaptive Gradient) : Generally learning rate, η is consistent across NN but Adagrad will change it based on the Weights and Biases.
- It will start with a learning rate
- Going forward, it will slow down learning rate
In the below formula, learning rate will be small i.e. 0.1/0.01/0.001
Formula for GD (Gradient Descent) is, W_new = W_old − η · ( ∂L / ∂W_old )
Adagrad takes care of 2 things :
- Remembers how big past gradients are
- Reduces learning rate, η for weights that already moved a lot
- For Big Gradients
- Small learning rate
- For Small Gradient
- Big learning rate
RMSProp (Root Mean Square Propagation) :
- Adagrad remembers all past gradients forever → learning rate keeps shrinking.
- RMSProp remembers recent gradients only → learning rate stays useful.
- DON'T remember all past Gradients, remember recent ones. So, learning rate goes DOWN when gradients are Big, and it goes UP and when the gradients are Small.
ADAM (Adaptive Moment Estimation)
Adam is a hybrid version of Momentum + NAG + Adagrad, RMSProp.
Adam = Momentum (mₜ) + RMSProp (vₜ) + Bias Correction
Gradient at step t, gₜ = ∇θ L(θₜ)
Momentum, mₜ = β₁ · mₜ₋₁ + (1 − β₁) · gₜ
RMSProp, vₜ = β₂ · vₜ₋₁ + (1 − β₂) · gₜ²
Most of the places, Adam is recommend, especially in Production.
Look at the Silver and Green color balls. Silver ball(Adagrad) remembering all previous Gradients as well, so more shadow BUT Green ball(RMSProp) remember only recent Gradient, hence less shadow.
Adam is also little Oscillating, because it internally using Momentum & NAG, but it is also not a big shootout. NAG is controlling it.
Point to carry across Optimizers :
ADAM will reach global minima first!
All the blogs written on Neural Networks & Deep Learning are listed below :
1) (AI Blog#1) Deep Learning and Neural Networks : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html
2) (AI Blog#2) Building a Real-World Neural Network: A Practical Use Case Explained : https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html
3) (AI Blog#3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques : https://arunsdatasphere.blogspot.com/2026/01/ai-blog3-deep-learning-foundations.html
4) (AI Blog#4) : Normalization and Optimizers in a Neural Network : https://arunsdatasphere.blogspot.com/2026/02/ai-blog4-normalization-and-optimizers.html
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment