(AI #4) : Normalization and Optimizers in a Neural Network

We are going to discuss about Normalization & Optimizers which will be used across AI, we will use them in LLM's, Agentic AI framework etc.

As we discussed in our previous blogs(links for previous blogs are mentioned at the end of this blog), while calculating gradients, when the previous weights are almost similar to new weights, we will land into a problem called Vanishing Gradient. Similarly when current weight is too large than the previous weights, then we will land into Exploding Gradient issue.

Using Normalization problem we are going to prevent Vanishing & Exploding Gradient problems.

This blogs agenda :

Normalization

Batch Normalization (Useful in Neural Network)
Layer Normalization (Useful in LLM's)

Weights Initialization

Xavier
He

EWMA (Exponential Weighted Moving Average)

We will use it in Optimizers like Momentum, NAG, Adagrad, RMSProp, Adam

Normalization

Normalization in Neural Networks and Deep Learning simply means rescaling values so training becomes stable, faster and more reliable.

Why Normalization is needed ?

Imagine training a model where

One input/feature ranges from 0-1
Another input/feature ranges from 0-10,00,000

In those scenarios, Gradient Descent behaves like a person walking on uneven terrain with:

Small scale features --> Tiny steps --> Slow learning
Large scale features --> Huge steps --> Overshooting

Normalization flattens the terrain, so optimization becomes smooth.

What exactly gets Normalized ?

Normalization can happen at three levels

Data/Input Normalization
Network Level Normalization
Weight Normalization (Less common)

Let us consider below Neural Network which inputs as Age, Salary and with 3 Neurons in the Hidden layer. I would like to predict the Salary of a person with Age 32. Note that this is a Regression Problem.

When we directly input these values to NN, it doesn't have intelligence to consider first input as Age and seconds as Salary. Isn't it ? How will it know that 42 is Age and 3,50,000 is Salary ? It will just identify it as a number. Observer the difference between those numbers, it is too high(3,50,000 - 42 = big number). When we calculate the derivatives of Age and Number, difference between those 2 would be too high and those derivates would be unstable. To be simplified :

d/dx(Age) = 0.002
d/dx(Salary) = 10000.20

Observe carefully, if derivate value is too small then it leads to Vanishing Gradient problem, and if it is too high then it leads to Exploding Gradient problem. We have to balance both of these problems, means values wouldn't be too small and too large. Note that we landed into these issues because of our input values. Correct ?

Because of this reason, we need to Normalize our values/inputs/features. Normalization is nothing but considering almost same range of values(not exactly same but with some difference) to all the neurons in the NN.

Techniques in Normalization :

Data Level Normalization
Network Level Normalization

Data Level Normalization

This technique will be applied before training. We have to first apply normalization to data and then use that data for training. When we say before training, conceptually/programmatically it must be before backward propagation, as training will happen during backward propagation.

Again, we have 2 types in it

Min Max Scaling

Range[0, 1]
Formula : x_scaled = (x - x_min) / (x_max - x_min)

Where x = Original data value

x_min = minimum value in the feature

x_max = maximum value in the feature

x_scaled = normalized value range[0, 1]

Example : If dataset = [1, 2, 3, 400, 10]

Lets calculate for value 2, x_scaled = (2 - 1)/ (400 - 1) = 1/399 = 0.0033

Note : In the input value is too small, then we land into Vanishing gradient problem. Hence we should not use Min Max Scaling if data has Outliers.

Standardization (or) Z-Score

Range[-1, 1]
Formula : xₛcaled = (x − μ) / σ

Where μ(mu) = Mean(average) of feature values

σ(sigma) = Standard deviation of feature value

Example :

If dataset = [2, 4, 6, 8]

μ = (2+4+6+8)/4 = 5

σ = √[ (1 / n) Σ (xᵢ − μ)² ], where n is no of values, μ is mean (as calculated above)

so, σ = √[ ((2−5)² + (4−5)² + (6−5)² + (8−5)²) / 4 ] = √[ (9 + 1 + 1 + 9) / 4 ] = √5 = 2.236

After normalization, expectation would be mean = 0, variance = 1

Network Level Normalization

This technique will be applied during training. We have 2 types in it

Batch Normalization

Each layer output becomes input to next layer and the expectation is training should be stable. Hence we use batch Normalization in between layers during network/training
Observe below image to understand how batch normalization works
Here as well, mean = 0 & variance = 1
It will happen in all the layers, in between one layer and other
Input ------------ Batch Normalization ------------------then apply activation function(Af)
Batch Normalization, normalization the activations of each layer during training, mathematically it resembles z-score but done dynamically per batch

Layer Normalization

We will discuss about it during LLM's

So far, we have tried to normalize input data. Now, lets see Weight initialization techniques.

Weight Initialization Techniques

We need to properly initate weights, otherwise we will again end up with Vanishing/Exploding gradient issues.

Generally, W_new = W_old − η · ( ∂L / ∂W_old )

and Z = w1x1+w2x2+...+b1

Techniques in weight initialization :

Xavier initialization

Choose weights so signal stays same strength from layer to layer
Whenever we use activation function called Tanh, Sigmoid, then the recommended technique for weights initialization would be "Xavier" initialization

Rules :

How many weights comes in
How many weights go out
Pick weights in a balanced way

Assume we have 4 inputs x = [1, ,1, 1, 1], now initially randomly assigning weights w = [2, 2, 2, 2]

Linear transformation, Z = 1*2 + 1*2 + 1*2 + 1*2 = 8 (w1x1+w2x2+w3x3+w4x4, ignore bias) and then apply ReLU activation function, max(0, Z) then value is 8, if we repeat this process after some layers thi value will be big and it leads to exploding gradient.

Now, Xavier proposed, if value is exploding then var(w) = 1/n

Just remember, if we apply Xavier initialization, weights will be balanced across all layers and output will be controlled, and training will be stable.

He initialization

Whenever we use activation function called ReLU, then the recommended technique for weights initialization would be "He" initialization
Var(w) = 2/n
Weight size = √2/n
Basically, when we use ReLU, sometime o/p from ReLu will keep reducing to half and recursively after passing through multiple NN layers, chances are there for it to become 0 or close to 0. He weight initialization fix this issue.

We need to understand Optimizers now and before that, lets see what is EWMA. These optimizers are called as boosters.

EWMA (Exponential Weighted Moving Average)

More preference to new gradients and less preference to old gradients
Lets say, we are predicting tomorrow temperature, in scenarios like this, we will consider current temperature conditions instead of considering the temperature 10 days back. Isn't it ? Works with time series data.
This concept will be helpful in optimizers
It will smoothen zig-zag nature of NN, while reaching Global minima. It reduces training time, and it will help reaching global minima quickly.

Formula, EWMA = v_t = β · v_{t-1} + (1 − β) · x_t

where v_t EWMA at time t, x_t is current value, β is smoothing factor (Hyper Parameter)

We will use it in Optimizers like Momentum, NAG, Adagrad, RMSProp and Adam.

Optimizers

An Optimizer is an algorithm that updates the models weights and biases so that the neural network learns from data.

Mathematical view : W_new = W_old − η · ( ∂L / ∂W_old )

Where η = learning rate,

∂L / ∂W_old = derivative of Loss with respect to old weight

The Optimizer decides how exactly this update is done.

Optimizers are classified based on

Speed/Velocity
Learning rate

Under

Speed/Velocity category

they introduced optimizers like Momentum, NAG.

Learning rate

they introduced optimizers like Adagrad, RMSProp

Another optimizer which is advanced called as Adam, which will control both speed and learning rate and this is recommended for Production in real time.

Momentum : Instead of using only the current gradient, use a moving average of past gradients.

Formula for GD (Gradient Descent) is, W_new = W_old − η · ( ∂L / ∂W_old )

First problem with GD is lot of zig zag nature, hence it will hard to reach Global minima in expected time. It will take too much time to reach end goal. This is the first problem of GD.

Second problem with GD is, it won't remember/store old weight information once one iteration is completed in NN. Why we need them is, based on old weights & biases, it has to change the direction while travelling towards Global minima. If you don't know where you started, then we may come back to same step again right ! which is a clear problem.

Now, if we inject EWMA(Exponential Weighted Moving Average) to GD, then it will remember older weights as well which will help to improvise ML model.

For example, below time series data is about temperatures on each day, EWMA will give more preference to latest gradient and less preference to older gradients, BUT it will memorize all the previous weights.

Time series data : t1, t2, t3, t4, ......t10, t11

Formula & Example :

As shown in above images, velocity will increase in each step, helps to reach global minima fast.

But it will overshoot as we have used momentum due to Velocity, and it will take lot of Oscillations.

GD most of the times land into Local Minima
Momentum have more Oscillations, due to speed but it has reached Global Minima

Problems with Momentum :

Overshooting
Lot of oscillations
Time taking process

NAG (Nesterov Accelerated Gradient) : NAG is an improved version of "Momentum" - based GD. It has something called look-a-head point as mentioned in the below image which will help reducing the speed from Momentum.

Instead of computing the gradient at current position, NAG looks a head where Momentum is about to take you and then corrects that step.
Observe the difference in steps in Momentum Vs NAG, you will see speed in Momentum at Global minima, as it crossed it and oscillating ; but in NAG speed is controlling that speed BUT remember it still has Oscillations(comparatively better than Momentum).

Now lets see the Optimizers used to control the Learning rate.

Adagrade
RMSProp

Adagrad (Adaptive Gradient) : Generally learning rate, η is consistent across NN but Adagrad will change it based on the Weights and Biases.

It will start with a learning rate
Going forward, it will slow down learning rate

In the below formula, learning rate will be small i.e. 0.1/0.01/0.001

Formula for GD (Gradient Descent) is, W_new = W_old − η · ( ∂L / ∂W_old )

Adagrad takes care of 2 things :

Remembers how big past gradients are
Reduces learning rate, η for weights that already moved a lot
For Big Gradients

Small learning rate

For Small Gradient

Big learning rate

RMSProp (Root Mean Square Propagation) :

Adagrad remembers all past gradients forever → learning rate keeps shrinking.
RMSProp remembers recent gradients only → learning rate stays useful.

DON'T remember all past Gradients, remember recent ones. So, learning rate goes DOWN when gradients are Big, and it goes UP and when the gradients are Small.

ADAM (Adaptive Moment Estimation)

Adam is a hybrid version of Momentum + NAG + Adagrad, RMSProp.

Adam = Momentum (mₜ) + RMSProp (vₜ) + Bias Correction

Gradient at step t, gₜ = ∇θ L(θₜ)

Momentum, mₜ = β₁ · mₜ₋₁ + (1 − β₁) · gₜ

RMSProp, vₜ = β₂ · vₜ₋₁ + (1 − β₂) · gₜ²

Most of the places, Adam is recommend, especially in Production.

Look at the Silver and Green color balls. Silver ball(Adagrad) remembering all previous Gradients as well, so more shadow BUT Green ball(RMSProp) remember only recent Gradient, hence less shadow.

Adam is also little Oscillating, because it internally using Momentum & NAG, but it is also not a big shootout. NAG is controlling it.

Point to carry across Optimizers :

ADAM will reach global minima first!

All the blogs written on Neural Networks & Deep Learning are listed below :

1) (AI Blog#1) Deep Learning and Neural Networks : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html

2) (AI Blog#2) Building a Real-World Neural Network: A Practical Use Case Explained : https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html

3) (AI Blog#3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques : https://arunsdatasphere.blogspot.com/2026/01/ai-blog3-deep-learning-foundations.html

4) (AI Blog#4) : Normalization and Optimizers in a Neural Network : https://arunsdatasphere.blogspot.com/2026/02/ai-blog4-normalization-and-optimizers.html

Thank you for reading this blog !

Arun Mathe

DataSphere

Search This Blog

(AI #4) : Normalization and Optimizers in a Neural Network

Labels

Comments

Post a Comment

Popular posts from this blog

(AI #1) Deep Learning and Neural Networks

AWS : Working with Lambda, Glue, S3/Redshift

Spark Core : Understanding RDD & Partitions in Spark