Skip to main content

(AI Blog#4) : Normalization and Optimizers in a Neural Network

We are going to discuss about Normalization & Optimizers which will be used across AI, we will use them in LLM's, Agentic AI framework etc. 

As we discussed in our previous blogs(links for previous blogs are mentioned at the end of this blog), while calculating gradients, when the previous weights are almost similar to new weights, we will land into a problem called Vanishing Gradient. Similarly when current weight is too large than the previous weights, then we will land into Exploding Gradient issue. 

Using Normalization problem we are going to prevent Vanishing & Exploding Gradient problems. 


This blogs agenda :

  • Normalization
    • Batch Normalization (Useful in Neural Network)
    • Layer Normalization (Useful in LLM's)
  • Weights Initialization
    • Xavier
    • He
  • EWMA (Exponential Weighted Moving Average)
    • We will use it in Optimizers  like Momentum, NAG, Adagrad, RMSProp, Adam


Normalization 
        Normalization in Neural Networks and Deep Learning simply means rescaling values so training becomes stable, faster and more reliable. 

Why Normalization is needed ?
Imagine training a model where 
  • One input/feature ranges from 0-1
  • Another input/feature ranges from 0-10,00,000
In those scenarios, Gradient Descent behaves like a person walking on uneven terrain with:
  • Small scale features --> Tiny steps --> Slow learning
  • Large scale features --> Huge steps --> Overshooting
Normalization flattens the terrain, so optimization becomes smooth.


What exactly gets Normalized ?
Normalization can happen at three levels
  • Data/Input Normalization
  • Network Level Normalization
  • Weight Normalization (Less common)

Let us consider below Neural Network which inputs as Age, Salary and with 3 Neurons in the Hidden layer. I would like to predict the Salary of a person with Age 32. Note that this is a Regression Problem.

When we directly input these values to NN, it doesn't have intelligence to consider first input as Age and seconds as Salary. Isn't it ? How will it know that 42 is Age and 3,50,000 is Salary ? It will just identify it as a number. Observer the difference between those numbers, it is too high(3,50,000 - 42 = big number). When we calculate the derivatives of Age and Number,  difference between those 2 would be too high and those derivates would be unstable. To be simplified :
  • d/dx(Age) = 0.002
  • d/dx(Salary) = 10000.20
Observe carefully, if derivate value is too small then it leads to Vanishing Gradient problem, and if it is too high then it leads to Exploding Gradient problem. We have to balance both of these problems, means values wouldn't be too small and too large. Note that we landed into these issues because of our input values. Correct ? 

Because of this reason, we need to Normalize our values/inputs/features. Normalization is nothing but considering almost same range of values(not exactly same but with some difference) to all the neurons in the NN. 


Techniques in Normalization :
  • Data Level Normalization
  • Network Level Normalization

Data Level Normalization 
        This technique will be applied before training. We have to first apply normalization to data and then use that data for training. When we say before training, conceptually/programmatically it must be before backward propagation, as training will happen during backward propagation. 

Again, we have 2 types in it
  • Min Max Scaling
    • Range[0, 1]
    • Formula :  x_scaled = (x - x_min) / (x_max - x_min)
Where x = Original data value
            x_min = minimum value in the feature
            x_max = maximum value in the feature
            x_scaled = normalized value range[0, 1]

Example : If dataset = [1, 2, 3, 400, 10]

Lets calculate for value 2, x_scaled = (2 - 1)/ (400 - 1) = 1/399 = 0.0033
Note : In the input value is too small, then we land into Vanishing gradient problem. Hence we should not use Min Max Scaling if data has Outliers.

  • Standardization (or) Z-Score 
    • Range[-1, 1]
    • Formula xₛcaled = (x − μ) / σ
Where μ(mu)      = Mean(average) of feature values
            σ(sigma) = Standard deviation of feature value 

Example :
If dataset = [2, 4, 6, 8]

μ = (2+4+6+8)/4 = 5

σ = √[ (1 / n) Σ (xᵢ − μ)² ], where n is no of values, μ is mean (as calculated above)

so, σ = √[ ((2−5)² + (4−5)² + (6−5)² + (8−5)²) / 4 ] = √[ (9 + 1 + 1 + 9) / 4 ] =  √5 = 2.236

After normalization, expectation would be mean = 0, variance = 1



Network Level Normalization
        This technique will be applied during training. We have 2 types in it
  • Batch Normalization
    • Each layer output becomes input to next layer and the expectation is training should be stable. Hence we use batch Normalization in between layers during network/training
    • Observe below image to understand how batch normalization works
    • Here as well, mean = 0 & variance = 1
    • It will happen in all the layers, in between one layer and other
    • Input ------------ Batch Normalization ------------------then apply activation function(Af)
    • Batch Normalization, normalization the activations of each layer during training, mathematically it resembles z-score but done dynamically per batch

  • Layer Normalization
    • We will discuss about it during LLM's

So far, we have tried to normalize input data. Now, lets see Weight initialization techniques.

Weight Initialization Techniques
            We need to properly initate weights, otherwise we will again end up with Vanishing/Exploding gradient issues. 

Generally, W_new = W_old − η · ( ∂L / ∂W_old )

and Z = w1x1+w2x2+...+b1 

Techniques in weight initialization :
  • Xavier initialization
    • Choose weights so signal stays same strength from layer to layer
    • Whenever we use activation function called Tanh, Sigmoid, then the recommended technique for weights initialization would be "Xavier" initialization
Rules :
  • How many weights comes in
  • How many weights go out
  • Pick weights in a balanced way
Assume we have 4 inputs x = [1, ,1, 1, 1], now initially randomly assigning weights w = [2, 2, 2, 2]

Linear transformation, Z = 1*2 + 1*2 + 1*2 + 1*2 = 8 (w1x1+w2x2+w3x3+w4x4, ignore bias) and then apply ReLU activation function, max(0, Z) then value is 8, if we repeat this process after some layers thi value will be big and it leads to exploding gradient.

Now, Xavier proposed, if value is exploding then var(w) = 1/n 

Just remember, if we apply Xavier initialization, weights will be balanced across all layers and output will be controlled, and training will be stable.


  • He initialization
    • Whenever we use activation function called ReLU, then the recommended technique for weights initialization would be "He" initialization
    • Var(w) = 2/n
    • Weight size = √2/n
    • Basically, when we use ReLU, sometime o/p from ReLu will keep reducing to half and recursively after passing through multiple NN layers, chances are there for it to become 0 or close to 0. He weight initialization fix this issue.

We need to understand Optimizers now and before that, lets see what is EWMA. These optimizers are called as boosters.


EWMA (Exponential Weighted Moving Average)
  • More preference to new gradients and less preference to old gradients
  • Lets say, we are predicting tomorrow temperature, in scenarios like this, we will consider current temperature conditions instead of considering the temperature 10 days back. Isn't it ? Works with time series data.
  • This concept will be helpful in optimizers
  • It will smoothen zig-zag nature of NN, while reaching Global minima. It reduces training time, and it will help reaching global minima quickly.

Formula, EWMA = v_t = β · v_{t-1} + (1 − β) · x_t

where v_t EWMA at time t, x_t is current value, β is smoothing factor (Hyper Parameter)

We will use it in Optimizers like Momentum, NAG, Adagrad, RMSProp and Adam.




Optimizers 
            
            An Optimizer is an algorithm that updates the models weights and biases so that the neural network learns from data.

Mathematical view : W_new = W_old − η · ( ∂L / ∂W_old )
Where η = learning rate, 
∂L / ∂W_old  = derivative of Loss with respect to old weight

The Optimizer decides how exactly this update is done.

Optimizers are classified based on
  • Speed/Velocity
  • Learning rate

Under 
  • Speed/Velocity category
    • they introduced optimizers like Momentum, NAG.
  • Learning rate 
    • they introduced optimizers like Adagrad, RMSProp
Another optimizer which is advanced called as Adam, which will control both speed and learning rate and this is recommended for Production in real time.


Momentum : Instead of using only the current gradient, use a moving average of past gradients.       

Formula for GD (Gradient Descent) is, W_new = W_old − η · ( ∂L / ∂W_old )

First problem with GD is lot of zig zag nature, hence it will hard to reach Global minima in expected time. It will take too much time to reach end goal. This is the first problem of GD.

Second problem with GD is, it won't remember/store old weight information once one iteration is completed in NN. Why we need them is, based on old weights & biases, it has to change the direction while travelling towards Global minima. If you don't know where you started, then we may come back to same step again right ! which is a clear problem.

Now, if we inject EWMA(Exponential Weighted Moving Average) to GD, then it will remember older weights as well which will help to improvise ML model. 

For example, below time series data is about temperatures on each day, EWMA will give more preference to latest gradient and less preference to older gradients, BUT it will memorize all the previous weights.

Time series data : t1, t2, t3, t4, ......t10, t11

Formula & Example :


As shown in above images, velocity will increase in each step, helps to reach global minima fast. 

But it will overshoot as we have used momentum due to Velocity, and it will take lot of Oscillations.
  • GD most of the times land into Local Minima
  • Momentum have more Oscillations, due to speed but it has reached Global Minima

Problems with Momentum :
  • Overshooting
  • Lot of oscillations
  • Time taking process


NAG (Nesterov Accelerated Gradient) : NAG is an improved version of "Momentum" - based GD. It has something called look-a-head point as mentioned in the below image which will help reducing the speed from Momentum.
  • Instead of computing the gradient at current position, NAG looks a head where Momentum is about to take you and then corrects that step.
  • Observe the difference in steps in Momentum Vs NAG, you will see speed in Momentum at Global minima, as it crossed it and oscillating ; but in NAG speed is controlling that speed BUT remember it still has Oscillations(comparatively better than Momentum).


    
Now lets see the Optimizers used to control the Learning rate.
  • Adagrade
  • RMSProp

Adagrad (Adaptive Gradient) : Generally learning rate, η is consistent across NN but Adagrad will change it based on the Weights and Biases.
  • It will start with a learning rate
  • Going forward, it will slow down learning rate 

In the below formula, learning rate will be small i.e. 0.1/0.01/0.001 

Formula for GD (Gradient Descent) is, W_new = W_old − η · ( ∂L / ∂W_old )

Adagrad takes care of 2 things :
  • Remembers how big past gradients are
  • Reduces learning rate, η for weights that already moved a lot
  • For Big Gradients
    • Small learning rate
  • For Small Gradient
    • Big learning rate

RMSProp (Root Mean Square Propagation) : 
  • Adagrad remembers all past gradients forever → learning rate keeps shrinking.
  • RMSProp remembers recent gradients only → learning rate stays useful.
  • DON'T remember all past Gradients, remember recent ones. So, learning rate goes DOWN when gradients are Big, and it goes UP and when the gradients are Small.

ADAM (Adaptive Moment Estimation) 

Adam is a hybrid version of Momentum + NAG + Adagrad, RMSProp.

Adam = Momentum (mₜ) + RMSProp (vₜ) + Bias Correction

Gradient at step t, gₜ = ∇θ L(θₜ)

Momentum, mₜ = β₁ · mₜ₋₁ + (1 − β₁) · gₜ

RMSProp, vₜ = β₂ · vₜ₋₁ + (1 − β₂) · gₜ²

Most of the places, Adam is recommend, especially in Production.


Look at the Silver and Green color balls. Silver ball(Adagrad) remembering all previous Gradients as well, so more shadow BUT Green ball(RMSProp) remember only recent Gradient, hence less shadow.

Adam is also little Oscillating, because it internally using Momentum & NAG, but it is also not a big shootout. NAG is controlling it.

Point to carry across Optimizers :
ADAM will reach global minima first!



All the blogs written on Neural Networks & Deep Learning are listed below :

1) (AI Blog#1) Deep Learning and Neural Networks :  https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html

2) (AI Blog#2) Building a Real-World Neural Network: A Practical Use Case Explained :  https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html

3) (AI Blog#3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques : https://arunsdatasphere.blogspot.com/2026/01/ai-blog3-deep-learning-foundations.html

4) (AI Blog#4) : Normalization and Optimizers in a Neural Network :   https://arunsdatasphere.blogspot.com/2026/02/ai-blog4-normalization-and-optimizers.html


Thank you for reading this blog !
Arun Mathe

Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

(AI Blog#1) Deep Learning and Neural Networks

I was curious to learn Artificial Intelligence and thinking what is the best place to start learning, and then realized that Deep Learning and Neural Networks is the heart of AI. Hence started diving into AI from this point. Starting from today, I will write continuous blogs on AI, especially Gen AI & Agentic AI. Incase if you are interested on above topics then please watch out this space. What is Artificial Intelligence, Machine Learning & Deep Learning ? AI can be described as the effort to automate intellectual tasks normally performed by Humans. Is this really possible ? For example, when we see an image with our eyes, we will identify it within a fraction of milliseconds. Isn't it ? For a computer, is it possible to do the same within same time limit ? That's the power we are talking about. To be honest, things seems to be far advanced than we actually thing about AI.  BTW, starting from this blog, it is not just a technical journal, we talk about internals here. ...