Skip to main content

(AI Blog#3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

It is extremely important to have a deep knowledge while designing a machine learning model, otherwise we will end up creating ML models which are of no use. We have to have a clear understanding on certain techniques to confidently build a ML model, train it using "training data", finalize the model and to deploy it in production. So far, from blog #1, #2, we have seen about the fundamentals of Deep Learning and Neural Network, architecture of a Neural Network, internal layers and components etc. 

Providing the links of Blogs #1, #2 below for quick reference.

Deep Learning & Neural Networks : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html

Building a real world neural network: A practical usecase explained : https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html

Now let's dive through below concepts/criteria to help gaining confidence on building your ML model:

  • Activation Functions (Forward Propagation)
    • ReLu, Leaky ReLu, Parametric ReLu
    • Sigmoid
    • Tanh
    • ELU(Exponential Linear Unit), SeLu(Scaled Exponential Linear Unit) - These are still under reasearch, not into production yet.
    • GELU(we will see this in LLM's, it is not for NN)
  • Loss Functions
    • Regression
      • MSE
      • MAE
    • Classification
      • BCE
      • CCE
      • SCCE
  • Backward propagation
    • Gradient Descent (GD)
    • Batch Gradient Descent (BGD)
    • Stochastic Gradient Descent (SGD)
    • Mini Batch Gradient Descent (MGBD)
    • Optimizers (Momentum, Adagrad, RMS prop, Adam)
  • Overfitting & Underfitting
  • Vanishing & Exploding Gradients
  • Optimisation Techniques 


Activation Functions 

                In real time, we are going to deal with complex data which is non-linear(for example list, tuple are linear data, trees, graphs etc. are non-linear data) and that complex data will be having some hidden patterns. In Deep learning/Neural Networks, we are going to deal with such complex data and these Neural Networks are meant for non-linear data. we need some non-linear functions in Neural Networks to add non-linearity to model and such functions are called as Activation Functions. 

Main usage of activation function is to identify the hidden patterns in the given input data.

Different types of activation functions:

  • Sigmoid 
  • Tanh
  • ReLu
    • Leaky ReLu
    • Parametric ReLu
    • ELU & SeLu
  • GeLu (Part of LLM's, not a part of Deep learning/Neural Network)
All the above activation functions are non-linear. They work on non-linear data. 

Each activation function has certain criteria which we should consider while designing a machine learning model, otherwise you will end up creating a model which is not good/strong enough in terms of performance and expectation:
  • It should be non-linear activation function
  • It should be differentiable, let's understand this point as below
    • In neural networks, we have 4 steps, forward propagation, loss calculation, backward propagation, adjusting weights and biases
      • Linear transformation, z = w1x1+w2x2....+wnxn
      • Activation function = A(z) ; if this is not differentiable then there is no learning at all
    • During backward propagation, we need to calculate gradients (nothing but derivatives in maths)
    • While doing it, we need to apply derivatives for for activation function as well, for this reason the activation which we need to use must be differentiable
  • It should be computationally inexpensive, let's understand this point as below
    • For example, in chatgpt 3.x they used around ~175 billion parameters (nothing but weights & biases), think about the complexity of this chatgpt ML model if the activation function formula is complex(instead of a simple formula)
    • Hence activation function must be computationally inexpensive
  • It should be zero-centered 
    • It has to consider both positive & negative scenarios while building ML model
    • It should be in a balanced way while considering input data
  • It should be non-saturating
    • A non-saturating activation function is one whose output keeps changing as the input changes, instead of getting stuck at a fixed value.
    • That means the neuron continues to respond when input changes
    • A non-saturating activation function allows the neuron’s output and gradient to keep changing with the input, preventing vanishing gradients and enabling effective learning during back propagation.


Please try to understand below program :

import torch

x = torch.tensor(0.0, requires_grad=True) # Enabled back propagation

y1 = torch.sigmoid(x) # Using Sigmoid activation function
y2 = torch.tanh(x) # Using Tanh activation function

print(y1)
# tensor(0.5000, grad_fn=<SigmoidBackward0>)
print(y2)
# tensor(0., grad_fn=<TanhBackward0>)

"""
Whatever be the value of x, during backpropagation, range of values for :

Sigmoid is [0, 1]
Tanh is [-1, 1]

Points to remember :
1. Understand that if we use Sigmoid or Tanh as activation functions in the hidden layers of a neural network then outputs are bounded between a range
as shown above.
2. Hence they are not recommended to use in the hidden layers of neural network.
3. Incase if the x value is 0.0 for tanh, output is 0 (as shown in above program)
Loss = W_old - learning rate(dL/dW_old)
= W_old - (0.1)(0) ; Hence dL/dw_old is '0' when we use tanh for
                input value '0'
= W_old - 0
= W_old

4. Hence new value = old value, hence no learning in ML model.
"""

Colab notes : 


Note, that we are going to apply activation functions at both hidden layers and output layers of neural network. We will see later in blog about what activation functions to use where. Generally, we don't use Sigmoid, Tanh at hidden layers because of above reason but we use them at output layers for classification or multi class classification problems. Also we will use others activation functions like softmax etc. at output layers.

Just keep in mind that we have been discussing only above 5 points, which is the criteria to decide the activation function in our ML models. 


Overfitting Vs Underfitting 

  • Lets assume, we have 1000 records of input data, then we have to divide it in between "Train Data" & "Test Data"
    • Assume : 70% is for Train Data, 30% is for Test Data (means we will use 70% of data from input data to train our ML model and 30% to test it)
    • Means our ML model is going to learn hidden & complex patterns from "Train Data"
    • "Test Data" is unseen data
    • Based on our ML model's training, your model is going to evaluate the patterns in "Test Data"



Overfitting  - Model performs well on "Train Data" but not well on  "Test Data"

Underfitting - Model won't perform well on both "Train" & "Test" Data. Basically, it didn't train well on complex "Train" data.

  • From the above diagram:
    • Straight line represents underfitting, model is too simple and couldn't capture the tru relationship between the input data, which didn't learn enough
    • Smooth curve is an example of good fit, ignore bit of noise but captured the overall trend which is a real pattern, with right level of complexity and with low training error
    • A very wavy curve represents overfitting which connected almost all the data points which is too complex with too much of learning noise, this memorised instead of understanding.



Note : Overfitting memorises noise, underfitting ignores patterns but a good fit captured the patterns.

How to avoid Overfitting ?

  • We need to use techniques like (which we are going to discuss further in this blog)
    • Dropout layer
    • L1 & L2 regularization
    • Early stopping 
Below diagram represents Underfitting scenario :
  • No need to go until testing, we can indentify underfitting scenario during training itself as loss is almost similar in all the iterations 


Vanishing Gradient & Exploding Gradient
                        If the gradient values are too small , then old and new weights are almost similar(during back propagation), with no learning. This is called Vanishing gradient. If gradient values are too large, then we will end up with Exploding gradient problem.

  • As per below diagram
    • We have a input layer with input variables as x1, x2
    • Hidden layer with 3 neuron's as h1, h2, h3
    • Output layer with one output neuron as Zf
    • Consider w1_11 is the weight of connection from x1 to h1
    • Output of 
      • h1 is O11
      • h2 is O12
      • h3 is O13
  • During backward propagation, lets assume that we need to find out the derivative of loss with respect to w1_11
    • Then path of loss calculation would be as mentioned in the below image.
    • Loss --> Y^ --> Zf --> O11 --> w1_11 (dL/dw1_11)


  • As per above backward propagation order, 
    • Derivative of Loss with respect to dw1_11 is nothing but
    • Chain rule, dL/dw1_11 = (dL/dY^)*(dY^/dZf)*(dZf/dO11)*(dO11/dw1_11)
    • Assume following values for above formula
      • dL/dw1_11 = (0.0001)*(0.000001)*(0.001)*(0.004)=4E-16(almost 0)
    • W1_11(new) = w1_11(old) - (learning rate) * (dL/dw1_11)
    • Assume learning rate is 0.1 and w1_11(old) = 0.8
    • W1_11(new) = 0.8 - (0.1)*(0.0000000004)=0.7999999
  • Observe carefully, that w1_11(new) value which is 0.7999999 is almost equal to w1_11(old) which is 0.8
  • That means w1_11(new) is almost close to w1_11(old)
    • if both are equal, is there any learning ?
  • Means, if the derivative values are too small(as mentioned above), then the old and new weights of same connection are almost similar/equal
    • W1_11(new) = W1_11(old)
    • This is called Vanishing gradient

Note : Hence we need select the activation function very carefully, otherwise we will end up with Vanishing gradients problem.

Exploding gradient 

  • Lets say
    • W1_11(old) = 0.8
    • W1_11(new) = 8689.89
  • Look at the difference of both the old and new weights of same connection
  • This is called Exploding gradient
Now, if we have to create a perfect ML model, then it must be in between Vanishing gradient and Exploding gradient. But there is no criteria, based on data we have to interpret. We just have to make sure that our ML model doesn't lead to either Overfitting/Underfitting issue or Vanishing/Exploding gradient issue.

So far, we have discussed about the criteria for selecting activation function, Overfitting/Underfitting issue & Vanishing/Exploding gradient issue.

Now, lets go to the actual activation functions.


Activation Functions

  • Sigmoid
    • It comes from machine learning algorithm called Logistic Regression, it convert any real number into probability (spinning a coin leads to head & trail)
    • It is a curved line, hence satisfying non linearity 
    • Any value
      • > 0.5, it will be considered as 1
      • < 0.5, it will be considered as 0
    • Hence range of Sigmoid is [0, 1]


Programatically showing range of Sigmoid i.e. [0, 1]:

import torch

z = torch.tensor(10.0, requires_grad=True)
y = torch.sigmoid(z)
print(y) # tensor(1.0000, grad_fn=<SigmoidBackward0>)

z5 = torch.tensor(-10.0, requires_grad=True)
y3 = torch.sigmoid(z5)
y3 # tensor(4.5398e-05, grad_fn=<SigmoidBackward0>)


Formula os Sigmoid is(1/(1+e^-z)) :

Similarly, derivate of Sigmoid can be represented as below:


Note : Sigmoid is differentiable (Hence satisfying 2nd criteria of activation functions)

dσ(z)/dz = σ(z) · (1 − σ(z))

Where σ is the representation of Sigmoid.
  • Sigmoid is computationally expensive due to exponential in its formula. It will take good amount of time to calculate the exponent.
  • Sigmoid is not zero-centered. Range is [0, 1]. 
  • Sigmoid is Saturated 
    • If linear transformation, Z = 10, then σ(10) = 1 (Hence range of Sigmoid is [0, 1])
    • Derivative, dσ(z)/dz = σ(z) · (1 − σ(z)) = 1 * (1 - 1) = 1* 0 = 0 (leads to Vanishing gradient issue)
    • If derivative is 0 then it leads to vanishing gradient



  • Tanh
    • Tanh is better than Sigmoid activation function
    • Formula for Tanh(Z) is : 
  • Guys, no need to remember all these formula's, programatically torch.tanh(z) will does it for us.
  • Graph for Tanh is

  • Pro's & Cons while satisfying the criteria of activation function
    • It is Non-linear (it's not a straight line)
    • It is differentiable
    • Formula for derivative of Tanh(Z) is as below:
    • Tanh is computationally more expensive than Sigmoid because of lot of exponential values
    • Tanh is zero centered. It is considering both +ve & -ve values as range is [-1, 1].
    • Tanh is Saturating, it leads to Vanishing gradient problem
      • Assume Z = 0, 
        • Tanh(0) = 
        • d/dx(Tanh(0)) = 1 -Tan^2h(0) = 1 - 0 = 1
      • Assume Z = 6
        • Tanh(Z) = Tanh)(6) = 1
        • d/dx(Tanh(0)) = 1 -Tan^2h(6) = 1 - 1 = 0

  • ReLu 
    • Rectified Linear Unit is the full form of ReLu
    • Formula, ReLu(z) = max(0, z) where o/p range goes to [0, infinite]
      • if z <= 0, then ReLu(z) is 0 (-ve value simply replace with 0)
      • if z > 0, then ReLu(z) is z (+ve value simple replace z)
      • This applies in Forward Propagation
    • Derivative of ReLu, d/dx(ReLu(z)) is
      • if z<= 0 then value is 0 
      • if z > 0 then value is 1
      • This applies in Backward Propagation
    • Graph of ReLu acivation function
      • See for all -ve values it is 0 and touching x-axis
    • Criteria of an activation function ?
      • It is non-linear
      • max(0, z) is differentiable where z > 0. Hence ReLu is partially differntiable
        • if z = 0, it is not differntiable 
      • Computationally in expensive, as formula is simple i.e. max(0, z)
      • It is not zero-centered (not accepting -ve values)
      • ReLu is non-saturated, if z > 0

Finally, is this activation function is recommended for Hidden layers of Neural Network or not ? Yes, because it is satisfying most of the criteria of activation functions. This is a good fit for Hidden Layers.

Now, lets understand a problem called DYING ReLu problem :

  • We have a problem called Dying ReLu problem in ReLu

  • We have maintain same activation function across all neuron's in hidden layers
  • Incase if ReLu(z) is a -ve value, then value will be ZERO and whenever the value of neuron value adjusted to ZERO, then that neuron is called a DEAD neuron and this situation is called Dying ReLu problem.
    • Reasons for this problem:
      • High -ve bias ; Z = w1x1+w2x2+b (if b is high -ve value, then z is a -ve value) and it will return 0
      • High learning rate
        • W_new = W_old - (learning rate) (dL/dW_old)
        • if learning rate is too high, then W_new = some high value, which will again obviously return 0, again dead neuron without learning and leads to Dying ReLu
To make it simple, remember that we will land into Dying ReLu problem in below 2 situations:

  • When the Bias is a high negative number
  • When the learning rate is too high
To overcome this problem, another flavour introduced called Leaky ReLu.

Leaky ReLu :
  • Leaky ReLu(z) is, where a = 0.01 (researched finalized this value post experiments)
    • z if z > 0
    • az if z <= 0  (this is the change, we are just multiplying z with "a" and its value is 0.01)
  • Please observe the difference between ReLu and Leaky ReLu from below diagram
  • Just to avoid '0', we are hardcoding with 0.01
  • Derivative is,   d/dx( Leaky ReLu(z)) is
    • 1 if z > 0
    • a if z<= 0 (we are keeping the neuron alive to avoid dying ReLu problem)

Now observe the properties on Leaky ReLu whether it is satisfying the criteria of action functions or not :
  • It is non-linear (as we can see from above graph)
  • It is differentiable
  • It is computationally inexpensive
  • It is zero-centered (a*z means (0.01)*(z), z could be any number range from [-infinite, +infinite ]) Hence zero-centerd, no bias.
  • It is partially non-saturated, not fully non-saturated  
Based on the discussion happened until now. We can recommend Leaky ReLu as an activation function in Hidden layers. But remember, still there is a problem :) 

  • Remember value of a is 0.01 is fixed and it is not a trainable parameter. This is a static value. It must be Dynamic, isn't it ? This is the problem with Leaky ReLu.

One more flavour got introduced called, Parameteric ReLu.


Parametric ReLu :
  • Observe below graph to spot the clear difference between ReLu Vs Leaky ReLu Vs Parametric ReLu
  • See in Parametric ReLu, is adjusting the value of 'a' based on the situation which is helping in increasing the learning rate
  • ReLu(z) is 
    • z if z > 0
    • az if z <= 0 but 'a' is not constant, it is a learnable parameter ; Your model is going to learn based on the situation, and this is completely taken care by ML model. Understand control is with your ML model, nothing manual here and strictly No hardcoding.

  • It is non linear
  • It is differentiable
  • It is computationally inexpensive
  • It is zero centered
  • It is non saturated

Conclusion :
ReLU completely blocks negative inputs, Leaky ReLU allows a small fixed gradient for negative values, and Parametric ReLU learns the optimal negative slope during training, improving gradient flow and representation capacity.

How we decide on activation function for real world problems ?
  • We will start with ReLu, analyze the output
  • Repeat the process with Leaky ReLu, analyze the output
  • Repeat the process with Parametric ReLu, analyze the output
and then decide on using the type of activation function. This is manual way.  We can decide on hyper parameters based on the test runs(automatical way of deciding hyper parameters) using python package called Optuna. Optuna is an open-source python library, used for Hyper parameter Optimization in Machine Learning and Deep Learning.

With this, we have seen all the required activation functions that we can use during Forward Propagation. Now, lets see the functions that needs to be used during Loss calculation.


Loss Functions :

We have 2 types of problems in ML:
  • Regression (in regression we have following loss functions)
    • MSE (Mean Squared Error)
    • MAE (Mean Absolute Error)
    • Huber Loss
  • Classification (in classification we have following loss functions)
    • Binary Cross Entropy
    • Categorical Cross Entropy
    • Sparse Categorical Cross Entropy (SCCE)
We have different set of loss functions for both Regression, Classification. Let's see one by one:




Loss functions for Regression problems:

1) MSE (Mean Squared Error) 
  • Please see below image for formula
  • Also consider below input data to understand MSE
    • i (input)
    • y (actual value)
    • y^ (predicted value)
    • e (y - y^)
    • e^2 
  • We need to understand the difference between Loss Vs Average Loss
    • If we consider one record at a time, it is called Loss
    • If we consider multiple records, and calculate the Loss, that's Average Loss
    • In first step, 
      • MSE = (y^ - y)^2 is Loss
      • MSE = 1/n ∑(i=1 to n)  (y^ -y)^2 is Average Loss (also known as Cost Function)
  • Advantages 
    • Simple
    • Same unit
  • Dis-advantages
    • Outliers is a problem, increasing the average Loss
    • Squaring (y^ - y)^2 , increasing the values
Outliers are nothing but abnormalities in the given data, we can simply remove them from input data and then apply MSE algorithm.

2) MAE (Mean Absolute Error) 
  • MAE is introduced to overcome above 2 problems with
    • Squares
    • Outliers
  • Formula for  
    • Loss is, MAE = |Y - Y^|
    • Average Loss, MAE =1/n ∑(i=1 to n) |Y - Y^|
  • Advantages are
    • Simple and Same unit (if it is - then -, if it is + then +)
    • It is very good for outliers
  • Disadvantages are
    • It is not fully differentiable, especially for f(x) = |x| = 0

3) Huber Loss 
  • Formula is
    • ½ (y − ŷ)² , if  |y − ŷ| ≤ λ
    • λ (|y − ŷ| − ½ λ), if |y − ŷ| > λ
    • where λ  is hyper parameter, your model is going to learn based on requirement and hyper parameter meter means, model is going to calculate the best value for those parameters based on the input scenario at run time.
  • Lets assume,  λ = 5 , e1 = 2,  e2=2, -20
    • For e1 =2 
      • e1 <= λ (2  <= 5) ; condition satisfied, hence using first formula
      • ½ (y − ŷ)² = ½ e1² = ½ (2)² = ² ½(4) = 2 (This is Loss)
    • For e2 = 2 , same value i.e. 2
    • For e3 = -20
      • if |y − ŷ| > λ, then use λ( |y − ŷ| − ½ λ)
      • if |-20| > 5, then λ( |y − ŷ| − ½ λ)
          • = 5(|-20| - ½ 5) = 5(20 - 2.5) = 5(17.7) = 87.5
    • So, we have 3 losses, those are 2, 2, 87.5
    • Average loss = (2 + 2 + 87.5)/3 = 30.5

Which loss function to choose ?
  • Suppose if we handled outliers at during data preprocessing itself, then go with MSE
  • If your data consists of outliers, then MAE ; but there is a problem with MAE which is not fully differentiable, then go with Huber Loss
  • Suppose, if you don't have any outliers in your dat, then go for MSE 
Simply, it will be MSE if outliers are not there or being handles, otherwise go with Huber Loss.



Loss functions for Classification problems:

1) Binary Cross Entropy (BCE)
  • It is mainly used if our output consists of 2 class classification
  • In the output neuron, what is the activation function you recommend in output layer ?
    • Sigmoid (check Sigmoid section for reasons)
  • Formula for Cost/Loss Function
  • It is differentiable 

2) Categorical Cross Entropy (CCE)

  • Activation function should be "Softmax"
  • It is a multi class classification (more than 2 classes)
  • Formula for Cost/Lost function



  • In real time, most of the problems are not using this Loss function
  • It is called One Hot Encoding, for categorical variables.
  • It will add new columns, which are the classification values and assign 1's and 0's for those records which makes computation very complex. This is the reason, most of the problems are not using Categorical Cross Entropy. IT IS NOT RECOMMENDED IN REAL TIME. It is a burden to infrastructure.


3) Sparse Categorical Cross Entropy (SCCE)
  • To overcome above issue, they introduced SCCE.
  • Instead of adding all categories as new columns, this algorithm adds only one columns and adding a row number to each of those categories. Hence it is recommended in real time especially when you are dealing it multi class problems. 
  • Formula for Cost function (No need to memorize it, program will take care, just for our understanding only)


Note : When you call the loss function, then it will internally takes care of One Hot Encoding or adding row number based on the function we call and then apply loss function.



Local Minima Vs Global Minima :
  • Both Global Minima & Local Minima are related to Loss Functions only
  • Global Minima
    • Lowest possible value of the loss function across entire data set
    • This value is absolute best for the model
    • Point 3 from below image is Global Minima
  • Local Minima 
    • Consider 100 iterations of our model as shown in the below image
    • During iteration 4, if we stop model thinking that it is the minimum loss, then it is a trap. This is nothing but Local Minima
    • We will land into this trap, and simply stop models here at iteration L4. But look at values at L55, L68, L79. Loss got reduced a lot comparatively with L4.
    • We have to stop the iteration at L79, which is at Global minima.


More image showing Local & Global Minima :



How many iterations to run the model is also a Hyper Parameter. We will decide on it based on the output from Optuna library. 


**** Note *****
We have completed below 2 main steps in Neural Networks:
  • Activation Functions
  • Loss Functions
Now we are going to discuss about Backward Propagation.



Backward Propagation : 
            Backward propagation is a mechanism in neural network uses to learn from its mistakes. It tells the network how much each weight contributed to the error, so those weights can be corrected.
  • Internally during backward propagation and algorithm will run which is called Gradient Descent

Algorithms used in backward propagation:
  • Gradient Descent (GD)
  • Batch Gradient Descent (BGD)
  • Stochastic Gradient Descent (SGD)
  • Mini Batch Gradient Descent (MGBD)
  • Optimizers (Momentum, Adagrad, RMS prop, Adam)

Gradient Descent (GD) Algorithm : Gradient means, we are going to calculate derivatives. Descent is getting ready to land slowly, by adjusting required parameters. We are going to minimize the loss by adjusting weights and biases. This is mainly used to reduce the loss. Not immediately, but it will gradually touching down the ground until loss is minimal. This is the main goal of Gradient Descent algorithm.

Maths behind this gradient descent is W_new = W_old − η · ( ∂L / ∂W_old )

Where W_new is new weight, W_old is old weight, η is learning rate and ( ∂L / ∂W_old ) is derivate of loss with respect to old weight. 

This is even applicable for Bias as well, B_new = B_old − η · ( ∂L / ∂B_old )


As per above image, we have a set of input data x1, x2 with actual value/output as y.
  • For record1, x1=80,  x2=8 and y=3
  • For record2, x1=60, x2=9 and y=5
Consider a single hidden layer with 2 neuron's h1, h2 and a output layer with one output as b. After that we have to calculate Y^ & Loss, L.

Now, weight of connections are represented as W₁₁¹W₁₂¹W₂₁¹W₂₂¹W₁₁²W₂₁²

During forward propagation, we have 2 functions Linear transformation & Activation function. So, if we calculate that for both h1, h2 then
  • h1 ---> Z₁ | Af = O₁₁
  • h2 ---> Z₂ | Af = O₁₂
And output of Y^ is O₂₁.

Based on above diagram, Linear transformation equation for 
Y^ = W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁ (assuming b from o/p layer as b₂₁)


Assume biases of h1, h2 as b₁₁b₁₂. Please see above image for clarity. Also assuming there are no outliers in my data, then recommended loss function for a regression problem is MSE.

MSE(Mean Squared Error) = (Y − Ŷ)²

Now, lets see Backward Propagation.

Loss ---> Ŷ 

Now lets understand, what are the parameters that we need to calculate Ŷ. If you can carefully build below graph then you can calculate everything required for back propagation based on it. 


Lets start executing backward propagation algorithm. 

From above graph, x1, x2 are fixed values but all weights and biases will change. 

So, below parameters are changing(as per above image):
  • W₁₁²W₂₁²,  b₂₁ ===> Step1 calculation
  • W₁₁¹,  W₂₁¹b₁₁ ===> Step2 calculation
  • W₁₂¹,  W₂₂¹, b₁₂ ===> Step3 calculation
Also 
  • O₁₁ = Af(W₁₁¹ x1 +  W₂₁¹ x2 +  b₁₁)
  • O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 +  b₁₂)


Step1 calculation :  Apply chain rule

dL/dW₁₁²  = (dL/dŶ) * (dŶ/dW₁₁² )
dL/dW₂₁² = (dL/dŶ) * (dŶ/dW₂₁² )
dL/db₂₁ = (dL/dŶ) * (dŶ/db₂₁ )


Now, lets see how to calculate :

dL/dW₁₁²  = (dL/dŶ) * (dŶ/dW₁₁² )
                   
We need to calculate (dL/dŶ) and (dŶ/dW₁₁² )

  • Note : as we applied MSE as loss function, we need to use Loss=  (Y − Ŷ)²
Keep in mind : We need to power rule, consider Y is constant and note that we are calculating loss with respect to Ŷ, so d/dŶ(Y^) is 1
d/dx (1 − x)²
= 2(1 − x) · d/dx(1 − x)
= 2(1 − x) · (−1)

Then dL/dŶ = d/dŶ(Loss)
                     = d/dŶ  (Y − Ŷ)²
                     =  2((Y − Ŷ)) * d/dŶ(Y − Ŷ)
                     = 2((Y − Ŷ)) * (d/dŶ(Y) - d/dŶ(Ŷ))
                     = 2((Y − Ŷ)) * (0 - 1)
                     = 2((Y − Ŷ)) * (- 1)
                     = -2(Y − Ŷ)

Now 2nd part of equation, dŶ/dW₁₁² .

dŶ/dW₁₁² = d/dW₁₁²(Ŷ) but Ŷ = W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁
                  = d/dW₁₁² (W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁)
                  = O₁₁

Hence dL/dW₁₁²  = (dL/dŶ) * (dŶ/dW₁₁² ) = -2(Y − ŶO₁₁


Now solving next equation, dL/dW₂₁² = (dL/dŶ) * (dŶ/dW₂₁² ) but we know the value of (dL/dŶ), just calculate (dŶ/dW₂₁² ).

dŶ/dW₂₁²  = d/dW₂₁² (W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = O₁₂

So, dL/dW₂₁² = (dL/dŶ) * (dŶ/dW₂₁² ) = -2(Y − Ŷ) O₁₂


Now solving 3rd equation, dL/db₂₁ = (dL/dŶ) * (dŶ/db₂₁ but we know the value of (dL/dŶ), just calculate (dŶ/db₂₁ ).

dŶ/db₂₁ = d/db₂₁(W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = 1

So, dL/db₂₁ = (dL/dŶ) * (dŶ/db₂₁ ) = -2(Y − Ŷ) * 1 = -2(Y − Ŷ) 





Step2 calculation : 
  • W₁₁¹,  W₂₁¹b₁₁ ===> Step2 calculation
Note :
  • O₁₁ = Af(W₁₁¹ x1 +  W₂₁¹ x2 +  b₁₁)
  • O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 +  b₁₂)
dL/dW₁₁¹ = (dL/dŶ) * (dŶ/dO₁₁) * (dO₁₁/dW₁₁¹) = -2(Y − Ŷ) * W₁₁² * x1 
dL/dW₂₁¹ =  (dL/dŶ) * (dŶ/dO₁₁) * (dO₁₁/dW₂₁¹) = -2(Y − Ŷ) * W₁₁² * x2  
dL/db₁₁ = (dL/dŶ) * (dŶ/dO₁₁) * (dO₁₁/db₁₁) = -2(Y − Ŷ) * W₁₁²  * 1 = -2(Y − Ŷ) * W₁₁²

dŶ/dO₁₁ = d/dO₁₁ (W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = W₁₁²

dO₁₁/dW₁₁¹ = d/dW₁₁¹ (O₁₁) = d/dW₁₁¹ (W₁₁¹ x1 +  W₂₁¹ x2 +  b₁₁) = x1
dO₁₁/dW₂₁¹ = d/dW₂₁¹ (O₁₁) = d/dW₂₁¹ (W₁₁¹ x1 +  W₂₁¹ x2 +  b₁₁) = x2

dO₁₁/db₁₁ =  d/db₁₁ (O₁₁) = d/db₁₁ (W₁₁¹ x1 +  W₂₁¹ x2 +  b₁₁) = 1




Step3 calculation :
  • W₁₂¹,  W₂₂¹, b₁₂ ===> Step3 calculation
Note :
  • O₁₁ = Af(W₁₁¹ x1 +  W₂₁¹ x2 +  b₁₁)
  • O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 +  b₁₂)
dL/dW₁₂¹ = (dL/dŶ) * (dŶ/dO₁₂) * (dO₁₂/dW₁₂¹) = -2(Y − Ŷ) * W₂₁²  * x1
dL/dW₂₂¹ =  (dL/dŶ) * (dŶ/dO₁₂* (dO₁₂/dW₂₂¹) = -2(Y − Ŷ) * W₂₁²  * x2
dL/db₁₂ =  (dL/dŶ) * (dŶ/dO₁₂) * (dO₁₂/db₁₂ ) = -2(Y − Ŷ) * W₂₁²  * 1 = -2(Y − Ŷ) * W₂₁² 

Common term :
dŶ/dO₁₂ = d/dO₁₂ (Ŷ) = d/dO₁₂(W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = W₂₁² 

Individual terms :
dO₁₂/dW₁₂¹ = d/dW₁₂¹ (W₁₂¹ x1 + W₂₂¹ x2 +  b₁₂) = x1
dO₁₂/dW₂₂¹ = d/dW₂₂¹ (W₁₂¹ x1 + W₂₂¹ x2 +  b₁₂) = x2
dO₁₂/db₁₂  =  d/db₁₂  (W₁₂¹ x1 + W₂₂¹ x2 +  b₁₂) =  1


That's how we need to calculate all parameters simply by looking at above image/graph. 😁
This looks like complex but if you refer above image & graph and do step by step then it will be easy.


Problem with Gradient Descent Algorithm : This might end up with local minima. If we can observed below image, we have 2 valleys but out Gradient Descent algorithm ended up at local minima which is mint color on your right hand side. This is the issue with local minima algorithm. Because this is the first version in Gradient Descent algorithm. It's like using chatgpt 5.2(latest version) though we have chatgpt-1.



Next versions of GD algorithm are :
  • Batch Gradient Descent (BGD)
  • Stochastic Gradient Descent (SGD)
  • Mini Batch Gradient Descent (MBGD)
We need to see what are above algorithms and how they prevent Local Minima issue.

  • Batch Gradient Descent (BGD)
    • It will take entire data set (all 'n' no of records in single shot)
    • Intialize random weights & biases
    • Linear transformation
    • Applying Activation function
    • Calculating the loss for all data points (Average Loss)
    • Update weights & Biases for this entire data set (for all the data points)
    • Disadvantages
      • More memory usage, as we need to fit entire data set into memory for calculation
      • It will take more time to process
  • Stochastic Gradient Descent (SGD)
    • Instead of entire data set, it will consider randomly one sample data(one data point at once)
    • Intialize random weights & biases
    • Linear transformation
    • Applying Activation function
    • Calculating the loss for the sample data point
    • Updating weights & biases for one data point
    • Disadvantages
      • One way this is good, in terms of speed
      • But other way, we can't judge entire data set keeping the result of one data set in mind
      • Overfitting can happen, due to more iterations and weight & bias adjustments
  • Mini Batch Gradient Descent (MBGD)
    • Assume, we have 100 records ; MBGD will splint the entire data set into 4 batched with 25 records in each batch
      • Batch1(25), Batch2(25), Batch3(25), Batch4(25)
    • At once, it will Consider ONLY one batch i.e. Batch1 
      • Intialize random weights & biases
      • Linear transformation
      • Applying Activation function
      • Calculating the loss for ONLY Batch1
      • As per above loss, adjust weights & biases
      • Note this batch selection is also an Hyper parameter, Optuna will help here
        • Ideally recommendation for batch size is 16, 32, 64, 128, 256 etc. This decision is from research. Incase if we are not clear on how to consider a batch, please use Optuna from Python. It will give best Hyper parameters for us to design model. More about this in future blogs ✌
    • Disadvantages
      • It is a good approach, comparatively with BGD & SGD
      • Kind of hybrid approach, between with BGD & SGD
      • More parallelism 
      • Less faster than SGD, because considering more samples
In terms of 
  • speed SGD > MBGD > BGD
  • memory utilization SGD < MBGD < BGD
When we design an application/model, it should be fast and memory efficient then preferred approach is MBGD. In real time, most of the cases, people are using MBGD (Mini Batch Gradient Descent)
  • "+" is minimal loss
  • MBGD seems to be moderate algorithm for Gradient Descent
  • Observe that BGD doesn't have much learning
  • Observe SGD have too many iterations, with more noise instead of actual learning

Observe this Analogy, you will understand a better way to decide on selecting GD algorithm :
  • Assume you are walking down from a tip of mountain
  • BGD (Stable, for time consuming process because of slow decision, need to analyse )
    • it is going scan entire mountain, then take one perfect step to reach the correct direction
  • SGD (Very fast, but lot of zig zag nature, due to this it it will keep changing direction)
    • you look at only one nearby rock, guess the direction and step immediately
  • MBGD (taking decent time to decide and then taking right step)
    • you look at set of neighbourhood rocks, then you decide the step

Problem with Gradient Descent algorithm is called Local Minima. To avoid this problem, some boosters are added to it, which are called Optimizers. It will add more power to above Gradient Descent algorithm.
  • BGD + Optimizers
  • SGD + Optimizers
  • MBGD + Optimizers 
Optimizers are named as Momentum, NAG, Rmsporp, Adam.



How to prevent Overfitting in Deep Learning/Neural Networks ?

First, lets understand the concept of Overfitting perfectly....
  • Understand that whenever we need to start a ML model, we will splint entire data into 
    • Train Data (Loss is low, accuracy is high)
    • Test Data (Loss is very high, and accuracy is low)
  • Our ML model will learn hidden patterns from "Train Data" and it works perfectly with "Train Data".  Loss will be low and accuracy will be very high on "Train Data".
  • But same model, if I run on "Test Data", then Loss will be very high and accuracy will be very low. This is called Overfitting

Above image represent Overfitting scenario.

    
  • In real time, we have to build ML models like either 1st or 2nd case from above image, but 3rd case represents exact Overfitting scenario.
  • Most of the real time models will fit in 2nd case, but 1st one is pretty perfect.

Below model is perfectly classified, it is kind of classification problem, example is "whether an email is spam or not spam". It segregated well to identify categories well. 

From the below image:
  • A represents Underfit, it is not trained well, not identifying hidden patterns
  • B represents a good/balanced/perfect model, ignoring noise and perfectly matching patterns
  • C represents Overfit

                                                    Underfitting Vs Balanced Vs Overfitting 


To prevent Overfitting, we have following techniques which are well proven: 
  • Drop Out Layer
  • Early Stopping
  • L2 Regularization 
We can include one technique or can use combination from above concepts. It will be like use "Drop Out Layer", analyze it and then use "Early Stopping" analyze it followed by "L2 Regularization" and compare which is best.

Drop Out Layer :

  • Ideally, all the neurons will be active across Neural Network
  • All neurons will be active means, all of them will memorize things in each iteration and there will be too much of noise, and this leads to OVERFITTING.
  • Randomly, in each iteration, it will switch off some nodes (lets take 20% of neurons will be switched off)
    • During 1st iteration 20% neurons will be switched off, during 2nd iteration another 20%(note 20% neurons which were switched off during 1st iteration are active during second iteration). We are trying to switch off only 20% of neurons in total, rest 80% willl be active in every iteration.
    • We covered different patterns in each iteration, which will be helping our ML model
  • Above image shows drop out layer in hidden layer
  • Automatically model will identify which neurons to switch off, it will take care of turning on those off neurons once iteration is done, we just need to mention the Drop Out Rate.
    • if drop out rate = 0.5 then 50% neurons will be switched off
    • this will happen during both forward propgation and backward propagation as well
    • output from switched off neurons is '0' , then during backward propagation, it won't consider this neuron and Vanishing Gradient problem won't be there.
  • Drop Out Layer is applicable for only "Training" but not during "Testing" and "Production". DO NOT INCLUDE DROP OUT LAYER while testing the data.
  • In Testing, all neurons will be active.

Programatically, below line will take care of everything ! 
nn.dropout(0.3)  ## nn is neural network, 0.3 means 30% neurons will be switched off


Above image tells us that, we will be having less error rate with drop out layer.



Early Stopping :
  • This is another technique to prevent Overfitting.
  • You are going to stop "Training" the model when it stops improving on "Validation" data (test data), even if Train loss is still improving.
    • Example: (7 iterations)
      • Train Loss = 0.989.              Test Loss = 0.899
      • Train  Loss = 0.976.             Test Loss = 0.897
      • Train  Loss = 0.932              Test Loss = 0.896
      • Train  Loss = 0.854.             Test Loss = 0.742
      • Train  Loss = 0.767              Test Loss = 0.736 (Patience)
      • Train  Loss = 0.632               Test Loss = 0.735999
      • Train  Loss = 0.542               Test Loss = 0.735999
    • There is a parameter called Patience, our model will stop here in this scenario, to avoid Overfitting. 
    • Patience is the Hyper parameter here

Some important notations :
  • TP - True Positive : Model predicted YES, actual is also YES
  • TN - True Negative : Model predicted NO, actual is also NO
  • FP - False Positive : Model predicted YES, actual is also NO
  • FN - False Negative : Model predicted NO, actual is also YES
Based on above notations, we will calculate accuracy.

Accuracy = (TP + TN)/(TP+TN+FP+FN)

For example, TP =70, TN = 20, FP = 5, FN = 5, then Accuracy = (70+20)/(70+20+5+5) = 0.9 (90%)



L2 Regularization :

  • This is a lengthy and confusing topic, last concept to prevent Overfitting
  • Lets see how it is going to avoid Overfitting
  • Regularization means controlling something with some rules. Isn't it ?
  • You are punishing the model, if weight become too large
    • So model forced to
      • Keep weights small and simple
      • Avoid memorizing
      • More generalization
  • Above Loss could be regression loss or classification loss, see formula for loss function as above
  • Second part of function excluding loss is nothing but PENALITY TERM, which will control the weights
    • where λ is a hyper parameter, it should be balanced (or else it will end up either Overfitting or Underfitting)
    • if λ=0 then Cost Function = Loss Function (so it shouldn't be 0)
We knew that, W_new = W_old − η · ( ∂L / ∂W_old ) ======> Equation1

L2 regularization, L¹ = L + (λ / 2) Σ ||Wᵢ||²

dL¹/dW_old = d/dW_old ()
                     = d/dW_old (L + (λ / 2) Σ ||Wᵢ||²)
                     = d/dW_old (L) + (λ / 2) * 2 W_old
                     = dL/dW_old + λ * W_old

W_new = W_old − η · ( ∂L / ∂W_old )
              = W_old − η ·(dL/dW_old + λ * W_old)
              = W_old − η ·λ * W_old - η * dL/dW_old
              = W_old(1 - η *λ) - η * dL/dW_old  ======> Equation2

Please note that , 1 - η *λ is the Weight Decay Factor which will penalize our larger weights and they will be under control, it is going to prevent Overfitting

Let us consider,  Ŷ = WX

Training point  x = 1, Y = 0 and weights are defined randomly.
and initial weight = 2 

Loss Function, L = 1/2 (Ŷ - Y)^2.  (regression)

Data Loss = 1/2 (Ŷ - Y)^2

but Ŷ = W*X , assume X =1, Y = 0 then Ŷ = W * 1 = W

Data Loss = 1/2 (Ŷ - Y)^2 = 1/2(W - 0)^2

if W = 2(randomly initialized W = 2), dL/dW = d/dW (1/2 * W) ^ 2 = W = 2 

W_new = W_old − η · ( ∂L / ∂W_old ) & assume η = 0.1
              = 2 - (0.1) * (2)
              = 2 - 0.2
              = 1.8 (See weight value reduced)

This is how weights will be reduced in L2 regularization.
    So, using L2 regularization, we are going to penalize the larger weights by using "Weight Decay". If weights are balanced, it will automatically eliminate Overfitting problem.


    Learning Rate( η ):
    • This is a Hyper parameter, used to adjust weights and biases
    • As per experiments, they proposed η = 0.1, 0.01, 0.001, 0.0001 are good values to choose
    • If η is too low, like 0.0000001 then our ML model will take lot of small steps, training will take too long and it will be too slow. But if η is 10, then it will take big jumps, it will reach Global minima but it will be zig zag nature. Hence learning shouldn't be too large, too small, it must be medium. Hence most of the researches recommend above values. 
    • First, manually train your model with different small learning rates, compare them and at the end if needed use OPTUNA and come up with a balanced learning rate for your model.

    When you are calling Optimizer, you have to pass the value for learning rate, η.

    We will see optimizers in next blog. In real time we use Optimizers, we will take the help of Boosters.


    Thank you for reading this blog !
    Arun Mathe

    Comments

    Popular posts from this blog

    AWS : Working with Lambda, Glue, S3/Redshift

    This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

    Spark Core : Understanding RDD & Partitions in Spark

    Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

    (AI Blog#1) Deep Learning and Neural Networks

    I was curious to learn Artificial Intelligence and thinking what is the best place to start learning, and then realized that Deep Learning and Neural Networks is the heart of AI. Hence started diving into AI from this point. Starting from today, I will write continuous blogs on AI, especially Gen AI & Agentic AI. Incase if you are interested on above topics then please watch out this space. What is Artificial Intelligence, Machine Learning & Deep Learning ? AI can be described as the effort to automate intellectual tasks normally performed by Humans. Is this really possible ? For example, when we see an image with our eyes, we will identify it within a fraction of milliseconds. Isn't it ? For a computer, is it possible to do the same within same time limit ? That's the power we are talking about. To be honest, things seems to be far advanced than we actually thing about AI.  BTW, starting from this blog, it is not just a technical journal, we talk about internals here. ...