(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

It is extremely important to have a deep knowledge while designing a machine learning model, otherwise we will end up creating ML models which are of no use. We have to have a clear understanding on certain techniques to confidently build a ML model, train it using "training data", finalize the model and to deploy it in production. So far, from blog #1, #2, we have seen about the fundamentals of Deep Learning and Neural Network, architecture of a Neural Network, internal layers and components etc.

Providing the links of Blogs #1, #2 below for quick reference.

Deep Learning & Neural Networks : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html

Building a real world neural network: A practical usecase explained : https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html

Now let's dive through below concepts/criteria to help gaining confidence on building your ML model:

Activation Functions (Forward Propagation)

ReLu, Leaky ReLu, Parametric ReLu
Sigmoid
Tanh
ELU(Exponential Linear Unit), SeLu(Scaled Exponential Linear Unit) - These are still under reasearch, not into production yet.
GELU(we will see this in LLM's, it is not for NN)

Loss Functions

Regression

Classification

BCE
CCE
SCCE

Backward propagation

Gradient Descent (GD)
Batch Gradient Descent (BGD)
Stochastic Gradient Descent (SGD)
Mini Batch Gradient Descent (MGBD)
Optimizers (Momentum, Adagrad, RMS prop, Adam)

Overfitting & Underfitting
Vanishing & Exploding Gradients
Optimisation Techniques

Activation Functions

In real time, we are going to deal with complex data which is non-linear(for example list, tuple are linear data, trees, graphs etc. are non-linear data) and that complex data will be having some hidden patterns. In Deep learning/Neural Networks, we are going to deal with such complex data and these Neural Networks are meant for non-linear data. we need some non-linear functions in Neural Networks to add non-linearity to model and such functions are called as Activation Functions.

Main usage of activation function is to identify the hidden patterns in the given input data.

Different types of activation functions:

Sigmoid
Tanh
ReLu

Leaky ReLu
Parametric ReLu
ELU & SeLu

GeLu (Part of LLM's, not a part of Deep learning/Neural Network)

All the above activation functions are non-linear. They work on non-linear data.

Each activation function has certain criteria which we should consider while designing a machine learning model, otherwise you will end up creating a model which is not good/strong enough in terms of performance and expectation:

It should be non-linear activation function
It should be differentiable, let's understand this point as below

In neural networks, we have 4 steps, forward propagation, loss calculation, backward propagation, adjusting weights and biases

Linear transformation, z = w1x1+w2x2....+wnxn
Activation function = A(z) ; if this is not differentiable then there is no learning at all

During backward propagation, we need to calculate gradients (nothing but derivatives in maths)
While doing it, we need to apply derivatives for for activation function as well, for this reason the activation which we need to use must be differentiable

It should be computationally inexpensive, let's understand this point as below

For example, in chatgpt 3.x they used around ~175 billion parameters (nothing but weights & biases), think about the complexity of this chatgpt ML model if the activation function formula is complex(instead of a simple formula)
Hence activation function must be computationally inexpensive

It should be zero-centered

It has to consider both positive & negative scenarios while building ML model
It should be in a balanced way while considering input data

It should be non-saturating

A non-saturating activation function is one whose output keeps changing as the input changes, instead of getting stuck at a fixed value.
That means the neuron continues to respond when input changes
A non-saturating activation function allows the neuron’s output and gradient to keep changing with the input, preventing vanishing gradients and enabling effective learning during back propagation.

Please try to understand below program :

import torch

x = torch.tensor(0.0, requires_grad=True) # Enabled back propagation

y1 = torch.sigmoid(x) # Using Sigmoid activation function
y2 = torch.tanh(x) # Using Tanh activation function

print(y1)
# tensor(0.5000, grad_fn=<SigmoidBackward0>)
print(y2)
# tensor(0., grad_fn=<TanhBackward0>)

"""
Whatever be the value of x, during backpropagation, range of values for :

Sigmoid is [0, 1]
Tanh is [-1, 1]

Points to remember :
1. Understand that if we use Sigmoid or Tanh as activation functions in the 
hidden layers of a neural network then outputs are bounded between a range 
as shown above.
2. Hence they are not recommended to use in the hidden layers of neural network.
3. Incase if the x value is 0.0 for tanh, output is 0 (as shown in above program)
        Loss = W_old - learning rate(dL/dW_old) 
             = W_old - (0.1)(0) ; Hence dL/dw_old is '0' when we use tanh for 
                input value '0'
             = W_old - 0
             = W_old

4. Hence new value = old value, hence no learning in ML model.
"""

Colab notes :

Note, that we are going to apply activation functions at both hidden layers and output layers of neural network. We will see later in blog about what activation functions to use where. Generally, we don't use Sigmoid, Tanh at hidden layers because of above reason but we use them at output layers for classification or multi class classification problems. Also we will use others activation functions like softmax etc. at output layers.

Just keep in mind that we have been discussing only above 5 points, which is the criteria to decide the activation function in our ML models.

Overfitting Vs Underfitting

Lets assume, we have 1000 records of input data, then we have to divide it in between "Train Data" & "Test Data"

Assume : 70% is for Train Data, 30% is for Test Data (means we will use 70% of data from input data to train our ML model and 30% to test it)
Means our ML model is going to learn hidden & complex patterns from "Train Data"
"Test Data" is unseen data
Based on our ML model's training, your model is going to evaluate the patterns in "Test Data"

Overfitting - Model performs well on "Train Data" but not well on "Test Data"

Underfitting - Model won't perform well on both "Train" & "Test" Data. Basically, it didn't train well on complex "Train" data.

From the above diagram:

Straight line represents underfitting, model is too simple and couldn't capture the tru relationship between the input data, which didn't learn enough
Smooth curve is an example of good fit, ignore bit of noise but captured the overall trend which is a real pattern, with right level of complexity and with low training error
A very wavy curve represents overfitting which connected almost all the data points which is too complex with too much of learning noise, this memorised instead of understanding.

Note : Overfitting memorises noise, underfitting ignores patterns but a good fit captured the patterns.

How to avoid Overfitting ?

We need to use techniques like (which we are going to discuss further in this blog)

Dropout layer
L1 & L2 regularization
Early stopping

Below diagram represents Underfitting scenario :

No need to go until testing, we can indentify underfitting scenario during training itself as loss is almost similar in all the iterations

Vanishing Gradient & Exploding Gradient

If the gradient values are too small , then old and new weights are almost similar(during back propagation), with no learning. This is called Vanishing gradient. If gradient values are too large, then we will end up with Exploding gradient problem.

As per below diagram

We have a input layer with input variables as x1, x2
Hidden layer with 3 neuron's as h1, h2, h3
Output layer with one output neuron as Zf
Consider w1_11 is the weight of connection from x1 to h1
Output of

h1 is O11
h2 is O12
h3 is O13

During backward propagation, lets assume that we need to find out the derivative of loss with respect to w1_11

Then path of loss calculation would be as mentioned in the below image.
Loss --> Y^ --> Zf --> O11 --> w1_11 (dL/dw1_11)

As per above backward propagation order,

Derivative of Loss with respect to dw1_11 is nothing but
Chain rule, dL/dw1_11 = (dL/dY^)*(dY^/dZf)*(dZf/dO11)*(dO11/dw1_11)
Assume following values for above formula

dL/dw1_11 = (0.0001)*(0.000001)*(0.001)*(0.004)=4E-16(almost 0)

W1_11(new) = w1_11(old) - (learning rate) * (dL/dw1_11)
Assume learning rate is 0.1 and w1_11(old) = 0.8
W1_11(new) = 0.8 - (0.1)*(0.0000000004)=0.7999999

Observe carefully, that w1_11(new) value which is 0.7999999 is almost equal to w1_11(old) which is 0.8
That means w1_11(new) is almost close to w1_11(old)

if both are equal, is there any learning ?

Means, if the derivative values are too small(as mentioned above), then the old and new weights of same connection are almost similar/equal

W1_11(new) = W1_11(old)
This is called Vanishing gradient

Note : Hence we need select the activation function very carefully, otherwise we will end up with Vanishing gradients problem.

Exploding gradient

Lets say

W1_11(old) = 0.8
W1_11(new) = 8689.89

Look at the difference of both the old and new weights of same connection
This is called Exploding gradient

Now, if we have to create a perfect ML model, then it must be in between Vanishing gradient and Exploding gradient. But there is no criteria, based on data we have to interpret. We just have to make sure that our ML model doesn't lead to either Overfitting/Underfitting issue or Vanishing/Exploding gradient issue.

So far, we have discussed about the criteria for selecting activation function, Overfitting/Underfitting issue & Vanishing/Exploding gradient issue.

Now, lets go to the actual activation functions.

Activation Functions

Sigmoid

It comes from machine learning algorithm called Logistic Regression, it convert any real number into probability (spinning a coin leads to head & trail)

It is a curved line, hence satisfying non linearity
Any value

> 0.5, it will be considered as 1
< 0.5, it will be considered as 0

Hence range of Sigmoid is [0, 1]

Programatically showing range of Sigmoid i.e. [0, 1]:  

import torch 

z = torch.tensor(10.0, requires_grad=True)
y = torch.sigmoid(z)
print(y) # tensor(1.0000, grad_fn=<SigmoidBackward0>)

z5 = torch.tensor(-10.0, requires_grad=True)
y3 = torch.sigmoid(z5)
y3 # tensor(4.5398e-05, grad_fn=<SigmoidBackward0>)

Formula os Sigmoid is(1/(1+e^-z)) :

Similarly, derivate of Sigmoid can be represented as below:

Note : Sigmoid is differentiable (Hence satisfying 2nd criteria of activation functions)

dσ(z)/dz = σ(z) · (1 − σ(z))

Where σ is the representation of Sigmoid.

Sigmoid is computationally expensive due to exponential in its formula. It will take good amount of time to calculate the exponent.
Sigmoid is not zero-centered. Range is [0, 1].
Sigmoid is Saturated

If linear transformation, Z = 10, then σ(10) = 1 (Hence range of Sigmoid is [0, 1])
Derivative, dσ(z)/dz = σ(z) · (1 − σ(z)) = 1 * (1 - 1) = 1* 0 = 0 (leads to Vanishing gradient issue)
If derivative is 0 then it leads to vanishing gradient

Tanh

Tanh is better than Sigmoid activation function
Formula for Tanh(Z) is :

Guys, no need to remember all these formula's, programatically torch.tanh(z) will does it for us.
Graph for Tanh is

Pro's & Cons while satisfying the criteria of activation function

It is Non-linear (it's not a straight line)
It is differentiable
Formula for derivative of Tanh(Z) is as below:

Tanh is computationally more expensive than Sigmoid because of lot of exponential values
Tanh is zero centered. It is considering both +ve & -ve values as range is [-1, 1].
Tanh is Saturating, it leads to Vanishing gradient problem

Assume Z = 0,

Tanh(0) =
d/dx(Tanh(0)) = 1 -Tan^2h(0) = 1 - 0 = 1

Assume Z = 6

Tanh(Z) = Tanh)(6) = 1
d/dx(Tanh(0)) = 1 -Tan^2h(6) = 1 - 1 = 0

ReLu

Rectified Linear Unit is the full form of ReLu
Formula, ReLu(z) = max(0, z) where o/p range goes to [0, infinite]

if z <= 0, then ReLu(z) is 0 (-ve value simply replace with 0)
if z > 0, then ReLu(z) is z (+ve value simple replace z)
This applies in Forward Propagation

Derivative of ReLu, d/dx(ReLu(z)) is

if z<= 0 then value is 0
if z > 0 then value is 1
This applies in Backward Propagation

Graph of ReLu acivation function

See for all -ve values it is 0 and touching x-axis

Criteria of an activation function ?

It is non-linear
max(0, z) is differentiable where z > 0. Hence ReLu is partially differntiable

if z = 0, it is not differntiable

Computationally in expensive, as formula is simple i.e. max(0, z)
It is not zero-centered (not accepting -ve values)
ReLu is non-saturated, if z > 0

Finally, is this activation function is recommended for Hidden layers of Neural Network or not ? Yes, because it is satisfying most of the criteria of activation functions. This is a good fit for Hidden Layers.

Now, lets understand a problem called DYING ReLu problem :

We have a problem called Dying ReLu problem in ReLu

We have maintain same activation function across all neuron's in hidden layers
Incase if ReLu(z) is a -ve value, then value will be ZERO and whenever the value of neuron value adjusted to ZERO, then that neuron is called a DEAD neuron and this situation is called Dying ReLu problem.

Reasons for this problem:

High -ve bias ; Z = w1x1+w2x2+b (if b is high -ve value, then z is a -ve value) and it will return 0
High learning rate

W_new = W_old - (learning rate) (dL/dW_old)
if learning rate is too high, then W_new = some high value, which will again obviously return 0, again dead neuron without learning and leads to Dying ReLu

To make it simple, remember that we will land into Dying ReLu problem in below 2 situations:

When the Bias is a high negative number
When the learning rate is too high

To overcome this problem, another flavour introduced called Leaky ReLu.

Leaky ReLu :

Leaky ReLu(z) is, where a = 0.01 (researched finalized this value post experiments)

z if z > 0
az if z <= 0 (this is the change, we are just multiplying z with "a" and its value is 0.01)

Please observe the difference between ReLu and Leaky ReLu from below diagram

Just to avoid '0', we are hardcoding with 0.01
Derivative is, d/dx( Leaky ReLu(z)) is

1 if z > 0
a if z<= 0 (we are keeping the neuron alive to avoid dying ReLu problem)

Now observe the properties on Leaky ReLu whether it is satisfying the criteria of action functions or not :

It is non-linear (as we can see from above graph)
It is differentiable
It is computationally inexpensive
It is zero-centered (a*z means (0.01)*(z), z could be any number range from [-infinite, +infinite ]) Hence zero-centerd, no bias.
It is partially non-saturated, not fully non-saturated

Based on the discussion happened until now. We can recommend Leaky ReLu as an activation function in Hidden layers. But remember, still there is a problem :)

Remember value of a is 0.01 is fixed and it is not a trainable parameter. This is a static value. It must be Dynamic, isn't it ? This is the problem with Leaky ReLu.

One more flavour got introduced called, Parameteric ReLu.

Parametric ReLu :

Observe below graph to spot the clear difference between ReLu Vs Leaky ReLu Vs Parametric ReLu
See in Parametric ReLu, is adjusting the value of 'a' based on the situation which is helping in increasing the learning rate
ReLu(z) is

z if z > 0
az if z <= 0 but 'a' is not constant, it is a learnable parameter ; Your model is going to learn based on the situation, and this is completely taken care by ML model. Understand control is with your ML model, nothing manual here and strictly No hardcoding.

It is non linear
It is differentiable
It is computationally inexpensive
It is zero centered
It is non saturated

Conclusion :

ReLU completely blocks negative inputs, Leaky ReLU allows a small fixed gradient for negative values, and Parametric ReLU learns the optimal negative slope during training, improving gradient flow and representation capacity.

How we decide on activation function for real world problems ?

We will start with ReLu, analyze the output
Repeat the process with Leaky ReLu, analyze the output
Repeat the process with Parametric ReLu, analyze the output

and then decide on using the type of activation function. This is manual way. We can decide on hyper parameters based on the test runs(automatical way of deciding hyper parameters) using python package called Optuna. Optuna is an open-source python library, used for Hyper parameter Optimization in Machine Learning and Deep Learning.

With this, we have seen all the required activation functions that we can use during Forward Propagation. Now, lets see the functions that needs to be used during Loss calculation.

Loss Functions :

We have 2 types of problems in ML:

Regression (in regression we have following loss functions)

MSE (Mean Squared Error)
MAE (Mean Absolute Error)
Huber Loss

Classification (in classification we have following loss functions)

Binary Cross Entropy
Categorical Cross Entropy
Sparse Categorical Cross Entropy (SCCE)

We have different set of loss functions for both Regression, Classification. Let's see one by one:

Loss functions for Regression problems:

1) MSE (Mean Squared Error)

Please see below image for formula

Also consider below input data to understand MSE

i (input)
y (actual value)
y^ (predicted value)
e (y - y^)
e^2

We need to understand the difference between Loss Vs Average Loss

If we consider one record at a time, it is called Loss
If we consider multiple records, and calculate the Loss, that's Average Loss
In first step,

MSE = (y^ - y)^2 is Loss
MSE = 1/n ∑(i=1 to n) (y^ -y)^2 is Average Loss (also known as Cost Function)

Advantages

Simple
Same unit

Dis-advantages

Outliers is a problem, increasing the average Loss
Squaring (y^ - y)^2 , increasing the values

Outliers are nothing but abnormalities in the given data, we can simply remove them from input data and then apply MSE algorithm.

2) MAE (Mean Absolute Error)

MAE is introduced to overcome above 2 problems with

Squares
Outliers

Formula for

Loss is, MAE = |Y - Y^|
Average Loss, MAE =1/n ∑(i=1 to n) |Y - Y^|

Advantages are

Simple and Same unit (if it is - then -, if it is + then +)
It is very good for outliers

Disadvantages are

It is not fully differentiable, especially for f(x) = |x| = 0

3) Huber Loss

Formula is

½ (y − ŷ)² , if |y − ŷ| ≤ λ
λ (|y − ŷ| − ½ λ), if |y − ŷ| > λ
where λ is hyper parameter, your model is going to learn based on requirement and hyper parameter meter means, model is going to calculate the best value for those parameters based on the input scenario at run time.

Lets assume, λ = 5 , e1 = 2, e2=2, -20

For e1 =2

e1 <= λ (2 <= 5) ; condition satisfied, hence using first formula
½ (y − ŷ)² = ½ e1² = ½ (2)² = ² ½(4) = 2 (This is Loss)

For e2 = 2 , same value i.e. 2
For e3 = -20

if |y − ŷ| > λ, then use λ( |y − ŷ| − ½ λ)
if |-20| > 5, then λ( |y − ŷ| − ½ λ)

= 5(|-20| - ½ 5) = 5(20 - 2.5) = 5(17.7) = 87.5

So, we have 3 losses, those are 2, 2, 87.5
Average loss = (2 + 2 + 87.5)/3 = 30.5

Which loss function to choose ?

Suppose if we handled outliers at during data preprocessing itself, then go with MSE
If your data consists of outliers, then MAE ; but there is a problem with MAE which is not fully differentiable, then go with Huber Loss
Suppose, if you don't have any outliers in your dat, then go for MSE

Simply, it will be MSE if outliers are not there or being handles, otherwise go with Huber Loss.

Loss functions for Classification problems:

1) Binary Cross Entropy (BCE)

It is mainly used if our output consists of 2 class classification
In the output neuron, what is the activation function you recommend in output layer ?

Sigmoid (check Sigmoid section for reasons)

Formula for Cost/Loss Function

It is differentiable

2) Categorical Cross Entropy (CCE)

Activation function should be "Softmax"
It is a multi class classification (more than 2 classes)
Formula for Cost/Lost function

In real time, most of the problems are not using this Loss function

It is called One Hot Encoding, for categorical variables.
It will add new columns, which are the classification values and assign 1's and 0's for those records which makes computation very complex. This is the reason, most of the problems are not using Categorical Cross Entropy. IT IS NOT RECOMMENDED IN REAL TIME. It is a burden to infrastructure.

3) Sparse Categorical Cross Entropy (SCCE)

To overcome above issue, they introduced SCCE.

Instead of adding all categories as new columns, this algorithm adds only one columns and adding a row number to each of those categories. Hence it is recommended in real time especially when you are dealing it multi class problems.
Formula for Cost function (No need to memorize it, program will take care, just for our understanding only)

Note : When you call the loss function, then it will internally takes care of One Hot Encoding or adding row number based on the function we call and then apply loss function.

Local Minima Vs Global Minima :

Both Global Minima & Local Minima are related to Loss Functions only
Global Minima

Lowest possible value of the loss function across entire data set
This value is absolute best for the model
Point 3 from below image is Global Minima

Local Minima

Consider 100 iterations of our model as shown in the below image
During iteration 4, if we stop model thinking that it is the minimum loss, then it is a trap. This is nothing but Local Minima
We will land into this trap, and simply stop models here at iteration L4. But look at values at L55, L68, L79. Loss got reduced a lot comparatively with L4.
We have to stop the iteration at L79, which is at Global minima.

More image showing Local & Global Minima :

How many iterations to run the model is also a Hyper Parameter. We will decide on it based on the output from Optuna library.

**** Note *****

We have completed below 2 main steps in Neural Networks:

Activation Functions
Loss Functions

Now we are going to discuss about Backward Propagation.

Backward Propagation :

Backward propagation is a mechanism in neural network uses to learn from its mistakes. It tells the network how much each weight contributed to the error, so those weights can be corrected.

Internally during backward propagation and algorithm will run which is called Gradient Descent

Algorithms used in backward propagation:

Gradient Descent (GD)
Batch Gradient Descent (BGD)
Stochastic Gradient Descent (SGD)
Mini Batch Gradient Descent (MGBD)
Optimizers (Momentum, Adagrad, RMS prop, Adam)

Gradient Descent (GD) Algorithm : Gradient means, we are going to calculate derivatives. Descent is getting ready to land slowly, by adjusting required parameters. We are going to minimize the loss by adjusting weights and biases. This is mainly used to reduce the loss. Not immediately, but it will gradually touching down the ground until loss is minimal. This is the main goal of Gradient Descent algorithm.

Maths behind this gradient descent is W_new = W_old − η · ( ∂L / ∂W_old )

Where W_new is new weight, W_old is old weight, η is learning rate and ( ∂L / ∂W_old ) is derivate of loss with respect to old weight.

This is even applicable for Bias as well, B_new = B_old − η · ( ∂L / ∂B_old )

As per above image, we have a set of input data x1, x2 with actual value/output as y.

For record1, x1=80, x2=8 and y=3
For record2, x1=60, x2=9 and y=5

Consider a single hidden layer with 2 neuron's h1, h2 and a output layer with one output as b. After that we have to calculate Y^ & Loss, L.

Now, weight of connections are represented as W₁₁¹, W₁₂¹, W₂₁¹, W₂₂¹, W₁₁², W₂₁²

During forward propagation, we have 2 functions Linear transformation & Activation function. So, if we calculate that for both h1, h2 then

h1 ---> Z₁ | Af = O₁₁
h2 ---> Z₂ | Af = O₁₂

And output of Y^ is O₂₁.

Based on above diagram, Linear transformation equation for

Y^ = W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁ (assuming b from o/p layer as b₂₁)

Assume biases of h1, h2 as b₁₁, b₁₂. Please see above image for clarity. Also assuming there are no outliers in my data, then recommended loss function for a regression problem is MSE.

MSE(Mean Squared Error) = (Y − Ŷ)²

Now, lets see Backward Propagation.

Loss ---> Ŷ

Now lets understand, what are the parameters that we need to calculate Ŷ. If you can carefully build below graph then you can calculate everything required for back propagation based on it.

Lets start executing backward propagation algorithm.

From above graph, x1, x2 are fixed values but all weights and biases will change.

So, below parameters are changing(as per above image):

W₁₁², W₂₁², b₂₁ ===> Step1 calculation
W₁₁¹, W₂₁¹, b₁₁ ===> Step2 calculation
W₁₂¹, W₂₂¹, b₁₂ ===> Step3 calculation

Also

O₁₁ = Af(W₁₁¹ x1 + W₂₁¹ x2 + b₁₁)
O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 + b₁₂)

Step1 calculation : Apply chain rule

dL/dW₁₁² = (dL/dŶ) * (dŶ/dW₁₁² )

dL/dW₂₁² = (dL/dŶ) * (dŶ/dW₂₁² )

dL/db₂₁ = (dL/dŶ) * (dŶ/db₂₁ )

Now, lets see how to calculate :

dL/dW₁₁² = (dL/dŶ) * (dŶ/dW₁₁² )

We need to calculate (dL/dŶ) and (dŶ/dW₁₁² )

Note : as we applied MSE as loss function, we need to use Loss= (Y − Ŷ)²

Keep in mind : We need to power rule, consider Y is constant and note that we are calculating loss with respect to Ŷ, so d/dŶ(Y^) is 1

d/dx (1 − x)²

= 2(1 − x) · d/dx(1 − x)

= 2(1 − x) · (−1)

Then dL/dŶ = d/dŶ(Loss)

= d/dŶ (Y − Ŷ)²

= 2((Y − Ŷ)) * d/dŶ(Y − Ŷ)

= 2((Y − Ŷ)) * (d/dŶ(Y) - d/dŶ(Ŷ))

= 2((Y − Ŷ)) * (0 - 1)

= 2((Y − Ŷ)) * (- 1)

= -2(Y − Ŷ)

Now 2nd part of equation, dŶ/dW₁₁² .

dŶ/dW₁₁² = d/dW₁₁²(Ŷ) but Ŷ = W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁

= d/dW₁₁² (W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁)

= O₁₁

Hence dL/dW₁₁² = (dL/dŶ) * (dŶ/dW₁₁² ) = -2(Y − Ŷ) O₁₁

Now solving next equation, dL/dW₂₁² = (dL/dŶ) * (dŶ/dW₂₁² ) but we know the value of (dL/dŶ), just calculate (dŶ/dW₂₁² ).

dŶ/dW₂₁² = d/dW₂₁² (W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = O₁₂

So, dL/dW₂₁² = (dL/dŶ) * (dŶ/dW₂₁² ) = -2(Y − Ŷ) O₁₂

Now solving 3rd equation, dL/db₂₁ = (dL/dŶ) * (dŶ/db₂₁ ) but we know the value of (dL/dŶ), just calculate (dŶ/db₂₁ ).

dŶ/db₂₁ = d/db₂₁(W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = 1

So, dL/db₂₁ = (dL/dŶ) * (dŶ/db₂₁ ) = -2(Y − Ŷ) * 1 = -2(Y − Ŷ)

Step2 calculation :

W₁₁¹, W₂₁¹, b₁₁ ===> Step2 calculation

Note :

O₁₁ = Af(W₁₁¹ x1 + W₂₁¹ x2 + b₁₁)
O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 + b₁₂)

dL/dW₁₁¹ = (dL/dŶ) * (dŶ/dO₁₁) * (dO₁₁/dW₁₁¹) = -2(Y − Ŷ) * W₁₁² * x1

dL/dW₂₁¹ = (dL/dŶ) * (dŶ/dO₁₁) * (dO₁₁/dW₂₁¹) = -2(Y − Ŷ) * W₁₁² * x2

dL/db₁₁ = (dL/dŶ) * (dŶ/dO₁₁) * (dO₁₁/db₁₁) = -2(Y − Ŷ) * W₁₁² * 1 = -2(Y − Ŷ) * W₁₁²

dŶ/dO₁₁ = d/dO₁₁ (W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = W₁₁²

dO₁₁/dW₁₁¹ = d/dW₁₁¹ (O₁₁) = d/dW₁₁¹ (W₁₁¹ x1 + W₂₁¹ x2 + b₁₁) = x1

dO₁₁/dW₂₁¹ = d/dW₂₁¹ (O₁₁) = d/dW₂₁¹ (W₁₁¹ x1 + W₂₁¹ x2 + b₁₁) = x2

dO₁₁/db₁₁ = d/db₁₁ (O₁₁) = d/db₁₁ (W₁₁¹ x1 + W₂₁¹ x2 + b₁₁) = 1

Step3 calculation :

W₁₂¹, W₂₂¹, b₁₂ ===> Step3 calculation

Note :

O₁₁ = Af(W₁₁¹ x1 + W₂₁¹ x2 + b₁₁)
O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 + b₁₂)

dL/dW₁₂¹ = (dL/dŶ) * (dŶ/dO₁₂) * (dO₁₂/dW₁₂¹) = -2(Y − Ŷ) * W₂₁² * x1

dL/dW₂₂¹ = (dL/dŶ) * (dŶ/dO₁₂) * (dO₁₂/dW₂₂¹) = -2(Y − Ŷ) * W₂₁² * x2

dL/db₁₂ = (dL/dŶ) * (dŶ/dO₁₂) * (dO₁₂/db₁₂ ) = -2(Y − Ŷ) * W₂₁² * 1 = -2(Y − Ŷ) * W₂₁²

Common term :

dŶ/dO₁₂ = d/dO₁₂ (Ŷ) = d/dO₁₂(W₁₁² * O₁₁ + W₂₁² * O₁₂ + b₂₁) = W₂₁²

Individual terms :

dO₁₂/dW₁₂¹ = d/dW₁₂¹ (W₁₂¹ x1 + W₂₂¹ x2 + b₁₂) = x1

dO₁₂/dW₂₂¹ = d/dW₂₂¹ (W₁₂¹ x1 + W₂₂¹ x2 + b₁₂) = x2

dO₁₂/db₁₂ = d/db₁₂ (W₁₂¹ x1 + W₂₂¹ x2 + b₁₂) = 1

That's how we need to calculate all parameters simply by looking at above image/graph. 😁

This looks like complex but if you refer above image & graph and do step by step then it will be easy.

Problem with Gradient Descent Algorithm : This might end up with local minima. If we can observed below image, we have 2 valleys but out Gradient Descent algorithm ended up at local minima which is mint color on your right hand side. This is the issue with local minima algorithm. Because this is the first version in Gradient Descent algorithm. It's like using chatgpt 5.2(latest version) though we have chatgpt-1.

Next versions of GD algorithm are :

Batch Gradient Descent (BGD)
Stochastic Gradient Descent (SGD)
Mini Batch Gradient Descent (MBGD)

We need to see what are above algorithms and how they prevent Local Minima issue.

Batch Gradient Descent (BGD)

It will take entire data set (all 'n' no of records in single shot)
Intialize random weights & biases
Linear transformation
Applying Activation function
Calculating the loss for all data points (Average Loss)
Update weights & Biases for this entire data set (for all the data points)
Disadvantages

More memory usage, as we need to fit entire data set into memory for calculation
It will take more time to process

Stochastic Gradient Descent (SGD)

Instead of entire data set, it will consider randomly one sample data(one data point at once)
Intialize random weights & biases
Linear transformation
Applying Activation function
Calculating the loss for the sample data point
Updating weights & biases for one data point
Disadvantages

One way this is good, in terms of speed
But other way, we can't judge entire data set keeping the result of one data set in mind
Overfitting can happen, due to more iterations and weight & bias adjustments

Mini Batch Gradient Descent (MBGD)

Assume, we have 100 records ; MBGD will splint the entire data set into 4 batched with 25 records in each batch

Batch1(25), Batch2(25), Batch3(25), Batch4(25)

At once, it will Consider ONLY one batch i.e. Batch1

Intialize random weights & biases
Linear transformation
Applying Activation function
Calculating the loss for ONLY Batch1
As per above loss, adjust weights & biases
Note this batch selection is also an Hyper parameter, Optuna will help here

Ideally recommendation for batch size is 16, 32, 64, 128, 256 etc. This decision is from research. Incase if we are not clear on how to consider a batch, please use Optuna from Python. It will give best Hyper parameters for us to design model. More about this in future blogs ✌

Disadvantages

It is a good approach, comparatively with BGD & SGD
Kind of hybrid approach, between with BGD & SGD
More parallelism
Less faster than SGD, because considering more samples

In terms of

speed SGD > MBGD > BGD
memory utilization SGD < MBGD < BGD

When we design an application/model, it should be fast and memory efficient then preferred approach is MBGD. In real time, most of the cases, people are using MBGD (Mini Batch Gradient Descent)

"+" is minimal loss
MBGD seems to be moderate algorithm for Gradient Descent
Observe that BGD doesn't have much learning
Observe SGD have too many iterations, with more noise instead of actual learning

Observe this Analogy, you will understand a better way to decide on selecting GD algorithm :

Assume you are walking down from a tip of mountain
BGD (Stable, for time consuming process because of slow decision, need to analyse )

it is going scan entire mountain, then take one perfect step to reach the correct direction

SGD (Very fast, but lot of zig zag nature, due to this it it will keep changing direction)

you look at only one nearby rock, guess the direction and step immediately

MBGD (taking decent time to decide and then taking right step)

you look at set of neighbourhood rocks, then you decide the step

Problem with Gradient Descent algorithm is called Local Minima. To avoid this problem, some boosters are added to it, which are called Optimizers. It will add more power to above Gradient Descent algorithm.

BGD + Optimizers
SGD + Optimizers
MBGD + Optimizers

Optimizers are named as Momentum, NAG, Rmsporp, Adam.

How to prevent Overfitting in Deep Learning/Neural Networks ?

First, lets understand the concept of Overfitting perfectly....

Understand that whenever we need to start a ML model, we will splint entire data into

Train Data (Loss is low, accuracy is high)
Test Data (Loss is very high, and accuracy is low)

Our ML model will learn hidden patterns from "Train Data" and it works perfectly with "Train Data". Loss will be low and accuracy will be very high on "Train Data".
But same model, if I run on "Test Data", then Loss will be very high and accuracy will be very low. This is called Overfitting.

Above image represent Overfitting scenario.

In real time, we have to build ML models like either 1st or 2nd case from above image, but 3rd case represents exact Overfitting scenario.
Most of the real time models will fit in 2nd case, but 1st one is pretty perfect.

Below model is perfectly classified, it is kind of classification problem, example is "whether an email is spam or not spam". It segregated well to identify categories well.

From the below image:

A represents Underfit, it is not trained well, not identifying hidden patterns
B represents a good/balanced/perfect model, ignoring noise and perfectly matching patterns
C represents Overfit

Underfitting Vs Balanced Vs Overfitting

To prevent Overfitting, we have following techniques which are well proven:

Drop Out Layer
Early Stopping
L2 Regularization

We can include one technique or can use combination from above concepts. It will be like use "Drop Out Layer", analyze it and then use "Early Stopping" analyze it followed by "L2 Regularization" and compare which is best.

Drop Out Layer :

Ideally, all the neurons will be active across Neural Network
All neurons will be active means, all of them will memorize things in each iteration and there will be too much of noise, and this leads to OVERFITTING.
Randomly, in each iteration, it will switch off some nodes (lets take 20% of neurons will be switched off)

During 1st iteration 20% neurons will be switched off, during 2nd iteration another 20%(note 20% neurons which were switched off during 1st iteration are active during second iteration). We are trying to switch off only 20% of neurons in total, rest 80% willl be active in every iteration.
We covered different patterns in each iteration, which will be helping our ML model

Above image shows drop out layer in hidden layer
Automatically model will identify which neurons to switch off, it will take care of turning on those off neurons once iteration is done, we just need to mention the Drop Out Rate.

if drop out rate = 0.5 then 50% neurons will be switched off
this will happen during both forward propgation and backward propagation as well
output from switched off neurons is '0' , then during backward propagation, it won't consider this neuron and Vanishing Gradient problem won't be there.

Drop Out Layer is applicable for only "Training" but not during "Testing" and "Production". DO NOT INCLUDE DROP OUT LAYER while testing the data.
In Testing, all neurons will be active.

Programatically, below line will take care of everything !

nn.dropout(0.3) ## nn is neural network, 0.3 means 30% neurons will be switched off

Above image tells us that, we will be having less error rate with drop out layer.

Early Stopping :

This is another technique to prevent Overfitting.
You are going to stop "Training" the model when it stops improving on "Validation" data (test data), even if Train loss is still improving.

Example: (7 iterations)

Train Loss = 0.989. Test Loss = 0.899
Train Loss = 0.976. Test Loss = 0.897
Train Loss = 0.932 Test Loss = 0.896
Train Loss = 0.854. Test Loss = 0.742
Train Loss = 0.767 Test Loss = 0.736 (Patience)
Train Loss = 0.632 Test Loss = 0.735999
Train Loss = 0.542 Test Loss = 0.735999

There is a parameter called Patience, our model will stop here in this scenario, to avoid Overfitting.
Patience is the Hyper parameter here

Some important notations :

TP - True Positive : Model predicted YES, actual is also YES
TN - True Negative : Model predicted NO, actual is also NO
FP - False Positive : Model predicted YES, actual is also NO
FN - False Negative : Model predicted NO, actual is also YES

Based on above notations, we will calculate accuracy.

Accuracy = (TP + TN)/(TP+TN+FP+FN)

For example, TP =70, TN = 20, FP = 5, FN = 5, then Accuracy = (70+20)/(70+20+5+5) = 0.9 (90%)

L2 Regularization :

This is a lengthy and confusing topic, last concept to prevent Overfitting
Lets see how it is going to avoid Overfitting
Regularization means controlling something with some rules. Isn't it ?
You are punishing the model, if weight become too large

So model forced to

Keep weights small and simple
Avoid memorizing
More generalization

Above Loss could be regression loss or classification loss, see formula for loss function as above
Second part of function excluding loss is nothing but PENALITY TERM, which will control the weights

where λ is a hyper parameter, it should be balanced (or else it will end up either Overfitting or Underfitting)
if λ=0 then Cost Function = Loss Function (so it shouldn't be 0)

We knew that, W_new = W_old − η · ( ∂L / ∂W_old ) ======> Equation1

L2 regularization, L¹ = L + (λ / 2) Σ ||Wᵢ||²

dL¹/dW_old = d/dW_old (L¹)

= d/dW_old (L + (λ / 2) Σ ||Wᵢ||²)

= d/dW_old (L) + (λ / 2) * 2 W_old

= dL/dW_old + λ * W_old

W_new = W_old − η · ( ∂L / ∂W_old )

= W_old − η ·(dL/dW_old + λ * W_old)

= W_old − η ·λ * W_old - η * dL/dW_old

= W_old(1 - η *λ) - η * dL/dW_old ======> Equation2

Please note that , 1 - η *λ is the Weight Decay Factor which will penalize our larger weights and they will be under control, it is going to prevent Overfitting

Let us consider, Ŷ = WX

Training point x = 1, Y = 0 and weights are defined randomly.

and initial weight = 2

Loss Function, L = 1/2 (Ŷ - Y)^2. (regression)

Data Loss = 1/2 (Ŷ - Y)^2

but Ŷ = W*X , assume X =1, Y = 0 then Ŷ = W * 1 = W

Data Loss = 1/2 (Ŷ - Y)^2 = 1/2(W - 0)^2

if W = 2(randomly initialized W = 2), dL/dW = d/dW (1/2 * W) ^ 2 = W = 2

W_new = W_old − η · ( ∂L / ∂W_old ) & assume η = 0.1

= 2 - (0.1) * (2)

= 2 - 0.2

= 1.8 (See weight value reduced)

This is how weights will be reduced in L2 regularization.

So, using L2 regularization, we are going to penalize the larger weights by using "Weight Decay". If weights are balanced, it will automatically eliminate Overfitting problem.

Learning Rate( η ):

This is a Hyper parameter, used to adjust weights and biases
As per experiments, they proposed η = 0.1, 0.01, 0.001, 0.0001 are good values to choose
If η is too low, like 0.0000001 then our ML model will take lot of small steps, training will take too long and it will be too slow. But if η is 10, then it will take big jumps, it will reach Global minima but it will be zig zag nature. Hence learning shouldn't be too large, too small, it must be medium. Hence most of the researches recommend above values.
First, manually train your model with different small learning rates, compare them and at the end if needed use OPTUNA and come up with a balanced learning rate for your model.

When you are calling Optimizer, you have to pass the value for learning rate, η.

We will see optimizers in next blog. In real time we use Optimizers, we will take the help of Boosters.

Thank you for reading this blog !

Arun Mathe

DataSphere

Search This Blog

(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

Labels

Comments

Post a Comment

Popular posts from this blog

(AI #1) Deep Learning and Neural Networks

AWS : Working with Lambda, Glue, S3/Redshift

Spark Core : Understanding RDD & Partitions in Spark