(AI Blog#3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques
It is extremely important to have a deep knowledge while designing a machine learning model, otherwise we will end up creating ML models which are of no use. We have to have a clear understanding on certain techniques to confidently build a ML model, train it using "training data", finalize the model and to deploy it in production. So far, from blog #1, #2, we have seen about the fundamentals of Deep Learning and Neural Network, architecture of a Neural Network, internal layers and components etc.
Providing the links of Blogs #1, #2 below for quick reference.
Deep Learning & Neural Networks : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html
Building a real world neural network: A practical usecase explained : https://arunsdatasphere.blogspot.com/2026/01/building-real-world-neural-network.html
Now let's dive through below concepts/criteria to help gaining confidence on building your ML model:
- Activation Functions (Forward Propagation)
- ReLu, Leaky ReLu, Parametric ReLu
- Sigmoid
- Tanh
- ELU(Exponential Linear Unit), SeLu(Scaled Exponential Linear Unit) - These are still under reasearch, not into production yet.
- GELU(we will see this in LLM's, it is not for NN)
- Loss Functions
- Regression
- MSE
- MAE
- Classification
- BCE
- CCE
- SCCE
- Backward propagation
- Gradient Descent (GD)
- Batch Gradient Descent (BGD)
- Stochastic Gradient Descent (SGD)
- Mini Batch Gradient Descent (MGBD)
- Optimizers (Momentum, Adagrad, RMS prop, Adam)
- Overfitting & Underfitting
- Vanishing & Exploding Gradients
- Optimisation Techniques
Activation Functions
In real time, we are going to deal with complex data which is non-linear(for example list, tuple are linear data, trees, graphs etc. are non-linear data) and that complex data will be having some hidden patterns. In Deep learning/Neural Networks, we are going to deal with such complex data and these Neural Networks are meant for non-linear data. we need some non-linear functions in Neural Networks to add non-linearity to model and such functions are called as Activation Functions.
Main usage of activation function is to identify the hidden patterns in the given input data.
Different types of activation functions:
- Sigmoid
- Tanh
- ReLu
- Leaky ReLu
- Parametric ReLu
- ELU & SeLu
- GeLu (Part of LLM's, not a part of Deep learning/Neural Network)
- It should be non-linear activation function
- It should be differentiable, let's understand this point as below
- In neural networks, we have 4 steps, forward propagation, loss calculation, backward propagation, adjusting weights and biases
- Linear transformation, z = w1x1+w2x2....+wnxn
- Activation function = A(z) ; if this is not differentiable then there is no learning at all
- During backward propagation, we need to calculate gradients (nothing but derivatives in maths)
- While doing it, we need to apply derivatives for for activation function as well, for this reason the activation which we need to use must be differentiable
- It should be computationally inexpensive, let's understand this point as below
- For example, in chatgpt 3.x they used around ~175 billion parameters (nothing but weights & biases), think about the complexity of this chatgpt ML model if the activation function formula is complex(instead of a simple formula)
- Hence activation function must be computationally inexpensive
- It should be zero-centered
- It has to consider both positive & negative scenarios while building ML model
- It should be in a balanced way while considering input data
- It should be non-saturating
- A non-saturating activation function is one whose output keeps changing as the input changes, instead of getting stuck at a fixed value.
- That means the neuron continues to respond when input changes
- A non-saturating activation function allows the neuron’s output and gradient to keep changing with the input, preventing vanishing gradients and enabling effective learning during back propagation.
Colab notes :
Note, that we are going to apply activation functions at both hidden layers and output layers of neural network. We will see later in blog about what activation functions to use where. Generally, we don't use Sigmoid, Tanh at hidden layers because of above reason but we use them at output layers for classification or multi class classification problems. Also we will use others activation functions like softmax etc. at output layers.
Just keep in mind that we have been discussing only above 5 points, which is the criteria to decide the activation function in our ML models.
Overfitting Vs Underfitting
- Lets assume, we have 1000 records of input data, then we have to divide it in between "Train Data" & "Test Data"
- Assume : 70% is for Train Data, 30% is for Test Data (means we will use 70% of data from input data to train our ML model and 30% to test it)
- Means our ML model is going to learn hidden & complex patterns from "Train Data"
- "Test Data" is unseen data
- Based on our ML model's training, your model is going to evaluate the patterns in "Test Data"
Overfitting - Model performs well on "Train Data" but not well on "Test Data"
Underfitting - Model won't perform well on both "Train" & "Test" Data. Basically, it didn't train well on complex "Train" data.
- From the above diagram:
- Straight line represents underfitting, model is too simple and couldn't capture the tru relationship between the input data, which didn't learn enough
- Smooth curve is an example of good fit, ignore bit of noise but captured the overall trend which is a real pattern, with right level of complexity and with low training error
- A very wavy curve represents overfitting which connected almost all the data points which is too complex with too much of learning noise, this memorised instead of understanding.
Note : Overfitting memorises noise, underfitting ignores patterns but a good fit captured the patterns.
How to avoid Overfitting ?
- We need to use techniques like (which we are going to discuss further in this blog)
- Dropout layer
- L1 & L2 regularization
- Early stopping
- No need to go until testing, we can indentify underfitting scenario during training itself as loss is almost similar in all the iterations
- As per below diagram
- We have a input layer with input variables as x1, x2
- Hidden layer with 3 neuron's as h1, h2, h3
- Output layer with one output neuron as Zf
- Consider w1_11 is the weight of connection from x1 to h1
- Output of
- h1 is O11
- h2 is O12
- h3 is O13
- During backward propagation, lets assume that we need to find out the derivative of loss with respect to w1_11
- Then path of loss calculation would be as mentioned in the below image.
- Loss --> Y^ --> Zf --> O11 --> w1_11 (dL/dw1_11)
- As per above backward propagation order,
- Derivative of Loss with respect to dw1_11 is nothing but
- Chain rule, dL/dw1_11 = (dL/dY^)*(dY^/dZf)*(dZf/dO11)*(dO11/dw1_11)
- Assume following values for above formula
- dL/dw1_11 = (0.0001)*(0.000001)*(0.001)*(0.004)=4E-16(almost 0)
- W1_11(new) = w1_11(old) - (learning rate) * (dL/dw1_11)
- Assume learning rate is 0.1 and w1_11(old) = 0.8
- W1_11(new) = 0.8 - (0.1)*(0.0000000004)=0.7999999
- Observe carefully, that w1_11(new) value which is 0.7999999 is almost equal to w1_11(old) which is 0.8
- That means w1_11(new) is almost close to w1_11(old)
- if both are equal, is there any learning ?
- Means, if the derivative values are too small(as mentioned above), then the old and new weights of same connection are almost similar/equal
- W1_11(new) = W1_11(old)
- This is called Vanishing gradient
Note : Hence we need select the activation function very carefully, otherwise we will end up with Vanishing gradients problem.
Exploding gradient
- Lets say
- W1_11(old) = 0.8
- W1_11(new) = 8689.89
- Look at the difference of both the old and new weights of same connection
- This is called Exploding gradient
Activation Functions
- Sigmoid
- It comes from machine learning algorithm called Logistic Regression, it convert any real number into probability (spinning a coin leads to head & trail)
- It is a curved line, hence satisfying non linearity
- Any value
- > 0.5, it will be considered as 1
- < 0.5, it will be considered as 0
- Hence range of Sigmoid is [0, 1]
Formula os Sigmoid is(1/(1+e^-z)) :
Similarly, derivate of Sigmoid can be represented as below:
Note : Sigmoid is differentiable (Hence satisfying 2nd criteria of activation functions)
dσ(z)/dz = σ(z) · (1 − σ(z))
- Sigmoid is computationally expensive due to exponential in its formula. It will take good amount of time to calculate the exponent.
- Sigmoid is not zero-centered. Range is [0, 1].
- Sigmoid is Saturated
- If linear transformation, Z = 10, then σ(10) = 1 (Hence range of Sigmoid is [0, 1])
- Derivative, dσ(z)/dz = σ(z) · (1 − σ(z)) = 1 * (1 - 1) = 1* 0 = 0 (leads to Vanishing gradient issue)
- If derivative is 0 then it leads to vanishing gradient
- Tanh
- Tanh is better than Sigmoid activation function
- Formula for Tanh(Z) is :
- Guys, no need to remember all these formula's, programatically torch.tanh(z) will does it for us.
- Graph for Tanh is
- Pro's & Cons while satisfying the criteria of activation function
- It is Non-linear (it's not a straight line)
- It is differentiable
- Formula for derivative of Tanh(Z) is as below:
- Tanh is computationally more expensive than Sigmoid because of lot of exponential values
- Tanh is zero centered. It is considering both +ve & -ve values as range is [-1, 1].
- Tanh is Saturating, it leads to Vanishing gradient problem
- Assume Z = 0,
- Tanh(0) =
- d/dx(Tanh(0)) = 1 -Tan^2h(0) = 1 - 0 = 1
- Assume Z = 6
- Tanh(Z) = Tanh)(6) = 1
- d/dx(Tanh(0)) = 1 -Tan^2h(6) = 1 - 1 = 0
- ReLu
- Rectified Linear Unit is the full form of ReLu
- Formula, ReLu(z) = max(0, z) where o/p range goes to [0, infinite]
- if z <= 0, then ReLu(z) is 0 (-ve value simply replace with 0)
- if z > 0, then ReLu(z) is z (+ve value simple replace z)
- This applies in Forward Propagation
- Derivative of ReLu, d/dx(ReLu(z)) is
- if z<= 0 then value is 0
- if z > 0 then value is 1
- This applies in Backward Propagation
- Graph of ReLu acivation function
- See for all -ve values it is 0 and touching x-axis
- Criteria of an activation function ?
- It is non-linear
- max(0, z) is differentiable where z > 0. Hence ReLu is partially differntiable
- if z = 0, it is not differntiable
- Computationally in expensive, as formula is simple i.e. max(0, z)
- It is not zero-centered (not accepting -ve values)
- ReLu is non-saturated, if z > 0
Finally, is this activation function is recommended for Hidden layers of Neural Network or not ? Yes, because it is satisfying most of the criteria of activation functions. This is a good fit for Hidden Layers.
Now, lets understand a problem called DYING ReLu problem :
- We have a problem called Dying ReLu problem in ReLu
- We have maintain same activation function across all neuron's in hidden layers
- Incase if ReLu(z) is a -ve value, then value will be ZERO and whenever the value of neuron value adjusted to ZERO, then that neuron is called a DEAD neuron and this situation is called Dying ReLu problem.
- Reasons for this problem:
- High -ve bias ; Z = w1x1+w2x2+b (if b is high -ve value, then z is a -ve value) and it will return 0
- High learning rate
- W_new = W_old - (learning rate) (dL/dW_old)
- if learning rate is too high, then W_new = some high value, which will again obviously return 0, again dead neuron without learning and leads to Dying ReLu
- When the Bias is a high negative number
- When the learning rate is too high
- Leaky ReLu(z) is, where a = 0.01 (researched finalized this value post experiments)
- z if z > 0
- az if z <= 0 (this is the change, we are just multiplying z with "a" and its value is 0.01)
- Please observe the difference between ReLu and Leaky ReLu from below diagram
- Just to avoid '0', we are hardcoding with 0.01
- Derivative is, d/dx( Leaky ReLu(z)) is
- 1 if z > 0
- a if z<= 0 (we are keeping the neuron alive to avoid dying ReLu problem)
- It is non-linear (as we can see from above graph)
- It is differentiable
- It is computationally inexpensive
- It is zero-centered (a*z means (0.01)*(z), z could be any number range from [-infinite, +infinite ]) Hence zero-centerd, no bias.
- It is partially non-saturated, not fully non-saturated
- Remember value of a is 0.01 is fixed and it is not a trainable parameter. This is a static value. It must be Dynamic, isn't it ? This is the problem with Leaky ReLu.
- Observe below graph to spot the clear difference between ReLu Vs Leaky ReLu Vs Parametric ReLu
- See in Parametric ReLu, is adjusting the value of 'a' based on the situation which is helping in increasing the learning rate
- ReLu(z) is
- z if z > 0
- az if z <= 0 but 'a' is not constant, it is a learnable parameter ; Your model is going to learn based on the situation, and this is completely taken care by ML model. Understand control is with your ML model, nothing manual here and strictly No hardcoding.
- It is non linear
- It is differentiable
- It is computationally inexpensive
- It is zero centered
- It is non saturated
- We will start with ReLu, analyze the output
- Repeat the process with Leaky ReLu, analyze the output
- Repeat the process with Parametric ReLu, analyze the output
- Regression (in regression we have following loss functions)
- MSE (Mean Squared Error)
- MAE (Mean Absolute Error)
- Huber Loss
- Classification (in classification we have following loss functions)
- Binary Cross Entropy
- Categorical Cross Entropy
- Sparse Categorical Cross Entropy (SCCE)
- Please see below image for formula
- Also consider below input data to understand MSE
- i (input)
- y (actual value)
- y^ (predicted value)
- e (y - y^)
- e^2
- We need to understand the difference between Loss Vs Average Loss
- If we consider one record at a time, it is called Loss
- If we consider multiple records, and calculate the Loss, that's Average Loss
- In first step,
- MSE = (y^ - y)^2 is Loss
- MSE = 1/n ∑(i=1 to n) (y^ -y)^2 is Average Loss (also known as Cost Function)
- Advantages
- Simple
- Same unit
- Dis-advantages
- Outliers is a problem, increasing the average Loss
- Squaring (y^ - y)^2 , increasing the values
- MAE is introduced to overcome above 2 problems with
- Squares
- Outliers
- Formula for
- Loss is, MAE = |Y - Y^|
- Average Loss, MAE =1/n ∑(i=1 to n) |Y - Y^|
- Advantages are
- Simple and Same unit (if it is - then -, if it is + then +)
- It is very good for outliers
- Disadvantages are
- It is not fully differentiable, especially for f(x) = |x| = 0
- Formula is
- ½ (y − ŷ)² , if |y − ŷ| ≤ λ
- λ (|y − ŷ| − ½ λ), if |y − ŷ| > λ
- where λ is hyper parameter, your model is going to learn based on requirement and hyper parameter meter means, model is going to calculate the best value for those parameters based on the input scenario at run time.
- Lets assume, λ = 5 , e1 = 2, e2=2, -20
- For e1 =2
- e1 <= λ (2 <= 5) ; condition satisfied, hence using first formula
- ½ (y − ŷ)² = ½ e1² = ½ (2)² = ² ½(4) = 2 (This is Loss)
- For e2 = 2 , same value i.e. 2
- For e3 = -20
- if |y − ŷ| > λ, then use λ( |y − ŷ| − ½ λ)
- if |-20| > 5, then λ( |y − ŷ| − ½ λ)
- = 5(|-20| - ½ 5) = 5(20 - 2.5) = 5(17.7) = 87.5
- So, we have 3 losses, those are 2, 2, 87.5
- Average loss = (2 + 2 + 87.5)/3 = 30.5
- Suppose if we handled outliers at during data preprocessing itself, then go with MSE
- If your data consists of outliers, then MAE ; but there is a problem with MAE which is not fully differentiable, then go with Huber Loss
- Suppose, if you don't have any outliers in your dat, then go for MSE
- It is mainly used if our output consists of 2 class classification
- In the output neuron, what is the activation function you recommend in output layer ?
- Sigmoid (check Sigmoid section for reasons)
- Formula for Cost/Loss Function
2) Categorical Cross Entropy (CCE)
- Activation function should be "Softmax"
- It is a multi class classification (more than 2 classes)
- Formula for Cost/Lost function
- In real time, most of the problems are not using this Loss function
- It is called One Hot Encoding, for categorical variables.
- It will add new columns, which are the classification values and assign 1's and 0's for those records which makes computation very complex. This is the reason, most of the problems are not using Categorical Cross Entropy. IT IS NOT RECOMMENDED IN REAL TIME. It is a burden to infrastructure.
- To overcome above issue, they introduced SCCE.
- Instead of adding all categories as new columns, this algorithm adds only one columns and adding a row number to each of those categories. Hence it is recommended in real time especially when you are dealing it multi class problems.
- Formula for Cost function (No need to memorize it, program will take care, just for our understanding only)
- Both Global Minima & Local Minima are related to Loss Functions only
- Global Minima
- Lowest possible value of the loss function across entire data set
- This value is absolute best for the model
- Point 3 from below image is Global Minima
- Local Minima
- Consider 100 iterations of our model as shown in the below image
- During iteration 4, if we stop model thinking that it is the minimum loss, then it is a trap. This is nothing but Local Minima
- We will land into this trap, and simply stop models here at iteration L4. But look at values at L55, L68, L79. Loss got reduced a lot comparatively with L4.
- We have to stop the iteration at L79, which is at Global minima.
- Activation Functions
- Loss Functions
- Internally during backward propagation and algorithm will run which is called Gradient Descent
- Gradient Descent (GD)
- Batch Gradient Descent (BGD)
- Stochastic Gradient Descent (SGD)
- Mini Batch Gradient Descent (MGBD)
- Optimizers (Momentum, Adagrad, RMS prop, Adam)
- For record1, x1=80, x2=8 and y=3
- For record2, x1=60, x2=9 and y=5
- h1 ---> Z₁ | Af = O₁₁
- h2 ---> Z₂ | Af = O₁₂
Assume biases of h1, h2 as b₁₁, b₁₂. Please see above image for clarity. Also assuming there are no outliers in my data, then recommended loss function for a regression problem is MSE.
- W₁₁², W₂₁², b₂₁ ===> Step1 calculation
- W₁₁¹, W₂₁¹, b₁₁ ===> Step2 calculation
- W₁₂¹, W₂₂¹, b₁₂ ===> Step3 calculation
- O₁₁ = Af(W₁₁¹ x1 + W₂₁¹ x2 + b₁₁)
- O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 + b₁₂)
- Note : as we applied MSE as loss function, we need to use Loss= (Y − Ŷ)²
- W₁₁¹, W₂₁¹, b₁₁ ===> Step2 calculation
- O₁₁ = Af(W₁₁¹ x1 + W₂₁¹ x2 + b₁₁)
- O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 + b₁₂)
- W₁₂¹, W₂₂¹, b₁₂ ===> Step3 calculation
- O₁₁ = Af(W₁₁¹ x1 + W₂₁¹ x2 + b₁₁)
- O₁₂ = Af(W₁₂¹ x1 + W₂₂¹ x2 + b₁₂)
- Batch Gradient Descent (BGD)
- Stochastic Gradient Descent (SGD)
- Mini Batch Gradient Descent (MBGD)
- Batch Gradient Descent (BGD)
- It will take entire data set (all 'n' no of records in single shot)
- Intialize random weights & biases
- Linear transformation
- Applying Activation function
- Calculating the loss for all data points (Average Loss)
- Update weights & Biases for this entire data set (for all the data points)
- Disadvantages
- More memory usage, as we need to fit entire data set into memory for calculation
- It will take more time to process
- Stochastic Gradient Descent (SGD)
- Instead of entire data set, it will consider randomly one sample data(one data point at once)
- Intialize random weights & biases
- Linear transformation
- Applying Activation function
- Calculating the loss for the sample data point
- Updating weights & biases for one data point
- Disadvantages
- One way this is good, in terms of speed
- But other way, we can't judge entire data set keeping the result of one data set in mind
- Overfitting can happen, due to more iterations and weight & bias adjustments
- Mini Batch Gradient Descent (MBGD)
- Assume, we have 100 records ; MBGD will splint the entire data set into 4 batched with 25 records in each batch
- Batch1(25), Batch2(25), Batch3(25), Batch4(25)
- At once, it will Consider ONLY one batch i.e. Batch1
- Intialize random weights & biases
- Linear transformation
- Applying Activation function
- Calculating the loss for ONLY Batch1
- As per above loss, adjust weights & biases
- Note this batch selection is also an Hyper parameter, Optuna will help here
- Ideally recommendation for batch size is 16, 32, 64, 128, 256 etc. This decision is from research. Incase if we are not clear on how to consider a batch, please use Optuna from Python. It will give best Hyper parameters for us to design model. More about this in future blogs ✌
- Disadvantages
- It is a good approach, comparatively with BGD & SGD
- Kind of hybrid approach, between with BGD & SGD
- More parallelism
- Less faster than SGD, because considering more samples
- speed SGD > MBGD > BGD
- memory utilization SGD < MBGD < BGD
- "+" is minimal loss
- MBGD seems to be moderate algorithm for Gradient Descent
- Observe that BGD doesn't have much learning
- Observe SGD have too many iterations, with more noise instead of actual learning
- Assume you are walking down from a tip of mountain
- BGD (Stable, for time consuming process because of slow decision, need to analyse )
- it is going scan entire mountain, then take one perfect step to reach the correct direction
- SGD (Very fast, but lot of zig zag nature, due to this it it will keep changing direction)
- you look at only one nearby rock, guess the direction and step immediately
- MBGD (taking decent time to decide and then taking right step)
- you look at set of neighbourhood rocks, then you decide the step
- BGD + Optimizers
- SGD + Optimizers
- MBGD + Optimizers
- Understand that whenever we need to start a ML model, we will splint entire data into
- Train Data (Loss is low, accuracy is high)
- Test Data (Loss is very high, and accuracy is low)
- Our ML model will learn hidden patterns from "Train Data" and it works perfectly with "Train Data". Loss will be low and accuracy will be very high on "Train Data".
- But same model, if I run on "Test Data", then Loss will be very high and accuracy will be very low. This is called Overfitting.
- In real time, we have to build ML models like either 1st or 2nd case from above image, but 3rd case represents exact Overfitting scenario.
- Most of the real time models will fit in 2nd case, but 1st one is pretty perfect.
- A represents Underfit, it is not trained well, not identifying hidden patterns
- B represents a good/balanced/perfect model, ignoring noise and perfectly matching patterns
- C represents Overfit
- Drop Out Layer
- Early Stopping
- L2 Regularization
- Ideally, all the neurons will be active across Neural Network
- All neurons will be active means, all of them will memorize things in each iteration and there will be too much of noise, and this leads to OVERFITTING.
- Randomly, in each iteration, it will switch off some nodes (lets take 20% of neurons will be switched off)
- During 1st iteration 20% neurons will be switched off, during 2nd iteration another 20%(note 20% neurons which were switched off during 1st iteration are active during second iteration). We are trying to switch off only 20% of neurons in total, rest 80% willl be active in every iteration.
- We covered different patterns in each iteration, which will be helping our ML model
- Above image shows drop out layer in hidden layer
- Automatically model will identify which neurons to switch off, it will take care of turning on those off neurons once iteration is done, we just need to mention the Drop Out Rate.
- if drop out rate = 0.5 then 50% neurons will be switched off
- this will happen during both forward propgation and backward propagation as well
- output from switched off neurons is '0' , then during backward propagation, it won't consider this neuron and Vanishing Gradient problem won't be there.
- Drop Out Layer is applicable for only "Training" but not during "Testing" and "Production". DO NOT INCLUDE DROP OUT LAYER while testing the data.
- In Testing, all neurons will be active.
- This is another technique to prevent Overfitting.
- You are going to stop "Training" the model when it stops improving on "Validation" data (test data), even if Train loss is still improving.
- Example: (7 iterations)
- Train Loss = 0.989. Test Loss = 0.899
- Train Loss = 0.976. Test Loss = 0.897
- Train Loss = 0.932 Test Loss = 0.896
- Train Loss = 0.854. Test Loss = 0.742
- Train Loss = 0.767 Test Loss = 0.736 (Patience)
- Train Loss = 0.632 Test Loss = 0.735999
- Train Loss = 0.542 Test Loss = 0.735999
- There is a parameter called Patience, our model will stop here in this scenario, to avoid Overfitting.
- Patience is the Hyper parameter here
- TP - True Positive : Model predicted YES, actual is also YES
- TN - True Negative : Model predicted NO, actual is also NO
- FP - False Positive : Model predicted YES, actual is also NO
- FN - False Negative : Model predicted NO, actual is also YES
- This is a lengthy and confusing topic, last concept to prevent Overfitting
- Lets see how it is going to avoid Overfitting
- Regularization means controlling something with some rules. Isn't it ?
- You are punishing the model, if weight become too large
- So model forced to
- Keep weights small and simple
- Avoid memorizing
- More generalization
- Above Loss could be regression loss or classification loss, see formula for loss function as above
- Second part of function excluding loss is nothing but PENALITY TERM, which will control the weights
- where λ is a hyper parameter, it should be balanced (or else it will end up either Overfitting or Underfitting)
- if λ=0 then Cost Function = Loss Function (so it shouldn't be 0)
- This is a Hyper parameter, used to adjust weights and biases
- As per experiments, they proposed η = 0.1, 0.01, 0.001, 0.0001 are good values to choose
- If η is too low, like 0.0000001 then our ML model will take lot of small steps, training will take too long and it will be too slow. But if η is 10, then it will take big jumps, it will reach Global minima but it will be zig zag nature. Hence learning shouldn't be too large, too small, it must be medium. Hence most of the researches recommend above values.
- First, manually train your model with different small learning rates, compare them and at the end if needed use OPTUNA and come up with a balanced learning rate for your model.
Comments
Post a Comment