(AI #2) Building a Real-World Neural Network: A Practical Use Case Explained

This blog will explain a clear picture on what will happen inside a Neural Network(NN). But before going through NN, we need to have some knowledge on some of the basic concepts in Calculus(Maths) & architecture of a Neural Network.

Note :

I recommend you to read the following blog(link mentioned below) and then start reading this blog.

Previous blog link : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html
At-least try to understand the basic layers of NN, weights, biases, activation function, loss function etc.

Lets start with Derivatives.

Derivatives :

Derivatives are originally a core concept of calculus (maths). They answer one question which is “How fast is something changing?”

Why derivatives appear in Machine Learning ?

Machine Learning uses Math as its foundation.
In ML, derivatives help answer : If I slightly change a ML model parameter, how will the error change ? (just ignore what is error in this context, understand that if we change some value/parameter in a ML model then how does it impact something else ?)
This is exactly what we need to 'train' the ML models
We use this in a concept called "Backpropagation" which we are going to discuss in this blog

Steps involved :

Model makes a prediction
Calculate loss (error)
Use derivatives to find :

Which weight caused more error ?
How much should each weight change ?

Simple example :

Think of standing on a hill ⛰️ and you want to reach the lowest point:

Derivative tells:

Which direction is downhill
How steep it is

ML does the same

Hill = loss function
Lowest point = Best model
Derivative = Direction to update weights

Note : Don't worry if you don't understand anything except what is derivative ! This entire blog talks about it. I am 100% sure, by the end of this blog each and every doubt will be clarified. This concept is like this, bit complex and needs multiple rounds of reading to digest the context.

Formulas that we need to learn in Derivatives :

d/dx(xⁿ) = n·xⁿ⁻¹

This is called Power rule
Examples :

d/dx(x⁵) = 5x⁴
d/dx(x³) = 3x²
d/dx(x²) = 2x
d/dx(x) = 1
d/dx(x⁻²) = -2x⁻³
d/dx(√x) = (1/2)x⁻¹ᐟ²

d/dx(uv) = u * dv/dx + v * du/dx

Examples :

As per formula : d/dx(uv) = u dv/dx + v du/dx
Please consider x² is u and x³ is v and apply the formula
d/dx (x² · x³) = x²(3x²) + x³(2x) = 3x⁴ + 2x⁴ = 5x⁴

d/dx(g(x)ⁿ) = n·g(x)ⁿ⁻¹·g'(x)

This is called Chain rule
Example :

d/dx(3x² + 5)⁴ = 4(3x² + 5)³ · d/dx(3x² + 5) = 4(3x² + 5)³ · 6x = 24x(3x² + 5)³
We may confuse and apply "d/dx(xⁿ) = n·xⁿ⁻¹" BUT

this example is like more of a function, g(x)^n instead of simple x. Isn't it ?
hence we need to apply chain rule and calculate both n·g(x)ⁿ⁻¹ & g'(x)
We need to be clear on when to apply Power rule & Chain rule

Power rule comes into picture when we have only X, not function
Chain rule comes into picture when we have a function like g(x) or f(x)
Let's say, we have d/dx(3x^4+2x^2-7), then it is not power rule or chain rules here. This is a polynomial(sum of terms), hence we need to differentiate term by term and then apply power rule as below.

d/dx(3x⁴ + 2x² − 7) = d/dx(3x⁴) + d/dx(2x²) − d/dx(7) = 12x³ + 4x

d²/dx² = d/dx ( d/dx )

It means taking the derivative of a derivative
Example :

Function : f(x) = 4x³
First derivative : d/dx (4x³) = 12x²
Second derivative : d/dx (12x²) = 24x
Therefore : d²/dx² (4x³) = 24x

∂²f / ∂x ∂y

Then ∂f/∂x → differentiate w.r.t x (keep y constant)
Then ∂/∂y ( ∂f/∂x ) → differentiate again w.r.t y

I believe, you are clear about derivatives and above formulas. Incase if you still not confident on using above formulas then I recommend you to practice more examples online to have some command on these concepts.

Now, lets understand what is PyTorch & one of it's functionality called "Autograd".

PyTorch :

PyTorch is a open source deep learning framework developed by Meta that is used to build, train and deploy neural networks easily
It is a python library, helps you write deep learning models
It provides

Tensors(like NumPy but faster + GPU support)
Automatic differentiation
Neural Network building blocks
Tools to train models

Note : I am preparing respective python code in Google colab notes, to showcase :

How to enable GPU for PyTorch
How to import torch module
How to create a Tensor
How to enable Autograd etc

I will attach the programs here in some time.

Autograd :

We have seen derivates and its usage in Neural Networks right ? That's conceptual.
Autograd will help us to achieve same functionality programmatically using PyTorch module
This is PyTorch automatic differentiation engine
It automatically calculates gradients of tensors during backpropagation.
In simple words, Autograd tracks all operations on tensors and computes derivatives automatically.
We will see what is gradient, tensor and backpropagation in this blog

We will use this Autograd functionality in LLM's as well during a concept called Backward propagation.

In our previous blog : https://arunsdatasphere.blogspot.com/2026/01/deep-learning-and-neural-networks.html

We have learnt what is a weight and bias of a neuron in Neural Network
Using this backward propagation, we will adjust the values of weights and biases
Initially the values of weights and biases are random values, after backward propagation these values will be adjusted and we will do it using autograd functionality
To understand the Autograd functionality, we have to know derivates. That's the reason, I explained derivatives first and then started Autograd.

Look at this equation :

Y = 2x + 3

In PyTorch, Autograd automatically calculate "Gradients" (means derivatives)
Mathematical representation of "Gradients" is derivatives
Where exactly we are going to calculate "Gradients" ? in the "Training Phase" of ML model
Autograd is nothing but a automatic differentiation engine, internally it is going to create a computation graph (like DAG in Spark)
What is meant by computation graph and how it work ? :

Assume x = 2
Y = 2x + 3

Now x is multiplied with 2

Step1 : 2 * 2 (2x)
Step2 : 4 added with 3 (2x + 3)
Step3 : 7
Y value is 7

Internally Autograd will create a computation graph which includes above steps, we call it grapcolabh it is step by step process

Lets see same in code :

Example 1 :

import torch

# Enabling autograd using flag requires_grad=True
x = torch.tensor(2.0, requires_grad=True)

# 2x +3

y = 2*x + 3

print(y)

# Output

# tensor(7., grad_fn=<AddBackward0>)

# It started backward propagation and goes until leaf node, nothing but starting point of computational graph
y.backward()

# print x.grad to see the gradient
# Please note, unless you execute y.backward(), backward propagation won't start
print(x.grad)
# Output : tensor(2.) because derivative of (2x + 3) is 2

Example 2 : We need to apply chain rule here as it is a function, g(x) i.e. d/dx(2x+3)²

import torch

x1 = torch.tensor(3.0, requires_grad=True)
y = (2*x1 + 3)**2
print(y) # tensor(81., grad_fn=<PowBackward0>)
y.backward() # backward propagation initiated
print(x1.grad) # Printing gradinet of x :  tensor(36.)
x1.grad.zero_ 
# if we don't set x1.grad.zero_ then 
# result will be accumulated each time you run this program

"""
Example:
Find d/dx(2x+3)²

We apply the formula:

d/dx[g(x)ⁿ] = n[g(x)]ⁿ⁻¹ · g'(x)

Step 1: Identify inner function
g(x) = 2x + 3

Step 2: Identify power
n = 2

Step 3: Apply formula

d/dx(2x+3)²
= 2(2x+3)²⁻¹ · d/dx(2x+3)

= 2(2x+3)¹ · 2

Step 4: Simplify

= 4(2x+3)

Final Answer:
d/dx(2x+3)² = 4(2x+3)

"""

One quick question :

Is this backward propagation happening at neuron level or layer level or entire Neural Network ?

backward propagation updates all weights & biases across every layer of neural network, not just the output layer
See below image for visual clarity

What is a Regression Vs Classification problem ?

Lets us consider, we have 2 input values x1, x2 and target value y

If our target value contains multiple values or continuous data then it is a Regression problem
If our target consists of categorical data, then it is called Classification problem

Examples :

Predicting the salaries of employees is a Regression problem (because we can't categorize salaries, they can vary with no limitation)
Predicting an email is spam or not is a Classification problem(because it has only 2 categories, spam or not spam)

Let us solve a regression problem to understand the entire flow of a Neural Network.

Problem statement : Predict house prices (in $1000) based on below 2 features

Features are input values :

x1 = House Size
x2 = No. of Bedrooms

Output

y = House Price (Output which continuous data, hence regression)

Now, let's remember this generic formula :

Wᵏᵢⱼ where

k = no. of hidden layer we are hitting
i = input neuron
j = hidden neuron

remember this formula to notate weight values for all the above connections
See below image, assigned weights for all connections as per above formula

Understanding Non-linearity and activation function :

We are going to deal with complex neural networks which are represented with curved lines in n dimensional space which is called non-linearity.
For non-linearity, we have different activation functions available like

ReLu - most popular
Sigmoid - binary classification
Tanh - centered data
Softmax - multiclass output

In simple words :

After computing Z = w1x1 + w2x2 + b (as per above neural network diagram)
We apply activation function, a = f(Z), where f() is activation function to add non-linearity
Which helps model to learn complex relationships

We use ReLu (Rectified linear unit)

ReLu(x) = max(0, x)
Means if x <= 0 then ReLu is 0 else if x > 0 then ReLu is x

Real world analogy

For electrical switch :

if INPUT is <= 0 ; OFF
if INPUT is > 0 ; ON

ReLU (Rectified Linear Unit) outputs zero for negative inputs and passes positive values as-is, making it fast, efficient, and widely used in deep learning.

Understanding Loss function :

A loss function measures how wrong the models prediction is compared with actual answer

Loss = Error

As we have multiple types of activation functions, we do have multiple loss functions

We use MSE (Mean Squared Error)

MSE = (ypred - ytrue)^2 ; where

ypred is predicted value of Y
ytrue is actual value of Y

and we are squaring it just to avoid -ve values

Lets take sample data :

This problem will help us any Neural Network problem
Lets take 2 records of sample data and solve this problem using NN
At the end of model, we will calculate predicted value which is Y^ (we call as Y hat)

Records X1 (Size of House) X2 (No of Bed rooms) Y (Price) Y^ ?

R1 1.5 3 250

R2 2.0 4 320

Next step is, whenever we are going to execute a Regression or Classification problem, we have to execute some steps.

Forward propagation (involved 2 steps as below)

Linear transformation, use formula, Z = w1x1+w2x2+b to calculate
And then apply activation function

We use ReLu activation function which is ReLu(Z)

Output of Forward propagation is predicted value which is Y^ (Y hat)

Calculate Loss

This Loss is based on whether the problem is Regression is Classification
We are dealing with Regression, and we discussed MSU type of loss above
Loss = (ypred - ytrue)^2

Backward propagation

Internally we are calculating the gradient here (derivatives as discussed above)

Adjust the weights and biases by using below formulas

New Weight : W_new = W_old − η · ( ∂L / ∂W_old )
New Bias : B_new = B_old − η · ( ∂L / ∂B_old )
Where η is learning rate, derivate of Loss with respect to derivative of Weight old

Now, lets apply above steps for sample data that we considered above. Considering first record(R1) for this example. Note, we are not acting on R2.

Step1 : Forward propagation

We need to apply linear transformation, which is Z = w1x1+w2x2+b

Understand what data we have at this point, and what we need
We have input values for record, R1 (x1 = 1.5, x2 = 3)
Now we need to define weights & biases as random values to start this process
Below are weights & biases for input layer to hidden layer

Below are random weights (randomly taken, could be any values)

[W11 W12] = [0.5 0.3]
[W21 W22] = [-0.2 0.4]

Biases (these are input to 1st hidden layer)

[b1] = [0.1]
[b2] = [0.2]

Below are weights & biases for hidden layer to output layer

Weights are

[W1 W2] = [0.7 0.6]

Biases are

b3 = 0.3

Now observe all the random values like input values, weights and biases are ready
Neural Network looks as below at this stage

Lets calculate the forward propagation

x1 = 1.5, x2 = 3

Lets calculate Z = w1x1 + w2x2 + b

But we have 2 hidden nodes right ? Hence we need to calculate 2 equations
For h1, Z1 = W11*x1 + W21*x2 + b1 = (0.5 * 1.5) + (-0.2 * 3) + 0.1 = 0.25
For h2, Z2 = W12 * x1 + W22 * x2 + b2 = (0.3 * 1.5) + (0.4 * 3) + 0.2 = 1.85

Now, we have to apply activation function for both Z1, Z2 (adding non linearity)

We are using ReLu activation function where ReLu(Z) = max(0, Z)
ReLu of :

Z1 is ReLu(0.25) = 0.25
Z2, ReLu(1.85) = 1.85

Understand that we completed the portion on (input layer + hidden layer) as of now and we have to repeat same process for (hidden layer + output layer)

Y^(Y hat) = W1*h1 + W2*h2 + b3 (we have only one neuron at o/p layer)
W1 & W2 represents 2nd random weights that we considered above,
[W1 W2] = [0.7 0.6] & h1, h2 are ReLu(Z1), ReLu(Z2) i.e [0.25, 1.85]
Y^(Y hat) = (0.7 * 0.25) + (0.6 * 1.85) + 0.3 = 1.585
Y^ = 1.585

During problem statement, we stated that house price is in $1000's of USD, hence need to multiple Y^ with 1000
Y^ = 1.585 * 1000 = 1585

Note that for record R1, actual price of house is 250

We completed step1 now

Step2 : Calculate Loss

Loss = (Y^ - Y)**2 = (1.585 - 250)**2 = 61,710.0122

You may doubt that we need to mention Y^ as 1585 (but No, that's for our understanding), incase if we have mention Y^ as 1585 then we need to multiple 250 ass well with $1000
Ignore above point if you didn't get the doubt of adding Y^ as 1585(instead of 1.585)

This is the actual gap between predicted price and actual price !

Step3 : Backward propagation

This step is very important

Initially, we considered random values for weights, biases for all connections and neurons and calculated Loss
Now we have to fin the derivative of Loss based on each and every weight, bias in the entire neural network
This is how we minimize the loss
This is a very tidious process and this is what happen in the core of Neural Network with 100's and 1000's of neurons with multiple hidden layers
This is why we need GPU, TPU's for AIML because of huge numbers of mathematical calculations happening in parallel
PLEASE TRY TO UNDERSTAND THIS ENTIRE PROCESS CAREFULLY TO HAVE A SOUND KNOWLEDGE OF NEURAL METWORKS.

Now we need to find the derivatives in backward propagation
As per above diagram, backward propagation start from Loss and adjust each value accordingly using derivatives in reverse order
Now lets calculate the gradients/derivatives from Loss --> Y^ (Y hat)

Nothing but finding the derivative of Loss based on the derivative of Y^

Formula for Loss calculation is, Loss = (Y^ - Y)**2

Derivative of Loss (from Loss to Y^), d(Loss)/d(Y^)
d/dY^((Loss)) = d /dY^((Y^ - Y)**2)) (replaced Loss with above Loss formula)
We need to apply chain rule, d/dx(g(x)ⁿ) = n·g(x)ⁿ⁻¹·g'(x)
Hence d /dY^((Y^ - Y)**2)) = 2(Y^-Y)*d/dy^(y^-y) = 2(y^-y)*1 = 2(y^-y)

we are calculating wrt y^, hence d/dy^(y^-y) = 1-0 = 1

Now Derivate of Loss wrt derivative of y^ = 2(y^-y) = 2(1.585 - 250) = -496.83

as we calculated, y^ =1.585 & y = 250
Backward propagation from Loss to y^ is -496.83 ( this is dL/dy^)

Now lets calculate the gradients/derivatives for output layer

We need to calculate 3 things

dL/dw1
dL/dw2
dL/db3

But Loss is not directly connected to W1, W2, d3 right ? there is a intermediate step called y^; Hence path will be :

Loss-------> y^ ----------> W1
Loss-------->y^------------>W2
Loss-------->y^------------>b3

Hence chain rule applies here as follows : (as we calculated above dL/dy^ = -496.83)

dL/dw1 = dL/dy^ * dy^/dw1
dL/dw2 = dL/dy^ * dy^/dw2
dL/db3 = dL/dy^ * dy^/db3

So we need to calculate

dy^/dw1 =
dy^/dw2
dy^/db3

But during Forward propagation, y^ = w1h1+w2h2+b3

dy^/dw1 = 0.25 (see below steps)

dy^/dw1 =d/dw1(w1h1+w2h2+b3)
=d/dw1(w1h1)+d/dw1(w2h2)+d/dw1(b3)
=d/dw1(w1h1)
=h1 but h1 is 0.25 dy^/dw1 = 0.25 (see below steps)

dy^/dw2 = 1.85 (see below steps)

dy^/dw2 = d/dw2(w1h1+w2h2+b3)
= d/dw2(w1h1)+d/dw2(w2h2)+d/dw2(b3)
=d/dw2(w2h2)
=h2 but h2 is 1.85

dy^/db3 = 1

dy^/db3 = d/db3(w1h1+w2h2+b3)
=d/db3(b3)
=1

Finally :

dL/dw1 = dL/dy^ * dy^/dw1 = -496.83 * 0.25 = -124.20
dL/dw2 = dL/dy^ * dy^/dw2 = -496.83 * 1.85 = -919.13
dL/db3 = dL/dy^ * dy^/db3 = -496.83 * 1 = -496.83

Until now, we calculated gradients until hidden layer and we need to calculate derivatives for rest of the weights, biases in the neural network
Lets calculate for h1, h2

dL/dh1 = dL/dy^ * dy^/dh1 = -496.83 * 0.7 = -347.78
dL/dh2 = dL/dy^ * dy^/dh2 = -496.83 * (w2) = -496.83 * 0.6 = -298.10
Note dL/dy^ = -496.83

dy^/dh1= d/dh1(y^)
=d/dh1(w1h1+w2h2+b3)
=d/dh1(w1h1) (ignore others as they are constant and zerod)
=w1
=0.7

as per NN, path is

Loss --> y^ --> h1
Loss --> y^ --> h2

Remember, we applied ReLu for h1, h2 during forward propagation. Hence we need to find out derivatives for ReLu as well.

if Z > 0 : derivative = 1
if Z <= 0 : derivative = 0
dL/dz = dL/dh * dh/dz
and we have to calculate it for both z1, z2
Note :

z1 = 0.25 (apply above der(ReLu) then it will be 1)
z2 = 1.85 (apply above der(ReLu) then it will be 1)

dL/dz1 = dL/dh1 * dh1/dz1 = -347.78 * 1 = -347.78
dL/dz2 = dL/dh2 * dh2/dz2 = -298.10 * 1 = -298.10
But we know the values of dL/dh1 & dL/dh2 i.e (-347.78 & -298.10)
We need to calculate :

dh1/dz1 = d/dz1(0.25) = 1
dh2/dz2 = d/dz2(1.85) = 1

Final step

Remember generic formula, Z = w1x1 + w2x2 + b
AND

z1 = w11x1+w21x2+b1
z2 = w12x1+w22x2+b2

For w11

dL/dw11 = dL/dz1 * dz1/dw11 (we calculated dL/dz1 = -347.78)
Need to calculate dz1/dw11 = d/dw11(z1)=d/dw11(w11x1+w21x2+b1)=x1=1.5
Now dL/dw11 = dL/dz1 * dz1/dw11 = -347.78 * 1.5 = -521.67

For w21

dL/dw21 = dL/dz1 * dz1/dw21 (we calculated dL/dz1 = -347.78)
Need to calculate dz1/dw21=d/dw21(z1)=d/dw21(w11x1+w21x2+b1)= x2= 3
Now dL/dw21 = dL/dz1 * dz1/dw21 = -347.78 * 3 = -1043.34

For b1

dL/db1 = dL/dz1 * dz1/db1 (but dL/dz1 is -347.78)
Need to calculate dz1/db1 = d/db1(z1)=d/db1(w11x1+w21x2+b1)=1
Now dL/db1 = dL/dz1 * dz1/db1 = -347.78 * 1 = -347.78

We completed all values for z1, need to calculate z2
For W12

dL/dw12 = dL/dz2 * dz2/dw12 (we calculated dL/dz2 = -298.10 )
dL/dw12 = (-298.10 * dz2/dw12)= (-298.10 * 1.5) = -444.05
dz2/dw12=d/dw12(z2)=d/dw12(w12x1+w22x2+b2)=x1=1.5

For W22

dL/Dw22 = dL/dz2 * dz2/dw22= (-298.10 * 3)= -888.6
dz2/dw22= d/dw22(z2)=d/dw22(w12x1+w22x2+b2)=x2=3

For b2

dL/db2 = dL/dz2 * dz2/db2= (-298.10 * dz2/db2)=(-298.10 * 1)=-298.10
dz2/db2 =d/db2(z2)=d/db2(w12x1+w22x2+b2)=1

Step4 : Adjust weights & Biases

W_new = W_old − η · ( ∂L / ∂W_old )
Lets assume η = 0.001
Lets calculate W1_new

W1_new = 0.7 - (0.001) * dL/dw1_old = 0.7 - (0.001) * (-124.21)= 0.8242
Old w11 value is 0.7, and the weight is adjusted to 0.8242
This is how weights and biases will be adjusted in neural network

Similarly, we have to calculate W2_new, W11_new, W12_new, W21_new, W22_new, b1_new, b2_new, b3_new, h1_new, h2_new

This entire process is just one iteration in neural network, these iterations will continue for n number of time, and old values will be adjusted after every iteration. Programmatically, that's what will happen when we run x1.grad.zero_ using tensor via PyTorch.

Kindly note that, this entire process will be repeated until y^(predicted output) is very close to y(original output), not until same but we have to predict a value which is close to actual value, isn't it ?

Consider a 2D graph, representing Loss in y-axis and number of iterations in x-axis, as the number of iterations increase, loss will start decreasing and we have to iterate until loss is bare minimal. That's the expectation. Anyways, we don't do all this manually BUT THIS IS WHAT WILL HAPPEN IN NEURAL NETWORK.

How does ML model decides to stop this continuous loop ?

First way : Manual way is defining the epocs, lets say if epochs=50 then this loop will stop after 50 iterations
Second way : We have a package called optuna, available in PyTorch, using which we can stop model automatically

BTW, a simplified way of the algorithm mentioned in this current blog is written in the below blog:

https://arunsdatasphere.blogspot.com/2026/01/ai-blog3-deep-learning-foundations.html

Please read Gradient Descent section, especially check images and hand written graphs.

That's all for this blog! Have a good day.

Thanks,

Arun Mathe

Email ID : arunkumar.mathe@gmail.com

DataSphere

Search This Blog

(AI #2) Building a Real-World Neural Network: A Practical Use Case Explained

Labels

Comments

Post a Comment

Popular posts from this blog

(AI #1) Deep Learning and Neural Networks

(AI #3) Deep Learning Foundations - Activation & Loss Functions, Gradient Descent algorithms & Optimization techniques

Spark Core : Understanding RDD & Partitions in Spark