Skip to main content

(AI Blog#6) : Iris Data Set - Identify type of flower using Neural Networks

Lets understand how to identify the type of flower based on the given input data using Neural Networks.

Below are the 3 types of flowers named as:

  • iris setosa
  • iris versicolor
  • iris virginica
and each flower have a different area called as petal and sepal, our ML need to predict the flower type based on this data if we provide an unseen data to our ML model once training is done(during testing phase).

Feature values that we are considering were :

  • Petal length 
  • Petal width
  • Sepal length 
  • Sepal width
Problem statement :
                       Based on above features, ML model should predict whether the given data belongs to any one of above flower.

Before digging into building/programming above neural network we need to understand about one of the important python library called sklearn.  

Scikit-learn  :  Scikit-learn is one of the famous machine learning libraries in Python. It is build for Machine Learning, not for Deep Learning. We are using some available data set which is IRIS dataset from this library in this use case.

We use Scikit-learn, when we need to :
  • Train ML models quickly
  • Perform data preprosessing
  • Do model evaluation & selection
  • Apply ML to tabular data (CSV, Dataframes)

Program :

import torch
import torch.nn as nn
import torch.optim as optim

from sklearn.datasets import load_iris # it means bring me load_iris method from datasets module inside sklearn library (importing iris data set from Scikit-learn)
from sklearn.model_selection import train_test_split # Existing utility in sklearn used to split data into Train, Test data
from sklearn.preprocessing import StandardScaler # StanardScaler/Z-Scalar normalization technique for data normalization

import matplotlib.pyplot as plt # importing matplotlib for plotting output graph



#Step1 : Load Dataset
data = load_iris()
X = data.data # load the values of key 'data' from dict 'data'
y = data.target # load the values of key 'target' from dict 'data'
data = load_iris()

"""
Please see the below keys and the data it contains

{'data': array([[5.1, 3.5, 1.4, 0.2], [Sepal_length, Sepal_width, Petal_length, Petal_width]


'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Here 0 means setosa, 1 means versicolor, 2 means virginica

This is the beauty of using Scikit-learn, BUT in real time we need to preprosess the data.

"""

data


#Step2 : Train/Validate split : Splitting the input data for training & testing using train_test_split(), note we mentioned the % of test data size


X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.4, random_state=42
)

print(X.shape)
print(y.shape)
print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)


#Step3 : Normalize

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # For train data, we use fit_transform() - it will calculate mean, standard deviation
X_val = scaler.transform(X_val) # For test data, we use transform() - it can't calculate mean & standard deviation as this is unseen data. Hence method name is diff for test.
# type(X_train)
# type(X_val)

# Converting input data into tensor, as o/p from normalization is of type numpy array but NN expect tensor
X_train = torch.tensor(X_train, dtype=torch.float32)
X_val = torch.tensor(X_val, dtype=torch.float32)

y_train = torch.tensor(y_train, dtype=torch.long)
y_val = torch.tensor(y_val, dtype=torch.long)



#Part 1 : Overfitting model
class OverfitModel(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(4,128),
nn.ReLU(),
nn.Linear(128,128),
nn.ReLU(),
nn.Linear(128,3)
)

def forward(self,x):
return self.net(x)


# Traning function
def train_model(model, optimizer, epochs=200):
criterion = nn.CrossEntropyLoss()

train_losses, val_losses = [], []

for epoch in range(epochs):
model.train()
pred = model(X_train)
loss = criterion(pred, y_train)

optimizer.zero_grad()
loss.backward()
optimizer.step() # Adjusting weights & Biases

model.eval() # Initiating validation
val_pred = model(X_val)
val_loss = criterion(val_pred, y_val)

train_losses.append(loss.item()) # Training loss
val_losses.append(val_loss.item()) # Validation loss

return train_losses, val_losses




# Train

model1 = OverfitModel()
opt1 = optim.Adam(model1.parameters(), lr=0.01)

train1, val1 = train_model(model1, opt1)



# Visualize overfitting
plt.plot(train1,label="Train Loss")
plt.plot(val1,label="Val Loss")
plt.legend()
plt.title("Overfitting Model")
plt.show()


# Regularize model
# Apply Dropout, L2, Early stopping techniques


class RegularizedModel(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(4,128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128,64),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(64,3)
)

def forward(self,x):
return self.net(x)


# Early stopping Training
def train_with_early_stopping(model, optimizer, patience=15):
criterion = nn.CrossEntropyLoss()

best_loss = float('inf')
counter = 0

train_losses, val_losses = [], []

for epoch in range(300):
model.train()
pred = model(X_train)
loss = criterion(pred,y_train)

optimizer.zero_grad()
loss.backward()
optimizer.step()

model.eval()
val_pred = model(X_val)
val_loss = criterion(val_pred,y_val)

train_losses.append(loss.item())
val_losses.append(val_loss.item())

if val_loss < best_loss:
best_loss = val_loss
counter = 0
else:
counter += 1

if counter >= patience:
print("Early stopping triggered")
break

return train_losses,val_losses


# L2 via Weight Decay

model2 = RegularizedModel()
opt2 = optim.Adam(model2.parameters(), lr=0.01, weight_decay=1e-3)

train2, val2 = train_with_early_stopping(model2, opt2)


# Visualize after fix
plt.plot(train2,label="Train Loss")
plt.plot(val2,label="Val Loss")
plt.legend()
plt.title("Regularized Model")
plt.show()




Feel free to download the ML code from following GiHub location : https://github.com/amathe1/GenAI-AgenticAI-Hub/blob/main/iris_data_set.ipynb


Thank you for reading this blog !

Arun Mathe

Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

Spark Core : Introduction & understanding Spark Context

Apache Spark is a free, open-source tool for processing large amounts of data in parallel across multiple computers. It is used for big data work loads like machine learning, graph processing and big data analytics. Spark is built on top of Hadoop, it is aware how Hadoop works. Programming languages for Spark : Scala Python Java R SQL Spark support 2 operations : Transformations Actions RDD (Resilient Distributed Dataset) : Entire Spark is built on the base concept called RDD. Below 2  operations are supported by RDD. Transformations Actions Features of Spark : Distributed, Partitioned, Replicated Note that if data is mutable then it will be hard to distribute, partition and replicate Hence Spark required Immutability feature Immutability We can't change the data By design, Spark is purely designed for Analytical operations(OLAP) It do support transactional operations using some 3rd party tools Cacheable  To reuse data we cache it If information is static, no need to recomput...