Skip to main content

(AI Blog#7) Importance of Dataset, Dataloader and Optuna Package

While designing a Neural Network, we can use Dataset & Data Loader for efficient memory management. Think of them as a backbone for providing data to a machine learning model. Optuna package available in the PyTorch library will be used to calculate the hyper parameters.

Agenda of this blog :

  • How we are making the model memory efficient by using Dataset, Data Loader
  • How to use Optuna package to select accurate Hyper Parameters

Hyper parameters in a Neural Network :

  • Learning rate
  • Batch size
  • No. of epochs
  • Drop out ratio
  • No. of hidden layers
  • No. of hidden units
  • Normalization - Batch norm etc.
  • Optimizers - Adam, Momentum, NAG, RMS Prop etc.
Generally, we manually try above hyper parameters with different reasonable values and try testing a model and finalize the accurate paramters which are best suitable for our model. But manual way takes time as we need to see different possibilities by using different values for these parameters and then come up with correct parameters and values. 

By using Optuna package, we will make this process easy. We will see how we do that in this blog with proper programmatical explanation.


Dataset : 

A dataset is the collection of data that you use to train, validate, and test a neural network. Dataset means raw data, we will consider one record at a time(we will see what does it mean in this blog).

Dataset know :
  • How many items exist, means it know what data it has
  • How to fetch one record from Dataset
  • Then applies pre-processing/transform data
Above points are the qualification factors of a Dataset. Dataset will never return batches and never shuffle the data, it also never decides training order.


Data Loader :

A Data Loader accepts data from Dataset. It will decide the order of incoming data into the model. 
  • It groups data into mini batches
  • It shuffle and load the data efficiently using multiple workers

Now, realize that Dataset & DataLoader are classes in the PyTorch library. Dataset will initiate first followed by DataLoader. We will use these in Neural Networks and also LLMs.


Consider below sample data and see some analogy to understand these concepts deep : with columns index, age, monthly spend amount, churn(means whether he/she will stay with our application or move to some other application)

  • Based on the index number 0, 1, 2, 3, 4 ; we will get the entire record from above sample data
  • Total samples are 5
  • if index(3) : Age=40, MonthlySpend=5000, Churn=1
  • Storing the index is easy than storing entire record. Isn't it ? Once index is stored, we can fetch records using that index which is memory efficient
  • First step of Data Loader is, it is going to create index list
    • index = [0, 1, 2, 3, 4]
    • where 0 will represent 1st row, 1 will represent 2nd row etc. 
  • Optionally, Data Loader will shuffle the data
    • Shuffle will help the accuracy of model 
    • For example, if shuffle = True then shuffle index = [2, 4, 1, 3, 0]
    • Note we are shuffling data, just the index
  • Data Loader will create mini batches, but how :
    • if Batch size = 2 then Data Loader will create batches as below
      • Batch 1 = [2, 4] ; [2] --> (22, 1800), 1 ; [4] --> (35, 4200), 1
      • Batch 2 = [1, 3]
      • Batch 3 = [0]
    • that's how data will be organized in Data Loaders
    • Remember Mini Batch Gradient Descent during backpropagation ? something similar is happening here at Data Loader

Very Important point :
 
  • In the code, when we declare Dataset, data is not going to Load. 
  • No samples were read at this point (after declaring Dataset)
  • Then when it will load ?
for batch in dataloader():
        train(batch)

This is called Lazy Evaluation. This is why Dataset & DataLoader are memory efficient. 

Guys this is a very important point to remember and we use it across GenAI, Agentic AI.




Problem statement :  Trying to predict whether person may diagnose with cancer or not based on certain criteria. Please check code, I have printed 1st 5 rows of data to get you some idea. Input variables are 30 columns and output parameter is the probability(Y/N) of getting person diagnose with cancer or not.

import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder   # LabelEncoder coverts input data into numerical values which system can understand


# Creating a dataframe to read input data from below cv files
df = pd.read_csv('/content/data.csv')
df.head() # it will read first 5 rows

# 569 rows and 33 columns
df.shape

# removing unnecessary data from input data, which is not add any meaning to model
#df.drop(columns=['id', 'Unnamed: 32'], inplace= True)
df.drop(columns=['id'], inplace= True)

df.head()


# Rows : print all ; columns : print all but from 2nd column(ignoring 1st column "diagnosis" where we removed data)
df.iloc[:, 1:]

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.2)


y_test


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


X_train
y_train

# Applying LabelEncoder to covert the char data in 1st column
# if we apply Normalization before apllying LabelEncoder then it will throw an error because we have non-numerical values in input data
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train) # fit_transform() will calculate SD, Mean as this is train data
y_test = encoder.transform(y_test) # transform() won't calculate SD, Mean as this is test data(unseen data)


y_train


# Casting numpy type of input variables into tensor which ML model can understand
X_train_tensor = torch.from_numpy(X_train.astype(np.float32))
X_test_tensor = torch.from_numpy(X_test.astype(np.float32))
y_train_tensor = torch.from_numpy(y_train.astype(np.float32))
y_test_tensor = torch.from_numpy(y_test.astype(np.float32))

X_train_tensor.shape

y_train_tensor.shape


from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):

  # init() tells how data should be loaded.
  # features are i/p variables, labels are target variables(2 options in diagnosis column)
  def __init__(self, features, labels):

    self.features = features
    self.labels = labels

  # len() returns the total number of samples.
  def __len__(self):

    return len(self.features)

  # getitem(index) returns the data, label at the given index.
  def __getitem__(self, idx):

    return self.features[idx], self.labels[idx]



# Creating instance for train, test dataset
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)


# 10th index
train_dataset[10]


# Creating instance for DataLoader
# batch_size=32, enable shuffling
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)



import torch.nn as nn


class MySimpleNN(nn.Module):

  def __init__(self, num_features):

    super().__init__()
    self.linear = nn.Linear(num_features, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):

    out = self.linear(features)
    out = self.sigmoid(out)

    return out



learning_rate = 0.1
epochs = 25


# create model
model = MySimpleNN(X_train_tensor.shape[1])

# define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# define loss function
loss_function = nn.BCELoss()



# define loop
for epoch in range(epochs):

  for batch_features, batch_labels in train_loader:

    # forward pass
    y_pred = model(batch_features)

    # loss calculate
    loss = loss_function(y_pred, batch_labels.view(-1,1))

    # clear gradients
    optimizer.zero_grad()

    # backward pass
    loss.backward()

    # parameters update
    optimizer.step()

  # print loss in each epoch
  print(f'Epoch: {epoch + 1}, Loss: {loss.item()}')



# Model evaluation using test_loader
model.eval()  # Set the model to evaluation mode
accuracy_list = []

with torch.no_grad():
    for batch_features, batch_labels in test_loader:
        # Forward pass
        y_pred = model(batch_features)
        y_pred = (y_pred > 0.8).float()  # Convert probabilities to binary predictions

        # Calculate accuracy for the current batch
        batch_accuracy = (y_pred.view(-1) == batch_labels).float().mean().item()
        accuracy_list.append(batch_accuracy)

# Calculate overall accuracy
overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Accuracy: {overall_accuracy:.4f}')


Please feel free to download code from :
https://github.com/amathe1/GenAI-AgenticAI-Hub/blob/main/Pytorch_Dataset_%26_Dataloader_Classes.ipynb



Optuna :

This is a Hyper parameter optimization library available in PyTorch. 

Hyper parameters in a Neural Network :

  • Learning rate
  • Batch size
  • No. of epochs
  • Drop out ratio
  • No. of hidden layers
  • No. of hidden units
  • Normalization - Batch norm etc.
  • Optimizers - Adam, Momentum, NAG, RMS Prop etc.
Hidden units are nothing but hidden neurons in a single hidden layer. If you give your inputs to Optuna, it will give you accurate hyper parameters based on our model.

Make sure to install Optuna in your Google colab as below :
  • !pip install optuna 



Please download code from following GitHub location : https://github.com/amathe1/GenAI-AgenticAI-Hub/blob/main/Optuna.ipynb

Points to note :
  • We have used Dataset, DataLoader as well in this Optuna code 
  • You need to pay close attention on each and every line to understand the way to are passing different hyper parameters into Optuna code so that it will process and come up with best hyper parameters based on our input data and model.
  • Finally, once program is done, it will print best hyper parameters and we need to build ML model based on that and then deploy in production.
  • Also, once we deployed we need to closely monitor the flow of input data and the performance of our model because we had trained our model based on some set of data and if that data change, then performance of our model also change.
  • During such case, we need to again re-run Optuna code with that updated data and come up with a new set of hyper parameters which are best suited for our model+data and then enhance our model again and redeploy it in production.
  • This is a recurring process to keep our model stable and accurate time to time.

Output from Optuna code :

Best trial: Value: 0.020338236577420805 Hyperparameters: {'hidden_layers': 4, 'hidden_units': 226, 'dropout': 0.30018694421569886, 'batch_norm': True, 'activation_fn': <class 'torch.nn.modules.activation.ReLU'>, 'optimizer': 'Adam'}


Thank you for reading this blog !

Arun Mathe

Comments

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

This is one of the important concept where we will see how an end-to-end pipeline will work in AWS. We are going to see how to continuously monitor a common source like S3/Redshift from Lambda(using Boto3 code) and initiate a trigger to start some Glue job(spark code), and perform some action.  Let's assume that, AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file got uploaded in AWS S3 bucket, Lambda should pass this file information as well to Glue, so that Glue job will perform some transformation and upload that transformed data into AWS RDS(MySQL). Understanding above flow chart : Let's assume one of your client is uploading some files(say .csv/.json) in some AWS storage location, for example S3 As soon as this file got uploaded in S3, we need to initiate a TRIGGER in AWS Lambda using Boto3 code Once this trigger is initiated, another AWS service called GLUE(ETL Tool)  will start a Pyspark job to receive this file from Lambda, perform so...

Spark Core : Understanding RDD & Partitions in Spark

Let us see how to create an RDD in Spark.   RDD (Resilient Distributed Dataset): We can create RDD in 2 ways. From Collections For small amount of data We can't use it for large amount of data From Datasets  For huge amount of data Text, CSV, JSON, PDF, image etc. When data is large we should go with Dataset approach     How to create an RDD ? Using collections val list = List(1, 2, 3, 4, 5, 6) val rdd = sc.parallelize(list) SC is Spark Context parallelize() method will convert input(collection in this case) into RDD Type of RDD will be based on the values assigned to collection, if we assign integers and RDD will be of type int Let's see below Scala code : # Created an RDD by providing a Collection(List) as input scala> val rdd = sc.parallelize(List(1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:23 # Printing RDD using collect() method scala> rdd.collect() res0: Array[Int] = Array(1, 2, 3, 4...

Spark Core : Introduction & understanding Spark Context

Apache Spark is a free, open-source tool for processing large amounts of data in parallel across multiple computers. It is used for big data work loads like machine learning, graph processing and big data analytics. Spark is built on top of Hadoop, it is aware how Hadoop works. Programming languages for Spark : Scala Python Java R SQL Spark support 2 operations : Transformations Actions RDD (Resilient Distributed Dataset) : Entire Spark is built on the base concept called RDD. Below 2  operations are supported by RDD. Transformations Actions Features of Spark : Distributed, Partitioned, Replicated Note that if data is mutable then it will be hard to distribute, partition and replicate Hence Spark required Immutability feature Immutability We can't change the data By design, Spark is purely designed for Analytical operations(OLAP) It do support transactional operations using some 3rd party tools Cacheable  To reuse data we cache it If information is static, no need to recomput...