(AI Blog#7) Importance of Dataset, Dataloader and Optuna Package

While designing a Neural Network, we can use Dataset & Data Loader for efficient memory management. Think of them as a backbone for providing data to a machine learning model. Optuna package available in the PyTorch library will be used to calculate the hyper parameters.

Agenda of this blog :

How we are making the model memory efficient by using Dataset, Data Loader
How to use Optuna package to select accurate Hyper Parameters

Hyper parameters in a Neural Network :

Learning rate
Batch size
No. of epochs
Drop out ratio
No. of hidden layers
No. of hidden units
Normalization - Batch norm etc.
Optimizers - Adam, Momentum, NAG, RMS Prop etc.

Generally, we manually try above hyper parameters with different reasonable values and try testing a model and finalize the accurate paramters which are best suitable for our model. But manual way takes time as we need to see different possibilities by using different values for these parameters and then come up with correct parameters and values.

By using Optuna package, we will make this process easy. We will see how we do that in this blog with proper programmatical explanation.

Dataset :

A dataset is the collection of data that you use to train, validate, and test a neural network. Dataset means raw data, we will consider one record at a time(we will see what does it mean in this blog).

Dataset know :

How many items exist, means it know what data it has
How to fetch one record from Dataset
Then applies pre-processing/transform data

Above points are the qualification factors of a Dataset. Dataset will never return batches and never shuffle the data, it also never decides training order.

Data Loader :

A Data Loader accepts data from Dataset. It will decide the order of incoming data into the model.

It groups data into mini batches
It shuffle and load the data efficiently using multiple workers

Now, realize that Dataset & DataLoader are classes in the PyTorch library. Dataset will initiate first followed by DataLoader. We will use these in Neural Networks and also LLMs.

Consider below sample data and see some analogy to understand these concepts deep : with columns index, age, monthly spend amount, churn(means whether he/she will stay with our application or move to some other application)

Based on the index number 0, 1, 2, 3, 4 ; we will get the entire record from above sample data
Total samples are 5
if index(3) : Age=40, MonthlySpend=5000, Churn=1
Storing the index is easy than storing entire record. Isn't it ? Once index is stored, we can fetch records using that index which is memory efficient
First step of Data Loader is, it is going to create index list

index = [0, 1, 2, 3, 4]
where 0 will represent 1st row, 1 will represent 2nd row etc.

Optionally, Data Loader will shuffle the data

Shuffle will help the accuracy of model
For example, if shuffle = True then shuffle index = [2, 4, 1, 3, 0]
Note we are shuffling data, just the index

Data Loader will create mini batches, but how :

if Batch size = 2 then Data Loader will create batches as below

Batch 1 = [2, 4] ; [2] --> (22, 1800), 1 ; [4] --> (35, 4200), 1
Batch 2 = [1, 3]
Batch 3 = [0]

that's how data will be organized in Data Loaders
Remember Mini Batch Gradient Descent during backpropagation ? something similar is happening here at Data Loader

Very Important point :

In the code, when we declare Dataset, data is not going to Load.
No samples were read at this point (after declaring Dataset)
Then when it will load ?

for batch in dataloader():

train(batch)

This is called Lazy Evaluation. This is why Dataset & DataLoader are memory efficient.

Guys this is a very important point to remember and we use it across GenAI, Agentic AI.

Problem statement : Trying to predict whether person may diagnose with cancer or not based on certain criteria. Please check code, I have printed 1st 5 rows of data to get you some idea. Input variables are 30 columns and output parameter is the probability(Y/N) of getting person diagnose with cancer or not.

import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder   # LabelEncoder coverts input data into numerical values which system can understand

# Creating a dataframe to read input data from below cv files

df = pd.read_csv('/content/data.csv')

df.head() # it will read first 5 rows

# 569 rows and 33 columns

df.shape

# removing unnecessary data from input data, which is not add any meaning to model

#df.drop(columns=['id', 'Unnamed: 32'], inplace= True)

df.drop(columns=['id'], inplace= True)

df.head()

# Rows : print all ; columns : print all but from 2nd column(ignoring 1st column "diagnosis" where we removed data)

df.iloc[:, 1:]

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.2)

y_test

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train

y_train

# Applying LabelEncoder to covert the char data in 1st column
# if we apply Normalization before apllying LabelEncoder then it will throw an error because we have non-numerical values in input data
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train) # fit_transform() will calculate SD, Mean as this is train data
y_test = encoder.transform(y_test) # transform() won't calculate SD, Mean as this is test data(unseen data)


y_train


# Casting numpy type of input variables into tensor which ML model can understand
X_train_tensor = torch.from_numpy(X_train.astype(np.float32))
X_test_tensor = torch.from_numpy(X_test.astype(np.float32))
y_train_tensor = torch.from_numpy(y_train.astype(np.float32))
y_test_tensor = torch.from_numpy(y_test.astype(np.float32))

X_train_tensor.shape

y_train_tensor.shape


from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):

  # init() tells how data should be loaded.
  # features are i/p variables, labels are target variables(2 options in diagnosis column)
  def __init__(self, features, labels):

    self.features = features
    self.labels = labels

  # len() returns the total number of samples.
  def __len__(self):

    return len(self.features)

  # getitem(index) returns the data, label at the given index.
  def __getitem__(self, idx):

    return self.features[idx], self.labels[idx]



# Creating instance for train, test dataset
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)


# 10th index
train_dataset[10]


# Creating instance for DataLoader
# batch_size=32, enable shuffling
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)



import torch.nn as nn


class MySimpleNN(nn.Module):

  def __init__(self, num_features):

    super().__init__()
    self.linear = nn.Linear(num_features, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):

    out = self.linear(features)
    out = self.sigmoid(out)

    return out



learning_rate = 0.1
epochs = 25


# create model
model = MySimpleNN(X_train_tensor.shape[1])

# define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# define loss function
loss_function = nn.BCELoss()



# define loop
for epoch in range(epochs):

  for batch_features, batch_labels in train_loader:

    # forward pass
    y_pred = model(batch_features)

    # loss calculate
    loss = loss_function(y_pred, batch_labels.view(-1,1))

    # clear gradients
    optimizer.zero_grad()

    # backward pass
    loss.backward()

    # parameters update
    optimizer.step()

  # print loss in each epoch
  print(f'Epoch: {epoch + 1}, Loss: {loss.item()}')



# Model evaluation using test_loader
model.eval()  # Set the model to evaluation mode
accuracy_list = []

with torch.no_grad():
    for batch_features, batch_labels in test_loader:
        # Forward pass
        y_pred = model(batch_features)
        y_pred = (y_pred > 0.8).float()  # Convert probabilities to binary predictions

        # Calculate accuracy for the current batch
        batch_accuracy = (y_pred.view(-1) == batch_labels).float().mean().item()
        accuracy_list.append(batch_accuracy)

# Calculate overall accuracy
overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Accuracy: {overall_accuracy:.4f}')


Please feel free to download code from : 
https://github.com/amathe1/GenAI-AgenticAI-Hub/blob/main/Pytorch_Dataset_%26_Dataloader_Classes.ipynb



Optuna :

This is a Hyper parameter optimization library available in PyTorch. 

Hyper parameters in a Neural Network :
Learning rate
Batch size
No. of epochs
Drop out ratio
No. of hidden layers
No. of hidden units
Normalization - Batch norm etc.
Optimizers - Adam, Momentum, NAG, RMS Prop etc.

Hidden units are nothing but hidden neurons in a single hidden layer. If you give your inputs to Optuna, it will give you accurate hyper parameters based on our model.

Make sure to install Optuna in your Google colab as below :

!pip install optuna

Please download code from following GitHub location : https://github.com/amathe1/GenAI-AgenticAI-Hub/blob/main/Optuna.ipynb

Points to note :

We have used Dataset, DataLoader as well in this Optuna code
You need to pay close attention on each and every line to understand the way to are passing different hyper parameters into Optuna code so that it will process and come up with best hyper parameters based on our input data and model.
Finally, once program is done, it will print best hyper parameters and we need to build ML model based on that and then deploy in production.
Also, once we deployed we need to closely monitor the flow of input data and the performance of our model because we had trained our model based on some set of data and if that data change, then performance of our model also change.
During such case, we need to again re-run Optuna code with that updated data and come up with a new set of hyper parameters which are best suited for our model+data and then enhance our model again and redeploy it in production.
This is a recurring process to keep our model stable and accurate time to time.

Output from Optuna code :

Best trial: Value: 0.020338236577420805 Hyperparameters: {'hidden_layers': 4, 'hidden_units': 226, 'dropout': 0.30018694421569886, 'batch_norm': True, 'activation_fn': <class 'torch.nn.modules.activation.ReLU'>, 'optimizer': 'Adam'}

Thank you for reading this blog !

Arun Mathe

DataSphere

Search This Blog

(AI Blog#7) Importance of Dataset, Dataloader and Optuna Package

Labels

Comments

Post a Comment

Popular posts from this blog

AWS : Working with Lambda, Glue, S3/Redshift

Spark Core : Understanding RDD & Partitions in Spark

Spark Core : Introduction & understanding Spark Context