While designing a Neural Network, we can use Dataset & Data Loader for efficient memory management. Think of them as a backbone for providing data to a machine learning model. Optuna package available in the PyTorch library will be used to calculate the hyper parameters.
Agenda of this blog :
- How we are making the model memory efficient by using Dataset, Data Loader
- How to use Optuna package to select accurate Hyper Parameters
Hyper parameters in a Neural Network :
- Learning rate
- Batch size
- No. of epochs
- Drop out ratio
- No. of hidden layers
- No. of hidden units
- Normalization - Batch norm etc.
- Optimizers - Adam, Momentum, NAG, RMS Prop etc.
Generally, we manually try above hyper parameters with different reasonable values and try testing a model and finalize the accurate paramters which are best suitable for our model. But manual way takes time as we need to see different possibilities by using different values for these parameters and then come up with correct parameters and values.
By using Optuna package, we will make this process easy. We will see how we do that in this blog with proper programmatical explanation.
Dataset :
A dataset is the collection of data that you use to train, validate, and test a neural network. Dataset means raw data, we will consider one record at a time(we will see what does it mean in this blog).
Dataset know :
- How many items exist, means it know what data it has
- How to fetch one record from Dataset
- Then applies pre-processing/transform data
Above points are the qualification factors of a Dataset. Dataset will never return batches and never shuffle the data, it also never decides training order.
Data Loader :
A Data Loader accepts data from Dataset. It will decide the order of incoming data into the model.
- It groups data into mini batches
- It shuffle and load the data efficiently using multiple workers
Now, realize that Dataset & DataLoader are classes in the PyTorch library. Dataset will initiate first followed by DataLoader. We will use these in Neural Networks and also LLMs.
Consider below sample data and see some analogy to understand these concepts deep : with columns index, age, monthly spend amount, churn(means whether he/she will stay with our application or move to some other application)
- Based on the index number 0, 1, 2, 3, 4 ; we will get the entire record from above sample data
- Total samples are 5
- if index(3) : Age=40, MonthlySpend=5000, Churn=1
- Storing the index is easy than storing entire record. Isn't it ? Once index is stored, we can fetch records using that index which is memory efficient
- First step of Data Loader is, it is going to create index list
- index = [0, 1, 2, 3, 4]
- where 0 will represent 1st row, 1 will represent 2nd row etc.
- Optionally, Data Loader will shuffle the data
- Shuffle will help the accuracy of model
- For example, if shuffle = True then shuffle index = [2, 4, 1, 3, 0]
- Note we are shuffling data, just the index
- Data Loader will create mini batches, but how :
- if Batch size = 2 then Data Loader will create batches as below
- Batch 1 = [2, 4] ; [2] --> (22, 1800), 1 ; [4] --> (35, 4200), 1
- Batch 2 = [1, 3]
- Batch 3 = [0]
- that's how data will be organized in Data Loaders
- Remember Mini Batch Gradient Descent during backpropagation ? something similar is happening here at Data Loader
- In the code, when we declare Dataset, data is not going to Load.
- No samples were read at this point (after declaring Dataset)
- Then when it will load ?
for batch in dataloader():
train(batch)
This is called Lazy Evaluation. This is why Dataset & DataLoader are memory efficient.
Guys this is a very important point to remember and we use it across GenAI, Agentic AI.
Problem statement : Trying to predict whether person may diagnose with cancer or not based on certain criteria. Please check code, I have printed 1st 5 rows of data to get you some idea. Input variables are 30 columns and output parameter is the probability(Y/N) of getting person diagnose with cancer or not.
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder # LabelEncoder coverts input data into numerical values which system can understand
# Creating a dataframe to read input data from below cv files
df = pd.read_csv('/content/data.csv')
df.head() # it will read first 5 rows
# 569 rows and 33 columns
df.shape
# removing unnecessary data from input data, which is not add any meaning to model
#df.drop(columns=['id', 'Unnamed: 32'], inplace= True)
df.drop(columns=['id'], inplace= True)
df.head()
# Rows : print all ; columns : print all but from 2nd column(ignoring 1st column "diagnosis" where we removed data)
df.iloc[:, 1:]
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.2)
y_test
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train
y_train
# Applying LabelEncoder to covert the char data in 1st column
# if we apply Normalization before apllying LabelEncoder then it will throw an error because we have non-numerical values in input data
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train) # fit_transform() will calculate SD, Mean as this is train data
y_test = encoder.transform(y_test) # transform() won't calculate SD, Mean as this is test data(unseen data)
y_train
# Casting numpy type of input variables into tensor which ML model can understand
X_train_tensor = torch.from_numpy(X_train.astype(np.float32))
X_test_tensor = torch.from_numpy(X_test.astype(np.float32))
y_train_tensor = torch.from_numpy(y_train.astype(np.float32))
y_test_tensor = torch.from_numpy(y_test.astype(np.float32))
X_train_tensor.shape
y_train_tensor.shape
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
# init() tells how data should be loaded.
# features are i/p variables, labels are target variables(2 options in diagnosis column)
def __init__(self, features, labels):
self.features = features
self.labels = labels
# len() returns the total number of samples.
def __len__(self):
return len(self.features)
# getitem(index) returns the data, label at the given index.
def __getitem__(self, idx):
return self.features[idx], self.labels[idx]
# Creating instance for train, test dataset
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)
# 10th index
train_dataset[10]
# Creating instance for DataLoader
# batch_size=32, enable shuffling
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)
import torch.nn as nn
class MySimpleNN(nn.Module):
def __init__(self, num_features):
super().__init__()
self.linear = nn.Linear(num_features, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, features):
out = self.linear(features)
out = self.sigmoid(out)
return out
learning_rate = 0.1
epochs = 25
# create model
model = MySimpleNN(X_train_tensor.shape[1])
# define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# define loss function
loss_function = nn.BCELoss()
# define loop
for epoch in range(epochs):
for batch_features, batch_labels in train_loader:
# forward pass
y_pred = model(batch_features)
# loss calculate
loss = loss_function(y_pred, batch_labels.view(-1,1))
# clear gradients
optimizer.zero_grad()
# backward pass
loss.backward()
# parameters update
optimizer.step()
# print loss in each epoch
print(f'Epoch: {epoch + 1}, Loss: {loss.item()}')
# Model evaluation using test_loader
model.eval() # Set the model to evaluation mode
accuracy_list = []
with torch.no_grad():
for batch_features, batch_labels in test_loader:
# Forward pass
y_pred = model(batch_features)
y_pred = (y_pred > 0.8).float() # Convert probabilities to binary predictions
# Calculate accuracy for the current batch
batch_accuracy = (y_pred.view(-1) == batch_labels).float().mean().item()
accuracy_list.append(batch_accuracy)
# Calculate overall accuracy
overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Accuracy: {overall_accuracy:.4f}')
Please feel free to download code from :
https://github.com/amathe1/GenAI-AgenticAI-Hub/blob/main/Pytorch_Dataset_%26_Dataloader_Classes.ipynb
Optuna :
This is a Hyper parameter optimization library available in PyTorch.
Hyper parameters in a Neural Network :
- Learning rate
- Batch size
- No. of epochs
- Drop out ratio
- No. of hidden layers
- No. of hidden units
- Normalization - Batch norm etc.
- Optimizers - Adam, Momentum, NAG, RMS Prop etc.
Hidden units are nothing but hidden neurons in a single hidden layer. If you give your inputs to Optuna, it will give you accurate hyper parameters based on our model.
Make sure to install Optuna in your Google colab as below :
Please download code from following GitHub location : https://github.com/amathe1/GenAI-AgenticAI-Hub/blob/main/Optuna.ipynb
Points to note :
- We have used Dataset, DataLoader as well in this Optuna code
- You need to pay close attention on each and every line to understand the way to are passing different hyper parameters into Optuna code so that it will process and come up with best hyper parameters based on our input data and model.
- Finally, once program is done, it will print best hyper parameters and we need to build ML model based on that and then deploy in production.
- Also, once we deployed we need to closely monitor the flow of input data and the performance of our model because we had trained our model based on some set of data and if that data change, then performance of our model also change.
- During such case, we need to again re-run Optuna code with that updated data and come up with a new set of hyper parameters which are best suited for our model+data and then enhance our model again and redeploy it in production.
- This is a recurring process to keep our model stable and accurate time to time.
Output from Optuna code :
Best trial:
Value: 0.020338236577420805
Hyperparameters: {'hidden_layers': 4, 'hidden_units': 226, 'dropout': 0.30018694421569886, 'batch_norm': True, 'activation_fn': <class 'torch.nn.modules.activation.ReLU'>, 'optimizer': 'Adam'}
Thank you for reading this blog !
Arun Mathe
Comments
Post a Comment