Batch run example code
Here will go through an example where we change some code to use optuna.
You can find the code here
.
Installation
The code relies on anaconda. Create an environment by running
$ conda env create -f environment.yml
Once this is done, activate the envioronment by running
$ conda activate hpo_workshop
Now we need to make the code in the archive available. After extracting the files, go to the directory you extracted them to and run
$ pip install -e .
This installs the partorch package into the active anaconda environment and makes it globally available in the environment.
Optimizing hyper parameters
The example implements a simple Long Short-Term Memory (LSTM) network on a binary sequence prediction dataset. The sequences are molecules encoded in the SMILES format and the prediction target is whether they risk penetrating the Blood Brain Barrier or not.
The network is defined in the file hpo_workshop/rnn.py
and implemented pytorch, in this example we’re mainly interested in the initialization of the network:
class RNNPredictor(nn.Module):
def __init__(self, *, tokenizer, device, embedding_dim, d_model, num_layers, bidirectional, dropout, learning_rate, weight_decay) -> None:
super().__init__()
self.tokenizer = tokenizer
self.device = device
self.embedding = nn.Embedding(tokenizer.get_num_embeddings(),
embedding_dim=embedding_dim,
padding_idx=0)
self.recurrent_layers = nn.LSTM(input_size=embedding_dim,
hidden_size=d_model,
num_layers=num_layers,
bidirectional=bidirectional,
dropout=dropout,
)
self.num_directions = 1
if bidirectional:
self.num_directions = 2
self.output_layers = nn.Sequential(nn.Dropout(dropout), nn.Linear(self.num_directions*d_model, d_model), nn.ReLU(), nn.Dropout(dropout), nn.Linear(d_model, 1))
self.loss_fn = nn.BCEWithLogitsLoss()
self.optimizer = AdamW(self.parameters(), lr=learning_rate, weight_decay=weight_decay)
self.to(self.device)
def forward(self, sequence_batch, lengths):
embedded_sequences = self.embedding(sequence_batch)
packed_sequence = pack_padded_sequence(embedded_sequences, lengths, enforce_sorted=False)
output, (h_n, c_n) = self.recurrent_layers(packed_sequence)
final_state = h_n[-1]
if self.num_directions == 2:
final_state = torch.cat((final_state, h_n[-1]), dim=-1)
logits = self.output_layers(final_state)
return logits
def loss_on_batch(self, batch):
sequence_batch, lengths, labels = batch
logit_prediction = self(sequence_batch.to(self.device), lengths)
loss = self.loss_fn(logit_prediction.squeeze(), labels.to(self.device))
return loss
def train_batch(self, batch):
self.train()
self.optimizer.zero_grad()
loss = self.loss_on_batch(batch)
loss.backward()
self.optimizer.step()
return loss.item()
def eval_batch(self, batch):
self.eval()
with torch.no_grad():
loss = self.loss_on_batch(batch)
return loss.item()
def eval_and_predict_batch(self, batch):
self.eval()
with torch.no_grad():
sequence_batch, lengths, labels = batch
logit_prediction = self(sequence_batch.to(self.device), lengths)
loss = self.loss_fn(logit_prediction.squeeze(), labels.to(self.device))
prob_predictions = torch.sigmoid(logit_prediction)
return loss.item(), labels.cpu().numpy(), prob_predictions.cpu().numpy()
As you can see, we’re taking the hyper parameters of the network as keyword arguments to
the __init__
method. Our goal is to find good settings for these hyper parameters.
Manual hyper parameter Optimization
First we start with the basic “Grad Student Descent” example in scripts/basic_neural_network.py
. The important part is given in the training loop:
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=args.random_seed)
for visible_index, heldout_indices in skf.split(smiles_list, labels):
tb_writer = SummaryWriter('basic_runs')
visible_labels = [labels[i] for i in visible_index]
train_indices, dev_indices = train_test_split(visible_index, stratify=visible_labels, shuffle=True, test_size=0.2, random_state=args.random_seed)
train_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=train_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers, shuffle=True)
dev_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=dev_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
heldout_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=heldout_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
model_kwargs = dict(tokenizer=tokenizer, device=device)
model_hparams = dict(embedding_dim=128,
d_model=128,
num_layers=3,
bidirectional=True,
dropout=0.2,
learning_rate=0.001,
weight_decay=0.0001)
heldout_roc_auc = train(train_dataloader=train_dataloader, dev_dataloader=dev_dataloader, test_dataloader=heldout_dataloader, writer=tb_writer,
max_epochs=max_epochs, model_class=RNNPredictor, model_args=tuple(), model_kwargs=model_kwargs, model_hparams=model_hparams)
tb_writer.close()
Here we are manually setting hyper parameters and then training our models with these. Using tensordboard we can
essentially track how good they are. These runs will be stored in basic_runs
so you need to run tensorboard like:
$ tensorboard --logdir=basic_runs
Try running the experiment a couple of times while changing hyper parameters and see if you can get better results.
Hyper Parameter Optimization with Optuna
We’ll now take a look at how we can easily extend the above example using Optuna. We will replace the work we did with manually setting hyper parameters with a loop which automatically searches the hyper parameter space. We need to import optuna and create a new study object for our hyper parameter optimization. We will perform a separate study for each fold in our cross validation.
import optuna
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=args.random_seed)
for visible_index, heldout_indices in skf.split(smiles_list, labels):
tb_writer = SummaryWriter('basic_runs')
visible_labels = [labels[i] for i in visible_index]
train_indices, dev_indices = train_test_split(visible_index, stratify=visible_labels, shuffle=True, test_size=0.2, random_state=args.random_seed)
train_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=train_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers, shuffle=True)
dev_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=dev_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
heldout_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=heldout_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
model_kwargs = dict(tokenizer=tokenizer, device=device)
model_hparams = dict(embedding_dim=128,
d_model=128,
num_layers=3,
bidirectional=True,
dropout=0.2,
learning_rate=0.001,
weight_decay=0.0001)
heldout_roc_auc = train(train_dataloader=train_dataloader, dev_dataloader=dev_dataloader, test_dataloader=heldout_dataloader, writer=tb_writer,
max_epochs=max_epochs, model_class=RNNPredictor, model_args=tuple(), model_kwargs=model_kwargs, model_hparams=model_hparams)
study = optuna.create_study(direction='maximize')
tb_writer.close()
We’ve now created a new study with the objective of maximizing an objective function. There are two interfaces for running
optunas optimization: the study.ask()
/ study.tell()
interface and the study.optimize()
interface.
We looked at how to use study.ask()
/ study.tell()
in the noteboook before and will now use study.optimize()
instead.
study.optimize()
takes a function to optimize as an input, and we’ll implement it inlined in our optimization loop so it can refer the datasets we’ve set up.
import optuna
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=args.random_seed)
for visible_index, heldout_indices in skf.split(smiles_list, labels):
tb_writer = SummaryWriter('basic_runs')
visible_labels = [labels[i] for i in visible_index]
train_indices, dev_indices = train_test_split(visible_index, stratify=visible_labels, shuffle=True, test_size=0.2, random_state=args.random_seed)
train_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=train_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers, shuffle=True)
dev_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=dev_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
heldout_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=heldout_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
model_kwargs = dict(tokenizer=tokenizer, device=device)
def optimization_function(trial: optuna.Trial):
model_hparams = dict(embedding_dim=128,
d_model=128,
num_layers=3,
bidirectional=True,
dropout=0.2,
learning_rate=0.001,
weight_decay=0.0001)
heldout_roc_auc = train(train_dataloader=train_dataloader, dev_dataloader=dev_dataloader, test_dataloader=heldout_dataloader, writer=tb_writer,
max_epochs=max_epochs, model_class=RNNPredictor, model_args=tuple(), model_kwargs=model_kwargs, model_hparams=model_hparams)
return heldout_roc_auc
study = optuna.create_study(direction='maximize')
study.optimize(optimization_function, n_trials=20)
tb_writer.close()
We’ve now set up the study infrastructure, but we’re still not actually
performing any search. The optimization_function
we defined takes
a optuna.Trial
object as its argument, and this is our interface
to the actual hyper parameter search procedure.
We extend our optimization_function
so that the values for the hyper
parameters are give by the trial:
import optuna
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=args.random_seed)
for visible_index, heldout_indices in skf.split(smiles_list, labels):
tb_writer = SummaryWriter('basic_runs')
visible_labels = [labels[i] for i in visible_index]
train_indices, dev_indices = train_test_split(visible_index, stratify=visible_labels, shuffle=True, test_size=0.2, random_state=args.random_seed)
train_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=train_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers, shuffle=True)
dev_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=dev_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
heldout_dataloader = get_dataloader(smiles_list=smiles_list, labels=labels, indices=heldout_indices,
tokenizer=tokenizer, batch_size=batch_size, num_workers=num_workers)
model_kwargs = dict(tokenizer=tokenizer, device=device)
def optimization_function(trial: optuna.Trial):
model_hparams = dict(embedding_dim=trial.suggest_categorical('embedding_dim', [8, 16, 32, 64, 128]),
d_model=trial.suggest_categorical('d_model', [8, 16, 32, 64, 128, 256, 512, 1024]),
num_layers=trial.suggest_int('num_layers', 1, 5),
bidirectional=trial.suggest_categorical('bidirectional', [True, False]),
dropout=trial.suggest_float('dropout', 0, 1),
learning_rate=trial.suggest_float('learning_rate', 1e-5, 1e-2, log=True),
weight_decay=trial.suggest_float('weight_decay', 1e-6, 1e-2, log=True))
heldout_roc_auc = train(train_dataloader=train_dataloader, dev_dataloader=dev_dataloader, test_dataloader=heldout_dataloader, writer=tb_writer,
max_epochs=max_epochs, model_class=RNNPredictor, model_args=tuple(), model_kwargs=model_kwargs, model_hparams=model_hparams)
return heldout_roc_auc
study = optuna.create_study(direction='maximize')
study.optimize(optimization_function, n_trials=20)
tb_writer.close()
Here we are using most of optunas variable types. We’re using the suggest_categorical
method to sample from a set of arbitrary python objects. We could have used suggest_int
for the embedding_dim
and d_model
hyper parameters, but by supplying a logits
we’re able to focus specific orders of magnitude instead.
For the learning_rate
and weight_decay
parameters, we want to
explore the values geometrically, so we set the attribute log=True
. This samples the values
from a log-transformed space instead, so that we for example are roughly as likely to sample values in the range
\([10^{-4},10^{-3}]\) as in the range \([10^{-3},10^{-2}]\). If we don’t do this, our sampling
will be skewed towards larger values.
We have changed our basic version of the training to automatically search for hyper paramters using Optuna.