Training¶
mlcg provides some tools to train its models in the scripts folder and some example input
files such as examples/train_schnet.yaml. The training is defined
using the pytorch-lightning package and
especially its cli utilities.
Extensions for using Pytorch Lightning¶
- class mlcg.pl.DataModule(dataset, log_dir, val_ratio=0.1, test_ratio=0.1, splits=None, batch_size=512, inference_batch_size=64, num_workers=1, loading_stride=1, save_local_copy=False, pin_memory=True)[source]¶
PL interface to train with datasets defined in mlcg.datasets.
- Parameters:
dataset (
InMemoryDataset) – a dataset from mlcg.datasets (or following the API of torch_geometric.data.InMemoryDataset)log_dir (
str) – where to store the data that might be produced during training.val_ratio (
float) – fraction of the dataset used for validationtest_ratio (
float) – fraction of the dataset used for testingsplits (
str) – filename of a file containing the indices for training, validation, and testing. It should be compatible with np.load and contain the fields ‘idx_train’, ‘idx_val’, and ‘idx_test’. If None then the dataset is split randomly using the val_ratio and test_ratio.batch_size (
int) – number of structure to include in each training batches.inference_batch_size (
int) – number of structure to include in each validation/training batches.num_workers; – number of cpu used for loading the dataset (see here for more details).
loading_stride (
int) – stride used to subselect the dataset. Useful parameter for debugging purposes.save_local_copy (
bool) – saves the input dataset in log_dir
- setup(stage=None)[source]¶
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit','validate','test', or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- test_dataloader()[source]¶
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.
- train_dataloader()[source]¶
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- val_dataloader()[source]¶
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- class mlcg.pl.PLModel(model, loss)[source]¶
PL interface to train with models defined in mlcg.nn.
- Parameters:
model (
Module) – instance of a model class from mlcg.nn.loss (
Loss) – instance of mlcg.nn.Loss.
- on_train_epoch_end()[source]¶
Called in the training loop at the very end of the epoch.
To access all batch outputs at the end of the epoch, you can cache step outputs as an attribute of the
LightningModuleand access them in this hook:class MyLightningModule(L.LightningModule): def __init__(self): super().__init__() self.training_step_outputs = [] def training_step(self): loss = ... self.training_step_outputs.append(loss) return loss def on_train_epoch_end(self): # do something with all training_step outputs, for example: epoch_mean = torch.stack(self.training_step_outputs).mean() self.log("training_epoch_mean", epoch_mean) # free up the memory self.training_step_outputs.clear()
- test_step(data, batch_idx, dataloader_idx=0)[source]¶
Operates on a single batch of data from the test set. In this step you’d normally generate examples or calculate anything of interest such as accuracy.
- Parameters:
batch – The output of your data iterable, normally a
DataLoader.batch_idx – The index of this batch.
dataloader_idx – The index of the dataloader that produced this batch. (only if multiple dataloaders used)
- Return type:
Tuple[Tensor,int]- Returns:
Tensor- The loss tensordict- A dictionary. Can include any keys, but must include the key'loss'.None- Skip to the next batch.
# if you have one test dataloader: def test_step(self, batch, batch_idx): ... # if you have multiple test dataloaders: def test_step(self, batch, batch_idx, dataloader_idx=0): ...
Examples:
# CASE 1: A single test dataset def test_step(self, batch, batch_idx): x, y = batch # implement your own out = self(x) loss = self.loss(out, y) # log 6 example images # or generated text... or whatever sample_imgs = x[:6] grid = torchvision.utils.make_grid(sample_imgs) self.logger.experiment.add_image('example_images', grid, 0) # calculate acc labels_hat = torch.argmax(out, dim=1) test_acc = torch.sum(y == labels_hat).item() / (len(y) * 1.0) # log the outputs! self.log_dict({'test_loss': loss, 'test_acc': test_acc})
If you pass in multiple test dataloaders,
test_step()will have an additional argument. We recommend setting the default value of 0 so that you can quickly switch between single and multiple dataloaders.# CASE 2: multiple test dataloaders def test_step(self, batch, batch_idx, dataloader_idx=0): # dataloader_idx tells you which dataset this is. ...
Note
If you don’t need to test you don’t need to implement this method.
Note
When the
test_step()is called, the model has been put in eval mode and PyTorch gradients have been disabled. At the end of the test epoch, the model goes back to training mode and gradients are enabled.
- training_step(data, batch_idx)[source]¶
Here you compute and return the training loss and some additional metrics for e.g. the progress bar or logger.
- Parameters:
batch – The output of your data iterable, normally a
DataLoader.batch_idx (
int) – The index of this batch.dataloader_idx – The index of the dataloader that produced this batch. (only if multiple dataloaders used)
- Return type:
Tensor- Returns:
Tensor- The loss tensordict- A dictionary which can include any keys, but must include the key'loss'in the case of automatic optimization.None- In automatic optimization, this will skip to the next batch (but is not supported for multi-GPU, TPU, or DeepSpeed). For manual optimization, this has no special meaning, as returning the loss is not required.
In this step you’d normally do the forward pass and calculate the loss for a batch. You can also do fancier things like multiple forward passes or something model specific.
Example:
def training_step(self, batch, batch_idx): x, y, z = batch out = self.encoder(x) loss = self.loss(out, x) return loss
To use multiple optimizers, you can switch to ‘manual optimization’ and control their stepping:
def __init__(self): super().__init__() self.automatic_optimization = False # Multiple optimizers (e.g.: GANs) def training_step(self, batch, batch_idx): opt1, opt2 = self.optimizers() # do training_step with encoder ... opt1.step() # do training_step with decoder ... opt2.step()
Note
When
accumulate_grad_batches> 1, the loss returned here will be automatically normalized byaccumulate_grad_batchesinternally.
- class mlcg.pl.LightningCLI(model_class=None, datamodule_class=None, save_config_callback=<class 'pytorch_lightning.cli.SaveConfigCallback'>, save_config_kwargs=None, trainer_class=<class 'pytorch_lightning.trainer.trainer.Trainer'>, trainer_defaults=None, seed_everything_default=True, parser_kwargs=None, subclass_mode_model=False, subclass_mode_data=False, args=None, run=True, auto_configure_optimizers=True)[source]¶
Command line interface for training a model with pytorch lightning.
It adds a few functionalities to pytorch_lightning.utilities.cli.LightningCLI.
register torch optimizers and lr_scheduler so that they can be specified
in the configuration file. Note that only single (optimizer,lr_scheduler) can be specified like that and more complex patterns should be implemented in the pytorch_lightning model definition (child of pytorch_lightning. LightningModule). see doc for more details.
link manually some arguments related to the definition of the work directory. If
default_root_dir argument of pytorch_lightning.Trainer is set and the save_dir / log_dir / dirpath argument of loggers / data / callbacks is set to default_root_dir then they will be set to the value of default_root_dir / default_root_dir/data / default_root_dir/ckpt.
Scripts¶
Scripts that are using LightningCLI have many convinient built in
functionalities
such as a detailed helper
python scripts/mlcg-train.py --help
python scripts/mlcg-train_h5.py --help