Training

mlcg provides some tools to train its models in the scripts folder and some example input files such as examples/train_schnet.yaml. The training is defined using the pytorch-lightning package and especially its cli utilities.

Extensions for using Pytorch Lightning

class mlcg.pl.DataModule(dataset, log_dir, val_ratio=0.1, test_ratio=0.1, splits=None, batch_size=512, inference_batch_size=64, num_workers=1, loading_stride=1, save_local_copy=False, pin_memory=True)[source]

PL interface to train with datasets defined in mlcg.datasets.

Parameters:
  • dataset (InMemoryDataset) – a dataset from mlcg.datasets (or following the API of torch_geometric.data.InMemoryDataset)

  • log_dir (str) – where to store the data that might be produced during training.

  • val_ratio (float) – fraction of the dataset used for validation

  • test_ratio (float) – fraction of the dataset used for testing

  • splits (str) – filename of a file containing the indices for training, validation, and testing. It should be compatible with np.load and contain the fields ‘idx_train’, ‘idx_val’, and ‘idx_test’. If None then the dataset is split randomly using the val_ratio and test_ratio.

  • batch_size (int) – number of structure to include in each training batches.

  • inference_batch_size (int) – number of structure to include in each validation/training batches.

  • num_workers; – number of cpu used for loading the dataset (see here for more details).

  • loading_stride (int) – stride used to subselect the dataset. Useful parameter for debugging purposes.

  • save_local_copy (bool) – saves the input dataset in log_dir

prepare_data()[source]

Download, preprocess dataset, etc.

setup(stage=None)[source]

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:

stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)
test_dataloader()[source]

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

train_dataloader()[source]

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

val_dataloader()[source]

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

class mlcg.pl.PLModel(model, loss)[source]

PL interface to train with models defined in mlcg.nn.

Parameters:
  • model (Module) – instance of a model class from mlcg.nn.

  • loss (Loss) – instance of mlcg.nn.Loss.

on_train_epoch_end()[source]

Called in the training loop at the very end of the epoch.

To access all batch outputs at the end of the epoch, you can cache step outputs as an attribute of the LightningModule and access them in this hook:

class MyLightningModule(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.training_step_outputs = []

    def training_step(self):
        loss = ...
        self.training_step_outputs.append(loss)
        return loss

    def on_train_epoch_end(self):
        # do something with all training_step outputs, for example:
        epoch_mean = torch.stack(self.training_step_outputs).mean()
        self.log("training_epoch_mean", epoch_mean)
        # free up the memory
        self.training_step_outputs.clear()
on_train_epoch_start()[source]

Called in the training loop at the very beginning of the epoch.

on_validation_epoch_end()[source]

Called in the validation loop at the very end of the epoch.

test_step(data, batch_idx, dataloader_idx=0)[source]

Operates on a single batch of data from the test set. In this step you’d normally generate examples or calculate anything of interest such as accuracy.

Parameters:
  • batch – The output of your data iterable, normally a DataLoader.

  • batch_idx – The index of this batch.

  • dataloader_idx – The index of the dataloader that produced this batch. (only if multiple dataloaders used)

Return type:

Tuple[Tensor, int]

Returns:

  • Tensor - The loss tensor

  • dict - A dictionary. Can include any keys, but must include the key 'loss'.

  • None - Skip to the next batch.

# if you have one test dataloader:
def test_step(self, batch, batch_idx): ...


# if you have multiple test dataloaders:
def test_step(self, batch, batch_idx, dataloader_idx=0): ...

Examples:

# CASE 1: A single test dataset
def test_step(self, batch, batch_idx):
    x, y = batch

    # implement your own
    out = self(x)
    loss = self.loss(out, y)

    # log 6 example images
    # or generated text... or whatever
    sample_imgs = x[:6]
    grid = torchvision.utils.make_grid(sample_imgs)
    self.logger.experiment.add_image('example_images', grid, 0)

    # calculate acc
    labels_hat = torch.argmax(out, dim=1)
    test_acc = torch.sum(y == labels_hat).item() / (len(y) * 1.0)

    # log the outputs!
    self.log_dict({'test_loss': loss, 'test_acc': test_acc})

If you pass in multiple test dataloaders, test_step() will have an additional argument. We recommend setting the default value of 0 so that you can quickly switch between single and multiple dataloaders.

# CASE 2: multiple test dataloaders
def test_step(self, batch, batch_idx, dataloader_idx=0):
    # dataloader_idx tells you which dataset this is.
    ...

Note

If you don’t need to test you don’t need to implement this method.

Note

When the test_step() is called, the model has been put in eval mode and PyTorch gradients have been disabled. At the end of the test epoch, the model goes back to training mode and gradients are enabled.

training_step(data, batch_idx)[source]

Here you compute and return the training loss and some additional metrics for e.g. the progress bar or logger.

Parameters:
  • batch – The output of your data iterable, normally a DataLoader.

  • batch_idx (int) – The index of this batch.

  • dataloader_idx – The index of the dataloader that produced this batch. (only if multiple dataloaders used)

Return type:

Tensor

Returns:

  • Tensor - The loss tensor

  • dict - A dictionary which can include any keys, but must include the key 'loss' in the case of automatic optimization.

  • None - In automatic optimization, this will skip to the next batch (but is not supported for multi-GPU, TPU, or DeepSpeed). For manual optimization, this has no special meaning, as returning the loss is not required.

In this step you’d normally do the forward pass and calculate the loss for a batch. You can also do fancier things like multiple forward passes or something model specific.

Example:

def training_step(self, batch, batch_idx):
    x, y, z = batch
    out = self.encoder(x)
    loss = self.loss(out, x)
    return loss

To use multiple optimizers, you can switch to ‘manual optimization’ and control their stepping:

def __init__(self):
    super().__init__()
    self.automatic_optimization = False


# Multiple optimizers (e.g.: GANs)
def training_step(self, batch, batch_idx):
    opt1, opt2 = self.optimizers()

    # do training_step with encoder
    ...
    opt1.step()
    # do training_step with decoder
    ...
    opt2.step()

Note

When accumulate_grad_batches > 1, the loss returned here will be automatically normalized by accumulate_grad_batches internally.

validation_step(data, batch_idx, dataloader_idx=0)[source]

The order of separate validation losses (bearing the name dataloader_idx_?) will be alphabetically ascending with respect to the Metaset names in the multi-metaset scenario.

Return type:

Tuple[Tensor, int]

class mlcg.pl.LightningCLI(model_class=None, datamodule_class=None, save_config_callback=<class 'pytorch_lightning.cli.SaveConfigCallback'>, save_config_kwargs=None, trainer_class=<class 'pytorch_lightning.trainer.trainer.Trainer'>, trainer_defaults=None, seed_everything_default=True, parser_kwargs=None, subclass_mode_model=False, subclass_mode_data=False, args=None, run=True, auto_configure_optimizers=True)[source]

Command line interface for training a model with pytorch lightning.

It adds a few functionalities to pytorch_lightning.utilities.cli.LightningCLI.

  • register torch optimizers and lr_scheduler so that they can be specified

in the configuration file. Note that only single (optimizer,lr_scheduler) can be specified like that and more complex patterns should be implemented in the pytorch_lightning model definition (child of pytorch_lightning. LightningModule). see doc for more details.

  • link manually some arguments related to the definition of the work directory. If

default_root_dir argument of pytorch_lightning.Trainer is set and the save_dir / log_dir / dirpath argument of loggers / data / callbacks is set to default_root_dir then they will be set to the value of default_root_dir / default_root_dir/data / default_root_dir/ckpt.

parse_arguments(parser, args)[source]

Parses command line arguments and stores it in self.config

Return type:

None

Scripts

Scripts that are using LightningCLI have many convinient built in functionalities such as a detailed helper

python scripts/mlcg-train.py --help
python scripts/mlcg-train_h5.py --help