4) Package training data

Command:

mlcg-tk-package_training_data package_training_data --config configuration_files/trpcage_packaging.yaml

Once all training data has been produced, these data must be packaged in a form that can be passed to the MLCG library for model training. In this step, CG coordinates, delta forces, and embeddings are loaded for all provided sample names in a raw dataset and saved as an HDF5 file. In the same step, molecules are split into training and validation sets and a saved in a partition file, which also stores information about the batch sizes to use for training and any striding that should be applied.

In the case that multiple dataset are used to train a model, these can be merged into combined HDF5 and partition files using the following:

Command:

mlcg-tk-package_training_data combine_datasets --dataset_names '[dataset_1, dataset_2, etc]' --save_dir /path/to/saved/files/ --force_tag tag

The optional force tag specifies a label given to produced delta forces and saved packaged data.

4.1*) Add decoys to training data

It is possible to add so-called decoys to the training data, distorted (unphysical) configurations with a zero delta-force label, such that the network has examples of configurations on which it should rely on the prior.

The script mlcg-tk-add_decoys enables the addition of decoys to a previously constructed HDF5 dataset. It can be used the following way to append decoys to the existing HDF5 dataset (or the config file can be modified for the script to copy the original HDF5 dataset before adding the decoys):

Command:

mlcg-tk-add_decoys add_decoy --config configuration_files/trpcage_decoys_dataset.yaml

The same script can also be used to add these decoys to an existing partition file in order to incorporate them during training:

Command:

mlcg-tk-add_decoys update_partition_file --config configuration_files/trpcage_decoys_partition.yaml

Note that an arbitrary number of decoys with different noise levels and strides can be appended to a dataset; only the decoys present in the partition file will effectively be taken into account for training.