Datasets
mlcg.datasets
contains a template InMemoryDataset
for CLN025, a 10 amino acid long mini protein that shows prototypical folding and unfolding behavior. The ChignolinDataset
class illustrates how a general dataset can be downloaded, unpacked/organized, transformed/processed, and collated for training models. Here, users can find more information on implementing custom datasets.
Chignolin Dataset
Dataset of 3744 individual all-atom trajectories simulated using [ACEMD] using the adaptive sampling strategy described in [AdaptiveStrategy] . All trajectories were simulated at 350K with [CHARM22Star] in a cubic box of 40 angstroms :sup:3
with 1881 TIP3P water molecules and two Na :sup:+
ions using a Langevin integrator with an integration timestep of 4 fs, a damping constant of 0.1 :sup:-1
ps, heavy-hydrogen constraints, and a PME cutoff of 9 angstroms and a PME mesh grid of 1 angstrom. The total aggregate simulation time is 187.2 us.
- class mlcg.datasets.chignolin.ChignolinDataset(root, transform=None, pre_transform=None, pre_filter=None, mapping='slice_aggregate', terminal_embeds=False, max_num_files=None)[source]
Dataset for training a CG model of the chignolin protein following a Cα CG mapping using the all-atom data from [CGnet].
This Dataset produces delta forces for model training, in which the CG prior forces (harmonic bonds and angles and a repulsive term) have been subtracted from the full CG forces.
Alanine Dipeptide Dataset
Dataset of a single 1M step trajectory of alanine dipeptide in explicit water. The trajectory is simulated using a Langevin scheme with [ACEMD] at 300K through the [AMBER_ff_99SB_ILDN] force force field. The cubic simulation box was 2.3222 cubic nm, an integration timestep of 2 fs was used, the solvent was composed of 651 [TIP3P] water molecules, electrostatics were computed every two steps using the PME method with a real-space cutoff of 9 nm and a grid spacing of 0.1 nm, and all bonds between heavy and hydrogen atoms were constrained.
- class mlcg.datasets.alanine_dipeptide.AlanineDataset(root, stride=1, transform=None, pre_transform=None, pre_filter=None)[source]
Dataset for training a CG model of the alanine-dipeptide protein following a Cα + 1 Cbeta CG mapping
Alanine Dipeptide CG structure:
CB(3) | N(1) - CA(2) - C(4) / \ C(0) N(5)
This Dataset produces delta forces for model training, in which the CG prior forces (harmonic bonds and angles) have been subtracted from the full CG forces.
- The class works as follows:
If the raw data (coordinates, forces, and pdb file) for alanine dipeptide does not exist, the files will automatically be downloaded and processed
If the raw data exists, “root/processed/” will be created and the raw dataset will be processed
- If the raw data and processed data exists, the class will load the processed dataset containing:
data : AtomicData object containing all the CG positions, forces, embeddings
slices : Slices of the AtomicData object
prior_models : Prior models (HarmonicBonds, HarmonicAngles) of the dataset with x_0 and sigma
topologies : Object of Topology class containing all topology information of the _CG_ molecule, including neighbor lists for bond and angle priors
- Inputs:
- root :
Location for AlanineDataset to be downloaded and processed
- stride :
Stride length over dataset
- Default priors:
HarmonicBonds, HarmonicAngles
- Optional priors:
- Repulsion
If repulsion prior is used, a custom neighbor list is created for the repulsion prior where all pairs of beads not interacting through bonds and angles are included in the interaction set
Custom H5 Dataset
Users may assemble their own curated dataset using an H5 format. This allows for the possiblity of training on multiple types of molecules or data from different system conditiions.
Inroduction
HDF5 format benefits the dataset management for mlcg when training/validation involves multiple molecules of vastly different lengths and when parallelization is used. The main features are:
The internal structure mimics the hierarchy of the dataset itself, such that we don’t have to replicate it on filesystem.
we don’t have to actively open all files in the process
we can transparently load only the necessary part of the dataset to the memory
This file contains the python data structures for handling the HDF5 file and its content, i.e., the coordinates, forces and embedding vectors for multiple CG molecules. An example HDF5 file structure and correponding class types after loading:
/ (HDF-group, => ``H5Dataset._h5_root``)
├── OPEP (HDF-group =(part, according to "partition_options")=> ``Metaset`` in a ``Partition``)
│ ├── opep_0000 (HDF-group, => ``MolData``)
│ │ ├── attrs (HDF-attributes of the molecule "/OPEP/opep_0000")
│ │ │ ├── cg_embeds (a 1-d numpy.int array)
│ │ │ ├── N_frames (int, number of frames = size of cg_coords on axis 0)
│ │ │ ... (optional, e.g., cg_exc_pairs cg_top, cg_pdb, etc that corrsponds to the molecule)
│ │ ├── cg_coords (HDF-dataset of the molecule "/OPEP/opep_0000", 3-d numpy.float32 array)
│ │ └── cg(_delta)_forces (HDF-dataset of the molecule "/OPEP/opep_0000", 3-d numpy.float32 array)
│ ... (other molecules in "/OPEP")
├── CATH (HDF-group ="partition_options"=> ``Metaset`` in a ``Partition``)
...
Basic bricks: MolData
and MetaSet
Data structure MolData
is the basic brick of the dataset.
It holds embeds, coords and forces of certain number of frames for a single molecule.
.sub_sample: method for performing indexing on the frames.
- class mlcg.datasets.MolData(name, embeds, coords, forces, weights=None, exclusion_pairs=None)[source]
Data-holder for coordinates, forces and embedding vector of a molecule, e.g., opep_0000.
- Parameters:
name (
str
) – Name of the moleculeembeds (
ndarray
) – Type embeddings for each atom, of shape (n_atoms,)coords (
ndarray
) – Cartesian coordinates of the molecule, of shape (n_frames, n_atoms, 3)forces (
ndarray
) – Cartesian forces of the molecule, of shape (n_frames, n_atoms, 3)
Data strcuture Metaset
holds multiple molecules and joins their frame indices together.
One can access frames of underlying MolData objects with a continuous index.
The idea is that the molecules in a Metaset
have similar numbers of CG beads, such that the sample requires similar processing time when passing through a neural network.
When this rule is enforced, it will allow automatic balancing of batch composition during training and validation.
.create_from_hdf5_group
: initiate a Metaset by loading data from given HDF-group.mol_list, detailed_indices, hdf_key_mapping, stride and parallel can control which subset is loaded to the Metaset (details see the description of “H5Dataset”).trim_down_to
: drop frames of underlying for obtaining a subset with desired number of frames. The indices of frames to be dropped are controlled by a random number generator. When the parameter “random_seed” and the MolData order and number of frames are the same, the frames to be dropped will be reproducible.len() (__len__)
: return total number of samples[] (__get_item__)
: infer the molecule and get the corresponding frame according to the given unified index. Return value is grouped by anAtomicData
object.
Train test split: Partition
Data structure Partition
can hold multiple Metaset for training and validation purposes.
Its main function is to automatic adjust (subsample) the Metaset(s) and the underlying MolData to form a balanced data source, from which a certain number of batches can be drawn. Each batch will contain a pre-given number of samples from each Metaset.
One or several Partition``s can be created to load part of the same HDF5 file into the memory.
The exact content inside a ``Partition
object is controlled by the partition_options
as a parameter for initializing a H5Dataset
.
Full Dataset: H5Dataset
Data strcuture H5Dataset
opens a HDF5 file and establish the partitions.
partition_options
describe what partitions to create and what content to load into them. (Detailed description and examples are as followed.)
loading_options
mainly deal with the HDF key mapping (which datasets/attributes corresponds to the coordinates, forces and embeddings) as well as (optionally) the information for correctly split the dataset in a parallel training/validation.
An example “partition_options” (as a Python Mappable (e.g., dict)):
{
"train": {
"metasets": {
"OPEP": {
"molecules": [
"opep_0000",
"opep_0001",
...
],
"stride": 1, # optional, default 1
"detailed_indices": {
# optional, providing the indices of frames to work with (before striding and splitting for parallel processes).
# optional,
"opep_0000":
val_ratio: 0.1
test_ratio: 0.1
seed: 12345
"opep_0001":
val_ratio: 0.1
test_ratio: 0.1
seed: 12345
"filename": ./splits
# If detailed_indices are not provided for a given molecule, then it is equivalent to np.arange(N_frames)
"opep_0000": [1, 3, 5, 7, 9, ...],
},
},
"CATH": {
"molecules": [
"cath_1b43A02",
...
],
"stride": 1, # optional
"detailed_indices": {}, # optional
}
},
"batch_size": {
# each batch will contain 24 samples from
# The two metasets will be trimmed down according to this ratio
# so optimally it should be proportional to the ratio of numbers of frames in the metasets.
"OPEP": 24,
"CATH": 6,
},
"subsample_random_seed": 42, # optional, default 42. Random seed for trimming down the frames.
"max_epoch_samples": None, # optional, default None. For setting a desired dataset size.
# Works by subsampling the dataset after it is loaded with the given stride.
},
"val": { # similar
}
}
An example “loading_options” (as a Python Mappable (e.g., dict)):
{
"hdf_key_mapping": {
# keys for reading CG data from HDF5 groups
"embeds": "attrs:cg_embeds",
"coords": "cg_coords",
"forces": "cg_delta_forces",
},
"parallel": { # optional, default rank=0 and world_size=1 (single process).
# For split the dataset evenly and load only the necessary parts to each process in a parallelized train/val setup
"rank": 0, # which process is this
"world_size": 1, # how many processes are there
}
}
- class mlcg.datasets.H5Dataset(h5_file_path, partition_options, loading_options, subsample_using_weights=False, exclude_listed_pairs=False)[source]
The top-level class for handling multiple datasets contained in a HDF5 file.
- Parameters:
h5_file_path (
str
) – Path to the hdf5 file containing the dataset(s)partition_options (
Dict
) – Dictionary of partition names mapping to collections of metaset informationloading_options (
Dict
) – Dictionary of options specifying how hdf5 attrs/datasets are named within hd5 groups
Loading into PyTorch: H5PartitionDataLoader
Class H5PartitionDataLoader
mimics the pytorch data loader and can generate training/validation batches with fixed proportion of samples from several Metasets in a Partition.
The proportion and batch size is defined when the partition is initialized.
When the molecules in each Metaset have similar embedding vector lengths, the processing of output batches will require a more or less fixed size of VRAM on GPU, which can benefit the
Note: 1. Usually in a train-val split, each molecule goes to either the train or the test partition.
In some special cases (e.g., non-transferable training) one molecule can be part of both partitions. In this situation, “detailed_indices” can be set to assign the corresponding frames to the desired partitions.
In addition one can pass in detailed_indices as a dictionary to split frames based on training/test (see partition_options example above)
- class mlcg.datasets.H5PartitionDataLoader(data_partition, collater_fn=<torch_geometric.loader.dataloader.Collater object>, pin_memory=False, subsample_using_weights=False)[source]
Load batches from one or multiple Metasets in a Partition. In multiple Metasets scenario, the order of data loaders will be alphabetically ascending with respect to the Metaset names.