H5 Dataset

mlcg.datasets.h5_dataset provides utilities so that users assemble their own curated dataset using an H5 format. This allows for the possiblity of training on multiple types of molecules or data from different system conditiions.

Python classes for processing data stored in HDF5 format.

Description

HDF5 format benefits the dataset management for mlcg-tools when training/validation involves multiple molecules of vastly different lengths and when parallelization is used. The main features are:

  1. The internal structure mimics the hierarchy of the dataset itself, such that we don’t have to replicate it on filesystem.

  2. we don’t have to actively open all files in the process

  3. we can transparently load only the necessary part of the dataset to the memory

This file contains the python data structures for handling the HDF5 file and its content, i.e., the coordinates, forces and embedding vectors for multiple CG molecules. An example HDF5 file structure and correponding class types after loading:

/ (HDF-group, => `H5Dataset._h5_root`)
├── OPEP (HDF-group =(part, according to "partition_options")=> `Metaset` in a `Partition`)
│   ├── opep_0000 (HDF-group, => `MolData`)
│   │   ├── attrs (HDF-attributes of the molecule "/OPEP/opep_0000")
│   │   │   ├── cg_embeds (a 1-d numpy.int array)
│   │   │   ├── N_frames (int, number of frames = size of cg_coords on axis 0)
│   │   │   ... (optional, e.g., cg_top, cg_pdb, etc that corrsponds to the molecule)
│   │   ├── cg_coords (HDF-dataset of the molecule "/OPEP/opep_0000", 3-d numpy.float32 array)
│   │   └── cg(_delta)_forces (HDF-dataset of the molecule "/OPEP/opep_0000", 3-d numpy.float32 array)
│   ... (other molecules in "/OPEP")
├── CATH (HDF-group ="partition_options"=> `Metaset` in a `Partition`)
...

Basic bricks: MolData and MetaSet

Data structure MolData is the basic brick of the dataset. It holds embeds, coords and forces of certain number of frames for a single molecule. .sub_sample: method for performing indexing on the frames.

Data strucuture Metaset holds multiple molecules and joins their frame indices together. One can access frames of underlying MolData objects with a continuous index. The idea is that the molecules in a Metaset have similar numbers of CG beads, such that the sample requires similar processing time when passing through a neural network. When this rule is enforced, it will allow automatic balancing of batch composition during training and validation.

  • .create_from_hdf5_group: initiate a Metaset by loading data from given HDF-group.mol_list, detailed_indices, hdf_key_mapping, stride and parallel can control which subset is loaded to the Metaset (details see the description of “H5Dataset”)

  • .trim_down_to: drop frames of underlying for obtaining a subset with desired number of frames. The indices of frames to be dropped are controlled by a random number generator. When the parameter “random_seed” and the MolData order and number of frames are the same, the frames to be dropped will be reproducible.

  • len() (__len__): return total number of samples

  • [] (__get_item__): infer the molecule and get the corresponding frame according to the given unified index. Return value is grouped by an AtomicData object.

Train test split: Partition

Data structure Partition can hold multiple Metaset for training and validation purposes. Its main function is to automatic adjust (subsample) the Metaset(s) and the underlying MolData to form a balanced data source, from which a certain number of batches can be drawn. Each batch will contain a pre-given number of samples from each Metaset. One or several Partition``s can be created to load part of the same HDF5 file into the memory. The exact content inside a ``Partition object is controlled by the partition_options as a parameter for initializing a H5Dataset.

Full Dataset: H5Dataset

Data strcuture H5Dataset opens a HDF5 file and establish the partitions. partition_options describe what partitions to create and what content to load into them. (Detailed description and examples are as followed.) loading_options mainly deal with the HDF key mapping (which datasets/attributes corresponds to the coordinates, forces and embeddings) as well as (optionally) the information for correctly split the dataset in a parallel training/validation.

An example “partition_options” (as a Python Mappable (e.g., dict)):

{
        "train": {
                "metasets": {
                        "OPEP": {
                                "molecules": [
                                        "opep_0000",
                                        "opep_0001",
                                        ...
                                ],
                                "stride": 1, # optional, default 1
                                "detailed_indices": {
                # optional, providing the indices of frames to work with (before striding and splitting for parallel processes).
                                        # optional,
                    "opep_0000":
                        val_ratio: 0.1
                        test_ratio: 0.1
                        seed: 12345
                    "opep_0001":
                        val_ratio: 0.1
                        test_ratio: 0.1
                        seed: 12345
                    "filename": ./splits

                # If detailed_indices are not provided for a given molecule, then it is equivalent to np.arange(N_frames)
                                            "opep_0000": [1, 3, 5, 7, 9, ...],


                                },
                        },
                        "CATH": {
                                "molecules": [
                                        "cath_1b43A02",
                                        ...
                                ],
                                "stride": 1, # optional
                                "detailed_indices": {}, # optional
                        }
                },
                "batch_size": {
                        # each batch will contain 24 samples from
                        # The two metasets will be trimmed down according to this ratio
                        # so optimally it should be proportional to the ratio of numbers of frames in the metasets.
                        "OPEP": 24,
                        "CATH": 6,
                },
                "subsample_random_seed": 42, # optional, default 42. Random seed for trimming down the frames.
                "max_epoch_samples": None, # optional, default None. For setting a desired dataset size.
                # Works by subsampling the dataset after it is loaded with the given stride.
        },
        "val": { # similar
        }
}

An example “loading_options” (as a Python Mappable (e.g., dict)):

{
        "hdf_key_mapping": {
                # keys for reading CG data from HDF5 groups
                "embeds": "attrs:cg_embeds",
                "coords": "cg_coords",
                "forces": "cg_delta_forces",
        },
        "parallel": { # optional, default rank=0 and world_size=1 (single process).
                # For split the dataset evenly and load only the necessary parts to each process in a parallelized train/val setup
                "rank": 0, # which process is this
                "world_size": 1, # how many processes are there
        }
}

Loading into PyTorch: H5PartitionDataLoader

Class H5PartitionDataLoader mimics the pytorch data loader and can generate training/validation batches with fixed proportion of samples from several Metasets in a Partition. The proportion and batch size is defined when the partition is initialized. When the molecules in each Metaset have similar embedding vector lengths, the processing of output batches will require a more or less fixed size of VRAM on GPU, which can benefit the

Note

Usually in a train-val split, each molecule goes to either the train or the test partition. In some special cases (e.g., non-transferable training) one molecule can be part of both partitions. In this situation, “detailed_indices” can be set to assign the corresponding frames to the desired partitions.

Detailed Docs

class mlcg.datasets.h5_dataset.MolData(name, embeds, coords, forces, weights=None, exclusion_pairs=None)[source]

Data-holder for coordinates, forces and embedding vector of a molecule, e.g., cath_1b43A02.

Parameters:
  • name (str) – Name of the molecule

  • embeds (ndarray) – Type embeddings for each atom, of shape (n_atoms,)

  • coords (ndarray) – Cartesian coordinates of the molecule, of shape (n_frames, n_atoms, 3)

  • forces (ndarray) – Cartesian forces of the molecule, of shape (n_frames, n_atoms, 3)

class mlcg.datasets.h5_dataset.MetaSet(name)[source]

Set of MolData instances for molecules with similar characterstics

Parameters:

name – Name of the metaset

static create_from_hdf5_group(hdf5_group, mol_list, detailed_indices=None, stride=1, hdf_key_mapping={'coords': 'cg_coords', 'embeds': 'attrs:cg_embeds', 'forces': 'cg_delta_forces', 'weights': 'subsampling_weights'}, parallel={'rank': 0, 'world_size': 1}, subsample_using_weights=False, exclude_listed_pairs=False)[source]

initiate a Metaset by loading data from given HDF-group. mol_list, detailed_indices, hdf_key_mapping, stride and parallel can control which subset is loaded to the Metaset (details see the description of “H5Dataset”)

static grab_n_frames(hdf5_group, mol_name)[source]

Returns number of frames for each mol name

static retrieve_hdf(hdf_grp, hdf_key)[source]

Unified hdf retriever for attributes and dataset.

trim_down_to(target_n_samples, random_seed=42, verbose=True)[source]

Trimming all datasets randomly to reach the target number of samples. MolData attributes of the MetaSet are permanently subsampled by this method.

class mlcg.datasets.h5_dataset.Partition(name)[source]

Contain several metasets for a certain purpose, e.g., training.

Parameters:

name (str) – name of the partition

sampling_setup(batch_sizes, max_epoch_samples=None, random_seed=42, verbose=True)[source]

Calculate the number of batches available for an epoch according to the batch size for each metaset and optionally the maximum number of samples in an epoch. The molecular dataset will be trimmed accordingly.

class mlcg.datasets.h5_dataset.H5Dataset(h5_file_path, partition_options, loading_options, subsample_using_weights=False, exclude_listed_pairs=False)[source]

The top-level class for handling multiple datasets contained in a HDF5 file.

Parameters:
  • h5_file_path (str) – Path to the hdf5 file containing the dataset(s)

  • partition_options (Dict) – Dictionary of partition names mapping to collections of metaset information

  • loading_options (Dict) – Dictionary of options specifying how hdf5 attrs/datasets are named within hd5 groups

training_validation_splitting(input_detailed_indices, part_name, metaset_name, mol_list)[source]

Option to split molecule in metaset frame by frame into training or validation Inputs:

input_detailed_indices –
dictionary read in from yaml file about how data should be split

must contain 3 primary keys [val_ratio, test_ratio, seed] additional option to write out to filename if in dict.keys()

part_name –

which partition is currently being examined

metaset_name –

global name describing class of molecules

mol_list –

names of molecules

Outputs:
self._detailed_indices[part_name][metaset_name] –

pass back indices according to queried metaset and partition

class mlcg.datasets.h5_dataset.H5SimpleDataset(h5_file_path, stride=1, detailed_indices=None, metaset_name=None, mol_list=None, hdf_key_mapping={'coords': 'cg_coords', 'embeds': 'attrs:cg_embeds', 'forces': 'cg_delta_forces'}, parallel={'rank': 0, 'world_size': 1}, subsample_using_weights=False, exclude_listed_pairs=False)[source]

The top-level class for handling a single dataset contained in a HDF5 file. Will only load from one single type of molecules (i.e., one Metaset), and do not support partition splits. Use .get_dataloader for obtaining a dataloader for PyTorch training, etc.

Parameters:
  • h5_file_path (str) – Path to the hdf5 file containing the dataset(s)

  • stride (default 1) – Stride for loading the frames

  • detailed_indices (default None) – Set this to manually define which frames to be included for each molecule.

  • metaset_name (default None) – the name of the h5 group containing the molecule data. If kept None and the given file consists of only one metaset, this parameter will be inferred from the file.

  • mol_list (default None) – A list of the molecules to be loaded. When kept None, all molecules will be loaded.

  • hdf_key_mapping (default loading the delta forces) – Key mapping for reading the data from h5 file.

  • parallel (default for single process) – For DDP parallelism. Details see the head of this file.

get_dataloader(batch_size, collater_fn=<torch_geometric.loader.dataloader.Collater object>, shuffle=True, pin_memory=False)[source]

Parameters:

batch_size: Size of the batches to draw from the metaset collater_fn, shuffle, pin_memory: See PyTorch documentations for dataloader options.

class mlcg.datasets.h5_dataset.H5PartitionDataLoader(data_partition, collater_fn=<torch_geometric.loader.dataloader.Collater object>, pin_memory=False, subsample_using_weights=False)[source]

Load batches from one or multiple Metasets in a Partition. In multiple Metasets scenario, the order of data loaders will be alphabetically ascending with respect to the Metaset names.

class mlcg.datasets.h5_dataset.H5MetasetDataLoader(metaset, batch_size, collater_fn=<torch_geometric.loader.dataloader.Collater object>, shuffle=True, pin_memory=False)[source]

Load batches from one Metaset. For kwargs/options, see https://pytorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset

Parameters:
  • metaset (Dataset) – Dataset object for a single set of molecules

  • batch_size (int) – Size of the batches to draw from the metaset