add_decoys.py

mlcg_tk.scripts.add_decoys.add_decoy(h5_files, datasets, mol_name_prefix, scale, stride=50, append=True)[source]

Adds decoy molecules with Gaussian noise to the specified HDF5 datasets, optionally combines them into a single HDF5 file.

This function processes multiple HDF5 files, and for each file, it generates decoy molecules by adding Gaussian noise to the coordinates of the molecules in the specified dataset. The decoys are stored as separate molecules in the same datasets, with the option to append them to the existing h5s or to copy the existing h5s into new ones before appending the decoy molecules. The decoy molecules are given a name based on the provided prefix.

After decoys are added to all specified files, the function can optionally combine the provided h5 files into a single combined HDF5 file, using external links to the original files. The name of the combined file can either be provided or generated automatically based on the input files.

Return type:

None

Arguments:

h5_files: List[str]

A list of paths to the HDF5 files where decoy molecules should be added. Each file should contain one single dataset specified in the datasets argument.

datasets: List[str]

A list of dataset names within each corresponding HDF5 file. These datasets should contain molecules where decoys will be added.

mol_name_prefix: str

The prefix to be added to the name of each decoy molecule. Each decoy molecule will be named by appending its original molecule name to this prefix.

scale: float

The standard deviation of the Gaussian noise to be added to the coordinates of the beads.

stride: Optional[int], default=50

The stride value to be used when selecting frames from the original molecules. For example, a stride of 50 will select every 50th frame from the original molecule dataset. If not specified, defaults to 50.

append: Optional[bool], default=True

If True, decoy molecules will be added to the datasets in the existing HDF5 files. If False, the HDF5 file will be copied before appending the decoy molecules in the new HDF5 file.

Notes:

  • The scale value controls the level of noise added to the decoy molecules. A larger value will result in more distorted configurations.

  • The stride argument helps reduce the number of frames included in the decoy molecule by selecting a subset based on the specified interval. The higher the stride the less decoys are present in the dataset.

mlcg_tk.scripts.add_decoys.add_noise_decoy_molecule(source_molecule, location, scale, name, coords_name='cg_coords', forces_name='cg_delta_forces', stride=50, copy_metadata=True)[source]

Create a zero-force decoy molecule from a real molecule in an h5.

Function written by Aleksander Durumeric.

The molecule is created by extracting coordinats from an existing molecule, adding Gaussian noise to them, and then storing them (along with 0-forces) under a new molecule entry. Entries are added for attrs if the corresponding metadata option is set.

Return type:

Group

Arguments:

source_molecule:

hdf5.Group that contains coordinate entries under “cg_coords”. These coordinates are strided by stride and then noised to create the noise molecule coordinates.

location:

hdf5.Group under which a new group will be created for the added molecule. The new molecule is itself a group with name name.

scale:

Standard deviation of the added 0-mean Gaussian noise.

name:

Name of hdf5.Group added. See location.

coords_name:

Name of dataset that holds the coordinates. Used to query source_molecule and to name the corresponding entry in the new molecule.

forces_name:

Name of dataset that holds the forces. Only used to to name the corresponding entry in the new molecule.

stride:

Amount by which to stride the source data when creating the new molecule. For example 50 would create a noise molecule with approximatly 1/50th of the frames.

copy_metadata:

If true, the attrs under META_CG_EMBEDS_KEY is copied to the created molecule and META_CG_NFRAMES_KEY is used to store the number of frames. This typically should be set to True.

Notes:

This method saves coordinates and forces as float32, no matter what precision they originate as.

mlcg_tk.scripts.add_decoys.update_partition_file(partition_file, decoy_h5_files, mol_name_prefixes, partition_name)[source]

Updates a partition file by adding new molecules with specified prefixes to the “molecules” list of existing datasets.

This function loads a YAML partition file, updates it by appending molecule names with specified prefixes, and then saves the updated partition back to a new YAML file. It is typically used in scenarios where new decoy molecules, generated with a prefix (such as a decoy identifier), need to be added to the partition file.

Return type:

None

Arguments:

partition_file: str

Path to the original partition YAML file that contains metadata about the datasets and molecules. The function will read this file and modify the “molecules” list within the “metasets” section of the partition.

decoy_h5_files: List[str]

A list of paths to HDF5 files that contain the decoy molecules. Each file corresponds to a dataset in the partition file. The number and ordering of files should match the number and ordering of datasets in the partition file.

mol_name_prefixes: List[str]

A list of prefixes to be added to the names of existing molecules in the partition file. Each prefix will be prepended to the names of the molecules in the partition’s “molecules” list, essentially creating new “decoy” molecules with the prefixed names.

partition_name: str

The path where the updated partition YAML file will be saved. This will overwrite the existing file at that location.

Notes:

  • The function assumes that the partition YAML file follows a specific structure, where molecule names are listed under the “train” section in the “metasets” -> “molecules” key.

  • The function does not add decoy molecules to the validation dataset provided in the partition file

Example:

If the partition file contains the following: ```yaml train:

metasets:
my_dataset:
molecules:
  • mol1

  • mol2

``` and mol_name_prefixes = [“DECOY_5”, “DECOY_05”] The updated partition file will contain: ```yaml train:

metasets:
my_dataset:
molecules:
  • mol1

  • mol2

  • DECOY_5_mol1

  • DECOY_5_mol2

  • DECOY_05_mol1

  • DECOY_05_mol2

```