Fixed datasets

mlcg.datasets contains a template InMemoryDataset for CLN025, a 10 amino acid long mini protein that shows prototypical folding and unfolding behavior. The ChignolinDataset class illustrates how a general dataset can be downloaded, unpacked/organized, transformed/processed, and collated for training models. Here, users can find more information on implementing custom datasets.

Alanine Dipeptide Dataset

Dataset of a single 1M step trajectory of alanine dipeptide in explicit water. The trajectory is simulated using a Langevin scheme with [ACEMD] at 300K through the [AMBER_ff_99SB_ILDN] force force field. The cubic simulation box was 2.3222 cubic nm, an integration timestep of 2 fs was used, the solvent was composed of 651 [TIP3P] water molecules, electrostatics were computed every two steps using the PME method with a real-space cutoff of 9 nm and a grid spacing of 0.1 nm, and all bonds between heavy and hydrogen atoms were constrained.

class mlcg.datasets.alanine_dipeptide.AlanineDataset(root, stride=1, transform=None, pre_transform=None, pre_filter=None)[source]

Dataset for training a CG model of the alanine-dipeptide protein following a Cα + 1 Cbeta CG mapping

Alanine Dipeptide CG structure:

        CB(3)
          |
  N(1) - CA(2) - C(4)
 /                  \
C(0)                 N(5)

This Dataset produces delta forces for model training, in which the CG prior forces (harmonic bonds and angles) have been subtracted from the full CG forces.

The class works as follows:
  • If the raw data (coordinates, forces, and pdb file) for alanine dipeptide does not exist, the files will automatically be downloaded and processed

  • If the raw data exists, “root/processed/” will be created and the raw dataset will be processed

  • If the raw data and processed data exists, the class will load the processed dataset containing:
    • data : AtomicData object containing all the CG positions, forces, embeddings

    • slices : Slices of the AtomicData object

    • prior_models : Prior models (HarmonicBonds, HarmonicAngles) of the dataset with x_0 and sigma

    • topologies : Object of Topology class containing all topology information of the _CG_ molecule, including neighbor lists for bond and angle priors

Inputs:
root :

Location for AlanineDataset to be downloaded and processed

stride :

Stride length over dataset

Default priors:
  • HarmonicBonds, HarmonicAngles

Optional priors:
  • Repulsion
    • If repulsion prior is used, a custom neighbor list is created for the repulsion prior where all pairs of beads not interacting through bonds and angles are included in the interaction set

download()[source]

Downloads the dataset to the self.raw_dir folder.

kB = 0.0019872041

Boltzmann constant in kcal/mol/K

static load_original_topology(pdb_file)[source]

Method to load origin topology

Parameters:

pdb_file (str) – Path of all-atom PDB file

Returns:

Topology class object for all-atom molecule

Return type:

topology

static make_baseline_models(data, beta, priors_cls)[source]

Method to make all baseline models

Parameters:
  • data (AtomicData) – AtomicData object of entire trajectory

  • beta (float) – 1/(k_B * T)

  • priors_cls (List[_Prior]) – List of prior classes

Returns:

Dictionary of prior models fitted with the right harmonic restraint values

Return type:

baseline_models

static make_cg_topology(topology, cg_mapping={('ACE', 'C'): ('CA_C', 0, 12), ('ALA', 'C'): ('C_A', 4, 12), ('ALA', 'CA'): ('CA_A', 2, 12), ('ALA', 'CB'): ('CB_A', 3, 12), ('ALA', 'N'): ('N_A', 1, 14), ('NME', 'N'): ('N_N', 5, 14)}, special_terminal=False)[source]

Method to make Topology class object of CG molecule, creates custom bonds and angles to make a non-linear CG molecule

Parameters:
  • topology (Topology) – All-atom topology

  • cg_mapping ((optional)) – Dictionary containing CG mapping. The default is AL_CG_MAP.

  • special_terminal ((optional)) – True if termini beads are to be treated separately. The default is False.

Returns:

Topology class object for CG molecule

Return type:

cg_topo

make_data_slices_prior(coords_forces_file, topology, cg_mapping, cg_topo)[source]

Method to make collated AtomicData object, slices, and baseline models

Parameters:
  • coords_forces_file (str) – npz file containing the forces and coordinates from the all-atom simulation

  • topology (Topology) – Topology of all-atom model

  • cg_mapping (dict) – Dictionary containing CG mapping.

  • cg_topo (Topology) – Topology of CG model

Return type:

Tuple[AtomicData, dict, ModuleDict]

Returns:

  • data_list_coll – Collated AtomicData object

  • slices – Dictionary containing slices of the AtomicData object

  • baseline_models – Module dictionary containing fitted prior models

make_priors(priors_cls, cg_topo)[source]

Method to make prior neighbor lists

Parameters:
  • priors_cls (List[_Prior]) – List of prior classes

  • cg_topo (Topology)

Returns:

Dictionary containing all prior neighbor lists

Return type:

prior_nls

process(cg_mapping={('ACE', 'C'): ('CA_C', 0, 12), ('ALA', 'C'): ('C_A', 4, 12), ('ALA', 'CA'): ('CA_A', 2, 12), ('ALA', 'CB'): ('CB_A', 3, 12), ('ALA', 'N'): ('N_A', 1, 14), ('NME', 'N'): ('N_N', 5, 14)})[source]

Method for processing the raw data - this is where all processing function calls take place All outputs are stored in the relevant processed files

Parameters:

cg_mapping ((optional)) – CG mapping dictionary. The default is AL_CG_MAP.

property processed_file_names

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_file_names

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

static repulsion_nls(name, cg_topo)[source]

Method for generating neighbor list for Repulsion prior - all interactions not included in bond or angle interactions

Parameters:
  • name (str) – Name of prior class

  • cg_topo (Topology) – Topology object of CG molecule

Returns:

Dictionary containing repulsion neighor list

Return type:

dict

save_dataset(pickle_name)[source]

Method for saving dataset given pickle name

Parameters:

pickle_name (str) – Name of pickle to store the dataset in

temperature = 300

Temperature used to generate the underlying all-atom data in [K]

Chignolin Dataset

Dataset of 3744 individual all-atom trajectories simulated using [ACEMD] using the adaptive sampling strategy described in [AdaptiveStrategy] . All trajectories were simulated at 350K with [CHARM22Star] in a cubic box of 40 angstroms :sup:3 with 1881 TIP3P water molecules and two Na :sup:+ ions using a Langevin integrator with an integration timestep of 4 fs, a damping constant of 0.1 :sup:-1 ps, heavy-hydrogen constraints, and a PME cutoff of 9 angstroms and a PME mesh grid of 1 angstrom. The total aggregate simulation time is 187.2 us.

class mlcg.datasets.chignolin.ChignolinDataset(root, transform=None, pre_transform=None, pre_filter=None, mapping='slice_aggregate', terminal_embeds=False, max_num_files=None)[source]

Dataset for training a CG model of the chignolin protein following a Cα CG mapping using the all-atom data from [CGnet].

This Dataset produces delta forces for model training, in which the CG prior forces (harmonic bonds and angles and a repulsive term) have been subtracted from the full CG forces.

download()[source]

Downloads the dataset to the self.raw_dir folder.

kB = 0.0019872041

Boltzmann constan in kcal/mol/K

process()[source]

Processes the dataset to the self.processed_dir folder.

property processed_file_names

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

property raw_file_names

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

temperature = 350

Temperature used to generate the underlying all-atom data in [K]