Fixed datasets¶
mlcg.datasets contains a template InMemoryDataset for CLN025, a 10 amino acid long mini protein that shows prototypical folding and unfolding behavior. The ChignolinDataset class illustrates how a general dataset can be downloaded, unpacked/organized, transformed/processed, and collated for training models. Here, users can find more information on implementing custom datasets.
Alanine Dipeptide Dataset¶
Dataset of a single 1M step trajectory of alanine dipeptide in explicit water. The trajectory is simulated using a Langevin scheme with [ACEMD] at 300K through the [AMBER_ff_99SB_ILDN] force force field. The cubic simulation box was 2.3222 cubic nm, an integration timestep of 2 fs was used, the solvent was composed of 651 [TIP3P] water molecules, electrostatics were computed every two steps using the PME method with a real-space cutoff of 9 nm and a grid spacing of 0.1 nm, and all bonds between heavy and hydrogen atoms were constrained.
- class mlcg.datasets.alanine_dipeptide.AlanineDataset(root, stride=1, transform=None, pre_transform=None, pre_filter=None)[source]¶
Dataset for training a CG model of the alanine-dipeptide protein following a Cα + 1 Cbeta CG mapping
Alanine Dipeptide CG structure:
CB(3) | N(1) - CA(2) - C(4) / \ C(0) N(5)
This Dataset produces delta forces for model training, in which the CG prior forces (harmonic bonds and angles) have been subtracted from the full CG forces.
- The class works as follows:
If the raw data (coordinates, forces, and pdb file) for alanine dipeptide does not exist, the files will automatically be downloaded and processed
If the raw data exists, “root/processed/” will be created and the raw dataset will be processed
- If the raw data and processed data exists, the class will load the processed dataset containing:
data : AtomicData object containing all the CG positions, forces, embeddings
slices : Slices of the AtomicData object
prior_models : Prior models (HarmonicBonds, HarmonicAngles) of the dataset with x_0 and sigma
topologies : Object of Topology class containing all topology information of the _CG_ molecule, including neighbor lists for bond and angle priors
- Inputs:
- root :
Location for AlanineDataset to be downloaded and processed
- stride :
Stride length over dataset
- Default priors:
HarmonicBonds, HarmonicAngles
- Optional priors:
- Repulsion
If repulsion prior is used, a custom neighbor list is created for the repulsion prior where all pairs of beads not interacting through bonds and angles are included in the interaction set
- kB = 0.0019872041¶
Boltzmann constant in kcal/mol/K
- static load_original_topology(pdb_file)[source]¶
Method to load origin topology
- Parameters:
pdb_file (
str) – Path of all-atom PDB file- Returns:
Topology class object for all-atom molecule
- Return type:
topology
- static make_baseline_models(data, beta, priors_cls)[source]¶
Method to make all baseline models
- Parameters:
data (
AtomicData) – AtomicData object of entire trajectorybeta (
float) – 1/(k_B * T)priors_cls (
List[_Prior]) – List of prior classes
- Returns:
Dictionary of prior models fitted with the right harmonic restraint values
- Return type:
baseline_models
- static make_cg_topology(topology, cg_mapping={('ACE', 'C'): ('CA_C', 0, 12), ('ALA', 'C'): ('C_A', 4, 12), ('ALA', 'CA'): ('CA_A', 2, 12), ('ALA', 'CB'): ('CB_A', 3, 12), ('ALA', 'N'): ('N_A', 1, 14), ('NME', 'N'): ('N_N', 5, 14)}, special_terminal=False)[source]¶
Method to make Topology class object of CG molecule, creates custom bonds and angles to make a non-linear CG molecule
- Parameters:
topology (
Topology) – All-atom topologycg_mapping ((optional)) – Dictionary containing CG mapping. The default is AL_CG_MAP.
special_terminal ((optional)) – True if termini beads are to be treated separately. The default is False.
- Returns:
Topology class object for CG molecule
- Return type:
cg_topo
- make_data_slices_prior(coords_forces_file, topology, cg_mapping, cg_topo)[source]¶
Method to make collated AtomicData object, slices, and baseline models
- Parameters:
- Return type:
Tuple[AtomicData,dict,ModuleDict]- Returns:
data_list_coll – Collated AtomicData object
slices – Dictionary containing slices of the AtomicData object
baseline_models – Module dictionary containing fitted prior models
- make_priors(priors_cls, cg_topo)[source]¶
Method to make prior neighbor lists
- Parameters:
priors_cls (
List[_Prior]) – List of prior classescg_topo (
Topology)
- Returns:
Dictionary containing all prior neighbor lists
- Return type:
prior_nls
- process(cg_mapping={('ACE', 'C'): ('CA_C', 0, 12), ('ALA', 'C'): ('C_A', 4, 12), ('ALA', 'CA'): ('CA_A', 2, 12), ('ALA', 'CB'): ('CB_A', 3, 12), ('ALA', 'N'): ('N_A', 1, 14), ('NME', 'N'): ('N_N', 5, 14)})[source]¶
Method for processing the raw data - this is where all processing function calls take place All outputs are stored in the relevant processed files
- Parameters:
cg_mapping ((optional)) – CG mapping dictionary. The default is AL_CG_MAP.
- property processed_file_names¶
The name of the files in the
self.processed_dirfolder that must be present in order to skip processing.
- property raw_file_names¶
The name of the files in the
self.raw_dirfolder that must be present in order to skip downloading.
- static repulsion_nls(name, cg_topo)[source]¶
Method for generating neighbor list for Repulsion prior - all interactions not included in bond or angle interactions
- Parameters:
name (
str) – Name of prior classcg_topo (
Topology) – Topology object of CG molecule
- Returns:
Dictionary containing repulsion neighor list
- Return type:
dict
- save_dataset(pickle_name)[source]¶
Method for saving dataset given pickle name
- Parameters:
pickle_name (
str) – Name of pickle to store the dataset in
- temperature = 300¶
Temperature used to generate the underlying all-atom data in [K]
Chignolin Dataset¶
Dataset of 3744 individual all-atom trajectories simulated using [ACEMD] using the adaptive sampling strategy described in [AdaptiveStrategy] . All trajectories were simulated at 350K with [CHARM22Star] in a cubic box of 40 angstroms :sup:3 with 1881 TIP3P water molecules and two Na :sup:+ ions using a Langevin integrator with an integration timestep of 4 fs, a damping constant of 0.1 :sup:-1 ps, heavy-hydrogen constraints, and a PME cutoff of 9 angstroms and a PME mesh grid of 1 angstrom. The total aggregate simulation time is 187.2 us.
- class mlcg.datasets.chignolin.ChignolinDataset(root, transform=None, pre_transform=None, pre_filter=None, mapping='slice_aggregate', terminal_embeds=False, max_num_files=None)[source]¶
Dataset for training a CG model of the chignolin protein following a Cα CG mapping using the all-atom data from [CGnet].
This Dataset produces delta forces for model training, in which the CG prior forces (harmonic bonds and angles and a repulsive term) have been subtracted from the full CG forces.
- kB = 0.0019872041¶
Boltzmann constan in kcal/mol/K
- property processed_file_names¶
The name of the files in the
self.processed_dirfolder that must be present in order to skip processing.
- property raw_file_names¶
The name of the files in the
self.raw_dirfolder that must be present in order to skip downloading.
- temperature = 350¶
Temperature used to generate the underlying all-atom data in [K]