raw_dataset module

class mlcg_tk.input_generator.raw_dataset.CGDataBatch(cg_coords, cg_forces, cg_embeds, cg_prior_nls, batch_size, stride, weights=None, concat_forces=False)[source]

Bases: object

Splits input CG data into batches for further memory-efficient processing

cg_coords

Coarse grained coordinates

cg_forces

Coarse grained forces

cg_embeds

Atom embeddings

cg_prior_nls

Dictionary of prior neighbour list

batch_size

Number of frames to use in each batch

stride

Integer by which to stride frames

concat_forces

Boolean indicating whether forces should be added to batch

class mlcg_tk.input_generator.raw_dataset.RawDataset(dataset_name, names, tag, n_batches=1, collection_cls=<class 'mlcg_tk.input_generator.raw_dataset.SampleCollection'>)[source]

Bases: object

Generates a list of data samples for a specified dataset

dataset_name

Name given to dataset

names

List of sample names

tag

Label given to all output files produced from dataset

dataset

List of SampleCollection objects for all samples in dataset

class mlcg_tk.input_generator.raw_dataset.SampleCollection(name, tag, n_batches=1)[source]

Bases: object

Input generation object for loading, manupulating, and saving training data samples.

name

String associated with atomistic trajectory output.

tag

String to identify dataset in output files.

pdb_fn

File location of atomistic structure to be used for topology.

add_terminal_embeddings(N_term='N', C_term='C')[source]

Adds separate embedding to terminals (do not need to be defined in original embedding_dict).

Parameters:
  • N_term (Optional[str]) – Atom of N-terminus to which N_term embedding will be assigned.

  • C_term (Optional[str]) – Atom of C-terminus to which C_term embedding will be assigned.

  • assigned. (Either of N_term and/or C_term can be None; in this case only one (or no) terminal embedding(s) will be)

apply_cg_mapping(cg_atoms, embedding_function, embedding_dict, skip_residues=None)[source]

Applies mapping function to atomistic topology to obtain CG representation.

Parameters:
  • cg_atoms (List[str]) – List of atom names to preserve in CG representation.

  • embedding_function (str) – Name of function (should be defined in embedding_maps) to apply CG mapping.

  • embedding_dict (str) – Name of dictionary (should eb defined in embedding_maps) to define embeddings of CG beads.

  • skip_residues ((Optional)) – List of residue names to skip (can be used to skip terminal caps, for example). Currently, can only be used to skip all residues with given name.

get_prior_nls(prior_builders, save_nls=True, **kwargs)[source]

Creates neighbourlists for all prior terms specified in the prior_dict.

Parameters:
  • prior_builders (List[PriorBuilder]) –

    List of PriorBuilder objects and their corresponding parameters. Input config file must minimally contain the following information for each builder:

    class_path: class specifying PriorBuilder object implemented in prior_gen.py init_args:

    name: string specifying type as one of ‘bonds’, ‘angles’, ‘dihedrals’, ‘non_bonded’ nl_builder: name of class implemented in prior_nls.py which will be used to collect

    atom groups associated with the prior term.

  • save_nls (bool) – If true, will save an output of the molecule’s neighbourlist.

  • kwargs

    save_dir:

    If save_nls = True, the neighbourlist will be saved to this directory.

    prior_tag:

    String identifying the specific combination of prior terms.

Return type:

Dictionary of prior terms with specific index mapping for the given molecule.

Example

To build neighbour lists for a system with priors for bonds, angles, nonbonded pairs, and phi and psi dihedral angles:

  • class_path: input_generator.Bonds init_args:

    name: bonds separate_termini: true nl_builder: input_generator.StandardBonds

  • class_path: input_generator.Angles init_args:

    name: angles separate_termini: true nl_builder: input_generator.StandardAngles

  • class_path: input_generator.NonBonded init_args:

    name: non_bonded min_pair: 6 res_exclusion: 1 separate_termini: false nl_builder: input_generator.Non_Bonded

  • class_path: input_generator.Dihedrals init_args:

    name: phi nl_builder: input_generator.Phi

  • class_path: input_generator.Dihedrals init_args:

    name: psi nl_builder: input_generator.Psi

has_delta_forces_output(training_data_dir, force_tag='', mol_num_batches=1)[source]

Returns True if cg data exists for this SampleCollection

Used to skip processing of molecules where all frames have been removed by cis conformation filtering

Parameters:
  • training_data_dir (str) – Location of saved cg data

  • prior_tag – String identifying the specific combination of prior terms

  • mol_num_batches (int) – number of batches in which the molecule is suposed to be saved

Return type:

bool

Returns:

  • True if cg output for the sample corresponding to prior_tag is present in training_data_dir

  • False otherwise

has_saved_cg_output(save_dir, prior_tag='')[source]

Returns True if cg data exists for this SampleCollection

Used to skip processing of molecules where all frames have been removed by cis conformation filtering

Parameters:
  • save_dir (str) – Location of saved cg data

  • prior_tag (str) – String identifying the specific combination of prior terms

Return type:

bool

Returns:

  • True if cg output for the sample corresponding to prior_tag is present in save_dir

  • False otherwise

load_all_batches_training_inputs(training_data_dir, force_tag='', mol_num_batches=1, stride=1)[source]
load_cg_force_map(save_dir)[source]

Helper function to load a previously saved force map for the molecule in the sample

Return type:

ndarray

Parameters:

save_dir: str

path to the directory where the force map was saved in the first batch of the molecule in the sample

Returns:

: force_map: np.ndarray

force map corresponding to the molecule in self

load_cg_output(save_dir, prior_tag='')[source]

Loads all cg data produced by save_cg_output and get_prior_nls

Parameters:
  • save_dir (str) – Location of saved cg data

  • prior_tag (str) – String identifying the specific combination of prior terms

Return type:

Tuple

Returns:

  • Tuple of np.ndarrays containing coarse grained coordinates, forces, embeddings,

  • structure, and prior neighbour list

load_cg_output_into_batches(save_dir, prior_tag, batch_size, stride, weights_template_fn)[source]

Loads saved CG data and splits these into batches for further processing

Parameters:
  • save_dir (str) – Location of saved cg data

  • prior_tag (str) – String identifying the specific combination of prior terms

  • batch_size (int) – Number of frames to use in each batch

  • stride (int) – Integer by which to stride frames

Return type:

Loaded CG data split into list of batches

load_training_inputs(training_data_dir, force_tag='', stride=1)[source]

Loads all cg data produced by save_cg_output and get_prior_nls

Parameters:
  • training_data – Location of saved cg data including delta forces

  • force_tag (str) – String identifying the produced delta forces

Return type:

Tuple of np.ndarrays containing coarse grained coordinates, delta forces, and embeddings,

process_coords_forces(coords, forces, topology, mapping='slice_aggregate', filter_cis=False, force_stride=100, batch_size=None, atoms_batch_size=None)[source]

Maps coordinates and forces to CG resolution

Parameters:
  • coords ([n_frames, n_atoms, 3]) – Atomistic coordinates

  • forces ([n_frames, n_atoms, 3]) – Atomistic forces

  • topology (Topology) – mdtraj topology to lead atomistic coordinates (used for cis-omega angles filtering)

  • mapping (str) – Mapping scheme to be used, must be either ‘slice_aggregate’ or ‘slice_optimize’.

  • filter_cis (bool) – If True, frames containing a cis-omega angle will be filtered out

  • force_stride (int) – Striding to use for force projection results

  • batch_size (Optional[int]) – Batching the coords and forces projection to CG

  • atoms_batch_size (Optional[int]) – Batch size for processing atoms when inferring constrained atoms

Return type:

Tuple of np.ndarray’s for coarse grained coordinates and forces

save_cg_output(save_dir, save_coord_force=True, save_cg_maps=True, cg_coords=None, cg_forces=None)[source]

Saves processed CG data.

Parameters:
  • save_dir (str) – Path of directory to which output will be saved.

  • save_coord_force (bool) – Whether coordinates and forces should also be saved.

  • cg_coords (Optional[ndarray]) – CG coordinates; if None, will check whether these are saved as attribute.

  • cg_forces (Optional[ndarray]) – CG forces; if None, will check whether these are saved as an object attribute.

class mlcg_tk.input_generator.raw_dataset.SimInput(dataset_name, tag, pdb_fns, collection_cls=<class 'mlcg_tk.input_generator.raw_dataset.SampleCollection'>)[source]

Bases: object

Generates a list of samples from pdb structures to be used in simulation

dataset_name

Name given to dataset

tag

Label given to all output files produced from dataset

pdb_fns

List of pdb filenames from which samples will be generated

dataset

List of SampleCollection objects for all structures

mlcg_tk.input_generator.raw_dataset.get_strides(n_structure, batch_size)[source]

Helper function to stride batched data