gen_input_data.py¶
- mlcg_tk.scripts.gen_input_data.build_neighborlists(dataset_name, names, sample_loader, tag, pdb_template_fn, save_dir, cg_atoms, embedding_map, embedding_func, skip_residues, prior_tag, prior_builders, raw_data_dir=None, cg_mapping_strategy=None, stride=1, force_stride=100, filter_cis=False, batch_size=None, mol_num_batches=1, atoms_batch_size=None, collection_cls=<class 'mlcg_tk.input_generator.raw_dataset.SampleCollection'>)[source]¶
Generates neighbour lists for all samples in dataset using prior term information
- Parameters:
dataset_name (str) – Name given to specific dataset
names (List[str]) – List of sample names
sample_loader (DatasetLoader) – Loader object defined for specific dataset
tag (str) – Label given to all output files produced from dataset
pdb_template_fn (str) – Template file location of atomistic structure to be used for topology
save_dir (str) – Path to directory in which output will be saved
cg_atoms (List[str]) – List of atom names to preserve in coarse-grained resolution
embedding_map (CGEmbeddingMap) – Mapping object
embedding_func (Callable) – Function which will be used to apply CG mapping
skip_residues (List[str]) – List of residues to skip, can be None
prior_tag (str) – String identifying the specific combination of prior terms
prior_builders (List[PriorBuilder]) – List of PriorBuilder objects and their corresponding parameters
stride (int) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlists
force_stride (int) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlists
filter_cis (bool) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlists
batch_size (bool) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlists
mol_num_batches (int) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlists
atoms_batch_size (int) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlists
collection_cls (Type[SampleCollection]) – Class type for sample collection
- mlcg_tk.scripts.gen_input_data.process_raw_dataset(dataset_name, names, sample_loader, raw_data_dir, tag, pdb_template_fn, save_dir, cg_atoms, embedding_map, embedding_func, skip_residues, cg_mapping_strategy, stride=1, force_stride=100, filter_cis=False, batch_size=None, mol_num_batches=1, atoms_batch_size=None, collection_cls=<class 'mlcg_tk.input_generator.raw_dataset.SampleCollection'>)[source]¶
Applies coarse-grained mapping to coordinates and forces using input sample topology and specified mapping strategies
- Parameters:
dataset_name (str) – Name given to specific dataset
names (List[str]) – List of sample names
sample_loader (DatasetLoader) – Loader object defined for specific dataset
raw_data_dir (str) – Path to coordinate and force files
tag (str) – Label given to all output files produced from dataset
pdb_template_fn (str) – Template file location of atomistic structure to be used for topology
save_dir (str) – Path to directory in which output will be saved
cg_atoms (List[str]) – List of atom names to preserve in coarse-grained resolution
embedding_map (CGEmbeddingMap) – Mapping object
embedding_func (Callable) – Function which will be used to apply CG mapping
skip_residues (List[str]) – List of residues to skip, can be None
cg_mapping_strategy (str) – Strategy to use for coordinate and force mappings; currently only “slice_aggregate” and “slice_optimize” are implemented
stride (int) – Interval by which to stride loaded data
force_stride (int) – stride for inferring the force maps in aggforce
filter_cis (bool) – if True, frames with cis-configurations will be filtered out from the dataset
batch_size (int) – Optional size in which performing batches of AA mapping to CG, to avoid memory overhead in large AA dataset
mol_num_batches (int) – If greater than 1, will save each molecule data into the specified number of batches that will be treated as different samples
atoms_batch_size (int, optional) – Optional batch size for processing atoms in large molecules (default: None). If specified, constraints among atoms for coordinate and force mappings (as defined by cg_mapping_strategy) will be computed in batches of this size. To significantly improve computational efficiency, it is assumed that structures have ordered residues. If atoms_batch_size exceeds the total number of atoms in the molecule, all atoms will be processed at once (default behavior).
collection_cls (Type[SampleCollection]) – Class type for sample collection