gen_input_data.py¶
- mlcg_tk.scripts.gen_input_data.build_neighborlists(dataset_name, names, sample_loader, tag, pdb_template_fn, save_dir, cg_atoms, embedding_map, embedding_func, skip_residues, prior_tag, prior_builders, raw_data_dir=None, cg_mapping_strategy=None, stride=1, force_stride=100, filter_cis=False, batch_size=None, mol_num_batches=1, atoms_batch_size=None, collection_cls=<class 'mlcg_tk.input_generator.raw_dataset.SampleCollection'>)[source]¶
Generates neighbour lists for all samples in dataset using prior term information
- Parameters:
dataset_name (
str) – Name given to specific datasetnames (
List[str]) – List of sample namessample_loader (
DatasetLoader) – Loader object defined for specific datasettag (
str) – Label given to all output files produced from datasetpdb_template_fn (
str) – Template file location of atomistic structure to be used for topologysave_dir (
str) – Path to directory in which output will be savedcg_atoms (
List[str]) – List of atom names to preserve in coarse-grained resolutionembedding_map (
CGEmbeddingMap) – Mapping objectembedding_func (
Callable) – Function which will be used to apply CG mappingskip_residues (
List[str]) – List of residues to skip, can be Noneprior_tag (
str) – String identifying the specific combination of prior termsprior_builders (
List[PriorBuilder]) – List of PriorBuilder objects and their corresponding parametersstride (
int) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlistsforce_stride (
int) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlistsfilter_cis (
bool) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlistsbatch_size (
Optional[int]) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlistsmol_num_batches (
Optional[int]) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlistsatoms_batch_size (
Optional[int]) – unused in this function present to allow the use of the same .yaml config for process_raw_dataset and build_neighborlistscollection_cls (
Type[SampleCollection]) – Class type for sample collection
- mlcg_tk.scripts.gen_input_data.process_raw_dataset(dataset_name, names, sample_loader, raw_data_dir, tag, pdb_template_fn, save_dir, cg_atoms, embedding_map, embedding_func, skip_residues, cg_mapping_strategy, stride=1, force_stride=100, filter_cis=False, batch_size=None, mol_num_batches=1, atoms_batch_size=None, collection_cls=<class 'mlcg_tk.input_generator.raw_dataset.SampleCollection'>)[source]¶
Applies coarse-grained mapping to coordinates and forces using input sample topology and specified mapping strategies
- Parameters:
dataset_name (
str) – Name given to specific datasetnames (
List[str]) – List of sample namessample_loader (
DatasetLoader) – Loader object defined for specific datasetraw_data_dir (
str) – Path to coordinate and force filestag (
str) – Label given to all output files produced from datasetpdb_template_fn (
str) – Template file location of atomistic structure to be used for topologysave_dir (
str) – Path to directory in which output will be savedcg_atoms (
List[str]) – List of atom names to preserve in coarse-grained resolutionembedding_map (
CGEmbeddingMap) – Mapping objectembedding_func (
Callable) – Function which will be used to apply CG mappingskip_residues (
List[str]) – List of residues to skip, can be Nonecg_mapping_strategy (
str) – Strategy to use for coordinate and force mappings; currently only “slice_aggregate” and “slice_optimize” are implementedstride (
int) – Interval by which to stride loaded dataforce_stride (
int) – stride for inferring the force maps in aggforcefilter_cis (
Optional[bool]) – if True, frames with cis-configurations will be filtered out from the datasetbatch_size (
Optional[int]) – Optional size in which performing batches of AA mapping to CG, to avoid memory overhead in large AA datasetmol_num_batches (
Optional[int]) – If greater than 1, will save each molecule data into the specified number of batches that will be treated as different samplesatoms_batch_size (
Optional[int]) – Optional batch size for processing atoms in large molecules (default: None). If specified, constraints among atoms for coordinate and force mappings (as defined by cg_mapping_strategy) will be computed in batches of this size. To significantly improve computational efficiency, it is assumed that structures have ordered residues. If atoms_batch_size exceeds the total number of atoms in the molecule, all atoms will be processed at once (default behavior).collection_cls (
Type[SampleCollection]) – Class type for sample collection