1) Loading and processing all-atom simulation data

Command:

mlcg-tk-gen_input_data process_raw_dataset --config configuration_files/trpcage.yaml

mlcg-tk-gen_input_data build_neighborlists --config configuration_files/trpcage.yaml --config configuration_files/trpcage_priors.yaml

This procedure will loop over all of the sample names specified by the names option. For each instance, it will load the atomistic coordinates, forces, and structures and map these to a lower resolution specified in the input file (this allows for various resolutions and CG embeddings to be used). Then, using the PriorBuilders listed in prior_builders, the script will generate a neighbor list for each molecule, so long as the prior builders are implemented in prior_gen.py and their specific neighbor list builders are implemented in mlcg_tk.input_generator.prior_nls.

Note

Keep in mind that the priors are assumed to be in [kcal/mol] at the fitting stage so raw forces should be transformed to [kcal/mol/angstrom].

Attention

When dealing with large molecules, several memory problems can arise due to handling the multi-dimensional time series. We provide several fixes for these problems which are described in the following subsections.

Matrix multiplication batching for large molecules

If your program gets killed after the loading of the all-atom data succeeded (tqdm bar finished) but before process_raw_dataset saved the CG output, try to set batch_size in your trpcage.yaml file. This will batch the matrix multiplication between atomistic coordinates/forces, which is the most memory-consuming part of the coarse-graining at this stage.

Batch processing for large molecules

If the dataset loads into memory successfully (the tqdm bar completes), but the program fails before saving the CG output, consider setting atoms_batch_size in your trpcage.yaml file.

This optional parameter specifies the batch size for processing atoms in large molecules. When set, constraints among atoms for coordinate and force mappings will be computed in batches of this size to reduce memory usage. To improve computational efficiency, it is assumed that the molecular structures have ordered residues.

If atoms_batch_size is larger than the total number of atoms in the molecule, all atoms will be processed at once (the default behavior).

Batch processing for large datasets

Should your dataset be too big to be loaded into memory at once (the tqdm bar doesn’t finish before it fails), you can set the mol_num_batches in your trpcage.yaml file as well as your trpcage_stats.yaml, trpcage_delta_forces.yaml and trpcage_packaging.yaml file. This will separate the trajectories in your dataset into mol_num_batches chunks that will be treated as separate molecules for the coarse-graining and statistics computing stages (see step 2) and the statistics of the different batches will be automatically accumulated to get only one prior object in the end.

Note

When setting the mol_num_batches parameter, the force map will be only computed on the first batch and re-used for all subsequent batches to ensure consistency in the case of optimized force maps.