package_training_data.py¶
- mlcg_tk.scripts.package_training_data.combine_datasets(dataset_names, save_dir, force_tag, save_h5=True, save_partition=True)[source]¶
Computes structural features and accumulates statistics on dataset samples
- Parameters:
dataset_names (List[str]) – List of dataset name to combine
save_dir (str) – Path to directory from which datasets will be loaded and to which output will be saved
force_tag (str) – Label given to produced delta forces and saved packaged data
save_h5 (bool) – Whether to save dataset h5 file(s)
save_partition (bool) – Whether to save dataset partition file(s)
- mlcg_tk.scripts.package_training_data.package_training_data(dataset_name, names, dataset_tag, force_tag, training_data_dir, save_dir, save_h5=True, save_partition=True, single_protein=False, batch_size=256, stride=1, train_size=0.8, train_mols=None, val_mols=None, random_state=None, mol_num_batches=1, keep_batches=False)[source]¶
Computes structural features and accumulates statistics on dataset samples
- Parameters:
dataset_name (str) – Name given to specific dataset
dataset_tag (str) – Label given to all output files produced from dataset
names (List[str]) – List of sample names
force_tag (str) – Label given to produced delta forces and saved packaged data
training_data_dir (str) – Path to directory from which input will be loaded
save_dir (str) – Path to directory to which output will be saved
save_h5 (bool) – Whether to save dataset h5 file(s)
save_partition (bool) – Whether to save dataset partition file(s)
single_protein (bool) – Whether the produced partition file should be for a single-molecule model Will be ignored if save_partition is False
batch_size (int) – Number of samples of dataset to include in each training batch
stride (int) – Integer by which to stride frames
train_size (Union[float, int]) – Either the proportion (if float) or number of samples (if int) of molecules in training data If None, lists should be supplied for training and validation samples
train_mols (Optional[List]) – Molecules to be used for training set
val_mols (Optional[List]) – Molecules to be used for validation set
random_state (Optional[str]) – Controls shuffling applied to the data before applying the split
mol_num_batches (int) – If greater than 1, will load each molecule data from the specified number of batches that were be treated as different samples
keep_batches (bool) – If set to True, batches will be put as individual molecules in the h5 dataset and the partition file will be built accordingly. Otherwise, if batches exist, they will be accumulated into one single molecule.