package_training_data.py

mlcg_tk.scripts.package_training_data.combine_datasets(dataset_names, save_dir, force_tag, save_h5=True, save_partition=True)[source]

Computes structural features and accumulates statistics on dataset samples

Parameters:
  • dataset_names (List[str]) – List of dataset name to combine

  • save_dir (str) – Path to directory from which datasets will be loaded and to which output will be saved

  • force_tag (str) – Label given to produced delta forces and saved packaged data

  • save_h5 (bool) – Whether to save dataset h5 file(s)

  • save_partition (bool) – Whether to save dataset partition file(s)

mlcg_tk.scripts.package_training_data.package_training_data(dataset_name, names, dataset_tag, force_tag, training_data_dir, save_dir, save_h5=True, save_partition=True, single_protein=False, batch_size=256, stride=1, train_size=0.8, train_mols=None, val_mols=None, random_state=None, mol_num_batches=1, keep_batches=False)[source]

Computes structural features and accumulates statistics on dataset samples

Parameters:
  • dataset_name (str) – Name given to specific dataset

  • dataset_tag (str) – Label given to all output files produced from dataset

  • names (List[str]) – List of sample names

  • force_tag (str) – Label given to produced delta forces and saved packaged data

  • training_data_dir (str) – Path to directory from which input will be loaded

  • save_dir (str) – Path to directory to which output will be saved

  • save_h5 (bool) – Whether to save dataset h5 file(s)

  • save_partition (bool) – Whether to save dataset partition file(s)

  • single_protein (bool) – Whether the produced partition file should be for a single-molecule model Will be ignored if save_partition is False

  • batch_size (int) – Number of samples of dataset to include in each training batch

  • stride (int) – Integer by which to stride frames

  • train_size (Union[float, int]) – Either the proportion (if float) or number of samples (if int) of molecules in training data If None, lists should be supplied for training and validation samples

  • train_mols (Optional[List]) – Molecules to be used for training set

  • val_mols (Optional[List]) – Molecules to be used for validation set

  • random_state (Optional[str]) – Controls shuffling applied to the data before applying the split

  • mol_num_batches (int) – If greater than 1, will load each molecule data from the specified number of batches that were be treated as different samples

  • keep_batches (bool) – If set to True, batches will be put as individual molecules in the h5 dataset and the partition file will be built accordingly. Otherwise, if batches exist, they will be accumulated into one single molecule.