package_training_data.py¶
- mlcg_tk.scripts.package_training_data.combine_datasets(dataset_names, save_dir, force_tag, save_h5=True, save_partition=True)[source]¶
Computes structural features and accumulates statistics on dataset samples
- Parameters:
dataset_names (
List[str]) – List of dataset name to combinesave_dir (
str) – Path to directory from which datasets will be loaded and to which output will be savedforce_tag (
Optional[str]) – Label given to produced delta forces and saved packaged datasave_h5 (
Optional[bool]) – Whether to save dataset h5 file(s)save_partition (
Optional[bool]) – Whether to save dataset partition file(s)
- mlcg_tk.scripts.package_training_data.package_training_data(dataset_name, names, dataset_tag, force_tag, training_data_dir, save_dir, save_h5=True, save_partition=True, single_protein=False, batch_size=256, stride=1, train_size=0.8, train_mols=None, val_mols=None, random_state=None, mol_num_batches=1, keep_batches=False)[source]¶
Computes structural features and accumulates statistics on dataset samples
- Parameters:
dataset_name (
str) – Name given to specific datasetdataset_tag (
str) – Label given to all output files produced from datasetnames (
List[str]) – List of sample namesforce_tag (
str) – Label given to produced delta forces and saved packaged datatraining_data_dir (
str) – Path to directory from which input will be loadedsave_dir (
str) – Path to directory to which output will be savedsave_h5 (
Optional[bool]) – Whether to save dataset h5 file(s)save_partition (
Optional[bool]) – Whether to save dataset partition file(s)single_protein (
Optional[bool]) – Whether the produced partition file should be for a single-molecule model Will be ignored if save_partition is Falsebatch_size (
int) – Number of samples of dataset to include in each training batchstride (
int) – Integer by which to stride framestrain_size (
Union[float,int,None]) – Either the proportion (if float) or number of samples (if int) of molecules in training data If None, lists should be supplied for training and validation samplestrain_mols (
Optional[List[str]]) – Molecules to be used for training setval_mols (
Optional[List[str]]) – Molecules to be used for validation setrandom_state (
Optional[str]) – Controls shuffling applied to the data before applying the splitmol_num_batches (
Optional[int]) – If greater than 1, will load each molecule data from the specified number of batches that were be treated as different sampleskeep_batches (
Optional[bool]) – If set to True, batches will be put as individual molecules in the h5 dataset and the partition file will be built accordingly. Otherwise, if batches exist, they will be accumulated into one single molecule.