Ensemble
Introduction
About Quantity
A Quantity wraps a numpy.ndarray and a unit (defined in proteka.dataset.unit_quantity). Assigning a Quantity to an Ensemble either during initialization or via the dot (.) notation as an attribute:
If the input is a plain
numpy.ndarray, then the unit is assumed as”dimensionless”If the input is a
Quantity, the input unit will be stored
Retrieving saved Quantity:
Accessing as an attribute (via dot (.)): returns a
numpy.ndarraywith the value of theQuantityin unit of the stored unitVia index bracket ([]): returns the stored
Quantityobject, allowing flexible automated unit conversion
List stored quantities: .list_quantities()
Special cases are “builtin quantities”, whose stored units are dictated by the
unit_system(also used instead of the default “dimensionless” during assignment):“coords” (ATOMIC_VECTOR): [L]
“time” (per-frame SCALAR): [T]
“forces” (ATOMIC_VECTOR): [E]/[L]
“velocities” (ATOMIC_VECTOR): [L]/[T]
“cell_lengths” (BOX_QUANTITIES): [L]
“cell_angles”: (BOX_QUANTITIES): degree
In addition, the above quantities are tied to the system molecule via the shape,
i.e., each per-frame quantity having the same number of frames as self.coords,
and correspond to the same number of atoms as indicated by self.top, if it is an
_ATOMIC_VECTOR_.
Trajectories:
Storing the information about which samples contained in the Ensemble come from
which trajectory.
Trajectories are sequential. Therefore, samples from different trajectories are
expected to be non-overlapping slices.
Trajectories info is supposed to be stored either during the Ensemble initialization
or after with .register_trjs method.
Properties
.n_trjs(int): number of trajectories.n_frames_per_trj(Dict[str, int]): dictionary of number of frames in each trajectory.trajectory_slicesor.trjsor.trajectories(Dict[str, slice]): Pythonslice``s for slicing Ensemble quantities according to the ``.trjsrecords.trj_indices(Dict[str, np.ndarray]): indices for different trajectoriesaccording to the.trjsrecords
mdtraj interface:
.get_mdtraj_trjs()(-> Dict[str, mdtraj.Trajectory]): pack anEnsemble’stop``and ``coords(and unitcell + simulation times, if available) into a dictionary ofmdtraj.Trajectoryfor analyses according toself.trjs.get_all_in_one_mdtraj_trj(): pack allcoordsinto oneTrajectoryobject (maybe not suitable for kinetic analyses, such as TICA and MSM!)
Implementations
- class proteka.dataset.Ensemble(name, top, coords, quantities=None, metadata=None, trajectory_slices=None, unit_system='nm-g/mol-ps-kJ/mol')[source]
An Ensemble is an in-memory data structure consisting of sample coordinates and other quantities for a certain system at a certain thermodynamic state. The samples usually correspond to a Boltzmann distribution.
An Ensemble must have name, top (molecular topology) and coords (3D coordinates). In addition, a unit_system has to be provided either as a pre-defined UnitSystem object or a seralized JSON version of UnitSystem or a string in format of “[L]-[M]-[T]-[E(nergy)]” to specify the units used internally, default “nm-g/mol-ps-kJ/mol”.
- Parameters:
name (str) – a human-readable name of the system. Not necessarily corresponding to the HDF5 group name
top (mdtraj.Topology) – the molecular topology of the system
coords (Quantity or numpy.ndarray) – 3D coordinates with shape (n_frames, n_atoms, 3) and dimension [L]
quantities (Dict[str, np.ndarray | Quantity], optional) – Example key and value pairs for builtin quantities: - forces: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_. - velocities: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_ with dimension [L]/[T]. - time: (n_frames,) _per-frame_ scalar indicating the elapsed simulation time with dimension [T]. - weights: (n_frames,) _per-frame_ scalar indicating the Boltzmann weight of each frame.
metadata (dict, optional) – Metadata to be saved, e.g., simulation temperature, forcefield information, saving time stride, etc, by default None
trajectory_slices (Dict[str, slice], optional) – a dictionary for trajectory name and its range expressed as a python slice object (similar to the usage of a [start:stop:stride] for indexing.), by default None
unit_system (str | UnitSystem object, optional) – In format “[L]-[M]-[T]-[E(nergy)]” for units of builtin quantities, by default “nm-g/mol-ps-kJ/mol”; alternatively, you can provide an existing UnitSystem or a JSON-serialized such object
- Raises:
ValueError – When input coords does not correspond to the input top.
Notes
Alternative to the default __init__ method, an Ensemble can also be created from a mdtraj.Trajectory object (see from_mdtraj_trj).
- classmethod from_hdf5(h5grp, unit_system='nm-g/mol-ps-kJ/mol', offset=None, stride=None, no_offset_stride_quantities=None)[source]
Create an instance from the content of HDF5 Group h5grp (h5py.Group). When given unit_system differs from the stored record, units will be converted when reading the Quantity into memory. Input offset and stride can be set to allow a partial loading of the non-scalar datasets with indexing [offset::stride], unless the dataset’s name is contained in no_offset_stride_quantities.
- Parameters:
h5grp (h5py.Group) – The group should contain all necessary information to initialize an Ensemble. Notablly, the following Datasets are required: - top: serialized topology - coords (with Attribute “unit”) And all Datasets should come with an Attribute unit for the physical unit used in storage.
unit_system (str, optional) – Should have the format “[L]-[M]-[T]-[E(nergy)]” for units of builtin quantities (see class docstring), by default “nm-g/mol-ps-kJ/mol”
offset (None | int, optional) – The offset for loading from the HDF5 file. Default is None.
stride (None | int, optional) – The stride for loading from the HDF5 file. Default is None.
no_offset_stride_quantities (List[str], optional) – The names of entries, for which no offset or stride should be considered during loading, by default None
- Returns:
An instance containing all compatible content from HDF5 Group h5grp.
- Return type:
- Raises:
ValueError – When the Dataset corresponding to top, coords or other fields does not exist or has invalid format.
- static from_mdtraj_trj(name, trj, quantities=None, metadata=None, trajectory_slices=None, unit_system='nm-g/mol-ps-kJ/mol')[source]
Create an Ensemble instance from mdtraj.Trajectory.
- Parameters:
name (str) – a human-readable name of the system. Not necessarily corresponding to the HDF5 group name
trj (mdtraj.Trajectory) – A trajectory, whose topology and coordinates (and when applicable also the unit cell information) will be stored in the created Ensemble.
quantities (Dict[str, np.ndarray | Quantity], optional) –
Example key and value pairs for builtin quantities: - forces: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_. - velocities: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_ with dimension [L]/[T]. - time: (n_frames,) _per-frame_ scalar indicating the elapsed simulation
time with dimension [T].
weights: (n_frames,) _per-frame_ scalar indicating the Boltzmann weight of
each frame.
metadata (dict, optional) – Metadata to be saved, e.g., simulation temperature, forcefield information, saving time stride, etc, by default None
trajectory_slices (Dict[str, slice], optional) – a dictionary for trajectory name and its range expressed as a python slice object (similar to the usage of a [start:stop:stride] for indexing.), by default None
unit_system (str | UnitSystem object, optional) – In format “[L]-[M]-[T]-[E(nergy)]” for units of builtin quantities, by default “nm-g/mol-ps-kJ/mol”; alternatively, you can provide an existing UnitSystem or a JSON-serialized such object
- Returns:
An instance containing all information from the mdtraj object together with the provided quantities, metadata, trajectory_slices and unit_system.
- Return type:
- Raises:
ValueError –
- get_all_in_one_mdtraj_trj()[source]
Pack all coords into one Trajectory object (maybe not suitable for kinetic analyses, such as TICA and MSM!)
- Returns:
A trajectory, whose topology (and unitcell dimensions, if applicable) come from the Ensemble object and coordinates from coords concatenated
- Return type:
mdtraj.Trajectory
- get_mdtraj_trjs()[source]
Pack this Ensemble’s top and coords (and unitcell + simulation times, if available) into a dictionary of Trajectory for analyses, according to the slicing given by self.trjs.
- Returns:
A dictionary containing all mdtraj Trjectories implied by the self.trjs.
- Return type:
Dict[str, mdtraj.Trajectory]
- get_quantity(key)[source]
Retrieve a Quantity in the Ensemble.
- Parameters:
key (str) – The name of the Quantity
- Returns:
The Quantity under the name key in the Ensemble
- Return type:
- Raises:
KeyError – When key does not correspond to a Quantity existing in the current Ensemble.
- get_unit(key)[source]
Retrieve the builtin unit for a Quantity under name key in the Ensemble, or alternatively the preset unit of a builtin Quantity. If neither is the case, return None.
- Parameters:
key (str) – Name of the Quantity
- Returns:
The bulitin unit of the Quantity under name key
- Return type:
str | None
- list_quantities()[source]
List the name of quantities stored in the Ensemble.
- Returns:
Quantity names
- Return type:
List[str]
- property n_atoms
Return the number of atoms of the molecule defined in top.
- Returns:
Number of atoms contained in the molecule.
- Return type:
int
- property n_frames
Return the number of frames.
- Returns:
Number of frames contained in this Ensemble.
- Return type:
int
- property n_frames_per_trj
Number of frames contained by each trajectory.
- Return type:
int
- property n_trjs
Number of trajectories in the Ensemble.
- Return type:
int
- register_trjs(dict_of_slices)[source]
The input slices will be used to indicate which slice of the Ensemble samples (and other quantities) correspond to which trajectory.
- Parameters:
dict_of_slices (Mapping[str, slice]) – Name and range of a trajectory.
- set_quantity(key, quant)[source]
Store quant (Quantity | numpy.ndarray) under name key (str). When quant is a plain numpy.ndarray, the unit is assumed according to .unit_system if the key is one of the PRESET_BUILTIN_QUANTITIES (or self.unit_system.builtin_quantities in case of customized unit_system at initialization), or dimensionless otherwise. * When key is one of the PRESET_BUILTIN_QUANTITIES (or self.unit_system.builtin_quantities), the unit and shape of quant need to be compatible.
- Parameters:
name (str) – The name/key to store the string.
quantity (numpy.ndarray | Quantity) – The quantity to be saved. When input is a raw numpy array, the unit is assumed to be either the builtin unit (when exists) or “dimensionless”.
- property top
Return the topology of the molecular system. Alias to .top.
- Return type:
mdtraj.Topology
- property topology
Return the topology of the molecular system. Alias to .top.
- Return type:
mdtraj.Topology
- property trajectories
Get the slices corresponding to each trajectory. These python slice objects can be used to retrive the correct portion for the corresponding trajectory for each _per-frame_ quantity (e.g., coords, forces, …) via bracket ([]) operator. Alias to .trajectory_slices.
- Returns:
Key-slice pairs for each registered trajectory
- Return type:
Dict[str, slice]
- property trajectory_indices
Indices of frames belonging to each Trajectory.
- Returns:
Key-(list of int) pairs for each registered trajectory
- Return type:
Dict[str, numpy.ndarray]
- property trajectory_slices
Get the slices corresponding to each trajectory. These python slice objects can be used to retrive the correct portion for the corresponding trajectory for each _per-frame_ quantity (e.g., coords, forces, …) via bracket ([]) operator.
- Returns:
Key-slice pairs for each registered trajectory
- Return type:
Dict[str, slice]
- property trjs
Get the slices corresponding to each trajectory. These python slice objects can be used to retrive the correct portion for the corresponding trajectory for each _per-frame_ quantity (e.g., coords, forces, …) via bracket ([]) operator. Alias to .trajectory_slices.
- Returns:
Key-slice pairs for each registered trajectory
- Return type:
Dict[str, slice]
- property unit_system
Return a the unit system used by the Ensemble.
- Returns:
The unit system containing units for basic dimension “[X]” (X = L, M, T, E) and units for builtin quantities
- Return type:
UnitSystem