Ensemble

Introduction

About Quantity

A Quantity wraps a numpy.ndarray and a unit (defined in proteka.dataset.unit_quantity). Assigning a Quantity to an Ensemble either during initialization or via the dot (.) notation as an attribute:

  • If the input is a plain numpy.ndarray, then the unit is assumed as”dimensionless”

  • If the input is a Quantity, the input unit will be stored

Retrieving saved Quantity:

  • Accessing as an attribute (via dot (.)): returns a numpy.ndarray with the value of the Quantity in unit of the stored unit

  • Via index bracket ([]): returns the stored Quantity object, allowing flexible automated unit conversion

List stored quantities: .list_quantities()

  • Special cases are “builtin quantities”, whose stored units are dictated by the unit_system (also used instead of the default “dimensionless” during assignment):

  • “coords” (ATOMIC_VECTOR): [L]

  • “time” (per-frame SCALAR): [T]

  • “forces” (ATOMIC_VECTOR): [E]/[L]

  • “velocities” (ATOMIC_VECTOR): [L]/[T]

  • “cell_lengths” (BOX_QUANTITIES): [L]

  • “cell_angles”: (BOX_QUANTITIES): degree

In addition, the above quantities are tied to the system molecule via the shape, i.e., each per-frame quantity having the same number of frames as self.coords, and correspond to the same number of atoms as indicated by self.top, if it is an _ATOMIC_VECTOR_.

Trajectories:

Storing the information about which samples contained in the Ensemble come from which trajectory. Trajectories are sequential. Therefore, samples from different trajectories are expected to be non-overlapping slices. Trajectories info is supposed to be stored either during the Ensemble initialization or after with .register_trjs method.

Properties

  • .n_trjs (int): number of trajectories

  • .n_frames_per_trj (Dict[str, int]): dictionary of number of frames in each trajectory

  • .trajectory_slices or .trjs or .trajectories (Dict[str, slice]): Python slice``s for slicing Ensemble quantities according to the ``.trjs records

  • .trj_indices (Dict[str, np.ndarray]): indices for different trajectoriesaccording to the .trjs records

mdtraj interface:

  • .get_mdtraj_trjs() (-> Dict[str, mdtraj.Trajectory]): pack an Ensemble’s top``and ``coords (and unitcell + simulation times, if available) into a dictionary of mdtraj.Trajectory for analyses according to self.trjs

  • .get_all_in_one_mdtraj_trj(): pack all coords into one Trajectory object (maybe not suitable for kinetic analyses, such as TICA and MSM!)

Implementations

class proteka.dataset.Ensemble(name, top, coords, quantities=None, metadata=None, trajectory_slices=None, unit_system='nm-g/mol-ps-kJ/mol')[source]

An Ensemble is an in-memory data structure consisting of sample coordinates and other quantities for a certain system at a certain thermodynamic state. The samples usually correspond to a Boltzmann distribution.

An Ensemble must have name, top (molecular topology) and coords (3D coordinates). In addition, a unit_system has to be provided either as a pre-defined UnitSystem object or a seralized JSON version of UnitSystem or a string in format of “[L]-[M]-[T]-[E(nergy)]” to specify the units used internally, default “nm-g/mol-ps-kJ/mol”.

Parameters:
  • name (str) – a human-readable name of the system. Not necessarily corresponding to the HDF5 group name

  • top (mdtraj.Topology) – the molecular topology of the system

  • coords (Quantity or numpy.ndarray) – 3D coordinates with shape (n_frames, n_atoms, 3) and dimension [L]

  • quantities (Dict[str, np.ndarray | Quantity], optional) – Example key and value pairs for builtin quantities: - forces: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_. - velocities: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_ with dimension [L]/[T]. - time: (n_frames,) _per-frame_ scalar indicating the elapsed simulation time with dimension [T]. - weights: (n_frames,) _per-frame_ scalar indicating the Boltzmann weight of each frame.

  • metadata (dict, optional) – Metadata to be saved, e.g., simulation temperature, forcefield information, saving time stride, etc, by default None

  • trajectory_slices (Dict[str, slice], optional) – a dictionary for trajectory name and its range expressed as a python slice object (similar to the usage of a [start:stop:stride] for indexing.), by default None

  • unit_system (str | UnitSystem object, optional) – In format “[L]-[M]-[T]-[E(nergy)]” for units of builtin quantities, by default “nm-g/mol-ps-kJ/mol”; alternatively, you can provide an existing UnitSystem or a JSON-serialized such object

Raises:

ValueError – When input coords does not correspond to the input top.

Notes

Alternative to the default __init__ method, an Ensemble can also be created from a mdtraj.Trajectory object (see from_mdtraj_trj).

classmethod from_hdf5(h5grp, unit_system='nm-g/mol-ps-kJ/mol', offset=None, stride=None, no_offset_stride_quantities=None)[source]

Create an instance from the content of HDF5 Group h5grp (h5py.Group). When given unit_system differs from the stored record, units will be converted when reading the Quantity into memory. Input offset and stride can be set to allow a partial loading of the non-scalar datasets with indexing [offset::stride], unless the dataset’s name is contained in no_offset_stride_quantities.

Parameters:
  • h5grp (h5py.Group) – The group should contain all necessary information to initialize an Ensemble. Notablly, the following Datasets are required: - top: serialized topology - coords (with Attribute “unit”) And all Datasets should come with an Attribute unit for the physical unit used in storage.

  • unit_system (str, optional) – Should have the format “[L]-[M]-[T]-[E(nergy)]” for units of builtin quantities (see class docstring), by default “nm-g/mol-ps-kJ/mol”

  • offset (None | int, optional) – The offset for loading from the HDF5 file. Default is None.

  • stride (None | int, optional) – The stride for loading from the HDF5 file. Default is None.

  • no_offset_stride_quantities (List[str], optional) – The names of entries, for which no offset or stride should be considered during loading, by default None

Returns:

An instance containing all compatible content from HDF5 Group h5grp.

Return type:

Ensemble

Raises:

ValueError – When the Dataset corresponding to top, coords or other fields does not exist or has invalid format.

static from_mdtraj_trj(name, trj, quantities=None, metadata=None, trajectory_slices=None, unit_system='nm-g/mol-ps-kJ/mol')[source]

Create an Ensemble instance from mdtraj.Trajectory.

Parameters:
  • name (str) – a human-readable name of the system. Not necessarily corresponding to the HDF5 group name

  • trj (mdtraj.Trajectory) – A trajectory, whose topology and coordinates (and when applicable also the unit cell information) will be stored in the created Ensemble.

  • quantities (Dict[str, np.ndarray | Quantity], optional) –

    Example key and value pairs for builtin quantities: - forces: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_. - velocities: (n_frames, n_atoms, 3) _ATOMIC_VECTOR_ with dimension [L]/[T]. - time: (n_frames,) _per-frame_ scalar indicating the elapsed simulation

    time with dimension [T].

    • weights: (n_frames,) _per-frame_ scalar indicating the Boltzmann weight of

    each frame.

  • metadata (dict, optional) – Metadata to be saved, e.g., simulation temperature, forcefield information, saving time stride, etc, by default None

  • trajectory_slices (Dict[str, slice], optional) – a dictionary for trajectory name and its range expressed as a python slice object (similar to the usage of a [start:stop:stride] for indexing.), by default None

  • unit_system (str | UnitSystem object, optional) – In format “[L]-[M]-[T]-[E(nergy)]” for units of builtin quantities, by default “nm-g/mol-ps-kJ/mol”; alternatively, you can provide an existing UnitSystem or a JSON-serialized such object

Returns:

An instance containing all information from the mdtraj object together with the provided quantities, metadata, trajectory_slices and unit_system.

Return type:

Ensemble

Raises:

ValueError

get_all_in_one_mdtraj_trj()[source]

Pack all coords into one Trajectory object (maybe not suitable for kinetic analyses, such as TICA and MSM!)

Returns:

A trajectory, whose topology (and unitcell dimensions, if applicable) come from the Ensemble object and coordinates from coords concatenated

Return type:

mdtraj.Trajectory

get_mdtraj_trjs()[source]

Pack this Ensemble’s top and coords (and unitcell + simulation times, if available) into a dictionary of Trajectory for analyses, according to the slicing given by self.trjs.

Returns:

A dictionary containing all mdtraj Trjectories implied by the self.trjs.

Return type:

Dict[str, mdtraj.Trajectory]

get_quantity(key)[source]

Retrieve a Quantity in the Ensemble.

Parameters:

key (str) – The name of the Quantity

Returns:

The Quantity under the name key in the Ensemble

Return type:

Quantity

Raises:

KeyError – When key does not correspond to a Quantity existing in the current Ensemble.

get_unit(key)[source]

Retrieve the builtin unit for a Quantity under name key in the Ensemble, or alternatively the preset unit of a builtin Quantity. If neither is the case, return None.

Parameters:

key (str) – Name of the Quantity

Returns:

The bulitin unit of the Quantity under name key

Return type:

str | None

list_quantities()[source]

List the name of quantities stored in the Ensemble.

Returns:

Quantity names

Return type:

List[str]

property n_atoms

Return the number of atoms of the molecule defined in top.

Returns:

Number of atoms contained in the molecule.

Return type:

int

property n_frames

Return the number of frames.

Returns:

Number of frames contained in this Ensemble.

Return type:

int

property n_frames_per_trj

Number of frames contained by each trajectory.

Return type:

int

property n_trjs

Number of trajectories in the Ensemble.

Return type:

int

register_trjs(dict_of_slices)[source]

The input slices will be used to indicate which slice of the Ensemble samples (and other quantities) correspond to which trajectory.

Parameters:

dict_of_slices (Mapping[str, slice]) – Name and range of a trajectory.

set_quantity(key, quant)[source]

Store quant (Quantity | numpy.ndarray) under name key (str). When quant is a plain numpy.ndarray, the unit is assumed according to .unit_system if the key is one of the PRESET_BUILTIN_QUANTITIES (or self.unit_system.builtin_quantities in case of customized unit_system at initialization), or dimensionless otherwise. * When key is one of the PRESET_BUILTIN_QUANTITIES (or self.unit_system.builtin_quantities), the unit and shape of quant need to be compatible.

Parameters:
  • name (str) – The name/key to store the string.

  • quantity (numpy.ndarray | Quantity) – The quantity to be saved. When input is a raw numpy array, the unit is assumed to be either the builtin unit (when exists) or “dimensionless”.

property top

Return the topology of the molecular system. Alias to .top.

Return type:

mdtraj.Topology

property topology

Return the topology of the molecular system. Alias to .top.

Return type:

mdtraj.Topology

property trajectories

Get the slices corresponding to each trajectory. These python slice objects can be used to retrive the correct portion for the corresponding trajectory for each _per-frame_ quantity (e.g., coords, forces, …) via bracket ([]) operator. Alias to .trajectory_slices.

Returns:

Key-slice pairs for each registered trajectory

Return type:

Dict[str, slice]

property trajectory_indices

Indices of frames belonging to each Trajectory.

Returns:

Key-(list of int) pairs for each registered trajectory

Return type:

Dict[str, numpy.ndarray]

property trajectory_slices

Get the slices corresponding to each trajectory. These python slice objects can be used to retrive the correct portion for the corresponding trajectory for each _per-frame_ quantity (e.g., coords, forces, …) via bracket ([]) operator.

Returns:

Key-slice pairs for each registered trajectory

Return type:

Dict[str, slice]

property trjs

Get the slices corresponding to each trajectory. These python slice objects can be used to retrive the correct portion for the corresponding trajectory for each _per-frame_ quantity (e.g., coords, forces, …) via bracket ([]) operator. Alias to .trajectory_slices.

Returns:

Key-slice pairs for each registered trajectory

Return type:

Dict[str, slice]

property unit_system

Return a the unit system used by the Ensemble.

Returns:

The unit system containing units for basic dimension “[X]” (X = L, M, T, E) and units for builtin quantities

Return type:

UnitSystem