Analyze¶

Analysis tools and module for YANK simulations. Provides programmatic and automatic “best practices” integration to determine free energy and other observables.

Fully extensible to support new samplers and observables.

yank.analyze.generate_phase_name(current_name, name_list)[source]¶

Provide a regular way to generate unique human-readable names from base names.

Given a base name and a list of existing names, a number will be appended to the base name until a unique string is generated.

Parameters:

current_name : string

The base name you wish to ensure is unique. Numbers will be appended to this string until a unique string not in the name_list is provided

name_list : iterable of strings

The current_name, and its modifiers, are compared against this list until a unique string is found

Returns:

name : string

Unique string derived from the current_name that is not in name_list. If the parameter current_name is not already in the name_list, then current_name is returned unmodified.

yank.analyze.get_analyzer(file_base_path)[source]¶

Utility function to convert storage file to a Reporter and Analyzer by reading the data on file

For now this is mostly placeholder functions since there is only the implemented ReplicaExchangeAnalyzer, but creates the API for the user to work with.

Parameters:

file_base_path : string

Complete path to the storage file with filename and extension.

Returns:

analyzer : instance of implemented YankPhaseAnalyzer

Analyzer for the specific phase.

yank.analyze.get_decorrelation_time(timeseries_to_analyze)[source]¶

Compute the decorrelation times given a timeseries.

See the pymbar.timeseries.statisticalInefficiency for full documentation

yank.analyze.get_equilibration_data(timeseries_to_analyze)[source]¶

Compute equilibration method given a timeseries

See the pymbar.timeseries.detectEquilibration function for full documentation

yank.analyze.get_equilibration_data_per_sample(timeseries_to_analyze, fast=True, nskip=1)[source]¶

Compute the correlation time and n_effective per sample.

This is exactly what pymbar.timeseries.detectEquilibration does, but returns the per sample data

See the pymbar.timeseries.detectEquilibration function for full documentation

yank.analyze.remove_unequilibrated_data(data, number_equilibrated, axis)[source]¶

Remove the number_equilibrated samples from a dataset

Discards number_equilibrated number of indices from given axis

Parameters:

data : np.array-like of any dimension length

This is the data which will be paired down

number_equilibrated : int

Number of indices that will be removed from the given axis, i.e. axis will be shorter by number_equilibrated

axis : int

Axis index along which to remove samples from. This supports negative indexing as well

Returns:

equilibrated_data : ndarray

Data with the number_equilibrated number of indices removed from the beginning along axis

yank.analyze.subsample_data_along_axis(data, subsample_rate, axis)[source]¶

Generate a decorrelated version of a given input data and subsample_rate along a single axis.

Parameters:

data : np.array-like of any dimension length

subsample_rate : float or int

Rate at which to draw samples. A sample is considered decorrelated after every ceil(subsample_rate) of indices along data and the specified axis

axis : int

axis along which to apply the subsampling

Returns:

subsampled_data : ndarray of same number of dimensions as data

Data will be subsampled along the given axis

class yank.analyze.YankPhaseAnalyzer(reporter, name=None, reference_states=(0, -1), analysis_kwargs=None)[source]¶

Analyzer for a single phase of a YANK simulation.

Uses the reporter from the simulation to determine the location of all variables.

To compute a specific observable in an implementation of this class, add it to the ObservableRegistry and then implement a get_X where X is the name of the observable you want to compute. See the ObservablesRegistry for information about formatting the observables.

Analyzer works in units of kT unless specifically stated otherwise. To convert back to a unit set, just multiply by the .kT property.

Parameters:

reporter : Reporter instance

Reporter from Repex which ties to the simulation data on disk.

name : str, Optional

Unique name you want to assign this phase, this is the name that will appear in MultiPhaseAnalyzer‘s. If not set, it will be given the arbitrary name “phase#” where # is an integer, chosen in order that it is assigned to the MultiPhaseAnalyzer.

reference_states : tuple of ints, length 2, Optional, Default: (0,-1)

Integers i and j of the state that is used for reference in observables, “O”. These values are only used when reporting single numbers or combining observables through MultiPhaseAnalyzer (since the number of states between phases can be different). Calls to functions such as get_free_energy in a single Phase results in the O being returned for all states.

For O completely defined by the state itself (i.e. no differences between states, e.g. Temperature), only O[i] is used

For O where differences between states are required (e.g. Free Energy): O[i,j] = O[j] - O[i]

For O defined by the phase as a whole, the reference states are not needed.

analysis_kwargs : None or dict, optional

Dictionary of extra keyword arguments to pass into the analysis tool, typically MBAR. For instance, the initial guess of relative free energies to give to MBAR would be something like: {'initial_f_k':[0,1,2,3]}

Attributes

`name`	User-readable string name of the phase
`observables`	List of observables that the instanced analyzer can compute/fetch.
`mbar`	MBAR object tied to this phase
`reference_states`	Tuple of reference states `i` and `j` for `MultiPhaseAnalyzer` instances
`kT`	Quantity of boltzmann constant times temperature of the phase in units of energy per mol
`reporter`	Sampler Reporter tied to this object.

name¶: User-readable string name of the phase

observables¶

List of observables that the instanced analyzer can compute/fetch.

This list is automatically compiled upon class initialization based on the functions implemented in the subclass

mbar¶: MBAR object tied to this phase

reference_states¶: Tuple of reference states i and j for MultiPhaseAnalyzer instances

kT¶

Quantity of boltzmann constant times temperature of the phase in units of energy per mol

Allows conversion between dimensionless energy and unit bearing energy

reporter¶: Sampler Reporter tied to this object.

analyze_phase(*args, **kwargs)[source]¶

Auto-analysis function for the phase

Function which broadly handles “auto-analysis” for those that do not wish to call all the methods on their own. This should be have like the old “analyze” function from versions of YANK pre-1.0.

Returns a dictionary of analysis objects

get_states_energies()[source]¶

Extract the deconvoluted energies from a phase.

Energies from this are NOT decorrelated.

Returns:

sampled_energy_matrix : numpy.ndarray of shape K,K,N

Deconvoluted energy of sampled states evaluated at other sampled states.

Has shape (K,K,N) = (number of sampled states, number of sampled states, number of iterations)

Indexed by [k,l,n] where an energy drawn from sampled state [k] is evaluated in sampled state [l] at iteration [n]

unsampled_energy_matrix : numpy.ndarray of shape K,L,N

Has shape (K, L, N) = (number of sampled states, number of UN-sampled states, number of iterations)

Indexed by [k,l,n] where an energy drawn from sampled state [k] is evaluated in un-sampled state [l] at iteration [n]

static get_timeseries(passed_timeseries)[source]¶

Generate the timeseries that is generated for this phase

Returns:

generated_timeseries : 1-D iterable

timeseries which can be fed into get_decorrelation_time to get the decorrelation

class yank.analyze.ReplicaExchangeAnalyzer(reporter, name=None, reference_states=(0, -1), analysis_kwargs=None)[source]¶

The ReplicaExchangeAnalyzer is the analyzer for a simulation generated from a Replica Exchange sampler simulation, implemented as an instance of the YankPhaseAnalyzer.

See also

YankPhaseAnalyzer

generate_mixing_statistics(number_equilibrated=None)[source]¶

Generate the Transition state matrix and sorted eigenvalues

Parameters:

number_equilibrated : int, optional, default=None

If specified, only samples number_equilibrated:end will be used in analysis If not specified, automatically retrieves the number from equilibration data or generates it from the internal energy.

Returns:

mixing_stats : np.array of shape [nstates, nstates]

Transition matrix estimate

mu : np.array

Eigenvalues of the Transition matrix sorted in descending order

show_mixing_statistics(cutoff=0.05, number_equilibrated=None)[source]¶

Print summary of mixing statistics. Passes information off to generate_mixing_statistics then prints it out to the logger

Parameters:

cutoff : float, optional, default=0.05

Only transition probabilities above ‘cutoff’ will be printed

number_equilibrated : int, optional, default=None

If specified, only samples number_equilibrated:end will be used in analysis If not specified, it uses the internally held statistics best

get_states_energies()[source]¶

Extract and decorrelate energies from the ncfile to gather energies common data for other functions

Returns:

energy_matrix : ndarray of shape (K,K,N)

Potential energy matrix of the sampled states Energy is from each sample n, drawn from state (first k), and evaluated at every sampled state (second k)

unsampled_energy_matrix : ndarray of shape (K,L,N)

Potential energy matrix of the unsampled states Energy from each sample n, drawn from sampled state k, evaluated at unsampled state l If no unsampled states were drawn, this will be shape (K,0,N)

static get_timeseries(passed_timeseries)[source]¶

Compute the timeseries of a simulation from the Replica Exchange simulation. This is the sum of energies for each sample from the state it was drawn from.

Parameters:

passed_timeseries : ndarray of shape (K,L,N), indexed by k,l,n

K is the total number of sampled states

L is the total states we want MBAR to analyze

N is the total number of samples

The kth sample was drawn from state k at iteration n, the nth configuration of kth state is evaluated in thermodynamic state l

Returns:

u_n : ndarray of shape (N,)

Timeseries to compute decorrelation and equilibration data from.

get_free_energy()[source]¶

Compute the free energy and error in free energy from the MBAR object

Output shape changes based on if there are unsampled states detected in the sampler

Returns:

DeltaF_ij : ndarray of floats, shape (K,K) or (K+2, K+2)

Difference in free energy from each state relative to each other state

dDeltaF_ij : ndarray of floats, shape (K,K) or (K+2, K+2)

Error in the difference in free energy from each state relative to each other state

get_enthalpy()[source]¶

Compute the difference in enthalpy and error in that estimate from the MBAR object

Output shape changes based on if there are unsampled states detected in the sampler

Returns:

DeltaH_ij : ndarray of floats, shape (K,K) or (K+2, K+2)

Difference in enthalpy from each state relative to each other state

dDeltaH_ij : ndarray of floats, shape (K,K) or (K+2, K+2)

Error in the difference in enthalpy from each state relative to each other state

get_entropy()[source]¶

Compute the difference in entropy and error in that estimate from the MBAR object

Output shape changes based on if there are unsampled states detected in the sampler

Returns:

DeltaH_ij : ndarray of floats, shape (K,K) or (K+2, K+2)

Difference in enthalpy from each state relative to each other state

dDeltaH_ij : ndarray of floats, shape (K,K) or (K+2, K+2)

Error in the difference in enthalpy from each state relative to each other state

get_standard_state_correction()[source]¶

Compute the standard state correction free energy associated with the Phase.

This usually is just a stored variable, but it may need other calculations.

Returns:

standard_state_correction : float

Free energy contribution from the standard_state_correction

kT¶

Quantity of boltzmann constant times temperature of the phase in units of energy per mol

Allows conversion between dimensionless energy and unit bearing energy

mbar¶: MBAR object tied to this phase

name¶: User-readable string name of the phase

observables¶

List of observables that the instanced analyzer can compute/fetch.

This list is automatically compiled upon class initialization based on the functions implemented in the subclass

reference_states¶: Tuple of reference states i and j for MultiPhaseAnalyzer instances

reporter¶: Sampler Reporter tied to this object.

class yank.analyze.MultiPhaseAnalyzer(phases)[source]¶

Multiple Phase Analyzer creator, not to be directly called itself, but instead called by adding or subtracting different implemented YankPhaseAnalyzer or other MultiPhaseAnalyzers‘s. The individual Phases of the MultiPhaseAnalyzer are only references to existing Phase objects, not copies. All YankPhaseAnalyzer and MultiPhaseAnalyzer classes support + and - operations.

The observables of this phase are determined through inspection of all the passed in phases and only observables which are shared can be computed. For example:

PhaseA has .get_free_energy and .get_entropy

PhaseB has .get_free_energy and .get_enthalpy,

PhaseAB = PhaseA + PhaseB will only have a .get_free_energy method

Because each Phase may have a different number of states, the reference_states property of each phase determines which states from each phase to read the data from.

For observables defined by two states, the i’th and j’th reference states are used:

If we define PhaseAB = PhaseA - PhaseB

Then PhaseAB.get_free_energy() is roughly equivalent to doing the following:

A_i, A_j = PhaseA.reference_states

B_i, B_j = PhaseB.reference_states

PhaseA.get_free_energy()[A_i, A_j] - PhaseB.get_free_energy()[B_i, B_j]

The above is not exact since get_free_energy returns an error estimate as well

For observables defined by a single state, only the i’th reference state is used

Given PhaseAB = PhaseA + PhaseB, PhaseAB.get_temperature() is equivalent to:

A_i = PhaseA.reference_states[0]

B_i = PhaseB.reference_states[0]

PhaseA.get_temperature()[A_i] + PhaseB.get_temperature()[B_i]

For observables defined entirely by the phase, no reference states are needed.

Given PhaseAB = PhaseA + PhaseB, PhaseAB.get_standard_state_correction() gives:

PhaseA.get_standard_state_correction() + PhaseB.get_standard_state_correction()

This class is public to see its API.

Parameters:

phases : dict

has keys “phases”, “names”, and “signs”

Attributes

`observables`	List of observables this `MultiPhaseAnalyzer` can generate
`phases`	List of implemented `YankPhaseAnalyzer`‘s objects this `MultiPhaseAnalyzer` is tied to
`names`	Unique list of string names identifying this phase.
`signs`	List of signs that are used by the `MultiPhaseAnalyzer` to

observables¶: List of observables this MultiPhaseAnalyzer can generate

phases¶: List of implemented YankPhaseAnalyzer‘s objects this MultiPhaseAnalyzer is tied to

names¶

Unique list of string names identifying this phase. If this MultiPhaseAnalyzer is combined with another, its possible that new names will be generated unique to that MultiPhaseAnalyzer, but will still reference the same phase.

When in doubt, use MultiPhaseAnalyzer.phases() to get the actual phase objects.

signs¶: List of signs that are used by the MultiPhaseAnalyzer to

yank.analyze.analyze_directory(source_directory)[source]¶

Analyze contents of store files to compute free energy differences.

This function is needed to preserve the old auto-analysis style of YANK. What it exactly does can be refined when more analyzers and simulations are made available. For now this function exposes the API.

Parameters:

source_directory : string

The location of the simulation storage files.

yank.analyze.extract_u_n(ncfile)[source]¶

Extract timeseries of u_n = - log q(X_n) from store file

where q(X_n) = pi_{k=1}^K u_{s_{nk}}(x_{nk})

with X_n = [x_{n1}, ..., x_{nK}] is the current collection of replica configurations s_{nk} is the current state of replica k at iteration n u_k(x) is the kth reduced potential

TODO: Figure out a way to remove this function

Parameters:

ncfile : netCDF4.Dataset

Open NetCDF file to analyze

Returns:

u_n : numpy array of numpy.float64

u_n[n] is -log q(X_n)

yank.analyze.extract_trajectory(nc_path, nc_checkpoint_file=None, state_index=None, replica_index=None, start_frame=0, end_frame=-1, skip_frame=1, keep_solvent=True, discard_equilibration=False, image_molecules=False)[source]¶

Extract phase trajectory from the NetCDF4 file.

Parameters:

nc_path : str

Path to the primary nc_file storing the analysis options

nc_checkpoint_file : str or None, Optional

File name of the checkpoint file housing the main trajectory Used if the checkpoint file is differently named from the default one chosen by the nc_path file. Default: None

state_index : int, optional

The index of the alchemical state for which to extract the trajectory. One and only one between state_index and replica_index must be not None (default is None).

replica_index : int, optional

The index of the replica for which to extract the trajectory. One and only one between state_index and replica_index must be not None (default is None).

start_frame : int, optional

Index of the first frame to include in the trajectory (default is 0).

end_frame : int, optional

Index of the last frame to include in the trajectory. If negative, will count from the end (default is -1).

skip_frame : int, optional

Extract one frame every skip_frame (default is 1).

keep_solvent : bool, optional

If False, solvent molecules are ignored (default is True).

discard_equilibration : bool, optional

If True, initial equilibration frames are discarded (see the method pymbar.timeseries.detectEquilibration() for details, default is False).

Returns:

trajectory: mdtraj.Trajectory

The trajectory extracted from the netcdf file.