Analyze

YANK Specific analysis tools for YANK simulations from the yank.yank.AlchemicalPhase classes

Extends classes from the MultiStateAnalyzer package to include the

class yank.analyze.ExperimentAnalyzer(store_directory, **analyzer_kwargs)[source]

Semi-automated YANK Experiment analysis with serializable data.

This class is designed to replace the older analyze_directory functions by providing a common analysis data interface which other classes and methods can draw on. This is designed to semi-automate the combination of multi-phase data

Each of the main methods fetches the data from each phase and returns them as a dictionary to the user. The total dump of data to serialized YAML files can also be done.

Each function documents what its output data structure and entries surrounded by curly braces ({ }) indicate variables which change per experiment, often the data.

Output dictionary is of the form:

yank_version: {YANK Version}
phase_names: {Name of each phase, depends on simulation type}
general: {See :func:`get_general_simulation_data`}
equilibration: {See :func:`get_equilibration_data`}
mixing: {See :func:`get_mixing_data`}
free_energy: {See :func:`get_experiment_free_energy_data`}
Parameters:
store_directory : string

Location where the analysis.yaml file is and where the NetCDF files are

**analyzer_kwargs

Keyword arguments to pass to the analyzer class. Quantities can be passed as strings.

Examples

Start with an experiment (Running from the yank.experiment.ExperimentBuilder example)

>>> import textwrap
>>> import openmmtools as mmtools
>>> import yank.utils
>>> import yank.experiment.ExperimentBuilder as ExperimentBuilder
>>> setup_dir = yank.utils.get_data_filename(os.path.join('..', 'examples',
...                                          'p-xylene-implicit', 'input'))
>>> pxylene_path = os.path.join(setup_dir, 'p-xylene.mol2')
>>> lysozyme_path = os.path.join(setup_dir, '181L-pdbfixer.pdb')
>>> with mmtools.utils.temporary_directory() as tmp_dir:
...     yaml_content = '''
...     ---
...     options:
...       default_number_of_iterations: 1
...       output_dir: {}
...     molecules:
...       T4lysozyme:
...         filepath: {}
...       p-xylene:
...         filepath: {}
...         antechamber:
...           charge_method: bcc
...     solvents:
...       vacuum:
...         nonbonded_method: NoCutoff
...     systems:
...         my_system:
...             receptor: T4lysozyme
...             ligand: p-xylene
...             solvent: vacuum
...             leap:
...               parameters: [leaprc.gaff, leaprc.ff14SB]
...     protocols:
...       absolute-binding:
...         complex:
...           alchemical_path:
...             lambda_electrostatics: [1.0, 0.9, 0.8, 0.6, 0.4, 0.2, 0.0]
...             lambda_sterics: [1.0, 0.9, 0.8, 0.6, 0.4, 0.2, 0.0]
...         solvent:
...           alchemical_path:
...             lambda_electrostatics: [1.0, 0.8, 0.6, 0.3, 0.0]
...             lambda_sterics: [1.0, 0.8, 0.6, 0.3, 0.0]
...     experiments:
...       system: my_system
...       protocol: absolute-binding
...     '''.format(tmp_dir, lysozyme_path, pxylene_path)
>>> yaml_builder = ExperimentBuilder(textwrap.dedent(yaml_content))
>>> yaml_builder.run_experiments()

Now analyze the experiment

>>> import os
>>> exp_analyzer = ExperimentAnalyzer(os.path.join(tmp_dir, 'experiment'))
>>> analysis_data = exp_analyzer.auto_analyze()
Attributes:
use_full_trajectory : bool. Analyze with subsampled or complete trajectory
nphases : int. Number of phases detected
phases_names : list of phase names. Used as keys on all attributes below
signs : dict of str. Sign assigned to each phase
analyzers : dict of YankPhaseAnalyzer
iterations : dict of int. Number of maximum iterations in each phase
u_ns : dict of np.ndarray. Timeseries of each phase
nequils : dict of int. Number of equilibrium iterations in each phase
g_ts : dict of int. Subsample rate past nequils in each phase
Neff_maxs : dict of int. Number of effective samples in each phase
get_general_simulation_data()[source]

General purpose simulation data on number of iterations, number of states, and number of atoms. This just prints out this data in a regular, formatted pattern.

Output is of the form:

{for phase_name in phase_names}
    iterations : {int}
    natoms : {int}
    nreplicas : {int}
    nstates : {int}
Returns:
general_data : dict

General simulation data by phase.

get_equilibration_data(discard_from_start=1)[source]

Create the equilibration scatter plots showing the trend lines, correlation time, and number of effective samples

Output is of the form:

{for phase_name in phase_names}
    discarded_from_start : {int}
    effective_samples : {float}
    subsample_rate : {float}
    iterations_considered : {1D np.ndarray of int}
    subsample_rate_by_iterations_considered : {1D np.ndarray of float}
    effective_samples_by_iterations_considered : {1D np.ndarray of float}
    count_total_equilibration_samples : {int}
    count_decorrelated_samples : {int}
    count_correlated_samples : {int}
    percent_total_equilibration_samples : {float}
    percent_decorrelated_samples : {float}
    percent_correlated_samples : {float}
Returns:
equilibration_data : dict

Dictionary with the equilibration data

get_mixing_data()[source]

Get state diffusion mixing arrays

Output is of the form:

{for phase_name in phase_names}
    transitions : {[nstates, nstates] np.ndarray of float}
    eigenvalues : {[nstates] np.ndarray of float}
    stat_inefficiency : {float}
Returns:
mixing_data : dict

Dictionary of mixing data

get_experiment_free_energy_data()[source]

Get the free Yank Experiment free energy, broken down by phase and total experiment

Output is of the form:

{for phase_name in phase_names}
    sign : {str of either '+' or '-'}
    kT : {units.quantity}
    free_energy_diff : {float (has units of kT)}
    free_energy_diff_error : {float (has units of kT)}
    free_energy_diff_standard_state_correction : {float (has units of kT)}
    enthalpy_diff : {float (has units of kT)}
    enthalpy_diff_error : {float (has units of kT)}
free_energy_diff : {float (has units of kT)}
free_energy_diff_error : {float (has units of kT)}
free_energy_diff_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
free_energy_diff_error_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
enthalpy_diff : {float (has units of kT)}
enthalpy_diff_error : {float (has units of kT)}
enthalpy_diff_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
enthalpy_diff_error_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
Returns:
free_energy_data : dict

Dictionary of free energy data

auto_analyze()[source]

Run the analysis

Output is of the form:

yank_version: {YANK Version}
phase_names: {Name of each phase, depends on simulation type}
general:
    {for phase_name in phase_names}
        iterations : {int}
        natoms : {int}
        nreplicas : {int}
        nstates : {int}
equilibration:
    {for phase_name in phase_names}
        discarded_from_start : {int}
        effective_samples : {float}
        subsample_rate : {float}
        iterations_considered : {1D np.ndarray of int}
        subsample_rate_by_iterations_considered : {1D np.ndarray of float}
        effective_samples_by_iterations_considered : {1D np.ndarray of float}
        count_total_equilibration_samples : {int}
        count_decorrelated_samples : {int}
        count_correlated_samples : {int}
        percent_total_equilibration_samples : {float}
        percent_decorrelated_samples : {float}
        percent_correlated_samples : {float}
mixing:
    {for phase_name in phase_names}
        transitions : {[nstates, nstates] np.ndarray of float}
        eigenvalues : {[nstates] np.ndarray of float}
        stat_inefficiency : {float}
free_energy:
    {for phase_name in phase_names}
        sign : {str of either '+' or '-'}
        kT : {units.quantity}
        free_energy_diff : {float (has units of kT)}
        free_energy_diff_error : {float (has units of kT)}
        free_energy_diff_standard_state_correction : {float (has units of kT)}
        enthalpy_diff : {float (has units of kT)}
        enthalpy_diff_error : {float (has units of kT)}
    free_energy_diff : {float (has units of kT)}
    free_energy_diff_error : {float (has units of kT)}
    free_energy_diff_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
    free_energy_diff_error_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
    enthalpy_diff : {float (has units of kT)}
    enthalpy_diff_error : {float (has units of kT)}
    enthalpy_diff_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
    enthalpy_diff_error_unit : {units.quantity compatible with energy/mole. Corrected for different phase kT}
Returns:
serialized_data : dict

Dictionary of all the auto-analysis calls organized by section headers. See each of the functions to see each of the sub-dictionary structures

dump_serial_data(path)[source]

Dump the serialized data to YAML file

Parameters:
path : str

File name to dump the data to

yank.analyze.get_analyzer(file_base_path, **analyzer_kwargs)[source]

Utility function to convert storage file to a Reporter and Analyzer by reading the data on file

For now this is mostly placeholder functions since there is only the implemented ReplicaExchangeAnalyzer, but creates the API for the user to work with.

Parameters:
file_base_path : string

Complete path to the storage file with filename and extension.

**analyzer_kwargs

Keyword arguments to pass to the analyzer.

Returns:
analyzer : instance of implemented Yank*Analyzer

Analyzer for the specific phase.

yank.analyze.analyze_directory(source_directory, **analyzer_kwargs)[source]

Analyze contents of store files to compute free energy differences.

This function is needed to preserve the old auto-analysis style of YANK. What it exactly does can be refined when more analyzers and simulations are made available. For now this function exposes the API.

Parameters:
source_directory : string

The location of the simulation storage files.

**analyzer_kwargs

Keyword arguments to pass to the analyzer.

Returns:
analysis_data : dict

Dictionary containing all the automatic analysis data

yank.analyze.print_analysis_data(analysis_data, header=None)[source]

Helper function of printing the analysis data payload from a ExperimentAnalyzer.auto_analyze() call

Parameters:
analysis_data : dict

Output from ExperimentAnalyzer.auto_analyze()

header : str, Optional

Optional header string to print before the formatted helper, useful if you plan to make this call multiple times but want to divide the outputs.

class yank.analyze.MultiExperimentAnalyzer(script, **builder_kwargs)[source]

Automatic Analysis tool for Multiple YANK Experiments from YAML file

This class takes in a YAML file, infers the experiments from expansion of all combinatorial options, then does the automatic analysis run from ExperimentAnalyzer.auto_analyze() yielding a final dictionary output.

Parameters:
script : str or dict

Full path to the YAML file which made the YANK experiments. OR The loaded yaml content of said script.

builder_kwargs

Additional keyword arguments which normally are passed to the ExperimentBuilder constructor. The experiments are not setup or built, only the output structure is referenced.

run_all_analysis(serialize_data=True, serial_data_path=None, **analyzer_kwargs)[source]

Run all the automatic analysis through the ExperimentAnalyzer.auto_analyze()

Parameters:
serialize_data : bool, Default: True

Choose whether or not to serialize the data

serial_data_path: str, Optional

Name of the serial data file. If not specified, name will be {YAML file name}_analysis.pkl`

analyzer_kwargs

Additional keywords which will be fed into the YankMultiStateSamplerAnalyzer for each phase of each experiment.

Returns:
serial_output : dict

Dictionary of each experiment’s output of format {exp_name: ExperimentAnalyzer.auto_analyze() for exp_name in ExperimentBuilder’s Experiments} The sub-dictionary of each key can be seen in ExperimentAnalyzer.auto_analyze() docstring

yank.analyze.extract_u_n(ncfile)[source]

Extract timeseries of u_n = - log q(X_n) from store file

where q(X_n) = pi_{k=1}^K u_{s_{nk}}(x_{nk})

with X_n = [x_{n1}, ..., x_{nK}] is the current collection of replica configurations s_{nk} is the current state of replica k at iteration n u_k(x) is the kth reduced potential

TODO: Figure out a way to remove this function

Parameters:
ncfile : netCDF4.Dataset

Open NetCDF file to analyze

Returns:
u_n : numpy array of numpy.float64

u_n[n] is -log q(X_n)

yank.analyze.extract_trajectory(nc_path, nc_checkpoint_file=None, state_index=None, replica_index=None, start_frame=0, end_frame=-1, skip_frame=1, keep_solvent=True, discard_equilibration=False, image_molecules=False)[source]

Extract phase trajectory from the NetCDF4 file.

Parameters:
nc_path : str

Path to the primary nc_file storing the analysis options

nc_checkpoint_file : str or None, Optional

File name of the checkpoint file housing the main trajectory Used if the checkpoint file is differently named from the default one chosen by the nc_path file. Default: None

state_index : int, optional

The index of the alchemical state for which to extract the trajectory. One and only one between state_index and replica_index must be not None (default is None).

replica_index : int, optional

The index of the replica for which to extract the trajectory. One and only one between state_index and replica_index must be not None (default is None).

start_frame : int, optional

Index of the first frame to include in the trajectory (default is 0).

end_frame : int, optional

Index of the last frame to include in the trajectory. If negative, will count from the end (default is -1).

skip_frame : int, optional

Extract one frame every skip_frame (default is 1).

keep_solvent : bool, optional

If False, solvent molecules are ignored (default is True).

discard_equilibration : bool, optional

If True, initial equilibration frames are discarded (see the method pymbar.timeseries.detectEquilibration() for details, default is False).

Returns:
trajectory: mdtraj.Trajectory

The trajectory extracted from the netcdf file.