Analysis

The YANK analysis tools are built around bringing in modern analysis techniques to help the user better understand the final free energy output that you get. In theory, we could just print out “Free energy of Binding = XX.XXX +- Y.YYY kcal/mol,” but that alone does not tell you anything about the quality of your free energy estimate. We try to provide guides and built-in analysis to help understand the numbers YANK provides.

The analysis framework comes three ways to help the user analyze their data (use these links to jump to usage):

Analysis Theory

We run YANK simulations through multiple data processing tools improve the estimate in free energy and other thermodynamic properties. We document them in the Algorithms page, but summarize them here, and link to their detailed explanation in the appropriate page. When running automatically, these steps are run in order to yield the final free energy estimate. The analysis procedure is carried out for each phase of the YANK simulation.

Equilibration and Decorrelation

We first discard non-equilibrated samples and then subsample to remove correlation effects between the remaining samples. The timeseries to carry out this analysis from replica exchange is chosen as the sum of potential energies from each sample, evaluated in state it was drawn from. The timeseries is run through the detectEquilibration routine of the timeseries module of PyMBAR to determine the correlation times. The decorrelation rate is analyzed at each point in the timeseries by assuming all samples before that point are part of the equilibration time, then the number of remaining samples are computed by subsampling at the decorrelation rate. The point in the timeseries which preserves the maximum number of samples is treated as the sample at which equilibration is complete, and the remaining subsampled data are passed onto MBAR for free energy estimate.

Replica Mixing Convergence

Converged free energy estimates require a sufficient mixing of replicas which indicated good phase space overlap. Replica exchange simulations enhance sampling by allowing configurations to swap into new states, reducing energetic or steric barriers. However, if no or few swaps occur, then the replica exchange provides little additional benefit over single simulations. Furthermore, there is a good chance you have low phase space overlap between states the replicas can sample, which may yield not only an incorrect free energy estimate, but also underestimate the error in the free energy as well.

Note

Terminology Note: The “Replica” is the time-continuous trajectory of particles and box vectors. A (thermodynamic) state is the collection of forces and rules which dictate how particles interact with each other and the surrounding bulk environment, including temperature, pressure, and Hamiltonian. In YANK, the Hamiltonian Replica Exchange is carried out by proposing each replica swap their states, so each replica has a time-discontinuous state associated with it as well.

YANK provides a visual and quantitative guide for how well the replicas mixed. The first guide is the transition state matrix, which is the measure how frequently states swapped replicas with each other. In automatic or programmatic, the transition state matrix is shown as a table indexed by i, j where each entry shows the number of times state i (row) swapped with state j (column). In visual mode, a heat map shows the same data where darker shading indicates a denser exchange of states. Note that if a state does not exchange replicas, it is counted as “exchanging with itself.” You want a state to exchange with at least one other state.

The subdominant eigenvalue is computed as a quantitative metric of how well mixed the replicas are. This is computed as the second eigenvalue of the transition state matrix. This quantity is the provides the estimate for how decomposable the transition state matrix is and how many iterations it would take for one state to swap to all replicas. The lower this number is, the better mixing has occurred.

The visual analyze mode provides an additional qualitative guide for how well individual replicas sampled each state. The state each replica is in is plotted as a function of time to see if any particular replicas are getting stuck in a single state.

Free Energy Estimation

The free energy difference and its error are estimated through the Multistate Bennet Acceptance Ratio implemented in PyMBAR. The equilibrated and decorrelated data are analyzed by MBAR with default values for the initial guess. If there are unsampled states, such as when anisotropic-dispersion corrections are accounted for, then the free energy difference to those states are also estimated using one-sided exponential reweighting. The final free energy difference of scientific interest, such as binding or hydration, is taken when multiple phases’ free energies are added together, along with any standard state corrections.

Analysis Usage

Automatic Analysis

YANK looks at a simulation output directory and makes automatic choices based on best practices and procedures. All the steps taken here are outlined above and in order. You can use this yank analyze facility, even when YANK is running.

$ yank analyze --store={experiments}

Replace {experiments} with the output directory for your simulation. This minimal form of the command has a few limitations which can be corrected with more complex flags: it only outputs to terminal, it can only target single directories. Both of these can be corrected with more complex invocation of the command:

$ yank analyze --yaml={Some YAML file which ran with ``yank script``}

This form with the -y YAML or --yaml=YAML flag tells the analysis to look inside of a YAML file which was read to run the experiment(s) you want to analyze. In this form, the experiment paths and names are determined from the YAML file and can analyze any number of combinatorial experiments found within the file. The -y flag is exclusive from -s. This will still output to terminal though, so we need another flag to save the data to disk.

$ yank analyze -y YAML -e {serial Pickle file}

The -e SERIAL (alternately --serial=SERIAL) flag tells YANK analyze to save the analysis data to the SERIAL file in Pickle format. This saves not only the final free energy, but also all intermediate values in between. See the docs for the ExperimentAnalyzer class for how the output data is formatted and what values you can extract. The -e option can be set with either the -s or the -y flags.

Note

We know Pickle has problems when importing across Python versions and are working on a solution to carry the data in a more universal way. The data are composed of simple numbers, NumPy arrays, and SimTK Quantities for unit handling, so we are exploring options to represent them in a transferable way which does not require intamate knowledge of what unit system YANK outputs in.

Visual Analysis

YANK can create Jupyter Notebooks to analyze your simulations to help visually see more than just walls of text and numbers. These notebooks behave similar to the automatic analysis in that they follow the set of data processing and analysis methods, but will render graphic representation of the same data.

Note

Rendering these notebooks requires both the juptyer and matplotlib packages and their dependencies. These are not required to run YANK itself, and will not be installed by default if you installed YANK through conda, pip, or setup.py. You can still create the notebook as without these packages.

To generate the notebook, use the following command:

$ yank analyze report --store={experiments} --output={mynotebook.ipynb} {--format ipynb}

Replace {experiments} with the output directory for your simulation and {mynotebook.ipynb} a filename to save the notebook as. This extension no longer infers what the file type is, so you will optionally need to set the extension through the --format flag. This will default to ipynb (Jupyter Notebook), you can also specify PDF and HTML.

Setting the file format of the report is controlled through the --format flag, which is optional. The valid options are ipynb (Jupyter Notebooks, default), pdf, or html. This will generate, render, and export the notebook to the corresponding file type. Note that additional packages or external programs may be required to use this feature (e.g. .pdf requires a xelatex binary to be on the current system path).

The visual notebook system also supports the multi-experiment analysis, but changes the format of the --output | -o flags. This is done by replacing the target -s or --store flag as with -y or --yaml. Here is an example:

$ yank analyze report -y {a_file.yaml} -o {a_directory} {--format ipynb}

-y {target} and --yaml={target} are the same, just as is -o {target} and --ouput={target}. Note that the the -o flag now targets a directory, because multiple output files are generated. This is the main reason for the addition of the --format flag. If {a_directory} is not a directory, an error is thrown.

Parallel Analysis

Each of the invocations of the command line analysis now support parallel analysis through MPI. Much in the same way that you can run parallel simulations, you can run parallel analysis. This is most helpful with the --yaml flag where multiple experiments are being analyzed. The parallelization is naive, so each experiment runs over on parallel thread, the individual experiments themselves are not parallelized.

One only needs to wrap the analysis command around an MPI call like such:

$ mpiexec yank analyze -y a_file.yaml -e output.pkl

Programmatic Analysis

The full analyze module’s API provides extensible, granular access to the analysis suite. This is helpful if you want to add new analysis methods, manipulate the data yourself, or integrate the analysis tools into your own code. Simply import yank.analyze into your code and use the API to your own desire. Should you find that you want your changes permanently added to YANK, feel free to open a pull request on GitHub to start the conversation and consideration!