Analysis¶
The YANK analysis tools are built around bringing in modern analysis techniques to help the user better understand the final free energy output that you get. In theory, we could just print out “Free energy of Binding = XX.XXX +- Y.YYY kcal/mol,” but that alone does not tell you anything about the quality of your free energy estimate. We try to provide guides and built-in analysis to help understand the numbers YANK provides.
The analysis framework comes three ways to help the user analyze their data (use these links to jump to usage):
Analysis Theory¶
We run YANK simulations through multiple data processing tools improve the estimate in free energy and other thermodynamic properties. We document them in the Algorithms page, but summarize them here, and link to their detailed explanation in the appropriate page. When running automatically, these steps are run in order to yield the final free energy estimate. The analysis procedure is carried out for each phase of the YANK simulation.
Equilibration and Decorrelation¶
We first discard non-equilibrated samples and then subsample to remove correlation effects between the remaining
samples. The timeseries to carry out this analysis from replica exchange is chosen as the sum
of potential energies from each sample, evaluated in state it was drawn from. The timeseries is run through the
detectEquilibration
routine of the
timeseries module of PyMBAR to determine the correlation
times. The decorrelation rate is analyzed at each point in the timeseries by assuming
all samples before that point are part of the equilibration time, then the number of remaining samples are computed
by subsampling at the decorrelation rate. The point in the timeseries which preserves the maximum number of samples
is treated as the sample at which equilibration is complete, and the remaining subsampled data are passed onto MBAR
for free energy estimate.
Replica Mixing Convergence¶
Converged free energy estimates require a sufficient mixing of replicas which indicated good phase space overlap. Replica exchange simulations enhance sampling by allowing configurations to swap into new states, reducing energetic or steric barriers. However, if no or few swaps occur, then the replica exchange provides little additional benefit over single simulations. Furthermore, there is a good chance you have low phase space overlap between states the replicas can sample, which may yield not only an incorrect free energy estimate, but also underestimate the error in the free energy as well.
Note
Terminology Note: The “Replica” is the time-continuous trajectory of particles and box vectors. A (thermodynamic) state is the collection of forces and rules which dictate how particles interact with each other and the surrounding bulk environment, including temperature, pressure, and Hamiltonian. In YANK, the Hamiltonian Replica Exchange is carried out by proposing each replica swap their states, so each replica has a time-discontinuous state associated with it as well.
YANK provides a visual and quantitative guide for how well the replicas mixed. The first guide is the transition
state matrix, which is the measure how frequently states swapped replicas with each other. In
automatic or programmatic, the transition state matrix is shown as a
table indexed by i, j
where each entry shows the number of times state i
(row) swapped with state j
(column).
In visual mode, a heat map shows the same data where darker shading indicates a denser exchange
of states. Note that if a state does not exchange replicas, it is counted as “exchanging with itself.” You want a
state to exchange with at least one other state.
The subdominant eigenvalue is computed as a quantitative metric of how well mixed the replicas are. This is computed as the second eigenvalue of the transition state matrix. This quantity is the provides the estimate for how decomposable the transition state matrix is and how many iterations it would take for one state to swap to all replicas. The lower this number is, the better mixing has occurred.
The visual analyze mode provides an additional qualitative guide for how well individual replicas sampled each state. The state each replica is in is plotted as a function of time to see if any particular replicas are getting stuck in a single state.
Free Energy Estimation¶
The free energy difference and its error are estimated through the Multistate Bennet Acceptance Ratio implemented in PyMBAR. The equilibrated and decorrelated data are analyzed by MBAR with default values for the initial guess. If there are unsampled states, such as when anisotropic-dispersion corrections are accounted for, then the free energy difference to those states are also estimated using one-sided exponential reweighting. The final free energy difference of scientific interest, such as binding or hydration, is taken when multiple phases’ free energies are added together, along with any standard state corrections.
Analysis Usage¶
Automatic Analysis¶
YANK looks at a simulation output directory and makes automatic choices based on best practices and procedures. All
the steps taken here are outlined above and in order. You can use this yank analyze
facility, even when YANK is
running.
$ yank analyze --store={experiments}
Replace {experiments}
with the output directory for your simulation. This minimal form of the command has a few
limitations which can be corrected with more complex flags: it only outputs to terminal, it can only target single
directories. Both of these can be corrected with more complex invocation of the command:
$ yank analyze --yaml={Some YAML file which ran with ``yank script``}
This form with the -y YAML
or --yaml=YAML
flag tells the analysis to look inside of a YAML file which was read
to run the experiment(s) you want to analyze. In this form, the experiment paths and names are determined from the YAML
file and can analyze any number of combinatorial experiments found within the file.
The -y
flag is exclusive from -s
. This will still output to terminal though, so we need another flag to
save the data to disk.
$ yank analyze -y YAML -e {serial Pickle file}
The -e SERIAL
(alternately --serial=SERIAL
) flag tells YANK analyze to save the analysis data to the SERIAL
file in Pickle format. This saves not only the final free energy, but also all intermediate values in between. See the
docs for the ExperimentAnalyzer class for how the output data is formatted and what values
you can extract. The -e
option can be set with either the -s
or the -y
flags.
Note
We know Pickle has problems when importing across Python versions and are working on a solution to carry the data in a more universal way. The data are composed of simple numbers, NumPy arrays, and SimTK Quantities for unit handling, so we are exploring options to represent them in a transferable way which does not require intamate knowledge of what unit system YANK outputs in.
Visual Analysis¶
YANK can create Jupyter Notebooks to analyze your simulations to help visually see more than just walls of text and numbers. These notebooks behave similar to the automatic analysis in that they follow the set of data processing and analysis methods, but will render graphic representation of the same data.
Note
Rendering these notebooks requires both the juptyer
and matplotlib
packages and their dependencies. These
are not required to run YANK itself, and will not be installed by default if you installed YANK through conda,
pip, or setup.py. You can still create the notebook as without these packages.
To generate the notebook, use the following command:
$ yank analyze report --store={experiments} --output={mynotebook.ipynb} {--format ipynb}
Replace {experiments}
with the output directory for your simulation and {mynotebook.ipynb}
a filename to save
the notebook as. This extension no longer infers what the file type is, so you will optionally need to set the extension
through the --format
flag. This will default to ipynb
(Jupyter Notebook), you can also specify PDF and HTML.
Setting the file format of the report is controlled through the --format
flag, which is optional. The valid options
are ipynb
(Jupyter Notebooks, default), pdf
, or html
. This will generate, render, and export the notebook to
the corresponding file type. Note that additional packages or external programs may be required to use this feature
(e.g. .pdf
requires a xelatex
binary to be on the current system path).
The visual notebook system also supports the multi-experiment analysis, but changes the format of the --output | -o
flags. This is done by replacing the target -s
or --store
flag as with -y
or --yaml
. Here is an example:
$ yank analyze report -y {a_file.yaml} -o {a_directory} {--format ipynb}
-y {target}
and --yaml={target}
are the same, just as is -o {target}
and --ouput={target}
. Note that
the the -o
flag now targets a directory, because multiple output files are generated. This is the main reason
for the addition of the --format
flag. If {a_directory}
is not a directory, an error is thrown.
Parallel Analysis¶
Each of the invocations of the command line analysis now support parallel analysis through MPI. Much in the same way
that you can run parallel simulations, you can run parallel analysis. This is most helpful with the --yaml
flag
where multiple experiments are being analyzed. The parallelization is naive, so each experiment runs over on parallel
thread, the individual experiments themselves are not parallelized.
One only needs to wrap the analysis command around an MPI call like such:
$ mpiexec yank analyze -y a_file.yaml -e output.pkl
Programmatic Analysis¶
The full analyze module’s API provides extensible, granular access to the analysis suite.
This is helpful if you want to add new analysis methods, manipulate the data yourself, or integrate the analysis tools
into your own code. Simply import yank.analyze
into your code and use the
API to your own desire. Should you find that you want your changes permanently
added to YANK, feel free to
open a pull request on GitHub to start the conversation and consideration!