Let’s stop being sloppy about uncertainty

Let’s draw a line. Across the calendar, I mean. Let’s all pledge that from today on we’re going to give honest accounting of the uncertainty in our data. I mean ‘honest’ in the sense that if someone tried to reproduce our data in the future, their confidence interval and ours would overlap.
There are a few conceptual issues to address up front. Let’s set up our discussion in terms of some variable q which we measure in a molecular dynamics (MD) simulation at successive configurations: q_0, q_1, q_2, and so on. Regardless of the length of our simulation, we can measure the average of all the values \overline{q}= \displaystyle\sum_{i=1}^{M} q_i. We can also calculate the standard deviation σ of these values in the usual way as the square root of the variance. Both of these quantities will approach their “true” values (based on the simulation protocol) with enough sampling – with large enough M.

The first key point to get straight is one that you probably know already: the difference between the natural scale of variation of the variable q (see blue curve below, whose width is characterized by the standard deviation σ) and the uncertainty in the estimate of \overline{q} (see red curve below). The standard deviation describes the range of values one expects to see in the list of q_i values, but the variability in the estimate of the mean can be much smaller – if there are many independent measurements.

The uncertainty in \overline{q} is best framed in terms of a confidence interval. Understanding confidence intervals, which characterize statistical uncertainty, is essential and can be accomplished with a simple “thought experiment.” Imagine you repeat your simulation of length M many times, each time generating an average q (from the red distribution in the figure). If you consider the distribution of these \overline{q} values there is a 90% chance that any future simulation (performed in an identical fashion, aside from stochasticity) will fall within the 5th and 95th percentiles of the \overline{q} distribution – regardless of whether all M values are independent. This range of percentiles is the 90% confidence interval – even if we don’t know how to obtain them in practice.

The confidence interval or another estimate of statistical uncertainty in principle can be smaller, even much smaller, than the standard deviation if the number of independent samples among the M values is extremely large. That is, if a simulation is very long compared to system timescales, the estimate of an average from a single simulation should be very close to the “true” mean, based on infinite simulation, and the difference can be very small compared to σ. So, with more data, the red curve in the figure above would get narrower, as would the corresponding confidence interval.

As you may know, a confidence interval can also be estimated based on the standard error of the mean (an estimator of the more official “standard uncertainty” noted in the VIM reference) but this assumes a symmetric Gaussian distribution of the variable and requires knowing the number of independent samples in your data. In MD simulation, only a small fraction of the frames generated may be independent. And it is a challenging if not impossible task to quantify this, as described below.

Error bars really only make sense when you’re sure multiple independent samples have been gathered. Without independent samples, we really have no way to understand the statistical uncertainty in our measurements.

In molecular simulation, we would want to know how many of the ‘zillions’ of trajectory frames are truly independent. Generally, this can be estimated by examining the autocorrelation time for the observable of interest (or implicitly via block-averaging) … with the risk that subtle correlations with slower processes may be masked by such an analysis. The transient “equilibration” time needed to relax the system initially also must be accounted for.

To examine sample size using a global correlation time (aka “decorrelation” time), Ed Lyman and I proposed examining trajectory frames at increasingly large time intervals. The data at each interval were compared to behavior expected for independent samples in order to infer the interval which led to quasi-independent behavior. This was a way to estimate an effective sample size for the whole system – i.e., the total simulation time divided by the minimum interval required for independence. The analysis can also show if the minimum interval is close to the overall simulation time, which would rule out good sampling.

How much of this is really relevant for the types of systems that usually are simulated?
If we’re honest, we’ll admit that for simulations of large and complex systems, we may never reach the timescales necessary to generate independent samples. What then? How do we even think about error in such cases?

To make things concrete, let’s assume we’ve run a one microsecond simulation of a system with a slow timescale of 10 μs. In this case, all of the trajectory frames are correlated. Of course, there’s nothing to stop us from calculating the standard deviation σ of any observable based on the trajectory, but we can expect that even the value of σ is highly biased – probably too small because we have not sampled important parts of the configuration space that require 10 μs to see. Certainly the standard error/uncertainty would be meaningless because there is not even one representative (independent) sample due to the short sampling time.

Well, there’s no getting around inadequate data, but statistical mechanics does have something to teach us here. We can view our single too-short trajectory, as in our prior thought experiment, as one sample of a distribution of 1 μs trajectories, all started from the same configuration. This is precisely the “initialized trajectory ensemble” discussed in an earlier post, which must obey a non-equilibrium Fokker-Planck equation. If we follow the ensemble of time-evolving trajectories, we can get the non-equilibrium configuration space distribution (ensemble) at any time point. This distribution will slowly relax toward equilibrium under typical conditions – or possibly to a non-equilibrium steady state under other constraints.

That conceptual picture is very powerful, but is it useful? Maybe or maybe not. If you just have the single too-short trajectory – sorry, you’re out of luck. However, if you have several too-short trajectories generated independently, then you can at least estimate the uncertainty in your observable of interest given the time elapsed and the initial state or initial distribution – call this an ensemble transient average. See the sketch below. For example, you could calculate the average of a quantity in each trajectory based on the interval from 95 – 100 ns. Then the multiple independent trajectories would enable you to estimate uncertainty in this explicitly transient quantity. It’s not an equilibrium estimate, but hey, it’s a genuine statistical mechanics observable that could also be measured by another group. Such an observable also might be compared among related systems (e.g., different ligands interacting with the same protein) in a statistically meaningful way.

If you want to learn more about uncertainty analysis – and figure out how to make meaningful error bars for your own data – consider looking at a joint effort I’m part of to ‘codify’ best practices in assessing and reporting uncertainty. This article will be submitted to a brand new journal, the “Living Journal of Computational Molecular Science,” a non-profit, community-run, low-cost, open-access venue for updateable articles oriented toward education and best practices. (Full disclosure: I helped to found the journal.) Feel free to look at the article and make suggestions directly on the article’s github page.

Further Reading

Chodera JD. “A simple method for automated equilibration detection in molecular simulations,” Journal of chemical theory and computation. 2016; 12(4):1799–1805.

Alan Grossfield, Paul N. Patrone, Daniel R. Roe, Andrew J. Schultz, Daniel W. Siderius, Daniel M. Zuckerman, “Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations” (unpublished)

Grossfield A, Zuckerman DM. “Quantifying uncertainty and sampling quality in biomolecular simulations,” Annu Rep Comput Chem. 2009. 5:23–48.
JCGM. JCGM 200: “International vocabulary of metrology – Basic and general concepts and associated terms (VIM)”. Joint Committee for Guides in Metrology; 2012.

Lyman E, Zuckerman DM. “On the Structural Convergence of Biomolecular Simulations by Determination of the Effective Sample Size,” J. Phys Chem B. 2007; 111(44):12876–12882.

Yang W, Bitetti-Putzer R, Karplus M. “Free energy simulations: Use of reverse cumulative averaging to determine the equilibrated region and the time required for convergence,” Journal of Chemical Physics. 2004; 120(6):2618–2628.

DM Zuckerman, Statistical Physics of Biomolecules: An Introduction, CRC Press, 2010.

What I have against (most) PMF calculations

Such a beautiful thing, the PMF. The potential of mean force is a ‘free energy landscape’ – the energy-like-function whose Boltzmann factor exp[ -PMF(x) / kT ] gives the relative probability* for any coordinate (or coordinate set) x by integrating out (averaging over) all other coordinates. For example, x could be the angle between two domains in a protein or the distance of a ligand from a binding site.

The PMF’s basis in statistical mechanics is clear. When visualized, its basins and barriers cry out “Mechanism!’’ and kinetics are often inferred from the heights of these features.

Yet aside from the probability part of the preceding paragraph, the rest is largely speculative and subjective … and that’s assuming the PMF is well-sampled, which I highly doubt in most biomolecular cases of interest.

Let’s deal with each of the issues in turn: mechanism, kinetics, and sampling.

[* Note that a precise interpretation of the PMF requires knowledge of the Jacobian for the chosen coordinate(s): some intervals (x, x+dx) may contain more Cartesian configuration space than others.]

Mechanism and the PMF

To my knowledge, David Chandler and coworkers were the first to highlight dangers of the PMF. They described energy landscapes – potential energy functions – with mechanisms that could not be described by obvious coordinate choices for PMF calculations (thus motivating path-sampling techniques). Expanding on those ideas, consider the two-dimensional landscapes below.

Although some of the landscapes perhaps could be described by some (tortuous) one-dimensional coordinate, in other instances that just wouldn’t be possible. Think how much worse the situation would have to be in a system like a protein with thousands of degrees of freedom.

Now take one more step on the road to skepticism and consider how PMFs tend to be constructed in computational studies. Because the calculations are so expensive, we tend to choose one or two coordinates in advance for study. These coordinate choices represent our pre-conceived ideas about how a system might function.

Based on the features of the PMF landscape, we build a story about mechanism … the dwells are here and the transition states there. And yet we know that all basins are not alike – some may be energy stabilized and other favored by entropy. The heights of barriers, their widths and likewise the dimensions of basins all are influenced by the particular coordinate(s) we chose in advance, as illustrated in the two-dimensional examples above.

So I would say the quantitative analysis of a PMF is inherently misleading, but the “stories” we build are almost more dangerous. Because our minds are naturally drawn to stories, such narratives are easy to retain. This wouldn’t be a problem if the stories were not biased by the subjective coordinate choices and perhaps also by inadequate sampling.

Is there a good way to think about mechanism? I believe the starting point has to be a trajectory ensemble of continuous unbiased transition events. Trajectories tell us the full sequence of events from an initial to a final state. And the ensemble reports on the diversity of mechanisms, something a PMF could never do. It’s true that trajectory ensembles are difficult to obtain, but at least they offer a potentially unbiased way to describe mechanism.

As a final perspective on mechanism, think about the goal of discovery. We use large amounts of computing resources and we would like to be able to discover something we did not know before. But if we only look at coordinates we presumed from the start to be important, we are severely impairing our ability to discover novel phenomena.

Perhaps it’s worth a moment to consider that our subjective coordinate choices typically are based on a further assumption – that we have good or complete knowledge of our system’s biological function. Is this really true?

Ideally, although it’s a challenge, we would perform unbiased analyses of path/trajectory ensembles to discover important coordinates. Significant work has already been done on automated discovery of coordinates, and I think such methodologies should play an increasingly important role in the future.

Does a PMF predict kinetics?

We’ve all done it – looked at a PMF (free energy landscape), estimated a barrier height, and tried to guess a rate constant for a process. If we haven’t tried to guess an absolute rate, we’ve at least said to ourselves something like, “Well, that barrier is higher than the other, so one process is slower than the other.”

But in principle, the PMF may not yield any information about kinetics at all! Look again at the two-dimensional landscapes above. Would you trust a one-dimensional PMF of any of these to give reliable rates? So why trust a projection from 1,000+ dimensions to one or two, as is usually done for biomolecular systems?

Again, I think part of the reason we over-interpret PMFs is because free energy landscapes speak to us like stories. We can’t resist the narrative that arises in the stat-mech part of our brains, even when we know better.

To drive home the point further, note that it’s even possible to construct a PMF that is exactly constant and yet does not exhibit diffusive dynamics. One way would be with a potential consisting of `anti-parallel’ valleys as in landscape (c) above, where the well depths and widths were tuned at each x value to exactly have equal probability when integrated over y.

Another constant-in-x PMF based on a non-trivial potential energy in x and y results from the following potential energy, which is a double-well potential in x modulated by a harmonic y component with x-varying width:

You can check this yields a constant PMF by integrating over y and multiplying the result by the Boltzmann factor of the energy minimum at any x value. Below is a sample trajectory of x values (y not shown) simulated with overdamped Langevin (a.k.a. Brownian) dynamics.

Clearly the behavior is not simple diffusion in x, even though the PMF(x) is constant. (Histogramming a lot of these trajectories into x bins numerically confirms the PMF is flat.) And that’s no surprise, from a theory standpoint. The PMF generally does not govern dynamics, except in rare cases where an ideal reaction coordinate has been used … and suitable effective dynamics have been specifically derived for the selected coordinate.

Are biomolecular PMFs well sampled?

Very sophisticated simulation and analysis techniques are used to calculate PMFs. But should we trust the results? To be clear, I’m not questioning the rigor behind the methods, but rather the likelihood that the PMFs are well-sampled.

Consider the popular analysis approach WHAM (weighted histogram analysis method). WHAM seeks to provide the best possible PMF given the data available from the different simulation windows. This is a very different goal from assessing whether sufficient sampling has been performed. For better and for worse, WHAM almost always provides a smooth estimate of the free energy profile. Our minds tend to confuse smoothness with reliable, well-sampled data. But the two things are completely independent.

Let’s be more concrete. How many nsec are typically used in a single window of a WHAM protein calculation? Often the answer seems to be 10 nsec or less. Is that really enough time to sample even a constrained protein process? When careful studies are done on small peptides, they suggest 10s of nsec are needed for good sampling of these tiny systems!

Now consider a thought experiment. If indeed every window of a WHAM-like PMF calculation is well sampled, then we have an equilibrium ensemble for that window. And if the PMF is accurate, we also know the relative free energy of that coordinate value. We therefore can generate an overall equilibrium ensemble by combining all the window ensembles, weighting each by the Boltzmann factor of the window-specific PMF. Would you trust this equilibrium ensemble? Further, an equilibrium ensemble can be projected onto any coordinate to generate a new PMF. Would you trust that new PMF?


To sum up: Be careful! I am dubious that protein PMFs are well-sampled. But even when a PMF is exactly accurate, it reveals a landscape based on a subjectively chosen coordinate set. Physical scientists are adept at building a story from a landscape, but any given landscape is likely to be deceptive both in terms of mechanism and kinetics. Perhaps it’s time for us to move on from over-reliance on the PMF.


I very much appreciate comments on a draft given by Alan Grossfield.

Further reading

Dellago, C.; Bolhuis, P. G.; Csajka, F. S. & Chandler, D., Transition path sampling and the calculation of rate constants, J. Chem. Phys., 1998, 108, 1964-1977

Grossfield, A. & Zuckerman, D. M., Quantifying uncertainty and sampling quality in biomolecular simulations, Annu Rep Comput Chem, 2009, 5, 23-48.
Kumar, S.; Rosenberg, J. M.; Bouzida, D.; Swendsen, R. H. & Kollman, P. A., Multidimensional free-energy calculations using the weighted histogram analysis method, J. Comput. Chem., 1995, 16, 1339-135

Lyman, E. & Zuckerman, D. M., On the Structural Convergence of Biomolecular Simulations by Determination of the Effective Sample Size, J. Phys. Chem. B, 2007, 111, 12876-12882

McGibbon, Robert T., Brooke E. Husic, and Vijay S. Pande, Identification of simple reaction coordinates from complex dynamics, The Journal of Chemical Physics 146, 044109 (2017)

Zuckerman, D.M., Statistical Physics of Biomolecules: An Introduction, CRC Press, 2010.

Biology for quants, again. Required reading, Part 2.

“Proteins don’t know biology” is one of those things I’m overly fond of saying. Fortunately, it’s true, and it gives quantitative folks a foot in the door of the magical world of biology. And it’s not only proteins that are ignorant of their role in the life of a cell, the same goes for DNA, RNA, lipids, etc. None of these molecules knows anything. They can only follow physical laws.

Is this just a physicist’s arrogance along the lines of, “Chemistry is just a bunch of special cases, uninteresting consequences of quantum mechanics”? I hope not. To the contrary, you should try to see that cells employ basic physics, but of a different type than what we learned (most of us, anyway) in our physical sciences curricula. This cell biophysics is fascinating, not highly mathematical, and offers a way of understanding numerous phenomena in the cell, which are all ‘special cases’ … but special cases of what?

Continue reading