Some quick guidance for analyzing molecular dynamics (MD) or Markov-chain Monte Carlo (MC) data in hard-to-sample systems – e.g., biomolecules. I can summarize the advice this way: Ask not how to compute error bars. Ask first whether error bars are even appropriate. A meaningless error bar is more dangerous (to you and the community) than no error bar at all. This guidance is essentially abstracted from our recent Best Practices paper, and I hope it will set in context some of the theory discussed in an earlier post.
I realized that I owe you something. In a prior post, I invoked some Bayesian ideas to contrast with boostrapping analysis of high-variance data. (More precisely, it was high log-variance data for which there was a problem, as described in our preprint.) But the Bayesian discussion in my earlier post was pretty quick. Although there are a number of good, brief introductions to Bayesian statistics, many get quite technical.
Here, I’d like to introduce Bayesian thinking in absolutely the simplest way possible. We want to understand the point of it, and get a better grip on those mysterious priors.
I don’t about you but I grew up on equilibrium statistical mechanics. The beauty of a partition function, an ensemble, the ability to understand thermodynamic principles from microscopic rules. I love that stuff.
But what if we want to understand biology? Is a partition function really the most important object? This Fall, I’m going to lecture on biophysics for an assortment of biology and biomedical engineering students for just a few weeks; and for the first time in my teaching career, I’m planning to omit a partition-function based description of molecular behavior. I’m just not convinced it’s important enough for an abbreviated set of lectures.
I want to talk again today about the essential topic of analyzing statistical uncertainty – i.e., making error bars – but I want to frame the discussion in terms of a larger theme: our community’s often insufficiently critical adoption of elegant and sophisticated ideas. I discussed this issue a bit previously in the context of PMF calculations. To save you the trouble of reading on, the technical problem to be addressed is statistical uncertainty for high-variance data with small(ish) sample sizes.
Let’s draw a line. Across the calendar, I mean. Let’s all pledge that from today on we’re going to give honest accounting of the uncertainty in our data. I mean ‘honest’ in the sense that if someone tried to reproduce our data in the future, their confidence interval and ours would overlap.
There are a few conceptual issues to address up front. Let’s set up our discussion in terms of some variable which we measure in a molecular dynamics (MD) simulation at successive configurations: , , , and so on. Regardless of the length of our simulation, we can measure the average of all the values . We can also calculate the standard deviation σ of these values in the usual way as the square root of the variance. Both of these quantities will approach their “true” values (based on the simulation protocol) with enough sampling – with large enough .
Such a beautiful thing, the PMF. The potential of mean force is a ‘free energy landscape’ – the energy-like-function whose Boltzmann factor exp[ -PMF(x) / kT ] gives the relative probability* for any coordinate (or coordinate set) x by integrating out (averaging over) all other coordinates. For example, x could be the angle between two domains in a protein or the distance of a ligand from a binding site.
The PMF’s basis in statistical mechanics is clear. When visualized, its basins and barriers cry out “Mechanism!’’ and kinetics are often inferred from the heights of these features.
Yet aside from the probability part of the preceding paragraph, the rest is largely speculative and subjective … and that’s assuming the PMF is well-sampled, which I highly doubt in most biomolecular cases of interest.
“Proteins don’t know biology” is one of those things I’m overly fond of saying. Fortunately, it’s true, and it gives quantitative folks a foot in the door of the magical world of biology. And it’s not only proteins that are ignorant of their role in the life of a cell, the same goes for DNA, RNA, lipids, etc. None of these molecules knows anything. They can only follow physical laws.
Is this just a physicist’s arrogance along the lines of, “Chemistry is just a bunch of special cases, uninteresting consequences of quantum mechanics”? I hope not. To the contrary, you should try to see that cells employ basic physics, but of a different type than what we learned (most of us, anyway) in our physical sciences curricula. This cell biophysics is fascinating, not highly mathematical, and offers a way of understanding numerous phenomena in the cell, which are all ‘special cases’ … but special cases of what?
You’re a quantitative person and you want to learn biology. My friend, you are in a difficult situation. If you really want to learn how biology works in a big-picture sense, as opposed to cutting yourself a very narrow slice of the great biological pie, then you have a challenging road ahead of you. Fortunately, many have walked it before you, and I want to give you some advice based on my own experiences. I should say at the outset that my own learning has focused mostly on the cell-biology part of the pie – not physiology, zoology, ecology, … and so my comments here refer to learning cell biology.
The scary thing is that I have been at this for almost 20 years (very part-time admittedly) and I would never dare to call myself a cell biologist. But I think it’s fair to say that by now I have a decent sense of what I know and what I don’t know. I will never be able to draw out the Krebs cycle, but I have a qualitative sense of its purpose and importance, as well as of general principles of cycles and catalyzed reactions in biochemistry. Not that impressive, I know, but I’m proud of it anyway.
Basic strategies, timescales, and limitations
Key biomolecular events – such as conformational changes, folding, and binding – that are challenging to study using straightforward simulation may be amenable to study using “path sampling” methods. But there are a few things you should think about before getting started on path sampling. There are fairly generic features and limitations that govern all the path sampling methods I’m aware of.
Path sampling refers to a large family of methods that, rather than having the goal of generating an ensemble of system configurations, attempt to generate an ensemble of dynamical trajectories. Here we are talking about trajectory ensembles that are precisely defined in statistical mechanics. As we have noted in another post, there are different kinds of trajectory ensembles – most importantly, the equilibrium ensemble, non-equilibrium steady states, and the initialized ensemble which will relax to steady state. Typically, one wants to generate trajectories exhibiting events of interest – e.g., binding, folding, conformational change.
Q: What is a trajectory?
A trajectory is the time-ordered sequence of system configurations which occur as all the coordinates evolve in time following some rules – hopefully rules embodying reasonable physical dynamics, such as Newton’s laws or constant-temperature molecular dynamics.
Q: What is a trajectory ensemble?
It’s a set of independent trajectories that together characterize a particular condition such as equilibrium or a non-equilibrium steady state. That is, the trajectories do not interact in any way, but statistically they describe some condition because of how they have been initiated – and when they are observed relative to their initialization … see below.