Some quick guidance for analyzing molecular dynamics (MD) or Markov-chain Monte Carlo (MC) data in hard-to-sample systems – e.g., biomolecules. I can summarize the advice this way: Ask not how to compute error bars. Ask first whether error bars are even appropriate. A meaningless error bar is more dangerous (to you and the community) than no error bar at all. This guidance is essentially abstracted from our recent Best Practices paper, and I hope it will set in context some of the theory discussed in an earlier post.
I realized that I owe you something. In a prior post, I invoked some Bayesian ideas to contrast with boostrapping analysis of high-variance data. (More precisely, it was high log-variance data for which there was a problem, as described in our preprint.) But the Bayesian discussion in my earlier post was pretty quick. Although there are a number of good, brief introductions to Bayesian statistics, many get quite technical.
Here, I’d like to introduce Bayesian thinking in absolutely the simplest way possible. We want to understand the point of it, and get a better grip on those mysterious priors.
I don’t about you but I grew up on equilibrium statistical mechanics. The beauty of a partition function, an ensemble, the ability to understand thermodynamic principles from microscopic rules. I love that stuff.
But what if we want to understand biology? Is a partition function really the most important object? This Fall, I’m going to lecture on biophysics for an assortment of biology and biomedical engineering students for just a few weeks; and for the first time in my teaching career, I’m planning to omit a partition-function based description of molecular behavior. I’m just not convinced it’s important enough for an abbreviated set of lectures.
I want to talk again today about the essential topic of analyzing statistical uncertainty – i.e., making error bars – but I want to frame the discussion in terms of a larger theme: our community’s often insufficiently critical adoption of elegant and sophisticated ideas. I discussed this issue a bit previously in the context of PMF calculations. To save you the trouble of reading on, the technical problem to be addressed is statistical uncertainty for high-variance data with small(ish) sample sizes.