Rules of Thumb for Sampling Assessment

Some quick guidance for analyzing molecular dynamics (MD) or Markov-chain Monte Carlo (MC) data in hard-to-sample systems – e.g., biomolecules. I can summarize the advice this way: Ask not how to compute error bars. Ask first whether error bars are even appropriate. A meaningless error bar is more dangerous (to you and the community) than no error bar at all. This guidance is essentially abstracted from our recent Best Practices paper, and I hope it will set in context some of the theory discussed in an earlier post.

Rule Zero – Plan your study with explicit awareness of sampling needs. Has anyone ever convincingly sampled a system as complex as the one you’re choosing to study? Do you know the timescales associated with your system and is there hope to access them … multiple times?

Rule One – Assume your data is not well-sampled, until you’re convinced to the contrary. There are good qualitative tests for ruling out good sampling: use them. If there is evidence an important state has only been visited once or if you see continuing drift in any observable that is supposed to be in a steady state, you’re not well sampled. Every important steady-state observable should be fluctuating about a mean after a transient “equilibration/burn-in” period. For multiple trajectories, also compare the distributions from each.

Rule Two – Do not cherry pick data. If some of your data makes a nice story for any reason while other data does not, excluding the disagreeable data means biasing your results. That’s not good science.

Rule Three – Be extremely cautious when assessing data from an enhanced sampling approach. Be cynical at first and assume that the only thing your fancy method does is smooth otherwise poor data. At a minimum, perform multiple completely independent runs to gauge variance. If your results depend on starting configuration(s), then you have not sampled well.

What are some examples of good sampling? Most obviously I can point to the MD protein folding study by Shaw and coworkers, where we see multiple folding and unfolding events; this is what good sampling looks like in a single trajectory. In more modest systems, we carefully analyzed MC-based peptide equilibrium sampling.

Categories

Posts