Can’t live with them, can’t live without them. That just about sums up my relationship with Markov state models aka MSMs. They have known limitations, but they sure can be useful (sometimes) and also misleading (sometimes).

I wrote a few years ago about some of the limitations of MSMs, and the aim of this post is to expand on one point from that post, namely, the tendency of MSMs to simply recapitulate the distribution of the raw counts data when one or a small number of trajectories are used to train the model. That is, as first pointed out by Scalco and Caflisch, when analyzing a trajectory that revisits the same basins, you won’t learn anything from the MSM because it essentially spits back the quantities you would get from simple analysis, e.g., by counting populations.

I wish I were writing this only to criticize MSMs, but the fact is that our new RiteWeight algorithm inherits this same flaw of MSMs. More on RiteWeight below.

A trajectory exploring free energy basins.
Figure: Analysis scheme for a single long trajectory. If a single long trajectory (black) samples a small number of free energy basins (gray contours), then the trajectory may frequently “cross” itself (green square, e.g.) in the discrete space of MSM states (not shown for clarity). The present discussion analyzes the longest stretch of trajectory spanning such a crossing event excludes the “dangling ends” (dashed black). A counts-based MSM built from the long trajectory (solid black) excluding the dangling ends exactly recapitulates the counts-based populations.

I want to use a simple example to do a very straightforward calculation to highlight the issue. Consider a single long trajectory that arguably does a good job of exploring a multi-basin conformation space (see figure) in the sense that transitions among basins are sampled multiple times. In this scenario, when the space is discretized into states for MSM analysis, we expect that multiple “crossing” events occur, i.e., the trajectory visits MSM states multiple times with gaps between such visits.

We analyze the longest stretch of the trajectory spanning the earliest and latest visits to a particular MSM state as shown in the figure. This construction removes the “dangling ends” before and after the crossing event.

We can define a MSM with M states starting from the transition counts matrix with elements C_{ij} for lag \tau denoting the number of times the trajectory is observed in state j having been in state i one lag time earlier. The total count C_i in state i (less one for the last time point, or the first) is given either by summing over the counts out of, or into, state i

(1)   \begin{equation*} C_i = \sum_{j=1}^M C_{ij} = \sum_{j=1}^M C_{ji} \end{equation*}

where the sum includes the case j=i representing self transitions, i.e., sequential observations in state i. The expression (1) is exact for all states i because the initial and final states are the same by construction.

We now consider the MSM transition matrix with elements representing the transition probability from i to j given by

(2)   \begin{equation*}     T_{ij} = \frac{C_{ij}}{C_i} \end{equation*}

which is the maximum likelihood estimate for matrix elements. We also define the total counts C = \sum_{i=1}^MC_i.

The solution for the stationary probabilities (e.g., for equilibrium) is surprising.
The stationary probabilities \pi_i resulting from the transition matrix (2) are exactly proportional to the counts observed in the trajectory itself, i.e., \pi_i = C_i / C. We can confirm this by applying the transition matrix to the stationary vector:

(3)   \begin{equation*}     \sum_{i=1}^M \pi_i \, T_{ij}      = \sum_{i=1}^M \frac{C_i}{C} \frac{C_{ij}}{C_i}     = \frac{1}{C} \sum_{i=1}^M C_{ij} = \frac{C_j}{C} = \pi_j \end{equation*}

where we used (1) that builds in the closed-loop assumption.

The innocuous-looking finding (3) is worth careful consideration. After all, the central idea of a MSM is to provide a principled “model” of the raw trajectory data which is not evident from the data itself. If the MSM simply recapitulates the counts, then the model is not providing a clear value and may obscure problems in sampling. Note that (3) is independent of the trajectory length, and only assumes the first and last time points occur in the same state: thus, good sampling is not assumed or built into the analysis.

The single-trajectory analysis just presented is hardly the full story of MSM analysis of finite data. Although analogous conclusions are derived when considering multiple trajectories, multiple short trajectories where dangling ends represent a dominant information source cannot be understood from the counts-based approach, as described by Scalco and Caflisch. Also, more sophisticated approaches to computing stationary probabilities from counts have been developed notably to include a detailed balance constraint; however, empirical data suggests this constraint fails to shift MSM stationary populations away from proportionality to counts.

The bottom line is this: the simplest MSM transition matrix, where matrix elements are derived directly from transition counts, illustrates a key limitation: when input trajectories are long enough to “cross”, it may be difficult for a MSM to yield a stationary solution substantially different from the simple cluster counts of the input data, implying significant “initial state bias.”

I promised to explain what all this has to do with RiteWeight, the randomized iterative reweighting approach. Because RiteWeight iterates over stationary solutions to MSMs (each with a different set of states), if those MSMs can only recapitulate the raw counts in the trajectory data, then RiteWeight suffers from the same problem. RiteWeight only works when the MSM stationary solutions differ from the counts (assuming there is not already exhaustive sampling). It seems that the sweet spot for RiteWeight is reweighting multiple shorter trajectories, as illustrated in some of our recent work adjusting protein ensembles from AI tools.

I want to thank Alex Dickson and Ed Lyman for pointing me to some of these issues a few years back.