Comprehensive benchmarking of Markov chain Monte Carlo methods for dynamical systems

Background In quantitative biology, mathematical models are used to describe and analyze biological processes. The parameters of these models are usually unknown and need to be estimated from experimental data using statistical methods. In particular, Markov chain Monte Carlo (MCMC) methods have become increasingly popular as they allow for a rigorous analysis of parameter and prediction uncertainties without the need for assuming parameter identifiability or removing non-identifiable parameters. A broad spectrum of MCMC algorithms have been proposed, including single- and multi-chain approaches. However, selecting and tuning sampling algorithms suited for a given problem remains challenging and a comprehensive comparison of different methods is so far not available. Results We present the results of a thorough benchmarking of state-of-the-art single- and multi-chain sampling methods, including Adaptive Metropolis, Delayed Rejection Adaptive Metropolis, Metropolis adjusted Langevin algorithm, Parallel Tempering and Parallel Hierarchical Sampling. Different initialization and adaptation schemes are considered. To ensure a comprehensive and fair comparison, we consider problems with a range of features such as bifurcations, periodical orbits, multistability of steady-state solutions and chaotic regimes. These problem properties give rise to various posterior distributions including uni- and multi-modal distributions and non-normally distributed mode tails. For an objective comparison, we developed a pipeline for the semi-automatic comparison of sampling results. Conclusion The comparison of MCMC algorithms, initialization and adaptation schemes revealed that overall multi-chain algorithms perform better than single-chain algorithms. In some cases this performance can be further increased by using a preceding multi-start local optimization scheme. These results can inform the selection of sampling methods and the benchmark collection can serve for the evaluation of new algorithms. Furthermore, our results confirm the need to address exploration quality of MCMC chains before applying the commonly used quality measure of effective sample size to prevent false analysis conclusions. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0433-1) contains supplementary material, which is available to authorized users.


Initialization Scheme Details
To initialize the sampling algorithms we use samples from the prior (RND initialization) or results of a multi-start local optimization run (MS initialization). An illustration of the initialization is provided in Supplement Figure S1.
RND Initialization: In this initialization scheme we draw 100 parameter points from our uniform priors. For single-chain algorithms, these points are distributed to the 100 runs per scenario. For multi-chain algorithms, we take different (but overlapping) sequences of L points from the 100 available points and distribute these onto the L chains of each run. Each scenario has a different random seed.
MS Initialization: For each run we draw 1000 parameters vectors according to a Latin Hypercube scheme and use them as initial values for a multi-start local optimization (as presented in [1]). The local optimization for each starting point is performed using the MAT-LAB build-in routine fmincon with 'interior-point'.
For single-chain methods the maximum a posteriori (MAP) estimate obtained using the multi-start local optimization is employed as initial parameter vector.
For multi-chain methods, we filter the optimization results based on the posterior difference to the MAP and only take into account the best u ≤ 1000 optimization results {θ (j) , j = 1, ..., u}. To determine a reasonable number u we define R i = 2(log(L(y|θ (1) )p(θ (1) )) − log(L(y|θ (i) )p(θ (i) ))), i = 1, . . . , 1000 where θ i are sorted descending by their corresponding posterior values from best to worst posterior values. Therefore, the parameter values are sorted, such that we obtain a sequence θ (1) , θ (2) , . . . , θ (1000) , such that p(D|θ (1) )p(θ (1) ) ≥ p(D|θ (2) )p(θ (2) ) ≥ · · · ≥ p(D|θ (1000) )p(θ (1000) ). From this sequence the first u members are selected such that for all i = 1, . . . , u, the log-posterior ratio: is larger than the inverse χ 2 distribution with onedegree of freedom with an α-level of 0.001. This is motivated by the likelihood ratio test. For the remaining optimization results {θ (j) , j = 1, . . . , u} we define weights w j calculated with the log-posterior values such that The weight w j is used to draw the initial values from θ j ) for our multi-chain MCMC run. This heuristic takes into account both, the height and width of the modes of the posterior distribution. The width is encoded in the frequency which a certain local optima is recovered.

Automated Burn-In Calculation via Sequential Geweke Test
The different sampling methods yield burn-ins of different lengths. To determine the length of the burnin phase automatically, we employ the Geweke test multiple times. The Geweke test is originally used to compare the sample meansμ 0%−10% andμ 50%−100% of the first 10% and the last 50% of a chain [2] while accounting for the respective spectral variance approximationsσ 2 0−10% andσ 2 50−100% of the interval sample means. The comparison is based on the z-score: The z-score is used to test the hypothesis that the sample means are different against the null-hypothesis that the sample meansμ 0%−10% andμ 50%−100% are equal. We consider a z-score greater than z 0 = 2 as significant. If the z-score is smaller than 2, the null hypothesis is not rejected, suggesting that the burn-in phase is over. For automatic detection of the BI length, we split up the raw chain into 40 equally sized intervals. Then, we perform the Geweke test on subsets of these intervals. Initially the Geweke test is performed on the chain composed of all segments 1, ..., 40. If the resulting zscore is too high, we test the chain composed of the second to the last segment 2, ..., 40. If the z-score is still too high, we test the third to the last segment 3, ..., 40 and so on until we have tested all sub-chains or the z-score falls below the threshold.
We adjust the significance level z 0 for each subsequent test with the Bonferroni-Holm [3] correction. Thus, we sort the obtained z-scores in descending order (as they are anti-proportionally related to the p-values, which should be sorted ascendingly) and normalize the corresponding z 0 of the test asz 0 = z 0 /k where k is the sorted index k ∈ {1, ..., 40} of the test.
Once the z-score is lower than the significance level z 0 /k, we choose all iterations with i ≤ n BI to be discarded. As all MCMC chains are multidimensional with n θ > 1 and the Geweke test is a univariate test, we always test all parameter dimensions individually and then use the worst case, i.e. the highest BI found for all dimensions.

Obtaining Groups of Similar MCMC Chains
To obtain groups of similar chains for the same benchmark problem in robust fashion, we compare all pairs of runs using a Gelman-Rubin-Brooks [4] and Geweke test.
The Gelman-Rubin-Brooks test compares the variancecovariance matrix W of θ within a set of chains with the variance-covariance B of θ between the set of chains [4]. Here θ (jt) i is the ith element of the parameter vector in chain j at iteration t. We define the variance-covariance matrix within the chains as and the variance-covariance matrix between the chains as with means Here n is the number of iterations and m the number of chains being compared (here always n = 2). Any rotationally invariant distance measure between W and B/n can be used to determine if the chains are sufficiently similar or not [4]. Brooks and Gelman proposed as distance measure where λ is the largest eigenvalue of W −1 B/n. This multivariate measure is an upper bound for its univariate counterpart (see [4]).
We found this similarity test to be a good choice if combined with a Geweke test between the chains (note, this is a different Geweke test than the one used in Supplement Section 2). This Geweke test is used on the last parts of both chains after Burn-In removal further shortening the longer chain so that both chains have equal iteration numbers remaining. The Gelman-Rubin-Brooks test is used to compare the covariances within and across chains while the Geweke test compares the chain means. The conservativeness of both approaches can be controlled by modifying the thresholds R 0 and z 0 which are used to R and z. The tests are passed, if R < R 0 and z < z 0 . We set R 0 = 1.05 and z 0 = 0.05 fairly conservative to make it more likely to overestimate the number of groups to underestimate it. If both tests are passed, we consider two chains to be sufficiently similar to be assigned to the same group. Each time two runs are classified as similarly they form a group and linked, existing groups are getting merged.

Memory Usage
We have tested the memory usage of our implementation of the sampling methods by applying them in the largest considered benchmark problem (M4) with 10 4 iterations. We performed 10 independent runs, 5 runs with MS initialization and 5 runs with RND initialization. Figure S2 shows memory consumption due to the sampling process. To generate these results, the undocumented profile -memory functionality of MAT-LAB2016b was used on a desktop computer. Since MALA is more sensitive to bad initialization than the other methods, the RND initialized runs fail to start. Thus, in that case the computation time is very short.

Resolving Non-Identifiability Leads to Higher Sampling Efficiency
For (M1a), an analysis of the analytical solution for the output trajectory y(t) reveals that the parameters β and δ are interchangeable. These information can be employed by constructing more efficient samples, e.g. by using a random permutation sampler, or by postprocessing the results [5,6]. In this study, we considered post-processing. Instead of applying the analysis pipeline to the raw chains, we first process the chains by switching β and δ so that β > δ holds. The resulting increase in EQ and ESS per second are shown in Figure S3. The sampling performance improves significantly for almost all algorithms. The single-chain algorithms benefit the most. This highlights the importance of usage of additional information for sampling problems. Unfortunately, additional information -as used here -is often not accessible in practice.

Detailed Results of each Benchmark Problem
Figures S4-S10 illustrate the performance of the considered sampling approaches for benchmark problems (M1a) to (M6). The exploration quality as well as the sampling efficiency (effective sample size per second) is indicated. (M1b) Figure S5 Exploration quality (top) and sampling efficiency (bottom) for problem (M1b). A complete white bar in the upper plot indicates that less than 5% of the runs of the respective algorithm explored the posterior distribution. In this case, the sampling efficiency is not evaluated, otherwise, the distribution of corrected ESS across well exploring runs is illustrated.

(M2)
Figure S6 Exploration quality (top) and sampling efficiency (bottom) for problem (M2). A complete white bar in the upper plot indicates that less than 5% of the runs of the respective algorithm explored the posterior distribution. In this case, the sampling efficiency is not evaluated, otherwise, the distribution of corrected ESS across well exploring runs is illustrated.

(M3)
Figure S7 Exploration quality (top) and sampling efficiency (bottom) for problem (M3). A complete white bar in the upper plot indicates that less than 5% of the runs of the respective algorithm explored the posterior distribution. In this case, the sampling efficiency is not evaluated, otherwise, the distribution of corrected ESS across well exploring runs is illustrated.

(M4)
Figure S8 Exploration quality (top) and sampling efficiency (bottom) for problem (M4). A complete white bar in the upper plot indicates that less than 5% of the runs of the respective algorithm explored the posterior distribution. In this case, the sampling efficiency is not evaluated, otherwise, the distribution of corrected ESS across well exploring runs is illustrated.

(M5)
Figure S9 Exploration quality (top) and sampling efficiency (bottom) for problem (M5). A complete white bar in the upper plot indicates that less than 5% of the runs of the respective algorithm explored the posterior distribution. In this case, the sampling efficiency is not evaluated, otherwise, the distribution of corrected ESS across well exploring runs is illustrated. In this benchmark problem none of the algorithms has shown EQ > 0.

(M6)
Figure S10 Exploration quality (top) and sampling efficiency (bottom) for problem (M6). A complete white bar in the upper plot indicates that less than 5% of the runs of the respective algorithm explored the posterior distribution. In this case, the sampling efficiency is not evaluated, otherwise, the distribution of corrected ESS across well exploring runs is illustrated. In this benchmark problem none of the algorithms has shown EQ > 0.