BCM: toolkit for Bayesian analysis of Computational Models using samplers

Background Computational models in biology are characterized by a large degree of uncertainty. This uncertainty can be analyzed with Bayesian statistics, however, the sampling algorithms that are frequently used for calculating Bayesian statistical estimates are computationally demanding, and each algorithm has unique advantages and disadvantages. It is typically unclear, before starting an analysis, which algorithm will perform well on a given computational model. Results We present BCM, a toolkit for the Bayesian analysis of Computational Models using samplers. It provides efficient, multithreaded implementations of eleven algorithms for sampling from posterior probability distributions and for calculating marginal likelihoods. BCM includes tools to simplify the process of model specification and scripts for visualizing the results. The flexible architecture allows it to be used on diverse types of biological computational models. In an example inference task using a model of the cell cycle based on ordinary differential equations, BCM is significantly more efficient than existing software packages, allowing more challenging inference problems to be solved. Conclusions BCM represents an efficient one-stop-shop for computational modelers wishing to use sampler-based Bayesian statistics. Electronic supplementary material The online version of this article (doi:10.1186/s12918-016-0339-3) contains supplementary material, which is available to authorized users.


Background
There is an increasing interest in using Bayesian statistics for the analysis of computational models in biology [1][2][3][4]. With Bayesian statistics, the unknown parameters of a computational model are assigned a probability distribution describing their uncertainty. This distribution can be updated from prior information to give the posterior probability distribution, using Bayes' theorem: where Θ represents the parameters, X the measurement data and ℳ the computational model. Furthermore, the marginal likelihood, or evidence, can be used to discriminate between different computational models. It can be calculated by marginalizing the parameters: Typically, neither the posterior probability nor the marginal likelihood can be calculated directly, but sampling algorithms can be used to estimate them [5][6][7][8][9][10][11][12][13][14][15][16]. These sampling algorithms are computationally demanding, especially when the number of parameters is large and when the computational model is expensive to simulate. Typical models in systems biology indeed carry many parameters and are expensive to simulate [17]. Additionally, the posterior probability distributions arising from such models are usually complex, containing multiple modes and ridges that are difficult to traverse [18]. Bayesian analysis of such systems biology models thus requires the use of advanced sampling algorithms. Since these sampling algorithms each have unique characteristics and can be more or less suitable for a particular task, it would be beneficial to have various algorithm easily available.
BCM, a toolkit for the Bayesian analysis of Computational Models using samplers, provides efficient, multithreaded implementations of eleven algorithms for calculating posterior probabilities and marginal likelihoods.
The BCM toolkit focuses on computational models that involve simulations or extensive calculations. Examples of such computational models are systems of ordinary differential equations describing biochemical reactions; or steady-state signaling networks, where the activity levels may be calculated in diverse ways. These computational models are in contrast to statistical models that can be specified in the BUGS or Stan languages. For such statistical models, excellent software packages already exist [19,20]. For the computational models that are targeted by BCM, several alternative software packages also exist [5,[21][22][23]. However, each of these packages implements only a single type of sampling algorithm and most of them focus on one particular type of computational model. In contrast, with BCM the user can choose from eleven sampling algorithms and the plugin architecture allows diverse types of models. Thus, BCM represents a one-stopshop for Bayesian analysis of systems biology models, where the user has a high chance of finding a suitable algorithm for the analysis of the user-defined model.

Implementation
BCM consists of three components: an inference tool, a model parsing tool and an R script for further analysis and visualization (see Fig. 1a).
The inference tool (mdlinf ) is the main component of BCM. It uses a specified sampling algorithm to generate samples from the posterior probability distribution and to calculate a marginal likelihood estimate. Error bounds for the marginal likelihood estimate are also provided, which are calculated directly from the samples using a method suitable for the particular algorithm used to calculate the marginal likelihood. As input, the inference tool requires three parts: a configuration file, an XML file specifying the prior, and a dynamic library that evaluates the likelihood function. For constructing the dynamic library that evaluates the likelihood function, BCM provides cross-platform boilerplate code, such that custom model simulation code can be easily adapted for use with BCM. Alternatively, the model parsing tool can be used as described further below.
The inference tool implements three classes of sampling algorithms: Markov chain Monte Carlo (MCMC) [6,7], sequential Monte Carlo (SMC) [8] and nested sampling [9]. For each class of sampling algorithms, BCM includes several options for proposal distributions, as well as extensions that can increase the sampling efficiency when dealing with complex inference problems, giving a total of eleven different sampling algorithms (Table 1). Fig. 1 Overview of BCM. a The inference tool is the main component of BCM, providing three classes of algorithms for generating samples from posterior probability distributions and calculating estimates of the marginal likelihood. The parsing tool can optionally be used to generate the prior and likelihood files from a model description file and data. b Excerpt of a model description file. The model parsing tool can parse this file, load the relevant data, and output C++ source code for a dynamic library that evaluates the likelihood function. In this example, the "Simulate()" function still has to be implemented by the user with a desired simulation method Care has been taken to create efficient, multithreaded implementations of each algorithm. Firstly, the inference tool has been written in C++ and performance bottlenecks have been profiled and optimized. Secondly, each algorithm has been parallelized with a multithreading strategy suitable for that algorithm: for MCMC, multiple chains are distributed across threads, for SMC, particles are distributed in batches across threads, and for nested sampling, a batch of samples is generated at each iteration by all threads which are then re-used in subsequent nested sampling iterations.
The model parsing tool (mdlparse) is the second component of BCM. It can be used to generate the prior and likelihood files for the inference tool. The parsing tool reads a model description file that specifies the model, comprising the prior, likelihood and data references, and it outputs C++ source code for a dynamic library that evaluates the prior and likelihood function with the relevant data. This C++ code can then be used as a basis for further modification; or it can be directly compiled into a dynamic library. The input model description file uses a custom format with an easy-to-read syntax. An excerpt of a model description file is shown in Fig. 1b. The use of the model parsing tool is optional and it is meant as an aid in model specification rather than as a comprehensive tool capable of fully specifying all types of models.
Finally, a script is provided to load the output of the inference tool into R for further analysis and for visualization of the results. This script can be used to display kernel density estimates of the posterior probability distribution of the sampled variables, as well as to make plots for visual posterior predictive checking; examples of both of these are shown in Figs. 3 and 4. Basic functionality for convergence diagnostics is included as well, including autocorrelation functions and trace plots. Functions for conversion of the results to CODA objects [24] and to ggmcmc objects [25], two R packages for MCMC convergence diagnostics and output analysis, are also provided.

Analytically tractable example
To showcase BCM, and to explore how each class of algorithms deals with increasing dimensionality and complex distributions, we first analyzed a problem which is analytically tractable: the Gaussian shells problem described in [5,26]. While this example is not directly relevant for systems biology, its likelihood function is multimodal and ridge-shaped, resembling the likelihoods often encountered in systems biology models. The likelihood function for this Gaussian shells problem is given by where r = 2, w = 0.1, and Θ and c i are n-dimensional vectors. Θ is the vector of variables which are to be sampled and c i are two constant vectors describing the centers of the two peaks and are assigned the values c 1,x = 3.5, c 2,x = −3.5 and 0 in the other dimensions. This likelihood function is then composed of two narrow, well-separated, ring-shaped peaks (Fig. 2a), which is a challenging sampling problem.
We tested three sampling algorithms on this problem, one from each class of sampling algorithms: feedbackoptimized parallel-tempered Markov chain Monte Carlo (FOPTMC) [12], sequential Monte Carlo (SMC) [8] with the automated temperature schedule selection of [15] but without using Approximate Bayesian Computation, and MultiNest [5].
As shown in Table 2, all three algorithms give the correct estimate for the marginal likelihood within the error bounds. When the number of dimensions is 10 or fewer, MultiNest is extremely efficient: it requires the fewest likelihood evaluations while achieving the tightest error bound. When the number of dimensions is increased beyond 10 however, MultiNest becomes very inefficient. At this point the exponential scaling of the algorithm becomes apparent. In the higher-dimensional setting, the SMC algorithm deals with this problem most efficiently. FOPTMC is least efficient: it requires the largest number of likelihood evaluations and has the largest error bound. FOPTMC can still effectively explore the posterior distribution (as shown in Fig. 2b), however, the temperature schedule of the parallel chains in FOPTMC is optimized for exploration of the posterior rather than for estimation of the marginal likelihood and as a result there is an increasingly large error in the marginal likelihood estimate at higher dimensionality. Parallel tempering [10] Adaptive proposals [11] Feedback-optimized temperatures [12] Thermodynamic integration [13] Automated parameter blocking [14] Sequential Monte Carlo [8] MCMC proposals [8] Kernel density estimate proposals [8] Automated temperature schedule [15] Nested sampling [9] MCMC proposals [9] Ellipsoid proposals [16] MultiNest [5] Kinetic ordinary differential equation model Having explored the behavior of several sampling algorithms in an analytically tractable example, we now illustrate the use of BCM for analyzing biological computational models. As an example of this, we investigated the inference of the parameters of a model based on a system of ordinary differential equations (ODEs). The 6-variable cell cycle model of Tyson [27] was used, as downloaded from BioModels [17]. A graphical representation of this model is shown in Fig. 3a.
To recreate a typical setting in biology, data was generated from the model at six time points for two observables with three replicates (see Additional file 1). Then BCM was used to infer all 16 parameters of the model (10 kinetic parameters and 6 initial conditions) from these 36 data points. The priors for the kinetic parameters were set to a uniform distribution spanning an order of magnitude on either side of the parameter values that were used to generate the data, and the priors for the initial conditions were set to a uniform distribution between 0 and 1 (see blue curves in Fig. 3c). The likelihood function was set equal to the one that generated the data, that is, a normal distribution with standard deviation 0.05. The following algorithms were used: FOPTMC feedback-optimized parallel-tempered Markov Chain Monte Carlo [12], SMC automated-temperature sequential Monte Carlo but without ABC approximation [15], and MultiNest [5].  Despite the small size of the model, this inference problem is challenging. Firstly, the ODE system is stiff, and even with the use of an implicit ODE solver it is costly to simulate. Secondly, there are multiple distinct ways in which the model can fit the data, leading to sub-optimal modes in the posterior distribution. Thus, a sampler must be able to escape these local optima, and it must be able to converge to the correct posterior distribution with a limited number of likelihood evaluations due to the computational cost of the simulations.
Four sampling algorithms were tested on this problem: SMC, MultiNest, FOPTMC (now extended with automated parameter blocking [14]), and additionally nested sampling with MCMC proposals (Nested-MCMC) was added as an alternative nested sampling strategy. In this inference task, FOPTMC with automated parameter blocking was most efficient, requiring 14 h to generate 2000 samples from the posterior. SMC required 19 h, while Nested-MCMC required 30 h and MultiNest had to be discontinued as the acceptance rate quickly dropped to essentially zero. The tests were performed using 16 threads on an Intel Xeon E5-2680 processor.
The Bayesian estimates of the parameters and the trajectories of the species can be used to study the uncertainty in the model. Figure 3b shows the posterior distribution of the two observables, as well as of two inferred species for which no observable data was generated, as estimated by FOPTMC. We can see that the data are sufficient to constrain the trajectories of the observed species. For the unobserved species phosphorylated cyclin, the overall trajectory can also be inferred. Nevertheless, for this unobserved species, the second peak is more variablehere the data is insufficient to constrain the precise magnitude of the peak. For the other unobserved species, unphosphorylated cyclin, we see that there is greater uncertainty. The posterior distribution indicates only that the average levels are low, but the precise levels cannot be inferred from these data. Figure 3c shows the marginal posterior probability distributions of the parameters. It can be seen that for all parameters, the values used to generate the data fall within areas of non-zero probability of the posterior. In most cases the data-generation values also have maximum posterior probability, but interestingly this is not true for all parameters, such as for the activation and deactivation of Cdc2. Furthermore, some parameters are not identifiable, for example the rates of phosphorylation and desphosphorylation of Cdc2 cannot be determined from the data. In general, such lack of identifiability could be for structural reasons, that is, the parameters cannot be inferred in theory given the observed species, due to a redundant parameterization. Alternatively, the parameters may be identifiable in theory, but the data may provide insufficient information to constrain the parameters in practice.
Overall, the Bayesian estimates provide useful measures of the uncertainty in parameter values, model fit and model predictions.

Comparison with existing software packages
There are several software packages which can perform Bayesian inference of the parameters of ODE-based models: BioBayes [21], ABC-SysBio [22], SYSBIONS [23] and Stan [20]. BioBayes uses parallel-tempered Markov Chain Monte Carlo, ABC-SysBio uses sequential Monte Carlo sampling in combination with Approximate Bayesian Computation, SYSBIONS uses nested sampling, and Stan uses Hamiltonian Monte Carlo and the No-U-Turn sampler (NUTS).
To compare BCM with these software packages, a simplified version of the previous inference problem was used. Instead of inferring all 16 parameters, the initial conditions and 4 of the 10 kinetic parameters were fixed to the values used to generate the data, leaving 6 parameters to be inferred. Figure 4a shows the marginal posterior probability distributions of the simplified problem, as estimated by BCM using FOPTMC (see Additional file 2: Figure S1 for the posteriors estimated by each algorithm/software package). The other software packages were optimized for this problem as much as possible to give a fair comparison (see Additional file 1). Figure 4b shows the time required to generate 1000 samples from the posterior with each software package and algorithm, using eight threads on an Intel Xeon E5-2680 processor. It is clear that BCM is significantly faster than the other software packages. In particular the MultiNest algorithm in BCM is extremely efficient in this low-dimensional setting, requiring only 75 s. The other algorithms in BCM required between 25 and 50 min, except for ellipsoidal nested sampling which required three hours. From the other software packages, only SYSBIONS and Stan were able to solve this inference problem in a reasonable amount of time. SYS-BIONS required five hours using Nested-MCMC, which is approximately six times longer than BCM with the same algorithm. For Stan, using the NUTS algorithm, the sampling with a chain does not always converge as the NUTS algorithm does not have a means to escape sub-optimal modes. This problem was addressed by starting eight separate chains in parallel, in which case most of the chains were sampling the correct, optimal mode. In this case, Stan required approximately six hours to generate the requested samples. BioBayes was able to reach apparent convergence in 4.5 days. For ABC-SysBio, and SYSBIONS using ellipsoidal sampling, the samplers did not reach convergence in 7 days (see Additional file 1).

Conclusion
The BCM toolkit provides efficient, multithreaded implementations of eleven sampling algorithms for generating posterior samples and calculating marginal likelihoods. Additional tools are included which facilitate the process of specifying models and visualizing the sampling output. This toolkit can be used for analyzing the uncertainty in the parameters and the predictions of computational models using Bayesian statistics.
The examples show that it depends on the problem which sampling algorithm will perform well. In the Gaussian shells example, where the focus was on marginal likelihood estimation, MultiNest performed best in a low-dimensional setting, and in the medium-to high dimensional setting sequential Monte Carlo was most efficient. In the cell cycle example, where the focus was on parameter inference, parallel-tempered Markov chain Monte Carlo was more efficient than sequential Monte Carlo. There are various aspects of the posterior probability distribution which affect the performance of the different algorithms; for example the number of modes, how well the shapes of the modes are approximated by the proposal distributions, and the location and volume of the posterior modes with respect to the prior. These features of the posterior probability distribution will typically not be known for the problem of interest before starting the analysis, and it is then unclear which algorithm might be most suitable. The availability of various algorithms in BCM will therefore be useful in the Bayesian analysis of diverse models.
In the second example, we have shown that BCM can be used to infer the parameters of an ODE-based model of the cell cycle. BCM is significantly more efficient in this task than existing software packages. This increase in efficiency was possible due to the parallelization of the sampling algorithms in combination with the use of optimized C++ as programming language. Due to the higher efficiency, BCM allows the analysis of larger or more challenging computational models than was previously feasible. In previous cases where Bayesian analysis of complex biological computational models was done, such as in [3,4,28], sampling algorithms were newly implemented for each project. The availability of BCM as an efficient, reusable software package can help in streamlining such projects in the future.