 Methodology article
 Open Access
 Published:
Optimal experiment design for model selection in biochemical networks
BMC Systems Biology volume 8, Article number: 20 (2014)
Abstract
Background
Mathematical modeling is often used to formalize hypotheses on how a biochemical network operates by discriminating between competing models. Bayesian model selection offers a way to determine the amount of evidence that data provides to support one model over the other while favoring simple models. In practice, the amount of experimental data is often insufficient to make a clear distinction between competing models. Often one would like to perform a new experiment which would discriminate between competing hypotheses.
Results
We developed a novel method to perform Optimal Experiment Design to predict which experiments would most effectively allow model selection. A Bayesian approach is applied to infer model parameter distributions. These distributions are sampled and used to simulate from multivariate predictive densities. The method is based on a kNearest Neighbor estimate of the Jensen Shannon divergence between the multivariate predictive densities of competing models.
Conclusions
We show that the method successfully uses predictive differences to enable model selection by applying it to several test cases. Because the design criterion is based on predictive distributions, which can be computed for a wide range of model quantities, the approach is very flexible. The method reveals specific combinations of experiments which improve discriminability even in cases where data is scarce. The proposed approach can be used in conjunction with existing Bayesian methodologies where (approximate) posteriors have been determined, making use of relations that exist within the inferred posteriors.
Background
Developing computational models of biochemical networks is complicated by the complexity of their interaction mechanisms [1–8]. Typically, hypotheses on how the system operates are formalized in the form of computational models [9–12]. These models are subsequently calibrated to experimental data using inferential techniques [13–19]. Despite the steady increase in data availability originating from new quantitative experimental techniques, the modeler is often faced with the problem that several different model topologies can describe the measured data to an acceptable degree [20–22]. The uncertainty associated with the predictions hinders the investigator when trying to make a clear distinction between competing models. In such cases, additional data is required. Optimal Experiment Design (OED) methods can be used to determine which experiments would be most useful [23]. These methods typically involve specifying an optimality criterion or design aim and finding the experiment that most effectively attains this goal while considering the current parameter uncertainty. Existing methods of OED for model selection are usually based on assuming an uncertainty distribution around best parameter estimates [24, 25] or model linearization [26]. Due to the nonlinearity of the model and the nonGaussian shape of the parameter distribution, these methods are rarely appropriate for Systems Biology models [27] (See Figure 1 for an example of the effect of model linearization, and how it can skew predictive distributions in cases of large parameter uncertainty). In this work, we employ a Bayesian approach using the Posterior Predictive Distribution (PPD) which directly reflects the prediction uncertainty and accounts for both model nonlinearity and nonGaussianity of the parameter distribution. PPDs are defined as distributions of new observations conditioned on the observed data. Samples from the PPD can be obtained by drawing from the posterior parameter probability distribution and simulating predictions for each parameter set. By simulating a sample from the PPDs for all experimentally accessible moieties and fluxes, differences between models can be explored [28].
Previously, predictive distributions have been used to perform experiment design targeted at reducing the uncertainty of specific predictions [29–31]. In the field of machine learning, optimal experiment design based on informationtheoretic considerations is typically referred to as active learning [32]. In the neurosciences, the Bayesian inversion and selection of nonlinear states space models is known as dynamic causal modelling (DCM). Although DCM is dominated by variational (approximate) Bayesian model inversion  the basic problems and ensuing model selection issues are identical to the issues considered in this work. In DCM, the issue of optimising experimental design focuses on the LaplaceChernoff risk for model selection and its relationship with classical design optimality criteria. Daunizeau et al. (2011) consider the problem of detecting feedback connections in neuronal networks and how this depends upon the duration of design stimulation [33]. We will consider a similar problem in biochemical networks  in terms of identifying molecular interactions and when to sample data. We present a method to use samples from simulated predictive distributions for selecting experiments useful for model selection. Considering the increased use of Bayesian inference in the field [14, 34–39], this approach is particularly timely since it enables investigators to extract additional information from their inferences.
In a Bayesian setting, model selection is typically based on the Bayes factor, which measures the amount of evidence the data provides for one model over another [40, 41]. For every pair of models, a Bayes factor can be computed, defined as the ratio of their integrated likelihoods. One advantage of the Bayes factor is that it automatically penalizes unnecessary model complexity in light of the experimental data. It therefore reduces the risk of unwarranted model rejections. This penalization occurs because more parameters or unnecessarily wide priors lead to a lower weighting of the high likelihood region. This is illustrated in Figure 2.
What the Bayesian selection methodology does not provide, however, is a means to determine which experiment would optimally increase the separation between models. Determining which measurements to perform in order to optimally increase the Bayes factor in favor of the correct model is a difficult task. We propose a method which allows ranking combinations of new experiments according to their efficacy at increasing the Bayes factors which point to the correct model. Predictions whose distributions do not overlap between competing models are good measurement candidates [42, 43]. Often distributions for a single prediction show a large degree of overlap, hampering a decisive outcome. Fortunately, PPDs also contain information on how model predictions are related to each other. The relations between the different prediction uncertainties depend on both the data and the model. Differences in these interprediction relations between competing models can be probed and used (see Figure 3). We quantify these differences in predictive distributions by means of the Jensen Shannon divergence (JSD) [44].
There are many design parameters that one could optimize. In this paper, we focus on a simple example: namely, which system variable should be measured and at which time point. We argue that by measuring those time points at which the models show the largest difference in their predictive distributions, large improvements in the Bayes factors can be obtained. By applying the methodology on an analytically tractable model, we show that the JSD is nearly monotonically related to the predicted change in Bayes factor. Subsequently, the Jensen Shannon divergence is computed between predictions of a nonlinear biochemical network. Since each model implies different relations between the predictive distributions, certain combinations of predictions lead to more discriminability than others. The method serves as a good predictor for effective experiments when compared to the obtained Bayes factors after the measurements have been performed. The approach can be used to design multiple experiments simultaneously, revealing benefits that arise from combinations of experiments.
Methods
Consider biochemical networks that can be modeled using a system of ordinary differential equations. These models comprise of equations $f\left(\overrightarrow{x}\right(t),\overrightarrow{u}(t),\overrightarrow{p})$ which contain parameters $\overrightarrow{p}$ (constant in time), inputs $\overrightarrow{u}\left(t\right)$ and state variables $\overrightarrow{x}\left(t\right)$. Given a set of parameters, inputs, and initial conditions $\overrightarrow{x}\left(0\right)$, these equations can be simulated. Measurements $\overrightarrow{y}\left(t\right)=g\left(\overrightarrow{x}\right(t),\overrightarrow{q},\overrightarrow{r})$ are performed on a subset and/or a combination of the total number of state variables in the model. Measurements are hampered by measurement noise $\overrightarrow{\epsilon}$, while many techniques used in biology necessitate the use of scaling and offset parameters $\overrightarrow{r}$[45]. The vector $\overrightarrow{\theta}$, defined as $\overrightarrow{\theta}=\{\overrightarrow{p},\overrightarrow{r},{\overrightarrow{x}}_{0}\}$, lists all the required quantities to simulate the model. The parameters $\overrightarrow{q}$ determine the experimental design and could include differences in when various responses are measured or the mapping from hidden model states $\overrightarrow{x}$ to observed responses $\overrightarrow{y}$. We will refer to these as ‘design parameters’ that are, crucially, distinguished from model parameters $\overrightarrow{\theta}$. Design parameters are under experimental control and determine the experimental design. In what follows, we try to optimise the discriminability of models in terms of Bayesian model comparison by optimizing an objective function with respect to $\overrightarrow{q}$. In the examples, we will consider $\overrightarrow{q}$ as the timing of extra observations.
To perform inference and experiment design, an error model is required. Considering R time series of length N _{1}, N _{2} … N _{ R }, hampered by independent noise, one obtains the equation:
where M _{ i } indicates a model and y ^{D} the observed data. The parameters are given by $\overrightarrow{\theta}$, while ${y}_{k}^{D}\left({t}_{j}\right)$ indicates the value of a data point of state variable k at time j, respectively.
Predictive Distributions
Posterior Predictive Distributions (PPDs) are defined as distributions of new observations conditioned on the observed data. They correspond to the predicted distribution of future experiments, considering the current model assumptions and uncertainty. To obtain a sample from these predictive distributions, we propagate the uncertainty from the data to the predictions. By specifying prior distributions on the parameters and applying Bayes rule, it is possible to define a posterior distribution over the parameters. The posterior parameter probability distribution $p\phantom{\rule{1pt}{0ex}}\left(\overrightarrow{\theta}\right\phantom{\rule{2pt}{0ex}}{y}^{D})$ is given by normalizing the likelihood multiplied with the prior to a unit area:
where $p\left({y}^{D}\right\phantom{\rule{2pt}{0ex}}\overrightarrow{\theta})$ is the distribution of the observed data given parameters $\overrightarrow{\theta}$. The parameter prior distributions $p\left(\overrightarrow{\theta}\right)$ typically reflect the prior uncertainty associated with the parameters. To sample from the posterior parameter distribution, one needs to verify that the posterior distribution is proper. This can be checked by profiling the different parameters and determining whether the likelihood times the prior does not flatten out [28, 46]. After checking whether the posterior distribution of parameters is proper, a sample from this distribution can be obtained using Markov Chain Monte Carlo (MCMC) [22, 28]. MCMC can generate samples from probability distributions whose probability densities are known up to a normalizing factor [47] (see Additional file 1: section S1). A sample of the posterior parameter distribution reflects the uncertainty associated with the parameter values and can subsequently be used to simulate different predictions. The predictive distribution can now be sampled by simulating the model for each of the samples in the posterior parameter distribution and adding noise generated by the associated error model. The latter is required as future observations will also be affected by noise.
Model selection
In a Bayesian setting, model selection is often performed using the Bayes factor [40, 48, 49]. This pivotal quantity in Bayesian model selection expresses the change of relative belief in both models after observing experimental data. By applying Bayes rule to the problem of assigning model probabilities, one obtains:
where P(My ^{D}) represents the probability of model M given observed data y ^{D}, while P(M) and P(y ^{D}) are the prior probabilities of the model and data, respectively. Rather than explicitly computing the model probability, one usually considers ratios of model probabilities, allowing direct comparison between different models. As the prior model probability can be specified a priori (equal if no preference is given), the only quantity that still requires evaluation is P(y ^{D} M), which can be obtained by integrating the likelihood function over the parameters:
The Bayes factor is actually the ratio of these integrated (also named marginal or marginalized) likelihoods and is defined as:
where M _{1} and M _{2} refer to the different models under consideration. Unnecessarily complex models are implicitly penalized due to the fact that these result in a lower weighting of the high likelihood region, which results in a lower value for the integrated likelihood. This is illustrated in Figure 2. This means that maximizing the model evidence corresponds to maintaining an accurate explanation for the data while minimizing complexity [50].
Bounds can be defined where the Bayes factor value becomes decisive for one model over the other. Typically, a ratio of 100:1 is considered decisive in terms of model selection [40, 51]. In dynamic causal modelling, variational methods predominate, usually under the Laplace assumption. This assumes that the posterior density has a Gaussian shape, which greatly simplifies both the integration problems and numerics. Note that assuming a Gaussian posterior over the parameters does not necessarily mean that the posterior predictive distribution over the data is Gaussian (see Figure 1). Computing the required marginal likelihoods is challenging for nonlinear problems where such asymptotic approximations to the posterior distribution are not appropriate. Here one must resort to more advanced methods such as thermodynamic integration (see Additional file 1: section S2) [52] or annealed importance sampling [40]. Though the Bayes factor is a useful method of model selection, determining what to measure in order to improve the Bayes factor in favor of the correct model is a nontrivial problem. As such, it provides a means to perform model selection, but not the optimal selection of data features.
Experiment design
The approach is based on selecting measurements which provide the largest discriminatory power between competing models in terms of their predictive distributions. The design parameter in the proposed methodology corresponds to the choice and timing of a new observation. In other words, we want to determine which observable should be measured when in order to maximize the divergence between the posterior predictive densities, thereby maximally informing our model comparison. This divergence is quantified by means of the Jensen Shannon divergence (JSD) as it provides a measure of the dissimilarity between the probability density functions of competing models. The JSD is defined as the averaged Kullback Leibler divergence D _{ KL } between probability distributions and their mixture:
Here K represents the number of probability densities, p(M _{ i }) the (prior) probability of model M _{ i } and p(y M _{ i }) the Posterior Predictive Distribution. Additionally, this metric is monotonically related to an upper and lower bound of the classification error rate in clustering problems [33, 53] and is bounded between 0 and 1. In the case where the model that generated the data is in the set of competing models, it is analogous to the mutual information between a new measurement (or sample) coming from a mixture of the candidate models and a model classifier (see Additional file 1: section S3). In this study, we opted for comparing models in a pairwise fashion (K=2). This allows us to determine which models are distinguished by a new experiment. Mutual information has been considered before in the context of experimental design for constraining predictions or parameters of interest [29], but not in the setting of model selection. Though appealing for its properties, estimating the Jensen Shannon divergence for one or more experiments requires integration over the predictive densities, since:
Here P and Q are random variables with p and q their associated densities. Considering that only a sample of the PPDs is available, it is required to obtain a density estimate suitable for integration. Density estimation can be approached in two ways: by Kernel Density Estimation (KDE), or by kNearest Neighbor (kNN) density estimation. In Kernel Density Estimation (KDE), an estimate of the density is made by centering normalized kernels on each sample and computing weighted averages. This results in a density estimate with which computations can be performed. The kernels typically contain a bandwidth parameter which is estimated by means of cross validation [54, 55].
For well behaved low dimensional distributions, KDE often performs well. Considering the strongly nonlinear nature of both the parameter and predictive distributions, a Gaussian kernel with constant covariance is not appropriate. As the dimensionality and nonuniformity of the problem increases, more and more weights in the KDE become small and estimation accuracy is negatively affected [56]. Additionally, choosing an appropriate bandwidth by means of crossvalidation is a computationally expensive procedure to perform for each experimental candidate.
With kNearest Neighbor (kNN) density estimation, density is estimated by computing the volume required to include the k nearest neighbors of the current sample [55–57]:
In this equation ${\rho}_{k}\left(\overrightarrow{\theta}\right)$ represents the distance to the k ^{th} nearest neighbor, d the number of dimensions and v _{ d } the volume of the unit ball in ${\mathbb{R}}^{d}$. Furthermore, N denotes the number of included samples and v _{ d } is given by:
where Γ corresponds to the Gamma function. The advantage of using the kNN estimate is that this estimator adapts to the local sampling density, adjusting its volume where sampling is sparse. Note, however, that, similar to the KDE, this estimator also suffers from a loss of accuracy when estimating high dimensional densities. Fortunately, the number of experiments designed simultaneously, and therefore the dimensionality of the density, is typically low. Consider ${\overrightarrow{y}}_{j}^{{M}_{i}}$, a vector of predictions simulated with model M _{ i } and parameter set ${\overrightarrow{\theta}}_{j}$, where each element of the vector corresponds to a different model prediction. A model prediction is defined as a quantity which can be computed by supplying model M _{ i } with parameter set ${\overrightarrow{\theta}}_{j}$ (e.g., a predicted value at a certain time point, a difference between predictions or an area under some predicted curve). For OED purposes, these should be quantities that could potentially be measured. The set of these predicted values coming from model M _{ i } shall be referred to as ${\Omega}_{{M}_{i}}$. Inserting these quantities, the kNN estimate of the JSD becomes:
with Q _{ a,b } given by
Here d denotes the number of elements in ${\overrightarrow{y}}_{j}^{{M}_{i}}$ (the number of predictions included), and ${r}_{k}\phantom{\rule{2pt}{0ex}}\left({x}_{i},{\Omega}^{{M}_{j}}\right)$ corresponds to the Euclidean distance to the k ^{th} nearest neighbor of x _{ i } in ${\Omega}_{{M}_{j}}$. Note that the backslash indicates excluding an element from the set. Using this equation, the JSD can straightforwardly be computed for all possible combinations of experiments and used to rank according to how well these experiments would discriminate between the models. A larger value for the JSD indicates an informative experiment. The last step involves sampling several combinations of measurements and determining the set of experiments which have the greatest JSD. The proposed methodology is depicted in Figure 4.
Testing the method: numerical experiments
To demonstrate the method, a series of simulation studies are performed. Since we know which model generated the data, it is possible to compare to the Bayes factor pointing to the true model. After generating an initial data set using the true model, PPDs are sampled for each of the competing models. As the design variable, we consider the timing of a new measurement. Hence, we look at differences between the predictive distributions belonging to the different models at different timepoints. We use a sample of simulated observables at specific timepoints to compute JSD estimates between the different models. We thereby test whether the JSD estimate can be used to compare different potential experiments. The new experimental data is subsequently included and the JSD compared to the change in Bayes factor in favor of the correct model. Note that this new Bayes factor depends on the experimental outcome and that this approach results in a distribution of predicted Bayes factors. A large change in Bayes factor indicates a useful experiment.
Analytic models
The method is applied to a number of linear regression models. Linear regression models are models of the form:
where $\overrightarrow{\theta}$ represents a parameter vector and B constitutes a design matrix with basis functions B _{ i }(t). Given that σ, the standard deviation of the Gaussian observation error ε, is known and the prior distribution over the parameters is a Gaussian with standard deviation ξ, the mean and covariance matrix of the posterior distribution can be computed analytically (see Additional file 1: section S4). Using linear models avoids the difficult numerical integration commonly required to compute the Bayes factor and makes it possible to perform overarching Monte Carlo studies on how these Bayes factors adjust upon including new experimental data. The analytical expressions make it possible to compare the JSD to distributions of the actual Bayes factors for model selection.
Nonlinear biochemical networks
To further test the methodology, a series of artificial models based on motifs often observed in signaling systems [58, 59] were implemented (see Additional file 1: section S5 for model equations). Artificial data was simulated for M _{1}. Subsequently, inference was performed for all four competing topologies. The difference between each of the models was the origin and point of action of the feedback mechanism (see Figure 5).
Each of the artificial models was able to describe the measured data to an acceptable degree. We used a Gamma distribution with α=1 and β=3 for the prior distributions of the parameters. This prior is relatively noninformative (allowing a large range of parameter values), while not being so vague that the simplest model is always preferred (Lindley’s paradox [40]). Data was obtained using M _{1}. Observables were Bp, of which three replicates were measured, and Dp, of which two replicates were measured. These were measured at t=[ 0,2,5,10,20,40,60,100]. All replicates were simulated by adding Gaussian white noise with a standard deviation of 0.03. The parameter values corresponding to the true system were obtained by running Monte Carlo simulations until a visible overshoot above the noise level was observed. Parameter inference was performed using population MCMC with the noise σ as an inferred parameter. As design variables we consider both the choice of which observable(s) to measure and the time point(s) of the measurement.
Computational implementation
All algorithms were implemented in Matlab (Natick, MA). Numerical integration of the differential equations was performed with compiled MEX files using numerical integrators from the SUNDIALS CVode package (Lawrence Livermore National Laboratory, Livermore, CA). Absolute and relative tolerances were set to 10^{−8} and 10^{−9} respectively. MCMC was performed using a population MCMC approach using N _{ T }=40 chains with a temperature schedule given by ${T}_{n}={\left(\frac{{N}_{T}}{n}\right)}^{4}$[52]. This also permitted computation of the Bayes factors between the nonlinear models by means of thermodynamic integration. The Gaussian proposal distribution for the MCMC was based on an approximation to the Hessian computed using a Jacobian obtained by simulating the sensitivity equations. After convergence, the chain was thinned to 10000 samples. Since the number of experiments designed simultaneously (and therefore the dimensionality of the prediction vectors) was reasonably small (N _{ samples }>>2^{k}), the kNN search was performed using kd trees [60]. The figures in this paper were determined using k=10.
Results and discussion
Analytic models
A series of experiments were performed using linear regression models. To demonstrate the method, consider the following four competing models, where M _{3} is used to generate the data:
The presence of sine waves in M _{3} and M _{4} elicits particularly noticeable patterns in the optimal experiment design matrices. D equidistantly sampled time points were generated as data (including Gaussian additive noise σ). To make sure that the model selection was unsuccessful a priori, these were sampled in a region where the models roughly predict the same behavior. Initially, the Bayes factors were log10(B _{31})=2.0439 (decisive), log10(B _{32})=−0.0554 (pointing to the wrong model) and log10(B _{34})=0.4658 (‘not worth more than a bare mention’ [51]). PPDs were generated for each of the models and used to compute credible predictive intervals that enclose 95% of the predictive density. Bayes factors were computed for each of the models. Since the aim of the design is to successfully select between the models after performing new experiments, the change in Bayes factor in favor of the true underlying model was computed. Since the experimental outcome is not known a priori, a distribution of Bayes factors is predicted:
The expectation is taken with respect to new realizations of the data ${y}_{n}^{D}$. Note that such an overarching estimation would be computationally intractable for a nonlinear model. New experiments can be simulated in two ways. Either by using the correct model with the true parameter values and adding measurement noise or by taking samples from the posterior predictive distribution of the correct model (Δ B ab B). In practice these ‘true’ parameter values are not known and the predictive distribution of the measurement based on the posterior samples provides the best obtainable estimate given the current parameter uncertainty. The change in Bayes factor (in favor of the correct model) was compared to the Jensen Shannon divergence between the competing models. Large predicted changes indicate that the experiment would result in a successful selection. As for the JSD, a large value indicates a large distance between the joint predictive distributions, marking the measurement as useful. See Figure 6 for an example of the analysis results. As shown in the different panels, different models parameterized on the same data, result in different posterior predictive distributions (dotted lines in the top row). When comparing model 1 and 3 (the true model), we can see that differences in predictions occur mostly beyond the time range previously measured. Whereas model 1 predicts a straight line, the true underlying system deviates from a single line. Consider two new measurements. From the differences in PPDs, it is clear that measuring beyond the region where data is available would lead to a decisive choice against model 1 as corroborated by the large Bayes factor updates shown in the left plot on the middle row (B _{13}). It can also be seen that the PPDs differ more for negative time than positive time. Therefore the area which is decisive is larger for negative time. The JSDs follow this same pattern. The PPD for model 2 is less different from the PPD the true model would have generated. For the simulations coming from model 2, we can see that the value for positive and negative time is correlated. For model 3, these prediction values are negatively correlated. Consequently, performing one measurement for negative time and one for positive time would lead to the largest discriminatory power.
The JSD agrees well with the actual Bayes factor updates as shown in the third row of Figure 6. Interestingly, all the designs based on the JSD result in designs that would effectively discriminate between the true model and its competitors without having to specify a true model a priori. Subsequently, a Monte Carlo study was performed where a large number of random models were generated and compared. Plotting the relationship between the updated Bayes factors upon a new experiment and the corresponding JSD typically reveals a monotonic relationship that underlines its usefulness as a design criterion (see Figure 7 for two typical examples). These analyses were performed for a large number of randomly generated linear models. The Spearman correlation coefficient between the JSD and the expected Bayes factor averaged at 0.91 for the experiments we performed (see Additional file 1: section S6).
Nonlinear models
In Figure 8, we show the different predictions after performing model inference. Two sets of PPDs were simulated for two experimental conditions. These sets mimic two different concentrations of a signaling molecule, and have been implemented by setting the stimulus u to either 1 or 2. To test the effect of measuring multiple predictions, divergence estimates were computed for a large number of different combinations of two measurements. These results are shown in Figure 9. Each subplot corresponds to a different model comparison. The axis of each subplot is divided into ten sections corresponding to different predictions. Within each section, the axis represents time. The color value indicates the JSD, where a large value indicates a lot of separation and therefore a good measurement. Note the bright squares corresponding to the concentration of BpCp in each of the models. These high efficacies are not surprising considering that the PPDs show large differences between the models for these concentrations (See Figure 8). Also noticeable is that many of the experiments on the same predictions reveal dark diagonals within each tile. Measuring the same concentration twice typically adds fewer predictive constraints than measuring at two different time points, unless the second measurement is performed using a different concentration of signaling molecule (note how the diagonal lights up on the combination of measuring BpCp in condition 1 and 2 when selecting between model 3 and 4). Interestingly, measuring certain states during the overshoot is highly effective (Bp and Dp for any comparison), while the overshoot is less informative for others (Bp under stimulus 1 and 2 for discriminating between M _{1} and M _{2}). All in all, the information contained in such a matrix is very valuable when it comes to selecting from a small list of experiments. For example, we can also see that considering the current predictive distributions, model 2 and 4 can barely be distinguished. This implies that in order to actually distinguish between these two, a different experiment is required. Such a new experiment could, again, be evaluated by generating a new competing set of PPDs.
To test the results, in silico experiments have been performed by simulating new data from the true model and determining the Bayes factor change upon including this data. Bayes factors were estimated using thermodynamic integration (see Additional file 1: section S2). The calculation of each set of four marginal likelihoods took about 6 days of wallclock time on an Intel i7 CPU (2.93 GHz) with MATLAB R2010a. To validate the method, experiments are selected where differences between models are expected. The following experiments were performed:

1. Steady state Cp and BpCp concentration

2. Bp and Dp during the peak in the second condition (u=2)

3. Steady state Cp
Experiment 1 should differentiate between M _{1} and M _{3} (D _{13}≈0.49), but not between M _{1} and M _{2} or M _{1} and M _{4}. Experiment 2 should give discriminatory power for all models (D _{12}≈0.48,D _{13}≈0.54,D _{14}≈0.61). Experiment 3 should not provide any additional discriminatory power at all. The results of these analyses are shown in Table 1. As predicted, experiment 1 leads only to an increase in discriminatory power between model M _{1} and M _{3}. Experiment 2 improves the discriminatory power between all the models, while experiment 3 even reveals a decrease in discriminatory power for model 1 and 2. Noteworthy is also the large variance observed for experiment 3, which is likely related to the large variance in the steady state predictions of Cp. Again, the predictions based on the JSD are well in line with the Bayes factors obtained.
Conclusions
This paper describes a method applicable to performing experiment design with the aim of differentiating between various hypotheses. We show by means of a simulation study on analytically tractable models that the JSD is approximately monotonically related to the expected change in Bayes factor in favor of the model that generated the data (considering the current uncertainty in its parameters). This monotonic relation is useful, because it implies that the JSD can be used as a predictor of the change in Bayes factor. The applicability to nonlinear models of biochemical reaction networks was demonstrated by applying it to models based on motifs previously observed in signaling networks [58, 59]. Experiments were designed for distinguishing between different feedback mechanisms.
Though forecasting a predictive distribution of Bayes factors has been suggested [61], the implicit penalization of model complexity could have adverse consequences. The experiment design could suggest a measurement where the probability densities of two models overlap. When this happens, both models can describe the data equally well, which leads to an implicit penalization of the more complex model (since it allows for more varied predictions due to its added freedom). This penalization can then be followed by subsequent selection (of the simpler model). Though a decisive selection occurs, such an experiment would not provide additional insight however. In [61], this is mitigated by determining the evidence in favor of a more complex model. Moreover, computing the predictive distributions of Bayes factors required for this approach is computationally intractable for nonlinear models that are not nested. By focusing on differences in predictive distributions, both these problems are mitigated, making it is possible to pinpoint where the different models predict different behavior. Aside from their usefulness in model selection, such predictive differences could also be attributed to the different mechanisms present in the different models. This allows for followup studies to investigate whether these are either artificial or true system behavior.
A complicating factor in this method is the computational burden. The largest challenge to overcome is to obtain a sample from the posterior parameter distribution. Running MCMC on high dimensional problems can be difficult. Fortunately, recent advances in both MCMC [19, 62] as well as approximate sampling techniques [39, 48, 63, 64] allow sampling parameter distributions of increasingly complex models [14, 34–38]. The bottleneck in computing the JSD resides in searching for the k ^{th} nearest neighbor. A subproblem which occurs in many different problems and for which computationally faster solutions exist [65, 66]. An attractive aspect of this methodology is that it is possible to design multiple experiments at once. However, the density estimates typically become less accurate as the number of designed experiments increases (see Additional file 1: section S8). Therefore, we recommend starting with a low number of experiments (two or three) and gradually adding experiments while the JSD is low. Density estimation can also be problematic when the predictions vary greatly in their dispersion. When considering nonnegative quantities such as concentrations, logtransforming the predictions may alleviate problems. Finally, the number of potential combinations of experiments increases exponentially with the number of experiments designed. It is clear that this rapidly becomes infeasible for large numbers of experiments. However, it is not necessary to fill the entire experimental matrix and techniques such as Sequential Monte Carlo sampling could be considered as an alternative to more effectively probe this space. We revert the reader to Additional file 1: section S7 for a proof of principle implementation of such a sampler.
One additional point of debate is the weighting of each of the models in the mixture distribution used to compute the JSD. It could be argued that it would be more sensible to weight models according to their model probabilities by determining the integrated likelihoods of the data that is already available. The reason for not doing this is twofold. Firstly, the computational burden this adds to the experimental design procedure is significant. More importantly however, the implicit weighting in favor of parsimony could strongly affect the design by removing models which are considered unnecessarily complex at this stage of the analysis. When designing new experiments, the aim is to obtain measurements that allow for optimal discrimination between the predictive distributions under the different hypotheses. Optimal discrimination makes it sensible to consider the models equally probable a priori.
The method has several advantages that are particularly useful for modeling biochemical networks. Because the method is based on sampling from the posterior parameter probability distribution, it is particularly suitable when insufficient data is available to consider Gaussian parameter probability distributions or model linearisations. Additionally, it allows incorporation of prior knowledge in the form of prior parameter probability distributions. This is useful when the available data contains insufficient constraints to result in a well defined posterior parameter distribution. Because the design criterion is based on predictive distributions and such distributions can be computed for a wide range of model quantities, the approach is very flexible. In biochemical research, in vivo measurements are often difficult to perform and practical limitations of the various measurement technologies play an important role. In many cases measurements on separate components cannot be performed and measurements result in indirect derived quantities. Fortunately, in the current framework such measurements can be used directly since distributions of such experiments can be predicted.
Moreover, the impact of specific combinations of experiments can be assessed by including them in the design simultaneously which reveals specific combination of measurements that are particularly useful. This way, informative experiments can be distinguished from noninformative ones and the experimental efforts can be targeted to discriminate between competing hypotheses.
Availability
Source code is available at: http://cbio.bmt.tue.nl/sysbio/software/pua.html.
References
 1.
Tiemann C, Vanlier J, Hilbers P, van Riel N: Parameter adaptations during phenotype transitions in progressive diseases. BMC Syst Biol. 2011, 5: 17410.1186/175205095174.
 2.
van Riel NA, Tiemann CA, Vanlier J, Hilbers PA: Applications of analysis of dynamic adaptations in parameter trajectories. Interface Focus. 2013, 3 (2): 2012008410.1098/rsfs.2012.0084.
 3.
Schmitz J, Van Riel N, Nicolay K, Hilbers P, Jeneson J: Silencing of glycolysis in muscle: experimental observation and numerical analysis. Exp Physiol. 2010, 95 (2): 380397. 10.1113/expphysiol.2009.049841.
 4.
Schilling M, Maiwald T, Hengl S, Winter D, Kreutz C, Kolch W, Lehmann W, Timmer J, Klingmüller U: Theoretical and experimental analysis links isoformspecific ERK signalling to cell fate decisions. Mol Syst Biol. 2009, 5: 334
 5.
Borisov N, Aksamitiene E, Kiyatkin A, Legewie S, Berkhout J, Maiwald T, Kaimachnikov N, Timmer J, Hoek J, Kholodenko B: Systemslevel interactions between insulin–EGF networks amplify mitogenic signaling. Mol Syst Biol. 2009, 5: 256
 6.
Cedersund G, Roll J, Ulfhielm E, Danielsson A, Tidefelt H, Strålfors P: Modelbased hypothesis testing of key mechanisms in initial phase of insulin signaling. PLoS Comput Biol. 2008, 4 (6): 799806.
 7.
Koschorreck M, Gilles E: Mathematical modeling and analysis of insulin clearance in vivo. BMC Syst Biol. 2008, 2: 4310.1186/17520509243.
 8.
Schoeberl B, EichlerJonsson C, Gilles E, Müller G: Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors. Nat Biotechnol. 2002, 20 (4): 370375. 10.1038/nbt0402370.
 9.
Jeneson J, Westerhoff H, Kushmerick M: A metabolic control analysis of kinetic controls in ATP free energy metabolism in contracting skeletal muscle. Am J PhysiolCell Physiol. 2000, 279 (3): C813C832.
 10.
Wu F, Jeneson J, Beard D: Oxidative ATP synthesis in skeletal muscle is controlled by substrate feedback. Am J PhysiolCell Physiol. 2007, 292: C115C124.
 11.
Groenendaal W, Schmidt K, von Basum G, van Riel N, Hilbers P: Modeling glucose and water dynamics in human skin. Diabetes Technol Ther. 2008, 10 (4): 283293. 10.1089/dia.2007.0290.
 12.
Vanlier J, Wu F, Qi F, Vinnakota K, Han Y, Dash R, Yang F, Beard D: BISEN: Biochemical simulation environment. Bioinformatics. 2009, 25 (6): 836837. 10.1093/bioinformatics/btp069.
 13.
Vanlier J, Tiemann C, Hilbers P, van Riel N: Parameter uncertainty in biochemical models described by ordinary differential equations. Math Biosci. 2013, 246 (2): 305314. 10.1016/j.mbs.2013.03.006.
 14.
Klinke D: An empirical Bayesian approach for modelbased inference of cellular signaling networks. BMC Bioinformatics. 2009, 10: 37110.1186/1471210510371.
 15.
Taylor H, Barnes C, Huvet M, Bugeon L, Thorne T, Lamb J, Dallman M, Stumpf M: Calibrating spatiotemporal models of leukocyte dynamics against in vivo liveimaging data using approximate Bayesian computation. Integr Biol. 2012, 4 (3): 335345. 10.1039/c2ib00175f.
 16.
Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmüller U, Timmer J: Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics. 2009, 25 (15): 192310.1093/bioinformatics/btp358.
 17.
Hasenauer J, Waldherr S, Wagner K, Allgower F: Parameter identification, experimental design and model falsification for biological network models using semidefinite programming. Syst Biol IET. 2010, 4 (2): 119130. 10.1049/ietsyb.2009.0030.
 18.
Brännmark C, Palmér R, Glad S, Cedersund G, Strålfors P: Mass and information feedbacks through receptor endocytosis govern insulin signaling as revealed using a parameterfree modeling framework. J Biol Chem. 2010, 285 (26): 2017110.1074/jbc.M110.106849.
 19.
Girolami M, Calderhead B: Riemann manifold langevin and hamiltonian monte carlo methods. J R Stat Soci: Series B (Stat Methodol). 2011, 73 (2): 123214. 10.1111/j.14679868.2010.00765.x.
 20.
Cedersund G, Roll J: Systems biology: model based evaluation and comparison of potential explanations for given biological data. FEBS J. 2009, 276 (4): 903922. 10.1111/j.17424658.2008.06845.x.
 21.
Müller T, Faller D, Timmer J, Swameye I, Sandra O, Klingmüller U: Tests for cycling in a signalling pathway. J R Stat Soc: Series C (Appl Stat). 2004, 53 (4): 557568. 10.1111/j.14679876.2004.05148.x.
 22.
Calderhead B, Girolami M: Statistical analysis of nonlinear dynamical systems using differential geometric sampling methods. Interface Focus. 2011, 1 (6): 821835. 10.1098/rsfs.2011.0051.
 23.
Tegnér J, Compte A, Auffray C, An G, Cedersund G, Clermont G, Gutkin B, Oltvai Z, Stephan K, Thomas R: Computational disease modeling–fact or fiction?. BMC Syst Biol. 2009, 3: 5610.1186/17520509356.
 24.
Skanda D, Lebiedz D: An optimal experimental design approach to model discrimination in dynamic biochemical systems. Bioinformatics. 2010, 26 (7): 93910.1093/bioinformatics/btq074.
 25.
Steiert B, Raue A, Timmer J, Kreutz C: Experimental Design for Parameter Estimation of Gene Regulatory Networks. PloS one. 2012, 7 (7): e4005210.1371/journal.pone.0040052.
 26.
Casey F, Baird D, Feng Q, Gutenkunst R, Waterfall J, Myers C, Brown K, Cerione R, Sethna J: Optimal experimental design in an epidermal growth factor receptor signalling and downregulation model. Syst Biol IET. 2007, 1 (3): 190202. 10.1049/ietsyb:20060065.
 27.
Kreutz C, Raue A, Timmer J: Likelihood based observability analysis and confidence intervals for predictions of dynamic models. BMC Syst Biol. 2012, 6: 12010.1186/175205096120.
 28.
Vanlier J, Tiemann C, Hilbers P, van Riel N: An integrated strategy for prediction uncertainty analysis. Bioinformatics. 2012, 28 (8): 11301135. 10.1093/bioinformatics/bts088.
 29.
Liepe J, Filippi S, Komorowski M, Stumpf MP: Maximizing the Information Content of Experiments in Systems Biology. PLOS Comput Biol. 2013, 9: e100288810.1371/journal.pcbi.1002888.
 30.
Vanlier J, Tiemann C, Hilbers P, van Riel N: A Bayesian approach to targeted experiment design. Bioinformatics. 2012, 28 (8): 11361142. 10.1093/bioinformatics/bts092.
 31.
King RD, Whelan KE, Jones FM, Reiser PG, Bryant CH, Muggleton SH, Kell DB, Oliver SG: Functional genomic hypothesis generation and experimentation by a robot scientist. Nature. 2004, 427 (6971): 247252. 10.1038/nature02236.
 32.
MacKay DJ: Informationbased objective functions for active data selection. Neural Comput. 1992, 4 (4): 590604. 10.1162/neco.1992.4.4.590.
 33.
Daunizeau J, Preuschoff K, Friston K, Stephan K: Optimizing experimental design for comparing models of brain function. PLoS Comput Biol. 2011, 7 (11): e100228010.1371/journal.pcbi.1002280.
 34.
Hug S, Raue A, Hasenauer J, Bachmann J, Klingmüller U, Timmer J, Theis F: Highdimensional Bayesian parameter estimation: case study for a model of JAK2/STAT5 signaling. Math Biosci. 2013, 246 (2): 293304. 10.1016/j.mbs.2013.04.002.
 35.
Finley S, Gupta D, Cheng N, Klinke D: Inferring relevant control mechanisms for interleukin12 signaling in naïve CD4+; T cells. Immunol Cell Biol. 2010, 89: 100110.
 36.
Konukoglu E, Relan J, Cilingir U, Menze B, Chinchapatnam P, Jadidi A, Cochet H, Hocini M, Delingette H, Jaïs P: Efficient probabilistic model personalization integrating uncertainty on data and parameters: Application to eikonaldiffusion models in cardiac electrophysiology. Prog Biophys Mol Biol. 2011, 107: 134146. 10.1016/j.pbiomolbio.2011.07.002.
 37.
Xu T, Vyshemirsky V, Gormand A, von Kriegsheim A, Girolami M, Baillie G, Ketley D, Dunlop A, Milligan G, Houslay M: Inferring signaling pathway topologies from multiple perturbation measurements of specific biochemical species. Sci Signal. 2010, 3 (113): ra20
 38.
Kalita MK, Sargsyan K, Tian B, PaulucciHolthauzen A, Najm HN, Debusschere BJ, Brasier AR: Sources of celltocell variability in canonical nuclear factorκ B (NFκ B) signaling pathway inferred from single cell dynamic images. J Biol Chem. 2011, 286 (43): 3774137757. 10.1074/jbc.M111.280925.
 39.
Toni T, Stumpf M: Simulationbased model selection for dynamical systems in systems and population biology. Bioinformatics. 2010, 26: 104110. 10.1093/bioinformatics/btp619.
 40.
Vyshemirsky V, Girolami M: Bayesian ranking of biochemical system models. Bioinformatics. 2008, 24 (6): 833839. 10.1093/bioinformatics/btm607.
 41.
Schmidl D, Hug S, Li WB, Greiter MB, Theis FJ: Bayesian model selection validates a biokinetic model for zirconium processing in humans. BMC Syst Biol. 2012, 6: 9510.1186/17520509695.
 42.
Mélykúti B, August E, Papachristodoulou A, ElSamad H: Discriminating between rival biochemical network models: three approaches to optimal experiment design. BMC Syst Biol. 2010, 4: 3810.1186/17520509438.
 43.
Flassig R, Sundmacher K: Optimal design of stimulus experiments for robust discrimination of biochemical reaction networks. Bioinformatics. 2012, 28 (23): 30893096. 10.1093/bioinformatics/bts585.
 44.
Endres DM, Schindelin JE: A new metric for probability distributions. Inf Theory IEEE Trans. 2003, 49 (7): 18581860. 10.1109/TIT.2003.813506.
 45.
Kreutz C, Rodriguez M, Maiwald T, Seidl M, Blum H, Mohr L, Timmer J: An error model for protein quantification. Bioinformatics. 2007, 23 (20): 274710.1093/bioinformatics/btm397.
 46.
Raue A, Kreutz C, Theis F, Timmer J: Joining forces of Bayesian and frequentist methodology: A study for inference in the presence of nonidentifiability. Phil Trans Roy Soc A. 2012, 371 (1984): 2011054410.1098/rsta.2011.0544.
 47.
Geyer C: Practical markov chain monte carlo. Stat Sci. 1992, 7 (4): 473483. 10.1214/ss/1177011137.
 48.
Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf M: Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface. 2009, 6 (31): 187202. 10.1098/rsif.2008.0172.
 49.
Burnham KP, Anderson DR: Model selection and multimodel inference: a practical informationtheoretic approach. 2002, New York: Springer
 50.
Penny WD, Stephan K, Mechelli A, Friston K: Comparing dynamic causal models. NeuroImage. 2004, 22 (3): 11571172. 10.1016/j.neuroimage.2004.03.026.
 51.
Good IJ: Weight of evidence: a brief survey. Bayesian Stat. 1985, 2: 249269.
 52.
Calderhead B, Girolami M: Estimating Bayes factors via thermodynamic integration and population MCMC. Comput Stat Data Anal. 2009, 53 (12): 40284045. 10.1016/j.csda.2009.07.025.
 53.
Lin J: Divergence measures based on the Shannon entropy. Inf Theory IEEE Trans. 1991, 37: 145151. 10.1109/18.61115.
 54.
Härdle W, Werwatz A, Müller M, Sperlich S: Introduction. Nonparametric Semiparametric Models. 2004, New York: Springer
 55.
Kraskov A, Stögbauer H, Grassberger P: Estimating mutual information. Phys Rev E. 2004, 69 (6): 066138
 56.
Budka M, Gabrys B, Musial K: On accuracy of PDF divergence estimators and their applicability to representative data sampling. Entropy. 2011, 13 (7): 12291266.
 57.
Boltz S, Debreuve E, Barlaud M: Highdimensional statistical distance for regionofinterest tracking: Application to combining a soft geometric constraint with radiometry. IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07. 2007, Minneapolis, Minnesota, USA: IEEE Computer Society, 18.
 58.
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999, 27: 2934. 10.1093/nar/27.1.29.
 59.
Bevan P: Insulin signalling. J Cell Sci. 2001, 114 (8): 14291430.
 60.
Friedman JH, Bentley JL, Finkel RA: An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Softw (TOMS). 1977, 3 (3): 209226. 10.1145/355744.355745.
 61.
Trotta R: Forecasting the Bayes factor of a future observation. Mon Notices R Astronomical Soc. 2007, 378 (3): 819824. 10.1111/j.13652966.2007.11861.x.
 62.
Calderhead B, Girolami M: Statistical analysis of nonlinear dynamical systems using differential geometric sampling methods. Interface Focus. 2011, 1 (6): 821835. 10.1098/rsfs.2011.0051.
 63.
Calderhead B, Girolami M, Lawrence N: Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes. Adv Neural Inf Process Syst. 2009, 21: 217224.
 64.
Liepe J, Barnes C, Cule E, Erguler K, Kirk P, Toni T, Stumpf M: ABCSysBio approximate Bayesian computation in Python with GPU support. Bioinformatics. 2010, 26 (14): 179710.1093/bioinformatics/btq278.
 65.
Arefin A, Riveros C, Berretta R, Moscato P: GPUFSkNN: A Software Tool for Fast and Scalable kNN Computation Using GPUs. PloS one. 2012, 7 (8): e4400010.1371/journal.pone.0044000.
 66.
Garcia V, Debreuve E, Barlaud M: Fast k nearest neighbor search using GPU. CVPR Workshop on Computer Vision on GPU. 2008, Anchorage, Alaska, USA: IEEE Computer Society
Acknowledgements
Funding
This work was funded by the Netherlands Consortium for Systems Biology (NCSB). The authors would like to thank H. W. H. van Roekel, R. M. Foster and C. Çölmekçi Öncü for helpful discussions.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JV developed the method, performed the analyses, and wrote the paper. CT, PH and NR contributed to the computational analysis, interpretation of the results and revised the paper. All authors read and approved the final manuscript.
Electronic supplementary material
Supplementary Information.
Additional file 1: S1. Additional information regarding the MCMC Sampler. S2. Thermodynamic integration. S3. Mutual information. S4. Analytical expressions for the linear models. S5. Nonlinear model equations. S6. Spearman correlation JSD and Bayes factor. S7. Sampling bigger design matrices. S8. Choosing the size parameter for the density estimation. (PDF 5 MB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Vanlier, J., Tiemann, C.A., Hilbers, P.A. et al. Optimal experiment design for model selection in biochemical networks. BMC Syst Biol 8, 20 (2014). https://doi.org/10.1186/17520509820
Received:
Accepted:
Published:
Keywords
 Model selection
 Inference
 Bayes factor
 Uncertainty