A model *M* describes the deterministic dynamic behaviour of variables *z*(*t*) = (*z*
_{1}(*t*), ..., *z*
_{
K
}(*t*))^{
T
}by a system of differential equations *dz*
_{
k
}/*dt* = *f*
_{
k
}(*z*(*t*), *t*, *θ*), using parameters *θ*. The parameters may include initial values. Measurements *y*
_{
ki
} of each *z*
_{
k
} are taken at time points *t*
_{
i
}, *i* = 1, ..., *N*. For notational simplicity, we assume all variables are measured at all the time points, but the following discussion is easily extended to more general cases. An additive Gaussian measurement error ϵ ~ *N* (0, *σ*
^{2}) affects the observations of the variables. For simplicity we assume here that the error distributions of the variables are independent and constant over time. They are also characterised by some variance *σ*
^{2} which we consider to be part of the parameters.

The probability of the data

*Y* given some estimates

of the parameters is

where *p*
_{
N
}(*y* |
, *σ*
^{2}) is a Gaussian with mean
and variance *σ*
^{2}. Estimates
_{
ki
} are obtained by
where
are the solutions of differential equation system *z*
_{
k
}(*t*) = *f*
_{
k
}(*z*(*t*), *t*,
).

Model comparison methods involve two steps: first the model parameters need to be inferred from the available data, and then the adequacy of each calibrated model needs to be assessed. The classical and Bayesian approaches to each step are described next.

### Bayesian inference

In Bayesian statistics our knowledge about model parameters, conditional on the observed data, is summarised by probability distributions. This is allowed because parameters are random variables with a degree of uncertainty. The relationship between the data and the parameters is described by

The posterior distribution of the parameters *θ*, given the data *Y*, is proportional to the likelihood *P* (*Y* | *θ*, *M*) of the parameters times the parameter prior *P* (*θ* | *M*), normalised by the likelihood or evidence *P* (*Y* | *M*) for model *M*. Samples from the posterior distributions of model parameters are routinely obtained by Markov Chain Monte Carlo (MCMC) methods [16]. Descriptions of the posterior distribution in terms of mean, median or variance are easily obtained from such samples. In the following, estimates of the 95% credible interval (CI), for example, are obtained by the cutoffs for the lowest 2.5% and highest 2.5% of samples.

The denominator in Equation

2, the model evidence, is constant during the calibration step for one particular model and thus can be ignored. However, it becomes our quantity of interest in model comparison. Computation of the model evidence requires solving the integral

*p*(

*Y* |

*M*) = ∫

*p*(

*Y* |

*θ*,

*M*)

*p*(

*θ* |

*M*)

*dθ*, which is analytically intractable for most examples discussed in this paper. However, if a sample of

*S* sample points

*θ*
_{1}, ...,

*θ*
_{
S
}, from the posterior distribution

*p*(

*θ* |

*Y*,

*M*) is available (for example, from an MCMC simulation), then the model evidence can be estimated by Gelfand and Dey's [

17] reciprocal importance sampler, which is defined as

where *h*(*θ* | *M*) is an arbitrary probability density function over parameters *θ*. The choice of a suitable function *h*(*θ*
_{
s
}| *M*) is crucial. If set to the prior we obtain the harmonic mean estimator which can perform very poorly due to its high variance. The variance problem is mitigated when a distribution is chosen which is close to the posterior distribution. Stability of the estimator is, for example, achieved by setting *h*(*θ* | *M*) to a multivariate Gaussian fitted to the sampled points *θ*
_{
s
}, or to a multivariate *t*-distribution as used here. The reciprocal importance sampler was shown to perform well [18] when compared to other model comparison methods such as reversible jump MCMC or simple information criteria like the BIC.

As we show in the simulations below the reciprocal importance sampler is suitable for the comparatively simple models which we investigate in this study. For more complex models, particularly with many modes, more complex model comparison algorithms might be required (see, for example, [14]). Such algorithms are harder to implement and run. The strategy we are suggesting here is to test via simulations whether a particular type of models is amenable to simple model selection procedures based on MCMC samples and only if this is not the case to develop more advanced methods.

### Bayesian model comparison

Given a particular candidate model,

*M*
_{
i
}, its posterior probability is given by

where *p*(*M*
_{
i
}) is the prior probability of the model, and *p*(*Y* | *M*
_{
i
}) is the model evidence which we estimate here by equation 3.

According to Occam's principle, simpler models are preferred over complex models if they explain the data equally well. If the unknown parameters

*θ* are integrated out as in equation

4, the posterior model probability incorporates a balance between complexity and fit. Since

*P* (

*Y* |

*M*) embodies Occam's principle, it will be the key quantity for model comparison. One way to see that

*P* (

*Y* |

*M*) can be used to choose the model with the better predictive performance is as follows. The model likelihood

*P* (

*Y* |

*M*) essentially captures the sequential predictive power of the model over past incremental data sets, since

*P* (

*Y* |

*M*) =

*P* (

*Y*
_{1} |

*M*)

*P* (

*Y*
_{2} |

*Y*
_{1},

*M*) ...

*P*(

*Y*
_{
n
}|

*Y*
_{1}, ...,

*Y*
_{n-1},

*M*), that is, it captures how well part

*Y*
_{
i
}of the data

*Y* = (

*Y*
_{1}, ...,

*Y*
_{
n
}) is predicted using earlier parts

*Y*
_{1}, ...,

*Y*
_{i-1 }to calibrate the model. This makes complex models which overfit less likely. An alternative though related measure that penalises models that overfit is the Deviance Information Criterion (DIC) [

19], which measures the predictive power on unseen data. It relies more strongly on assumptions about the distribution of future data, but assesses the fully calibrated model, not only versions calibrated by partial data. To define the DIC we first set the deviance as

*D*(

*y*,

*θ*,

*M*) = -2 log

*p*(

*y* |

*θ*,

*M*). For large sample sizes, the model minimizing the deviance is the model with the highest posterior probability (see, for example, [

20]). The predictive deviance for future data can be approximated by

for samples *θ*
_{
s
}from the posterior (for example, by MCMC simulation) and
(*y, M*) = 1/*N* Σ_{
s
}
*D*(*y*, *θ*
_{
s
}, *M*) is the average deviance and
= 1/*N* Σ_{
s
}
*θ*
_{
s
}is the vector of posterior parameter means. The lower the DIC the better the model. However, we find that the DIC calculations are often unstable resulting in completely unrealistic values, which might be less relied upon.

Note that the complexity of the model is not easily captured by the number of estimated parameters or degrees of freedom, as in the well known Akaike's information criterion (AIC) or even the BIC and related information criteria. If correlated parameters or informative priors are used, for example, the number of

*effective* degrees of freedom, pD, is reduced. They can be estimated by (see [

19])

Once the probability of the model is known, we can select the most probable model from a set of competitive models using the Bayes factor *BF* = *p*(*Y* | *M*
_{
i
})/*p*(*Y* | *M*
_{
j
}). That is, the Bayes factor measures the extent by which the data increase the odds of *M*
_{
i
}to *M*
_{
j
}. Standard cutoffs for interpreting the significance of BFs, like [21], then allow interpretation of the result. Basically, a BF above 1 provides weak, above 3 substantial, above 10 decisive, and above 100 overwhelming evidence for model *M*
_{
i
}over *M*
_{
j
}.

The effective degree of freedom measures how many and by how much parameters are constrained by the data. On one hand, each parameter contributes close to one degree if the width of its posterior is small compared to the width of its prior. On the other hand, a parameter contributes very little to the overall effective degrees of freedom if it is not well constrained by the data and the width of its posterior hardly differs from the width of its prior. Consequently, the Bayes factor or effective degrees of freedom cannot eliminate or penalise spurious parameters which are ill determined by the data. It might be dubious to invoke Occam's principle once more (after it has already been incorporated in the model evidence) to decide between models and only additional data should be used for final clarification. For purely pragmatic reasons of convenience one might still accept the model with formally fewer parameters even though it has the same model evidence and same effective degrees of freedom as a more complex one.

### Frequentist inference

Parameter estimation by maximising the likelihood of

*θ* in equation

1 is equivalent to least squares (LS) optimisation, minimising the sum of squared errors

An unbiased estimator

for the noise variance is

Confidence intervals for the

*p* estimated parameters are obtained from the covariance matrix of

. This matrix is approximated by the Hessian matrix

*H*(

), that is, the matrix of second derivatives of the optimised function. The matrix is evaluated at

. 95% confidence intervals for the

*j*-th parameter (with

*j* = 1, ...,

*p*) are obtained as

± 1.96 SE

_{
j
}, where each particular standard error (SE

_{
j
}) is obtained from the

*j*-th diagonal element of the SE matrix

### Model selection as hypothesis testing

A commonly used method for frequentist model comparison is a likelihood ratio test (LRT), in which two nested models with different number of parameters are compared. According to the null hypothesis, a simple model

*M*
_{
s
}(with

*p*
_{
s
}parameters) is correct, and thus the additional parameters in the more complex model

*M*
_{
c
}with

*p*
_{
c
}parameters are unnecessary. A

*p*-value is obtained as tail probability of a

*χ*
^{2} distribution with

*p*
_{
c
}-

*p*
_{
s
} degrees of freedom of the statistic

Applicability of an LRT is limited due to the requirement that the models are nested, and that the parameters are fully identifiable. Akaike's Information Criterion (AIC) allows ranking models even when they are nonnested. It is defined by

where *p*
_{
i
}is the number of parameters in model *i* and *θ*
_{ML} is the value of *θ* that maximises the likelihood in 1. Standard tables exist for assessing the significance of AIC values [22].

### Implementation

MCMC methods are generic approaches to obtain samples from posterior distributions without the need to calculate the model evidence in equation

2 (for details see [

16]). In the following we use MCSim, an MCMC simulator for differential equations developed mainly for application to pharmacokinetic models [

23] for the estimation of posterior distributions and model probabilities. Essentially, at each step of the MCMC simulation, a set

*θ** of new parameters is chosen from a distribution centered on the current parameters

*θ* (a multivariate Gaussian in MCSim). The differential equation is solved using the new parameters

*θ**. In the case of a symmetric proposal distribution,

*θ** is accepted and

*θ* set to

*θ** with probability

Five parallel MCMC chains were run for each model. Each chain consisted of 40,000 iterations (20,000 iterations in the simple one-equation model shown in [Additional file 1]). The first 20,000 samples (10,000 samples in the simple one-equation model) of each chain were discarded and then one every 10 iterations was stored. Convergence was assessed by applying the
statistic described in [16] to the five parallel runs. This statistic was below 1.05 in all estimations, except otherwise stated, indicating good convergence behaviour. Evaluation of one model takes a few minutes on a standard desktop machine. Monitoring and analysis of MCSim runs was performed with the statistical R software [24]. Maximum likelihood values were obtained by taking the highest value from the MCMC runs.