Nested sampling for parameter inference in systems biology: application to an exemplar circadian model

Background Model selection and parameter inference are complex problems that have yet to be fully addressed in systems biology. In contrast with parameter optimisation, parameter inference computes both the parameter means and their standard deviations (or full posterior distributions), thus yielding important information on the extent to which the data and the model topology constrain the inferred parameter values. Results We report on the application of nested sampling, a statistical approach to computing the Bayesian evidence Z, to the inference of parameters, and the estimation of log Z in an established model of circadian rhythms. A ten-fold difference in the coefficient of variation between degradation and transcription parameters is demonstrated. We further show that the uncertainty remaining in the parameter values is reduced by the analysis of increasing numbers of circadian cycles of data, up to 4 cycles, but is unaffected by sampling the data more frequently. Novel algorithms for calculating the likelihood of a model, and a characterisation of the performance of the nested sampling algorithm are also reported. The methods we develop considerably improve the computational efficiency of the likelihood calculation, and of the exploratory step within nested sampling. Conclusions We have demonstrated in an exemplar circadian model that the estimates of posterior parameter densities (as summarised by parameter means and standard deviations) are influenced predominately by the length of the time series, becoming more narrowly constrained as the number of circadian cycles considered increases. We have also shown the utility of the coefficient of variation for discriminating between highly-constrained and less-well constrained parameters.

. Calculation of log Z for 5, 10, 20 and 30-dimensional isotropic Gaussian likelihood functions (e −Σ(Xi−µ) 2 /2σ 2 ) over prior widths from 4 to 40 standard deviations (σ=0.05). The mean estimates of log Z (error bars indicate 95% confidence intervals) were obtained by nested sampling using 25 active samples (n=25) from 5 runs of the algorithm. The true values of log Z are indicated by the red lines (-10.38, -20.77, -41.54 and -62.30 in 5, 10, 20 and 30 dimensions respectively). Nested sampling computes log Z in an ndimensional cube 0..1 from the posterior points obtained. By design, these points are sampled from the region(s) of the prior where the likelihood is non zero. The final estimate of log Z is obtained by multiplying the output of nested sampling by the volume of the prior in parameter space (d max 1 − d min 1 ).. * (d max n − d min n ) where the likelihood is not zero. When the integral width becomes large (30 times the true standard deviation, or more) in higher dimensions, the proportion of the prior where the likelihood is greater than zero becomes significantly smaller than the volume of parameter space, and therefore we scale this volume by an estimate of the non-zero fraction. This estimate is obtained in an independent analysis by evaluating the likelihood in up to 10 7 random samples from the parameter space.
Results obtained using 10 steps of slice sampling with the step size updated adaptively (blue) do not differ systematically from those obtained with a fixed step size (green).

dimensions
Integral width/true sd logZ 1 Slice Step 10 Slice Steps Fig. S3. Calculation of log Z for 5, 10, 20 and 30-dimensional Gaussian likelihood functions (e −Σ(Xi−µ) 2 /2σ 2 ) over prior widths from 4 to 40 standard deviations (σ=0.05). The mean estimates of log Z (error bars indicate 95% confidence intervals) were obtained by nested sampling using 25 active samples (n=25) from 5 runs of the algorithm. The true values of log Z are indicated by the red lines (-10.38, -20.77, -41.54 and -62.30 in 5, 10, 20 and 30 dimensions respectively). Results obtained using 1 step of slice sampling in the exploration step (blue) do not differ systematically from those obtained using 10 steps of slice sampling (green). The adaptive heuristic is used in both cases.  . Bridging sparse data. A. Schematic of bridging for a bridge length of three. The observed data is represented by the black crosses at 1h (B1) and 4h (B4). Initially, the bridge points B2, B3 and B end are generated in sequence from B1 by (equation (10)). In this example, the expected value of B end does not correspond to the known value B4, and as a consequence the bridge points are scaled to new values B2 and B3 . B. Sparsely-sampled data (circles) and bridge points (solid lines) computed as in A for M (red), P c (black) and P n (blue) over two circadian cycles. 2 Parameter inference for the free-running (DD) system Fig. S7. Parameter inference for the free-running (DD) system over 1-5 circadian cycles of data. All figures show box plots summarising 5 runs of the nested sampling algorithm. Each run used a different stochastic realisation of the circadian model. The same (uniform) prior range for parameters was used in all runs. The results are grouped and plotted in several different ways to ease comparison. The parameter values used to generate each of the synthetic data series are indicated by the blue dashed lines.
A. Mean of all parameters as a function of the number of cycles. Triangles indicate the upper and lower bound used for integration (prior width). B. Mean of all parameters plotted on a more detailed scale than in A. C. SD of all parameters as a function of the number of cycles. D. CV of all parameters as a function of the number of cycles. E. CV for 1, 3 and 5 circadian cycles plotted together for comparison.

Comparison of nested sampling and MCMC
A single realisation of the DD model was analysed by five repetitions of nested sampling and by a standard implementation of MCMC [1]. (Note that all other analyses apply nested sampling to five different realisations of the circadian model.) In both algorithms, the lower and upper parameter bounds are mapped to 0..1 in all dimensions, and hence inference takes place within a unit cube. The output analysis of MCMC followed the recommendations in [1]. An average of 57,018 log likelihood calls were required to obtain 1054 posterior samples in nested sampling, whereas MCMC required 150,000 evaluations to obtain 1500 samples (using a batch of length 100, acceptance rate 11%). The acceptance ratio of MCMC is strongly influenced by the scale parameter which sets the proposal step size (scale * N (0, 1)). Alternative scales were explored to ensure consistent results from MCMC simulations of the circadian model, and from multi-dimensional Gaussian test problems. One complication that arises in high-dimensional problems is that lower values for scale improve the acceptance ratio but increase the computation required to explore the prior.
All nine model parameters are approximately normally distributed as indicated by the sample histograms (Fig. S8A) and QQ plots (Fig. S8B) of MCMC samples. On the assumption that parameters are normally distributed, a graphical comparison of the (theoretical) normal densities inferred by nested sampling can be made with those inferred by MCMC, see Fig. S8C. The densities plotted for the most typical run of nested sampling (the run giving the greatest number of median estimates of parameter means (4 parameters of 9)). While some differences in the location of the means can be noted, the predicted posterior densities are in generally good agreement.    . Entropy of data sampled over 1-3 circadian cycles. Total entropy of 13, 25 and 37 samples, and of 49 samples obtained from 1, 2 and 3 circadian cycles respectively. Total entropy is defined as the sum of p log(p) for all bridge points (p is calculated by (equation (18) in the main text)). Entropy increases with increasing numbers of samples, and is approximately equal when 49 samples are selected from 1-3 circadian cycles.