Simple 1dimensional exponential example
Here we introduce a simple 1dimensional exponential toy model example which we will use to illustrate our definitions of all the parts of a typical systems biology analysis, for example, the model, the input or rate parameters, observations with errors, model discrepancy, Bayesian emulators, implausibility measures and history matching. Specifically, this 1dimensional example will be used throughout this “Methods” section to demonstrate each of the three main parts of our approach:
Say we are interested in the concentration of a chemical which evolves in time. We represent this concentration as f
_{
t
}(x) where x is, for example, a reaction rate parameter and t is time. We model f
_{
t
}(x) with the differential equation:
$$ \frac{df_{t}(x)}{dt} \;\;=\;\; x \; f_{t}(x) $$
(1)
which in this case we can solve precisely to give
$$ f_{t}(x) \;\;=\;\; f_{0} \; \text{exp}{(x t)} $$
(2)
We will temporarily assume the initial conditions are f
_{0}=f
_{
t=0}(x)=1. The system runs from t=0 to t=5 and we are at first interested in the value of f
_{
t
}(x) at t=3.5. This mathematical model features an input or rate parameter x, which we wish to learn about. We do this using a measurement of the real biological system at t=3.5 which we denote z, which corresponds to, but is not the same as, the model output f
_{
t=3.5}(x). Note that usually, for models of realistic complexity, we would not have the analytic solution for f
_{
t
}(x) given by Eq. (2). Instead we would resort to a numerical integration method to solve Eq. (1) that might require significant time for one model evaluation, ranging from less than a second to hours, days or even weeks, depending on the full complexity of the model [6, 12]. Such a computational cost, for a single evaluation of the model, means that a full global parameter search is computationally infeasible, especially when the model has many rate parameters and therefore a high dimensional input space, which may require vast numbers of evaluations to explore fully.
We typically begin the analysis by exploring the model’s behaviour for several different values of the unknown rate or input parameter x. Figure 1 (left panel), shows five evaluations of the model f
_{
t
}(x) for different values of x between x=0.1 and 0.5, coloured red to purple respectively, with time on the xaxis. The measurement of the system is denoted z, and is represented as the black point in Fig. 1, with ± 3σ
_{
e
} error bars representing observational error, defined precisely below. This measurement was made at t=3.5 shown as the vertical dashed line. The most important questions for the biologist at this point are: can the model match the observed data z at all, and if so, what is the entire set of input parameter choices that give rise to acceptable matches between model output and observed data? Figure 1 (right panel) represents this question as it now shows only f
_{
t=3.5}(x) but now represented purely as a function of the input parameter x on the xaxis, with the red to purple points consistent with those in the left panel. The observed data z is now represented as the solid black horizontal line, with the ± 3σ
_{
e
} error bars as the horizontal black dashed lines. We see that there will be acceptable values of x approximately between 0.3 and 0.35.
For a general complex model f
_{
t
}(x), that possesses a large number of input or rate parameters and possibly several outputs, a full analysis of the model’s behaviour encounters the following issues:

1
When comparing the model to observed data from the real biological system, an adequate statistical description of the link between model and reality, covering all major uncertainties, is required.

2
For complex models, the time taken to evaluate the model numerically is so long that an exhaustive exploration of the model’s behaviour is not feasible.

3
The appropriate scientific goal should be to identify all locations in input parameter space that lead to acceptable fits between model and data, and not just find the location of a single good match.
Methods to address these three fundamental issues are described in the next three sections.
Model discrepancy and linking the model to reality
Most systems biology models are developed to help explain and understand the behaviour of corresponding real world biological systems. An essential part of determining whether such models are adequate for this task is the process of comparing the model to experimental data. As a comparison of this kind involves several uncertainties that cannot be ignored, it is therefore vital to develop a clearly defined statistical model for the link between systems biology model f(x) and reality z. This allows for a meaningful definition of an ‘acceptable’ match between a model run and the observed data. Here we describe a simple yet extremely useful statistical model for the link between the biological model and reality, that has been successfully applied in a variety of scientific disciplines, for example climate, cosmology, oil reservoir modelling and epidemiology [6, 8, 11, 12].
The most recognisable source of uncertainty is that of observational or experimental error. We represent the uncertain quantities of interest in the real biological system as the vector y, which we will measure with a vector of errors e to give the vector of observations z, such that
$$ z \;\;=\;\; y \; + \; e $$
(3)
where we represent the errors as additive, although more complex forms could be used. Note that z, y and e here represent vectors of random quantities, which will reduce to scalar random quantities if there is only one quantity of interest. A common specification [6] that we will employ here is to judge the errors to be independent from y, and unbiased with expectation E(e)=0 and, for the scalar case, \(\text {Var}(e) = \sigma _{e}^{2}\).
An important distinction to make is between the model of the biological system, represented as the vector f(x), and the system itself y. We represent the difference between these using a model discrepancy term as follows. Even were we to evaluate the model f(x) at its best possible choice of input x
^{∗}, the output f(x
^{∗}) would still not be in agreement with the real biological system value y, due to the many simplifications and approximations of the model. Hence we state that:
$$ y \;\;=\;\; f(x^{*}) \; + \; \epsilon $$
(4)
where the ε is the model discrepancy: a vector of uncertain quantities that represents directly the difference between the model and the real system. Again we treat y, f, x
^{∗} and ε as vectors of random quantities. A simple and popular specification [6] would be to judge that ε is independent of f(x
^{∗}), x
^{∗} and e, with E(ε)=0 and, in the scalar case, \(\text {Var}(\epsilon) = \sigma _{\epsilon }^{2}\). In a multivariate setting, where f(x) describes a vector of outputs (for example, with each output labelled by time t), the vector ε may have an intricate structure, possessing nonzero covariances between components of ε. This could capture the related deficiencies of the model across differing time points. Various structures of increasing complexity are available (for examples see [6, 8, 9]), along with methods for specification of their components [6, 16].
While the explicit inclusion of the model discrepancy term ε is unfamiliar, it is now standard practice in the statistical literature for complex models [7, 9, 17, 33]. Furthermore, any analysis performed without such a term is implicitly conditioned with the statement “given the model is a perfect representation of reality for some value of the inputs x”, a statement that is rarely true. The model discrepancy allows us to perform a richer analysis than before as we can now include any extra knowledge we have about the model’s deficiencies to improve our modelling of reality y, through the joint structure of ε (see for example [6, 16]). This is especially important in prediction: as a simple example, if our model undershot every auxin output we have measured so far, we may suspect that it will undershoot future measurements of auxin also, and may wish to build this into our prediction for future y [33].
We can specify probabilistic attributes of ε a priori, or learn about them by comparing to observed data. For direct specification, there are often various simple experiments that can be performed on the model itself to obtain assessments of σ
_{
ε
} and other aspects if necessary. For example, often models are run from exact initial conditions, so performing a set of exploratory model evaluations with the initial conditions appropriately perturbed would provide a lower bound on σ
_{
ε
}. See [16] where several such assessment methods are demonstrated, for more details.
1dimensional example
Figure 2 (left panel) shows the case for the simple 1dimensional exponential example model when both the observation error e and model discrepancy ε are ignored. The model f(x) is given by the purple line, while the observed data z is given by the horizontal black line. Here only one value of x can be viewed as acceptable (coloured green) while all others are unacceptable (red). This particular value of x is not unique in that if we were to perform the measurement again, due to measurement error we would get a different value for z and hence for x. More importantly, if the model had a second output, say corresponding to a different time, that also depended on the same input x, we would be extremely unlikely to be able to match both outputs to their measurements as we would have to obtain exact matches simultaneously for precisely the same value of x. Inferences and predictions about the biological system made from this case, using this value of x are not trustworthy.
Figure 2 (right panel) shows the far more realistic situation where we include both observation error e (the black dashed lines represent z±3σ
_{
e
}) and model discrepancy ε (the purple dashed lines show f(x)±3σ
_{
ε
}). As we have taken into account both major types of uncertainty there is now a range of acceptable values for x (green points) with borderline/unacceptable points in yellow/red. If we were to consider additional outputs of the model, we still have a chance to match them simultaneously to data for a subset of the currently acceptable points. If on the other hand we cannot find any acceptable points x even given the uncertainties represented by e and ε, then we can state that the model is inconsistent with the observed data and therefore most likely based on incorrect biological principles. Further, inclusion of the observation error and model discrepancy terms often aids a global parameter search as they tend to smooth the likelihood surface (or comparable objective function), making it both easier to explore while simultaneously more robust. They also help reduce the temptation to chase the often scientifically misleading global minimum, such as the lone green point in Fig. 2 (left panel), instead suggesting that the identification of a set of acceptable input points is the appropriate goal for such a search (see the green points in Fig. 2 (right panel)).
Bayesian emulation of systems biology models
Many complex mathematical models have been developed and employed within the area of systems biology. Often these models have high dimensional input spaces in that they posses several input parameters, for example reaction rate parameters, that must be specified in order to evaluate the model. We represent the list of such inputs as the vector x, with individual inputs as x
_{
k
} with k=1,…,d. The model may have any number of outputs, denoted as the vector f(x), with individual outputs as f
_{
i
}(x) with i=1,…,q, the behaviour of which we want to investigate, possibly comparing some of these to observed data. For example, the index i may label the different times we are interested in, or the different chemical outputs of the model, or both. Most models are complex enough that they require numerical integration methods to solve, and hence take appreciable time to evaluate. This evaluation time can range anywhere from less than a second to minutes, hours or even days for highly sophisticated models: our approach is applicable in any of these cases, and adds more value as the dimensionality and evaluation time of the model increases.
A Bayesian emulator is a fast, approximate mimic of the full systems biology model. It gives insight into the structure of the model’s behaviour and can be used instead of the model in many complex calculations. The emulator gives a prediction of what the model’s output f(x) will be at a yet to be evaluated input point x, and additionally provides an associated uncertainty for that prediction (these are often expressed as posterior distributions, or simply expectations and variances in some cases). Critically an emulator is extremely fast to evaluate as it only requires a few matrix multiplications, and hence can be used to explore the input space more fully, as for example in a global parameter search.
A popular choice for the Bayesian emulator for model f(x), which has individual outputs f
_{
i
}(x), i=1…q, is structured as follows:
$$ f_{i}(x) = \sum_{j} \beta_{ij} g_{ij}(x_{A_{i}}) + u_{i}(x_{A_{i}}) + w_{i}(x) $$
(5)
where the active variables \(x_{A_{i}}\) are a subset of the inputs x that are most influential for output f
_{
i
}(x). The first term on the right hand side of the emulator Eq. (5) is a regression term, where g
_{
ij
} are known deterministic functions of \(x_{A_{i}}\), a common choice being low order polynomials, and β
_{
ij
} are unknown scalar regression coefficients. The second term, \(\phantom {\dot {i}}u_{i}(x_{A_{i}})\) is a Gaussian process over \(x_{A_{i}}\) (or in a less fully specified version, a weakly second order stationary stochastic process), which means that if we choose a finite set of inputs \(\{x^{(1)}_{A_{i}},\ldots,x^{(s)}_{A_{i}} \}\), the uncertain outputs \(u_{i}(x^{(1)}_{A_{i}}),\ldots,u_{i}(x^{(s)}_{A_{i}})\) will have a multivariate normal distribution with a covariance matrix constructed from an appropriately chosen covariance function, for example:
$$ \text{Cov}\left(u_{i}(x_{A_{i}}),u_{i}(x'_{A_{i}})\right) = \sigma^{2}_{u_{i}} \text{exp}\left\{ \frac{\x_{A_{i}}x_{A_{i}}'\^{2}}{ \theta_{i}^{2}}\right\} $$
(6)
where \(\sigma ^{2}_{u_{i}}\) and θ
_{
i
} are the variance and correlation length of \(\phantom {\dot {i}}u_{i}(x_{A_{i}})\) which must be specified [6]. The third term w
_{
i
}(x) is a nugget, a white noise process uncorrelated with β
_{
ij
}, \(\phantom {\dot {i}}u_{i}(x_{A_{i}})\) and itself such that
$$ \text{Cov}\left(w_{i}(x),w_{i}(x')\right) = \left\{ \begin{array}{cl} \sigma^{2}_{w_{i}} & \text{if} \quad x=x' \\ 0 & \text{otherwise} \end{array} \right. $$
(7)
with expectation zero and \(\text {Var} (w_{i}(x)) = \sigma ^{2}_{w_{i}}\), that represents the effects of the remaining inactive input variables [6].
The emulator, as given by Eq. (5), possesses various desirable features. The regression term, given by \(\sum _{j} \beta _{ij} g_{ij}(x_{A_{i}})\), is often chosen to represent say a third order polynomial in the active inputs. This would attempt to mimic the large scale global behaviour of the function f
_{
i
}(x), and in many cases, will capture a large proportion of the model’s structure. (It is worth noting that reasonably accurate emulators can often be constructed just using regression models, for example using the lm() function in R. This can be a sensible first step, before one attempts the construction of a full emulator of the form given in Eq. (5).) The second term \(\phantom {\dot {i}}u_{i}(x_{A_{i}})\), the Gaussian process, mimics the local behaviour of f
_{
i
}(x) and specifically its local deviations from the third order polynomial given by the regression terms. We can choose the list of active inputs \(x_{A_{i}}\) using various statistical techniques for example, classical linear model fitting criteria such as AIC or BIC [6]. A list of say p active inputs for a particular output f
_{
i
}(x) means that we have reduced the input dimension from d to p dimensions, which can result in large efficiency gains. The small remaining effect of the inactive inputs is captured by the third term w
_{
i
}(x) in Eq. (5).
We proceed by performing an initial set of carefully chosen model evaluations, often picked to be ‘space filling’, i.e. well spread out over the input space. For example we may use a maximin Latin hypercube design, an approximately orthogonal design which attempts to ensure there are no large holes inbetween the run locations (see Fig. 3 and [34–36]). An n point Latin hypercube design is created by dividing the range of each input into n subintervals, and placing points to ensure there is only ever one point in each subinterval (this can be done using the lhs() function in R [37]). Many such Latin hypercube designs are generated and the one with maximum minimum distance between points is chosen.
We then fit the emulator given by Eq. (5) to the set of model runs using our favourite statistical tools, guided by expert judgement. Specifically we would prefer a fully Bayesian approach if we required full probability distributions [9], and a Bayes Linear approach [38, 39], which we will describe below, if we required purely expectations, variances and covariances of f(x). We make certain pragmatic choices in the emulator construction process, for example, while we keep the regression coefficients β
_{
ij
} uncertain, we may directly specify \(\sigma ^{2}_{u_{i}}\), \(\sigma ^{2}_{w_{i}}\) and θ
_{
i
} a priori, or use suitable plugin estimates [6].
The emulators then provide an expectation and variance for the value of f(x) at an unexplored input point x. We can test the emulators using a series of diagnostics, for example checking their prediction accuracy over a new batch of runs [40]. See [10] for an introduction and [6, 7, 9] for detailed descriptions of emulator construction.
While there are several approaches to emulator construction, our preferred choice is to use Bayes Linear methods, which is a more tractable version of Bayesian statistics which requires a far simpler prior specification and analysis [38, 39]. It deals purely with expectations, variances and covariances of all uncertain quantities, and uses the following update equations to adjust our beliefs in the light of new data. Say we had performed an initial wave of n runs at input locations x
^{(1)},x
^{(2)},…,x
^{(n)} giving a column vector of model output values D
_{
i
}=(f
_{
i
}(x
^{(1)}),f
_{
i
}(x
^{(2)}),…,f
_{
i
}(x
^{(n)}))^{T}, where i labels the model output. We obtain the adjusted expectation and variance for f
_{
i
}(x) at new input point x using:
$$\begin{array}{*{20}l} & \mathrm{E}_{D_{i}}(f_{i}(x)) = \end{array} $$
(8)
$$\begin{array}{*{20}l} &\mathrm{E}(f_{i}(x)) + \text{Cov}(f_{i}(x), D_{i}) \text{Var}(D_{i})^{1} (D_{i}  \mathrm{E}(D_{i})) \\ &\text{Var}_{D_{i}}(f_{i}(x)) = \\ &\text{Var}(f_{i}(x))  \text{Cov}(f_{i}(x), D_{i}) \text{Var}(D_{i})^{1} \text{Cov}(D_{i},f_{i}(x)) \end{array} $$
(9)
All quantities on the right hand side of Eqs. (8) and (9) can be calculated from Eqs. (5) and (6) combined with prior specifications for E(β
_{
ij
}), Var(β
_{
ij
}), \(\sigma ^{2}_{u_{i}}\), \(\sigma ^{2}_{w_{i}}\) and θ
_{
i
}. Note that we could have used the entire collection of model outputs D={D
_{1},D
_{2},…,D
_{
q
}} instead of just D
_{
i
} in Eqs. (8) and (9), if we had specified a more complex, multivariate emulator [41].
\(\mathrm {E}_{D_{i}}(f_{i}(x))\) and \(\text {Var}_{D_{i}}(f_{i}(x))\) are used directly in the implausibility measures used for the global parameter searches described below.
1dimensional example
We now demonstrate the construction of an emulator for the simple one dimensional exponential model. As there is only one output dimension, f(x) is now a scalar, so we drop the i index from Eqs. (59).
Figure 4 (left panel) shows output from such an emulator of the simple model defined by Eq. 1. We suppose that only n=5 runs of the model have been performed at the locations x
^{(j)}=0.1,0.2,0.3,0.4,0.5, which are shown as the purple points (these are the same as the five coloured points in Fig. 1). We therefore have the model output values
$$\begin{array}{*{20}l} D &= (f(x^{(1)}), f(x^{(2)}),\ldots,f(x^{(5)}))^{T} \\ &= (e^{0.1\times3.5}, e^{0.2\times3.5},\ldots,e^{0.5\times3.5})^{T} \end{array} $$
(10)
where again the output of interest has t=3.5.
We use a simplified form of the emulator given by Eq. (5), where we choose the polynomial terms β
_{
j
}
g
_{
j
}(x
_{
A
}) to represent only a constant term β
_{0}. As we only have one input variable, there is no distinction between inactive and active variables so we also set w(x) to zero, and hence the emulator Eq. (5) reduces to
$$ f(x) \;\;=\;\; \beta_{0} + u(x) $$
(11)
For simplicity we treat the constant term β
_{0} as known and hence set Var(β
_{0})=0, and choose prior expectation E(β
_{0})=β
_{0}=3.5, a value which we expect the function outputs to be approximately centred around. We specify the parameters in the covariance function for u(x) given by Eq. (6) to be σ
_{
u
}=1.5 and θ=0.14 representing curves of moderate smoothness: this process will be discussed in more detail for the full Arabidopsis model.
All expectation, variance and covariance terms on the right hand side of Eqs. (8) and (9) can now be found using Eqs. (11), (6) and (10), for example,
$$\begin{array}{@{}rcl@{}} \mathrm{E}(f(x)) &=& \beta_{0} \end{array} $$
(12)
$$\begin{array}{@{}rcl@{}} \text{Var}(f(x)) &=& \sigma_{u}^{2} \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} \mathrm{E}(D) &=& (\beta_{0}, \ldots, \beta_{0})^{T} \end{array} $$
(14)
while Cov(f(x),D) is a row vector of length n with jth component
$$\begin{array}{@{}rcl@{}} \text{Cov}(f(x), D)_{j} &=& \text{Cov}(f(x), f(x^{(j)})) \\ &=& \sigma^{2}_{u} \text{exp}\left\{ \frac{\xx^{(j)}\^{2}}{\theta^{2}}\right\} \end{array} $$
(15)
and similarly Var(D) is an n×n matrix with (j,k) element
$$\begin{array}{@{}rcl@{}} \text{Var}(D)_{jk} &=& \text{Cov}(f(x^{(j)}), f(x^{(k)})) \\ &=& \sigma^{2}_{u} \text{exp}\left\{ \frac{\x^{(j)}x^{(k)}\^{2}}{\theta^{2}}\right\} \end{array} $$
(16)
We can now calculate the adjusted expectation and variance E_{
D
}(f(x)) and Var_{
D
}(f(x)) from Eqs. (8) and (9) respectively.
Figure 4 (left panel) shows E_{
D
}(f(x)) as a function of x as the blue line. We can see that it precisely interpolates the five known runs at outputs D, which is desirable as f(x) is a deterministic function. The blue line also gives a satisfactory estimate of the true function f(x)= exp(3.5x). The red pair of lines give the credible interval \(\mathrm {E}_{D}(f(x)) \pm 3 \sqrt {\text {Var}_{D}(f(x))}\) as a function of x. This defines a region between the lines that we believe is highly likely to contain the true function f(x). Another desirable feature of the emulator is that these credible intervals decrease to zero width at the five known run locations, as is appropriate for a deterministic function, as we precisely know the value of f(x) there (and because we have no inactive variables). Therefore when x is close to a known run we are more certain about the possible values of f(x), compared to when x is far from any such runs.
Figure 4 (right panel) shows an emulator as applied to a more complex 1dimensional function. Here the true function is
$$ f(x) \;\;=\;\; 3 \,x \sin\left(\frac{5\pi(x0.1)}{ 0.4}\right) $$
(17)
which has been simulated at only 10 input points evenly spread between x
^{(1)}=0.1 and x
^{(10)}=0.5. Here the prior emulator specifications were as in the previous example, but with E(β
_{0})=0, σ
_{
u
}=0.6 and θ=0.06 allowing for functions with more curvature, centred around zero. As before the blue and red lines show E_{
D
}(f(x)) and \(\mathrm {E}_{D}(f(x)) \pm 3 \sqrt {\text {Var}_{D}(f(x))}\) as functions of x. The true function f(x) is given by the solid black line and it can be seen that it lies within the credible region for all x, only getting close to the boundary for x>0.5. This demonstrates the power of the emulation process: with the use of only 10 points the emulator accurately mimics a reasonably complex function with five turning points. We will demonstrate the effectiveness of emulators in higher dimensions for the main Arabidopsis model example.
History matching: an efficient global parameter search
Bayesian emulation is very useful in a variety of situations. As emulators are extremely fast to evaluate, they can replace the original model in any larger calculation, for example when designing future experiments. They can also provide much structural insight into the behaviour of the model. One of the most important applications of emulation is to the problem of performing a global parameter search. In this section we describe a powerful iterative global search method known as history matching, which has been successfully employed in a variety of scientific disciplines including galaxy formation [6, 7, 15, 22, 24], epidemiology [11, 25, 42, 43], oil reservoir modelling [8, 26–28], climate modelling [12], environmental science [16, 44] and traffic modelling [29]. Many of these applications involved models with substantial runtime, for which the process of emulation is vital.
When confronting a systems biology model with observed data the following questions are typically asked:

1
Are there any input parameter settings that lead to acceptable matches between the model output and observed data?

2
If so, what is the full set \(\mathcal {X}\) that contains all such input parameter settings?
History matching is designed to answer these questions. It proceeds iteratively and employs implausibility measures to determine parts of the input space that can be discarded from further investigation.
We can ask the question: for an unexplored input parameter setting x, how far would the emulator’s expected value for the individual function output f
_{
i
}(x) be from the corresponding observed value z
_{
i
} before we could deem it highly unlikely for f
_{
i
}(x) to give an acceptable match were we to evaluate the function at this value of x? The implausibility measure I
_{
i
}(x) captures this concept, and is given by:
$$ I_{i}^{2}(x) \;\;=\;\; \frac{(\mathrm{E}_{D_{i}}(f_{i}(x))  z_{i})^{2}}{ \text{Var}_{D_{i}}(f_{i}(x)) + \text{Var}(\epsilon_{i}) + \text{Var}(e_{i})} $$
(18)
The numerator of Eq. (18) gives the distance between the emulator expectation \(\mathrm {E}_{D_{i}}(f_{i}(x))\) and the observation z
_{
i
}, while the denominator standardises this quantity by all the relevant uncertainties regarding this distance: the emulator variance \(\text {Var}_{D_{i}}(f_{i}(x))\), the model discrepancy variance Var(ε
_{
i
}) and the observation error variance Var(e
_{
i
}). This structure is a direct consequence of Eqs. (3) and (4). A large value of I
_{
i
}(x) for a particular x implies that we would be unlikely to obtain an acceptable match between f
_{
i
}(x) and z
_{
i
} were we to run the model there. Hence we can discard the input x from the parameter search if I
_{
i
}(x)>c, for some cutoff c, and refer to such an input as implausible. We may choose the cutoff c by appealing to Pukelsheim’s 3sigma rule [45], which is the powerful result that states that for any continuous, unimodal distribution, 95% of its probability must lie within ± 3σ, regardless of asymmetry, skew, or heavy tails, which suggests that a choice of c=3 could be deemed reasonable [6].
We can combine the implausibility measures I
_{
i
}(x) from several outputs in various simple ways, for example we could maximise over all outputs defining
$$ I_{M}(x) \;\;=\;\; \underset{{i \in Q}}{\text{max}}\ I_{i}(x) $$
(19)
where Q represents the collection of all outputs, or some important subset of them (often we will only emulate a small subset of outputs in early iterations). A more robust approach would be to consider the second or third maximum implausibility, hence allowing for some inaccuracy of the emulators [6]. Also, multivariate implausibility measures are available (see [6] for details), but these require a more detailed prior specification, for example this requires covariances between different components of e and ε. Note that a low value of the implausibility I
_{
M
}(x) does not imply that the input point x is ‘good’ or ‘plausible’ as it still may lead to a poor fit to outputs that have not been included in Q yet. Also, low implausibility at x may occur because of a high emulator variance \(\text {Var}_{D_{i}}(f_{i}(x))\) which once resolved following further runs of the model, may then lead to a high implausibility at x. Hence we refer to low implausibility inputs x as “nonimplausible", consistent with the literature [6–8, 11, 24, 25, 28].
We proceed iteratively, discarding regions of the input parameter space in waves, refocussing our search on the remaining ‘nonimplausible’ inputs at each wave. Prior to performing the k
^{th} wave, we define the current set of nonimplausible input points as \(\mathcal {X}_{k}\) and the set of outputs that we considered for emulation in the previous wave as Q
_{
k−1}. We proceed according to the following algorithm.

1
Design and evaluate a well chosen set of runs over the current nonimplausible space \(\mathcal {X}_{k}\). e.g. using a maximin Latin hypercube with rejection [6].

2
Check to see if there are new, informative outputs that can now be emulated accurately (that were difficult to emulate well in previous waves) and add them to the previous set Q
_{
k−1}, to define Q
_{
k
}.

3
Use the runs to construct new, more accurate emulators defined only over the region \(\mathcal {X}_{k}\) for each output in Q
_{
k
}.

4
The implausibility measures I
_{
i
}(x), i∈Q
_{
k
}, are then recalculated over \(\mathcal {X}_{k}\), using the new emulators.

5
Cutoffs are imposed on the Implausibility measures I
_{
i
}(x)<c and this defines a new, smaller nonimplausible volume \(\mathcal {X}_{k+1}\) which should satisfy \(\mathcal {X} \subset \mathcal {X}_{k+1} \subset \mathcal {X}_{k}\).

6
Unless a) the emulator variances for all outputs of interest are now small in comparison to the other sources of uncertainty due to the model discrepancy and observation errors, or b) the entire input space has been deemed implausible, return to step 1.

7
If 6 a) is true, generate as large a number as possible of acceptable runs from the final nonimplausible volume \(\mathcal {X}\), sampled depending on scientific goal.
We then analyse the form of the nonimplausible volume \(\mathcal {X}\), the behaviour of model evaluations from different locations within it and the corresponding biological implications. If the entire input space has been deemed implausible in step 6 b), this may imply that the model is inconsistent with the observed data, with respect to the specified uncertainties. This could be because the biological principles that underlie the model’s structure are incorrect, and hence remodelling is required. Or that we may have underestimated the observation errors, the model discrepancy or even the emulator uncertainty, although emulator diagnostics [40] combined with the choice of fairly conservative cutoffs should make the latter unlikely. Note that concluding that a far larger model discrepancy is needed is essentially stating that the model is highly inaccurate, and may, for example be judged unfit for purpose.
The history matching approach is powerful for several reasons:

As we progress through the waves and reduce the volume of the region of input space of interest, we expect the function f(x) to become smoother, and hence to be more accurately approximated by the regression part of the emulator \(\phantom {\dot {i}}\beta _{ij} g_{ij}(x_{A_{i}})\), which is often composed of low order polynomials (see Eq. 5).

At each new wave we have a higher density of points in a smaller volume and hence the Gaussian process term \(\phantom {\dot {i}}u_{i}(x_{A_{i}})\) in the emulator will be more effective, as it depends mainly on the proximity of x to the nearest runs.

In later waves the previously strongly dominant active inputs from early waves will have their effects curtailed, and hence it will be easier to select additional active inputs, unnoticed before.

There may be several outputs that are difficult to emulate in early waves (perhaps because of their erratic behaviour in uninteresting parts of the input space) but simple to emulate in later waves once we have restricted the input space to a much smaller and more biologically realistic region.
1dimensional example
We now demonstrate the history matching process as applied to the simple 1dimensional exponential example. Figure 5 (left panel) shows the emulator expectation and credible intervals as in Fig. 4, however now the observation z plus observed error has been included as the horizontal black solid and dashed lines respectively. Here we have set the model discrepancy to zero (σ
_{
ε
}=0) and reduced the size of the observation errors σ
_{
e
} for clarity. Also given are the implausibilities I(x) as represented by the colours on the x=axis: red, yellow and green for high (I(x)>3.5), borderline (3.5<I(x)<3) and low (I(x)<3) implausibility respectively.
The nonimplausible space \(\mathcal {X}_{1}\) at wave 1 is the full initial range of the rate parameter x, which is 0.075<x<5.25. If we impose cutoffs of I(x)<3 then this defines the wave 2 nonimplausible space \(\mathcal {X}_{2}\) as shown by the green region of the xaxis in Fig. 5 (left panel).
We then perform the second wave by designing a set of two more runs over \(\mathcal {X}_{2}\), reconstructing the emulator over this region, and recalculating the implausibility measure I(x). The results of this second wave are shown in Fig. 5 (right panel). It can be seen that the emulator is now highly accurate over the \(\mathcal {X}_{2}\) region and that the nonimplausible region in green has been further reduced. As the emulator is now far more accurate than the corresponding observation error, we may stop the analysis with this wave as \(\mathcal {X}_{3} \simeq \mathcal {X}\), implying that further runs will do little to reduce the nonimplausible region further. Note that providing we have enough runs in each wave, we would often create new emulators at each wave, defined only over the current green nonimplausible region [6], instead of updating emulators from previous waves, as in Fig. 5. See the Additional file 1 for R code to reproduce the example model output, discrepancy, emulation and history matching plots of Figs. 1, 2, 4, and 5 respectively.
History matching and Bayesian MCMC
Here we discuss the standard form of a full Bayesian analysis, and compare it to the above history matching approach, highlighting the relative strengths and weaknesses of each method.
History matching attempts to answer efficiently some of the most important questions that a modeller may have, identifying if the model can match the data, and where in input space such acceptable matches can be found. It requires only a limited specification related to the key uncertain quantities, in terms of means, variances and covariances. A fully Bayesian approach goes further [46], and delivers a posterior distribution across all uncertain quantities, which has the benefit of providing probabilistic answers to most scientific questions e.g. in this context it gives the posterior distribution of the location of the true input x
^{∗}. However, it requires a more detailed prior specification of joint probability distributions across all these quantities, and critically, it also assumes the existence of a single true x
^{∗} (and the accuracy of the statistical model that defines it). This may not be judged appropriate, for example, in a situation where two different x
^{∗}s were found that gave good matches to two different subsets of the observed data. In more detail, a fully Bayesian specification in the context of Eqs. (3) and (4) requires the multivariate distributions π(zy),π(yf(x
^{∗})),π(f(x
^{(j)})x
^{(j)}) for a collection of inputs x
^{(j)}, and a prior distribution π(x
^{∗}) over the true input x
^{∗}. Meaningful specifications of this form can be difficult to make, for example, often familiar distributional forms are assumed such as the multivariate normal distribution, but such choices are often made for mathematical convenience or computational tractability. Really, choices of this kind demand a careful justification, without which results such as the posterior for x
^{∗} rapidly lose meaning. For example, the fully Bayesian approach will, after substantial calculation, return a posterior for x
^{∗}, which may be quite narrow, even if the model cannot fit the observed data at all. History matching however may quickly discover this mismatch after a few waves, making further analysis unnecessary.
The second drawback of the fully Bayesian approach is that it is often hard to perform the necessary calculations, and therefore various numerical schemes are required, the most popular being MCMC [47]. While MCMC has enjoyed much success, issues still remain over the convergence of an MCMC algorithm for even modest dimensional problems [48]. Often the likelihood may be highly multimodal, and therefore vast numbers of model evaluations are usually required to reach convergence, making MCMC prohibitively expensive for models of even moderate evaluation time. In contrast, the calculations for history matching are relatively fast and simple.
A third issue is that of robustness: small changes in the full Bayesian specification, especially involving the likelihood, can lead to substantial changes in the posterior. These sensitivities can go unnoticed and can be hard to analyse [49–52], but will call into question the resulting scientific conclusions.
Due to these issues, we support the fully Bayesian approach, but only for cases where such detailed calculations are warranted, say for a well tested, accurate biological model, which possesses well understood model deficiencies, and which is to be combined with data that have a trusted observation error structure, and critically, where full probabilistic results are deemed essential. If, instead, the main concern is to check whether the model can fit the data; to see what regions of input parameter space give acceptable matches, and for this to be used for further model development, then a history match may be the more appropriate analysis. Even if one wishes to forecast the results of future experiments [33, 53] and to make subsequent decisions, history matching can be sufficient, as one can either reweight appropriately the samples generated in the final wave, as was done in [11], or as stated in step 7 of the history matching algorithm, use more sophisticated sampling in the nonimplausible region depending on the scientific question.
Now the use of emulators can of course facilitate the large number of model evaluations required for Bayesian MCMC algorithms, admittedly at the expense of increased uncertainty (see for example [9, 54]). However, a more serious problem is then encountered. The likelihood function, which is a core component of Bayesian calculations, is constructed from all outputs of interest (and therefore attempts to describe both the ‘good’ and ‘bad’ inputs simultaneously). Hence we need to be able to emulate with sufficient accuracy all such outputs, including their possibly complex joint behaviour. This may be extremely challenging as often, especially in early waves, there may exist several erratically behaved outputs that are extremely difficult to emulate, which will dramatically fail emulator diagnostics [40]. Unfortunately, the likelihood and hence the posterior may be highly sensitive to such poorly constructed emulators. Therefore, from a purely practical perspective, employing the full Bayesian paradigm using inadequate emulators constructed at wave 1, may be unwise [6, 15].
History matching as a precursor to MCMC
If one does wish to perform a fully Bayesian analysis on a well tested biological model, we would usually recommend performing a history match first [15]. This would identify the nonimplausible region \(\mathcal {X}\) which should contain the vast majority of the posterior. Then MCMC or an equivalent Bayesian calculation (such as importance sampling) can be performed within \(\mathcal {X}\), using the accurate emulators defined in the final wave (for an example of this see [11]).
This is because if the model is of modest to high dimension, the posterior may often only occupy a tiny fraction of the original input space \(\mathcal {X}_{1}\). Unless the model is very fast to evaluate, we would need to use emulators to overcome the MCMC convergence issues, but performing enough model runs to construct sufficiently accurate emulators over the whole of \(\mathcal {X}_{1}\) would be extremely inefficient. Iterative history matching naturally provides the accurate emulators that we would need, defined only over \(\mathcal {X}\), which should contain the posterior.
Note that the history match cuts out space efficiently, based on small numbers of easy to emulate outputs in early waves, and designs appropriate runs for the next wave that have good coverage properties for the current nonimplausible region. Alternative iterative strategies, such as using MCMC at each wave to generate samples from the current posterior (which includes high emulator uncertainty), for use as model runs in the next wave, may be highly inefficient and could run into a series of difficulties. Such strategies would not fully exploit the smoothness of the model output, and may tend to cluster points together around the current posterior mode (hence squandering runs, for smooth deterministic models), leading to poor coverage of the full nonimplausible space. Such strategies may also be highly misled by inaccurate emulators and the subsequent posterior sensitivity combined with multimodal likelihood issues, leading to clustered designs in the wrong parts of the input space, and in some cases a lack of convergence.
To conclude, if one really desires a fully Bayesian analysis, then performing a history match first can greatly improve efficiency. In this way we view history matching not as a straight competitor to alternative approaches, but instead a complimentary technique, which has many benefits in its own right. See [6] and the extended discussion in [15] for more details of this argument.
History matching and ABC
Another Bayesian technique that has been developed more recently, and compared (somewhat cautiously) to history matching, by for example [55], is that of Approximate Bayesian Computation or ABC [56]. While the two approaches seem to share some superficial similarities, they are fundamentally different in their goal and the principled way each approach is set up and implemented. For example, ABC attempts to approximate full Bayesian inference and hence to obtain an approximate Bayesian posterior distribution (critically, using a tolerance that tends to zero). History matching is not an inference procedure, as it is simply attempting to rule out all the input space that is clearly inconsistent with the data given the model discrepancy and observation error (which are meaningful ‘tolerances’ that critically will never tend to zero). It is worth noting that if one attempts to specify a meaningful minimum size for the tolerance in ABC, one is arguably not really employing ABC anymore, but is instead just back using Bayesian inference (as shown by [56]) in the form of a samplingresampling algorithm (as described for example by [57]). History matching does not attempt to probabilize the remaining input space in any way, which can result in increased efficiency of the parameter search. We have directly compared and contrasted the two approaches in [58], where we demonstrated that a powerful version of ABC failed to find any part of parameter space that matched the observed data for a 22dimensional stochastic epidemiology model, while history matching found the correct part of input space and many good matches using approximately half the number of runs that the (failed) ABCSMC approach used.