We consider a model given by the system of ODEs of the general form:
(1)
Here the m-dimensional vector θ contains all unknown parameters, y is an n-dimensional state vector, and f is a given vector function, differentiable with respect to t, y and θ. When components of the initial state vector y0 are not known, they are considered as unknown parameters. Thus, y0 may depend on θ.
As mentioned above, we assume that (1) is the 'right' model for the problem we are interested in, implying that (1) is a sufficiently accurate mathematical description approximating reality. This means that all relevant knowledge about the modeled processes is incorporated correctly in the vector function f. Thus, the only uncertainty in (1) is the vector of unknown parameters θ. Furthermore, it means that there exists a 'true' value θ* for the parameters θ such that (1) represents reality. Therefore, in principle, all unknown parameters can be determined if sufficient and accurate enough data are available.
Quantities that can be experimentally measured are called observables. The theory of identifiability holds in general for observables being a combination of state variables. However, for the sake of simplicity we consider here the particular case when only the components of the state vector are measured. Let us assume that for fitting (1) there are N measurements available. Each measurement, which we denote by , is specified by the time t
i
when the c
i
-th component of the state vector y is measured. The corresponding model value obtained from (1) is denoted by (t
i
, θ). The assumptions outlined above imply that the difference is solely due to experimental error. We denote the vector of discrepancies between the theoretical values and the measured values by Y(θ). Then the least squares estimate of the parameters is the value of θ that minimizes the sum of squares [14, 15]
(2)
We note that (2) is an appropriate measure under certain assumptions only, which we will discuss below. Other measures might be used when these assumptions do not hold.
Parameter Estimation by the Levenberg-Marquardt Method
There exist a number of different optimization techniques for parameter estimation. The choice of technique usually depends on the type of model equations (deterministic or stochastic), on the number of unknown parameters (moderate or large), as well as on the dependence of model solutions on parameters (linear or nonlinear, continuous or discontinuous). For a survey on optimization methods in biochemical models we refer to [2, 16]. In general, model (1) – being nonlinear in θ – leads to a least squares problem (2) that has several minima, first because the problem has more than one solution, and second because the fitness function (2) can have several stationary points that do not correspond to the lowest value of the fitness landscape (so-called local minima). Local search methods, like Levenberg-Marquardt (LM), easily get trapped in one of the local minima rather than finding the global minimum. To explore the whole search space one needs global search methods, like the Evolution Strategy used in [12]. Unfortunately, these methods converge very slowly once near a minimum. In contrast, gradient-based methods are efficient optimizers [17] for nonlinear least-squares problems once a sufficiently good initial guess for the parameter values is available. In this paper we use the solutions from the global search in [12] as initial guesses for local optimization by the LM method [18]. In this way, we reduce the chance of missing the global minimum and the determination of all the minima is precise and fast.
In general, any gradient-based optimization procedure seeks a correction δθ for the parameter vector, such that S(θ + δθ) ≤ S(θ) holds. The LM method [18] determines the correction as the solution of the equations
(JT(θ)J(θ) + λI
m
) δθ = -JT(θ)Y(θ),
where λ ≥ 0 is a control parameter (see below), I
m
is the identity matrix of size m and the Jacobian is the so-called 'sensitivity' matrix of size N × m. The entry Ji, jin J(θ) shows how sensitive the model response is at the i-th data point for a change in the j-th parameter. The LM method can be seen as the combination of two gradient-based approaches: Gauss-Newton and steepest descent [17]. If λ = 0 in (3), it coincides with the Gauss-Newton method. However, when the matrix JT(θ)J(θ) is (almost) singular, to solve (3), λ has to be positive and for large λ the LM method approaches the steepest descent method. During the optimization λ is adapted such that the algorithm strives to exploit the fast convergence of the Gauss-Newton method whenever this is possible [18, 19].
In order to solve (3), the singular value decomposition (SVD) [20] of the matrix J(θ) can be used, i.e.
J(θ) = U(θ) Σ (θ) VT(θ),
where U(θ) is an orthogonal matrix of size N × m, such that UT(θ)U(θ) = I
m
, V(θ) is an orthogonal matrix of size m × m, such that VT(θ)V(θ) = V(θ)VT(θ) = I
m
, and Σ(θ) is a diagonal matrix of size m × m which contains all singular values σ
i
in non-increasing order. Then the correction δθ can be found as
δθ = -V(θ) (Σ2(θ) + λI
m
)-1 Σ(θ) UT(θ) Y(θ).
Later, when we study the reliability of the parameters computed, the SVD will play an important role again.
In order to execute an LM optimization step, the vector of discrepancies Y(θ), the matrix J(θ) and its SVD have to be evaluated for each new estimate of θ. For this purpose, for Y and the entries of J one needs to resolve (1) and the additional system of variational equations (i = 1,2,...,m)
(6)
respectively. We note that the costs for performing the SVD and computing the correction (5) are negligible in comparison with the computational costs for solving (1) and (6).
Thus, a single LM step requires the numerical solution of m + 1 coupled systems, each one consisting of n ODEs. Fortunately, these systems are coupled in a special way, namely, for each i = 1, 2,...,m, system (6) is a system of linear ODEs for , coupled only with (1). The system of equations (6) has the same stiffness as (1), so for numerical stability the same step size can be used for the time integration of (1) and (6) (note that ODE stiffness is determined by the eigenvalues of the Jacobian matrix and is not related to parameter stiffness as described above). Therefore, the one-way coupling can be used to solve (1) and (6) efficiently.
Numerical integration of (1) and (6) requires a fast and reliable ODE solver. Search in the parameter space may lead to some values of θ such that the systems of ODEs become stiff [21]. It is well known that for stiff ODE systems explicit schemes can give rise to numerical instability or, alternatively, extremely small time steps. Therefore, an implicit scheme is the best choice for time integration for stability reasons. Using an implicit scheme allows us to exploit the specific coupling between (1) and (6) in an efficient way. At each time step τ integrating first (1) provides the solution vector y. This requires the LU decomposition of the Jacobian matrix . Using this LU decomposition the calculation of from (6) reduces to a simple forward substitution and backsubstitution. In our simulations we use a tailor-made code [22] based on the implicit multistep Backward Differentiation Formulas (BDF) [23].
When the unknown parameters have to obey certain constraints – linear or non-linear – some additional work is needed. If the correction δθ found by (5) leads to violation of some constraints, then by the introduction of Lagrange multipliers a modified correction can be found, which fits all constraints. For the constrained minimization problem we refer the reader to [22].
For additional modeling and numerical aspects of this method we refer the reader to Additional file 1 (Section 1).
Statistical Analysis of Parameter Estimates
Above we used θ* to denote the 'true' parameter vector, for which (1) describes reality with sufficient accuracy, and by we denote the parameter vector which minimizes (2). Even having a 'right' model and an estimate for the parameter vector which fits the data well, does not mean that the whole modeling problem is resolved successfully. It is important to know how reliable the obtained estimate is. This is the subject of a posteriori identifiability analysis [3, 4, 24]. One way to look at this is inspecting the fitness landscape S(θ) in the neighbourhood of . Roughly speaking, if it is a sharp trough then the true parameter vector θ* and the obtained minimum are close. If it is flat in one or more directions, like the surface for a 2-parameter case in Fig. 1(a), then the minimum found can be far apart from the true parameter vector. Near the minimum, where the gradient of S(θ) vanishes, this surface is approximated by the second derivative or Hessian of S(θ). If the model is linear in the parameters the Hessian is equal to JTJ. This linearity assumption and some statistics underlie the following rigorous analysis [14, 15, 21].
We assume that the measurement errors in are independent of each other and normally distributed and that the error distributions have zero mean and constant standard deviation σ. Then, is a maximum likelihood estimate [14, 15]. By assumption the model with the 'true' solution θ* describes reality, thus
where ϵ
i
are the measurement errors, for which
(7)
holds approximately [14]. Here N
m
(·,·) denotes the m-dimensional multivariate normal distribution. Notice that (7) holds exactly when y is linear in θ. Next we can define a region around in which the 'true' parameter vector θ* lies with a certain probability 1 - α. This (1 - α)-confidence region is determined by the inequality
(8)
where F
α
(m, N - m) is the upper α part of Fisher's distribution with m and N - m degrees of freedom. Geometrically these confidence regions are given by the contours of S() (for linear models), cf. Fig. 1(a).
The ellipsoid defined by (8), is centered at and has its principal axes directed along the eigenvectors of JT()J(). Using the SVD (4) for J(), we get
and the eigenvectors of JT()J() are the columns of the matrix V(). So, the ellipsoid has its principal axes directed along the column vectors of the matrix V(). Moreover, the radii along these principal axes are inversely proportional to the corresponding singular values σ
i
, the diagonal elements of Σ(). This all can be seen by using the following transformation (rotation)
(9)
yielding
(10)
On the other hand, since S()/(N - m) is an unbiased estimator of σ2, the equation for the ellipsoid can be rewritten as
(11)
where ≈ mσ2F
α
(m, N - m) is proportional to the variance in the measurement errors. This form is more convenient to deal with because z can be considered as a set of uncorrelated variables, and once the conclusion has been drawn for the identifiability of z, the problem can be transformed back, revealing us the quality of .
Now, we assume that the model (1) is properly scaled, such that all parameter values are of the same order of magnitudes, and that we are interested only in the first few digits of the parameter values. Let us introduce the sphere given by
where r
ϵ
defines the level of accuracy one desires for the parameter estimates. For instance, if the parameters are of order O(1) and one is interested only in the first two digits to the right of the decimal point, then r
ϵ
= 0.01. In order to be able to determine z
i
accurately enough, the radius along the ellipsoid's i-th principal axis shouldn't exceed the radius of the sphere, which leads us to the following inequality
(12)
A graphical representation of the ellipsoid and the sphere for the 2-dimensional case is given in Figure 1(b).
If only the first k largest singular values satisfy (12), then only the first k entries of z are estimated with the required accuracy and no sufficient information is available for the remaining components of z. Each of the first k entries of z defines a parameter or a linear combination of parameters which is well-determined. If a principal axis of the ellipsoid makes a significant angle with the axes in parameter space (i.e., there exists more than one significant entry in the eigenvector), this implies correlation between parameters in .
To summarize, the level of noise in the data in combination with the accuracy requirement for the parameter estimates, defines the threshold for significant singular values in the matrix Σ. The number of singular values exceeding this threshold determines the number of parameter relations that can be derived from the experiment. How these relations relate to the individual parameters is described by the corresponding columns in the matrix V. The largest entries in these columns indicate the well-determined parameters. This method is illustrated on the basis of a simple enzymatic reaction in [2].
Finally, (11) indicates that having, for instance, two times more accurate data so that the standard deviation σ is halved, will decrease the radii along the ellipsoid's principal axis by a factor of 2. Therefore, in case of very small singular values σ
i
(i.e. strongly elongated ellipsoids) more accurate data obtained by the experimentalist will not improve the quality of the corresponding parameter estimates by much. In such a case, one certainly needs additional measurements of a different type (e.g. different components, different time points, or in the case of PDEs different spatial points).
Another way of assessing the information from the confidence region is by looking at confidence intervals of the parameter estimates (i = 1, 2,...,m). From (8) one can derive dependent and independent confidence intervals. The dependent confidence interval is the intersection of the ellipsoidal region with the i-th parameter axis
(13)
i.e. one assumes that all other parameters are exactly determined. The independent confidence interval is the projection of the ellipsoidal region onto the i-th parameter axis
(14)
Clearly, small independent confidence intervals for indicate that it is well-determined. However, in some cases considering only individual confidence intervals can be misleading. For instance, in the presence of strong correlations between parameters, the dependent confidence intervals underestimate the confidence region while the independent confidence intervals overestimate it.
From (7), the covariance matrix of is given by
(15)
Then, by denoting , the correlation coefficient between and can be computed by
(16)
We note that by computing individual confidence intervals and correlations between parameters, one is not able to assess the identifiability of linear combinations of parameters. This can be seen only by using the first approach, i.e. by inspection of the V and Σ matrix.
The Biological Test Problem: Gap Gene Circuits
We apply the methodology described above to assess parameter determinability of gene circuit models for the gap gene network in early Drosophila development. Here, we provide a brief outline of gap gene circuit models. More detailed information can be found in [7, 8, 25].
Segment determination occurs during the blastoderm stage of Drosophila development, between 1.5 and 3 hours after egg laying [26]. During this stage, the embryo consists of a syncytium; there are no cell membranes between nuclei. These nuclei constitute the basic objects of the model. They are arranged in a row along the A-P axis. Nuclei divide rapidly and synchronously [27]. Periods between mitotic divisions are called cleavage cycles, where cycle n occurs between mitoses n - 1 and n. The models considered here run from early cycle 13 (t = 0.0 min) to the onset of gastrulation at the end of cycle 14A (t = 71.1 min). Mitosis occurs at the end of cycle 13, between t = 16.0 min and t = 21.1 min [27].
Gene circuit models describe the change in concentrations of each gap gene product in each nucleus over time by the following system of ODEs
(17)
a and b denote regulated genes and regulators respectively. a and b are integer indices representing cad, hb, Kr, kni, gt as well as the terminal gap gene tailless (tll). denotes the concentration of the product of gene a in nucleus i. The Bcd gradient remains constant over time, and is not regulated by the other genes in the model. denotes the concentration of Bcd protein in nucleus i. N
g
= 6 is the number of genes in the model (excluding Bcd), and the function
(18)
is a sigmoid regulation-expression function.
During mitosis, protein production is shut down. Nuclei divide instantaneously at the end of mitosis and the distance between them is halved. Gap gene circuits cover the region from 35% to 92% A-P position, which includes 30 (cycle 13) and 58 (14A) nuclei. Therefore, system (17) consists of 180 and 348 ODEs during cycles 13 and 14A, respectively. Initial conditions are prescribed by maternal gradients of Bcd, Cad and Hb, and zero levels for all other gene products. We use no-flux boundary conditions at i = 0 and i = imax.
In system (17) there are m = 66 unknown parameters. These include the genetic interconnection or regulatory weight matrix W of size N
g
× N
g
where the matrix elements represent the regulation of gene a by gene b, while maternal coefficients m
a
represent the regulatory effect of Bcd on gene a. Regulatory parameters represent repression (if < 0), activation (if > 0) or no interaction (if ≈ 0). Other parameters include promoter thresholds h
a
, promoter strengths R
a
, diffusion coefficients D
a
, and decay rates λ
a
. Estimates for these parameters have been obtained in previous studies by fits to quantitative expression data [11] using global search methods such as parallelized Lam Simulated Annealing [7, 8] or the Stochastic Ranking Evolution Strategy (followed by downhill simplex direct search) [12] and using a first-improvement local search method with randomized order of examination [13]. In the latter the initial parameter estimates are obtained by using a splitting strategy: parameters λ
a
and D
a
are estimated by assuming that the protein production is constant within certain spatio-temporal domains which reduces (17) to a system of linear equations uncoupled for each gene (the boundaries of production domains are obtained from data); parameters in the nonlinear part of the model are estimated by fitting the production term in (17) with the data given as input, as closely as possible, to the quadrilateral production regions.
The data set used for model fitting consists of N = 2702 measurements of protein concentrations at nuclear resolution (using multi-channel immunofluorescent antibody assays; available online [11]). Measurements were taken at one time point during cycle 13 (T0), and eight time points T
i
(1 ≤ i ≤ 8) during cycle 14A (Figure 2a). Measurements for the concentrations of all gene products represented in the model at all time points are available, except for Cad at T7 and T8, and Tll before T3. The level of measurement error in the data is less than 5%, see [28]. Each data point represents concentration values which have been averaged across 9–62 embryos. Therefore, from the Central Limit Theorem (CLT) we assume that the experimental errors are approximately normally distributed.
The quality of the parameter estimates is measured by the root mean square (RMS) of the discrepancy vector
(19)
where N
t
= 8 is the number of time classes, N
c
is the number of nuclei and is equal to zero for Tll at j = 0,1, 2 and for Cad at j = 7, 8, and is equal to one otherwise. A solution is considered to be 'good' if RMS < 12.0 and if there are no visible pattern defects in the model response [7, 8, 12, 13]. It is important to note that the RMS only shows the quality of the fit of the model to the data but does not give any information about the quality of the parameter estimates. Our aim is to find the parameter estimates that give a good fit and to apply statistical analysis in order to investigate how reliable these estimates are.
The search space for parameters is defined by the linear constraints
(20)
and by the nonlinear constraints
(21)
where and are the maximum values in the data set for proteins b and Bcd, respectively. Note that in [7, 8, 13] threshold parameters h
a
for genes Kr, Kni, gt, and hb are fixed to negative values representing a constitutively repressed state for the corresponding genes [7, 8]. Fixing some parameters to specific values may severely restrict the search space leaving some solutions out of consideration. Contrary to their approach, we include threshold parameters for these genes in the search by putting the constraints -10.0 ≤ h
a
≤ 0.0.
In order to make the analysis of parameter estimation easier, we scale all parameters used in (17) in the following way:
for all genes a and b. Note that the choice of the scaling factors for R
a
, D
a
, and λ
a
is based on the search ranges of the corresponding parameters. The choice of the scaling factors for regulatory weights and maternal coefficients m
a
is based on the fact that the maximum level of protein concentration for all genes in the data set is of order O(102). Thus, all scaled parameters are of order O(1).