Robust and efficient parameter estimation in dynamic models of biological systems

Background Dynamic modelling provides a systematic framework to understand function in biological systems. Parameter estimation in nonlinear dynamic models remains a very challenging inverse problem due to its nonconvexity and ill-conditioning. Associated issues like overfitting and local solutions are usually not properly addressed in the systems biology literature despite their importance. Here we present a method for robust and efficient parameter estimation which uses two main strategies to surmount the aforementioned difficulties: (i) efficient global optimization to deal with nonconvexity, and (ii) proper regularization methods to handle ill-conditioning. In the case of regularization, we present a detailed critical comparison of methods and guidelines for properly tuning them. Further, we show how regularized estimations ensure the best trade-offs between bias and variance, reducing overfitting, and allowing the incorporation of prior knowledge in a systematic way. Results We illustrate the performance of the presented method with seven case studies of different nature and increasing complexity, considering several scenarios of data availability, measurement noise and prior knowledge. We show how our method ensures improved estimations with faster and more stable convergence. We also show how the calibrated models are more generalizable. Finally, we give a set of simple guidelines to apply this strategy to a wide variety of calibration problems. Conclusions Here we provide a parameter estimation strategy which combines efficient global optimization with a regularization scheme. This method is able to calibrate dynamic models in an efficient and robust way, effectively fighting overfitting and allowing the incorporation of prior information. Electronic supplementary material The online version of this article (doi:10.1186/s12918-015-0219-2) contains supplementary material, which is available to authorized users.


S.1 High quality Jacobian computation
In the local phase of the hybrid optimization method, the NL2SOL algorithm is used, which requires the Jacobian of the residuals vector (R). As an alternative to the simple forward finite difference method: where δθ i is a perturbation in the i-th parameter value, the parametric sensitivity based calculation produces a Jacobian of higher quality. The sensitivity of the model output vector (y) to the parameters has to be computed: Note that the Jacobian J(θ) is a matrix of size N D × N θ , where N D is the number of data points and N θ is the number of parameters, such that [J(θ)] ij is the weighted sensitivity of the model prediction for the i-th data point with respect to the j-th model parameter.
The parametric sensitivities of the model outputs can be computed from the sensitivities of the state-variables as where g(·) is the observation function (see Eq. (2) in the main text) and x jθi = dxj dθi denotes the sensitivity of the j-th state variable with respect to the i-th parameter.
The sensitivities of the state-variables with respect to the model parameters can be obtained by solving the so-called forward sensitivity equations (FSEs), which is a well-known method in the perturbation theory of differential equations. The FSEs read as Note that this means that N θ × N x equations have to be solved, so it is a computationally expensive calculation, which increases with the number of state variables and parameters. When the Jacobian is required, these equations are usually solved together with the dynamic model equations, because they share the system's Jacobian ( ∂f (u,x,θ) ∂x ) and thus the two systems of equations have the same "stiffness". Note that modern solvers like CVODES [1] implement this type of computation.
The system's Jacobian and the inhomogeneous part ( ∂f (u,x,θ) ∂θi ) of (S.1.4) can be derived analytically (symbolically) from the dynamic equations prior to the model calibration. This procedure increases the speed and robustness of the initial value problem (IVP) and sensitivity computations. It should also be noted that the Jacobian computation by forward sensitivity equations requires a similar computational effort than the finite difference (FD) method, but in the case of the sensitivity based computation the error in the Jacobian is controlled. However, in the case of FD method, the error is unknown.
We compared the FSE and the FD methods based on the case studies presented in the main text, confirming that the accuracy of the Jacobian has a significant effect on the convergence of NL2SOL. A high quality Jacobian computation resulted in faster convergence with better chances of obtaining the global optima. However, we also observed that NL2SOL implements an intelligent adaptive scheme to tune the perturbation parameter (δθ i ) for the FD method (S.1.1); for small problems and with careful settings, it can be almost as efficient as solving the forward sensitivity equations.

S.3 Prediction error measure
To measure the prediction error, we used the following normalized root mean square error formula Here N D is the total number of data points, N e is the number of experiments, N y,k is the number of observables in the k-th experiment, N t,k,j is the number of time points in the k-th experiments for the j-th observable, y ijk is the where non-weighted Tikhonov regularization is recommended, and (iii) worst case scenario (no prior knowledge and therefore random guess of parameters) where a two-step regularization procedure is proposed. In the first step ridge regularization is applied which results the parameter vector with minimum norm, that fits the data reasonably well. In the second step this parameter vector is used as the reference parameter vector for Tikhonov regularization. In each scenario the regularized optimization is solved for a set of regularization parameter and the generalized cross validation method (GCV) is applied to choose the optimal candidate. model prediction for the dataỹ ijk . This formula computes the root of the sum of squared error between model prediction and data for each observable, and normalizes it by the squared range of the data corresponding to that observable. In this way, the observables are properly scaled.

S.4 Initial guess calculation for global optimization
The search for the global optima of the objective function is restricted to a N θ dimensional box. This box is specified by the lower and upper bounds of the parameters, i.e. θ min ≤ θ ≤ θ max . In the main text, we mention 4 strategies to generate points (initial guesses) from this box: 1. Multivariate uniform distribution, which selects points from the box with uniform probability. 3. Latin hypercube sampling (LHS) [2] 4. Logarithmic Latin hypercube sampling, which transforms the bounds as in 2. and then applies the LHS method.
The logarithmic scaling of the bounds is especially useful if the lower and upper bounds are different by more than an order of magnitude and we would like to collect samples from all the range of magnitudes. With the logarithmic scaling, each order of magnitude in the range of the parameters will have equal chance to contain the selected points.

S.5 Robust computation of the regularization candidates
Due to the non-convexity of the cost function, none of the stochastic global search algorithms can guarantee that the global minimum of regularized optimization problem is found. The global optimization problem is solved multiple times with regularization to generate the candidates. We can detect inconsistencies among these solutions to find cases that did not converge to the global optima. We can use a simple and well-known observation from bi-criteria optimization to filter out incorrect solutions due to convergence to local optima. If the bi-criteria optimization (here the criteria are formulated by the least squares term Q LS and the penalty Γ) is solved by weighting (here the weighting parameter is the regularization parameter), then the solutions are located on a convex curve (the points of which are then (Q LS (θ αi ), Γ(θ αi )) for i = 1 . . . I). This convex curve is referred as the L-curve in regularization theory (and in the main text), but it is also known as the Pareto front in multi-objective optimization.
For example, in case of two estimated parameter vectorsθ α1 andθ α2 corresponding to two regularization parameter values α 1 > α 2 is expected to have the following relations: Q LS (θ α1 ) > Q LS (θ α2 ) -i.e. larger regularization leads to worse model fit-, and Γ T (θ α1 ) < Γ T (θ α2 ), i.e. larger regularization gives smaller penalty function value. If the relation is not fulfilled, then one of the solution dominates the other and the dominated solution is not a global solution of the corresponding optimization problem, but is the artefact of local convergence.
A straightforward strategy would be to detect and remove the dominated points from the Pareto front obtained by independent estimations. But this would discard valuable computational results. Instead, we applied the following iterative search strategy. We assume that a set of regularization parameters α 1 > α 2 > . . . α I are already selected for which the optimization problem (Equation (7) in the main text) has to be solved, for example using the procedure described in the main text.
The penalty term is a quadratic, therefore convex function of the parameters, which takes its unique (global) minimum at the reference parameter vector θ ref .
On the other hand, the first part of the objective function Q LS (θ) can be highly multi-modal with many local minima. Therefore, it is recommended to start the search with the largest regularization parameter α 1 .
A procedure to compute a smooth L-curve.
Step 1. Global, sequential forward search. Solve the optimization problem ((7) in the main text) one-by-one for α 1 , α 2 , . . . α I using the proposed global optimization meta-heuristic (eSS2) such that, in the i-th run the previously obtained parameters {θ α1 ,θ α2 , . . .θ αi−2 ,θ αi−1 } are included in the initial guess set for the global search. This can save a large amount of time in the global search of the next optimal point, preventing dominated solutions.
Remark 1. Note that, as we solve a sequence of non-linear optimization problems with a global search method, it is possible, that at some point a much better optimum is reached; for exampleθ α k , which reveals that, the previously obtained pointsθ αi for i = 1, 2, . . . k − 1 are dominated, i.e. they have not converged to the global optima. But the reverse cannot happen, i.e. as mentioned aboveθ αi cannot be dominated by any ofθ αi−1 . . .θ α1 since they are used as initial guesses in the search.
Step 2. Backward search. In the second step we refine the optima one-by-one, but in the reverse order α I−1 , . . . α 1 . Here, only two local searches are performed for each regularization parameter. The initial guess for these searches are the parameters obtained in Step 1: for the optimization with α i the initial guesses areθ i+1 andθ i .
Remark 2. Note that, the reverse order optimization with the selected initial guesses eliminates the dominated points of the Pareto front that may appeared in Step 1.

S.6 Bias-variance computation
In Section 3.5 in the main text we used multiple sets of calibration data (replicates) to investigate the effect of the random noise in the data to the model prediction and parameter estimation. Calibrating the model on these multiple data sets results in a set (population) of estimated parameter vectors (each vector corresponds to a dataset). Using these estimated parameter vectors, we obtain a population of predictions. Based on these quantities, we can then compute the parameter estimation and prediction bias and variance using: whereθ α is the estimated parameters with regularization parameter α,ŷ α is the model prediction based on this parameter estimate, i.e.ŷ α = y(θ α ) and y t , θ t are the true (nominal) prediction and parameter vectors, respectively. This true values are known only for synthetic problems and used only for the bias-variance analysis.  [3]. Optimization runs terminate when at least one stopping criteria is reached. The most frequently activated stopping criteria for the global optimization was either the allowed computation time or the allowed number of objective function evaluation. The allowed computation time and number of function evaluations are reported for each case study in Additional File 2. Table S.7.1: Default optimization settings for the case studies. eSS settings: ndiverse is the number of diverse solutions generated, n1 is the number of global iteration before the local algorithm is called for the first time, n2 is the number of global iteration between consecutive calls of the local algorithm, local balance influences the selection of starting point among the members of the population for initiating the local optimization, log var generates the initial and new members of the population in the logarithmic scaled bounds of the parameters. NL2SOL settings: maxfuneval is the maximum number of function evaluation before the search terminates, maxiter is the maximum iteration number before termination, tolrfun is the relative tolerance (the algorithm terminates if the approximated global optima is within this tolerance value), tolobjr is the computational accuracy of the objective function and the Jacobian (which is tuned to the tolerance level of the ODE solver tolerance level).