An iterative identification procedure for dynamic modeling of biochemical networks

Balsa-Canto, Eva; Alonso, Antonio A; Banga, Julio R

doi:10.1186/1752-0509-4-11

Research article
Open access
Published: 17 February 2010

An iterative identification procedure for dynamic modeling of biochemical networks

Eva Balsa-Canto¹,
Antonio A Alonso¹ &
Julio R Banga¹

BMC Systems Biology volume 4, Article number: 11 (2010) Cite this article

9375 Accesses
116 Citations
Metrics details

Abstract

Background

Mathematical models provide abstract representations of the information gained from experimental observations on the structure and function of a particular biological system. Conferring a predictive character on a given mathematical formulation often relies on determining a number of non-measurable parameters that largely condition the model's response. These parameters can be identified by fitting the model to experimental data. However, this fit can only be accomplished when identifiability can be guaranteed.

Results

We propose a novel iterative identification procedure for detecting and dealing with the lack of identifiability. The procedure involves the following steps: 1) performing a structural identifiability analysis to detect identifiable parameters; 2) globally ranking the parameters to assist in the selection of the most relevant parameters; 3) calibrating the model using global optimization methods; 4) conducting a practical identifiability analysis consisting of two (a priori and a posteriori) phases aimed at evaluating the quality of given experimental designs and of the parameter estimates, respectively and 5) optimal experimental design so as to compute the scheme of experiments that maximizes the quality and quantity of information for fitting the model.

Conclusions

The presented procedure was used to iteratively identify a mathematical model that describes the NF-κ B regulatory module involving several unknown parameters. We demonstrated the lack of identifiability of the model under typical experimental conditions and computed optimal dynamic experiments that largely improved identifiability properties.

Background

Biological systems are mainly composed of genes that encode the molecular machines that execute the functions of life and networks of regulatory interactions specifying how genes are expressed, with both operating on multiple, hierarchical levels of organization [1]. Systems biology aims at understanding how such systems are organized by combining experimental data with mathematical modeling and computer-aided analysis techniques [1, 2].

The modeling and simulation of biochemical networks (e.g. metabolic or signaling pathways) has recently received a great deal of attention [3–5]. The modeling framework selected depends both on the properties of the studied system and the modeling goals. Lauffenburger et al. [4, 6] organized the models in terms of three main groups, depending on their level of detail: deterministic, probabilistic and statistical.

Currently, the most typical approach to representing biochemical networks is through a set of coupled deterministic ordinary differential equations intended to describe the network and the production and consumption rates for the individual species involved in the network [7]. The conceptual framework selected for the construction of rate equations enables models to be further classified as generalized mass-action-based models and power-law models [8].

Unfortunately, with model details come parameters, and most parameters are generally unknown, thereby hampering the possibility for obtaining quantitative predictions. Modern experimental techniques, such as time-resolved fluorescence spectroscopy or mass-spectrometry-based techniques, can be used to obtain time-series data for the biological system under consideration. The goal of model identification is then to estimate the non-measurable parameters so as to reproduce, insofar as is possible, the experimental data. Although apparently simple, non-linear model identification is usually a very challenging task, due to the usual lack of identifiability, either practical or, in the worst case, structural. In fact, several authors have reported difficulties in assessing unique and meaningful values for the parameters from given sets of experimental data since broad ranges of parameter values result in similar model predictions (see for example, [9–12]).

This problem has motivated the development of iterative procedures for model identification, such as those proposed by Feng and Rabitz [13], who, using a closed-loop strategy, attempted to estimate how to stimulate and how to observe a system for identification purposes. Kremling et al. [14] and Gadkar et al. [15] suggested alternative identification procedures that involve some type of experimental design, to either calculate stimuli profiles or to select species whose concentration measurements would maximally benefit model calibration and/or model discrimination.

It is important to note, however, that, in most cases, only a limited number of components in the network can be measured, usually far fewer components than incorporated in the model; only specific stimuli are available, and the system may only be stimulated in very specific ways (for example, via sustained or pulse-wise stimulation); the number of sampling times is usually rather limited, and finally, the experimental data are subject to substantial experimental noise. These constraints, together with the dynamic and typically non-linear character of the models under consideration result in identifiability problems, i.e. in the impossibility of providing a unique solution for the parameters.

Our research describes a novel general iterative identification procedure, extending the one originally outlined in Balsa-Canto et al. [16], that addresses model identification under these typical constraints and which aims to reduce the effects of the lack of identifiability.

With this aim in mind, the iterative identification procedure presented here involves the following steps:

Analysis of structural identifiability. This step, which is often disregarded, evaluates whether the parameters may be assigned unique values from a given pair model and observables, under ideal experimental conditions, and assesses - when this is possible - the reformulation of a given model or the implementation of an iterative procedure for model calibration.
Global ranking of parameters. This step helps decide which parameters are the most relevant to model output. In the case of lack of structural identifiability, global ranking may be used to make decisions as to reformulate the model or which parameters to estimate.
Model calibration using global optimization methods. The model calibration problem can be formulated as a non-linear optimization problem. Unfortunately, since it is usually the case that several sub-optimal solutions are possible, the use of global optimization methods is necessary to somehow guarantee that the best possible solution is located.
Practical identifiability analysis. Complementary to the structural identifiability test, the practical identifiability analysis enables an evaluation of the possibility of assigning unique values to the parameters from a given set of experimental data or experimental scheme, subject to experimental noise. In this paper we distinguish between two types of practical identifiability analyses: firstly, the expected quality of a given experimental scheme is analyzed a priori using what we call the expected uncertainty of the parameters; and secondly, the quality of the parameter estimates for a given set of experimental data using robust confidence intervals is analyzed a posteriori.
Optimal experimental design via dynamic optimization. The purpose of this step is to design dynamic experiments with the aim of maximizing data quality and quantity (as measured by the Fisher information matrix) for the purpose of model calibration.

To illustrate the difficulties that may be faced when identifying a nonlinear dynamic biological model and the performance of the proposed identification procedure we consider the mathematical model that describes the NF-κ B regulatory module proposed by Lipniacki et al. [9].

Methods

Model building

A mathematical model has three important functions: first, it helps to better understand the biological phenomenon studied; secondly, it enables experiments to be specifically designed to make predictions of certain characteristics of the biological system that can then be experimentally verified; and finally, it summarizes the current body of knowledge in a format that can be easily communicated. Devising such a model involves a number of steps (Figure 1), commencing with a definition of its purpose and finishing with a preliminary working model.

The purpose of the model will condition the selection of the modeling framework and the information that should be included in the model. Only elements that might have an impact on the questions to be addressed by the model should be included. In this regard, account should be taken of the fact that reaction models can only include a small subset of all reactions taking place within a cell. Thus, assumptions must be made about the extent to which the species included in the model evolve independently of the species excluded from the model, and also about the species that are crucial for the purpose of the model. At this stage it is possible to define the network architecture and decide which type of modeling framework may be the most appropriate (deterministic generalized mass action based models, power-law models, stochastic models, partial differential equations, etc.)

In the next step, an initial mathematical model structure (or battery of model structures) is proposed. New experimental information must then be used to verify hypotheses, and to discriminate, if possible, among different model alternatives. The candidates will often depend on a number of unknown non-measurable parameters that can be computed by means of experimental data fitting (identification).

This crucial step provides the mathematical structure with the capacity to reproduce a given data set, make predictions and discriminate among different model candidates.

The last step is validation, which essentially means reconciling model predictions with any new data observed. This process is likely to reveal defects, in which case a new model structure and/or new (optimal) experiment is planned and implemented. This process is repeated iteratively until validation is considered to be complete and satisfactory.

Note that the success of this model-building loop relies on being able to perform experiments under a sufficient number of conditions to extract a rich ensemble of dynamic responses, to accurately measure such responses and to iterate in order to improve the predictive capabilities of the model without a significant cost.

Since model identification is a task that consumes large amounts of experimental data, an iterative identification procedure is proposed which is intended to accurately compute model unknowns while reducing experimental cost.

Optimal identification procedure

The proposed iterative identification procedure is depicted in Figure 2.

If there are several model candidates two extra steps should be included in the loop, one to analyze structural distinguishability among candidates and the other to design experiments for model discrimination [17].

Mathematical model formulation

We will assume a biological system described by the vector of state variables x(t) ∈ X ⊂ , which is the unique solution of the set of nonlinear ordinary differential equations:

(1)

where corresponds to the external factors and θ∈ Θ ⊂ is the vector of model parameters where Θ is the feasible parameter space.

Moreover, given an experimental scheme, with n_eexperiments, observables per experiment e and sampling times per experiment e and observable o, y^{e, o}∈ Y ⊂ will regard the vector of discrete time measurements, as follows:

(2)

where regards the s^thsampling time for observable o in experiment e. Thus every experimental (measured) data will be denoted as and similarly, the corresponding model predictions will be denoted as .

Structural identifiability analysis

Once the structure of the state-space representation, Eqns. (1)-(3), has been established, the structural identifiability problem is concerned with the possibility of calculating a unique solution for the parameters while assuming perfect data (noise-free and continuous in time and space). Structural identifiability is thus related to the model structure and possibly to the type of stimulation and independent of the parameter values.

There are, at least, two obvious reasons to asses structural identifiability: first, the model parameters have a biological meaning, and we are interested in knowing whether it is at all possible to determine their values from experimental data; second, is related with the problems that a numerical optimization approach may find when trying to solve an unidentifiable model.

There are a few methods for testing the structural identifiability of nonlinear models [18, 19]: the similarity transformation approach [20], differential algebra methods [21, 22] and power series approaches [23, 24]. Unfortunately there is no method amenable to every model, thus at some point we have to face the selection of one of the possibilities. All of them present limitations related to the non-linearity and the size of the system under consideration, meaning by size the number of state variables, the number of parameters and the number of observables. Probably the most easy to apply, provided one uses a symbolic manipulation software, are the power series expansions methods. In this regard two possibilities have been developed: the Taylor series and the generating series.

Details of the Taylor series approach can be found in [23]. The approach is based on the fact that observations are unique analytic functions of time and so all their derivatives with respect to time should also be unique. It is thus possible to represent the observables by the corresponding Maclaurin series expansion and it is the uniqueness of this representation that will guarantee the structural identifiability of the system. The idea is to establish a system of non-linear algebraic equations on the parameters, based on the calculation of the Taylor series coefficients, and to check whether the system has a unique solution. The generating series approach[24] allows to extend the analysis to the entire class of bounded and measurable stimuli. In this case the series is generated with respect to the stimuli domain. The method requires the model to be linear in the stimuli as follows:

(3)

(4)

The observables can be expanded in series with respect to time and stimuli in such a way that the coefficients of this series are g(x, θ, t = 0) and the Lie derivatives:

(5)

where L_fgis the Lie derivative of g along the vector field f, given by:

(6)

with f_jthe jth component of f.

If s(θ) regards the vector of all the coefficients of the series, a sufficient condition for the model to be identifiable is that there exists a unique solution for θ from s(θ), similarly to the Taylor series method. Note also that power series approaches assume that all the information on the progress of the observables is contained in the germ, i.e. the infinite set of power series coefficients evaluated at t = 0⁺. If the derivatives are zero, then the germ is said not to be informative and no conclusions about identifiability can be directly drawn. Later observations (t > 0) could give more information and restrict the set of feasible values of θ. Probably the major drawback of the power series approaches is that the necessary number of power series coefficients is usually unknown. For the Taylor series approach an upper limit has been shown for bilinear and polynomial systems [25, 26]. Additionally Margaria et al. (2001) [27] showed that for the combination of the Taylor series and the differential algebra approaches, n_x+ 1 derivatives are sufficient for the case of rational systems with one observable. However there are not bounds for a general non-linear system. In addition, solving the non-linear system of equations resulting from the power series approaches is usually not a trivial task, particularly when the number of parameters is large and the number of observables is reduced. We therefore propose using the following identifiability tableaus to easily visualize the possible structural identifiability problems.

The tableau represents the non-zero elements of the Jacobian of the series coefficients with respect to the parameters. It consists of a table with as many columns as parameters and with as many rows as non-zero series coefficients, in principle, infinite, as shown in Figure 3.

If the Jacobian is rank deficient, i.e. the tableau presents empty columns, the corresponding parameters may be unidentifiable. Note that since the number of series coefficients may be infinite, unidentiability may not be fully guaranteed unless higher order series coefficients are demonstrated to be zero.

If the rank of the Jacobian coincides with the number of parameters, then it will be possible to, at least, locally identify the parameters. In this situation a careful inspection of the tableau will help to decide on an iterative procedure for solving the system of equations, as follows:

The number of non-zero coefficients is usually much larger than the number of parameters. In practice this means that we should select the first n_θrows that guarantee the Jacobian rank condition. The tableau helps to easily detect the necessary coefficients and to generate a "minimum" tableau.
A unique non-zero element in a given row of the minimum tableau means that the corresponding parameter is structurally identifiable. If any, the parameters in this situation can be computed as functions of the power series coefficients and can be then eliminated from the "minimum" tableau to generate a "reduced" tableau. Subsequent reductions may lead to the appearance of new unique non-zero elements and so on. Thus all possible "reduced" tableaus should be built first.
Once no more reductions are possible, one should try to solve the remaining equations. Since it is often the case that not all remaining power series coefficients depend on all parameters, the tableau will help to decide on how to select the equations to solve for particular parameters.
If several meaningful solutions exist for a given set of parameters, then the model is said to be locally identifiable.

If the model turns out not to be completely identifiable, identifiability may be recovered by extending the set of observables, however this may not be accessible in practice. Alternatively one may consider fixing some parameters [21] or to reformulate the model.

Global ranking of parameters

Observables will depend differently on different parameters and this may be used to rank the parameters in order of their relative influence on model predictions. Such influence may be quantified by the use of parametric sensitivities.

Local parametric sensitivities for a given experiment e, observable o and at a sampling time are defined as follows:

(7)

They may be numerically computed by using the direct decoupled method within a backward differentiation formulae (BDF) based approach, as implemented in e.g. ODESSA [28].

The corresponding relative sensitivities, , can be used to asses the individual local parameter influence or importance, that is to establish a ranking of parameters. Brun and Reichert (2001) [29] suggested several importance factors, that may be generalized for the case of having several observables and experiments [16].

Of course, the values of the parameters are not known a priori, and even when optimally computed, optimal values are subject to uncertainty depending on the type of experiments and the presence of experimental noise. Consequently, the ranking for a given value of the parameters may be of limited value. Alternatively, one may compute ranking for a sufficiently large number of parameter vectors in the feasible parameter space.

The simplest approach is to apply a Monte Carlo sampling. By sampling repeatedly from the assumed joint-probability density function of the parameters and by evaluating the sensitivities for each sample, the distribution of sensitivity values, along with the mean and other characteristics, can be estimated. This approach yields reasonable results if the number of samples is quite large, requiring a great computational effort.

An alternative that can yield more precise estimates is Latin hypercube sampling (LHS). This method selects n_lhsdifferent values for each of the parameters, which it does by dividing the range of each parameter into n_lhsnon-overlapping intervals on the basis of equal probability. Next, from each interval one value for the parameters is selected at random with respect to the probability density in the interval.

The n_lhsvalues thus obtained for the first parameter are then paired in a random manner (equally likely combinations) with the n_lhsvalues for the second and successive parameters. This method allows the overall parameter space to be explored without requiring an excessively large number of samples. The importance factors will then read:

(8)

(9)

(10)

(11)

(12)

where N_D= n_lhsn_en_on_s, δ^msqrand δ^mabsquantify how sensitive a model is to a given parameter considering δ^mabsinteractions between parameters. δ^maxand δ^minindicate the presence of outliers and provide information about the sign. δ^meanprovides information about the sign of the averaged effect a change in a parameter has on the model output.

Ordering the parameters according to these criteria, preferably in decreasing order, results in a parameter importance ranking. This information may be useful to decide on reformulating the model or to fix the less relevant parameters to improve either structural or practical identifiability.

Note that the summations will, in general, hide the different effects from the different experiments and observables unless they are in the same order of magnitude. Similar analyses may be performed for experiments and observables, thus providing information on the parameters that are more relevant to a particular observable in a particular type of experiment.

Model calibration

Given the measurements, the aim of model calibration or parameter identification is to estimate some or all of the parameters θ in order to minimize the distance among data and model predictions. The maximum-likelihood principle yields an appropriate cost function to quantify such distance, which, for the case of Gaussian noise with known or constant variance, reads as the widely used weighted least-squares function:

(13)

where collects the information related to a given measurement experimental noise.

Parameter identification is then formulated as a non-linear optimization problem, where the decision variables are the parameters and the objective is to minimize J(θ) subject to the system dynamics in Eqns. (1)-(3) and also, possibly, to some algebraic constraints that define the feasible region Θ.

This problem has recently received a great deal of attention in the literature. Jaqaman and Danuser presented a guide for model calibration in the context of biological systems [30] noting that there are two major issues in minimizing 13: first, the presence of local minima and second, the lack of practical identifiability.

To deal with first difficulty several authors have proposed the use of global optimization methods [31–34], since most of the model calibration problems related to biological models have several sub-optimal solutions. Recently suggested, in addition, was the use of sequential hybrid global-local methods [35, 36] to enhance efficiency, particularly for highly multimodal and large scale systems.

Practical identifiability analysis

As already mentioned in the introduction, practical identifiability analysis enables an evaluation of the possibility of assigning unique values to parameters from a given set of experimental data or experimental scheme subject to experimental noise. We distinguish between practical identifiability a priori, which anticipates the quality of the selected experimental scheme in terms of what we will call the expected uncertainty of the parameters, and practical identifiability a posteriori, which assesses the quality of the parameter estimates after model calibration in terms of the confidence region.

It is important to note that the major difference between the two analyses is that, a priori, we have to assume a maximum experimental error, whereas, a posteriori, since the experimental data are already available, the experimental error may be estimated either through experimental data manipulation (when replicates of the experiments are available) or after model calibration using the residuals (i.e. the differences among model predictions and the experimental data) [37].

Possibly the simplest approach to perform such analyses given a set of simulated (a priori) or real (a posteriori) experimental data is to draw contours of the cost J(θ) by pairs of parameters. This will help detect typical practical identifiability problems, such as strong correlation between parameters, the lack of identifiability for some parameters when the contours extend to infinity, or the presence of sub-optimal solutions.

To quantify the expected uncertainty of the parameters and/or the confidence region, we rely on a Monte Carlo-based sampling method [38–40]. The underlying idea is to simulate the possibility of performing hundreds of replicates of the same experimental scheme for a given experimental error. The model calibration problem is solved for each replicate and the cloud of solutions is recorded in a matrix. Note that, in order to avoid convergence to local solutions, an efficient global optimization method is required. The cloud of solutions is assumed to correspond to, or to be fully contained in, a hyper-ellipsoid. Principal component analysis applied to the 0.95 - 0.05 interquartile range of the cloud or matrix of solutions then provides information on hyper-ellipsoid eccentricity (correlation between parameters) and pseudo-volume (accuracy of the parameters). The analysis of the histograms of the parameter solutions provides the mean value of the parameters (μ) and either maximum expected uncertainty (a priori) or the confidence intervals (a posteriori) for the parameters (C_θ). See details in [40].

The obtained expected uncertainty of the parameters will allow the different experimental designs to be compared a priori, i.e. without performing any experiment. The richest experiment, in terms of the quantity and quality of information, will be the one with the best compromise between pseudo-volume and eccentricity.

The confidence intervals obtained for the parameters will enable a decision to be made on the need to perform further experiments to improve the quality of the parameter estimates and, thus, the predictive capabilities of the model.

Optimal experimental design

A crucial aspect of experimental data is data quantity and quality. As mentioned in the previous section, a given set of data may result in practical identifiability problems. This is why data generation and modeling have to be implemented as parallel and interactive processes, thereby avoiding the generation of data that may eventually turn out to be unsuited for modeling.

In addition, the use of model-based (in silico) experimentation can greatly reduce the effort and cost of biological experiments, and simultaneously facilitate the understanding of complex biological systems [41–44].

The model identification loop is complemented with an optimal experimental design step. The aim is to calculate the best scheme of measurements in order to maximize the richness (quantity and quality) of the information provided by the experiments while minimizing, or at least, reducing, the experimental burden [38, 40].

The richness of the experimental information may be quantified by the use of the Fisher Information Matrix (ℱ) [37, 45], which for the case of Gaussian known or constant variance reads as follows:

(14)

where E represents the expectation for a given value of the parameters μ presumably close to the optimal solution θ*.

The optimal experimental design is then formulated and solved as a general dynamic optimization problem, see details in [40], that computes the time-varying stimuli profile, sampling times, experiments duration and (possibly) initial conditions so as to maximize a scalar measure of the Fisher Information Matrix subject to the system dynamics (Eqn. 1 and 3) and to other algebraic constraints associated with experimental limitations.

Regarding the selection of the scalar measure of the ℱ, several alternatives exist all of them related to the eigenvalues of the ℱ and thus related to the shape and size of the associated hyper-ellipsoid. The most popular are probably the D-optimality and E-optimality criteria, the former corresponding to the maximization of the determinant of the ℱ and the latter corresponding to the maximization of the minimum eigenvalue. From previous studies [40] it may be concluded that the E-optimality criterion offers the best quantity-quality compromise for the information, particularly for cases where the parameters are highly correlated or the sensitivities with respect to the parameters are highly uneven; otherwise D-optimality may be more successful.

Results and Discussion

The NF-κ B regulatory module

NF-κ B is implicated in several common diseases, especially those with inflammatory or auto immune components, such as septic shock, cancer, arthritis, diabetes and atherosclerosis [46]. Mathematical models connected to experimental data have played a key role in revealing forms of regulation of NF-κ B signaling and the underlying molecular mechanisms. Commencing with the original model proposed by Hoffmann et al. [47], several models have been proposed that include additional feedback loops, cross-talk with other pathways and NF-κ B oscillations, as detailed in the recent reviews by Lipniacki and Kimmel, [48] and Cheong et al., [49].

The model considered in this work was proposed by Lipniacki et al. [9]. This model presents several modifications with respect to the original by Hoffmann et al. [47]. Basically, while the original model accounts for the interplay among three isoforms of the inhibitory proteins Iκ Bα, Iκ Bβ and Iκ Bϵ, Lipniacki et al. consider the inhibitory roles of Iκ Bα and A20, incorporate new assumptions about the IKK activation and introduce the nuclear-cytoplasmic volume ratio.

The model involves two compartment kinetics of the activators IKK and NF-κ B, the inhibitors A20 and Iκ Bα and their complexes. It is assumed that IKK exists in any one of three forms: neutral (IKKn), active (IKKa) or inactive (IKKi). In the presence of an extracellular signal such as TNF, IKK is transformed into its active (phosphorylated) form. In this form it is capable of phosphorylating Iκ Bα, and this leads to its degradation. In resting cells, the unphosphorylated Iκ Bα binds to NF-κ B and sequesters it in an inactive form in the cytoplasm. As a result, degradation of Iκ Bα releases the second activator, NF-κ B. The free NF-κ B enters the nucleus and upregulates transcription of the two inhibitors Iκ Bα and A20 and of a large number of other genes including the control gene cgen. The newly synthesized Iκ Bα again inhibits NF-κ B, while A20 inhibits IKK by catalyzing its transformation into another inactive form in which it is no longer capable of phosphorylating Iκ Bα.

The scheme of the pathway is illustrated in Figure 4. The corresponding mathematical model consists of 15 non-linear ordinary differential equations with 30 parameters as follows [9]:

where IKKn represents the cytoplasmic concentration of neutral form of IKK kinase; IKKa, the cytoplasmic concentration of active form of IKK; IKKi, the cytoplasmic concentration of inactive IKK; Iκ Bα, the cytoplasmic concentration of Iκ Bα; Iκ Bα_n, the nuclear concentration of Iκ Bα; Iκ Bα_t, the concentration of Iκ Bα mRNA transcripts calculated per cytoplasmic volume V; (IKKa/Iκ Bα), the cytoplasmic concentration of complexes IKKa and Iκ Bα, equivalent notation is used for all the complexes; T_Ris a logical variable representing the presence or absence of signal; k_vis the ratio of cytoplasmic to nuclear volumes.

Results/Discussion

In their paper, Lipniacki et al. (2004) fixed some of the model parameters by using values from the literature. To fit the unknown parameters, they used experimental data from previous works by Lee et al. [50] and Hoffmann et al. [47]:

(15)

Lipniacki et al. concluded that several different sets of parameters are capable of reproducing the data. This lack of identifiability may originate either in the structure of the model and observables selected (lack of structural identifiability) or in the type of experiments performed and the experimental noise (lack of practical identifiability). Our aim was to determine the origin of the problem and to use the model identification loop presented here to improve the quality of the parameter estimates.