Identification of regulatory structure and kinetic parameters of biochemical networks via mixed-integer dynamic optimization
© Guillén-Gosálbez et al.; licensee BioMed Central Ltd. 2013
Received: 26 March 2013
Accepted: 14 October 2013
Published: 31 October 2013
Recovering the network topology and associated kinetic parameter values from time-series data are central topics in systems biology. Nevertheless, methods that simultaneously do both are few and lack generality.
Here, we present a rigorous approach for simultaneously estimating the parameters and regulatory topology of biochemical networks from time-series data. The parameter estimation task is formulated as a mixed-integer dynamic optimization problem with: (i) binary variables, used to model the existence of regulatory interactions and kinetic effects of metabolites in the network processes; and (ii) continuous variables, denoting metabolites concentrations and kinetic parameters values. The approach simultaneously optimizes the Akaike criterion, which captures the trade-off between complexity (measured by the number of parameters), and accuracy of the fitting. This simultaneous optimization mitigates a possible overfitting that could result from addition of spurious regulatory interactions.
The capabilities of our approach were tested in one benchmark problem. Our algorithm is able to identify a set of plausible network topologies with their associated parameters.
KeywordsParameter estimation Structure identification Akaike criterion Orthogonal collocation Dynamic optimization Biochemical networks
Mathematical models of biochemical systems are becoming essential in systems biology to complement and extract information from time series. This information can be of two types. On the one hand, if the structure of the molecular circuit that executes the process of interest is known, models can be used to infer the numerical parameters that govern the dynamics of the system [1–4]. On the other, models can be used to infer the structure of the system from time series data (see for example [5–7]).
In either case, to obtain a useful model, we face different challenges: (i) defining the system’s mass flow structure (stoichiometry), (ii), deciding the appropriate mathematical representation (kinetics), (iii) estimating the parameters that make the model response consistent with experimental data (parameter estimation), and (iv) inferring the system’s regulatory structure. In addition, once the model is well defined, it should be able to predict systemic responses under yet untested experimental conditions (model validation).
The four challenges described in the previous paragraph are often addressed in independent steps. Current solutions to the first challenge are generally based on compiling information about the system and using that information to create the stoichiometric matrix for the system one wants to analyze (see for instance ). To solve the second challenge we need to define kinetic functions that describe the dynamic behavior of the dependent variables of the system. If the kinetic functions are unknown, approximate formalisms that have a solid theoretical support can be used to describe the dynamic behavior of the system within a given accuracy [9, 10]. The third challenge is typically formulated as an optimization problem that minimizes the sum of squared residuals between the measured and simulated data (see a review of methods in ). The type of optimization problem being faced and the technical challenges to be solved depends upon the biological model of choice, upon the experimental data available, and upon the specific mathematical formalism used [11, 12]. In many practical applications, the target biological system is described through nonlinear ordinary differential equations (ODEs). Hence, the parameter estimation task gives rise to dynamic optimization problems that are hard to solve. The fourth challenge could in principle be addressed in the same way as the first. However, despite the enormous amount of biological information available in public databases, regulatory signals are, in general, poorly understood and hardly ever properly characterized in vivo. Regulatory signals appear in a model as parameters accounting for the influence that metabolites others than the substrates of a reaction have on its velocity. Hence, parameter fitting can also be used to address the fourth challenge. However, the overwhelming majority of parameter estimation methods assumes a given structure and considers a fix regulatory scheme (see a review in ). This simplification is motivated by the difficulty in identifying regulatory effects, a task for which a myriad of alternative kinetic models must be explored [7, 13–15].
Traditional methods for the selection of biological systems have mostly applied regression or chi-squared-based criteria (rather than information-theoretic fit criteria) . However, information-theoretic criteria such as the Akaike’s Information Criterion (AIC)  or the Bayesian Information Criterion (BIC) , are now perceived as important measures to assess quality of fit. AIC is often preferred over BIC becaue it has a more immediate connection to the theory of information . AIC captures the trade-off between the complexity (measured by the number of parameters), and accuracy of the fitting. Smaller AIC values imply a better approximation to the model sought.
In this work we propose a strategy to simultaneously address the four challenges described above that relies on the use of mixed-integer dynamic optimization (MIDO) methods. Our approach adopts a structured mathematical framework to represent the kinetics of the processes that is flexible enough to reproduce a set of plausible network topologies (by implementing slight modifications on a basic model formulation). The power-law  and the saturable and cooperative formalisms are examples of such general kinetic representations . Based on this type of general kinetic modeling framework, we develop our systematic parameter estimation method that provides as output a set of potential reaction and regulatory topologies for the target network along with the associated model parameters. We illustrate the capabilities of our approach using the GMA kinetic representation, a canonical model structure that uses the power-law kinetic formalism [21, 22].
Results and discussion
Parameter estimation when the regulatory structure is known
Original and predicted parameters values
As observed, the model is rather flexible, as there are many combinations of parameters values leading to very low residuals and essentially the same fit to the data. In practical terms, this means that given an experiment and an estimation procedure, we could obtain different parameter sets that closely reproduce the experimental measurements, but that differ from the actual values with which the dynamic profile was generated in silico. Thus, estimated parameter values don’t help comparing the obtained fit with the reference model. In practice, the residual error and the resulting time profiles should be used to assess the fit.
Parameters values with noisy data (one experiment)
Parameter values obtained from simulated noisy data (with noisy data (three experiments))
Identifying the regulatory structure
Performance using error free data
After testing the capabilities of the method when the structure is known, we studied its ability to identify the regulatory topology of the model. To this end, we explore the performance of the method using one experiment with low experimental error (i.e., assuming that the data follow normal distributions with a standard deviation of 0.5%). Larger errors result in a wider set of alternative structures and for simplicity’s sake we shall not discuss them here.
In order to simplify the search, we fix a maximum of two metabolites (the substrate of the reaction, which is given by the stoichiometric information, and one possible additional modifier, which is not a priori characterized) as potential variables affecting each velocity.
We note that it is typical to have some a priori knowledge about the biological system one is interested in. The complexity of the regulatory interactions in the identification problem is reduced if such knowledge can be used to constrain further both, the number of potential regulatory signals in the model and their signs (positive, negative). In such cases, we can introduce specific constraints for the relevant parameters to be fitted. For example, in our case kinetic-order corresponding to the substrates of a reaction must be positive.
The MINLP model that simultaneously fits the parameters and infers probable regulatory interactions was implemented in GAMS 23.7.3 and solved with the solver SBB in the same computer as before. The model has 72 binary variables, 391 continuous variables and 414 equations. The solution time was in the order of few minutes for each simulation.
As before, one strategy for increasing the possibility of correctly identifying the “true” regulatory structure is to use additional time-series data of the same system under different sets of initial conditions. To this end, we changed the initial concentrations of X 3 (0.2, 1.2, and 2.2). The MINLP model was again implemented in GAMS and solved with SBB in the same computer. In this case, the MINLP features 72 binary variables, 967 continuous variables and 980 equations. The solution time was in the order of few minutes for each simulation.
The use of MIDO techniques combined with orthogonal collocation allows posing the parameter estimation task as an algebraic optimization problem that can be efficiently solved using standard MINLP algorithms. Orthogonal collocation shows some appealing properties (see ), but has the drawback of increasing the model size because it adds auxiliary variables and equations that increase the problem complexity. Our MIDO approach, however, can be solved by any MIDO algorithm, and it is not restricted to the use of orthogonal collocation and MINLP reformulations.
A key point in our method is the selection of an appropriate starting point to initialize the MINLP algorithm. Standard MINLP algorithms typically solve an initial NLP where the binary variables are relaxed. If this NLP does not converge, the entire algorithm might fail. An initialization strategy that works well in practice is to integrate first the original kinetic model for some parameter values, and then use the dynamic profiles generated in silico to provide a starting point for the NLP solver. Another method consists of solving an auxiliary model where we relax some constraints through the addition of slack variables, and then minimize the summation of the slacks in order to obtain an initial feasible point. With this relaxed model, we can identify a feasible (but not necessarily optimal) solution for the initial NLP.
Even if the MINLP model converges, there is still the issue of getting trapped in local optima during the search. To avoid this, we can run the optimization algorithm from different starting points generated randomly. This strategy does not guarantee convergence to the global optimum, but tends to produce high quality solutions in short CPU times. In contrast, deterministic global optimization methods provide a rigorous interval within which the optimum should fall, but tend to lead to large CPU times (see [26, 27]).
In our case, we initialize the NLPs by solving a set of relaxed problems from different starting points and then pass these results to the first NLP solver. This approach provides feasible points from which the model converges to solutions with low residuals.
In general, due to the nonconvex nature of the reformulated MINLP, the nonlinear branch and bound implemented in SBB outperforms the outer-approximation used by DICOPT. This is because the supporting hyperplanes defined in the master MILP solved by DICOPT may chop-off feasible solutions due to the noconvex nature of some nonlinear inequalities.
We note that nonlinear models are hard to handle, and even more so when they contain binary variables. Standard NLP solvers can solve problems containing up to hundreds of thousands of variables and constraints. On the other hand, the computational burden of MIDO (and MINLP) models is rather sensitive to the number of binary variables. For the type of problems we are dealing with, it is difficult to provide a bound on the number of binaries above which the algorithm might fail. In practice, however, we found that this approach efficiently for less than one hundred binaries (around 30 parameters).
From a practical viewpoint, we face the challenging problem of discriminating between compatible regulatory structures for a given data set. On a worst case scenario, our method provides a ranked set of alternative regulatory topologies that can be tested and validated experimentally. If appropriate additional time-series data are available, the set of admissible solutions for testing can be further constrained and reduced. Our method finds a set of alternatives that are consistent with the dynamic data available and that can be further refined using additional information and expert knowledge on the system. (i.e., complementary biological information). For instance, kinetic-orders that correspond to substrates of a reaction may be safely restricted to be positive. Similarly, if we are fairly sure that a given metabolite does not participate in a reaction, its kinetic-order should be fixed to zero.
Our method can also be used to explore hypotheses about the regulatory structure of a system. For instance, we can force some parameters to take negative values, thereby representing inhibition effects, and then perform the optimization so as to determine if the fitting is good enough. Furthermore, we can follow the same procedure in order to identify regulatory effects that are consistent with this hypothesis.
In addition, we note that our approach can be easily adapted in order to work with other model selection criteria besides AIC. We remark, however, that the assessment of different selection criteria would deserve a comprehensive study that is beyond the scope of this work.
The simple examples presented in this paper show that estimating parameters in dynamic kinetic models is far from being an easy job and that models based on the power-law formalism facilitate the estimation task. Although this formalism is suitable for a wide variety of problems, one may argue that it may present some limitations. As an alternative, we can use extensions of this framework such as the Saturable and Cooperative formalism , which takes into account saturation effects. In both cases, a key point is the possibility of using a canonical mathematical formalism that facilitates the automatic search of alternative regulatory patterns. The method described here would be applicable to such models via recasting of the Saturating and Cooperative formalism into a power law .
In this work we have proposed a rigorous approach based on mathematical programming for the simultaneous identification of the regulatory signals and estimation of the kinetic parameters of models of biochemical networks. Our approach is based on the use of mixed-integer dynamic optimization (MIDO) models that minimize the Akaike criterion, and that can be solved by standard optimization algorithms. Particularly, we solve this MIDO by reformulating it as a mixed-integer nonlinear program (MINLP) using orthogonal collocation on finite elements, which makes it possible to apply standard MINLP solution algorithms in an iterative fashion in order to identify a set of plausible network topologies and associated kinetic parameters.
It is noteworthy that the difficult task of parameter estimation in nonlinear models becomes really complicated as the size of the models increases. Therefore, such estimation typically requires customized solution procedures. One key point is to use the appropriate initial conditions to ensure convergence of the calculations.
The proposed method can contribute to fill the lack of information on the regulatory signals that are in play in a given metabolic scenario. Although we cannot deal with genome-wide models, we have shown that dynamic profiles can be processed to provide clear hypothesis on the underlying regulatory structure. This is an important step towards completing essential information on different metabolic processes that are poorly understood.
There are two critical issues in defining this model. One is the selection of an appropriate mathematical representation for v r , which may be a function of an arbitrary number of variables (substrates, products, and modifiers). In most cases the mechanism for each process are unknown and choosing a specific mechanistic rate law, such as a Michaelis-Menten rate law, becomes an act of faith. The other issue is the problem of identifying the regulatory structure of the system.
The most straightforward and theoretically well supported solution to both issues is the use of an approximate formalism based on a standard mathematical representation . By adopting such a kinetic representation, identifying the regulatory structure of the system becomes synonymous to determining the set of values θ for the model parameters that better fit the available data. Hence, without losing generality, and as a first step towards a more complex framework, we will consider the case where the rates are modeled using a power-law formalism. Note, however, that our approach could be easily extended in order to accommodate any other structured kinetic formalism.
where γ r is an apparent rate constant for reaction r, and f r,j is the kinetic order of metabolite j in that process. Note that this equation accounts for the effect of n + m metabolites (n dependent and m independent) on each reaction.
The advantage of this representation is that the same functional form represents all the rates. The reaction structure of the system will constrain the range of admissible values for some of the parameters. For example, all γ and f parameters for the substrates and catalysts of the reactions are by definition larger than zero. In addition, the values of the f parameters for all metabolites that are not directly involved in a given process are zero in the rate that describes the process.
Note that the power-law formalism accounts for both the stoichiometry of the system (the network structure), and the reaction and regulatory structures (kinetic orders) using a single systematic nonlinear representation. This property is very important for defining a systematic way of exploring alternative regulatory signals. We will make use of this general and compact formalism in the derivation of the equations for the parameter estimation model.
Parameter estimation in a GMA model
where X i represents the state variables (i.e., metabolite concentrations), X 0i their initial conditions, X i,u exp denotes the experimental observations, and X i,u mod are the values calculated by the dynamic model (i.e., model predictions). i is the index for the set of state variables whose derivatives explicitly appear in the model, γ r and f r,j are the parameters to be estimated, and t u , is the time associated with experimental point u belonging to the set U of observations. k is the total number of experimental data points and n is the number of time dependent variables.
Conventional parameter estimation approaches seek parameter values that minimize the approximation error assuming a given regulatory scheme (i.e., fixing some f r,j to zero beforehand according to the aprioristic biochemical knowledge of the system). While this assumption simplifies the calculations, it can lead to poor approximations and hamper at the same time the discovery of new regulatory loops. In this work we introduce a rigorous and systematic parameter estimation and network identification method that makes no assumption regarding the regulatory network topology.
where Boolean variables Y have been replaced by auxiliary binary variables y. In these equations, M is a sufficiently large parameter whose value must be carefully set according to the bounds defined for the kinetic parameters.
There are different solution methods to solve this MIDO (see ). Without loss of generality, we propose here to reformulate this problem into an equivalent algebraic MINLP (mixed-integer nonlinear program) using orthogonal collocation on finite elements. This allows exploiting the rich optimization theory and software applications available for MINLP in the solution of the MIDO. Note that the reformulated MINLP might be nonconvex. This will give rise to multimodality (i.e., existence of multiple local optima), preventing standard gradient-based solvers from identifying the global optimum. Deterministic global optimization methods could be applied to solve the MINLP, but they might lead to large CPU times given the size and complexity of a standard dynamic problem of this type. Details on the application of deterministic global optimization methods to parameter estimation problems of small/medium size can be found elsewhere [30, 31]. For the reasons given above, in this work we will solve the reformulated MINLP using local optimizers.
Where ONE it and ZERO it represent the sets of binary variables that take a value of one and zero, respectively, in iteration it of the algorithm. After adding the integer cut, the model is solved again to produce a new regulatory topology, and this procedure is repeated iteratively until a desired number of configurations is generated. Hence, the algorithm produces as output a set of potential network configurations (encoded in the values of the binary solutions) rather than a single topology. Note that these regulatory topologies show a descendant value of the Akaike performance criterion.
Funding: The authors acknowledges the financial support of the following institutions: Spanish Ministry of Education and Science (CTQ2009-14420-C02, CTQ2012-37039-C02, DPI2012-37154-C02-02, BFU2008-00196/BMC, BFU2010-17704, SGR2009-0809 and ENE 2011-28269-CO3-03), Spanish Ministry of External Affairs (projects PHB 2008-0090-PC), and European Commission (Marie Curie Actions - IAPP program - FP7/251298).
- Chou IC, Voit EO: Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Math Biosci. 2009, 219: 57-83. 10.1016/j.mbs.2009.03.002.PubMedPubMed CentralView ArticleGoogle Scholar
- Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmüller U, Timmer J: Structural and practical identifiability analysis of partially observed dynamical mod-elsby exploiting the profile likelihood. Bioinformatics. 2009, 25: 1923-1929. 10.1093/bioinformatics/btp358.PubMedView ArticleGoogle Scholar
- Srinath S, Gunawan R: Parameter identifiability of power-law biochemical system models. J Biotechnol. 2010, 149: 132-140. 10.1016/j.jbiotec.2010.02.019.PubMedView ArticleGoogle Scholar
- Voit EO: Characterizability of metabolic pathway systems from time series data. Math Biosci. 2013, doi:10.1016/j.mbs.2013.01.008Google Scholar
- Maki Y, Tominaga D, Okamoto M, Watanabe S, Eguchi Y: Development of a system for the inference of large scale genetic networks. Pac Symp Biocomput. 2001, 2001: 446-458.Google Scholar
- Vance W, Arkin A, Ross J: Determination of causal connectivities of species in reaction networks. Proc Natl Acad Sci U S A. 2002, 99 (9): 5816-5821. 10.1073/pnas.022049699.PubMedPubMed CentralView ArticleGoogle Scholar
- Sriyudthsak K, Shiraishi F, Hirai MY: Identification of a Metabolic Reaction Network from Time-Series Data of Metabolite Concentrations. PLoS One. 2013, 8 (1): e51212-10.1371/journal.pone.0051212.PubMedPubMed CentralView ArticleGoogle Scholar
- Orth JD, Conrad TM, Na J, Lerman JA, Nam H, Feist AM, Palsson BØ: A comprehensive genome-scale reconstruction of Escherichia coli metabolism. Mol Syst Biol. 2011, 7: 535-doi:10.1038/msb.2011.65PubMedPubMed CentralView ArticleGoogle Scholar
- Sorribas A, Hernandez-Bermejo B, Vilaprinyo E, Alves R: Cooperativity and saturation in biochemical networks: a saturable formalism using Taylor series approximations. Biotechnol Bioeng. 2007, 97: 1259-1277. 10.1002/bit.21316.PubMedView ArticleGoogle Scholar
- Alves R, Vilaprinyo E, Hernandez-Bermejo B, Sorribas A: Mathematical formalisms based on approximated kinetic representations for modeling genetic and metabolic pathways. Biotechnol Genet Eng Rev. 2008, 25: 1-40. 10.5661/bger-25-1.PubMedView ArticleGoogle Scholar
- Moles CG, Mendes P, Banga JR: Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res. 2003, 13 (11): 2467-2474. 10.1101/gr.1262503.PubMedPubMed CentralView ArticleGoogle Scholar
- Chis OT, Banga JR, Balsa-Canto E: Structural identifiability of systems biology models: a critical comparison of methods. PLoS One. 2011, 6 (11): e27755-10.1371/journal.pone.0027755. doi:10.1371/journal.pone.0027755PubMedPubMed CentralView ArticleGoogle Scholar
- Sorribas A, Samitier S, Canela EI, Cascante M: Metabolic pathway characterization from transient response data obtained in situ: parameter estimation in S-system models. J Theor Biol. 1993, 162: 81-102. 10.1006/jtbi.1993.1078.PubMedView ArticleGoogle Scholar
- Sorribas A, Cascante M: Structure identifiability in metabolic pathways: parameter estimation in models based on the power-law formalism. Biochem J. 1994, 298: 303-311.PubMedPubMed CentralView ArticleGoogle Scholar
- Vilela M, Chou IC, Vinga S, Vasconcelos AT, Voit EO, Almeida JS: Parameter optimization in S-system models. BMC Syst Biol. 2008, 2: 35-10.1186/1752-0509-2-35.PubMedPubMed CentralView ArticleGoogle Scholar
- Markon KE, Krueger RF: An Empirical Comparison of Information-Theoretic Selection Criteria for Multivariate Behavior Genetic Models. Behav Genet. 2004, 34: 593-610. 10.1007/s10519-004-5587-0.PubMedView ArticleGoogle Scholar
- Akaike H: New look at statistical-model identification. IEEE Trans Automat. 1974, 19: 716-723. 10.1109/TAC.1974.1100705. Contr. AC19 716View ArticleGoogle Scholar
- Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6: 461-464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
- Burnham KP, Anderson DR: Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2002, New York: Springer-Verlag, ISBN 0-387-95364-7, 2Google Scholar
- Voit E: Computational analysis of biochemical systems. A practical guide forbiochemists and molecular biologists. 2000, Cambridge: Cambridge University PressGoogle Scholar
- Savageau M, Biochemical systems analysis. i: Some mathematical properties of the rate law for the component enzymatic reactions. J Theor Biol. 1969, 25: 365-369. 10.1016/S0022-5193(69)80026-3.PubMedView ArticleGoogle Scholar
- Savageau M: Biochemical systems analysis. ii. The steady-state solutions for an n-pool system using a power-law approximation. J Theor Biol. 1969, 25: 370-379. 10.1016/S0022-5193(69)80027-5.PubMedView ArticleGoogle Scholar
- Voit EO, Almeida J: Decoupling dynamical systems for pathway identification from metabolic profiles. Bioinformatics. 2004, 20: 1670-1681. 10.1093/bioinformatics/bth140.PubMedView ArticleGoogle Scholar
- Wang Y, Joshi T, Zhang XS, Xu D, Chen L: Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics. 2006, 22 (19): 2413-2420. 10.1093/bioinformatics/btl396.PubMedView ArticleGoogle Scholar
- Biegler L, Grossmann IE: Retrospective on optimization. Comput Chem Eng. 2004, 28 (8): 1169-1192. 10.1016/j.compchemeng.2003.11.003.View ArticleGoogle Scholar
- Miró A, Pozo C, Guillén-Gosélbez G, Egea JA, Jiménez L: Deterministic global optimization algorithm based on outer approximation for the parameter estimation of nonlinear dynamic biological systems. BMC Bioinformatics. 2012, 13 (1): 90-10.1186/1471-2105-13-90.PubMedPubMed CentralView ArticleGoogle Scholar
- Grossmann IE, Biegler L: Part II. Future perspective on optimization. Comput Chem Eng. 2004, 28 (8): 1193-1218. 10.1016/j.compchemeng.2003.11.006.View ArticleGoogle Scholar
- Pozo C, Marin-Sanguino A, Guillén-Gosalbez G, Jimenez L, Alves R, Sorribas A: Steady-state global optimization of non-linear dynamic models through recasting into power-law canonical models. BMC Syst Biol. 2011, 5: 137-10.1186/1752-0509-5-137.PubMedPubMed CentralView ArticleGoogle Scholar
- Vecchietti A, Sangbum L, Grossmann I: Modeling of dicrete/continuous optimization problems: Characterization and formulation of disjunctions and their re-laxations. Comput Chem Eng. 2003, 27: 433-448. 10.1016/S0098-1354(02)00220-X.View ArticleGoogle Scholar
- Esposito W, Floudas C: Global optimization for the parameter estimation of differential-algebraic systems. Ind Eng Chem Res. 2000, 39 (5): 1291-1310. 10.1021/ie990486w.View ArticleGoogle Scholar
- Rodriguez-Fernandez M, Egea J, Banga J: Novelmetaheuristic for parameter estimation in nonlinear dynamic biological systems. BMC Bioinf. 2006, 7: 483-10.1186/1471-2105-7-483.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.