The problem we address here is to infer the regulatory structure of a metabolic system, given a known structure for the reaction network (stoichiometry) and experimental time series for the dynamic behavior of that system. To address this question, and to explore the practical problems associated, we consider the following general representation of a biochemical network:
(1)
where X
i
denotes the concentration of metabolite i, μ
i,r
is the stoichiometric coefficient of metabolite i in process r, which indicates the number of molecules of type i produced or destroyed by process r, and v
r
is the rate function of this process. In general, v
r
is represented as:
(2)
There are two critical issues in defining this model. One is the selection of an appropriate mathematical representation for v
r
, which may be a function of an arbitrary number of variables (substrates, products, and modifiers). In most cases the mechanism for each process are unknown and choosing a specific mechanistic rate law, such as a Michaelis-Menten rate law, becomes an act of faith. The other issue is the problem of identifying the regulatory structure of the system.
The most straightforward and theoretically well supported solution to both issues is the use of an approximate formalism based on a standard mathematical representation [10]. By adopting such a kinetic representation, identifying the regulatory structure of the system becomes synonymous to determining the set of values θ for the model parameters that better fit the available data. Hence, without losing generality, and as a first step towards a more complex framework, we will consider the case where the rates are modeled using a power-law formalism. Note, however, that our approach could be easily extended in order to accommodate any other structured kinetic formalism.
Power-law models
Using the power-law representation, the rate v
r
is expressed as follows:
(3)
where γ
r
is an apparent rate constant for reaction r, and f
r,j
is the kinetic order of metabolite j in that process. Note that this equation accounts for the effect of n + m metabolites (n dependent and m independent) on each reaction.
The advantage of this representation is that the same functional form represents all the rates. The reaction structure of the system will constrain the range of admissible values for some of the parameters. For example, all γ and f parameters for the substrates and catalysts of the reactions are by definition larger than zero. In addition, the values of the f parameters for all metabolites that are not directly involved in a given process are zero in the rate that describes the process.
By adopting such a kinetic representation, we can pose the problem of identifying the regulatory signals in a very compact mathematical form. If X
j
is a modifier of v
r
, then the corresponding kinetic order f
r,j
will be different from zero (positive if it is an activator, and negative if it is an inhibitor). By substituting (3) into equation (1), we get what is known as a Generalized Mass-Action (GMA) model.
(4)
Note that the power-law formalism accounts for both the stoichiometry of the system (the network structure), and the reaction and regulatory structures (kinetic orders) using a single systematic nonlinear representation. This property is very important for defining a systematic way of exploring alternative regulatory signals. We will make use of this general and compact formalism in the derivation of the equations for the parameter estimation model.
Parameter estimation in a GMA model
Given a set of experimental observations (i.e., time courses for the metabolites), our goal is to find the values of the apparent constants and kinetic orders that minimize the sum of least squared errors between the experimental data and the predicted dynamic profiles. This problem can be expressed in compact form as follows:
(5)
where X
i
represents the state variables (i.e., metabolite concentrations), X
0i
their initial conditions, X
i,u
exp denotes the experimental observations, and X
i,u
mod are the values calculated by the dynamic model (i.e., model predictions). i is the index for the set of state variables whose derivatives explicitly appear in the model, γ
r
and f
r,j
are the parameters to be estimated, and t
u
, is the time associated with experimental point u belonging to the set U of observations. k is the total number of experimental data points and n is the number of time dependent variables.
Conventional parameter estimation approaches seek parameter values that minimize the approximation error assuming a given regulatory scheme (i.e., fixing some f
r,j
to zero beforehand according to the aprioristic biochemical knowledge of the system). While this assumption simplifies the calculations, it can lead to poor approximations and hamper at the same time the discovery of new regulatory loops. In this work we introduce a rigorous and systematic parameter estimation and network identification method that makes no assumption regarding the regulatory network topology.
To model the existence of a regulatory interaction, we make use of the following disjunction:
(6)
In which Y
r,j
-,Y
r,j
and Y
r,j
+ are Boolean variables that are true if parameter f
r,j
is negative, zero or positive, respectively, and false otherwise. ϵ is a very small parameter. Note that only one term of the disjunction can be active (i.e., exclusive disjunction), while the others must be false. Hence, if Y
r,j
is true, metabolite i takes no part in velocity r. Conversely, if this metabolite has an influence on r, then Y
r,j
is false and either Y
r,j
- or Y
r,j
+ will be active. This disjunction can be translated into standard algebraic equations using either the big-M or convex-hull reformulations [29]. By applying the former, we get:
(7)
where Boolean variables Y have been replaced by auxiliary binary variables y. In these equations, M is a sufficiently large parameter whose value must be carefully set according to the bounds defined for the kinetic parameters.
A key issue in our approach is how to avoid overfitting. To this end, we make use of the Akaike criterion, which captures the trade-off between the number of kinetic parameters contained in the model and its ability to accurately reproduce the experimental data. If we assume that the error of the observations follows a normal distribution, the Akaike criterion takes the following mathematical form [17]:
(8)
Where AIC denotes the value of the Akaike criterion and C is a constant value that does not affect the optimization. The parameter estimation problem can be finally posed in mathematical terms using the following MIDO (mixed-integer dynamic optimization) formulation:
(9)
There are different solution methods to solve this MIDO (see [25]). Without loss of generality, we propose here to reformulate this problem into an equivalent algebraic MINLP (mixed-integer nonlinear program) using orthogonal collocation on finite elements. This allows exploiting the rich optimization theory and software applications available for MINLP in the solution of the MIDO. Note that the reformulated MINLP might be nonconvex. This will give rise to multimodality (i.e., existence of multiple local optima), preventing standard gradient-based solvers from identifying the global optimum. Deterministic global optimization methods could be applied to solve the MINLP, but they might lead to large CPU times given the size and complexity of a standard dynamic problem of this type. Details on the application of deterministic global optimization methods to parameter estimation problems of small/medium size can be found elsewhere [30, 31]. For the reasons given above, in this work we will solve the reformulated MINLP using local optimizers.
One important feature of our approach is that rather than calculating a single optimal solution, it identifies a set of plausible regulatory topologies by solving the model iteratively. That is, the model is first solved to identify a potential regulatory configuration represented by a binary solution (i.e., set of values of the binary variables). The model is then calculated again but this time adding the following integer cut, which excludes solutions identified so far in previous iterations from the search space:
(10)
Where ONE
it
and ZERO
it
represent the sets of binary variables that take a value of one and zero, respectively, in iteration it of the algorithm. After adding the integer cut, the model is solved again to produce a new regulatory topology, and this procedure is repeated iteratively until a desired number of configurations is generated. Hence, the algorithm produces as output a set of potential network configurations (encoded in the values of the binary solutions) rather than a single topology. Note that these regulatory topologies show a descendant value of the Akaike performance criterion.