- Research article
- Open access
- Published:

# Efficient, sparse biological network determination

*BMC Systems Biology*
**volume 3**, Article number: 25 (2009)

## Abstract

### Background

Determining the interaction topology of biological systems is a topic that currently attracts significant research interest. Typical models for such systems take the form of differential equations that involve polynomial and rational functions. Such nonlinear models make the problem of determining the connectivity of biochemical networks from time-series experimental data much harder. The use of linear dynamics and linearization techniques that have been proposed in the past can circumvent this, but the general problem of developing efficient algorithms for models that provide more accurate system descriptions remains open.

### Results

We present a network determination algorithm that can treat model descriptions with polynomial and rational functions and which does not make use of linearization. For this purpose, we make use of the observation that biochemical networks are in general 'sparse' and minimize the 1-norm of the decision variables (sum of weighted network connections) while constraints keep the error between data and the network dynamics small. The emphasis of our methodology is on determining the interconnection topology rather than the specific reaction constants and it takes into account the necessary properties that a chemical reaction network should have – something that techniques based on linearization can not. The problem can be formulated as a Linear Program, a convex optimization problem, for which efficient algorithms are available that can treat large data sets efficiently and uncertainties in data or model parameters.

### Conclusion

The presented methodology is able to predict with accuracy and efficiency the connectivity structure of a chemical reaction network with mass action kinetics and of a gene regulatory network from simulation data even if the dynamics of these systems are non-polynomial (rational) and uncertainties in the data are taken into account. It also produces a network structure that can explain the real experimental data of *L. lactis* and is similar to the one found in the literature. Numerical methods based on Linear Programming can therefore help determine efficiently the network structure of biological systems from large data sets. The overall objective of this work is to provide methods to increase our understanding of complex biochemical systems, particularly through their interconnection and their non-equilibrium behavior.

## Background

Determining the interaction topology in large-scale biological systems has been an active area of research for some time now. Most methodologies that deal with high-throughput experimental data make use of information about the behavior of the system locally around the steady-state [1–3]. For example, a class of techniques that fall under the rubric of 'stationary state Jacobian Matrix Elements' analyzes the system behavior when it is perturbed locally from steady-state and looks at whether the concentration of one species is increased or decreased when another species' concentration is increased. In [4] and [5], the authors have built on this approach and determined the functional interactions in cellular signaling and gene networks by taking into account the 'modular' structure of the networks in question. Alternatively, inferences about the topology of the network can be made by introducing pulse changes in concentration of a chemical species in the network, and observing the network's response, concluding causal chemical connectivities [6]. In [3], a linear dynamical system was considered to represent a gene regulatory network, and an approach, based on Linear Programming, was proposed in order to obtain the sparsest network structure from genetic perturbation experiments.

However, most information of the system dynamics is in its transient behavior and, more importantly, many reactions that the living cell requires are actually for most of the time far from steady state [7]. The problem of determining the network structure in the case where transient time-series data for non-equilibrium behavior are available is much harder and this has been an active area of research for over a decade. The reason is that while such data are often abundant due to the development of high-throughput, effective experimental techniques, at the same time, a high computational effort is required to extract information about the network structure using traditional techniques. A recent review of available techniques can be found in [8] or [9], but earlier articles, such as [10], also list several approaches to this network identification problem.

In actual fact, identifying the interconnection topology in biological and biochemical systems is of greater importance than extracting the parameters (the rates of the various reactions) that best fit the particular time series data. There are several reasons for this: first, the parameters are often collected under noisy experimental conditions and are sensitive to the laboratory environment. Second, as is often the case with large networks, persistence of observed phenomena is robust to a large range of most parameter values and therefore identifying these parameters exactly is not of great interest. Indeed, *chemical reaction networks* often have the same functionality in the neighborhood of most of the nominal reaction rates. But most importantly, networks are rarely robust to the random rewiring of the underlying interconnection structure and hence determining the network connectivity is much more important than determining the kinetic parameters that fit the particular data.

In this paper, we first consider chemical reaction networks with *mass action kinetics* (see references [11] and [12]) and seek to identify the chemical pathways and mechanisms, that is, how the chemical complexes interact within the chemical network. Chemical reaction networks are dynamical systems of the form

where the unknown matrix encompassing the connectivity structure is *A* and the vector of functions *f*: ℝ^{n}→ ℝ^{m}(which need to satisfy appropriate smoothness conditions to ensure local existence and uniqueness of solutions) is known. This makes (1) linear in the unknown parameters. Our main objective is to provide a procedure to identify the interconnection topology that is encapsulated in *A*, given experimental time-series data.

An important property of the network given by *A* is sparseness, i.e., it has much less edges than the full graph with the same vertex set. In this paper, we extend the results in [13] that focus on obtaining sparse interconnection networks from experimental data to general and large-scale networks. We apply the presented methodology in order to reconstruct a biochemical network from mock-up experimental data obtained through simulations. More importantly, we show its validity in determining the glycolytic pathway of *Lactococcus lactis* from real experimental data. Although this pathway has been investigated in great detail (see for example, [14–16]) and is the test object of many system identification approaches as a recent paper fittingly notes in its title, it is 'an unfinished systems biological case study' [14].

Finally, we suggest how the method of identifying connectivity for systems of the form (1) can be adjusted to determine the structure of *gene regulatory networks*, in which the unknown parameters do not enter the system dynamics in an affine way. We then apply the methodology in order to reconstruct a gene regulatory network from mock-up experimental data obtained through simulations.

## Notation

ℝ, ℝ^{n}, ℝ^{m × n}is the set of all real numbers, real vectors of length *n*, *m* × *n* real matrices

*A*_{
ij
}(*i*, *j*)th is the (*i*, *j*) entry of matrix *A* ∈ ℝ^{m × n}

{\mathbb{R}}_{+}^{n},{\overline{\mathbb{R}}}_{+}^{n} :{*x* ∈ ℝ^{n}: *x*_{
i
}> 0, *i* = 1, ..., *n*}, {*x* ∈ ℝ^{n}: *x*_{
i
}≥ 0, *i* = 1, ..., *n*}

vec(*A*) is a vector which contains the columns of *A* stacked one below each other

*e* =[1, 1, ⋯, 1]^{T}

diag(*A*), *A* ∈ ℝ^{n × n}is a vector of length *n*, where (diag(*A*))_{
i
}= *A*_{
ii
}

diag(*x*), *x* ∈ ℝ^{n}is a matrix in ℝ^{n × n}, where (diag(*x*))_{
ii
}= *x*_{
i
}and (diag(*x*))_{
ij
}= 0 if *j* ≠ *i*

## Methods

### Chemical reaction networks

Chemical reaction networks are used to describe and understand biological processes. An illustrative example is the following reaction network proposed by Michaelis and Menten for chemical reactions involving enzymes,

Here, *S* denotes the substrate, *E* the enzyme, *ES* the enzyme-substrate complex and *P* the final product. They are the so called *species* that participate in the reactions. The positive constants *k*_{1}, *k*_{-1}, *k*_{2} and *k*_{-2} are the reaction *rate constants*, *edges* represent reactions and *vertices* represent *complexes* (the objects that appear before and after the reaction arrows).

In *chemical kinetics*, it is common to assume that the dynamics of the chemical reaction network (such as the one given by (2)) can be described by the following set of nonlinear ODEs [17]:

where *v*(*x*) is the *rate vector* (or *flux vector*), *x* is the concentration vector of the different species and *N* is the *stoichiometric matrix*. If *p* molecules of species *i* appear before the reaction arrow in reaction *j* then *N*_{
ij
}= -*p* and if they appear after then *N*_{
ij
}= *p*.

The description given by (3) hides the underlying chemical network structure, which we aim to determine in this paper. Hence, in the following, we introduce the notation used in *chemical reaction network theory*, which decomposes *N* and *v*(*x*) into: the so called *bookkeeping matrix Y*, which maps the space of complexes into the space of species; the concentration vector of the different complexes Ψ(*x*); and matrix *A*_{
κ
}, which defines the network structure. For the Michaelis-Menten reaction, the vectors of species and complexes are given by

respectively. The elements of the *i* th row of matrix *Y* tell us in which complexes species *i* appears and how often; or, equivalently, the entries to the *j* th column tell us of how much of each species complex *j* is made of. For (2),

Matrix *K* is the transpose of the *weighted adjacency matrix* of the *digraph* representing the chemical reaction network; that is, entry *K*_{
ij
}is nonnegative and corresponds to the rate constant associated with the reaction from complex *j* to *i*. The so called *kinetic matrix A*_{
κ
}is given by *A*_{
κ
}= *K* - diag(*K*^{T}*e*). For (2),

In chemical reaction network theory, it is common to assume mass action kinetics. The law of mass action assumes that if reactions take place at constant temperature in a homogenous and well mixed solution then the probability of a collision between molecules is proportional to the product of their concentrations. This means that ln Ψ(*x*) = *Y*^{T} ln *x*, and one reformulates the set of nonlinear ODEs given by (3) as [18]:

In general, we assume that a chemical reaction network has *n* species and *m* complexes. Thus, in (4): *x* ∈ {\overline{\mathbb{R}}}_{+}^{n}, Ψ(*x*) ∈ {\overline{\mathbb{R}}}_{+}^{m}, *A*_{
κ
}∈ ℝ^{m × m}, and Y\in {\overline{\mathbb{R}}}_{+}^{n\times m}. Experimental data is stacked in vector Ψ(*x*) and often matrix *Y* is known such that we can explicitly search for the network structure given by *A*_{
κ
}. Finally, for clarity, we provide the expanded ODE representation of the Michaelis-Menten reaction given by (2):

### Determining affine and sparse interconnections in dynamical systems

Consider a dynamical system of the following form:

where *f*(·) ∈ ℝ^{m}is a vector of known functions, which satisfy appropriate smoothness conditions to ensure local existence and uniqueness of solutions. With *A* = *Y A*_{
κ
}and *f* (*x*) = Ψ(*x*), the above description results in a dynamical system of the form given by (6). Note that the unknown parameters (which also encode the network structure) are in *A*, which enters the system dynamics linearly. Let neither the value of the entries nor the structure of matrix *A* be known. What we wish to find is the structure and entries in matrix *A*, given experimental data.

For this purpose, we consider the following discrete-time system:

*x*(*t*_{k + 1}) = *x*(*t*_{
k
}) + (*t*_{k + 1}- *t*_{
k
}) *Af*(*x*(*t*_{
k
})),

which is the Euler discretization of (6).

Now, the set of measurements, which we denote by \widehat{x}, can be used to fit the unknown entries to *A* such as to minimize the error between the data and the model predictions, which are given by (7). It is popular to solve the minimization problem which results in the least 2-norm of the error between *x*_{
i
}(*t*_{k+1}) and {\widehat{x}}_{i}({t}_{k+1}) (least squares minimization problem). We can write such an error metric as:

min||*Ma* - *b*||_{2}

where *a* ∈ ℝ^{nm}is a vector containing *A*_{
ij
}, which we treat as decision variables, and *M* ∈ ℝ^{((p-1) × n) × nm}and *b* ∈ ℝ^{(p-1) × n}are defined by 'stacking' all such conditions obtained by manipulating the data as per (7). Here *p* corresponds to the number of measurements. This problem has the following analytical solution:

*a** = *M*^{†}*b* ≜ (*M*^{T} *M*)^{-1}*M*^{T}*b*

There are a few drawbacks of the above least-squares approach. In the presence of extra constraints on the variables *A*_{
ij
}(constrained regression), the problem does not have a closed-form solution, in general. Such constrained minimizations, the simplest of which is a *Second Order Cone Problem* (SOCP) [19], may carry a significant computational cost for large problems. The same is true if the data are contaminated with error which needs to be taken into account when producing *A*_{least-squares} [20]. Furthermore, the solution to a least-squares problem will not be sparse in general; it will rather result in a full matrix.

In [19] and more recently in [21], the fact was mentioned that a large number of elements of the solution *z* of a *Linear Program* (LP) such as

min ||*z*||_{1},

are zero, that is, (10) produces sparse solutions. For this reason, this is the approach we follow in the paper. In particular, if *A* is sparse then the following program seeks explicitly to minimize the entries to matrix *A* and, thus, tries to recover this property of the matrix:

By making {\mu}_{k}^{+} and {\mu}_{k}^{-} as small as possible for all *k*, we can ensure that the data are in close Euler-fit with the model making the approximation error as small as possible. The advantage of solving LPs is that the task can be performed using fast, efficient and readily available algorithms. Note also that the number of decision variables in (11) relates directly to the size of *A* and not of the data, which makes it suitable for the identification of large-scale systems. Support for the validity of above heuristic to obtain a sparse interconnection matrix *A* are also Theorem 1.1 of [22] and the work presented in [23].

An additional advantage of our approach is also that we may incorporate uncertainties in the measurements with little additional computational complexity. If we model the uncertainty in the measurements as

\tilde{x} (*t*_{
k
}) ≥ 0, \tilde{f} (*t*_{
k
}) ≥ 0, for all *k*, and *A*_{
ij
}≥ 0 then we can formulate the robust counterpart to (11) that is still an LP (see also [24, 25]). The following LP is a robust formulation of program (11):

In summary, using the above ideas, we aim to extract from data the sparsity pattern in matrix *A*, which is related to the connectivity of the underlying graph structure, drawing conclusions on the network interaction topology.

Finally, note that if data points are rare, that is *p* ≤ *m*, and there are not any constraints on matrix *A* then the error between the data and the model predictions can be made zero and (9) does not have a unique solution. However, the following LP can be used to try to recover the sparsity structure of the matrix:

### Obtaining the structure of a gene regulatory network

Using the same ideas, another class of a networks that can be determined are gene regulatory networks given microarray time-series data. We first briefly explain the function of gene regulatory networks and DNA microarray time-series.

A gene encodes the information necessary to produce a specific protein. The process, in which the information is used to synthesize a protein, is highly controlled and this control allows the cell to vary the level of a particular protein in the cell depending on the cell's need for this protein. The first step of synthesizing a protein from a gene is RNA polymerase transcribing gene information from DNA to mRNA (see Figure 1a). This mRNA is then translated into synthesized proteins by ribosomes. Control can occur at a number of stages including the synthesis of mRNA, subsequent processing of the mRNA, control of the ribosome and control of mRNA stability. Some proteins, called transcription factors, are responsible for the regulation of the initiation of transcription. A transcription factor binds to the DNA, at the promoter site, in order to either inhibit or activate (or alternatively increase the rate of) the transcription of mRNA that is responsible for the synthesis of a specific protein (see Figure 1b). (Note that self regulation is also possible.) The collection of DNA segments which interact with each other in the manner described is called the gene regulatory network.

In order to understand the dynamics and behavior of a gene regulatory network, three levels of information are required:

1. The network interconnection in the form of a *directed graph*;

2. Whether an edge from node *i* to node *j* means that transcription factor *i* is activating (denoted by arrow) or repressing (denoted by 'hammer') *j*;

3. The activation/repression rates for the transcription factors.

Time-series obtained from DNA microarrays [26, 27] are extremely helpful to obtain the structure of gene regulatory networks. This is because DNA microarrays allow observation of the presence of specific mRNA within the cell and in particular, time-series data allow measurements on how these change over time after a perturbation, or when following the cell cycle.

Now, consider the model of a gene regulatory network as described in [28] and [29], where nodes represent genes. Knowledge of the three hierarchal levels of information mentioned previously is necessary to describe the network. The first level determines whether there is an interaction between proteins *X*_{1} and *X*_{2}. An interaction of the form '*X*_{1} → *X*_{2}' means that protein *X*_{1} activates the production of protein *X*_{2} and '*X*_{1} ⊣ *X*_{2}' that *X*_{1} inhibits it. This information belongs to the second level. The activation and repression Hill input functions are given mathematically by (see [28]):

respectively, where *x*_{1} is the concentration of *X*_{1}. (In [29], the notation \frac{1}{K} is used instead of *k*. For clarity, we have adopted a 'simpler' notation.) Knowledge about the magnitude of activation or repression coefficient *k*, *k* > 0, and exponent *n*, *n* > 0, is part of the third level of information.

If there exists more than one connection to a particular gene/node then all transcription factors associated with the connections could be necessary to fulfill a specific task (activation or inhibition) or it might be that any of them is sufficient to have an effect on the transcription process; more complex combinations are also possible. In [28, 30], a generalized input function of the following form is presented, which takes the possible interplay of different transcription factors into account:

Here, activation of protein *X*_{
i
}by protein *X*_{
j
}is represented by *n*_{
ij
}= *m*_{
ij
}> 0, and repression by *n*_{
ij
}= 0, *m*_{
ij
}> 0. The contribution of the different transcription factors on the transcription rate is denoted by *b*_{
ij
}. Putting everything together, the mathematical description of the dynamics of the concentrations of protein *X*_{
i
}of an arbitrarily large gene regulatory network is as follows:

where *γ*_{
i
}> 0 is the basal transcription production rate and *d*_{
i
}> 0 is the degradation/dilution rate. In the above model, however, the vector field (right hand side of Equation (17)) is not affine in the unknown parameters and therefore one may think that the method proposed in the previous section can not be extended for this case; we show here how the above can be reformulated and cast in a form that allows identification using Linear Programming.

Let Δ*t* = *t*_{ℓ+1} - *t*_{ℓ}. A discrete-time system that approximates (17) is:

*x*_{
i
}(*t*_{ℓ+1}) = *x*_{
i
}(*t*_{ℓ}) + Δ*t*(*γ*_{
i
}+ *f*_{
i
}(*x*_{
i
}(*t*_{ℓ})) - *d*_{
i
}*x*_{
i
}(*t*_{ℓ})).

Indeed, if *b*_{
ij
}, *k*_{
ij
}and *m*_{
ij
}are unknown then (18) is not affine in the unknown parameters as is the case in (7). We rewrite (18) as follows:

In (19), if *n*_{
ij
}= 0 then we denote it by {\overline{n}}_{ij} and let {b}_{i}={\displaystyle {\sum}_{j}{b}_{ij}{x}_{j}^{{\overline{n}}_{ij}}}={\displaystyle {\sum}_{j}{b}_{ij}}. If *n*_{
ij
}> 0 then we denote it by {\tilde{n}}_{ij}. For all *i*, *j*, let an entry to matrix *B* be *b*_{
ij
}for which *n*_{
ij
}> 0, and let an entry of matrix *K* be *k*_{
ij
}. As before, given a set of measurements, which we denote by \widehat{x}, this set can be used to approximate the structure of the gene regulatory network determined by *b*_{
ij
}, *b*_{
i
}and *k*_{
ij
}if the Hill coefficients *m*_{
ij
}(and, thus, *n*_{
ij
}) are known and the basal production and degradation rates are known or considered uncertain but within a known range. For instance, we can try to recover *B, K* through a LP. The following LP puts emphasis on minimizing the 1-norm of vec(*B*), *b*, and vec(*K*), which are independent of each other, while we keep the Euler discretisation error, *μ*, as small as possible.

(The latter requirements in brackets correspond to the case of uncertain production and degradation rates.) Note that as per (16)

*k*_{
ij
}= 0 if and only if *b*_{
ij
}= 0 or *b*_{
i
}= 0, ∀*i*, *j*.

The following remark deals with the case when the solution of (20) violates (21). The rationale behind the idea is that by following these rules we can determine unambiguously whether activation or repression takes place.

**Remark 1**
*Since requirement (21) cannot be implemented in a LP, we deduce the following from the solution of (20) about the connectivity of the network when (21) is violated:*

- *if k*_{
ij
}≠ 0, *b*_{
ij
}= 0 *and b*_{
i
}= 0 *then the production of X*_{
i
}*is not affected by X*_{
j
}; *that is, it is the same case as when k*_{
ij
}= 0,

- *if b*_{
ij
}≠ 0 *and k*_{
ij
}= 0 *then X*_{
j
}*enhances the production of X*_{
i
}; *i. e., it is the same case as when k*_{
ij
}≠ 0,

- *if b*_{
i
}≠ 0 *and k*_{
ij
}= 0 *for all i then the production of X*_{
i
}*is not affected by X*_{
j
}; *that is, it is the same case as when b*_{
i
}= 0.

## Results and discussion

### Numerical experiments

#### An artificial chemical reaction network

Consider the network with 5 species \mathcal{S} = {*A*, *B*, *C*, *D*, *E*} and 5 complexes, \mathcal{C} = {*A*, 2*B*, *A* + *C*, *D*, *B* + *E*} in Figure 2. Suppose we are given time series data for all the species in this system, but we do not know the topology of the interconnection. The first experiment examines the recovered interconnection diagram using the (non-robust) LP (11). Later on, we will consider the same problem with missing data on one species and a robust network determination problem.

We have implemented the network shown in Figure 2 with a *K* matrix of the form:

and have simulated the system with uniformly distributed initial conditions. The data sets were obtained by simulating the above set of nonlinear equations using SIMULINK. Ten such data sets were generated and incorporated in the LP.

Since we do not know how the chemical network is connected, and we cannot even speculate how parts of it may be connected, we need to assume a general structure for it and write the dynamics for the complete network. A least-squares approach, would yield the following structure in matrix *K*, where non-zero entries denote the fractional occurrences of non-zero *k*_{
ij
}'s for the 10 data sets:

Essentially the only zero element predicted is *k*_{45}. Note that the diagonal of this matrix does not enter into our optimization. We write these entries as zero, but this is merely a convenient notational place holder. The resulting structure of the *K* matrix from our linear programming approach is given by

where again non-zero entries denote the fractional occurrences of non-zero entries for the 10 data sets tested. Observe that the second column is equal to zero which implies that the second complex is not the product of any reaction. Having determined this sparse structure for the system, we can repeat the same LP optimization, but now impose the new information about the sparse structure obtained in the new linear program, i.e. that *k*_{12} = 0 etc. Iterating once on this data, we get the following results:

This experiment reveals that the sparsity structure can be further reduced by an iterative procedure. One could also use the above as a 'probability' lookup table, and investigate sparsity structures, such as setting *k*_{23} and *k*_{24} equal to zero. Indeed this solution is also feasible, which reveals additional structure in the matrix *K*. Working this way, we have found that the following non-zero matrix results in feasible LPs:

which is the same as the network shown in Figure 2, but for a link between complex 1 and complex 3. Of course, this is not surprising: there is no unique reaction mechanism that can fit a data set; rather, there can be many networks which with some kinetic parameters can represent the same data within experimental error. In fact, we can only hope to *invalidate* a postulated reaction mechanism using data, a point we will return to in the concluding section.

The next experiment we performed was to assume that some of the species could not be observed in the experiments for technical reasons. In particular, we assumed that the concentration of species *A* could not be measured. This does not pose significant problems, as we can replace the occurrences of the terms in the vector field involving the variable *x*_{1} with a vector of new variables *q* which we also ask to be 'sparse', through minimization of the sum of *q*_{
i
}. Eight such substitutions need to be made; the result is a matrix of the form:

and a *q* = [*q*_{1}, ..., *q*_{8}] which corresponds to nonzero entries for *k*_{31}, *k*_{34}, *k*_{35} *k*_{13} and *k*_{15}. Therefore in this case too, a sparse topology interconnection is obtained, but the matrix in this case is not as sparse as before.

Suppose now that data are uncertain, and we want to search for *robust* sparse structures for the *K* matrix. We set {\u03f5}_{i}^{+} = {\u03f5}_{i}^{-} = 0.0004 for *i* = 1, ..., 5 and all data points – such uncertainty could be due to roundoff errors (see Equation (12)). A robust LP can be formulated, as discussed earlier, and the obtained optimization results in a network with a richer sparsity structure:

Finally, we note that once a candidate network is determined, we can perform a least-squares minimization to obtain the best *k* values for a particular sparsity structure. For example, if we choose *K*_{nom} as the sparsity structure and fit the least squares error over all 10 experiments, we get the following *K* matrix:

In figures 3A and 3B we show how the nominal system, with the *K* matrix given by Equation (22) compares in simulation with the *K* matrix given by Equation (23) for different initial conditions. We see that some initial conditions have better behavior for the two parameter sets than others. There is hope, however, that using other methods and through choice of a particular initial condition we can distinguish between the two network structures; the initial condition in Figure 3B shows some deviation in the dynamics of *x*_{1}, and designing an experiment that would yield 'maximum' deviation would allow for differentiability between various models that can explain the same data. More details about this approach can be found in [31].

#### A sample gene regulatory network

Consider the artificial gene regulatory network modeled through

where

*b*_{3} = 0.71, *b*_{5} = 0.80, *γ*_{
i
}= 0.1 and *d*_{
i
}= 1. The network is depicted in Figure 4, where solid lines with an arrow head denote activation and dash pointed lines with a hammer head denote inhibition.

We assume that *d*_{
i
}is known and *γ*_{
i
}= *γ* for all *i*, where 0.095 ≤ *γ* ≤ 0.105. We take 'measurements' every Δ*t* = 0.05 between *t* = 0 and *t* = 5 (time is in arbitrary units) from four different random initial conditions between 0 and 1 in order to obtain mock-up data. Solving (20) using the solver SeDuMi [32], we obtain the following results for matrices *B* and *K*:

*b*_{3} = 0.64, *b*_{5} = 0.80, and all other *b*_{
i
}= 0. Following the rules given by Remark 1, we are able to reconstruct the network shown in Figure 4. As the example shows, we are able to determine the interaction network given by (24) through the LP (20) even when degradation rates are considered uncertain.

### Reconstructing the glycolytic pathway of Lactococcus lactis

*Lactococcus lactis* is a bacterium used in the industrial production of cheese and buttermilk as it converts more than 90% of lactose (milk sugar) to lactic acid [14]. In general, the glycolytic pathway (or glycolysis) consists of chemical reactions that convert glucose into pyruvate. In the first step, glucose is converted into glucose-6-phosphate (G6P). A conversion of G6P into fructose-1,6-bisphosphate (FBP) follows, which is then converted sequentially to glyceraldehyde-3-phosphate (Ga3P), 3-phosphoglyceric acid (3-PGA) and PEP [16]. Additionally, Glucose and PEP are converted directly to pyruvate and G6P. Note that since measurement data for the intermediate Ga3P were unavailable, we include an additional rate denoting depletion of FBP. A simplified description of the pathway from reference [33] is depicted in Figure 5. The relative simplicity of this metabolic network makes *L. lactis* an attractive model for biological systems approaches [14]. A recent paper which presents an approach to determine the connectivity of this system and puts some emphasis on its sparsity is [16]. However, this approach does not take into account the characteristic particulars that make up a chemical reaction network. Here, we first use LP (11) to try to elucidate the glycolytic pathway of *L. lactis* using the same experimental data from [33].

Particularly, we wish to elucidate the glycolytic pathway of *L. lactis* under the assumption that the following complexes participate in the chemical reaction network: Glu, G6P, FBP, 2 × 3PGA, 2 × PEP, 2 × Pyru and Lact. In other words, we wish to obtain interaction topology *Aκ* of the chemical reaction network given by \dot{x} = *Y A*_{
κ
}*f* (*x*), where

Note that the network topology is completely determined by *A*_{
κ
}. Recall that

*A*_{
κ
}= *K* - diag(*K*^{T}*e*), *K*_{
ij
}≥ 0 ∀*i*, *j*.

Now, by solving (11) we indeed obtain a sparse chemical reaction topology (Figure 6(a)). However, the error between the model dynamics and experimental data is unreasonably large (Figure 6(b)). Therefore, it is not surprising that this configuration differs greatly from the the proposed pathway of Figure 5.

The ℓ_{1} regularized least squares problem, which is called *Lasso* is statistics, considers an objective function to minimize, which consists of the sum of the 1-norm of the vector of unknowns and the least squares of the error:

where *α* is a nonnegative constant that allows us to regulate the weight we put on the sparsity of *A* explicitly. Note that for *α* = 0, program (26) minimizes the the error between data and model dynamics solely (Figure 7(b)). This time, the error between the model dynamics and experimental data is considerably smaller. The connection topology is shown in Figure 7(a). Now, we increase *α* to see whether or which interconnections disappear without altering the system dynamics significantly. This pathway, which remains unaltered for 2 ≤ *α* ≤ 10, is shown in Figure 7(a). The dynamic behavior of this system is indistinguishable from the one shown in Figure 7(b) and, thus, is not shown.

Further increase of *α* results first in the disappearance of the links between G6P and FBP, and sequential changes do not result in 'sensible' connection topologies. Of course, this is something that in general the investigator does not know. While the pathway depicted in Figure 6(a) might be dismissed because the resulting model behavior compares badly with data, this argument does not hold for the pathway in Figure 7(a).

Now, we exploit the following related approach to try to deduce the interactions of the system by solving the following LP:

We solve (27) for *α* = 0, *α* = 2 and *α* = 3, and obtain the reaction pathway shown in Figure 8(a) which results in a model with the dynamics depicted in Figure 8(b). (Note that for 0 <*α* ≤ 75, the model dynamics are indistinguishable from the ones shown in Figure 8(b) and are not shown.)

The error between the model dynamics and experimental data is again considerably smaller than the error shown in Figure 6(b). As we can see from Figure 8(a), a relatively sparse reaction topology was obtained.

The pathway *x*_{1} → ... → *x*_{7} was almost reconstructed. A sensible assumption to make is that the degradation of G6P which appears at *α* = 3 corresponds to the conversion into FBP suggested at *α* = 2.

Also, the direct link between glucose and pyruvate was discovered. Finally, with

our approach provides a meaningful chemical reaction network of the form (4). (This matrix corresponds to the case when *α* = 2.) Nevertheless, without biochemical information the superiority of this pathway to the pathway in Figure 7(a) cannot be established and it follows that experiments have to be designed to discriminate between several competing models.

## Conclusion

We have presented a methodology for determining the interaction topology of biological networks, that are either affine in the unknown parameters or can be transformed to have this property, using time series data collected from experiments. We demonstrated the ability of our method to identify a chemical reaction network structure through several numerical examples. We have also tested our approach by elucidating the glycolytic pathway of the bacterium *Lactococcus lactis*. Our method respects the structural properties that chemical reaction network dynamics should have [11, 12].

In the case of gene regulatory networks, more realistic models could be used, but those would include additional parameters, first, by making the Hill coefficient in the activation and repression terms a free variable; and second, encoding the fact that when two transcription factors act on DNA, either both are required (AND) or any of them is sufficient (OR) for action. Thus, a valuable research direction is to investigate this case and establish whether similar analysis techniques to the ones presented in this paper can be used.

In (27) we introduced a free variable *α* whose value can change the solution considerably. Hence, it is worthwhile to explore different possible heuristics how to choose the value of this variable. (Here, we kept the balance between increasing *α* and keeping the model dynamics that followed from the solution of (27) relatively close to experimental data.) An iterative method can also be used, which uses 'live' information from simulations and then iterates with a simple Linear Program to find the network structure that fits best the parameters.

Finally, as shown, different methods or the same one with different constraints provide different models that represent the same data, which is an expected feature of such methods. It follows that experiments have to be designed to discriminate between competing models, in a way that 'closes the loop' between modelling and experiment (see for example [31]).

## References

Yeung MKS, Tegnér J, Collins JJ: Reverse engineering gene networks using singular value decomposition and robust regression. PNAS. 2002, 99 (9): 6163-6168.

Tegnér J, Yeung MKS, Hasty J, Collins JJ: Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling. PNAS. 2003, 100 (10): 5944-5949.

Julius AA, Zavlanos M, Boyd S, Pappas GJ: Genetic network identification using convex programming. Technical Report MS-CIS-07-20. 2007, Department of Computer and Information Science, University of Pennsylvania

Kholodenko BN, Kiyatkin A, Bruggeman FJ, Sontag ED, Westerhoff HV, Hoek JB: Untangling the wires: A strategy to trace functional interactions in signalling and gene networks. PNAS. 2002, 99 (20): 12841-12846.

Kholodenko BN, Sontag ED: Determination of functional network structure from local parameter dependence data. 2002, arXiv.org:physics/0205003.

Vance W, Arkin A, Ross J: Determination of causal connectivities of species in random networks. PNAS. 2002, 99 (9): 5816-5821.

Westerhoff HV, Kolodkin A, Conradie R, Wilkinson SJ, Bruggeman FJ, Krab K, van Schuppen JH, Hardin H, Bakker BM, Moné MJ, Rybakova KN, Eijken M, van Leeuwen HJP, Snoep JL: Systems biology towards life in silico: mathematics of the control of living cells. Journal of Mathematical Biology. 2009, 58 (1–2): 7-34.

Crampin EJ, Schnell S, McSharry PE: Mathematical and computational techniques to deduce complex biochemical reaction mechanisms. Prog Biophys Mol Biol. 2004, 86 (1): 77-112.

Ross J, Schreiber I, Vlad MO: Determination of Complex Reaction Mechanisms: Analysis of Chemical, Biological and Genetic Networks. 2006, Oxford, UK: Oxford University Press

Chevalier T, Schreiber I, Ross J: Toward a systematic determination of complex reaction mechanisms. J Phys Chem. 1993, 97 (26): 6776-6787.

Feinberg M: Lectures on chemical reaction networks. 1979, Mathematics Research Centre, University of Wisconsin, http://www.che.eng.ohio-state.edu/~FEINBERG/LecturesOnReactionNetworks

Feinberg M: Chemical reaction network structure and the stability of complex isothermal reactors-I. The deficiency zero and deficiency one theorems. Chem Eng Sci. 1987, 42 (10): 2229-2268.

Papachristodoulou A, Recht B: Determining Interconnections in Chemical Reaction Networks. Proceedings of the 2007 American Control Conference. 2007, 4872-4877. New York City, USA

Voit EO, Almeida J, Marino S, Lall R, Goel G, Neves AR, Santos H: Regulation of glycolysis in

*Lactococcus lactis*: An unfinished systems biological case study. Syst Biol (Stevenage). 2006, 153 (4): 286-298.del Rosario RCH, Mendoza E, Voit EO: Challenges in lin-log modelling of glycolysis in

*Lactococcus lactis*. IET Systems Biology. 2008, 2: 136-149.Srividhy J, Crampin EJ, McSharry PE, Schnell1 S: Reconstructing biochemical pathways from time course data. Proteomics. 2007, 7: 828-838.

Gunawardena J: Notes on Metabolic Control Analysis. 2002, Bauer Center for Genomics Research, Harvard University, Cambridge, MA 02138, USA, http://vcp.med.harvard.edu/papers/mca.pdf

Gunawardena J: Chemical reaction network theory for in-silico biologists. 2003, Bauer Center for Genomics Research, Harvard University, Cambridge, MA 02138, USA, http://vcp.med.harvard.edu/papers/crnt.pdf

Boyd S, Vandenberghe L: Convex Optimization. 2004, Cambridge, UK: Cambridge University Press

El Ghaoui L, Lebret H: Robust Solutions to Least-Squares Problems with uncertain data. SIAM J Matrix Anal Appl. 1997, 18 (4): 1035-1064.

Zavlanos MM, Julius AA, Boyd SP, Pappas GJ: Identification of Stable Genetic Networks using Convex Programming. 2008, 2755-2760. Proceedings of the American Control Conference, Seattle, Washington, USA

Candès EJ, Romberg J, Tao T: Stable signal recovery from incomplete and inaccurate measurements. Communications of Pure and Applied Mathematics. 2006, 59: 1207-1223.

Donoho DL, Tanner J: Sparse nonnegative solution of underdetermined linear equations by linear programming. PNAS. 2005, 102 (27): 9446-9451.

Soyster AL: Convex Programming with set-inclusive constraints and applications to inexact linear programming. Operations Research. 1973, 21: 1154-1157.

Bertsimas D, Sim M: The Price of Robustness. Operations Research. 2004, 52: 35-53.

Filkov V, Skiena S, Zhi J: Analysis Techniques for Microarray Time-Series Data. Journal of Computational Biology. 2002, 9 (2): 317-330.

Hana S, Yoon Y, Choc KH: Inferring biomolecular interaction networks based on convex optimization. Comput Biol Chem. 2007, 31 (5-6): 347-354.

Alon U: An Introduction to Systems Biology: Design Principles of Biological Circuits. 2006, Boca Raton, FL: Chapman & Hall/CRC

Mangan S, Alon U: Structure and function of the feed-forward loop network motif. PNAS. 2003, 100 (21): 11980-11985.

Setty Y, Mayo AE, Surette MG, Alon U: Detailed map of a cis-regulatory input function. PNAS. 2003, 100 (13): 7702-7707.

Papachristodoulou A, El-Samad H: Algorithms for Discriminating Between Biochemical Reaction Network Models: Towards Systematic Experimental Design. Proceedings of the 2007 American Control Conference, New York City, USA. 2007, 2714-2719.

Sturm JF: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software. 1999, 11–12: 625-653. http://sedumi.ie.lehigh.edu/

Neves AR, Ventura R, ansour N, Shearman C, Gasson MJ, Maycock C, Ramos A, Santos H: Is the Glycolytic Flux in

*Lactococcus lactis*Primarily Controlled by the Redox Charge? Kinetics of NAD^{+}and NADH Pools Determined*in vivo*by^{13}C NMR. The Journal of Biological Chemistry. 2002, 277 (31): 28088-28098.

## Acknowledgements

This work was financially supported by EPRSC project E05708X. The authors would like to thank Ben Recht and Mark Roberts for useful discussions about this paper.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Authors' contributions

The authors contributed equally to this work. The first author developed the algorithm for determining gene regulatory networks and performed the example of *L. Lactis* while the second author developed the algorithm for biochemical reaction networks and conceived the general idea. Both authors read and approved the final manuscript.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

August, E., Papachristodoulou, A. Efficient, sparse biological network determination.
*BMC Syst Biol* **3**, 25 (2009). https://doi.org/10.1186/1752-0509-3-25

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1752-0509-3-25