Regulatory network reconstruction using an integral additive model with flexible kernel functions

Novikov, Eugene; Barillot, Emmanuel

doi:10.1186/1752-0509-2-8

Research article
Open access
Published: 24 January 2008

Regulatory network reconstruction using an integral additive model with flexible kernel functions

Eugene Novikov¹ &
Emmanuel Barillot¹

BMC Systems Biology volume 2, Article number: 8 (2008) Cite this article

4834 Accesses
9 Citations
Metrics details

Abstract

Background

Reconstruction of regulatory networks is one of the most challenging tasks of systems biology. A limited amount of experimental data and little prior knowledge make the problem difficult to solve. Although models that are currently used for inferring regulatory networks are sometimes able to make useful predictions about the structures and mechanisms of molecular interactions, there is still a strong demand to develop increasingly universal and accurate approaches for network reconstruction.

Results

The additive regulation model is represented by a set of differential equations and is frequently used for network inference from time series data. Here we generalize this model by converting differential equations into integral equations with adjustable kernel functions. These kernel functions can be selected based on prior knowledge or defined through iterative improvement in data analysis. This makes the integral model very flexible and thus capable of covering a broad range of biological systems more adequately and specifically than previous models.

Conclusion

We reconstructed network structures from artificial and real experimental data using differential and integral inference models. The artificial data were simulated using mathematical models implemented in JDesigner. The real data were publicly available yeast cell cycle microarray time series. The integral model outperformed the differential one for all cases. In the integral model, we tested the zero-degree polynomial and single exponential kernels. Further improvements could be expected if the kernel were selected more specifically depending on the system.

Background

One of the most challenging tasks of systems biology is to reconstruct structures and mechanisms of interaction between components of cellular systems from available experimental data. In view of recent technological developments for large-scale measurements of DNA expression levels, this problem can often be formulated more specifically as a problem of gene network inference from microarray gene expression data. In particular, microarray time-series represent an important source of information about cellular dynamics. Various approaches have been proposed to reconstruct network structures from microarray time series. These approaches include additive regulation models [1, 2], dynamic Bayesian networks (DBN) [3–5], S-system models [6, 7] and Boolean networks [8, 9]. Each of these concepts allows for several modifications, which multiplies the number of possible models for data analysis. The problem is not trivial as little is known about molecular interactions in experimentally observed systems. The mismatch between the real mechanisms used for data generation and the models used for network inference may lead to arbitrary network structures. Therefore it is difficult to expect that any one of the proposed formalizations can ensure acceptable performance for any biological system. Nevertheless further attempts to develop models that provide greater accuracy and flexibility with respect to the system under investigation would be appreciated.

The additive regulation model is a widely used approach for network inference from time series data [1]. It is represented by a set of ordinary differential equations:

\frac{d y_{i} (t)}{d t} = \sum_{j = 1}^{n} w_{i j} y_{j} (t) + b_{i}

(1)

where y_i(t) is the intensity level of node i at time t; n is the number of measured nodes; b_iis the constant output observed in the absence of regularity inputs and w_ijis the coefficient representing the influence of node j on the regulation of node i. As experimentally obtained time series are available in a finite number of discrete time points N, the continuous differential representation (1) should be converted into the discrete-time form:

y_{i} (t_{k + 1}) = y_{i} (t_{k}) (1 + w_{i i} Δ t_{k}) + \sum_{\begin{array}{l} j = 1 \\ j \neq k \end{array}}^{n} w_{i j} y_{j} (t_{k}) Δ t_{k} + b_{i} Δ t_{k}

(2)

where k = 1, ...,N-1 and Δt_kis the time interval between the measurements at times t_kand t_k+1.

Network inference fits developed models to experimental data. Fitting adjusts the unknown model parameters so that an optimal value for a fitness criterion is ensured. For the inference model(2), this criterion can be defined as

χ^{2} = \frac{1}{N n - P} {\sum_{i = 1}^{n} \sum_{k = 1}^{N} \frac{1}{ψ_{i k}^{2}} [{\hat{y}}_{i} (t_{k + 1}) - {\hat{y}}_{i} (t_{k}) (1 + w_{i i} Δ t_{k}) - \sum_{\begin{array}{l} j = 1 \\ j \neq k \end{array}}^{n} w_{i j} {\hat{y}}_{j} (t_{k}) Δ t_{k} - b_{i} Δ t_{k}]}^{2}

(3)

where ${\hat{y}}_{i}$ (t_k) are the measured time series, ψ _ikare the statistical weights and P is the number of estimated parameters. With the proper weights ψ _ik, a χ² criterion value close to 1 indicates an acceptable fit. The estimated parameters encode information about the structure of the network.

In this paper we generalize the additive regulation model by converting differential equations into integral equations with adjustable kernel functions. These kernel functions can be selected based on prior knowledge or defined through iterative improvement in data analysis. This makes the integral model very flexible and thus capable of covering a broad range of biological systems more adequately and specifically than previous models. As the number of the unknown parameters for even medium-sized networks may exceed the number of experimentally measured points, fitting algorithms for underdetermined problems have to be applied. Among different fitting strategies [10] the forward selection fitting algorithm has shown reasonable performance, in particular for sparse networks, and, therefore, it has been adopted in this paper.

We tested the proposed generalization for the additive regulation model with simulated and experimental data. Mathematical models have been developed for real biological systems including the glycolysis pathway in yeast [11] and the mitogen-activated protein kinase (MAPK) cascade [12]. These models are available as SBML modules [13, 14] that can be imported in JDesigner [15] to simulate time series. These time series are then sampled at random time intervals and statistical noise is added to mimic experimentally observed distortions. We also used the public yeast cell cycle microarray time series datasets measured by Spellman et al. [16] to demonstrate practical applicability of the developed approach.

Results

Mathematical Framework

The additive regulation model (1) can be easily used to derive first approximations for network structures. However, if the first-order ordinary differential equations (1) are not appropriate for a particular system or experimental dataset, the inference approach based on Eq. (1) provides little possibility for easy adjustments. Therefore we are looking for generalizations of the basic additive regulation model (1) that would allow us to systematically approximate broader range of dynamic behaviors. With this aim we integrate the ordinary differential equation (1) yielding:

y_{i} (t) - y_{i} (t_{0}) = \sum_{j = 1}^{n} w_{i j} \int_{t_{0}}^{t} y_{j} (t) d t + b_{i} (t - t_{0})

(4)

where t₀ is the initial time point. The coefficient w_ijcan be moved under the integral and converted into the function w_ij(t, x):

y_{i} (t) = \sum_{j = 1}^{n} \int_{t_{0}}^{t} w_{i j} (t, x) y_{j} (x) d x + b_{i} (t, t_{0})

(5)

where b_i(t, t₀) is a function generalizing the second term in the right-hand part of Eq. (4). The fitness criterion for the integral model can be defined similar to Eq.(3):

χ^{2} = \frac{1}{N n - P} {\sum_{i = 1}^{n} \sum_{k = 1}^{N} \frac{1}{ψ_{i k}^{2}} [{\hat{y}}_{i} (t_{k}) - \sum_{j = 1}^{n} \int_{t_{0}}^{t_{k}} w_{i j} (t_{k}, x) {\hat{y}}_{j} (x) d x - b_{i} (t_{k}, t_{0})]}^{2}

(6)

Now the inference model is completely defined by the kernel functions w_ij(t, x) and by the background functions b_i(t, t₀). This model, besides higher flexibility, allows for a straightforward interpretation in terms of control theory [17]. The integral equation (5) can be considered as the reaction of a system (gene i, in our case) on the n external inputs, represented by y_i(t), with w_ij(t, x) being system impulse response functions.

We propose the integral model (5) as a generic environment for devising more specific models. Instead of changing the form of the differential equation (which may lead to reprogramming of the inference algorithm), the integral model (5) allows for continuous change of the various parameters of the kernel or background functions. The parameters that are known from prior knowledge can be fixed in analysis, whereas the others can be made free and estimated from experimental data. Certain parameters can also be used to identify the shape of the kernel or background functions. Some examples of the generic representations for the kernel functions are given in the Methods section.

Higher model flexibility is accompanied by larger uncertainty about the derived structures, as different models or sets of model parameters can be in accordance with experimental data. Typical solutions for underdetermined problems are to collect more experimental data or to use more prior knowledge from the other sources of information. The advantage of the integral inference model is that (i) once we have more experimental data, we can leave more parameters free in fitting, and (ii) once we have more prior knowledge, we can smoothly integrate it in the inference model. In contrast, the differential model (2) needs to be redefined and reprogrammed in both cases.

The kernel or background functions can be rather complex for adequate description of the molecular/genetic interactions. As little has been formalized in this field so far, we have to use approximations. We are looking for such representations for w_ij(t, x) and b_i(t, t₀) that result in the inference models linear with respect to the unknown parameters. These models can be represented as linear regression models allowing us to directly compute the best-fit parameters from the data. It is also straightforward to apply non-linear models, but these models lead to non-linear regression, requiring computationally intensive, iterative approaches. Therefore we generally prefer to use linear models unless we have strong evidence or prior knowledge that a model should be non-linear. Three linear models – polynomial, exponential and delta-function – for w_ij(t, x) and b_i(t, t₀) are presented in the Methods section.

Fitting Algorithm

The network reconstruction using the differential additive model (2) has been described in the Background section. The same approach can be applied for the developed integral model (5): this model is fit to experimental data and the unknown parameters are estimated by minimizing the χ² fitness criterion (6). Links created from the estimated parameters, if the corresponding parameters are significantly different from zero, form the network structure. In [10], different strategies to search for optimal network structures have been reviewed and compared. The searching strategies are model independent and therefore can be applied to both models, (2) and (5), without modification. Here we apply the forward selection algorithm [10] as a good compromise between prediction accuracy and speed of processing. The algorithm we use is essentially equivalent to the "Forw-reest-K" algorithm from [10]; we have just diversified a set of stopping criteria. The implemented algorithm is outlined as follows:

1. We begin without links for the network. A default model defined by Eq. (2) with all w_ij= 0 or by Eq. (5) with all w_ij(t, x) ≡ 0 is assigned to each non-interacting node.

2. The default model is fit to the data and the χ² fitness criterion is calculated for each node.

3. The node showing the largest χ² value is probably regulated by one of the other nodes. A link between the node of interest and each of the other nodes that are not yet identified as regulators for the node of interest is created.

4. The resulting sub-network is fit to the experimental data. The link that ensures the best quality of fit is conserved in the system.

5. The procedure generates links until the stopping criterion is fulfilled. We have implemented the following stop-criteria:

• We stop the procedure if the node with the lowest quality of fit is already linked to all the other nodes of the network. Thus, there are no more free nodes that can improve the fit for the node of interest (i.e. the node is saturated). This indicates that either the algorithm has achieved the local minimum or the inference model is not correct. In any case we still can continue to increase the overall quality of fit by more precise fitting for some of the other nodes. However, this may lead to over-fitting for these nodes and therefore is undesirable.

• The procedure can be stopped if the overall χ² quality criterion has reached a certain limit, or when the overall number of links (or the maximum number of links for one node) surpasses a user-defined value.

• Finally, the user can decide when to stop iterations based on visual inspection of the residuals – the differences between the experimental and the reconstructed time series. However, this may be problematic for large networks.

We use the χ² criterion as an indicator of correspondence between the inference model and experimental data because the inference model is expected to reproduce experimental data. However, if the statistical weights ψ _ikin Eqs. (3) and (6) are not correct, the absolute value of the χ² criterion is meaningless. Using the experimental errors as ψ _ikcan lead to overestimation of χ², because experimental data are presented in both the left- and the right-hand parts of the fitting models (2) or(5). Integration averages experimental errors in the right-hand part of Eq.(5). Thus, its contribution can be ignored in the overall statistical error, and ψ _ikis equal to the experimental error. The sum in the right-hand part of Eq. (2) can also be considered as a smoothing operation. However, the error from the experimental point y_i(t_k-1) in the first term of the right-hand part of Eq. (2) is comparable to y_i(t_k) in the left-hand part of Eq. (2) and must be taken into account. In this case we define ψ _ikas a product of experimental error and √2. Then values for χ² close to 1 indicate appropriate fit for both models.

If we assume that any link between any pair of nodes is possible, then the number of the unknown parameters can exceed the size of experimental datasets that are typically available. This leads to underdetermined systems and requires additional conditions to regularize the solution. In this respect the forward selection proceeds in a "natural", although not optimal, way: a new link is added only when it is necessary to increase the quality of fit.

The main problem of the algorithm is that it can easily be trapped in the local minima. If a wrong node is selected at an early iteration because it gives the best quality of fit for the selected node, the decision cannot be reconsidered at later iterations taking into account additional links created after that wrong decision. Nevertheless, we found that this algorithm performs reasonably well in many cases, particularly for relatively sparse networks.

Testing

We compared performances of the differential and integral inference models using various artificial systems producing simulated data and three experimental datasets from [16].

As available experimental datasets are typically limited in size, we explored models where the number of free (fit) parameters was small. Thus we tested two kernels for the integral model: the zero-degree polynomial (L_w= 0 in Eq. (8) and L_b= 0 in Eq.(9)) and the single exponential (L_w= 1 in Eq. (13) and L_b= 0 in Eq.(14)). In each case we had one free parameter per link. This also equalizes the degrees of freedom in the compared differential and integral inference models. The delta-function model described in the Methods section was not applied because all tested systems demonstrate behavior continuous in time.

To appreciate how our predictions are far from random, we also applied the integral model with the zero-degree polynomial kernel to infer network structures from the permuted data, i.e. when node labels are randomly assigned to generated time series.

Arbitrary Networks

In the first set of experiments the model used for network inference was that used for data generation.

Simulation

Artificial regulatory networks were generated with random and scale-free topologies. For random topology, any two nodes are connected with the probability p independent from the other connections. For scale-free topology [18], the number of links at each node is approximated by a power-law distribution p(k) ~ k^γ . We used the growing network with redirection algorithm [19] to generate networks with scale-free topology. The number of nodes in the generated networks was 20; the probability p for the random networks was equal to 0.05; and the parameter γ for the scale-free networks was set to 2.5 for all cases. We demonstrate examples of networks undergoing random topology (Fig. 1a) and scale-free topology (Fig. 1b). A set of first-order ordinary differential equations (1) was used to simulate time series. The parameters w_ijwere randomly generated from the uniform distribution in the interval [-1;1]. The background levels b_iwere set to zero and the initial states y_i(t₀) were set to 1 for all nodes.

We used the fourth-order Runge-Kutta formula [20] to numerically solve differential equations(1). The solution was built on 1000 time points uniformly spaced over the interval [0;10]. The resulting time series were sampled to produce 20 time points to approach the quality of experimental data. We split the original 1000-point time series into 20 intervals of 50 points. At each interval the output time point was randomly selected. This led to a time series with non-homogeneous (random) time intervals between subsequent measurements. Each of 20 intensity values was statistically distorted. The distorted value was generated as a Gaussian random variable with the mean equal to the true value and standard deviation proportional to the true value. The coefficient of proportionality – noise-to-signal level – was set to 0.05.

Inference

As time series were simulated using a set of first-order ordinary differential equations, the corresponding inference model is either the differential model (Eq. (2)) or the integral model (Eq.(5)) with the zero-degree polynomial kernel (L_w= 0 in Eq. (8) and L_b= 0 in Eq.(9)). Although the single exponential kernel may also be used in this case, it is clearly non-adequate and therefore it was not tested.

We reconstructed the networks from the generated time series using the forward selection procedure. Each time the fitting procedure added a new link, we updated the number of links for True Positives (TP), False Positives (FP) and False Negatives (FN). Then TP, FP and FN values were combined to estimate Positive Predictive Value (PPV) and Sensitivity value (Se) defined as in [21]:

\begin{matrix} P P V = \frac{T P}{T P + F P}; & S e = \frac{T P}{T P + F N} \end{matrix}

(7)

Other possible performance measures, such as negative predictive value or specificity, are not relevant for sparse networks when the forward selection procedure is used for reconstruction. During first iterations of the fitting procedure the number of TN largely exceeds the number of TP leveling the difference between reconstruction models.

We stopped the forward selection procedure if the χ² fitting criterion became smaller than 0.5 or if a particular node became saturated. Adequate fit should give χ² values close to 1, as experimental errors – and thus statistical weights ψ _ik– in the χ² criteria for Eqs. (3) or (6) are directly accessible in simulations. Limiting the value of the χ² criterion to 0.5 leads to substantial over-fitting. However, as we recorded the history of generated links (PPV, Se and χ² value after each added link), this allowed us to explore a broader range of model fitness values.

We averaged the dependence of PPV and Se on the total number of links over 100 runs of the simulation procedure. A different network structure, different link parameters, different time sampling and different noise realizations were generated at each run.

Artificial Biological Systems

We used two mathematical models for real biological systems (yeast glycolysis [11] and the MAPK cascade [12]) to test the performance of the developed inference models for more realistic systems. These models can be imported in JDesigner [15] as SBML modules [13, 14] and used to simulate time series. The network structures and SBML files used for simulations are also available from our web page [22]. We stress that we used these modules as they were originally developed, i.e. without any modifications in the structure or in the kinetic parameters of the models. Mathematical representations and kinetic parameters of the models can be viewed in JDesigner. We used JDesigner to integrate the models on 100 time points spaced uniformly over the interval [0;1] for yeast glycolysis and [0;100] for the MAPK cascade.

Two data distorting steps were performed as before: we left 20 time points at random time intervals, and added Gaussian noise with noise-to-signal level equal to 0.05. Examples of time series used for the inference are available on our web page [22].

Besides comparing the differential and integral inference models, we also tested here two kernels for the integral model: the zero-degree polynomial (L_w= 0 in Eq. (8) and L_b= 0 in Eq.(9)) and single exponential (L_w= 1 in Eq. (13) and L_b= 0 in Eq.(14)).

The forward selection fitting procedure generated the dependence of the PPV, Se (Eq.(7)) and χ² criteria (Eqs. (3) and(6)) on the total number of generated links. The resulting curves were averaged over 100 runs of the simulation procedure. The simulation procedure generated different time sampling and different realizations of noise at each run, whereas the network structure, kinetic laws and kinetic parameters remained the same.

Real Data

To demonstrate applicability of the developed approach to real experimental data, we used the yeast (Saccharomyces cerevisiae) cell cycle microarray time series dataset [16]. This dataset consists of three sub-sets measured using different cells synchronization methods [16]: α factor-based (alpha, 18 time points), size-based (elu, 14 time points) and cdc15-based (cdc15, 24 time points).

As others did [23–25], we selected a part of the yeast cell cycle pathway available from KEGG [26] (Fig. 2). Assuming that this pathway reflects biological reality, we can count the number of TP, FP and FN and calculate PPV and Se as it is done for artificial systems.

As experimental errors and therefore the statistical weights ψ _ikin Eqs. (3) or (6) were not available, the absolute value of the χ² fitting criterion could not be used as a stopping condition for the forward selection procedure. However, as it will be shown for artificial systems (see the Discussion section), numerous FP links are required to yield the χ² criterion close to 1. Taking into account that fitting models are very approximate, it may not be always reasonable to require perfect fitting quality. Therefore we investigated the performance (PPV and Se) of the inference models as a function of the number of generated links.

As for the artificial systems, we compared here performances of the differential and integral inference models. In the integral model we used the same two kernels: the zero-degree polynomial (L_w= 0 in Eq. (8) and L_b= 0 in Eq.(9)) and single exponential (L_w= 1 in Eq. (13) and L_b= 0 in Eq.(14)).

We also applied DBN approach to infer network structures from the experimental datasets. We used the Banjo software [27] to perform Bayesian inference. For analysis, we selected the alpha and elu datasets as only these two datasets were measured at equidistant time points. The latter is prerequisite for Banjo. To run Banjo we used the same input settings as given in [21]. We calculated PPV and Se for the inferred networks that had the highest score in the Banjo output.

Independent artificial data

Finally, we performed an additional comparison of the differential and integral inference models based on an independent set of artificial data described in [21]. Briefly, 20 random 10-gene networks with an average in-degree per gene of 2 were generated. For each network, time-series data (1000 time points) were simulated using linear ordinary differential equations. Each data point was statistically distorted with noise-to-signal ratio equal to 0.1. In our analysis we first sampled the 1000-point time series to produce 20-point time series, which were then used for network reconstruction. As the network structures are known, we built the dependencies of PPV and Se on the number of generated links for each network. The obtained dependencies were further averaged over 20 networks.

Software

The developed algorithms for the network inference were implemented in the software package NETI, freely available from our web page [22].