An approach for dynamical network reconstruction of simple network motifs

Background One of the most important projects in the post-genome-era is the systemic identification of biological network. The almost of studies for network identification focused on the improvement of computational efficiency in large-scale network inference of complex system with cyclic relations and few attempted have been done for answering practical problem occurred in real biological systems. In this study, we focused to evaluate inferring performance of our previously proposed method for inferring biological network on simple network motifs. Results We evaluated the network inferring accuracy and efficiency of our previously proposed network inferring algorithm, by using 6 kinds of repeated appearance of highly significant network motifs in the regulatory network of E. coli proposed by Shen-Orr et al and Herrgård et al, and 2 kinds of network motif in S. cerevisiae proposed by Lee et. al. As a result, our method could reconstruct about 40% of interactions in network motif from time-series data set. Moreover the introduction of time-series data of one-factor disrupted model could remarkably improved the performance of network inference. Conclusions The results of network inference examination of E. coli network motif shows that our network inferring algorithm was able to apply to typical topology of biological network. A continuous examination of inferring well established network motif in biology would strengthen the applicability of our algorithm to the realistic biological network.


Background
The investigation of network dynamics in biology is a major issue in systems and synthetic biology. Recent advances in high-throughput technologies for comprehensive observation of cells produce a lot of data for analyzing dynamics of complex system such as gene regulatory networks and metabolic pathways. Time-series with dynamic behavior are one of such data involving enormous amount of information regarding the regulation of biological network in vivo. However, as such information is entirely implicit, it requires the development of adequate analytic and computational methods to reconstruct biological systems. The key in developing such computational methods is to build a reliable mathematical model for analyzing biological networks, and to explore parameter values in the model within vast searching space. Tominaga et al. and Maki et al. have developed a novel method [1,2] inferring conceptual biological networks by the combination of a dynamical network model called S-system [3] with a traditional parameter estimation based on simple genetic algorithms [4,5]. The S-system is based on an ordinary differential equation, in which the temporal (time-dependent) dynamic process of system components are characterized by power-law formalism. The S-system is suitable for conceptual modeling and describing complex systems with a loop or a cyclic interaction because the dynamic behavior of the network can be easily obtained by numerical integration and customized [6].
The values of interrelated coefficients in the formalism are directly or indirectly related to the regulation mechanism in the network model. The inferred network structure from the inference of parameters provides one of the best candidates for the biological network structure. However, S-system requires a large number of parameters that must be estimated to identify dynamical biological networks; the number of estimated parameters is 2n(n + 1) (where n is the number of system components).
We previously proposed efficient procedures for inferring biological network based on experimentally observed time-series data of mRNA or metabolites [7-10] using S-system and real-coded genetic algorithms (RCGAs) [11] with a combination of uni-modal normal distribution crossover(UNDX) [12] and minimal generation gap(MGG) [13]. Other groups have also developed several methods tooptimize parameters using S-system [14][15][16][17][18][19], Beside of S-system modeling, a lot of network reconstruction algorithms from time-series have been developed [20][21][22][23][24][25][26][27]. However, most of the works focused on the improvement of computational efficiency in large-scale network inference of complex systems with cyclic relations and few attempts have been done for answering practical problems occurred in real biological systems. Herrgård et. al., Shen-Orr et. al., and Lee et. al. proposed that the gene regulatory network in Escherichia coli or Saccharomyces cerevisiae identified by experimental studies is composed of the limited number of network motif; each motif has simple form of relationships between transcription factors and genes [28,29]. Little attention has been paid to evaluate the performance of network inference for such simple network motifs with dynamical modeling, S-system. In this paper, in order to evaluate the inferring performance of our previously proposed network inferring algorithms, we applied our algorithm to 8 kinds of simple form of network motifs proposed by Shen-Orr et. al. [29], Herrgård et. al. [28], and Lee et. al.

Results of network identification
We inferred network candidates 100 times for each network motif, based on artificial generated time-series data sets (see Figure 1, 2 and 3). After obtaining 100 network candidates, we calculated precision and recall (see Figure 4) to evaluate the accuracy and efficiency of our algorithm from network structural (topological) point of view. In the case of network inferences for DOR(Dense overlapping regulation), FF(Feed-forward), RM(Regulator Module) and TM(Target Module) networks, the value of recall becomes around 0.4, which indicates about 40% of interactions in the network model are properly reconstructed by our algorithm. The better case were for RI(Regulatory Interaction), AR (Autoregulation), and ML(Multicomponent-Loop) network, and our algorithm could reconstruct around 60% of interactions in the network motif. We also calculated F-measure to evaluate balance of accuracy and efficiency of network identification. Also in F-measure, network identification for RI, AR, and ML network represents better estimation results compared with DOR, FF, RM, SIM and TM. On the other hand, the low values of precision were observed in the all cases, indicating that network candidates inferred from our algorithms. The performance of inferring accuracy (precision) was relatively low, namely, the inferred network candidates contain many false-positive interactions. Figure 5(A) shows the best case of identified network topology for SIM (precision : 0.33, recall : 1.0). Figure 5(A) contains all regulatory interactions in network motif for SIM (shown in Figure 1(E)). However, this contains many false-positive interactions, such as self-degradation for synthetic process, and inhibitory regulation from X2, X3, and X4 to X1.
However, the lower values of precision were often observed in our previous works applied to other types of networks, so that we have already developed a method to remove the false-positive interactions inferred by parallel computing [7,8]. Even though we can apply our previously proposed method to improve the precision values, our aim here is to see how both precision and recall values can be improved by altering the information content of time-series data.
We thus focus on the inferred network candidates for SIM since the performance of accuracy and efficiency ( Figure 4(E)) was very low. There is a possibility that the imbalance between huge degree of freedom in S-system network modeling and information amount in reference time-series data yields such low performance of accuracy and efficiency. In other words, the information content of the single reference time-series data (shown in Figure 3(E)) is not enough to identify network candidates. To overcome this situation, we tried to infer network candidates by testing another kind of time-series data, more strictly, one-factor disrupted model. We prepared time-series data for one-factor disrupted model as shown in Figure 6. The S-system parameters for reconstructing networks are same as Figure 2 and 3 except the rate constant for the synthetic process of disrupted factor. We prepared timeseries data for the one-factor disrupted model with the rate constant for the synthetic process of disrupted factor i(a i ) set to 0.0. We inferred 8 network candidates from 5 time-series data including wild-type (see Figure 3(E)) and one-factor disrupted strain. The comparison between single and 5 time-series in inferring accuracy and efficiency is shown in Figure 7. The result shows Regulatory Interaction (RI): a factor X 1 regulates X 2 , X 2 regulates X 3 , and X 3 regulates X 4 . (B) Regulator Module (RM): a factor X 1 regulates a set of regulons shown as X 2 to X 4 . (C) Target Module (TM): a regulon X 4 is regulated by other factors shown as X 1 to X 4 . (D) Feed-Forward (FF): a regulator X 1 regulates X 2 , and both jointly regulates X 4 . (E) Single Input Module (SIM): same as RM, a factor X1 regulates a set of regulons shown as X 2 to X 4 , and also regulates X 1 itself. (F) Dense Overlapping Regulation(AR): regulons X 3 and X 4 were regulated by regulator X 1 and X 2 , respectively. (G) Autoregulation: X 1 regulates X 2 , X 2 regulates X 3 , X 3 regulates X 4 , and X 4 inhibitory regulates X 1 . (F) Multicomponent-Loop (ML): X 1 regulates X 2 , X 2 inhibitory regulates X 3 , X 3 regulates X 4 , and X 4 inhibitory regulates X 1 .
that the performance is remarkably improved compared with the case in single time-series. We applied the same data to other motifs (data not shown) and found that the introduction of time-series data using one-factor disrupted model can improve the performance of our algorithm.

Conclusions
We applied our previously proposed algorithm to the network motifs proposed by Herrgård and Shen-Orr. As a result, the efficiency (recall) of our method exhibited relatively high in most of network motifs. In particular, in the Regulatory Interactions (RI) model, we reconstructed about 68% of interactions in the model. Interestingly, the performance of network inference for complex regulatory network including cyclic interactions (AR and ML) was better than that for simple network analyzed in this study. It is likely that the abundant information related to dynamic behavior contained in time-series data for complex regulatory network constrains the degree of freedom S-system modeling, for this reason, the false-positive or false-negative interactions for complex network are reduced.
In order to examine how to improve both the accuracy and efficiency, we attempted to infer the network candidates based on 5 time-series data including time-series for one-factor disrupted model. In this situation, the performance of inferring accuracy and efficiency remarkably increased. This result suggests that the inferring performance can be improved by adding other kinds of time-series data.
Note that the present performance is examined by a set of data generated from arbitrary given parameter values. We should test the performance of our method for various structures of networks with different parameters as well as for observed data. From practical point of view, there have been various kinds of data accumulated under different experimental conditions. The differential information content of such data is expected to further improve the performance of our method. A continuous examination of inferring well-established network motifs in biology would strengthen the applicability of our algorithm to the realistic biological network including gene regulatory networks or metabolic pathways.

Material
In order to evaluate the applicability of our inferring algorithm, we prepared 8 kinds of artificial network models, Regulatory Interaction (RI), Regulator Module (RM), Target Module (TM), Feed-Forward (FF), Single  [28,29], and that of Saccharomyces cerevisiae proposed by Lee et. al [30]. We modified the 8 network motifs to network models consisting of 4 nodes (X 1 , X 2 , X 3 , and X 4 ) without a loss of each network topology. Figure 1 shows each network structure analyzed in this paper.
Subsequently, we prepared artificial time-series data containing 40 sampling points for each network motif by the numerical integration [6]. The reference time-series data of 8 network models are shown in Figure 2.

S-system formalism
S-system is a suitable formalism for dealing with gene expression network or conceptual metabolic pathway structures. It can sufficiently represent the structure of organizationally complex system, to capture the essence of experimentally observed response: where n is the number of system components (genes or metabolites) in the investigating network, X i is the experimentally observed response (gene expression level for gene expression network, or concentration of metabolites Figure 3 Reference time-series. The artificial reference time-series simulated based on 8 kinds of network motif in Figure 1. We prepared these time-series under following conditions: the number of sampling point is 40, initial value of X 1 is 10.0, that of X 2 is 7.0, X 3 is 3.0, and X 4 is 1.0 for RI, RM, TM, FF, SIM, and DOR. For ML, initial values are X 1 = X 2 = X 3 = X 4 = 5.0. We set h ii , which represents the interrelated coefficient for self degradation, at 1.    for metabolic pathway's investigation), a i and b i are apparent positive rate constant, and g ij and h ij are interrelated coefficients between X i s.
The first term on the right-hand side of Eq. (1) corresponds to the synthetic process of X i , and the second term expresses the degradation process of X i . The value of g ij (h ij ) express the interactive effects of X j to the synthetic process (degradation process) of X i . The value of g ij (h ij ) also determine the structure of the interactions between X i and X j . When the value of g ij (h ij ) is positive, X j induces a synthetic process (degradation process) of X i . On the other hand, when g ij (h ij ) is negative, X j suppresses the synthetic process (degradation process) of X i . When the value of g ij (h ij ) is zero, then there are no effects of X j on the synthetic (degradation) process of X j .
The biological network can be inferred by estimating a i , b i , and h ij in the S-system formula. A representation of S-system parameters to be estimated is shown in Figure 1.

Real-coded genetic algorithms
The S-system is a formalism of ordinary non-linear differential equations, and thus the system can easily be solved numerically by using numerical integration algorithm customized specifically for this formalism [6]. However, when an adequate time-course of relevant state variable is given, a set of parameter values a i , b i , g ij , and h ij , in many cases, will not be uniquely determined, because it is highly possible that the other set of parameter values will also show a similar time-course. Therefore, even if one set of parameter values that could explain the observed time-course is obtained, this set is still one of the best candidates that explain the observed time-courses. Our strategy is to explore and exploit these candidates within the immense huge searching space of parameter values.
In this problem, each set of parameter values to be estimated is evaluated by using following procedure: Suppose that X CAL d,i,t is the value of the numerically integrated time-course at time t of state variable X i in the d-th data-set, and X EXP d,i,t represents the experimentally observed time-course at time t of X i in the d-th data-set. Sum up the square values of relative error between X CAL d,i,t and X EXP d,i,t to get the total relative error E; where D is the total number of data-sets that experimentally observed under the different kind of experimental conditions such as disruption of genes or inhibition of kinase activities, N is the total number of experimentally observable state variables and T is the total number of sampling points over time in one experimental conditions. The computational task is to find out a set of parameter values that minimizes the objective function E. We have developed the efficient computational technique based on real-coded genetic algorithms (RCGAs) as a nonlinear numerical optimization method which is much less likely to be stranded in local minima. This technique is based on the combination of the operator called uni-modal normal distribution crossover (UNDX) [12] with the alternation of generation model called minimal generation gap (MGG) model [13]. Furthermore, in order to find the skeletal structure (small-size system) of the S-system formalism that explain the experimentally observed response, some of the parameters (g ij and h ij ), absolute values of which are less than a given threshold value are to be removed (reset to zero) during optimization procedures.

Evaluation of identified network
We used the precision and the recall for evaluating the inferred biological network candidates. The precision is defined as follows: where TP i is the number of true-positive interactions in i-th network candidate, FP i is the number of falsepositive interactions in i-th network candidate, and n is the number of inferred network candidates. The value of precision shows the inferring accuracy of biological network candidates. We also used recall, which indicates the inferring efficiency of network candidates as follows: For evaluating balance using both precision and recall, defined as follows: F -measure = 2 * precision * recall precision + recall (7)