Understanding gene interactions is a fundamental question in systems biology. Currently, modeling of gene regulations using the Bayesian Network (BN) formalism assumes that genes interact either instantaneously or with a certain amount of time delay. However in reality, biological regulations, both instantaneous and timedelayed, occur simultaneously. A framework that can detect and model both these two types of interactions simultaneously would represent gene regulatory networks more accurately.
Results
In this paper, we introduce a framework based on the Bayesian Network (BN) formalism that can represent both instantaneous and timedelayed interactions between genes simultaneously. A novel scoring metric having firm mathematical underpinnings is also proposed that, unlike other recent methods, can score both interactions concurrently and takes into account the reality that multiple regulators can regulate a gene jointly, rather than in an isolated pairwise manner. Further, a gene regulatory network (GRN) inference method employing an evolutionary search that makes use of the framework and the scoring metric is also presented.
Conclusion
By taking into consideration the biological fact that both instantaneous and timedelayed regulations can occur among genes, our approach models gene interactions with greater accuracy. The proposed framework is efficient and can be used to infer gene networks having multiple orders of instantaneous and timedelayed regulations simultaneously. Experiments are carried out using three different synthetic networks (with three different mechanisms for generating synthetic data) as well as real life networks of Saccharomyces cerevisiae, E. coli and cyanobacteria gene expression data. The results show the effectiveness of our approach.
Background
In any biological system, various genetic interactions occur concurrently amongst different genes. While some genes interact almost instantaneously, other genes could have time delayed interactions (see Figure 1). From a biological perspective, instantaneous regulations represent the scenarios where the effect of a change in the expression level of a regulator gene is carried on to the regulated gene (almost) instantaneously. In such cases, the effect is reflected almost immediately in the regulated gene’s expression level^{a}. On the other hand, in cases where regulatory interactions are timedelayed, its effect will be seen on the regulated gene after a finite time delay.
Bayesian network and its extension, dynamic Bayesian network (DBN), has found significant applications in the modeling of genetic interactions [1, 2]. To the best of our knowledge, prior works on inter and intraslice connections in the dynamic probabilistic network formalism [3, 4] have modelled a DBN using an initial network and a transition network employing the 1storder Markov assumption, where the initial network exists only during the initial period of time and subsequently the dynamics is expressed using only the transition network. Realising that a dth order DBN has variables replicated d times, a 1storder DBN for this task^{b} is therefore usually limited to around 10 variables. Alternately, if a 2ndorder DBN model is chosen, it can mostly deal with 67 variables [5]. Thus, prior works on DBNs were either unable to discover these two interactions simultaneously or were unable to fully exploit its potential, thereby restricting studies to simpler network configurations. However, since our proposed approach does not replicate variables, we can study any complex network configuration without limitations on the number of nodes. Zou et al. [2], while highlighting the existence of both instantaneous and timedelayed interactions among genes while considering the parentchild relationships of a particular order, did not account for the regulatory effects of other parents (having different order of regulation than the current one) on that particular child. This is in violation of the biological reality that parents with various orders of regulation can jointly regulate a child. Our proposed method supports multiple parents to regulate a child simultaneously, with different orders of regulation. Moreover, the limitation of detecting genetic interactions like , which are prevalent in genetic networks [6], is also overcome with the proposed method. Experiments conducted using both synthetic and reallife GRNs show the effectiveness of our approach.
Results and discussion
We evaluate our proposed method by studying both synthetic networks and reallife biological networks of Saccharomyces cerevisiae (yeast), E. coli and cyanobacteria. The overall accuracy of the inference method and correctness of the modeling approach is evaluated by four widely accepted performance measures given below. The terms, TP, FP, TN and FN, used in the following expressions respectively mean the number of true positives, number of false positives, number of true negatives and number of false negatives.
1.
Sensitivity(Se): It measures the proportion of true connections which are correctly inferred. It is defined as follows:
(1)
2.
Specificity (Sp): Specificity is defined by the following equation:
(2)
3.
Precision (Pr): Precision is proportional to the inferred connections which are correct. It is defined as follows:
(3)
4.
Fscore (F): Biologically, a good reconstruction algorithm should infer as many correct arcs as possible, in addition to the criteria that most of the inferred arcs should be correct. The Fscore measure is the harmonic mean of Se and Pr[7] and represents a compromise between these two objectives:
(4)
Since our method uses discrete data for the statistical significance tests embedded in the scoring function, we applied the Persist [8] algorithm to discretize the data into 3 levels. The confidence level (α) is set to 0.9. We will use a local search in the DAG space with the classical operators of arc addition, arc deletion and arc reversal. The starting point of the search is always an empty graph. The parameters for all the other methods that are used for comparison are set to their default values mentioned in their user manuals.
Synthetic network
Synthetic network using differential equation based models
For performing studies using synthetic networks, we generated 3 random networks of size 10, 25 and 50 using the genenetweaver tool [9]. This tool has been used to generate in silico benchmarks in the DREAM (both DREAM3 [10] and DREAM4 [11]) challenge initiative. The tool is able to obtain biologically plausible network topologies (and also biologically plausible network dynamics) of a given size by extracting random subnetworks of Saccharomyces cerevisiae and E. Coli[9, 12]. We used the tool to generate time series data as in the DREAM4 challenge with ten different perturbations for each experiment. Initial and final timestamps for the simulations were 0 and 1000, respectively, and the time step was 50. One of the objectives of this experiment was to test the usefulness of the proposed approach in the presence of noise in mRNA expression levels. Since microarray experiments can incur a wide range of noise levels depending on the technology, environment and the subject under study, we experimented under various noise levels that are likely to be present in the expression data. To mimic a reallife noisy environment, as in [13, 14], we added 5 different noise levels to the data samples (random Gaussian noise with zero mean and variance, σ^{2} = 0.0, 0.01, 0.02, 0.05, 0.10). The performance, measured by the four performance measures, corresponding to the three different sized networks is reported in Figure 2. Figure 2(A) shows the performance variation as a function of network size and noise level. The Xaxes represent the noise levels while the Yaxes represent the corresponding performance measures (Se, Sp, Pr, F). In Figure 2(B)(D), we compare our approach with three other methods, namely TDARACNE, BANJO and BNFinder (BDe and MDL) using the FScore (results corresponding to other measures are available in Additional file 1). It is evident from the results that there is no clear winner in all the cases. Some methods perform good in some cases, while others outperform it in other cases. However, it is clear that our proposed approach, albeit not always the best, it is always among the top performers and has consistently superior performance.
Probabilistic network of yeast
We use a subnetwork from the yeast cell cycle, shown in Figure 3, taken from Husmeier et al. [15]. The network consists of 12 genes and 11 interactions. For each interaction, we randomly assigned a regulation order of 0, 1, 2 or 3. We used two different conditional probabilities for the interactions between the genes, namely, the noisy regulation according to a binomial distribution and the noisy XORstyle coregulation. For the binomial distribution dependent noisy regulation, the parameters were set as follows: excitation: P(onon) = 0.9, P(onoff) = 0.1; inhibition: P(onon) = 0.1, P(onoff) = 0.9. For the noisy XORstyle coregulation the parameters were set as: P(onon, on) = P(onoff, off) = 0.1, P(onon, off) = P(onoff, on) = 0.9 [15]. Eight confounder nodes were also added, resulting in the total number of nodes to be 20.
We used 30, 50 and 100 samples, generated 5 datasets in each case and compared our approach with two other DBN based methods, namely BANJO [16] and BNFinder [17]. Since these methods detect only regulations of order 1, while calculating performance measures for these methods, we ignored the exact orders for the timedelayed interactions in the target network. We could not apply TDARACNE [7] to this network since the generated data has two levels of discrete values and TDARACNE returns error when applied to such discrete datasets. We show the results for this network in Table 1, where we observe that our method, coupled with a high precision, outperforms the other two in terms of both sensitivity and specificity. The Fscore is also the best in all the cases. This points to the strength of our method in discovering complex interaction scenarios where multiple regulators may jointly regulate target genes with varying timedelays.
Table 1
Comparison of proposed method with BANJO and BNFinder on the yeast subnetwork
N=30
N=50
N=100
Se
Sp
Pr
F
Se
Sp
Pr
F
Se
Sp
Pr
F
Proposed
0.62±
0.992±
0.57±
0.59±
0.80±
1.0±
0.79±
0.79±
0.82±
1.0±
0.76±
0.79±
Method
0.12
0.0045
0.11
0.11
0.04
0.0
0.07
0.05
0.06
0.0
0.03
0.04
BNFinder
0.53±
0.996±
0.68±
0.59±
0.62±
0.997±
0.74±
0.67±
0.69±
0.997±
0.74±
0.72±
+BDe
0.04
0.0006
0.02
0.02
0.04
0.0019
0.13
0.06
0.08
0.0007
0.06
0.07
BNFinder
0.51±
0.996±
0.63±
0.56±
0.60±
0.996±
0.68±
0.63±
0.65±
0.996±
0.69±
0.67±
+MDL
0.08
0.0006
0.07
0.08
0.05
0.0022
0.15
0.09
0.0
0.0
0.04
0.02
BANJO
0.51±
0.987±
0.49±
0.46±
0.55±
0.993±
0.57±
0.55±
0.60±
0.995±
0.61±
0.61±
0.08
0.01
0.2
0.15
0.09
0.0049
0.23
0.16
0.08
0.0014
0.09
0.08
Synthetic network of glucose homeostasis
In higher eukaryotes, glucose homeostasis is maintained via a complex system involving many organs and signaling mechanisms. The liver plays a crucial role in this system by storing glucose as glycogen when blood glucose levels are high, and releasing glucose into the bloodstream when blood glucose levels are low. To accomplish its task, the liver responds to circulating levels of hormones, mainly insulin, epinephrine, glucagon, and glucocorticoids [18].
Le et al. [18] conducted an extensive review of the literature regarding the biological components affecting perinatal glucose metabolism. Based on the study, a Bayesian Network model of glucose homeostasis containing 35 nodes and 52 interactions (shown in Figure 4) was constructed. We used the model for generating datasets of varying size (50, 75 and 100 samples), having first and secondorder regulations using the Bayes Net Toolbox [19]. The random multinomial CPDs used by this approach of data generation were obtained by sampling from a Dirichlet distribution with hyperparameters chosen by the method^{c} described in [20] with a corresponding Equivalent Sample Size (ESS) value of 10. The choice of this prior distribution for the conditional parameters ensures a reasonable level of dependence between dconnected variables in the generative structure [20].
We compare our method with the three other methods that were used previously for comparison, namely BANJO [16] and BNFinder [17](using BDe and MDL). While calculating performance measures for these methods, we ignored the exact orders for the timedelayed interactions in the target network. Similar to the probabilistic network of yeast, we could not apply TDARACNE for this network due to error occurring because TDARACNE is unable to cope with the discrete data. The results are shown in Table 2. We observe that, both in terms of specificity and precision, our method outperforms others. The Fscore is the highest in all the cases, indicating a good balance between sensitivity and precision.
Table 2
Comparison of proposed method with BANJO and BNFinder on the glucose homeostasis network
N=50
N=75
N=100
Se
Sp
Pr
F
Se
Sp
Pr
F
Se
Sp
Pr
F
Proposed
0.50
0.9812
0.54
0.52
0.46
0.9914
0.71
0.56
0.54
0.9906
0.72
0.62
Method
BNFinder
0.48
0.9488
0.29
0.37
0.52
0.9506
0.32
0.39
0.56
0.9557
0.36
0.44
+BDe
BNFinder
0.54
0.948
0.31
0.40
0.56
0.9395
0.29
0.38
0.54
0.9369
0.27
0.37
+MDL
BANJO
0.52
0.97
0.44
0.47
0.48
0.9838
0.57
0.52
0.54
0.9881
0.67
0.60
Reallife biological data of saccharomyces cerevisiae (IRMA)
To validate our method with a reallife biological gene regulatory network, we investigate a recent network reported in [21]. In that significant work, the authors built a network, called IRMA, of the yeast Saccharomyces cerevisiae[21]. They tested the transcription of network genes by culturing the cells in presence of galactose and glucose. The network is composed of five genes regulating each other; it is also negligibly affected by endogenous genes. It is one of the first attempts at building a reference data set having an accurately known target network [7]. There are two sets of gene profiles called Switch ON and Switch OFF for this network, each containing 16 and 21 time series data points, respectively. A ’simplified’ network, ignoring some internal protein level interactions, is also reported in [21]. To compare our reconstruction method, we consider 3 other methods, namely, TDARACNE [7], BANJO [16] and BNFinder [17].
IRMA ON dataset
The performance comparison amongst various method based on the ON dataset is shown in Table 3. We observe that our method clearly outperforms the others. There are no false predictions and precision is highest. The sensitivity and Fscore measures are also very high.
Table 3
Performance comparison based on IRMA ON dataset
Original Network
Simplified Network
Se
Sp
Pr
F
Se
Sp
Pr
F
Proposed Method
0.63
1.0
1.0
0.77
0.67
1.0
1.0
0.80
TDARACNE
0.63
0.88
0.71
0.67
0.67
0.90
0.80
0.73
BNFinder+BDe
0.13
0.82
0.25
0.17
0.17
0.80
0.33
0.22
BNFinder+MDL
0.13
0.82
0.25
0.17
0.17
0.80
0.33
0.22
BANJO
0.25
0.76
0.33
0.27
0.50
0.70
0.50
0.50
IRMA OFF dataset
Due to the lack of ’stimulus’, it is relatively difficult to reconstruct the exact network from the OFF dataset [7]. As a result, the overall performances of all the algorithms suffer to some extent. The comparison is shown in Table 4. Again, we observe that our method reconstructs the gene network with high precision. Specificity is also quite high, implying that the inference of false positives is low.
Table 4
Comparison based on IRMA OFF dataset
Original Network
Simplified Network
Se
Sp
Pr
F
Se
Sp
Pr
F
Proposed Method
0.50
0.94
0.80
0.62
0.50
0.90
0.75
0.60
TDARACNE
0.60

0.37
0.46
0.75

0.50
0.60
BNFinder+BDe
0.13
0.82
0.25
0.17
0.33
0.80
0.50
0.40
BNFinder+MDL
0.13
0.82
0.25
0.17
0.33
0.80
0.50
0.40
BANJO
0.38
0.88
0.60
0.46
0.33
0.90
0.67
0.44
Yeast KEGG pathway reconstruction
In order to test the proposed method’s performance on yeast S. cerevisiae cell cycle, we selected a eleven gene network of the G1phase: Cln3, Cdc28, Mbp1, Swi4, Clb6, Cdc6, Sic1, Swi6, Cln1, Cln2, Clb5. The data used was obtained from the cdc28 experiment of Spellman et al. [22]. In the later stage of the G1phase, the Cln3Cdc28 protein kinase complex activates two transcription factors, MBF and SBF, and these promote the transcription of some genes important for budding and DNA synthesis [7, 23]. Entry into the Sphase requires the activation of the protein kinase Cdc28p through binding with Clb5 or Clb6, and also the destruction of Sic1 [24]. Also, Swi4 becomes associated with Swi6 to form the SCB complex that activates CLN1 and CLN2 in late G1. Mbp1 forms the MCBbinding factor complex with Swi6, which activates DNA synthesis genes and Sphase cyclin genes CLB5 and CLB6 in late G1 [7]. In budding yeast, commitment to DNA replication during the normal cell cycle requires degradation of the cyclindependent kinase (CDK) inhibitor Sic1. The G1 cyclinCDK complexes Cln1Cdk1 and Cln2Cdk1 initiate the process of Sic1 removal by directly catalyzing Sic1 phosphorylation at multiple sites [7, 25].
In Figure 5(B)(F), we report network graphs reconstructed by our proposed approach, TDARACNE, BNFinder(BDe and MDL) and BANJO. We also report the KEGG pathway [26] of the cellcycle in yeast in 5(A). Since the ground truth for this network is not known, instead of applying performance measures as a means of determining network accuracy, we refer to the available correct interactions obtained from the KEGG pathway [26] and identify which of the predicted interactions are correct or otherwise. We observe from the results that our approach correctly identifies the regulation of SWI4SWI6 and MBP1SWI6 complex by the CLN3CDC28 complex. Also, the proposed approach infers that the SWI4SWI6 complex regulates the CLN1CLN2CDC28 complex, which is correct. Two more interactions inferred by our approach (CLN1→CLN2 and CLB5CLB6CDC28→CDC6) are also correct based on the KEGG pathway. Overall we observe that none of the methods perform particularly well on this network. However, the number of correct predictions by our method (5) is higher than the other methods.
SOS DNA repair network of E. coli
We analyze the wellknown SOS DNA repair network in E. coli as shown in Figure 6(A). This GRN is well known for its responsibility of repairing the DNA if it gets damaged. It is the largest, most complex, and best understood DNA damageinducible network to be characterized to date.
The expression of the genes in the SOS regulatory network is controlled by a complex circuitry which involves the RecA and LexA proteins [27]. Normally LexA acts as the master repressor of more than 20 genes, including lexA and recA genes. This repression is done by its binding to the interaction sites in the promoter regions of these genes. When DNA damage occurs, one of the SOS proteins, RecA, acts as a sensor. By binding to singlestranded DNA, it becomes activated, senses the damage and mediates LexA autocleavage [27]. The drop in LexA levels in turn stops the repression of the SOS genes and activates them. When the damage has been repaired, the level of activated RecA drops and it stops mediating LexA autocleavage. LexA level in turn increases, starting repression of the SOS genes, and the cell then returns to its normal state.
The expression data sets of the SOS DNA repair system were obtained from Uri Alon Lab [28]. These data are expression kinetics of 8 genes namely uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA and polB. Four experiments were done for various UV light intensities (Exp. 1 and 2:5Jm^{−2}, Exp. 3 and 4:20Jm^{−2}). In each experiment, the above 8 genes were monitored at 50 instants which are evenly spaced by 6 minutes intervals.
The results corresponding to Experiment 1 is presented in Figure 6(B). Along with our result, we include the results from BANJO, TDARACNE and BNFinder in Figure 6(C)(F) and the target network in 6(A). The results corresponding to the other experiments are available in Additional file 2, Additional file 3, Additional file 4, Additional file 5 and Additional file 6. From the results, we observe that our method correctly identifies lexA and recA as the ’hub’ genes for this network. Again, the exact ground truth for this network is not precisely known, and hence it is not possible to calculate the wellknown performance measures. Instead, using the known interactions obtained from literature [13, 14], an analysis of correct and incorrect predictions by our method is obtained and shown in Table 5. We observe that most of the interactions inferred by our proposed method are correct. It successfully infers lexA as the regulator of uvrA, uvrD, umuD and recA. Also, considering the indirect regulation of RecA through LexA, two more interactions, namely recA→uvrY and recA→polB can also be considered correct. In contrast, 3 of the 5 identified interactions by TDARACNE [7] are correct. Most of the interactions identified by BANJO and BNFinder+MDL are incorrect. BNFinder+BDe successfully identifies regulation of ruvA, polB and uvrA by lexA. In addition, the regulation of umuD by recA can also be considered correct. However, compared to these methods, our proposed method infers the highest number of correct predictions. Number of incorrect predictions is also very low for our method.
Table 5
Analysis of individual interactions inferred by proposed method
correct/
Regulator
Target
incorrect
uvrD
correct
umuD
correct
LexA
recA
correct
uvrA
correct
uvrY
correct^{a}
RecA
polB
correct^{a}
uvrD
ruvA
incorrect
^{a}correct considering indirect regulation of RecA through LexA
Network analysis of strongly cycling genes in cyanobacteria, Cyanothece sp. ATCC 51142
To study our approach on a large scale network, we use a network of a strain of cyanobacteria, namely Cyanothece sp. strain ATCC 51142 [29]. The microarray data corresponding to the genes were collected from two publicly available genomewide microarray data sets of Cyanothece, performed in alternating lightdark (LD) cycles with samples collected every 4h over a 48h period: the first one starting with 1h into dark period followed by two DL cycles (DLDL), and the second one starting with two hours into light period, followed by one LD and one continuous LL cycle (LDLL) [30]. In total, there were 24 samples. Using a threshold filter with a 2fold change cutoff, 730 genes were selected for the analysis. The genes are responsible for performing the major tasks of energy metabolism and respiration, nitrogen fixation, protein translation and folding, and photosynthesis, along with several other tasks. Result obtained using our method is shown in Figure 7. The degree distribution is shown in Figure 8. To compare our result with the other methods, we applied BANJO, BNFinder(BDe and MDL) and TDARACNE. The results of all the three except BNFinder(BDe) was not satisfactory. As a result, we compare our method only with BNFinder+BDe.
Similar to other large scale datasets (e.g. the Human HeLa cell data [31], the Arabidopsis L. Heynth dataset [32]), the microarray data set for cyanobacteria also has very few samples. Moreover, being not a wellstudied organism, it requires caution in the interpretation of results. We note that GRN reconstruction studies of cyanobacteria reported earlier (e.g. [29, 33, 34]) commonly emphasise an evaluation criteria, namely “functional enrichment” analysis of subnetworks. Further, another common feature noted for genetic networks [35–37] is that transcriptional regulatory networks possess the scale free nature of the network topology^{d}. Since we have limited samples and also because the ground truth is unknown, we have carried out the evaluation of the inferred network using both: (i) statistical means i.e. GO functional enrichment analysis (using both p = 0.05 and p = 0.10), and (ii) the R^{2} measure of the powerlaw fit of the network to establish its scalefree nature.
The enrichment analysis was done by using gene ontology (GO) database (compiled using two sources: one from the Cyanobase database [38], and another from genomewide amino sequence matching using the Blast2GO software suite [39]; the the compiled database is available in Additional file 7), where every GO terms appearing in each subnetwork is assessed to find out whether a certain functional category is significantly overrepresented in a certain subnetwork/cluster, more than what would be expected by chance. The Cytoscape [40] plugin BiNGO [41] was used for GO functional category enrichment analysis. For BiNGO, we use the combined and filtered gene set as the reference set, the hypergeometric test as the test for functional overrepresentation, and False Discovery Rate (FDR) as the multiple hypothesis testing correction scheme.
First, we present the results corresponding to p = 0.05. The network obtained by BNFinder+BDe has 16 subnetworks each containing at least 3 genes. Of these, 6 subnetworks have significantly enriched functionalities (as determined by the GO functional enrichment test). Of the other 10, we compute the 3 most densely connected hubs for each subnetwork, and in 2 of 10 such subnetworks, the hubs have defined significantly enriched functionalities. On the other hand, in our result, there are 14 subnetworks in total having at least 3 genes. Of these, 3 subnetworks have defined enriched functions (the largest subnetwork has the role of nitrogen fixation according to the enrichment test). Of the other 11, we compute the 3 most densely connected hubs for each subnetwork, and in 5 of the 11 such subnetworks, the hubs have defined significantly enriched functionalities.
The results corresponding to p = 0.10 show that for BNFinder+BDe, 7 subnetworks have enriched functionalities (as determined by the test). Of the other 9, we compute the 3 most densely connected hubs for each subnetwork, and in 2 of the 9 such subnetworks, the hubs have defined enriched functionalities. On the contrary, the result using our approach has 5 subnetworks with defined significantly enriched functions (the largest subnetwork has the role of nitrogen fixation, similar to the p = 0.05 case). Of the other 9, we compute the 3 most densely connected hubs for each subnetwork, and in 6 of the 9 such subnetworks, the hubs have defined significantly enriched functionalities.
We also test the networks to assess whether they are scale free, using a power law fit. The R^{2} value of the fit corresponding to our network is 0.93, which is a better fit compared to BNFinder+BDe (0.62).
Conclusion
In this paper, we propose a framework that can simultaneously represent instantaneous and timedelayed genetic interactions. The proposed scoring metric uses information theoretic quantities having not only relevant properties but also implicitly includes the biological truth that some genes may jointly regulate other genes. Incorporating these novel features, we have implemented a score+search based GRN reconstruction algorithm. Experiments have been performed on different synthetic networks of varying complexities and also on reallife biological networks. Our method shows improved performance compared to other recent methods, both in terms of reconstruction accuracy and number of false predictions and at the same time maintaining comparable or better true predictions. A natural extension of the described method can be incorporation of apriori knowledge from sources like proteinprotein interactions databases and fusing the knowledge with existing regulatory networks to make the inferred networks much more reliable, and we are pursuing this objective. Along with these extensions, the proposed approach would improve the accuracy of gene regulatory network reconstruction and enhance research in systems biology.
Methods
The representational framework
Let us model a gene network containing n genes (denoted by ) with a corresponding microarray dataset having N time points. A basic DBNbased GRN reconstruction method would try to find associations between genes X_{
i
} and X_{
j
} by taking into consideration the data and or vice versa (small case letters mean data values in the microarray), where 1 ≤ δ ≤ d. That is, it will take into consideration the dth order Markov rule, for a gene having a maximum order of regulation d with its parents. This will effectively enable this model to capture at most dstep time delayed interactions. Conversely, a basic BNbased strategy would use the entire set of N time points and it will capture regulations that are effective instantaneously.
Now, to represent both instantaneous and multiple step timedelayed interactions, we consider an adjacency matrix based structure as shown in Figure 9. The zero entries in the figure denote no regulation. For the first n columns, the entries marked by 1 correspond to instantaneous regulations whereas for the last n columns nonzero entries denote the order of regulation. As an example, the entry 1 in the cell (X_{1}, X_{2}) means X_{1} has (almost) instantaneous regulatory effect on X_{2}. Similarly, the entry d in the cell means X_{
n
} regulates X_{2} with a dstep time delay. Using this representation, we do not need to replicate layers of interactions for each increment in the order of regulations, making it efficient and particularly suitable for representing GRNs, where higherorder regulations is a common phenomenon.
Complications in the alignment of data samples can arise if the parents have different orders of regulation with the child node. To make this notion clear, we describe an example where we have already assessed the degree of interest in adding two parents (gene B and C, having third and first order regulations, respectively) to the gene under consideration, X. Now, we want to assess the degree of interest in adding gene A as a parent of X with a second order regulatory relationship, that is we want to compute^{e}MI(X, A^{2}{B^{3}, C^{1}}), where superscripts on the parent variables denote the order of regulation it has with the child node.
There are two possibilities to consider. The first one corresponds to the scenario where the timeseries data is not periodic. In this case, we cannot use all the N samples for MI computation, rather we have to use (N − δ) samples where δ is the maximum order of regulation that the gene under consideration has, with its parent nodes (3 in this example). Figure 10 shows how the alignment of the samples can be done for the current example. In the figure, we have N samples and since δ = 3, we can effectively use (N − 3) samples.
The √symbol inside a cell denotes that this data sample will be used for MI computation, whereas empty cells denote that these data samples will not be considered for computing the MI. Similar alignments will need to be done for the other case, where the data is considered to be periodic (e.g., datasets of yeast compiled by [42] show such cyclic behavior [43]). However, we can use all the N data samples in this case, where the data is shifted in a circular manner.
The interpretation of the results obtained from an algorithm that uses this framework can be done in a straightforward manner. Using this framework and the aligned data samples, if we construct a network where we observe, for example, arc having order δ, we conclude that the interslice arc between X_{1} and X_{n} is inferred and X_{1} regulates X_{
n
} with a δstep timedelay. Similarly, if we find an arc X_{2} → X_{
n
}, we say that the intraslice arc between X_{2} and X_{n} is inferred and a change in the expression level of X_{2} will almost immediately effect the expression level of X_{n}. To ensure consistency in the resulting Bayesian networks, the following 3 conditions must also be satisfied:
1.
The network must be a directed acyclic graph.
2.
The interslice arcs must go in the correct direction (no backward arc).
3.
Interactions remain existent independent of time (stationarity assumption).
Our proposed scoring metric, CCIT
We share the same idea with MIT (Mutual Information Tests) [44] and MDL (the Minimum Description Length principle) for developing a scoring metric that can score both instantaneous and timedelayed interactions simultaneously: to use the MI/loglikelihood measure between a node X, and its parents, Pa(X) for measuring the degree of association between them, and penalizing structural complexity. The first part aims at minimizing the KullbackLeibler (KL) divergence between the joint distribution corresponding to the original network (p_{
D
}) and the graph under consideration (p_{
G
}), according to the following equation:
(5)
which is equivalent to maximizing the loglikelihood (i.e., the higher the MI/loglikelihood score, the better is the network) [44]. In our approach, calculation of the MI/loglikelihood score is done in a manner which is similar to the approaches in MIT/MDL, with a major difference: calculation of score (using MI/loglikelihood) in case of joint regulation. To make the notion clear consider Figure 1. Using MIT, the MI part for scoring for gene B is^{f}MI(B, {A^{0}, D^{0}}) + MI(B, C^{1}) (similar calculations of loglikelihood will be used for MDL). As we can see, the calculation of MI/loglikelihood for the zeroorder interactions do not take into account the parents who regulate it with timedelay. Unlike the approach in basic MIT and other approaches where zero and higherorder interactions are scored separately and then combined, in our approach, we also condition (during computation) on parents which have different orders of regulation with the target gene. The marginal probability for each node of this model thus becomes:
(6)
The term Pa(X_{
i
}[t]) in the above equation represents the parents of gene X_{
i
} at time t, which can be in the same timeslice or in one of the d previous timeslices (d is the maximum order of regulation) of gene X_{
i
} at time t. Thus, using our approach, the scoring function for B will calculate . Scoring in this manner enables us to score both intra and interslice interactions simultaneously, rather than considering these two types of interactions in an isolated manner, making it specially suitable for problems like reconstructing GRNs, where occurrence of joint regulation is a common phenomenon.
The idea of penalizing complex structures is ubiquitous, finding its place in most of the scores like BIC, MIT and MDL. The penalization component for BIC and MDL are global, whereas for MIT it is specific for each variable and its parents. Being local in nature, the MIT scheme usually outperforms the other two [44]. In this scheme, the localised penalization is based on a theorem of Kullback [45], which says that for a particular confidence level α, the quantity 2N.MI(X_{
i
}, X_{
j
}Pa(X_{
i
})) − χα,l_{
ij
} represents a statistical test of conditional independence, where l_{
ij
} is the degrees of freedom of a chisquared distribution, and is the statistical significance threshold. The more positive the value is, the more likely is that X_{
i
} and X_{
k
} are related (given the current parent set, Pa(X_{
i
})) and viceversa. Thus, adding up the MI quantities for all the genes (multiplied by 2*number of samples) and subtracting the corresponding local penalization measures effectively constitute a series of conditional independence (CI) tests, and this scheme is used for scoring using MIT.
However, porting this idea of local penalization directly to a gene regulatory network which is cursed with dimensionality (there are a large number of variables (genes), but only a few samples are available), has the problem of overpenalization. This can be exemplified using Figure 1. The penalization component for gene B according to MIT, will be: χ_{
α,4} + χ_{
α,12} + χ_{
α,36}, assuming the special case where we have 3 levels of discrete data (the details of how these penalization components can be computed will be shown later). For a Bayesian network design having thousands of samples available, this penalization is not a problem. However, but for GRN reconstruction with samples ranging between 2050, this penalization is too high. To remedy this situation, we propose to apply the penalization only on a perorder of regulation basis. Using this modified scheme, the penalization will be 2χ_{
α,4} + χ_{
α,12}, which constitutes considerable savings, thereby increasing better prediction ratio (in terms of sensitivity and specificity).
The approaches described above are summarised as a scoring metric, named CCIT (Combined Conditional Independence Tests) in Equation 7. The score, when applied to a graph G containing n genes (denoted by ), with a corresponding microarray dataset D, can be expressed as:
(7)
Here denotes the number of parents of gene X_{
i
} having a k step timedelayed regulation and δ_{
i
} is the maximum timedelay that gene X_{
i
} has with its parents. The parent set of gene X_{
i
}, Pa(X_{
i
}) is the union of the parent sets of X_{
i
} having zero timedelay (denoted by ), singlestep timedelay (denoted by ) and up to parents having the maximum timedelay (δ_{
i
}). This is defined as follows:
(8)
The number of effective data points, , depends on whether the data can be considered to be showing periodic behavior or not (e.g., datasets from [42] can be considered as showing periodic behavior [43]). In the case of aperiodicity, is determined by subtracting, from the total length of the time profile (N), the maximum order of the timedelay that the gene under consideration has with its parents (δ_{
i
}).
(9)
Finally, denote any permutation of the index set of the variables and , the degrees of freedom, is defined as follows:
(10)
where r_{
p
} denotes the number of possible values that gene X_{
p
} can take (after discretization, if the data is continuous). If the number of possible values that the genes can take is not the same for all the genes, the quantity denotes the permutation of the parent set where the first parent gene has the highest number of possible values, the second gene has the second highest number of possible values and so on.
Some properties of CCIT Score
In this section we study several useful properties of the proposed scoring metric. The first among these is the decomposability property, which is especially useful for local search algorithms:
Proposition 1
CCIT is a decomposable scoring metric.
Proof
This result is evident as the scoring function is, by definition, a sum of local scores. □
Next, we show in Theorem 1 that CCIT takes joint regulation into account while scoring and it is different than three related approaches, namely MIT [44] applied to: a Bayesian Network (which we call MIT_{0}); a dynamic Bayesian Network (called MIT_{1}); and also a naive combination of these two, where the intra and interslice networks are scored independently (called MIT_{0 + 1}). For this, we make use of the decomposition property of MI, defined next:
Property 1
(Decomposition Property of MI) In a BN, if Pa(X_{
i
}) is the parent set of a node X_{
i
}, and the cardinality of the set is s_{
i
}, the following identity holds [44]:
(11)
Theorem 1
CCIT scores intra and interslice arcs concurrently, and is different from MIT_{0}, MIT_{1} and MIT_{0 + 1} since it takes into account the fact that multiple regulators may regulate a gene simultaneously, rather than in an isolated manner.
Proof
We prove by showing a counter example, using the network in Figure 11. We apply our metric along with the three other techniques on the network, describe the working procedure in all these cases to show that the proposed metric indeed scores them concurrently, and finally show the difference with the other three approaches. The network in Figure 11 has 4 interactions, 2 of these are instantaneous and 2 are timedelayed (with δ = 1). We assume a nontrivial case where the data is supposed to be periodic (the proof is trivial otherwise). Also, we assume that all the gene expressions were discretized to 3 quantization levels.
1.
Application of MIT in a BN based framework:
(12)
2.
Application of MIT in a DBN based framework:
(13)
3.
A naive application of MIT in a combined BN and DBN based framework:
(14)
4.
Our proposed scoring metric:
(15)
The concurrent scoring behavior of CCIT is evident from the first term in RHS of (15). Also, the inclusion of C in the parent set in the first term of the RHS of the equation exhibits the manner by which it achieves the objective of taking into account the biological fact that multiple regulators may regulate a gene jointly (the calculation, however, needs to be carried out in accordance with the process we described in the Methods Section).
Considering (12) and (13), it is also obvious that CCIT is different from both MIT_{0} and MIT_{1}. To show that CCIT is different from MIT_{0 + 1}, we consider (14) and (15). It suffices to consider whether MI(B, {A^{0},D^{0}}) + MI(B, C^{1}) is different from . Using (11), this becomes equivalent to considering whether MI(B, {A^{0},D^{0}}C^{1}) is the same as MI(B, {A^{0}, D^{0}}), which are clearly inequal. This completes the proof. □
Endnotes
^{a}The timedelay will always be greater than zero. However, if the delay is small enough so that the regulated gene is effected before the next data sample is taken, it can be considered as an instantaneous interaction
^{c}The method works as follows: for a variable X_{
i
} with k states, a basis vector is constructed for P(X_{
i
}Pa(X_{
i
})) by normalizing the vector . For the jth instantiation pa(X_{
i
}) of Pa(X_{
i
}), samples are obtained for the probability corresponding to this instantiation by using θ_{
ij
} ∼ Dirichlet(sα_{
ij
}) where s is the equivalent sample size and the α_{
ij
}’s are obtained by shifting the basis vector to the right j places where j modulo k is not one.
^{d}We clarify that different processes including genetic networks will generate scale free networks. However, if a network obtained using microarray data is scale free, it indicates that it is modelling the underlying biological process more accurately
^{e}in this paper, we use Mutual Information (MI)/loglikelihood based Conditional Independence tests for analysis of regulatory interactions
^{f}it should be noted here that MIT/MDL are basic scoring metric for BNs, which can be extended to score both Static and Dynamic BNs separately. Here, we are discussing MIT/MDL applied to a network having both zero and higherorder interactions
Declarations
Acknowledgments
This research is a part of the larger project on genetic network modeling supported by Monash University and AustraliaIndia Strategic Research Fund. Comments and suggestions from Assoc. Prof. Manzur Murshed from Gippsland School of IT is gratefully acknowledged.
Authors’ Affiliations
(1)
Gippsland School of Information Technology, Faculty of Information Technology, Monash University
References
Ram R, Chetty M: A markovblanketbased model for gene regulatory network inference.Comput Biol Bioinf, IEEE/ACM Trans 2011,8(2):353–367.View Article
Zou M, Conzen S: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data.Bioinformatics 2005, 21:71.PubMedView Article
de Campos, Ji Q: Efficient structure learning of Bayesian networks using constraints.J Machine Learning Res 2011, 12:663–689.
Friedman N, Murphy K, Russell S: Learning the structure of dynamic probabilistic networks. Morgan Kaufmann Publishers Inc.; 1998.
Eaton D, Murphy K: Bayesian structure learning using dynamic programming and MCMC.Proceedings of the TwentyThird Confererence on Uncertainty in Artificial Intelligence (UAI 2007) 2007.
Chaitankar V, Ghosh P, Perkins E, Gong P, Deng Y, Zhang C: A novel gene network inference algorithm using predictive minimum description length approach.BMC Syst Biol 2010,4(Suppl 1):S7.PubMedView Article
Zoppoli P, Morganella S, Ceccarelli M: TimeDelayARACNE: Reverse engineering of gene networks from timecourse data by an information theoretic approach.BMC Bioinf 2010, 11:154.View Article
Mörchen F, Ultsch A: Optimizing time series discretization for knowledge discovery. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery and data mining. Chicago, IL, USA: ACM; 2005:660–665.View Article
Schaffter T, Marbach D, Floreano D: GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods.Bioinformatics 2011,27(16):2263–2270. Oxford Univ PressPubMedView Article
Prill R, Marbach D, SaezRodriguez J, Alexopoulos L, Sorger P, Xue X, Clarke N, AltanBonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges.PloS one 2010,5(2):e9202.PubMedView Article
Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene networks for performance assessment of reverse engineering methods.J Comput Biol 2009,16(2):229–239.PubMedView Article
Noman N, Iba H: Inferring gene regulatory networks using differential evolution with local search heuristics.IEEE/ACM Trans Comput Biol Bioinf (TCBB) 2007,4(4):634–647.View Article
Kimura S, Ide K, Kashihara A, Kano M, Hatakeyama M, Masui R, Nakagawa N, Yokoyama S, Kuramitsu S, Konagaya A: Inference of Ssystem models of genetic networks using a cooperative coevolutionary algorithm.Bioinformatics 2005,21(7):1154–1163. IEEE Computer Society PressPubMedView Article
Husmeier D: Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks.Bioinformatics 2003,19(17):2271.PubMedView Article
Yu J, Smith V, Wang P, Hartemink A, Jarvis E: Advances to Bayesian network inference for generating causal networks from observational biological data.Bioinformatics 2004,20(18):3594.PubMedView Article
Wilczyński B, Dojer N: BNFinder: exact and efficient method for learning Bayesian networks.Bioinformatics 2009,25(2):286.PubMedView Article
Le P, Bahl A, Ungar L: Using prior knowledge to improve genetic network reconstruction from microarray data.Silico Biol 2004,4(3):335–353.
Murphy K, et al.: The bayes net toolbox for matlab.Comput Sci Stat 2001,33(2):1024–1034.
Cantone I, Marucci L, Iorio F, Ricci M, Belcastro V, Bansal M, Santini S, Di Bernardo M, Di Bernardo D, Cosma M: A yeast synthetic network for in vivo assessment of reverseengineering and modeling approaches.Cell 2009, 137:172–181.PubMedView Article
Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.Mol Biol Cell 1998,9(12):3273–3297.PubMed
Cross F: Starting the cell cycle: what’s the point?Curr Opin Cell Biol 1995,7(6):790–797.PubMedView Article
Chun K, Goebl M: Mutational analysis of Cak1p, an essential protein kinase that regulates cell cycle progression.Mol Gen Genet MGG 1997,256(4):365–375.View Article
Sawarynski K, Kaplun A, Tzivion G, Brush G: Distinct activities of the related protein kinases Cdk1 and Ime2.Biochimica Et Biophysica Acta 2007,1773(3):450–456.PubMedView Article
Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet.Nucleic Acids Res 2002, 30:42–46.PubMedView Article
Kabir M, Noman N, Iba H: Reverse engineering gene regulatory network from microarray data using linear timevariant model.BMC Bioinf 2010,11(Suppl 1):S56. BioMed Central LtdView Article
Stöckel J, Welsh E, Liberton M, Kunnvakkam R, Aurora R, Pakrasi H: Global transcriptomic analysis of Cyanothece 51142 reveals robust diurnal oscillation of central metabolic processes.Proc Nat Acad Sci 2008,105(16):6156.PubMedView Article
Wang W, Ghosh B, Pakrasi H: Identification and modeling of genes with Diurnal Oscillations from microarray time series data.Comput Biol Bioinf, IEEE/ACM Trans 2011, 8:108–121.View Article
Whitfield M, Sherlock G, Saldanha A, Murray J, Ball C, Alexander K, Matese J, Perou C, Hurt M, Brown P, et al.: Identification of genes periodically expressed in the human cell cycle and their expression in tumors.Mol Biol Cell 2002,13(6):1977–2000.PubMedView Article
Yuan Y, Li C, Windram O: Directed partial correlation: inferring largescale gene regulatory network through induced topology disruptions.PLoS One 2011,6(4):e16835.PubMedView Article
McDermott J, Oehmen C, McCue L, Hill E, Choi D, Stöckel J, Liberton M, Pakrasi H, Sherman L: A model of cyclic transcriptomic behavior in the cyanobacterium Cyanothece sp. ATCC 51142.Mol BioSyst 2011,7(8):2407–2418.PubMedView Article
Toepel J, Welsh E, Summerfield T, Pakrasi H, Sherman L: Differential transcriptional analysis of the cyanobacterium Cyanothece sp. strain ATCC 51142 during lightdark and continuouslight growth.J Bacteriol 2008,190(11):3904–3913.PubMedView Article
Jeong H, Tombor B, Albert R, Oltvai Z, Barabási A: The largescale organization of metabolic networks.Nature 2000,407(6804):651–654.PubMedView Article
Jeong H, Mason S, Barabasi A, Oltvai Z: Lethality and centrality in protein networks.Arxiv preprint condmat/0105306 2001.
Guelzim N, Bottani S, Bourgine P, Képès F: Topological and causal structure of the yeast transcriptional regulatory network.Nat Genet 2002, 31:60–63.PubMedView Article
Götz S, GarcíaGómez J, Terol J, Williams T, Nagaraj S, Nueda M, Robles M, Talón M, Dopazo J, Conesa A: Highthroughput functional annotation and data mining with the Blast2GO suite.Nucleic Acids Res 2008,36(10):3420–3435.PubMedView Article
Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks.Genome Res 2003,13(11):2498–2504.PubMedView Article
Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks.Bioinformatics 2005,21(16):3448–3449.PubMedView Article
Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, et al.: A genomewide transcriptional analysis of the mitotic cell cycle.Mol Cell 1998, 2:65–73.PubMedView Article
Xing Z, Wu D: Modeling multiple time units delayed gene regulatory network using dynamic Bayesian network. In Proceedings of the Sixth IEEE International Conference on Data Mining  Workshops. IEEE; 2006:190–195.View Article
de Campos L: A scoring function for learning Bayesian networks based on mutual information and conditional independence tests.J Machine Learning Res 2006, 7:2149–2187.
Kullback S: Information Theory and Statistics. Volume. New York: Dover Publications; 1968.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.