Integrating multiple types of data to predict novel cell cycle-related genes
© Wang et al; licensee BioMed Central Ltd. 2011
Published: 20 June 2011
Skip to main content
© Wang et al; licensee BioMed Central Ltd. 2011
Published: 20 June 2011
Cellular functions depend on genetic, physical and other types of interactions. As such, derived interaction networks can be utilized to discover novel genes involved in specific biological processes. Epistatic Miniarray Profile, or E-MAP, which is an experimental platform that measures genetic interactions on a genome-wide scale, has successfully recovered known pathways and revealed novel protein complexes in Saccharomyces cerevisiae (budding yeast).
By combining E-MAP data with co-expression data, we first predicted a potential cell cycle related gene set. Using Gene Ontology (GO) function annotation as a benchmark, we demonstrated that the prediction by combining microarray and E-MAP data is generally >50% more accurate in identifying co-functional gene pairs than the prediction using either data source alone. We also used transcription factor (TF)–DNA binding data (Chip-chip) and protein phosphorylation data to construct a local cell cycle regulation network based on potential cell cycle related gene set we predicted. Finally, based on the E-MAP screening with 48 cell cycle genes crossing 1536 library strains, we predicted four unknown genes (YPL158C, YPR174C, YJR054W, and YPR045C) as potential cell cycle genes, and analyzed them in detail.
By integrating E-MAP and DNA microarray data, potential cell cycle-related genes were detected in budding yeast. This integrative method significantly improves the reliability of identifying co-functional gene pairs. In addition, the reconstructed network sheds light on both the function of known and predicted genes in the cell cycle process. Finally, our strategy can be applied to other biological processes and species, given the availability of relevant data.
According to , “mutations in two genes produce a phenotype that is surprising in light of each mutation's individual effects. This phenomenon, which defines genetic interaction, can reveal functional relationships between genes and pathways.” Thus, deciphering genetic interaction networks via high-throughput technologies can both reveal the schematic wiring of biological processes and predict novel genes. Recently, several such high-throughput technologies have been developed to identify genetic interactions at the genome scale, including Synthetic Genetic Array (SGA) , Diploid-based Synthetic Lethality Analysis on Microarrays (dSLAM) , and Epistatic Miniarray Profile (E-MAP) . The first two approaches aim to identify synthetic lethal, or negative, interactions, meaning that the double mutant is more lethal than the corresponding single mutants. On the other hand, assuming that the expected phenotype of a double mutation reflects the additional effects of the single mutations, E-MAP, an extension of SGA, gains power by identifying positive as well as negative interactions, which, in this case, would indicate that the double mutant is healthier than expected.
Here, we exploited the E-MAP methodology to discover novel genes involved in the cell cycle process in budding yeast. The distinct advantage of using E-MAP is the potential of discovering functionally associated genes which do not otherwise physically interact. Physical interaction assays, such as the yeast two-hybrid system or DNA-binding microarrays, are unlikely to reveal these associations. Despite the superiority of E-MAP, interpretation of the data is still challenging. First, genetic interactions occur both between and within functional modules. Thus, the function of a gene cannot be determined by its interacting partners. Second, E-MAP suffers from high false positive and negative rates, making it difficult to infer genetic interaction accurately and sufficiently. Consequently, the integration of external information, such as gene expression, Transcription Factor (TF)-DNA binding (chip-chip) and protein phosphorylation, is necessary in order to identify novel genes involved in the cell cycle process.
Several methods have been developed to integrate multiple types of data to infer a transcription regulatory network in eQTL analysis, including mRNA expression, chip-chip, physical interaction and protein phosphorylation [5–8]. In this paper, we integrated genetic interaction and other genomic data to construct a specific network which we then applied to the cell cycle process in budding yeast.
As expected, Figure 2A confirms that gene pairs having a higher cc are more likely to be co-functional. Also, Figure 2B shows that gene pairs with both significant positive and negative S-scores are more likely to be co-functional. By comparing the enrichment between Figures 2A and 2B, it is apparent that extreme S-scores could indicate co-functional membership more efficiently.
When combining these two kinds of data, we found that they were complementary. As shown in Figure 2C, for a certain cut-off of S-scores, gene pairs with a higher correlation of expression are more likely to be co-functional, and vice versa. Therefore, the results approved our original hypothesis that combining these two kinds of information could help to construct a more accurate potential cell cycle related gene set. We adopted an area by which to define significantly interacting gene pairs based on the data in Figure 2C. For a positive genetic interaction area, we require that the enrichment over random be larger than 2, and for a negative genetic interaction area, we require that it be larger than 4. Then the constraints are (S-score>2.5 and cc >0.9) or S-score>6 or (S-score<-3 and cc>0.9) or (S-Score<-14 and cc>0.85). Compared to the most powerful method at each point, the combination is generally >50% more accurate in the areas defined above (Figure 2D). Finally, 259 gene pairs between 206/1536 test strains and 48 KCCGs passed the filter. We use these 206 test genes as the potential set of cell cycle-related genes (PCCGs).
We compared our E-MAP data with the benchmark data. Similar to previous work , we tested the sensitivity and precision of the E-MAP data (see Methods and materials). Compared to genetic interactions in BIOGRID, both the positive and negative interactions are very precise (p-value < 10–50).
We also tested the efficiency when combining E-MAP with DNA microarray data. When the co-expression test was applied, the significance level of precision increases around 2-fold (Table S2 in Additional file 2). This result indicates that co-expression does indeed provide extra information about genetic interaction. Hence, our strategy can be used to identify potential cell cycle genes and their relationships with known cell cycle genes, thus enabling us to construct a reliable network.
We also compared our S-score with previously published large-scale genetic interaction data . Significantly interacting gene pairs show obvious correlation between the two datasets (r=0.64, Figure S1 in Additional file 3).
Functional enrichment analysis was performed on all GO biological process terms in both positive and negative parts of PCCGs. We defined the positive part as those genes having (S-score>2.5 and cc >0.9) or S-score>6 and the negative part as those genes having (S-score<-3 and cc>0.9) or (S-Score<-14 and cc>0.85). We distinguished these two parts because the principles on which positive and negative genetic interaction are based may be different for functional analysis, as discussed in a previous study . For the positive part, only one functional category, “nucleosome organization,” was enriched under 98% confidence level (q=0.012). For the negative part, five functional categories were enriched, including “DNA-dependent DNA replication,” “chromatin assembly,” “interphase of mitotic cell cycle,” “cell cycle checkpoint” and “regulation of organelle organization” (q=0.014, 0.013, 0.015, 0.009, 0.012). All these biological processes can either be interpreted as related to the cell cycle process, or just part of it. In addition, KCCGs were found to be mainly involved in “regulation of organelle organization,” “regulation of mitotic cell cycle,” “interphase of mitotic cell cycle,” “regulation of cell cycle process” and “cell cycle checkpoint” processes. Hence, in the negative part, there are more directly co-functional genes than the positive part. This is consistent with the surprising conclusion of previous work  indicating that negative, in contrast to positive, genetic interactions always occur between genes with overlapping functions.
Finally, we also analyzed the functional enrichment of all PCCGs. Three functions, including “chromatin assembly,” “regulation of organelle organization” and “nucleosome organization,” were enriched (q=0.015, 0.009, 0.002). This suggests the importance of separating PCCGs into two parts for a functional enrichment analysis. Such separation further helps us to understand how known cell cycle genes, both positive and negative, interact in terms of their functions and also helps us to find specific functions only enriched in one of the two parts, such as “DNA-dependent DNA replication,” “interphase of mitotic cell cycle” and “cell cycle checkpoint.” which only enriched in negative part but not in all PCCGs.
Periodicity and enrichment are consistent criteria since most of the known cell cycle TFs rank at the top in both cases. However, some TFs are ranked differently (See Table S3 in Additional file 4). For example, Mcm1 is ranked 6/130 in the enrichment test (ET); however, it ranks 124/183 in the periodic test (PT), which means that its expression does not show periodicity. We know that Mcm1 regulates different phases during the cell cycle [12, 13], and its expression will not be periodic. However, many of its neighboring genes in the transcriptional network are cell cycle genes, making its identification possible in the enrichment test. Similar to Mcm1, Skn7 ranks 23 and 114, respectively, in ET and PT. In contrast, HCM1 is ranked 3 in PT, but 42 in ET. One possible explanation of this apparent difference is that the PCCGs and KCCGs only cover a limited part of cell cycle-related genes, and some targets of HCM1 are missing in this set. Other examples like YHP1 are similar to HCM1. Based on this analysis, a TF that is significant in either test should be included. Hence, we use the multiplication of the two ranks as an index, and we use its rank to evaluate the priority order (see Methods and materials for details).
To determine how many TFs should be involved, we examined the coverage rates of TFs. The coverage rate is evaluated at two levels: the fraction of genes which are targets of the selected TFs in the PCCGs and KCCGs and the fraction of gene pairs which are co-regulated by any one of the selected TFs. In the PCCGs and KCCGs, 232/236 genes are involved in the chip-chip dataset (at least one TF can bind to them), and 77 gene pairs, which are both genetically interacting and co-expressed, can be simultaneously bound by the same TF. We noticed that when the top 25 TFs are selected, most of the 232 genes and 77 gene pairs (75% and 97%, respectively) could be covered (Figure S2 in Additional file 5). The cover rate increases quite slowly when more TFs are selected. We therefore used these 25 TFs to construct the transcriptional network based on the PCCGs and KCCGs.
We next determined whether the PCCGs are enriched with known cell-cycle genes. Among the PCCGs, we calculated the proportion of genes which are annotated to participate in the cell cycle process (in MIPS database) and used the hyper-geometric distribution to define the p-values. About 1/2 of the PCCGs (94/206) were determined to be known cell-cycle and DNA processing genes (p = 6 × 10–6). We performed the same test to the selected TFs. Eighteen of them are known to be cell cycle TFs (p = 5 × 10–11, Table S4 in Additional file 6).
Since cell cycle events are controlled by cyclin-dependent kinases (CDKs), we investigated whether Cdk1(CDC28) substrates were enriched in our PCCGs and selected TFs. As expected, both of them turned out to be enriched with CDC28 substrates (Table S5 in Additional file 7), further supporting the finding that both PCCGs and selected TFs are cell cycle-related.
We compared the difference between using all TFs in the database and only the selected 25 TFs to explain indirect transcriptional relationships between the 25 TFs and 232 target genes. Based on comparing the wild-type and TF mutant microarray data, we could tell how one TF could affect the expression of each gene. This data reflects the transcriptional relationship between the TFs and targets although the relation could be indirect. This independent evidence, which describes the transcriptional network, can be utilized to validate the network we constructed.
Between our 25 TFs and the 232 target genes, there are 140 indirect TF-target pairs. By using the transcriptional relationships of all 183 TFs in the chip-chip dataset, 103/140 pairs could be connected within three steps, although for more steps, quite a few indirect pairs could be explained (Figure S3 in Additional file 8). We also tested the fraction of indirect TF-target pairs which could be connected by only using the relationships of the 25 TFs. The result (Figure S3) shows that the sub TF-target network can explain 85.4% (88/103) of the indirect relations in the first three steps. This result illustrates that the 25 TFs form a cooperative transcriptional network which can explain its indirect connections quite well.
Our approach integrates the genetic interaction network, co-expression network, and transcriptional network, and it performed well in predicting cell cycle genes. Many previous papers have also discussing integrating these data source in eQTL analysis. However, comparing to these approaches, we started from known functional genes and their E-MAP profiles to build up the network step by step. In this process, we could see how these data source could describe the biology network, and how they are co-operated together. We illustrate that E-MAP and DNA microarray could be complementary in identifying PCCGs, and also the cluster result tells that how transcriptional relationships could reflect functional connections of genes in the network.
In addition, there are other types of networks, such as protein physical interaction networks, which are informative for the prediction of gene function. However, because physical interactions annotated in databases are quite sparse between KCCGs and the 1536 library strains, we have not performed an analysis of it. We believe the efficiency of prediction can be increased when such data are integrated in a reasonable framework.
Although the current study focused on the cell cycle process, our approach is not limited, and it can be easily applied to other biological processes, given the availability of data.
E-MAP technology is a powerful high-throughput tool to identify novel functional genes which genetic interacted with the known one. By screening forty eight cell cycle genes crossing 1536 library strains, E-MAP helps us obtain a large potential cell cycle-related gene set.
Our analysis shows that genetic interaction and gene co-expression could be complementary for identifying co-functional gene pairs, and combining them has significantly improved the accuracy of the prediction.
TF-DNA binding (chip-chip) and protein phosphorylation data were used to construct a cell cycle regulation network. Periodic expressed and being enriched of cell cycle genes in targets can both be used to identify TFs which regulate the cell cycle process. When comparing the cluster result to prior knowledge, we could believe that our cell cycle transcriptional network is well constructed. This network could help to illustrate how PCCGs are involved in cell cycle process.
Finally, four genes with unknown functions in PCCGs are laboured. From KCCGs which the four genes are genetic interacted and co-expressed, we could predict which phase of cell cycle they may be involved in. In addition, the time course expression data of them are consistent with the constructed transcriptional network, and some of them are substrate of CDK1 (CDC28) kinase which regulates the cell cycle process in budding yeast. All these analyses provided strong evidence that the four novel genes should be participate in cell cycle related process.
The 48 cell cycle genes were manually curate from the literature, which function in different phases of the cell cycle process (Figure S5 in Additional file 11). They had been screened by crossing a 1536 mutant strain library in budding yeast, and the relative double mutant strains were selected to obtain the EMAP data. However, the analytical framework we developed is not affected by the selection of these genes, and it can be applied to other processes as well.
We use eight time course microarray experiments from four previously published works to perform the co-expression analysis [14–17]. The data were downloaded from the supplementary data from the authors’ website, and the KNNImpute method was used  to impute the missing value.
To measure the similarity between the time course expression profiles of two genes, we used the time-lagged correlation . For multiple experiments, we adopted a loose definition for correlation between two genes which is the maximum time lag correlation score in all the eight experiments. This means that two genes showing high correlation in one experiment are considered to be co-expressed. We can use such a loose definition because we have already had a stringent constraint in E-MAP analysis to ensure that the interactions are reliable, even if two genes only show co-expression in one experiment.
We also use hyper geometric distribution to calculate the p-value of precision. The relative results are reported in Table S2 in Additional file 2.
Here m i is the number of selected genes which have the function i; n is the number of selected genes; M i is the number of test genes which have the function i; N is the number of test genes.
Finally, we use Bonferroni correction to control false discovery rate in this multiple testing problem and get the q value.
The Chip-Chip data and wild type vs. TF mutant microarray data were downloaded from YeastRact (http://www.yeastract.com[20, 21]). Among 183 TFs in our dataset, 37 are annotated as cell cycle-related in the MIPS database. The CDC28 substrate dataset was downloaded from the supplementary data of two previously published works [22, 23].
Here, m i is the neighbor in the PCCGs and KCCGs; n is the number of genes in PCCGs and KCCGs; M i isTF i ’s targets in the test genes, and N is the number of test genes.
Suppose the probability of one TF i not to be enriched and periodically expressed is . Then the probability that TF i is either enriched or periodically expressed is. . ForTF i , the multiplication of its rank of p-values (ascending order) keeps the order of the probability; . Thus the smaller the order of is, the larger Pr i of TF i is.
We thank Hongjing Qu and Xiaojing Yang and Chao Tang lab and Nevan Krogan lab for their help in EMAP experiment. We also thank Dr. David Martin for his critical reading of the manuscript. This work is supported by the National Natural Science Foundation of China (No.10871009, No.10721403, and No.10774009), the National High Technology Research and Development of China (No.2008AA02Z306), the National Key Basic Research Project of China (No. 2006CB910706, No.2009CB918503), and “The Fundamental Research Funds for the Central Universities” in China.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 1, 2011: Selected articles from the 4th International Conference on Computational Systems Biology (ISB 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.