Prediction of quantitative phenotypes based on genetic networks: a case study in yeast sporulation
© Shen et al; licensee BioMed Central Ltd. 2010
Received: 1 May 2010
Accepted: 10 September 2010
Published: 10 September 2010
An exciting application of genetic network is to predict phenotypic consequences for environmental cues or genetic perturbations. However, de novo prediction for quantitative phenotypes based on network topology is always a challenging task.
Using yeast sporulation as a model system, we have assembled a genetic network from literature and exploited Boolean network to predict sporulation efficiency change upon deleting individual genes. We observe that predictions based on the curated network correlate well with the experimentally measured values. In addition, computational analysis reveals the robustness and hysteresis of the yeast sporulation network and uncovers several patterns of sporulation efficiency change caused by double gene deletion. These discoveries may guide future investigation of underlying mechanisms. We have also shown that a hybridized genetic network reconstructed from both temporal microarray data and literature is able to achieve a satisfactory prediction accuracy of the same quantitative phenotypes.
This case study illustrates the value of predicting quantitative phenotypes based on genetic network and provides a generic approach.
Predicting the consequences of environmental cues or genetic perturbations based on genetic network is becoming a powerful tool to understand biological phenomena or gene functions from a systems point of view. Ordinary differentiation equation (ODE) can make detailed predictions on a network but its application is limited by the network size because determining the values of the kinetic parameters for a large number of ODEs and solving these questions are often nontrivial. Recently, one exceptional study was conducted on an archaeon H. salinarum NRC-1 . Subsequent to genome sequencing, a large number of microarray, proteomic and ChIP-chip assays were carried out to reconstruct the genetic network. The great amount of data allowed training of a computational method to predict expression changes of gene modules upon perturbations. Ideally, such a comprehensive and systematic approach can be applied to every organism. However, the tremendous expense and effort are often inhibitory particularly for much more complicated organisms such as human. Alternatively, large-scale networks have been reconstructed from genomic and proteomic data. Although relatively noisier than the genetic networks studied by ODE, which are usually assembled from literature, such large-scale networks can still generate insightful predictions. For example, Marcotte and colleagues have predicted the phenotypes of knocking out genes in yeast and worm using genetic networks reconstructed by integrating various sources of data [2, 3]. However, these networks only represent correlation between genes but not necessarily physical interactions. Predictions are made based on how tightly the gene of interest is correlated with genes annotated with desired specific phenotype. Similar approaches have been applied to predicting gene functions, particularly those related to diseases and thus potential drug targets, based on networks directly reconstructed from genomic and/or proteomic data [4–10].
In the present study, we aim to address the following issues in predicting phenotypes based on genetic network. First, can one perform de novo predictions of phenotypes without relying on existing annotations of genes? If this is feasible, it will not only help make new discoveries but also demonstrate the effectiveness of understanding biological phenomena at a systems level. Second, can one predict a phenotype that is quantitatively measured using a genetic network that consists of physical interactions? A quantitative phenotype may provide a rigorous assessment of the prediction accuracy and physical-interaction network often shed light on understanding the molecular mechanism of phenotype formation. Third, can genomic analysis capture the most prominent features, which may form the major regulatory interactions, of such network? Is this "scaffold" of the network still able to predict the quantitative phenotypes?
We choose the sporulation process in Saccharomyces cerevisiae to perform a case study. All sexually reproducing organisms undergo meiosis in which each diploid cell generates four haploid gametes. The meiotic process in budding yeast is coupled with spore morphogenesis in which spores are formed from the haploid cells. Regulation of yeast sporulation has been studied for years and numerous important regulators have been identified [11, 12]. Genome-wide expression assays have been performed to determine the transcriptional program [13–15]. In addition, effect of single-gene deletions on sporulation efficiency has been determined at a genomic scale, which provided quantitative phenotypic measurements .
We first collect experimental evidence from literature to construct a network that includes both protein-protein interactions and transcription factor (TF)-gene regulatory interactions. We then investigate the dynamics of the network using a Boolean network model. Our study demonstrates that the yeast sporulation network has a robust design and, once sporulation starts, the network topology ensures the completion of the process. We also reconstruct a transcriptional regulatory network for yeast sporulation from genomic data using a computational method called UMMI (U biquitous M odel selector for M otif I nteractions). Comparison between the curated and the predicted networks shows that the most important transcriptional edges of the curated network are correctly identified by UMMI. When the predicted transcriptional edges are combined with necessary non-transcriptional edges taken from literature, the hybrid network shows the same dynamic characteristics and similar predictive power as the fully curated one.
Construct yeast sporulation network from literature
The upper half of Fig. 1 shows that many protein-protein interactions are involved in regulating a master meiotic regulator - Ime1. After Ime1 is activated, it turns on the downstream sporulation activators such as Ime2 and Ndt80 to transcribe EMG and MMG (the lower half of Fig. 1). After both EMG and MMG are transcribed, the yeast cell is committed to complete the sporulation process .
Predict the yeast sporulation efficiency
A genome-wide study was performed previously to quantitatively determine the effect of deleting a single gene on the efficiency of yeast sporulation . A Prespo/Spore ratio, measured by microarray, represents the percentage of a single deletion strain that can complete sporulation. If the ratio is larger than one, the deleted gene is considered as sporulation deficient; otherwise, sporulation efficient.
We choose Boolean network to analyze the curated network and search for the fixed points (Fig. 1). We follow the previous work of  in updating the network state (defined as the states of all nodes in the network) using a Markov chain (see Methods). The only modification to the previous method is the inclusion of a logical AND function to mimic the effect of an AND node. We also define a product function to quantify the sporulation percentage using the two markers' states: when EMG and MMG are both in state "1", sporulation is complete; otherwise, sporulation is incomplete. Perturbations to the network can be performed by clamping a node to state "0", for deleting a gene, or removing an edge, for disrupting an interaction. To have a direct comparison with the measured Prespo/Spore ratio, we calculate the ratio of sporulation percentage before perturbation versus that after perturbation, denoted by a symbol a (see Methods). This is done by enumerating all possible initializations of Boolean networks with and without clamping the deleted gene to state "0". In the same way, we have also evaluated the effect of other types of perturbations to the network, such as deleting an edge or deleting multiple genes (see below).
To further illustrate obtaining such a correlation is nontrivial, we perform a negative control experiment by looking at the correlation between the "absolute" sporulation efficiency change caused by deleting a gene and the averaged or minimal shortest path from each gene to EMG and MMG. To calculate the "absolute" sporualtion efficiency change, Prespo/Spore ratios smaller than 1.0 are inversed. A negative correlation is thus expected because the shorter a gene's path to the markers, the larger its influence. If the averaged path is used, we have a Pearson correlation of -0.45 with a P-value of 0.04 and a Spearman rank correlation of -0.53 with a P-value of 0.01. If the minimal path is used, a Pearson correlation of -0.45 with a P-value of 0.03 and a Spearman rank correlation of -0.49 with a P-value of 0.02 are obtained. Both the correlation and the statistical significance are significantly lower than Boolean network predictions.
Robustness and hysteresis of the sporulation network
Positive and negative feedback loops of the three regulators of sporulation*.
Effects of removing positive feedback loops.
In addition to positive feedbacks, there are two negative feedback loops for Ime1 (Table 1). Such architecture determines that Ime1 forms a hysteretic switch of sporulation: Ime1 is absolutely needed to initiate the meiotic process; however, Ime1 becomes unneeded after the cell commits to sporulation. Indeed, it is known to be important for the yeast cell to inactivate Ime1 once the sporulation-specific genes have been transcribed . Consistently, we observe that Ime1 is in the final state "1" in only 44% of all possible initializations that lead to sporulation in the curated network, indicating the importance of the negative feedbacks. To further confirm this, we perturb the two negative feedback loops (Table 1) by deleting the repression edges to Ime1. Removing either Cln2/Cdc28--|Ime1 or Ime2/Rim4--|Ime1 raises the percentage to 61% in both cases. Removing both edges raises the percentage to 70%.
Predictions of other perturbations' effects on sporulation
We also find that the self-activation of the meiotic activators has minor impact on sporulation efficiency (Additional file 1, Table S4). Even when the self-activation of all five activators is disrupted, the effect is slightly sporulation deficient (a = 1.27). However, the PKA pathway plays an important role in suppressing sporulation as deletion of cAMP/PKA node is sporulation efficient (a = 0.74). This is consistent with the known role of this pathway in the literature . All these computational predictions are novel and can guide the future experimental investigations of the sporulation mechanisms.
Uncover transcriptional regulatory interactions of sporulation by a computational method
We finally exploit a computational approach, UMMI, for de novo discovery of the transcriptional regulatory interactions during the sporulation. UMMI is an extension to our previous method, GBNet , which aims to find sequence constraints, such as co-occurrence of two motifs and distance constraint between them, enriched in a group of co-regulated genes. Based on the rules identified, target genes of a TF can be inferred. Unlike GBNet that relies on gene clusters generated from multiple microarray experiments, UMMI can be applicable to a single gene expression experiment. In addition, we also develop a measurement in UMMI to control the reliability of the models discovered (see Methods). The gene expression data from Chu et al.  is used in our analysis, which covers seven time points of sporulation: Metabolic (0 h), Early I (0.5 h), Early II (2 h), Early-Mid (5 h), Middle (7 h), Mid-Late (9 h) and Late (11.5 h).
We have compiled a list of 794 DNA motifs in yeast, including known motifs taken from literature and computationally generated ones (see Additional file 1). At each time point, all the genes are divided into 5 groups based on their expression levels (see Methods). UMMI is then used to find the combination of motifs and sequence constraints between these motifs that are associated with gene expression levels. UMMI finds several highly reliable constraints at each time point (Additional file 1, Table S1) that pass a reliability threshold (frequency of occurrence in the learned models) of 0.1. Based on these sequence constraints, we have recovered the known key transcription factors (TFs) in sporulation: Ume6, Ndt80 and Sum1. Furthermore, we identify 75 Ume6's target genes that satisfy the Ume6's sequence constraints and show at least 2-fold over-expression at early stages of sporulation (0.5-5 h). The functions of these genes indicate that they play important roles in sporulation. For example, the top three enriched gene ontology (GO) terms of biological process are: M phase of meiotic cell cycle (3.1E-18), meiosis I (2.2E-17) and reciprocal meiotic recombination (1.3E-14) (See Additional file 4, Table S5). We also identify 263 and 121 target genes whose expression levels have at least 2-fold elevation at middle stages (5-9 h) and satisfy the sequence constraints for Ndt80 and Sum1, respectively. The top three enriched GO terms of biological process are: spore wall assembly (4.5E-17 and 3.5E-21), sporulation (4.4E-15 and 1.0E-18) and ascospore formation (1.8E-14 and 1.3E-18) (P-values for Ndt80 and Sum1, respectively) (See Additional file 5, Table S6 and Additional file 6, Table S7). Ndt80 and Sum1 share 49 common targets whose top three enriched GO terms for biological process are strongly associated with sporulation: spore wall assembly (1.1E-16), sporulation (3.2E-16) and ascospore formation (1.3E-15) (See Additional file 7, Table S8 for full list).
Known sporulation regulators as targets of Ume6, Ndt80 and Sum1.
RIM4, IME2, NDT80
CLB1, NDT80, IME2, IME1
Interestingly, the predicted network is able to achieve comparable accuracy on sporulation efficiency prediction as the curated network (Additional file 1, Table S3). The Pearson correlation between the computational prediction and the experimental measurement is 0.87 with a P-value of 5.8E-2 and the Spearman rank correlation is 0.67 with a P-value of 0.27. It should be noted that the dataset used to calculate the correlations for the predicted network is very small (five data points) and therefore the P-values are not highly significant. Nevertheless, these encouraging results suggest that the most prominent transcriptional regulatory interactions captured by genomic data can be recovered by computational methods combined with literature curation and such a hybrid network still has a satisfactory predictive power of phenotypes.
Discussion and Conclusions
Accurately predicting phenotypes based on genetic network that constitutes physical interactions can provide great mechanistic insight into phenotype formation. We have conducted a case study in yeast sporulation by predicting a quantitative phenotype, sporulation efficiency change after deleting a gene, based on a network assembled from the existing knowledge. Such a physical interaction network illustrates how the perturbations are propagated in the network to cause phenotype formation. Importantly, our predictions are de novo and only rely on network topology. This study is the first to reveal the direct relationship between network topology and phenotype formation.
It is no doubt that there are genes and/or links missing in the reconstructed network which is still small. However, the satisfactory prediction accuracy suggests that the major regulatory interactions have been uncovered. We have also demonstrated that computational methods can extract the most prominent features of the transcriptional regulation captured by genomic data. With a minimal set of protein-protein interactions added, such a scaffold network shows promising predictive power. Such a network can still be noisy but may contain the key regulatory interactions that are important to correctly predict phenotypes to a satisfactory extent.
We choose Boolean network to analyze the dynamics and robustness of the yeast sporulation network. Compared with the differential equation approach, Boolean network does not provide the detailed temporal change of each gene/protein or cooperation between genes such as the competitive binding of Ndt80 and Sum1 . On the other hand, it allows study of the network in Fig. 1 with 29 nodes, which is often difficult to determine all the kinetic parameters needed in differential equations and even more challenging to solve them. Encouragingly, such a simple and easy to implement model can make de novo predictions of phenotypes accurately. It would be interesting to explore the power of this model on much larger networks with hundreds or even thousands of nodes once the data for reconstructing the networks become available.
In this work, we predict the phenotype based on enumerating all possible initializations. The computational cost for such an approach is exponential in scale. For example, the computational time for a Boolean network with 30 nodes is 2 hours. It will be 83 days for a Boolean network with 40 nodes, and 228 years for 50 nodes. Instead of enumerating all possible initializations, we have tried to sample only a small number of the initializations at random. We find such a sampling approach can make predictions with high precisions (results not shown). Other researchers have made similar observations about Boolean networks: given the number of nodes and the node connectivity, randomly sampled Boolean networks have similar global properties . We thus argue that the random sampling technique may allow extension of our approach to a much larger network in the future.
where s ij is the state of parent node j of node i at time point t.
In this work, we initialize the network with all possible states and update all nodes synchronously. All the states then evolve into a set of converged states called attractors . A product function can be defined on the attractors, e.g. the percentage of node × being in state "1" among all attractors. The value of the product function therefore reflects the stable and overall dynamics of the gene regulatory network under study, which is also determined by the Boolean function and the connectivity among the nodes in the network.
where x represents nodes, e represents edges and I is the identity function (equals one if the condition is satisfied or zero otherwise). Perturbation to the network can be represented by clamping the corresponding nodes to "0" or deleting the relevant edges. For example, deletion of gene i corresponds to f(xi = 0, e) and deletion of edge between gene i and j corresponds to f(x,e(i, j) = 0).
The predictive power of our model can therefore be evaluated by the correlation between the values of a and the Prespo/Spore ratios.
where N S is network structure, D is data, Γ(.) is the gamma function, Np is the number of parent nodes, log10(K) is a network parameter to penalize the complex models, q is the number of possible parent states, r + 1 is the number of possible child states, , , N jk is the number of samples for child state k when parent state is j, a jk is a prior count. In GBNet, we only considered the case of r = 1, i.e. the child node is a binary variable. In order to achieve efficient structure learning, we have utilized a Gibbs sampling to search for global optimum and applied GBNet successfully to several datasets in yeast and human .
UMMI is an extension of GBNet with the flexibility to analyze a single microarray experiment. In UMMI, we extend the GBNet framework to consider a child node with more than two categories. That is, we allow r > 1 in Eq. 1. When analyzing a single microarray data, we first separate all genes into multiple categories and each category represents genes with similar expression levels (Additional file 1, Fig. S2). In this study, we choose to use five categories (r = 4) that span the whole spectrum of the gene expression levels with equal intervals. As in GBNet , the gene category labels and promoter sequences (600 bps upstream of the start codon) are fed into UMMI to learn the sequence constraints.
Each motif's ability to discriminate gene categories is evaluated by a Bayesian score which is the logarithm of the posterior probability of the Bayesian network (Eq. 1). In GBNet , we first rank the motifs by their Bayesian scores; a motif and its associated sequence constraints with higher rank are always tested before the motifs with lower rank . To avoid data-overfitting in Bayesian network learning , each model learned is only allowed to have a small number of parent nodes (i.e. regulatory rules). Therefore, those highly ranked motifs may dominate the results. To avoid this possible bias, UMMI generates 51 models for each gene expression dataset (Additional file 1, Fig. S2): one model is obtained by using motifs ranked by their Bayesian scores as the input, the other 50 models by using motifs in random order as the input. This way we hope to avoid bias towards the top ranked motifs and to obtain models that may have lower Bayesian scores but are still biologically meaningful. From the 51 learned models, we calculate each sequence constraint's occurrence and only consider significantly present sequence constraints as reliable. A heuristic threshold of 0.1 for occurrence is used in this study.
This study was partially supported by NIH (R01GM072856 to W.W.).
- Bonneau R, Facciotti MT, Reiss DJ, Schmid AK, Pan M, Kaur A, Thorsson V, Shannon P, Johnson MH, Bare JC, et al.: A predictive model for transcriptional control of physiology in a free living cell. Cell. 2007, 131 (7): 1354-1365. 10.1016/j.cell.2007.10.053View ArticlePubMedGoogle Scholar
- McGary KL, Lee I, Marcotte EM: Broad network-based predictability of Saccharomyces cerevisiae gene loss-of-function phenotypes. Genome Biol. 2007, 8 (12): R258- 10.1186/gb-2007-8-12-r258PubMed CentralView ArticlePubMedGoogle Scholar
- Lee I, Lehner B, Crombie C, Wong W, Fraser AG, Marcotte EM: A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet. 2008, 40 (2): 181-188. 10.1038/ng.2007.70View ArticlePubMedGoogle Scholar
- Kim WK, Krumpelman C, Marcotte EM: Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy. Genome Biol. 2008, 9 (Suppl 1): S5- 10.1186/gb-2008-9-s1-s5PubMed CentralView ArticlePubMedGoogle Scholar
- Shen L, Liu J, Wang W: GBNet: Deciphering regulatory rules in the co-regulated genes using a Gibbs sampler enhanced Bayesian network approach. BMC Bioinformatics. 2008, 9 (1): 395- 10.1186/1471-2105-9-395PubMed CentralView ArticlePubMedGoogle Scholar
- Wu X, Jiang R, Zhang MQ, Li S: Network-based global inference of human disease genes. Mol Syst Biol. 2008, 4: 189- 10.1038/msb.2008.27PubMed CentralView ArticlePubMedGoogle Scholar
- Chen J, Aronow BJ, Jegga AG: Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics. 2009, 10: 73- 10.1186/1471-2105-10-73PubMed CentralView ArticlePubMedGoogle Scholar
- Karni S, Soreq H, Sharan R: A network-based method for predicting disease-causing genes. J Comput Biol. 2009, 16 (2): 181-189. 10.1089/cmb.2008.05TTView ArticlePubMedGoogle Scholar
- Huttenhower C, Haley EM, Hibbs MA, Dumeaux V, Barrett DR, Coller HA, Troyanskaya OG: Exploring the human genome with functional maps. Genome Res. 2009, 19 (6): 1093-1106. 10.1101/gr.082214.108PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, Bumgarner RE, Schadt EE: Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet. 2008, 40 (7): 854-861. 10.1038/ng.167PubMed CentralView ArticlePubMedGoogle Scholar
- Kassir Y, Adir N, Boger-Nadjar E, Raviv NG, Rubin-Bejerano I, Sagee S, Shenhar G: Transcriptional regulation of meiosis in budding yeast. Int Rev Cytol. 2003, 224: 111-171. 10.1016/S0074-7696(05)24004-4View ArticlePubMedGoogle Scholar
- Vershon AK, Pierce M: Transcriptional regulation of meiosis in yeast. Current Opinion in Cell Biology. 2000, 12 (3): 334-339. 10.1016/S0955-0674(00)00104-6View ArticlePubMedGoogle Scholar
- Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I: The Transcriptional Program of Sporulation in Budding Yeast. Science. 1998, 282 (5389): 699-705. 10.1126/science.282.5389.699View ArticlePubMedGoogle Scholar
- Primig M, Williams RM, Winzeler EA, Tevzadze GG, Conway AR, Hwang SY, Davis RW, Esposito RE: The core meiotic transcriptome in budding yeasts. Nat Genet. 2000, 26 (4): 415-423. 10.1038/82539View ArticlePubMedGoogle Scholar
- Friedlander G, Joseph-Strauss D, Carmi M, Zenvirth D, Simchen G, Barkai N: Modulation of the transcription regulatory program in yeast cells committed to sporulation. Genome Biology. 2006, 7 (3): R20- 10.1186/gb-2006-7-3-r20PubMed CentralView ArticlePubMedGoogle Scholar
- Deutschbauer AM, Williams RM, Chu AM, Davis RW: Parallel phenotypic analysis of sporulation and postgermination growth in Saccharomycescerevisiae. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99 (24): 15530-15535. 10.1073/pnas.202604399PubMed CentralView ArticlePubMedGoogle Scholar
- Saccharomyces Genome Database. http://www.yeastgenome.org/
- Li F, Long T, Lu Y, Ouyang Q, Tang C: The yeast cell-cycle network is robustly designed. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101 (14): 4781-4786. 10.1073/pnas.0305937101PubMed CentralView ArticlePubMedGoogle Scholar
- Wang W, Cherry JM, Nochomovitz Y, Jolly E, Botstein D, Li H: Inference of combinatorial regulation in yeast transcriptional networks: A case study of sporulation. Proceedings of the National Academy of Sciences. 2005, 102 (6): 1998-2003. 10.1073/pnas.0405537102.View ArticleGoogle Scholar
- Guttmann-Raviv N, Martin S, Kassir Y: Ime2, a Meiosis-Specific Kinase in Yeast, Is Required for Destabilization of Its Transcriptional Activator, Ime1. Mol Cell Biol. 2002, 22 (7): 2047-2056. 10.1128/MCB.22.7.2047-2056.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Suka N, Carmen AA, Rundlett SE, Grunstein M: The regulation of gene activity by histones and the histone deacetylase RPD3. Cold Spring Harb Symp Quant Biol. 1998, 63: 391-399. 10.1101/sqb.1998.63.391View ArticlePubMedGoogle Scholar
- Gershenson C: Classification of Random Boolean Networks. Artificial Life VIII: Proceedings of the Eight International Conference on Artificial Life: 2002; Sydney, Australia. 2002, 1-8. MIT PressGoogle Scholar
- Kauffman SA: Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology. 1969, 22 (3): 437-467. 10.1016/0022-5193(69)90015-0View ArticlePubMedGoogle Scholar
- Liang S, Fuhrman S, Somogyi R: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing: 1998; Hawaii, United States of America. 1998, 18-29.Google Scholar