Research article | Open | Published:
Quantitative maps of genetic interactions in yeast - Comparative evaluation and integrative analysis
BMC Systems Biologyvolume 5, Article number: 45 (2011)
High-throughput genetic screening approaches have enabled systematic means to study how interactions among gene mutations contribute to quantitative fitness phenotypes, with the aim of providing insights into the functional wiring diagrams of genetic interaction networks on a global scale. However, it is poorly known how well these quantitative interaction measurements agree across the screening approaches, which hinders their integrated use toward improving the coverage and quality of the genetic interaction maps in yeast and other organisms.
Using large-scale data matrices from epistatic miniarray profiling (E-MAP), genetic interaction mapping (GIM), and synthetic genetic array (SGA) approaches, we carried out here a systematic comparative evaluation among these quantitative maps of genetic interactions in yeast. The relatively low association between the original interaction measurements or their customized scores could be improved using a matrix-based modelling framework, which enables the use of single- and double-mutant fitness estimates and measurements, respectively, when scoring genetic interactions. Toward an integrative analysis, we show how the detections from the different screening approaches can be combined to suggest novel positive and negative interactions which are complementary to those obtained using any single screening approach alone. The matrix approximation procedure has been made available to support the design and analysis of the future screening studies.
We have shown here that even if the correlation between the currently available quantitative genetic interaction maps in yeast is relatively low, their comparability can be improved by means of our computational matrix approximation procedure, which will enable integrative analysis and detection of a wider spectrum of genetic interactions using data from the complementary screening approaches.
The recent advances in experimental biotechnologies have made it possible to start screening genome-wide datasets of quantitative genetic interactions in model organisms such as yeast [1–3]. High-throughput genetic screening approaches, such as those based on epistatic miniarray profiling (E-MAP) [4–7], genetic interaction mapping (GIM) , and synthetic genetic array (SGA) [9–11], have provided systematic means to global investigation of quantitative relationship between genotype and phenotype, with potential implications for a wide range of biological phenomena, including, for instance, modularity, essentiality, redundancy, buffering, epistasis, evolution, canalization and development of human disease [1–3, 12–21]. The rapid accumulation of quantitative genetic interaction data is providing us with unique opportunities to decipher how genes function as networks to regulate cellular processes and to maintain mutational robustness. However, the massive datasets also call for principled modelling frameworks and efficient analytic approaches to take a full advantage of the in-depth information encoded in the available and emerging quantitative interaction datasets . In particular, efficient bioinformatics procedures enabling integrative analysis of multiple datasets from various screening approaches could increase the quality and coverage of the genetic interaction maps, with the aim of completing the genetic interaction networks in yeast and other organisms.
Comparing the results from the alternative experimental approaches is crucial for validating the observed interactions, estimating the biases related to each approach, and filling the gaps in the currently incomplete datasets. It is therefore likely that comprehensive mapping of the quantitative genetic interaction networks will require integration of a number datasets from different screening approaches, similar to the recent efforts to complete the physical protein-protein interaction (PPI) networks in yeast and human [23–28]. A major challenge in such integrative analysis is that quantitative interaction data generated with the complementary experimental approaches in different laboratories are not directly comparable, due to differences, for instance, in experimental designs, growth conditions or screening protocols as well as in data pre-processing or scoring options. Even when the same mutant pairs are considered, the technical variation can lead to some disagreement in the detection results and to relatively large inconsistency between the datasets in general [8, 11]. The correction for such discrepancy can be beyond the capacity of the customized data processing techniques used within the individual screening approaches [29, 30]. A common modelling framework, adjusted for the different screening approaches, could improve the comparability of the results and allow for integrative analysis.
Compared to PPI networks, an additional challenge originates from the quantitative nature of the genetic interaction datasets; instead of comparing the overlap in binary terms, such as presence or absence of a physical interaction, here we should take into account the full spectrum of genetic interactions, ranging from extreme cases of negative interactions (i.e., synthetic sick and lethality) to the positive classes of interacting pairs (e.g., masking and suppression subcategories) [2, 3, 17]. We have recently shown that the quantitative data matrices obtained from the individual quantitative screening approaches can capture different portions of this spectrum, as compared to known classes of genetic interactions; for instance, the SGA and GIM datasets captured relatively well the negative classes of interactions, whereas the prediction of the positive interactions proved much more challenging when using the provided double-mutant fitness data alone . Similar observations have been made also when using the highly processed E-MAP data [32, 33]. To improve the predictive power of the individual quantitative datasets, we further developed our computational matrix approximation strategy , and showed that it could transform the original fitness matrices so that these allow for better discrimination of not only negative but also the positive end of interaction spectrum from the background variability .
In the present study, toward combining the quantitative detections from multiple large-scale genetic interaction approaches, we investigated the consistency among the currently available quantitative interaction datasets in yeast, as well as the sensitivity and specificity of the genetic interactions detected by using the three screening approaches (SGA, GIM and E-MAP), with respect to their overlap in common mutant pairs and coverage of known interacting pairs, as extracted from a gold-standard reference database of genetic interactions (BioGRID). We first show that the comparability of the detections between the different approaches can be improved using standardized matrix-based modelling framework within each individual dataset. Using appropriate scoring and aggregation functions, we then demonstrate how the detections from the different screening approaches can be combined more effectively, compared to that when using the individual datasets alone, suggesting that the matrix approximation-based meta-analytic procedure allows for the full exploitation of the existing data when predicting novel interactions or designing new experiments. To promote its widespread usage in the future screening studies, we have made publicly available an efficient, stand-alone R-implementation of the quantile-based matrix approximation procedure (QMAP), which includes a number of user-adjustable options that can be used to fine-tune the procedure for any given experimental dataset.
Results and Discussion
Scoring of quantitative genetic interactions
We have previously introduced a matrix-based modelling and approximation framework, and showed that it provides a quantitative and efficient means for scoring genetic interactions among thousands of genes, thereby leading to improved detection of both positive and negative pairs of interactions in large-scale quantitative screening experiments [31, 34]. Briefly, the matrix approximation strategy is based on the observation that most gene pairs in the large-scale genetic interaction screens have no significant interaction with each other [2, 3]. This implies that the single-mutant fitness effects, which are needed in the interaction scoring, could be estimated using solely the information encoded in the observed, double-mutant fitness matrix W, with entries w ab corresponding to the m query and n array strains, respectively, that is, a = 1,2,...m and b = 1,2,...n. The underlying idea of the matrix approximation it to decompose the original fitness matrix into separate components, W = x ⊗y, where the m and n-dimensional vectors x and y model the variability across the array and query mutants, respectively [31, 34].
In the symmetric case, that is , the above equation expresses in matrix notation the well-established multiplicative null model, w ab = w a w b , which states that the expected neutral phenotype of an organism's fitness, under the null hypothesis that it carries two non-interacting mutations (a and b), can be estimated by the product of the corresponding single-mutant fitness effects (w a and w b , respectively) . It was shown on symmetric, high-resolution data that the product function is the best null model among a family of alternative models (minimum, additive and log functions), in the sense that it yields a distribution with location close to zero and low dispersion over all of the measured deviations ε ab = w ab - w a w b [35, 36]. In the non-symmetric case, n ≠ m, even though the single-mutant effects x and y are not necessarily equal, these together can provide individual estimates for w a and w b , respectively. In the present work, the estimation of x and y was performed using a robust, rank-one matrix approximation method, named quantile-based matrix approximation (QMA) .
After performing the approximation of the double-mutant fitness matrix W under the null multiplicative model, the interaction class of a mutant pair (a,b) can be predicted using a specific scoring function s(x, y), such as minimum, maximum, product or scaled epistasis [13, 35, 36], which transform the original fitness matrix into a score (or residual) matrix s ab = w ab - s(x a , y b ). It has been shown before that there exists effective alternatives to the traditional product function when further classifying the significant genetic interactions into the positive and negative classes [13, 31]. Accordingly, the score values s ab can be used in place of the traditional deviations ε ab to test for a genetic interaction between genes a and b, where a large absolute score provides evidence for genetic interaction, while scores close to zero indicate non-interacting gene pairs. The positive interactions (or alleviating epistatic effects) should result in positive scores (s ab > 0), and the negative interactions (aggravating epistatic effects) in negative scores (s ab < 0), with synthetic lethality being the extreme case (w ab = 0).
Following the lessons learned from the integrative analysis of high-throughput PPI datasets , we first evaluated separately the data from the individual screening approaches (SGA, GIM and E-MAP), against a gold-standard reference database of know interactions (BioGRID) . Such within-approach benchmarking resulted in specific parameter combinations for the data-adjusted QMA estimates and scoring functions for positive and negative genetic interaction classes (Additional File 1) . In the following analyses, we utilized these same parameters and scoring functions to assess their robustness, and to demonstrate the relative advantages of the generic matrix approximation strategy, in terms of both improved comparability of the interaction scores as well as integrative detection of genetic interactions, among the screening approaches, in comparison to using the individual datasets alone. Our specific focus here is on the detection of pairs of positive interactions, the accurate scoring of which has been challenging in the past despite the quantitative approaches.
Agreement between the quantitative datasets
Using the datasets available from three representative screening approaches [5, 8, 10], we started with pairwise comparisons among the three datasets and characterized the number of common pairs of array and query mutants shared by the datasets (Table 1), as well as the distribution of the known pairs of positive and negative interactions into the data intersections (Table 2). The number of shared mutant pairs was largest in the SGA - E-MAP data pair (184 077 common pairs), second largest in the SGA - GIM data pair (58 215 common pairs), and smallest in the GIM - E-MAP data pair (12 461 common pairs). To investigate the coverage of the known pairs of genetic interactions in the three datasets, we used the existing information on genetically interacting pairs as available in the gold-standard BioGRID database . For the positive class of interactions, we combined the 'Positive Genetic' and 'Phenotypic Suppression' categories, which are composed of alleviating mutant pairs, and for the negative class of interactions, we merged four of the BioGRID's aggravating categories, namely, 'Negative Genetic', 'Synthetic Growth Defect', 'Synthetic Lethality', and 'Phenotypic Enhancement'.
Even if the interactions extracted from the three datasets under study were pairwise deleted from the BioGRID's genetic interaction categories (Table 2), there may remain some bias in these categories toward the E-MAP approach due to the large number of interactions identified in the three other large-scale E-MAP studies [4, 6, 7]. If these had also been excluded from the comparative analyses, the sizes of the reference positive and negative classes would have become much smaller, hence hindering the comparative evaluations. Due to this potential bias, the interaction detection results for the data pairs other than the SGA - GIM should be interpreted with caution. Moreover, it was not initially expected that the matrix approximation could provide any further improvements in the E-MAP data, since this data has already been heavily pre-processed and custom-scored against an expected fitness , resulting in a symmetric and close to zero-centered data matrix . Therefore, we focus here on illustrating the benefits of QMA-based integrative analysis using the detection of positive interactions in the SGA - GIM data pair as our principal case study; however, the full set of results are provided in Additional files 2 - 7.
The correlations between the double-mutant fitness matrices were relatively poor among all the three dataset pairs (Table 3); especially striking is the negative correlation between the SGA and GIM fitness matrices (Pearson correlation -0.099 and Spearman correlation -0.021). Beyond the original fitness matrices, we also evaluated - as for the point of comparison for our QMA-based scoring system - their scored versions using the custom-designed scoring systems in the SGA dataset (referred to as 'SGA custom score') , and the median estimate for the single-mutant effects with product scoring function in the GIM and E-MAP datasets (referred to as 'GIM/E-MAP median score') . It was found out, however, that the correlation between the interaction scores between the screening approaches remained relatively low even after such scoring of the individual datasets (Table 3). As expected, the comparability of the originally scored E-MAP dataset did not improve by the use of an additional median scoring, especially when the non-parametric Spearman's rank correlation coefficient was being used as a measure of association between the datasets. In contrast, the SGA and GIM datasets benefited to some extent from their individual scorings.
The reason for the negligible correlation between the SGA and GIM datasets is clearly visible in their scatter plot (Figure 1). It is very difficult to see any apparent patterns of association between the original double-mutant fitness measurements, even for those mutant pairs coding for known genetic interaction (Figure 1A). The custom-scored versions could not provide much improvement in their association, especially for the positive pairs of interactions (Figure 1B). Similar observations were made also in the other dataset pairs (Additional File 2). It should be noted, however, that for reproducible identification of genetic interactions, it suffices that the datasets share similar levels of interaction scores for the most extreme pairs (here the 3% quantiles were used as an expected rate of interactions ; Table 2). Similarly, since the ranking of the mutant pairs in terms of their evidence for genetic interactions is of more practical and biological importance, and due to the sensitivity of the Pearson correlation to data transformations and outlier pairs, we will use the more robust Spearman's rank correlation coefficient as our principal measure of association between the quantitative genetic interaction datasets in the next sections.
Predictive relationship between the datasets
To investigate whether the matrix approximation-based scoring strategy could enhance the between-approach comparability of the quantitative information encoded in the double-mutant fitness matrices, we next used the same estimation parameters and scoring functions defined in the previous within-approach evaluations . Briefly, three parameter combinations for the two QMA parameters were specified per each dataset: one for detecting all the interaction classes simultaneously (referred to as 'fixed setting'), and the others for detecting either the negative or positive classes separately ('adjusted settings'). The scoring functions were also shown to be specific to the alleviating and aggravating interaction classes (Additional File 1). With the use of these QMA-based single-mutant fitness estimation and interaction scoring options, there was an increasing trend in the Spearman's rank correlation coefficient between all the three datasets, when compared to the original double-mutant fitness measurements or the reference scoring approaches, especially when the adjusted QMA setting was used for the positive interactions (Figure 2). Interestingly, with QMA adjusted to negative interactions, the original SGA double-mutant fitness matrix provided better correlation with the GIM dataset than when using the custom-designed SGA scores (Additional File 3).
The relatively low Spearman's rank correlation in the interaction scores between the SGA and GIM datasets is also visible in their rank-based scatter plots (Figure 3A). Even if there were no clear patterns of association in the interactions rankings as a whole, the bulk of the positive pairs were supported consistently by both of the datasets, with only a relatively few interaction pairs near one of the data axes only. Such mutant pairs with discrepancy in their interaction scores may be either due to differences in the screening approaches or due to false positive findings. The interaction pairs lying in the middle of the rank scatter plot are likely to correspond to true non-interaction mutant pairs (Figure 3A). With the negative interaction pairs, there seems to be more variability between the datasets, which may be attributable to the fact that we used here the fixed QMA parameters and scoring functions chosen for positive interactions for illustration purposes. Moreover, the number of known negative interactions is much higher than the number of positive interactions in the datasets (Table 2). Even so, the enrichment of both positive and negative pairs at the extreme corners of the two-dimensional grid was highly statistically significant (Figure 3B). Similar findings were also seen in the other datasets (Additional File 4).
Although the Spearman's correlation is useful for evaluating an overall association between interaction datasets, it may be dominated by the non-interacting pairs near the zero scores, which often are not the most interesting from the biological point of view. To evaluate the agreement between the interaction scores among the most extreme levels, we tested next how accurately one can predict the 3% of the most positive values across the datasets using the same options as in Figure 2. Similarly as with the rank correlation coefficient, the predictive accuracy increased when moving from the original double-mutant fitness values and their custom or median scores to the QMA-based scores using either its fixed or adjusted settings in all the three data pairs (Figure 4). These results demonstrate that it is possible to find such estimation parameters and scoring functions that can markedly improve the prediction of those most extreme positive interaction scores that are shared across the datasets, compared to using the original fitness values or interaction scores only. In the negative classes of interactions, these baseline prediction accuracies were already much higher, especially in those pairs involving the E-MAP dataset, and, accordingly, the benefits of the QMA procedure were not so evident here (Additional File 5).
The modelling framework makes it also possible to avoid performing the single-mutant growth experiments in the large-scale genetic interaction screens, without compromising their quantitative scoring accuracy. Moreover, the model-estimated array-vector was in a good agreement with the experimentally-derived single-mutant fitness measurements available in the SGA data (Spearman's correlation ranged from 0.964 to 0.996, depending whether we use the fixed QMA settings or those adjusted for positive interactions, respectively). Despite such high rank correlation levels, however, there is a significant difference in the location and scaling between the estimated and measured fitness values, indicating that the estimates encode added information for interaction scoring. The QMA settings used here were originally selected on the basis of the pre-release version of the SGA data , which contained only 1277 of the query mutations of the current SGA dataset (75%), thus indicating the robustness of the QMA settings. In the following section, we further highlight the potential of the model-based strategy in integrative analysis by using the same QMA setup selected specifically for the positive interactions, even if this will likely to result in compromised prediction accuracies in the negative interaction classes.
Integrative identification of genetic interactions
After showing that the usage of the matrix approximation-based scoring system in place of the original double-mutant fitness matrix or its custom-scored version can lead to improvements in the comparability between the dataset pairs, we next evaluated whether these observed improvements in the rank correlation or prediction of the extreme pairs could contribute also to improved identification of genetic interactions, when using multiple datasets together, compared to using single datasets alone. To choose an appropriate data integration approach, we first evaluated the predictive performance of four rank aggregation functions (product, minimum, maximum and Borda count, which is effectively the same as the additive function), in terms of how accurately they can detect known pairs of interacting genes. Even if the QMA-based scoring setup was aimed here at the detection of positive interactions, we further tested its prediction capability also for the negative interactions to study its generalization capability beyond the type of interactions it was initially designed for. The prediction performance is illustrated here using the unbiased GIM-SGA data pair, whereas the E-MAP - SGA and E-MAP - GIM pairs are provided in Additional File 6.
When combining the interaction scores in the SGA and GIM datasets to detect pairs of positive interactions, the conservative maximum aggregation score gave the best prediction accuracy in terms of the overall AUC (Table 4). However, when focusing on the early sensitivity at the highest specificity levels (or the smallest FPR-levels), which are often more important in many practical applications, the Borda count and rank product were the two best performing methods (Figure 5A). In the detection of the negative interactions in the SGA - GIM dataset pair, the rank product performed better than the Borda count or either of the individual datasets alone (Figure 5B), whereas the liberal minimum rank gave the highest overall AUC performance (Table 4). The good performance of the Borda count and rank product with the positive interactions was also supported by the integrative analysis in the SGA - E-MAP and E-MAP - GIM dataset pairs, especially at the highest specificity levels (Additional File 6). However, the maximum function soon outperformed these two methods when an increasing number of positive interactions were predicted (Table 4, overall AUC). The rank product was found generally best in the prediction of negative interactions in each of the dataset pairs.
Taken together the integrative prediction results in the three dataset pairs, the Borda count and the rank product performed equally well when the aim is to identify the first candidate set of positive interactions with the highest specificity for follow-up studies, whereas the more stringent maximum function provided the best prediction accuracy when larger numbers of positive interactions are being identified. In the detection of negative interactions, the intermediate rank product showed consistently the best results among all the data pairs, making it an appropriate rank aggregation function in case both positive and negative interactions are being detected using the same setup. In addition to showing the benefits of the integrative detection, these results can also be used for comparative evaluation of the detection power among the individual datasets from the different screening approaches. For instance, on the basis of the same reference set of known interactions on a common set of shared mutant pairs in the SGA and GIM datasets, the GIM approach seems to detect particularly well larger number of negative interactions (Table 4), whereas the nearly genome-wide SGA dataset provides comparable detection power in the positive end of the genetic interaction spectrum (Figure 5).
Although the integrative detection based on combined scores was shown to provide marked improvements in the detection of both positive and negative interaction classes when using the SGA and GIM datasets together, it was interesting to note that in the SGA - E-MAP dataset pair, the E-MAP data alone provided extremely good detection accuracies in the positive class of interactions (Table 4). Rather than being a result of the superiority of this particular dataset, this is more likely attributable to the fact that many of the pairs (23%) of positive interactions in the BioGRID originate from the other large-scale genetic interaction screens performed with the E-MAP approach [4, 6, 7] (Table 2). These pairs clearly dominate the joint distribution of the positive interactions, while being supported by the SGA approach to a varying degree (Additional File 4). Interestingly, the detection of the negative interactions by the E-MAP approach alone was found sub-optimal (Table 4). Moreover, the additional benefits gained by the integrative analysis were more pronounced in the GIM - E-MAP than in the SGA - E-MAP data pair (Table 4). These results demonstrate that the intrinsic differences between the screening approaches influence how much they can complement each other.
To our knowledge, the present study is the first systematic and objective comparative evaluation of data from the main large-scale quantitative genetic interaction screening approaches (SGA, GIM and E-MAP). We showed here that even if the association between the original fitness measurements or their interaction scores is relatively low, their comparability can be improved by means of our matrix approximation technique. Toward an integrative analysis, we showed that a multi-approach analysis of quantitative genetic interactions can provide novel findings which are complementary to those obtained using any single screening approach alone. An integrative analysis can therefore provide a systematic means to pool information from previous interaction studies, with the aim of maximizing the number of both positive and negative interactions without compromising the reliability of the detections, as well as of minimizing the number of additional experiments needed when prioritizing of future screens. In general, such computational approach can facilitate the experimental efforts by improving the quality and coverage of the current genetic interaction networks, towards completing the still incomplete information of genetic interactions in yeast, which is - by and large - complementary to that obtained from the physical protein interactions and complexes [1, 5, 11, 17, 39, 40].
Although these results already demonstrate the potential of integrating datasets across different screening approaches using the matrix approximation strategy, more comprehensive studies are warranted in the future that combine experimental data from various types of genetic interaction studies, such as those performed under different environmental conditions, using fitness phenotypes other than growth, or on multiple perturbations or study organisms to investigate questions related, for instance, to plasticity and evolution of genetic networks or higher-order and interspecies interactions [2, 3, 17, 41–47]. Although we illustrated here the feasibility of the integrative analysis through QMA with its previously fixed parameters and scoring functions selected for each screening approach individually, even better prediction accuracies will likely to be obtained after a systematic optimization of these options for each dataset combination, downstream analysis objective, and interaction strength level separately (Additional File 7). The efficient QMA R-package, which includes a number of user-adjustable parameters (Additional File 8), was made available here to enable such tailored matrix approximation that meets the needs of a given study.
A potential limitation of the current evaluation setup is the definition of the reference set of interactions using the BioGRID database. For instance, since the interactions in the BioGRID database originate from multiple genetic interaction screening studies, there can be cases where a mutant pair AB is reported as encoding an interaction, even if BA is not, or where the reciprocal pairs AB and BA are marked as belonging to different classes of interactions. To make sure that such cases do not interfere with the comparative evaluations, we filtered out any unambiguous interaction pairs, and for the remaining interactions, we used the same interaction class for the reciprocal mutant pairs. Moreover, to provide as fair assessment as possible, we excluded those interactions identified from the datasets under comparison. Therefore, the detection accuracies presented here should be considered as lower bounds for the true accuracy of the screening approaches or their combination. Even if there may still remain some biases, especially toward the well-represented E-MAP approach, the BioGRID database includes also a wide range of other large-scale studies, thus providing a comprehensive reference set for the evaluations. To improve the future benchmarking studies, it would be beneficial to add a specific category for known non-interacting mutant pairs, similar to that available for physically non-interacting protein pairs .
Analogous to efforts for completing the mapping of the physical PPI networks [23–28], it would be important to provide the community with an easy access also to the raw interaction datasets, similar to that provided in the SGA database DRYGIN . For instance, our matrix approximation procedure was much more efficient with the original double-mutant fitness measurements, as provided by the SGA and GIM laboratories, compared to the highly processed and scored E-MAP datasets. The results with the E-MAP being one of the datasets were in many cases drastically different from that with the SGA - GIM dataset pair. As with any high-throughput assays, the large-scale genetic screening approaches are inherently noisy and biased in their nature, suggesting that each single assay can reveal only a limited scope of the full spectrum of genetic interaction classes. Therefore, it is likely that integrative analysis of data from the complementary screening approaches will be essential to complete the quantitative genetic interaction networks in yeast and other organisms. We invite those participating in the genetic interaction mapping effort to try out the matrix approximation-based procedure and to give us input and suggestions for its further improvements.
The methodological aim of the present study was to enable an integrated analysis of multiple genetic interaction datasets using a common scoring framework. adjusted for the high-throughput quantitative screening approaches. The next sections describe the genetic interaction datasets used to demonstrate the benefits of such integrative approach, as well as the methods used to model, standardize, compare and merge these datasets, while maintaining their biological consistency and quantitative nature.
Genetic interaction matrices
Three large-scale quantitative data sets on yeast were used in the present work for the systematic and comparative evaluations. To investigate the potential limitations in the between-approach agreement and relative benefits gained by an integrative analysis among the currently available high-throughput quantitative genetic interaction maps, we chose representative example datasets across the spectrum of high-throughput interaction screening approaches currently used for Saccharomyces cerevisiae.
The first dataset was available from the epistatic miniarray profiling (E-MAP) study of quantitative genetic interactions between genes involved in yeast chromosome biology . The original fitness measurements among 754 alleles of 743 genes were highly filtered and processed, providing a symmetric data matrix with close to zero-centered quantitative distribution for the pairwise interaction scores [29, 49]. The raw, unprocessed double-mutant fitness measurements were not available from this study.
Representing another screening approach, the genetic interaction mapping (GIM) combines ideas from the synthetic lethality analysis by microarray (SLAM) [50, 51] and from synthetic genetic array (SGA) approaches [9, 10]. The data matrix available from its pilot study contains double-mutant fitness measurements among 5918 array and 73 query genes . The filtered fitness effects were transformed back to non-log-scale to produce quantitative distribution with mean and median close to unity.
The third and the largest of the datasets is available from the recent SGA screening study . This data set contains double-mutant fitness measurements among 3885 array and 1712 query genes. The filtered and normalized double-mutant fitness data matrix, with median close to unity, was used in the matrix approximation procedure. The same dataset also includes a customized SGA scoring of the gene pairs [30, 52], which was used here as a baseline value for our QMA-based scoring procedure.
The quantile-based matrix approximation (QMA) is an efficient rank-one matrix approximation method, which is conceptually similar to the Tukey's median polish procedure, except that QMA uses multiplicative model instead of additive model and quantiles instead of medians . More specifically, the estimation of the single-mutant fitness effects is based on sub-sequent calculation of the p and q-quantile points for the rows and columns of the double-mutant fitness matrix W, respectively, and then arranging these quantiles in the estimated array and query vectors x and y.
Scoring of interactions
The presence and sign of an epistasis interaction between a gene pair (a,b) was scored using the residual s ab = w ab - s(x a , y b ). To avoid potential bias among the different genes in the datasets, duplicate rows and columns in the double-mutant fitness matrices were combined by calculating mean over the duplicates. The final dimensions of the data matrices are shown in Table 1. Before the data integration, each of the double-mutant fitness matrices was scored separately using the default QMA settings and scoring functions (Additional File 1), as described before .
Ranking of interactions
A gene pairs (a,b) was ranked according to its interaction score s ab obtained in each individual dataset using the fixed QMA settings and scoring functions for positive interactions (Additional File 1). A rank-based data aggregation was used for robust integration of the scores from two screening approaches. More precisely, four rank aggregation functions (minimum of the ranks, maximum of the ranks, product of the ranks, and Borda count, which is effectively the sum of the ranks) were evaluated in terms of their accuracy, compared to using the rankings from a single dataset alone.
Evaluation setup and measures
The pairwise intersections between the three dataset pairs were evaluated separately in terms of their number of common array and query mutants (Table 1), the coverage of the known pairs of genetic interactions (Table 2), as well as their association in fitness values and interaction scores across the shared mutant pairs (Table 3). The shared intersection among all the three datasets was only 498 × 7 in size, including 178 known negative and only 31 known positive interactions from the BioGRID database. Therefore, this triple intersection could not be reliably evaluated here.
BioGRID interaction matrix
We used the interactions available in the gold-standard BioGRID database (version 3.0.64 for S. cerevisia e) . We constructed a BioGRID's interaction matrix by treating the gene pairs extracted from the database as unordered, meaning that if an interaction exists for a mutant pair AB, we also copied the same interaction for the mutant pair BA for biological consistency. Similar symmetric strategy has been used also in previous studies [4–7, 11, 31]. For each pairwise intersection between datasets, separate positive and negative interaction matrices were created for evaluation purposes.
BioGRID interaction classes
Positive interaction matrix is constructed using 'Phenotypic Suppression' and 'Positive Genetic' categories from BioGRID database, and negative interaction matrix was generated by combining 'Synthetic Lethal', 'Synthetic Growth Defect', 'Phenotypic Enhancement' and 'Negative Genetic' categories. Such interaction matrices are ternary matrices with entries representing either an interacting, non-interacting or ambiguous case, where the pair belongs to both interaction classes. Since the ambiguous cases can lead to biases in the evaluation results, they were excluded from the evaluations.
Agreement between the datasets
The congruence between the dataset pairs was evaluated by calculating the Pearson and Spearman correlations across those mutant pairs shared by both datasets. The agreement of the datasets in terms of their extreme fitness values or interaction scores was evaluated by constructing interaction matrices using one of the datasets to define positive and negative genetic interactions. We used extreme 3% of the mutant pairs, according to the interaction rate estimate based on unbiased screens (3.15% ), and the BioGRID interactions here among the three dataset intersections (2.99%; Table 2). Other cut-off levels (1% and 5%) were also considered (Additional File 7).
Receiver operating characteristics
The receiver operating characteristic (ROC) curves were used to assess the discovery rate of genetic interactions. A single ROC curve summarizes the trade-off between true positive rate (TPR) and false positive rate (FPR) on a ranked list of mutant pairs. The true and false interactions were defined here using the interaction matrices (from the BioGRID or using 3% extreme values). The overall prediction performance was summarized using the area under the ROC curve (AUC). For an ideal classifier, TPR = 1, FPR = 0 and AUC = 1, whereas a random classifier has on average AUC of 0.5.
Partial AUC and early sensitivity
In many practical application cases, only the first few candidate mutant pair can be followed-up in further validation studies. Therefore, it is important to evaluate also the performance of a mutant pair ranking at low FPR levels, that is, for those pairs with highest specificity. We used here the partial area under the ROC curve (pAUC), in which the range of FPR is limited to a predefined interval between zero and r (here r = 0.1), and the resulting area is then normalized by dividing it with r. To investigate the early sensitivity of the detections, we also calculated the TPR at FPR of 0.01.
Enrichment of genetic interactions
To test the enrichment of interactions over random in different parts of scatter plots, the plots were divided into six-by-six grid. For each of these 36 parts, we performed a standard hypergeometric test to calculate the enrichment p-values for positive and negative interactions separately:
Here, K is the total number of gene pairs in the grid, M is the total number of (positive or negative) interactions (M ≤ K), m is the number of interactions found (m ≤ M), and t is the number of gene pairs in the particular grid cell (m ≤ t ≤ K). The p-values in the figures were limited between 10-100 and 0.99.
To promote its widespread usage in the future screening studies, we have made publicly available an efficient, stand-alone R-implementation of the quantile-based matrix approximation procedure (QMAP). This implementation includes a number of user-adjustable options that can be adjusted through a graphical user interface to fine tune the procedure for a given experimental dataset and downstream analysis object under investigation. Along with the open source R-code, the implementation contains documentation of the data format for the input data, the parameters of the various options, as well the output data of the QMAP (Additional File 8).
area under the curve
epistatic miniarray profiling
false positive rate
genetic interaction mapping
partial area under the curve (pAUC)
quantile-based matrix approximation
synthetic lethality analysis by microarray
receiver operating characteristic
synthetic genetic array
true positive rate.
Boone C, Bussey H, Andrews BJ: Exploring genetic interactions and networks with yeast. Nat Rev Genet. 2007, 8: 437-449. 10.1038/nrg2085
Dixon SJ, Costanzo M, Baryshnikova A, Andrews B, Boone C: Systematic mapping of genetic interaction networks. Annu Rev Genet. 2009, 43: 601-625. 10.1146/annurev.genet.39.073003.114751
Beltrao P, Cagney G, Krogan NJ: Quantitative genetic interactions reveal biological modularity. Cell. 2010, 141: 739-45. 10.1016/j.cell.2010.05.019
Schuldiner M, Collins SR, Thompson NJ, Denic V, Bhamidipati A, Punna T, Ihmels J, Andrews B, Boone C, Greenblatt JF, Weissman JS, Krogan NJ: Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell. 2005, 123: 507-519. 10.1016/j.cell.2005.08.031
Collins SR, Miller KM, Maas NL, Roguev A, Fillingham J, Chu CS, Schuldiner M, Gebbia M, Recht J, Shales M, Ding H, Xu H, Han J, Ingvarsdottir K, Cheng B, Andrews B, Boone C, Berger SL, Hieter P, Zhang Z, Brown GW, Ingles CJ, Emili A, Allis CD, Toczyski DP, Weissman JS, Greenblatt JF, Krogan NJ: Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map. Nature. 2007, 446: 806-810. 10.1038/nature05649
Wilmes GM, Bergkessel M, Bandyopadhyay S, Shales M, Braberg H, Cagney G, Collins SR, Whitworth GB, Kress TL, Weissman JS, Ideker T, Guthrie C, Krogan NJ: A genetic interaction map of RNA-processing factors reveals links between Sem1/Dss1-containing complexes and mRNA export and splicing. Mol Cell. 2008, 32: 735-746. 10.1016/j.molcel.2008.11.012
Fiedler D, Braberg H, Mehta M, Chechik G, Cagney G, Mukherjee P, Silva AC, Shales M, Collins SR, van Wageningen S, Kemmeren P, Holstege FC, Weissman JS, Keogh MC, Koller D, Shokat KM, Krogan NJ: Functional organization of the S. cerevisiae phosphorylation network. Cell. 2009, 136: 952-963. 10.1016/j.cell.2008.12.039
Decourty L, Saveanu C, Zemam K, Hantraye F, Frachon E, Rousselle JC, Fromont-Racine M, Jacquier A: Linking functionally related genes by sensitive and quantitative characterization of genetic interaction profiles. Proc Natl Acad Sci USA. 2008, 105: 5821-5826. 10.1073/pnas.0710533105
Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Pagé N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, Andrews B, Tyers M, Boone C: Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science. 2001, 294: 2364-2368. 10.1126/science.1065810
Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Ménard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, et al.: Global mapping of the yeast genetic interaction network. Science. 2004, 303: 808-813. 10.1126/science.1091317
Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JL, Toufighi K, Mostafavi S, Prinz J, St Onge RP, VanderSluis B, Makhnevych T, Vizeacoumar FJ, Alizadeh S, Bahr S, Brost RL, Chen Y, Cokol M, Deshpande R, Li Z, Lin ZY, Liang W, Marback M, Paw J, San Luis BJ, Shuteriqi E, Tong AH, van Dyk N, et al.: The genetic landscape of a cell. Science. 2010, 327: 425-431. 10.1126/science.1180823
Hartman JL, Garvik B, Hartwell L: Principles for the buffering of genetic variation. Science. 2001, 291: 1001-1004. 10.1126/science.291.5506.1001
Segrè D, Deluna A, Church GM, Kishony R: Modular epistasis in yeast metabolism. Nat Genet. 2005, 37: 77-83.
Davierwala AP, Haynes J, Li Z, Brost RL, Robinson MD, Yu L, Mnaimneh S, Ding H, Zhu H, Chen Y, Cheng X, Brown GW, Boone C, Andrews BJ, Hughes TR: The synthetic genetic interaction spectrum of essential genes. Nat Genet. 2005, 37: 1147-1152. 10.1038/ng1640
Ooi SL, Pan X, Peyser BD, Ye P, Meluh PB, Yuan DS, Irizarry RA, Bader JS, Spencer FA, Boeke JD: Global synthetic-lethality analysis and yeast functional profiling. Trends Genet. 2006, 22: 56-63. 10.1016/j.tig.2005.11.003
Jasnos L, Korona R: Epistatic buffering of fitness loss in yeast double deletion strains. Nat Genet. 2007, 39: 550-554. 10.1038/ng1986
Beyer A, Bandyopadhyay S, Ideker T: Integrating physical and genetic maps: from genomes to interaction networks. Nat Rev Genet. 2007, 8: 699-710. 10.1038/nrg2144
Ulitsky I, Shamir R: Pathway redundancy and protein essentiality revealed in the Saccharomyces cerevisiae interaction networks. Mol Syst Biol. 2007, 3: 104- 10.1038/msb4100144
Lehner B: Modelling genotype-phenotype relationships and human disease with genetic interaction networks. J Exp Biol. 2007, 210: 1559-1566. 10.1242/jeb.002311
Phillips PC: Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008, 9: 855-867. 10.1038/nrg2452
Gao H, Granka JM, Feldman MW: On the classification of epistatic interactions. Genetics. 2010, 184: 827-837. 10.1534/genetics.109.111120
Breker M, Schuldiner M: Explorations in topology-delving underneath the surface of genetic interaction maps. Mol Biosyst. 2009, 5: 1473-1481. 10.1039/b907076c
Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein-interaction networks?. Genome Biol. 2006, 7: 120- 10.1186/gb-2006-7-11-120
Goll J, Uetz P: The elusive yeast interactome. Genome Biol. 2006, 7: 223.
Gentleman R, Huber W: Making the most of high-throughput protein-interaction data. Genome Biol. 2007, 8: 112- 10.1186/gb-2007-8-10-112
Futschik ME, Chaurasia G, Herzel H: Comparison of human protein-protein interaction maps. Bioinformatics. 2007, 23: 605-611. 10.1093/bioinformatics/btl683
Venkatesan K, Rual JF, Vazquez A, Stelzl U, Lemmens I, Hirozane-Kishikawa T, Hao T, Zenkner M, Xin X, Goh KI, Yildirim MA, Simonis N, Heinzmann K, Gebreab F, Sahalie JM, Cevik S, Simon C, de Smet AS, Dann E, Smolyar A, Vinayagam A, Yu H, Szeto D, Borick H, Dricot A, Klitgord N, Murray RR, Lin C, Lalowski M, Timm J, et al.: An empirical framework for binary interactome mapping. Nat Methods. 2009, 6: 83-90. 10.1038/nmeth.1280
Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, Sahalie JM, Murray RR, Roncari L, de Smet AS, Venkatesan K, Rual JF, Vandenhaute J, Cusick ME, Pawson T, Hill DE, Tavernier J, Wrana JL, Roth FP, Vidal M: An experimentally derived confidence score for binary protein-protein interactions. Nat Methods. 2009, 6: 91-97. 10.1038/nmeth.1281
Collins SR, Schuldiner M, Krogan NJ, Weissman JS: A strategy for extracting and analyzing large-scale quantitative epistatic interaction data. Genome Biol. 2006, 7: R63- 10.1186/gb-2006-7-7-r63
Koh JL, Ding H, Costanzo M, Baryshnikova A, Toufighi K, Bader GD, Myers CL, Andrews BJ, Boone C: DRYGIN: a database of quantitative genetic interaction networks in yeast. Nucleic Acids Res. 2010, 38: D502-D507. 10.1093/nar/gkp820
Eronen VP, Lindén RO, Lindroos A, Kanerva M, Aittokallio T: Genome-wide scoring of positive and negative epistasis through decomposition of quantitative genetic interaction fitness matrices. PLoS One. 2010, 5: e11611- 10.1371/journal.pone.0011611
Ulitsky I, Krogan NJ, Shamir R: Towards accurate imputation of quantitative genetic interactions. Genome Biol. 2009, 10: R140- 10.1186/gb-2009-10-12-r140
Ryan C, Greene D, Cagney G, Cunningham P: Missing value imputation for epistatic MAPs. BMC Bioinformatics. 2010, 11: 197- 10.1186/1471-2105-11-197
Järvinen AP, Hiissa J, Elo LL, Aittokallio T: Predicting quantitative genetic interactions by means of sequential matrix approximation. PLoS One. 2008, 3: e3284.
Mani R, St Onge RP, Hartman JL, Giaever G, Roth FP: Defining genetic interaction. Proc Natl Acad Sci USA. 2008, 105: 3461-3466. 10.1073/pnas.0712255105
St Onge RP, Mani R, Oh J, Proctor M, Fung E, Davis RW, Nislow C, Roth FP, Giaever G: Systematic pathway analysis using high-resolution fitness profiling of combinatorial gene deletions. Nat Genet. 2007, 39: 199-206. 10.1038/ng1948
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006, 33: D535-D539. 10.1093/nar/gkj109.
Le Meur N, Gentleman R: Modeling synthetic lethality. Genome Biol. 2008, 9: R135- 10.1186/gb-2008-9-9-r135
Bandyopadhyay S, Kelley R, Krogan NJ, Ideker T: Functional maps of protein complexes from quantitative genetic interaction data. PLoS Comput Biol. 2008, 4: e1000065- 10.1371/journal.pcbi.1000065
Ulitsky I, Shlomi T, Kupiec M, Shamir R: From E-MAPs to module maps: dissecting quantitative genetic interactions using physical interactions. Mol Syst Biol. 2008, 4: 209- 10.1038/msb.2008.42
Fischbach MA, Krogan NJ: The next frontier of systems biology: higher-order and interspecies interactions. Genome Biol. 2010, 11: 208.
Van Driessche N, Demsar J, Booth EO, Hill P, Juvan P, Zupan B, Kuspa A, Shaulsky G: Epistasis analysis with global transcriptional phenotypes. Nat Genet. 2005, 37: 471-477. 10.1038/ng1545
Harrison R, Papp B, Pál C, Oliver SG, Delneri D: Plasticity of genetic interactions in metabolic networks of yeast. Proc Natl Acad Sci USA. 2007, 104: 2307-2312. 10.1073/pnas.0607153104
Tischler J, Lehner B, Fraser AG: Evolutionary plasticity of genetic interaction networks. Nat Genet. 2008, 40: 390-391. 10.1038/ng.114
Dixon SJ, Andrews BJ, Boone C: Exploring the conservation of synthetic lethal genetic interaction networks. Commun Integr Biol. 2009, 2: 78-81.
Jonikas MC, Collins SR, Denic V, Oh E, Quan EM, Schmid V, Weibezahn J, Schwappach B, Walter P, Weissman JS, Schuldiner M: Comprehensive characterization of genes required for protein folding in the endoplasmic reticulum. Science. 2009, 323: 1693-1697. 10.1126/science.1167983
Battle A, Jonikas MC, Walter P, Weissman JS, Koller D: Automated identification of pathways from quantitative genetic interaction data. Mol Syst Biol. 2010, 6: 379- 10.1038/msb.2010.27
Smialowski P, Pagel P, Wong P, Brauner B, Dunger I, Fobo G, Frishman G, Montrone C, Rattei T, Frishman D, Ruepp A: The Negatome database: a reference set of non-interacting protein pairs. Nucleic Acids Res. 2010, 38: D540-D544. 10.1093/nar/gkp1026
Collins SR, Roguev A, Krogan NJ: Quantitative genetic interaction mapping using the E-MAP approach. Methods Enzymol. 2010, 470: 205-231. full_text full_text
Pan X, Yuan DS, Xiang D, Wang X, Sookhai-Mahadeo S, Bader JS, Hieter P, Spencer F, Boeke JD: A robust toolkit for functional profiling of the yeast genome. Mol Cell. 2004, 16: 487-496. 10.1016/j.molcel.2004.09.035
Pan X, Yuan DS, Ooi SL, Wang X, Sookhai-Mahadeo S, Meluh P, Boeke JD: dSLAM analysis of genome-wide genetic interactions in Saccharomyces cerevisiae. Methods. 2007, 41: 206-221. 10.1016/j.ymeth.2006.07.033
Baryshnikova A, Costanzo M, Dixon S, Vizeacoumar FJ, Myers CL, Andrews B, Boone C: Synthetic genetic array (SGA) analysis in Saccharomyces cerevisiae and Schizosaccharomyces pombe. Methods Enzymol. 2010, 470: 145-179. full_text full_text
The authors thank Prof. Charlie Boone and Dr. Cosmin Saveanu for providing us with the quantitative SGA and GIM datasets, respectively. The work was supported by the Academy of Finland (grants 120 569, 133 227 and 140 880).
TA conceived the study, and ROL participated in its design. ROL and VPE developed and implemented the matrix approximation method. ROL analyzed the datasets. ROL and TA wrote the manuscript. All authors read and approved the final manuscript.