Integrating transcriptional and protein interaction networks to prioritize condition-specific master regulators

Background Genome-wide libraries of yeast deletion strains have been used to screen for genes that drive phenotypes such as stress response. A surprising observation emerging from these studies is that the genes with the largest changes in mRNA expression during a state transition are not those that drive that transition. Here, we show that integrating gene expression data with context-independent protein interaction networks can help prioritize master regulators that drive biological phenotypes. Results Genes essential for survival had previously been shown to exhibit high centrality in protein interaction networks. However, the set of genes that drive growth in any specific condition is highly context-dependent. We inferred regulatory networks from gene expression data and transcription factor binding motifs in Saccharomyces cerevisiae, and found that high-degree nodes in regulatory networks are enriched for transcription factors that drive the corresponding phenotypes. We then found that using a metric combining protein interaction and transcriptional networks improved the enrichment for drivers in many of the contexts we examined. We applied this principle to a dataset of gene expression in normal human fibroblasts expressing a panel of viral oncogenes. We integrated regulatory interactions inferred from this data with a database of yeast two-hybrid protein interactions and ranked 571 human transcription factors by their combined network score. The ranked list was significantly enriched in known cancer genes that could not be found by standard differential expression or enrichment analyses. Conclusions There has been increasing recognition that network-based approaches can provide insight into critical cellular elements that help define phenotypic state. Our analysis suggests that no one network, based on a single data type, captures the full spectrum of interactions. Greater insight can instead be gained by exploring multiple independent networks and by choosing an appropriate metric on each network. Moreover we can improve our ability to rank phenotypic drivers by combining the information from individual networks. We propose that such integrative network analysis could be used to combine clinical gene expression data with interaction databases to prioritize patient- and disease-specific therapeutic targets. Electronic supplementary material The online version of this article (doi:10.1186/s12918-015-0228-1) contains supplementary material, which is available to authorized users.

RARA, and TAL2. Note that, due to the large number of cancer drivers, we analyzed only the ones that were present among the top 10% of TFs as ranked by at least one of the three metrics listed above. Figure 8 depicts all the interactions with this set of proteins. Most of the enriched GO terms with high odds ratios pertained to regulation of transcription and organism development. For example, the GO term "muscle organ development" is enriched due to the presence of four proteins including ID3, a transcriptional inhibitor that can modulate cell differentiation, and UNC45A, which is part of the progesterone signaling pathway and is needed for cell proliferation. Enrichment of the GO term "cytoplasmic sequestering of transcription factor" arises from SRI, a gene that binds calcium and can regulate the activity of calcium channels, and MXI1, a protein that competes with MAX to bind MYC and inhibit its function.
Taken together, these results suggest that driver TFs tend to be ranked higher when integrating the PPI network degree because they are more often regulated by proteins like nuclear exporters, transcriptional repressors, or chromatin remodelers, and in humans, they may interact with signaling molecules that carry information about the environment (e.g. calcium or hormone levels).

Supplementary Figure Captions
Supplementary Figure 1 Combined network score improves enrichment in driver genes in a manner robust to data sampling. ROC curves show the performance of the three network measures for the 53 yeast strains most sensitive to rapamycin. P-values are computed using Kolmogorov-Smirnov test. AUC = area under the curve. Bar graphs show odds ratio for the overlap between driver TFs and the top 20% TFs ranked by each measure.

Supplementary Figure 2
Driver TFs have high degree in PANDA regulatory network associated with rapamycin response. (A) Receiver-operator characteristic (ROC) curves showing performance of two different measures -degree in PANDA transcriptional network and differential expression after addition of rapamycin for 50 minutes -in identifying driver TFs. Pvalues are computed using Wilcoxon test. (B) Bar graphs show the odds ratio for the overlap between driver TFs and the top 10, 20 and 30% of TFs ranked by the same two measures. (C) Transcriptional network inferred by PANDA in rapamycin-perturbed yeast. Only TFs and their interactions are shown. Red nodes denote rapamycin driver genes. The size of the node is proportional to its degree in the full network, including all target genes.

Supplementary Figure 3
Combining degree in transcriptional network and yeast two-hybrid (Y2H) protein interaction network improves enrichment in driver genes. Figure depicts ROC curves showing performance of three different measures -degree in transcriptional network, degree in Y2H protein interaction network, and combined network score -in identifying driver TFs. P-values are computed using Kolmogorov-Smirnov test. AUC = area under the curve. Bar graphs show odds ratio for the overlap between driver TFs and the top 20% of TFs ranked by each of the three measures.

Supplementary Figure 4
For rapamycin response in yeast, combining differential expression and protein interaction network results in a similar increase in enrichment for drivers as combining transcriptional network with the protein interaction network. The ROC curves show the performance of five network measures. Bar graphs show odds ratio for the overlap between driver TFs and the top 20% TFs ranked by each measure.

Supplementary Figure 5
Among stress conditions where GMIT network is not enriched for drivers, protein interaction network can improve enrichment. Figure depicts ROC curves comparing performance of degree in GMIT transcriptional network, degree in protein interaction network, and combined score, for all other growth conditions. P-values are computed using Kolmogorov-Smirnov test and AUC denotes area under the curve. Figure 6 Among stress conditions where PANDA network is not enriched for drivers, protein interaction network can improve enrichment. Figure depicts ROC curves comparing performance of degree in PANDA transcriptional network, degree in protein interaction network, and combined score, for all other growth conditions. P-values are computed using Kolmogorov-Smirnov test and AUC denotes area under the curve.

Supplementary Figure 7
For viral oncogene-perturbed human fibroblasts, combining differential expression with protein interaction data tends to decrease enrichment in drivers, as compared with combining the transcriptional and protein interaction networks. Large ROC curves show the performance of five network measures. Bar graphs show odds ratio for the overlap between driver TFs and the top 10% TFs ranked by each measure. Small ROC curves below show just the two combined scores for better visual comparison.

Supplementary Figure 8
The driver TFs with high combined network score interact with proteins enriched for chromatin assembly, histone modification and nuclear transport. Heatmaps show the ranks of all driver TFs according to either the transcriptional network degree, the protein interaction degree, or the combined network score, in the cases of menadione, DTT and diamide response in yeast. Network depicts all direct protein interactors of the menadione driver TFs that had a higher rank in the combined network score than in either individual network alone. Edges represent evidence of direct protein interaction from yeast two-hybrid experiments. Transcriptional network Protein interaction network Combined network score

Rapamycin (GMIT)
All PPI   Odds ratio False positive rate True positive rate 0 2 4 6 8 True positive rate False positive rate True positive rate Essential gene

A C
False positive rate True positive rate Driver gene  Odds ratio Odds ratio False positive rate True positive rate Transcriptional network Protein interaction network (Y2H only) Combined network score       •                   Transcriptional network Protein interaction network Combined network score Differential expression Diff. exp. + PPI