Protein Interaction Permutation Analysis (PIPA) Algorithm
For a given gene set, PIPA identifies the probability of seeing an observed number of protein interactions between siRNA hits by chance. We start with a gene set derived from a GeneGo Metacore Pathway. This gene set, G, is filtered to only contain genes targeted in the siRNA screen. Gene set members are labeled as "hits" or "non-hits" as described in the methods section. Gene set members are connected using literature-curated interactions to create a network N
G
where genes are represented as nodes and interactions are represented as edges. We label the number of observed edges connecting hits as O.
A network permutation method is used to derive a null distribution by permuting node labels. This approach is an extension of the graph permutation algorithm presented by Balasubramanian et al. and allows the topology of N
G
to remain constant [20]. By shuffling node labels we sidestep the connection bias of highly annotated (highly connected) gene set members. A minority of the knockdown genes in our siRNA screen are identified as hits (~6%) and accordingly a minority of permutations will label these highly connected nodes as hits.
Starting with network N
G
, we shuffle node labels with all other genes targeted in the screen to create a shuffled version of N
G
we refer to as
. The number of edges connecting hits within
is recorded. This process is iterated 10,000 times and a p-value is calculated as the proportion of sampled permutations that have O or greater edges connecting hits. We use a p-value threshold of 0.05 to reject the null hypothesis that the number of observed edges connecting hits is what we would expect by chance. Because of differences in overlap between members and networks across gene sets, we leave p-values unadjusted for multiple testing for a fair comparison between PIPA and conventional hypergeometric enrichment analysis.
We divide the global interactome network into sub-networks that are based on canonical gene sets for several reasons. As an initial experiment, we applied the above permutation algorithm to the global network and found that hits were not significantly connected. However when focusing on sub-networks derived from gene sets, we find significant enrichment for highly connected hits. This finding is consistent with work by Horvath et al. where network properties from co-expression sub-networks (modules) are more predictive than network properties from the global co-expression network [21]. Another advantage to dividing the global interactome into gene sets, is that by their definition canonical gene sets are biologically annotated and provide an intrinsic systems biology explanation of hits. Finally, by focusing on these sub-networks we are able to divide-and-conquer the global network in a biologically supported and computationally tractable manner.
Interacting gene pair siRNA hits are more reproducible than single gene hits
While replicate screens ideally should result in identical hit lists, experimental variability makes this not the case. A guiding principle behind ORA approaches is hits from pathways enriched for hits are more likely to be true positives and represent relevant biological processes. Similarly, a guiding hypothesis behind the development of PIPA is that screen hits that are directly connected in the global interactome are more likely to be true positives than are individual hits. To test this hypothesis we compared the overlap of single gene screen hits (node hits) to the overlap of interacting gene pair siRNA hits (edge hits) from replicate screens carried out in HeLa and TOV cell lines (Figure 1). Experimental description, data pre-processing, and labeling of hits are detailed in the Methods section. While significant overlap is observed for node hits in both screens, the overlap between edge hits indeed shows greater statistical significance.
Of note, the requirement that two hits share an interaction substantially reduces the number of edge hits compared to node hits. If we view nodes that are connected by edges as random independent variables with a ~6% chance of being a hit, the likelihood of drawing two hits by chance (~0.36%) is dramatically smaller. The use of interaction data to select among hits, consequently, substantially reduces the total number of hits while simultaneously enriching for hits that are more likely to be reproduced in replicate screens. This observation supports our hypothesis that analyzing high-throughput siRNA data in the context of protein interactions is likely to enable identification of essential signaling cascades.
PIPA identifies both protein complexes and signaling interactions
To utilize gene interaction information in combination with gene sets, the PIPA algorithm identifies gene sets in which edge hits are statistically overrepresented compared to a null distribution obtained by shuffling the node labels in each gene set. In the analysis of the HeLa and TOV screening data, we identified enriched pathways by combining GeneGO Metacore canonical signaling pathways with a human interactome map comprised of the union of binding, phosphorylation and expression regulation edges from three data sources (see Methods). The intersection of this interactome map with the canonical pathway gene sets yields 16,478 interactions, roughly evenly distributed among the three interaction types (Figure 2).
When PIPA is carried out using all 16,478 protein interactions, the majority of edge hits in enriched gene sets result from binding interactions (86.9%). Of these enriched binding edges, 72.5% are from protein complexes, as defined by GO Cellular Component categories that are children of the "protein complex" category (GO:0043234). It is expected that sub-units of protein complexes will have similar phenotypes in RNAi screens, and in fact RNAi studies often use essential protein complexes to identify false negatives because each knockdown targeted complex sub-unit is known to have a significant effect on phenotype [9, 15]. Additionally protein interaction databases typically include binding interactions between each sub-unit of large complexes, for example the eukaryotic initiation factor 3 (EIF3) or proteasome, resulting in highly-connected graphs between 13 or 27 complex sub-units (respectively) [22]. As a result, any edge-based ORA will be biased by these highly-connected portions of the global interactome map.
Although the identification of interactions between sub-units of essential protein complexes represents true biology, these highly connected pieces of the interactome obscure regulatory protein interactions which are of more interest for understanding cell signaling. Consequently, to identify essential signaling cascades, we restricted the interaction data set to 11,017 phosphorylation and expression regulation interactions. With this signaling-focused interaction set, 80 edge hits in enriched pathways were identified using PIPA. Of note, while the initial interaction set contained roughly equal proportions of phosphorylation and expression regulation edges, the edge hits from the enriched pathways were significantly enriched for phosphorylation interactions (Figure 2), additionally supporting the hypothesis that PIPA is detecting essential signaling events. Interestingly, this result also suggests that expression regulation is more robust to perturbation than is phosphorylation. This observation may reflect the fact that expression regulation interactions reported in these databases are more likely to be indirect interactions mediated by other genes and gene products than are phosphorylation interactions.
PIPA identifies distinct enriched pathways relative to the Hypergeometric
We applied PIPA to the HeLa and TOV siRNA screens to identify pathways that show significant enrichment for protein interactions connecting siRNA hits (Figure 3). Pathways with expected biological relevance to cell cycle regulation, translation, cell structure and various signal transduction cascades are shown to be enriched. Five enriched pathways are common to both HeLa and TOV cell lines: metaphase checkpoint, spindle assembly and chromosome separation, cytoskeletal remodeling focal adhesion kinase (FAK) signaling, role of SCF complex in cell cycle regulation, and role of Akt in hypoxia induced HIF1 activation.
PIPA identifies a distinct set of enriched pathways relative to HGA (Figure 4). Odds ratios and p-values indicate a strong agreement between methods, but approximately half of the enriched gene sets identified by PIPA are not detected by HGA (see Figure 3). Of note, because the coverage of the interactome is incomplete, many pathways have sparse interaction coverage, leading to fewer enriched gene sets detected by PIPA relative to HGA. As annotation and completeness of the interactome progresses, this limitation of edge-based approaches such as PIPA is expected to diminish.
PIPA detects active signaling branches with pathway gene sets
The Cytoskeletal Remodeling: FAK Signaling pathway (Figure 5) is significantly enriched for both HeLa and TOV siRNA hits using PIPA and not significantly enriched using HGA. FAK-mediated signal transduction has been implicated in control of cell migration, cell cycle progression and apoptosis [23]. Interestingly, both cell lines showed the VEGFA growth factor as being a hit while different branches of the pathway showed enrichment for edge hits. TOV hits highlight hits on FAK, PI3K and Akt signaling whereas HeLa hits show enrichment of MAPK signaling via Raf1 and Mek2. PIPA identifies enrichment in this pathway using protein interactions despite their being few scattered hits on the right side. Our results indicate that bombesian receptor signaling is not essential to HeLa and TOV cell lines.
Essential Signaling Network Identified in HeLa and TOV screens
To identify a sub network of essential signaling interactions detected in the siRNA screens, edge hits from gene sets enriched using HeLa and TOV hits were combined into a single large network (Figure 6). After visualizing the network using a force-based clustering algorithm, major clusters pertaining to GO categories relevant to cell growth rate can be identified. Of note, starting with the original hit list, no enriched GO terms are identified by HGA. Application of PIPA, as a result enables core signaling interactions and processes to be identified from comparatively noisy genome-wide siRNA screening data.