Discovery of small protein complexes from PPI networks with size-specific supervised weighting
- Chern Han Yong^{1}Email author,
- Osamu Maruyama^{2} and
- Limsoon Wong^{3}
https://doi.org/10.1186/1752-0509-8-S5-S3
© Yong et al.; licensee BioMed Central Ltd. 2014
Published: 12 December 2014
Abstract
The prediction of small complexes (consisting of two or three distinct proteins) is an important and challenging subtask in protein complex prediction from protein-protein interaction (PPI) networks. The prediction of small complexes is especially susceptible to noise (missing or spurious interactions) in the PPI network, while smaller groups of proteins are likelier to take on topological characteristics of real complexes by chance.
We propose a two-stage approach, SSS and Extract, for discovering small complexes. First, the PPI network is weighted by size-specific supervised weighting (SSS), which integrates heterogeneous data and their topological features with an overall topological isolatedness feature. SSS uses a naive-Bayes maximum-likelihood model to weight the edges with two posterior probabilities: that of being in a small complex, and of being in a large complex. The second stage, Extract, analyzes the SSS-weighted network to extract putative small complexes and scores them by cohesiveness-weighted density, which incorporates both small-co-complex and large-co-complex weights of edges within and surrounding the complexes.
We test our approach on the prediction of yeast and human small complexes, and demonstrate that our approach attains higher precision and recall than some popular complex prediction algorithms. Furthermore, our approach generates a greater number of novel predictions with higher quality in terms of functional coherence.
Keywords
Introduction
Most cellular processes are performed not by individual proteins acting alone, but by complexes consisting of multiple proteins that interact (bind) physically. Protein complexes comprise the modular machinery of the cell, performing a wide variety of molecular functions, so determining the set of existing complexes is important for understanding the mechanism, organization, and regulation of cellular processes. Since proteins in a complex interact physically, protein-protein interaction (PPI) data, made available in large amounts by high-throughput experimental techniques, is an important resource in the study of complexes. PPI data is frequently represented as a PPI network (PPIN), where vertices represent proteins and edges represent interactions between proteins. Protein complexes are groups of proteins that interact with one another, so they are usually dense subgraphs in PPI networks. Many algorithms have been developed to discover complexes from PPI networks based mainly on this idea [1–6].
It has been noted that the distribution of complex sizes follows a power law distribution [7], meaning that a large majority of complexes are small. Thus the discovery of small complexes is an important subtask within complex discovery. An inherent difficulty is that the strategy of searching for dense clusters becomes problematic: fully dense (ie. cliques) size-2 and size-3 clusters correspond to edges and triangles respectively, and only a few among the abundant edges and triangles of the PPI network represent actual small complexes. Furthermore, high-throughput PPI data suffers from significant amounts of noise, in terms of false positives (spuriously detected interactions) as well as false negatives (missing interactions). This presents a challenge for complex discovery from PPI data, and is especially severe for the discovery of small complexes, which is much more sensitive to extraneous or missing edges: for a size-2 complex, a missing co-complex interaction disconnects its two member proteins, while only two extraneous interactions are sufficient to embed it within a larger clique (a triangle).
Our proposed approach to address these challenges consists of two steps. First, we weight the edges of the PPI network with the probabilities of belonging to a complex, in a size-specific manner. Second, we extract the small complexes from this weighted network. In the first step, our weighting approach, called size-specific supervised weighting (SSS), integrates three different data sources (PPIs, functional associations, and literature co-occurrences) with their topological characteristics (degree, shared neighbours, and connectivity between neighbours), as well as an overall topological isolatedness feature. SSS uses a supervised maximum-likelihood naive-Bayes model to weight each edge with two separate probabilities: that of belonging to a small complex, and of belonging to a large complex. In the second step, our complex extraction approach, called Extract, uses these weights to predict and score candidate small complexes, by weighting their densities with a cohesiveness function [5] that incorporates both small and large co-complex probabilities of edges within and around each cluster.
In our previous approach, Supervised Weighting for Composite Networks (SWC [8]), we integrated diverse data sources (including topological characteristics) with a supervised approach to accurately score edges with co-complex probabilities, and attained good performance in predicting large complexes (of size greater than three) in yeast and human. However, SWC's performance in scoring edges from small complexes is unsatisfactory. This is because edges in small complexes have radically different topological characteristics from edges in large complexes. And since there are a far greater number of edges from large complexes than from small complexes, the learned model reflects the features of the former rather than the latter. Thus, here we use a model for small complexes specifically, which captures the characteristics of their edges more accurately.
By integrating two additional data sources (functional associations and literature co-occurrences) with supervised learning, our approach reduces the amount of spurious interactions among the PPIs. Complexes tend to be characterized by certain topological characteristics in the PPI network (for example, they tend to be densely connected and bordered by a sparse region), but smaller groups of proteins are likelier to take on such characteristics by chance. Integrating topological features from multiple data sources reduces the discovery of false positive complexes, as it is less likely that all data sources share such characteristics by chance in a random set of proteins.
An important topological characteristic of complexes is that they tend to be topologically isolated, or bordered by a sparse region. Many complexes exhibit a core-attachment structure [9], where distinct complexes can share common subsets of proteins (called the core), with variations among the remaining proteins (attachments). Since distinct complexes can share proteins, they overlap in the PPI network, and thus are not expected to be completely isolated; nonetheless, proteins in small complexes with core-attachment structures are still more isolated than those in large complexes. Thus we incorporate an isolatedness feature derived from an initial posterior probability calculation, which contributes to discriminating between edges in small complexes, large complexes, or in no complex.
Predicted complexes are typically given some score indicative of confidence in the prediction. The weighted density of the predicted complex is frequently used for this purpose (for example in [4, 8]): assuming the edge weights represent co-complex estimates, the weighted density averages over the weights of all the edges within the predicted complex, giving an overall measurement of the prediction's reliability. However, for predicted small complexes the weighted density is derived from only one or three edges (corresponding to size-2 or size-3 clusters respectively), making it susceptible to noisy edge weights. Thus we incorporate a cohesiveness function in scoring predicted complexes, which includes both internal edges within the cluster, as well as outgoing edges around the cluster.
We test our approach on the prediction of small complexes in yeast and human, and obtain improved performance in both organisms. In the rest of the paper, we first describe each of the two steps of our approach. Next we describe our experimental methodology, and finally present and discuss our results.
Methods
In this section, we describe our approach for predicting small protein complexes, which consists of two stages: first, size-specific supervised weighting (SSS) of the PPIs; second, extracting small complexes from this weighted PPI network.
Size-specific supervised weighting (SSS) of the PPI network
SSS uses supervised learning to weight each edge of the reliable PPI network with two posterior probabilities, that of being a small-co-complex edge (ie. of belonging to a small complex), and that of being a large-co-complex edge, given the edge's features. These features consist of diverse data sources, their topological characteristics, and an isolatedness feature derived from an initial calculation of the posterior. We first describe the data sources and features we use, then describe our weighting approach.
Data sources and features
We use three different data sources (PPI, functional association, and literature co-occurrence) together with their topological characteristics as features. Each data source provides a list of scored protein pairs: for each pair of proteins (a, b) with score s, a is related to b with score s, according to that data source. For both yeast and human, the following data sources are used:
where rel_{ i } is the estimated reliability of experimental method i, E_{ a,b } is the set of experimental methods that detected interaction (a, b), and n_{ i,a,b } is the number of times that experimental method i detected interaction (a, b). The scores from the Consolidated dataset are discretized into ten equally-spaced bins (0− 0.1, 0.1− 0.2, . . .), each of which is considered as a separate experimental method in our scoring scheme. We avoid duplicate counting of evidences across the datasets by using their publication IDs (in particular, PPIs from the Consolidated dataset are removed from the BioGRID, IntAct, and MINT datasets).
• STRING : Predicted functional association data is obtained from the STRING database [15] (data downloaded in January 2014). STRING predicts each association between two proteins a and b (or their respective genes) using the following evidence types: gene co-occurrence across genomes; gene fusion events; gene proximity in the genome; homology; co-expression; physical interactions; co-occurrence in literature; and orthologs of the latter five evidence types transferred from other organisms (STRING also includes evidence obtained from databases, which we discard as this may include co-complex relationships which we are trying to predict). Each evidence type is associated with quantitative information (e.g. the number of gene fusion events), which STRING maps to a confidence score of functional association based on co-occurrence in KEGG pathways. The confidence scores of the different evidence types are then combined probabilistically to give a final functional association score for (a, b). Only pairs with score greater than 0.5 are kept.
where A_{ x } is the set of PubMed papers that contain protein x. For yeast, that would be the papers that contain the gene name or open reading frame (ORF) ID of x as well as the word "cerevisiae"; for human that would be the papers that contain the gene name or Uniprot ID of x as well as the words "human" or "sapiens".
For each protein pair in each data source, we derive three topological features: degree (DEG), shared neighbors (SHARED), and neighborhood connectivity (NBC). For each data source, the edge weight used to calculate these topological features is the data source score of the edge.
where w(x, y) is the data source score of edge (x, y), N_{ a } is the set of all neighbours of a, excluding a.
where w(x, y) is the data source score of edge (x, y); N_{ a,b } is the set of all neighbours of a and b, excluding a and b themselves; λ is a dampening factor.
• SHARED : The extent of shared neighbors between the protein pair, derived using the Iterative AdjustCD function (with two iterations) [4].
This gives a total of twelve features: the three data sources PPI, STRING, and LIT , and nine topological features (three for each data source), DEG_{ PPI } , DEG_{ STRING }, DEG_{ LIT } , SHARED_{ PPI } , SHARED_{ STRING }, SHARED_{ LIT } , NBC_{ PPI } , NBC_{ STRING }, and NBC_{ LIT } . In addition, a feature called isolatedness is incorporated after an initial calculation of the posterior probabilities, as described below.
Size-specific supervised weighting of the PPI network (SSS)
In this step, we weight the edges of the PPI network with our size-specific supervised weighting (SSS) approach. We use a highly-reliable subset of the PPI network, by keeping only the top k edges with the highest PPI reliability scores. In our experiments we set k = 10000, but similar results are obtained for other values of k. SSS uses supervised learning to weight each edge with three scores: its posterior probability of being a small-co-complex edge (ie. of belonging to a small complex), of being a large-co-complex edge, and of not being a co-complex edge, given the features of the edge. These features consist of the twelve features described above (PPI, STRING, LIT , and nine topological features), as well as an isolatedness feature which is derived from an initial calculation of the posterior probabilities. We use a naive-Bayes maximum-likelihood model to derive the posterior probabilities.
1 Minimum description length (MDL) supervised discretization [16] is performed to discretize the features (excluding the isolatedness feature). MDL discretization recursively partitions the range of each feature to minimize the information entropy of the classes. If a feature cannot be discretized, that means it is not possible to find a partition that reduces the information entropy, so the feature is removed. Thus this step also serves as simple feature selection.
for each discretized value f of each feature F (excluding the isolatedness feature). n_{ sm } is the number of edges with class label sm-comp, n_{ sm,F = f } is the number of edges with class label sm-comp and whose feature F has value f ; n_{ lg } is the number of edges with class label lg-comp, n_{ lg,F = f } is the number of edges with class label lg-comp and whose feature F has value f ; n_{ non } is the number of edges with class label non-comp, and n_{ non,F = f } is the number of edges with class label non-comp and whose feature F has value f .
The posterior probabilities are calculated in a similar fashion for the other two classes lg-comp and non-comp. We abbreviate the posterior probability of edge (a, b) being in each of the three classes as P_{(a,b),sm}, P_{(a,b),lg}, and P_{(a,b),non}.
where N_{ x } denotes the neighbours of x, excluding x. The ISO feature is discretized with MDL.
5 The maximum-likelihood parameters for the ISO feature are learned for the three classes.
6 The posterior probabilities for the three classes, P_{(a,b),sm}, P_{(a,b),lg}, and P_{(a,b),non}, are recalculated for each edge (a, b), this time incorporating the new ISO feature.
Extracting small complexes
After using SSS to weight the PPI network, the small complexes are extracted. This stage, called Extract, consists of two steps (see Figure 1): first, the small-co-complex probability weight of each edge is disambiguated into size-2 and size-3 complex components; next, each candidate complex is scored by its cohesiveness-weighted density, which is based on both its internal and outgoing edges.
Results and discussion
Experimental setup
In our main experiments, we compare our two-stage approach (weighting with SSS, small complex extraction with Extract) against using the original PPI reliability (PPIREL) weighted network with standard clustering approaches to derive small complexes:
Markov Cluster Algorithm (MCL) [1] simulates stochastic flow to enhance the contrast between regions of strong and weak flow in the graph. The process converges to a partition with a set of high-flow regions (the clusters) separated by boundaries with no flow.
Restricted Neighborhood Search Clustering (RNSC) [2] is a local search algorithm that explores the solution space to minimize a cost function, calculated according to the number of intra-cluster and inter-cluster edges. RNSC first composes an initial random clustering, and then iteratively moves nodes between clusters to reduce the clustering's cost. It also makes diversification moves to avoid local minima. RNSC performs several runs, and reports the clustering from the best run.
IPCA [3] expands clusters from seeded vertices, based on rules that encode prior knowledge of the topological structure of protein complexes' PPI subgraphs. Whether a cluster is expanded to include a vertex is determined by the diameter of the resultant cluster and the connectivity between the vertex and the cluster.
Clustering by Maximal Cliques (CMC) [4] first generates all the maximal cliques from a given network, and then removes or merges highly overlapping clusters based on their inter-connectivity as follows. If the overlap between two maximal cliques exceeds a threshold overlap thres, then CMC checks whether the interconnectivity between the two cliques exceeds a second threshold merge thres. If it does, then the two cliques are merged; otherwise, the clique with lower density is removed.
Clustering with Overlapping Neighborhood Expansion (ClusterONE) [5] greedily expands clusters from seeded vertices to maximize a cohesiveness function, which is based on the edge weights within a cluster and the edge weights connecting the cluster to the rest of the network. It then merges highly-overlapping clusters.
Proteins' Partition Sampler v2.3 (PPSampler2) [6] partitions the PPI network into clusters using a Markov-chain Monte-Carlo approach to optimize an objective function. Novelly, it incorporates the size distribution of clusters in the objective function, and thus accounts for the sizes of complexes found in actual biological systems, where most of the complexes are small.
The six clustering algorithms and their parameters used for small complex discovery.
Clustering algorithm | Parameters |
---|---|
CMC | overlap thres = 1, merge_thres = 1 |
ClusterONE | all default |
IPCA | -P1 -T0.4 |
MCL | -I 2 |
RNSC | -e10 -D50 -d10 -t20 -T3 |
PPSampler2 | -f1DenominatorExponent 1 -f2 |
We also investigate the performance of using our SSS-weighted network with standard clustering approaches, and using the PPIREL network with our Extract approach.
We perform random sub-sampling cross-validation, repeated over ten rounds, using manually curated complexes as reference complexes for training and testing. For yeast, we use the CYC2008 [17] set which consists of 408 complexes, of which 259 are small (composed of two or three proteins). For human, we use the CORUM [18] set (filtered to remove duplicates and small complexes that are subsets of large ones), which consists of 1352 complexes, of which 701 are small. In each cross-validation round, t% of the complexes (large and small) are selected for testing, while all the remaining complexes are used for training. Each edge (a, b) in the network is given a class label lg-comp if a and b are in the same large training complex; otherwise it is labeled sm-comp if a and b are in the same small training complex; otherwise its class label is non-comp. Learning in SSS is performed using these labels, and the edges of the network are weighted using the learned models. Small complexes are then extracted from the weighted network. The predicted complexes are evaluated by matching them with only the small test complexes.
We design our experiments to simulate a real-use scenario of complex prediction in an organism where a few complexes might already be known, and novel complexes are to be predicted: in each round of cross-validation, the training complexes are those that are known and leveraged for learning to discover new complexes, while the test complexes are used to evaluate the performance of each approach at this task. Thus we use a large percentage of test complexes t = 90%. In yeast, this gives about 233 small test complexes and 26 small training complexes per cross-validation iteration; in human, this gives about 631 small test complexes and 70 small training complexes.
Evaluation methods
The precision of clusters is calculated only among those clusters that do not match a training complex, to eliminate the bias of the supervised approach (SSS) for predicting training complexes well. As a summarizing statistic of a precision-recall graph, we also calculate the area under the curve (AUC) of a precision-recall graph.
To measure the quality of a predicted complex, we derive the semantic coherence of its Gene Ontology (GO [19]) annotations across the three GO classes, biological process (BP), cellular compartment (CC), and molecular function (MF). First, we derive the BP semantic similarity between two proteins as the information content of their BP annotations' most informative common ancestor [20]. Then we define the BP semantic coherence of a predicted complex as the average BP semantic similarity between every pair of proteins in that complex (likewise for CC and MF).
Prediction of small complexes
Figure 2b shows the precision-recall graphs comparing our approach (SSS + Extract) to the baselines of standard clustering algorithms applied on a PPIREL network. While our approach has lower precision among the initial top predictions (at recall less than 5%), beyond that we attain substantially greater precision: for example, at 40% recall, our approach attains more than three times the precision than the other clustering approaches (28% versus 9%). Furthermore, we attain substantially higher recall as well. Figure 2c shows the precision-recall graphs when the standard clustering algorithms are applied on the SSS-weighted network. Using the SSS-weighted network, most of the clustering algorithms achieve improved precision in the mid-recall ranges, as well as gains in recall. However, our approach (SSS + Extract) still maintains greater precision in most of the recall range.
Figure 3b and 3c show the corresponding precision-recall graphs. As in yeast, our approach (SSS + Extract) outperforms the standard clustering algorithms applied on the PPIREL-weighted network by achieving substantially higher recall, as well as greater precision in almost the whole recall range (Figure 3b). Using the SSS instead of the PPIREL-weighted network, CMC and IPCA achieve higher precision, while the other clustering algorithms suffer from lower precision or recall (Figure 3c).
In the following section we investigate how the various techniques incorporated in SSS and Extract improve the performance of small complex prediction.
How do SSS and Extract improve performance?
Figure 6c and 6d show the corresponding charts for human complexes, with and without cohesiveness weighting. With the SSS network, cohesiveness weighting improves performance in four of seven clustering algorithms; whereas with the PPIREL network, cohesiveness weighting decreases performance in all clustering algorithms. Thus, in human complexes as well, cohesiveness weighting appears useful only when edges are weighted using SSS.
Example complexes
In this section we present some example complexes that are difficult to predict using the PPIREL network with any standard clustering algorithm, but can be predicted with our approach (SSS + Extract). Since the various clustering approaches output different numbers of predictions, we consider only the top scoring predicted clusters with a cross-validation precision level greater than some threshold. For yeast we use a precision threshold of 10%, but for human we use a lower precision threshold of 2%, since fewer human complexes are predicted with high precision.
Quality of novel complexes
In this section we compare the number and quality of high-confidence novel complexes predicted by our approach (SSS + Extract), against using standard clustering algorithms on the PPI reliability network. When weighting the network with SSS, the entire set of reference complexes is used for training. We filter the predicted complexes to remove those that match any reference complex, and to keep only high-confidence predictions: the score of each predicted complex is mapped to a precision value, using the cross-validation results, and only predicted complexes with estimated precision greater than a confidence threshold are kept. For yeast, this confidence threshold is 0.5; for human, a lower threshold of 0.1 is used, since much fewer complexes are predicted with high precision.
Figure 10b shows the corresponding charts for human predictions. Again, our approach generates more high-confidence novel predictions than the other approaches, with equal or greater quality: our predicted complexes have greater coherence than ClusterOne, MCL, RNSC, or PPSampler2 (p <.05 in at least one of BP, CC, or MF), and similar coherence with the other approaches. Our predicted complexes have similar semantic coherence compared to the Corum reference complexes.
Finally, we briefly mention two novel complexes, predicted by our approach, that we have validated via a literature scan. Our approach predicts a high-scoring yeast cluster consisting of Cap1p and Cap2p, which is not found in our reference database of complexes. However, a literature scan revealed this to be the capping protein heterodimer, which binds to actin filaments to control filament growth [21]. Our approach also predicts a novel high-scoring human cluster consisting of PKD1 and PKD2. A literature scan revealed that these two proteins, which are involved in autosomal polycystic kidney disease, have been found to form a PKD1-PKD2 heterodimer [22].
Conclusion
The size of protein complexes has been noted to follow a power distribution, meaning that a large majority of complexes are small (consisting of two or three distinct proteins). Thus the discovery of small complexes is an important subtask in protein complex prediction. Predicting small complexes from PPI networks is inherently challenging. Small groups of proteins are likelier to take on topological characteristics of real complexes by chance: for example, fully dense groups of two or three proteins correspond to edges or triangles respectively, but only a few of these actually correspond to small complexes. Furthermore, the prediction of small complexes is especially susceptible to noise (missing or spurious interactions) in the PPI network, as these can easily disconnect a small complex, or embed it within a larger clique.
We propose a two-stage approach, SSS and Extract, for discovering small complexes. First, the PPI network is weighted by size-specific supervised weighting (SSS), which integrates heterogeneous data and their topological features with an overall topological isolatedness feature, and uses a naive-Bayes maximum-likelihood model to weight the edges with their posterior probabilities of being in a small complex, and in a large complex. Integrating other data sources into the PPI network can help reduce noise, while incorporating the topological features across multiple data sources makes it less likely that random protein groups take on topological characteristics of complexes by chance. In our second stage, Extract, the SSS-weighted network is analyzed to extract putative small complexes and score them by cohesiveness-weighted density, which incorporates both small-co-complex and large-co-complex weights of internal and outgoing edges. This reduces the impact of noisy edge weights in deriving reliable scores for predictions, as more edge weights around the candidate complex are utilized.
While a few previous approaches have used supervised learning to weight PPI edges, none of them have done so in a complex-size-specific manner, or incorporated isolatedness as a feature in this way. Our adaptation of cohesiveness to address the problem of the small number of edge weights available in scoring small complexes is also novel.
We test our approach on the prediction of yeast and human small complexes, and demonstrate that our approach outperforms some commonly-used clustering algorithms applied on a PPI reliability network, attaining higher precision and recall. Furthermore, our approach generates a greater number of novel predictions with higher quality in terms of Gene Ontology semantic coherence. Nonetheless, the performance of small complex prediction still lags behind that of predicting large complexes, especially for human complexes.
We note that a significant challenge for human complex prediction is insufficient PPI data. An estimate of the human interactome size is around 220, 000 PPIs [23]. Our human PPI data consists of around 140, 000 PPIs, and with an estimated false-positive rate of 50%, this means that our human PPI network represents only a third of the true human PPI network. In comparison, in yeast an estimate of the interactome size is around 50, 000 PPIs. Our yeast PPI data consists of around 120, 000 PPIs, so even with an estimated false-positive rate of 50%, our yeast PPI network can be believed to be a good representation of the actual yeast PPI network. The much poorer representation of the true human interactome partially explains the poorer performance of our approach on human complexes.
Nonetheless, there is still room for further work on complex detection techniques that may improve the prediction of small human complexes. A possible future direction is to adapt other techniques that have proved useful for large complex prediction, such as GO term decomposition and hub removal [24], which might further improve the performance of small complex prediction.
Declarations
Declarations
This work is supported in part by a Singapore Ministry of Education grant MOE2012-T2-1-061 and a National University of Singapore NGS scholarship. Publication charges for this article is funded in part by a Singapore Ministry of Education grant MOE2012-T2-1-061.
This article has been published as part of BMC systems Biology Volume 8 Supplement 5, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S5.
Authors’ Affiliations
References
- van Dongen S: Graph clustering by flow simulation. 2000, PhD thesis, University of UtrechtGoogle Scholar
- King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics. 2004, 20 (17): 3013-3020. 10.1093/bioinformatics/bth351.View ArticlePubMedGoogle Scholar
- Li M, Chen J, Wang J, Hu B, Chen G: Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008, 9: 398-10.1186/1471-2105-9-398.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted ppi networks. Bioinformatics. 2009, 25 (15): 1891-1897. 10.1093/bioinformatics/btp311.View ArticlePubMedGoogle Scholar
- Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods. 2012, 9: 471-472. 10.1038/nmeth.1938.PubMed CentralView ArticlePubMedGoogle Scholar
- Widita CK, Maruyama O: PPSampler2: Predicting protein complexes more accurately and efficiently by sampling. BMC Syst Biol. 2013, 7 (Suppl 6): S14-10.1186/1752-0509-7-S6-S14.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatsuke D, Maruyama O: Sampling strategy for protein complex prediction using cluster size frequency. Gene. 2012, 518 (1): 152-158.View ArticlePubMedGoogle Scholar
- Yong CH, Liu G, Chua HN, Wong L: Supervised maximum-likelihood weighting of composite protein networks for complex prediction. BMC Syst Biol. 2012, 6 (Suppl 2): S13-10.1186/1752-0509-6-S2-S13.PubMed CentralView ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636. 10.1038/nature04532.View ArticlePubMedGoogle Scholar
- Chatr-aryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, O'Donnell L, Reguly T, Breitkreutz A, Sellam A, Chen D, Chang C, Rust J, Livstone M, Oughtred R, Dolinski K, Tyers M: The BioGRID interaction database: 2013 update. Nucleic Acids Res. 2013, 41 (Database): 816-823.View ArticleGoogle Scholar
- Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H: The MIntAct project-IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014, 42 (Database): 358-363.View ArticleGoogle Scholar
- Licata L, Briganti L, Peluso D, Perfetto L, Iannuccelli M, Galeota E, Sacco F, Palma A, Nardozza AP, Santonico E, Castagnoli L, Cesareni G: MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 2012, 40 (Database): 857-861.View ArticleGoogle Scholar
- Collins SR, Kemmeren P, Zhao XC, Greenblatt JF, Spencer F, Holstege FCP, Weissman JS: Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 2007, 6 (3): 439-450.View ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006, 22 (13): 1623-1630. 10.1093/bioinformatics/btl145.View ArticlePubMedGoogle Scholar
- Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41 (Database): 808-815.View ArticleGoogle Scholar
- Fayyad UM, Irani KB: Multi-interval discretization of continuous valued attributes for classification learning. Proceedings of the 13 Annual International Joint Conference on Articial Intelligence. 1993, 1022-1027.Google Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak SJ: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009, 37 (3): 825-831. 10.1093/nar/gkn1005.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruepp A, Waegele B, Lechner M, Brauner B, I DK, Fobo G, Frishman G, Montrone C, Mewes HW: CORUM: the comprehensive resource of mammalian protein complexes-2009. Nucleic Acids Res. 2010, 38: 497-501.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. Nature Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Pesquita C, Faria D, Falcao AO, Lord P, Couto FM: Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009, 5 (7): 1000443-10.1371/journal.pcbi.1000443.View ArticleGoogle Scholar
- Kim K, Yamashita A, Wear MA, Maéda Y, Cooper JA: Capping protein binding to actin in yeast: biochemical mechanism and physiological relevance. J Cell Biol. 204, 164 (4): 567-580.View ArticleGoogle Scholar
- Tsiokas L, Kim E, Arnould T, Sukhatme VP, Walz G: Homoand heterodimeric interactions between the gene products of PKD1 and PKD2. Proc Natl Acad Sci USA. 1997, 94 (13): 6965-6970. 10.1073/pnas.94.13.6965.PubMed CentralView ArticlePubMedGoogle Scholar
- Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein-interaction networks?. Genome Biol. 2006, 7 (11): 120-10.1186/gb-2006-7-11-120.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu G, Yong CH, Chua HN, Wong L: Decomposing PPI networks for complex discovery. Proteome Sci. 2011, 9 (Suppl 1): S15-10.1186/1477-5956-9-S1-S15.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.