Supervised maximum-likelihood weighting of composite protein networks for complex prediction
© Yong et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Skip to main content
© Yong et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Protein complexes participate in many important cellular functions, so finding the set of existent complexes is essential for understanding the organization and regulation of processes in the cell. With the availability of large amounts of high-throughput protein-protein interaction (PPI) data, many algorithms have been proposed to discover protein complexes from PPI networks. However, such approaches are hindered by the high rate of noise in high-throughput PPI data, including spurious and missing interactions. Furthermore, many transient interactions are detected between proteins that are not from the same complex, while not all proteins from the same complex may actually interact. As a result, predicted complexes often do not match true complexes well, and many true complexes go undetected.
We address these challenges by integrating PPI data with other heterogeneous data sources to construct a composite protein network, and using a supervised maximum-likelihood approach to weight each edge based on its posterior probability of belonging to a complex. We then use six different clustering algorithms, and an aggregative clustering strategy, to discover complexes in the weighted network. We test our method on Saccharomyces cerevisiae and Homo sapiens, and show that complex discovery is improved: compared to previously proposed supervised and unsupervised weighting approaches, our method recalls more known complexes, achieves higher precision at all recall levels, and generates novel complexes of greater functional similarity. Furthermore, our maximum-likelihood approach allows learned parameters to be used to visualize and evaluate the evidence of novel predictions, aiding human judgment of their credibility.
Our approach integrates multiple data sources with supervised learning to create a weighted composite protein network, and uses six clustering algorithms with an aggregative clustering strategy to discover novel complexes. We show improved performance over previous approaches in terms of precision, recall, and number and quality of novel predictions. We present and visualize two novel predicted complexes in yeast and human, and find external evidence supporting these predictions.
Protein complexes participate in many important cellular functions, so finding the set of existent complexes is essential for understanding the mechanism, organization, and regulation of processes in the cell. Since protein complexes are groups of interacting proteins, many methods have been proposed to discover complexes from protein-protein interaction (PPI) data, which has been made available in large amounts by high-throughput experimental techniques. Typically, complexes are predicted based on topological characteristics in the PPI network. For example, many approaches search for regions of high density or connectivity [1–5]. Other approaches further incorporate subgraph diameters of known complexes , and core-attachment models of connected clusters [7, 8]. Qi et al. used a set of topological features including density, degree, edge weight, and graph eigenvalues, with a supervised naive-Bayes approach to learn these feature parameters from training complexes .
Many algorithms have been developed to assess the reliability of high-throughput protein interactions [13–15] or predict new protein interactions [16–19], using various information such as gene sequences, annotations, interacting domains, 3D structures, experimental repeatability, or topological characteristics of PPI networks. These approaches have been shown to be effective in reducing false positives or false negatives. In our previous work , we have shown that using the topology of the PPI network to weight interactions, remove unreliable interactions, and posit new interactions improves the performance of several complex discovery algorithms. While such approaches are effective in reducing the impact of spuriously-detected and missing interactions, they do not directly address transient interactions and non-interacting complex proteins.
Researchers have also proposed integrating heterogeneous data sources with supervised approaches to predict co-complex protein pairs (protein pairs that belong to the same complex), using a reference set of training complexes. Data integration leverages on the fact that diverse data sources other than PPI can also reveal co-complex relationships, while a supervised approach targeted at predicting co-complex protein pairs can be trained to discriminate between actual co-complex interactions and spuriously-detected or transient interactions. Qiu and Noble  integrated PPI, protein sequences, gene expression, interologs, and functional information, to train kernel-based models, and achieved high classification accuracy in predicting co-complex protein pairs. However, they did not apply or test their method on reconstructing and predicting complexes. Wang et al.  integrated PPI, gene expression, localization annotations, and transmembrane features, and applied a boosting method to predict co-complex protein pairs. They showed that this approach, combined with their proposed clustering method HACO, achieved higher sensitivity in recovering reference complexes compared to unsupervised approaches. However, they did not explore how well their classification approach works when used in conjunction with other clustering methods: while sensitivity was improved, many reference complexes were still unable to be predicted in part due to limitations of HACO, thus raising the question of whether other clustering methods may also see an improvement when used with their co-complex predictions. Furthermore, these approaches directly produce co-complex affinity scores between protein pairs, without providing measurements of the predictive strengths of the different data sources, nor how the different score values of each data source indicate co-complex relationships. In our view, this is important when integrating different data sources: while using PPI for complex prediction is biologically reasonable because proteins in a complex interact and bind with each other, using other data sources such as sequences, expression, or literature co-occurrence is not as biologically intuitive, even if they do reveal co-complex relationships. Providing a measurement of how these data sources contribute to co-complex predictions allows human judgment of the validity and credibility of predicted novel complexes.
We propose a method to address these challenges of complex discovery: first, the PPI network is integrated with other heterogeneous data sources that specify relationships between proteins, such as functional association and co-occurrence in literature, to form an expanded, composite network. Next, each edge is weighted based on its posterior probability of belonging to a protein complex, using a naive-Bayes maximum-likelihood model learned from a set of training complexes. A complex discovery algorithm can then be used on this weighted composite network to predict protein complexes. Our method offers several advantages over current unsupervised or non-integrative weighting approaches. First, a composite protein network constructed from multiple data sources is more likely to have denser subgraphs for protein complexes, as it not only reduces the number of missing interactions, but also adds edges between non-interacting proteins from the same complex, because such proteins are likely to be related in ways other than by physical interactions. Second, learning a model from training complexes not only provides a powerful method to assess the reliability of interactions, but also allows the discrimination between transient and co-complex interactions. Third, utilizing multiple data sources to assess the reliability of interactions is likely to be more accurate than using just PPI data.
Our choice of a naive-Bayes maximum-likelihood model also offers several advantages over other supervised data-integration approaches. Firstly our model is transparent, in that learned parameters can be validated and analyzed, for example to reveal the predictive strengths of the different data sources. Furthermore, for a predicted complex, the learned parameters can then be used to visualize the component evidences from the different data sources, allowing human judgment of the credibility of the prediction. Second, maximum-likelihood models are known to be robust and have low variance, even when few training samples are available. Although we describe our experiments using yeast and human, this is important when we apply our approach to less-studied organisms with fewer known complexes available for training. Finally, we utilize different clustering algorithms as well as a simple aggregative clustering strategy to evaluate the performance of our method, and show that we improve the performance of complex prediction compared to other weighting methods.
Heterogeneous data sources are combined to build the composite network. Each data source provides a list of scored protein pairs: for each pair of proteins (u, v) with score s, u is related to v with score s, according to that data source. For both yeast and human, the following data sources are used:
PPI data is obtained by taking the union of physical interactions from BioGRID , IntAct  and MINT  (data from all three repositories downloaded in November 2011). Interactions are scored using a topological function, Iterative AdjustCD (with two iterations), which has been shown to improve the performance of complex discovery . Iterative AdjustCD uses expectation maximization to score each interaction (u, v) based on the number of shared neighbors of u and v. Interactions between proteins that have no shared neighbors are regarded as unreliable and are discarded. Protein pairs that do not directly interact but have shared neighbors are also scored; such pairs with scores above 0.1 are added as new interactions, and are called Level 2 or L2-PPIs. We consider PPIs and L2-PPIs as two separate data sources.
Predicted functional association data is obtained from the STRING database  (data downloaded in January 2012). STRING predicts each association between two proteins u and υ (or their respective genes) using the following evidence types: gene co-occurrence across genomes; gene fusion events; gene proximity in the genome; homology; coexpression; physical interactions; co-occurrence in literature; and orthologs of the latter five evidence types transferred from other organisms (STRING also includes evidence obtained from databases, which we discard as this may include co-complex relationships which we are trying to predict). Each evidence type is associated with quantitative information (e.g. the number of gene fusion events), which STRING maps to a confidence score of functional association based on co-occurrence in KEGG pathways. The confidence scores of the different evidence types are then combined probabilistically to give a final functional association score for (u, v). Only pairs with score greater than 0.5 are kept.
where A x is the set of PubMed papers that contain protein x. For yeast, that would be the papers that contain the gene name or open reading frame (ORF) ID of x as well as the word "cerevisiae"; for human that would be the papers that contain the gene name or Uniprot ID of x as well as the words "human" or "sapiens".
Statistics of data sources
# distinct proteins
% complex edges
# distinct proteins
% complex edges
Physical protein-protein interactions
Level 2 PPI
Predicted functional association
In the composite network, vertices represent proteins and edges represent relationships between proteins. The composite network has an edge between proteins u and v if and only if there is a relationship between u and v according to any of the data sources.
Next, each edge (u, v) is weighted based on its posterior probability of being a co-complex edge (i.e. both u and v are in the same complex), given the scores of the data source relationships between u and v.
We use a naive-Bayes maximum-likelihood model to derive the posterior probability. Each edge (u, v) between proteins u and v of the composite network is cast as a data instance. The set of features is the set of data sources, and for each instance (u, v), feature F has value f if proteins u and υ are related by data source F with score f. If u and v are not related by data source F, then feature F is given a score of 0. Using a reference set of protein complexes, each instance (u, v) in the training set is given a class label co-complex if both u and υ are in the same complex; otherwise its class label is non-co-complex Learning proceeds by two steps:
1. Minimum description length (MDL) supervised discretization  is performed to discretize the features. MDL discretization recursively partitions the range of each feature to minimize the information entropy of the classes. If a feature cannot be discretized, that means it is not possible to find a partition that reduces the information entropy, so the feature is removed. Thus this step also serves as simple feature selection.
for each discretized value f of each feature F. n c is the number of edges with class label co-complex, is the number of edges with class label co-complex and whose feature F has value f, is the number of edges with class label non-co-complex, and is the number of edges with class label non-co-complex and whose feature F has value f
where Z is a normalizing factor to ensure the probabilities sum to 1. Although the second last equality makes the assumption that the features are independent, naive-Bayes classifiers have been found to perform well even when this assumption is false . Specifically, while the probability estimates are frequently inaccurate, their rank orders usually remain correct, so that edges with likelier co-complex feature values are assigned higher scores than edges with likelier non-co-complex feature values.
After the composite network is weighted, the top k edges are used by a clustering algorithm to predict protein complexes. We use the following clustering algorithms in our study:
Markov Cluster Algorithm (MCL)  simulates stochastic flow to enhance the contrast between regions of strong and weak flow in the graph. The process converges to a partition with a set of high-flow regions (the clusters) separated by boundaries with no flow.
Restricted Neighborhood Search Clustering (RNSC)  is a local search algorithm that explores the solution space to minimize a cost function, calculated according to the number of intra-cluster and inter-cluster edges. RNSC first composes an initial random clustering, and then iteratively moves nodes between clusters to reduce the clustering's cost. It also makes diversification moves to avoid local minima. RNSC performs several runs, and reports the clustering from the best run.
IPCA  expands clusters from seeded vertices, based on rules that encode prior knowledge of the topological structure of protein complexes' PPI subgraphs. Whether a cluster is expanded to include a vertex is determined by the diameter of the resultant cluster and the connectivity between the vertex and the cluster.
Clustering by Maximal Cliques (CMC)  first generates all the maximal cliques from a given network, and then removes or merges highly overlapping clusters based on their inter-connectivity as follows. If the overlap between two maximal cliques exceeds a threshold overlap_thres, then CMC checks whether the inter-connectivity between the two cliques exceeds a second threshold merge_thres. If it does, then the two cliques are merged; otherwise, the clique with lower density is removed.
Hierarchical Agglomerative Clustering with Overlap (HACO)  first considers all vertices as individual clusters, then iteratively merges pairs of clusters with high connectivity between them. At each merge, the two constituting clusters are remembered; when the merged cluster A is later merged with another cluster B, it also tries to merge the remembered constituting clusters of A with the cluster B, and keeps the (possibly overlapping) resultant clusters if they are highly connected.
Clustering with Overlapping Neighborhood Expansion (ClusterONE)  greedily expands clusters from seeded vertices to maximize a cohesiveness function, which is based on the edge weights within a cluster and the edge weights connecting the cluster to the rest of the network. It then merges highly-overlapping clusters.
CMC, MCL, HACO, and ClusterONE are able to utilize edge weights in their input networks, whereas RNSC and IPCA do not; in this case, the selection of the top k edges provides less noisy networks as inputs to the algorithms.
Parameters for clustering algorithms
min deg ratio = 1, min size = 4, overlap thres = 0.5, merge thres = 0.25
min deg ratio = 1, min size = 4, overlap thres = 0.5, merge thres = 0.5
min deg ratio = 1, min size = 4, overlap thres = 0.5, merge thres = 0.75
-c c 1 -g 0.1
-c c 1 -g 0.3
c c 0.75 -g 0.1
-s 4 -d 0
-S4 -P2 -T0.4
-e10 -D50 -d10 -t20 -T3
where V X is the set of proteins contained in X.
SWC: supervised weighting of composite network (our proposed method)
BOOST: supervised weighting of composite network using LogitBoost 
TOPO: unsupervised topological weighting of PPI network with Iterative AdjustCD , including level-2 PPIs (these weights are equivalent to the PPI and L2-PPI features in our composite network)
STR: network of predicted and scored functional associations from STRING  (these weights are equivalent to the STRING feature in our composite network)
NOWEI: unweighted PPI network
We perform random sub-sampling cross-validation, repeated over ten rounds, using manually curated complexes as reference complexes for training and testing. For yeast, we use the CYC2008  set which consists of 408 complexes. Only complexes of size greater than three proteins are used for testing; there are 149 such complexes in CYC2008. For human, we use the CORUM  set which consists of 1829 complexes, of which 714 are of size greater than three. In each cross-validation round, t% of the complexes of size greater than three are selected for testing, while all the remaining complexes are used for training. Each edge (u, v) in the network is given a class label co-complex if u and v are in the same training complex, otherwise its class label is non-co-complex. For SWC and BOOST, learning is performed using these labels, and the edges of the entire network are then weighted using the learned models. TOPO, STRING, and NOWEI require no learning, so the labels are not used; instead, for TOPO the edges of the network are weighted with topological scores, for STRING the edges are weighted with functional association scores, and for NOWEI all edges are given weight 1. The top-weighted k edges from the network are then used by the clustering algorithms to predict complexes. For NOWEI we use , while for SWC, BOOST, TOPO, and STRING, we use . We do not use all edges for these four weighting methods, because weighting enriches the network in dense clusters, which causes some of the clustering algorithms to require too much time to run when all edges are used; moreover, our experiments indicate that the performance of these methods drop when more than 20000 edges are used. The predicted clusters are evaluated on how well they match the test complexes.
We designed our experiment to simulate a real-use scenario of complex prediction in an organism where a few complexes might already be known, and novel complexes are to be predicted: in each round of cross-validation, the training complexes are those that are known and leveraged for learning to discover new complexes, while the test complexes are used to evaluate the performance of each approach at this task. Thus we use a large percentage of test complexes . In yeast, this gives 134 test complexes (among the 149 complexes of size greater than three), and 274 training complexes (only 15 of size greater than three); in human, this gives 643 test complexes (among the 714 of size greater than three), and 1186 training complexes (71 of size greater than three).
The precision of clusters is calculated only among those clusters that do not match a training complex, to eliminate the bias of the supervised approaches (SWC and BOOST) for predicting training complexes well. The precision-recall area under curve (AUC) is used as a summarizing statistic for each method's performance. Besides evaluating the performance of complex prediction, we also evaluate the performance of edge classification, in which the edge weights are used to classify edges as co-complex or non-co-complex edges.
To evaluate the quality of novel predicted complexes, we define three measures of semantic coherence for each complex: its biological process (BP), cellular compartment (CC), and molecular function (MF) semantic coherence. These are calculated from the proteins' annotations to Gene Ontology (GO) terms, which span the three classes BP, CC, and MF . We use the most informative common ancestor method of calculating the semantic similarity between two proteins, as outlined in . Briefly, the semantic similarity of two GO terms is first defined as the information content of their most informative common ancestor. Next, the BP semantic similarity of two proteins is defined as the highest semantic similarity between their two sets of annotated BP terms. Then, we define the BP semantic coherence of a predicted complex as the average BP semantic similarity between every pair of proteins in that complex (likewise for CC and MF).
We first evaluate each approach in classification of co-complex edges. Here, each weighting approach is used to weight the network edges, and the edges are classified as co-complex by taking a threshold on their weights. We obtain precision-recall graphs (solid markers, left axis) by taking a series of decreasing thresholds; at each recall level, we also indicate the proportion of test complexes covered by at least one predicted edge (hollow markers, right axis).
On the other hand, SWC is more accurate than STRING in predicting co-complex edges with high weights, because many proteins that are highly functionally associated are not co-complex, while SWC's supervised learning approach produces weights that are targeted at predicting co-complex edges, so highly-weighted edges are likelier to be co-complex. However, to retrieve even more co-complex edges by lowering the weight threshold, STRING's precision rises above SWC's, indicating that finding co-complex edges in this region might be better served simply by functional association.
BOOST integrates the same data sources as SWC, but uses LogitBoost instead to learn to classify co-complex edges. Its points in the graph are clustered in two regions: one set of edges are given high scores, achieving about 40% recall and 35% precision (lower than SWC's precision of 50% at this recall level), while the remaining edges are given low scores. Thus BOOST performs classification in a categorical manner, whereas SWC produces co-complex scores that reflect a wide range of confidence.
Finally, the performance of NOWEI, which uses unweighted PPI edges, appears as a single point on the graph, and shows that the PPI edges cover only 53% of co-complex edges, with a precision of 5%.
Figure 2b shows the corresponding precision-recall graphs for classification of co-complex edges in human. Compared to yeast, the coverage of co-complex edges is much lower in human.
Compared to TOPO, SWC has lower precision along TOPO's entire recall range. However, once again TOPO's predicted edges are clustered in fewer complexes, giving lower complex coverage: for example, to cover 80% of complexes requires TOPO to recall 22% of edges at a precision of 8%; SWC has to recall only 13% of edges at a higher precision of 11% to cover the same amount of complexes. Thus, for human as well as yeast, SWC is able to predict co-complex edges for a wider range of complexes compared to TOPO, whose range is limited to fewer complexes that are densely connected.
For human, STRING's functional association scores are the least accurate for predicting co-complex edges, giving the lowest precision among all the weighting approaches.
Just like in yeast, BOOST performs classification in a categorical manner: a set of edges are predicted as co-complex with high scores, achieving 7% recall and similar precision levels as SWC, while the remaining edges are predicted as non-co-complex with low scores.
SWC recalls the most test complexes, with the highest precision at almost all recall levels, especially with the stricter match_thres = 0.75. Thus it outperforms all other weighting approaches, especially at predicting complexes with ne granularity.
At match_thresh = 0.5, STR achieves almost the same recall as SWC with only slightly lower precision levels, but its recall and precision are much worse at a higher match_thresh = 0.75. Since STR classifies co-complex edges across a large range of clusters, it is able to recall many test complexes; but its lower accuracy in edge classification means that many of its clusters include extra or missing proteins, causing them not to be matched at a stricter matching threshold. BOOST achieves similar recall as STR but with substantially lower precision levels at both match thresholds. Since it classifies edges categorically, many edges have similar scores that do not vary with classification accuracy; thus the ranking of clusters (based on their weighted-densities) does not correlate as well with their correctness, giving lower precision levels. TOPO achieves the lowest recall of all approaches. While its precision for its highest-scoring clusters is comparable to SWC's at match_thresh = 0.5 (at the extreme left end of the graph), it drops rapidly for the remaining clusters. This is because TOPO classifies co-complex edges accurately for a limited number of complexes which are thus easy to predict, while the remaining complexes' edges are not as accurately classified, creating many false positive clusters and low recall. Finally, although NOWEI achieves slightly higher recall than TOPO, it generates a great number of false positives, giving extremely low precision.
SWC attains the highest recall at both match_thresh, with higher precision at all recall levels (except that TOPO's top-scoring clusters has slightly higher precision at match_thresh = 0.5). The performance advantage is even more pronounced at match_thresh = 0.75, where SWC recalls 50% more test complexes compared to the other approaches, and maintains almost twice the precision throughout its recall range. BOOST attains the next highest recall, but with substantially lower precision at all recall levels. Just as in yeast, its categorical edge classification reduces the correctness of the ranking of its clusters, giving lower precision levels.
TOPO achieves lower recall, but at match_thresh = 0.5 its precision for its high-scoring clusters is higher than that of BOOST, and even comparable to SWC's for its highest-scoring clusters. Once again, TOPO's high accuracy in classifying edges for a limited number of complexes means it is only able to predict a few complexes well at rough granularity.
Unlike in yeast, here STR performs extremely poorly with the lowest recall and precision levels of all weighting approaches. This is not surprising given that STR performs poorly in edge classification as well. Indeed, even NOWEI achieves higher recall and precision at match_thresh = 0.5, with a similar recall at the higher match threshold.
We evaluate the five weighting approaches (SWC, STRING, TOPO, BOOST, and NOWEI) on the number and quality of high-confidence novel complexes predicted in yeast and human. For the supervised approaches (SWC and BOOST), we use the entire reference set of complexes (CYC2008 for yeast, CORUM for human) for training. Next, the edges of the entire network are weighted, and the top k edges are used to predict complexes with the COMBINED clustering strategy, which combines clusters predicted by the six clustering algorithms. We use k = 20000 for SWC, BOOST, and TOPO, k = 10000 for STRING, and k = all edges for NOWEI.
We filter the set of predicted complexes to obtain a set of unique, novel, high-confidence predictions. First, complexes that are too similar are removed: if any two predicted complexes match with match_thres = 0.5, then the complex with the lower score is removed. Next, only novel predictions are kept: if any predicted complex matches any reference complex with match_thres = 0.5, then that predicted complex is removed. Finally, only high-confidence predictions are kept: for each weighting approach, using the cross-validation results, the score of each predicted complex is benchmarked to a precision value, and predicted complexes whose estimated precision are less than a confidence threshold are removed. For yeast, this confidence threshold is 0.5; for human, since much fewer complexes are predicted with high precision, we use a 0.4 confidence threshold.
High-level biological processes of novel predicted yeast complexes
Protein metabolic process
RNA metabolic process
DNA metabolic process
Small molecule metabolic process
Regulation of metabolic process
Regulation of gene expression
Response to stress
Response to chemical stimulus
Cell cycle process
High-level biological processes of novel predicted human complexes
Protein metabolic process
RNA metabolic process
DNA metabolic process
Small molecule metabolic process
Regulation of metabolic process
Regulation of gene expression
Response to stress
Response to chemical stimulus
Cell cycle process
The likelihood ratio is a reflection of "co-complexness strength". In general, the likelihood ratios increase as the scores for the data sources (i.e. the x-axes) increase. For the PPI and L2-PPI data sources, protein pairs with higher scores have greater number of shared neighbors, and are likelier to be co-complex: when the score of PPI is close to 1, indicating that almost all of the protein pair's neighbors are shared, the pair is 40 times likelier to be co-complex in yeast and 35 times likelier to be co-complex in human. L2-PPI scores are imputed in edges whose proteins do not actually interact according to PPI databases, yet who share many interaction partners. These scores have corresponding lower likelihood ratios compared to PPI scores: with a score close to 1, the pair is less than 30 times likelier to be co-complex in yeast and less than 20 times likelier to be co-complex in human.
For the STRING data source, only protein pairs with very high functional association scores are likelier to be co-complex: those with the highest scores are almost 40 times likelier to be co-complex in yeast and 50 times likelier to be co-complex in human, whereas protein pairs with lower functional association scores do not seem any likelier to be co-complex.
For PubMed data, protein pairs that co-occur in literature, even infrequently, are already much likelier to be co-complex: about 20 times likelier in yeast and 10 times likelier in human. However, pairs that co-occur more frequently in literature are not any more likelier to be co-complex compared to pairs that co-occur less frequently.
The likelihood ratios for the different data sources show that the co-complexness strength of each data source does not increase linearly with its score. Moreover, between the different data sources, the relationships between data score and co-complexness are different. Thus, combining data scores across different data sources without factoring their dissimilar co-complexness relationships is evidently unsound, while our supervised approach scales the heterogeneous scores to a uniform co-complexness score in terms of likelihoods, which can then be combined probabilistically using the naive-Bayes formulation.
The high likelihood ratios for the data sources also demonstrate that they are indeed indicative of edges belonging to complexes: during cross-validation for both yeast and human, none of the data sources were removed by feature selection in any round.
The likelihood network for the cluster (Figure 11d) visualizes the component evidences for the prediction: the contribution of each data source to an edge's SWC score is reflected in the edge thickness, which is scaled with its likelihood ratio, or co-complexness strength. The likelihood network reveals that diverse data sources connect many proteins within the cluster with high SWC scores. CYT1, RIP1, and QCR2 are fully connected with each other via all three data sources, making them the strongest co-complex triplet that is centrally embedded in the cluster, while CYT1-COR1-QCR2 and CYT1-QCR7-QCR2 are connected with two or more data sources, making them highly co-complex and deeply embedded as well. The other proteins appear less central in the cluster, especially COB, a fringe member which is only connected via functional associations to four proteins.
We select two novel complexes predicted with the COMBINED strategy using the SWC network, with the entire reference set of complexes for training.
Figure 13b shows a high-scoring novel human complex, generated by all six clustering algorithms, made up of four proteins, HCN1, HCN2, HCN3, and HCN4, and annotated with one high-level BP term, transport. These proteins are fully connected by six PPIs with strong co-complexness, while five functional associations with strong to moderate co-complexness and five literature co-occurrences with strong to weak co-complexness also connect the proteins. The strong PPIs, reinforced by the other data sources, provide high credibility to this prediction. Indeed, the Uniprot descriptions for these proteins suggest that they may constitute subunits of a potassium channel complex .
In this paper, we introduce a maximum-likelihood supervised approach for weighting composite protein networks for predicting protein complexes, called SWC (Supervised Weighting of Composite networks). First, we construct a composite protein network using three heterogeneous data sources: PPI, predicted functional association, and co-occurrence in literature abstracts. Next, we weight each edge of the composite network based on its posterior probability of belonging to a protein complex, using a naive-Bayes maximum-likelihood model learned from a set of training complexes. The weighted composite network is then used by clustering algorithms to predict new complexes. We also propose a simple aggregative clustering strategy that combines clusters generated by multiple clustering algorithms, using simple voting. We evaluate our weighting scheme using six clustering algorithms, as well our aggregative clustering strategy, on the prediction of yeast and human complexes. We demonstrate that our proposed method outperforms a supervised data-integration approach using boosting, a predicted functional-association network from STRING, an unsupervised approach using a topological function to weight PPI networks, as well as a baseline approach using unweighted PPI networks: our approach predicts more correct complexes at higher precision levels, and generates more high-confidence novel complexes with similar or better semantic coherence. Using a few example complexes, we show that our approach increases the density of the complexes' subgraphs, and filters them to remove extraneous edges. Furthermore, our approach allows visualization of the evidence of predicted complexes, using learned likelihood parameters to express strengths of co-complex relationships of each data type. This aids human evaluation of the credibility of predicted complexes.
Finally, we present two novel predicted complexes: a four-protein yeast complex possibly involved in DNA metabolism and stress response, and a four-protein human complex possibly involved in transport processes. We show that these predictions appear credible from their evidences, being supported by diverse data sources with strong co-complexness. Indeed, a recent paper presents the predicted yeast complex as the Cul8-RING ubiquitin ligase complex, while the Uniprot database provides evidence that the predicted human complex may exist as a potassium channel complex.
SWC software package and data files are available at http://compbio.ddns.comp.nus.edu.sg/~cherny/SWC/.
This work was supported in part by Singapore National Research Foundation grant NRF-G-CRP-2007-04-082(d) and a National University of Singapore NGS scholarship.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 2, 2012: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.