Protein complex prediction based on k-connected subgraphs in protein interaction network
- Mahnaz Habibi^{1},
- Changiz Eslahchi^{1}Email author and
- Limsoon Wong^{2}Email author
DOI: 10.1186/1752-0509-4-129
© Habibi et al; licensee BioMed Central Ltd. 2010
Received: 6 February 2010
Accepted: 16 September 2010
Published: 16 September 2010
Abstract
Background
Protein complexes play an important role in cellular mechanisms. Recently, several methods have been presented to predict protein complexes in a protein interaction network. In these methods, a protein complex is predicted as a dense subgraph of protein interactions. However, interactions data are incomplete and a protein complex does not have to be a complete or dense subgraph.
Results
We propose a more appropriate protein complex prediction method, CFA, that is based on connectivity number on subgraphs. We evaluate CFA using several protein interaction networks on reference protein complexes in two benchmark data sets (MIPS and Aloy), containing 1142 and 61 known complexes respectively. We compare CFA to some existing protein complex prediction methods (CMC, MCL, PCP and RNSC) in terms of recall and precision. We show that CFA predicts more complexes correctly at a competitive level of precision.
Conclusions
Many real complexes with different connectivity level in protein interaction network can be predicted based on connectivity number. Our CFA program and results are freely available from http://www.bioinf.cs.ipm.ir/softwares/cfa/CFA.rar.
Background
Several groups have produced a large amount of data on protein interactions [1–9]. It is desirable to use this wealth of data to predict protein complexes. Several methods have been applied to protein inter-actome graphs to detect highly connected subgraphs and predict them as protein complexes [10–25]. The main criterion used for protein complex prediction is cliques or dense subgraphs. Spirin and Mirny proposed the clique-finding and super-paramagnetic clustering with Monte Carlo optimization approach to find clusters of proteins [10]. Another method is Molecular Complex Detection (MCODE) [11], which starts with vertex weighting and finds dense regions according to given parameters. On the other hand, the Markov CLuster algorithm (MCL) [26, 27] simulates a flow on the network by using properties of the adjacency matrix. MCL partitions the graph by discriminating strong and weak flows in the graph. The next algorithm is RNSC (Restricted Neighborhood Search Clustering) [13]. It is a cost-based local search algorithm that explores the solution space to minimize a cost function, which is calculated based on the numbers of intra-cluster and inter-cluster edges.
However, many biological data sources contain noise and do not contain complete information due to limitations of experiments. Recently, some computational methods have estimated the reliability of individual interaction based on the topology of the protein interaction network (PPI network) [23, 28, 29]. The Protein Complex Prediction method (PCP) [30] uses indirect interactions and topological weight to augment protein-protein interactions, as well as to remove interactions with weights below a threshold. PCP employs clique finding on the modified PPI network, retaining the benefits of clique-based approaches. Liu et al. [31] proposed an iterative score method to assess the reliability of protein interactions and to predict new interactions. They then developed the Clustering based on Maximal Clique algorithm (CMC) that uses maximal cliques to discover complexes from weighted PPI networks.
Following these past works, we model the PPI network with a graph, where vertices represent proteins and edges represent interactions between proteins. We present a new algorithm CFA--short for k-Connected Finding Algorithm--to find protein complexes from this graph. Our algorithm is based on finding maximal k-connected subgraphs. The union of all maximal k-connected subgraphs (k ≥ 1) forms the set of candidate protein clusters. These candidate clusters are then filtered to remove (i) clusters having less than four proteins and (ii) clusters having a large diameter. We compare the results of our algorithm with the results of MCL, RNSC, PCP and CMC. Our algorithm produces results that are comparable or better than these existing algorithms on real complexes of [32, 33].
Preliminaries
Interaction Graphs
A PPI network is considered as an undirected graph G = 〈V, E〉, where each vertex v ∈ V represents a protein in the network and each edge uv ∈ E represents an observed interaction between proteins u and v. Two vertices u and v of G are adjacent or neighbors if and only if uv is an edge of G. The degree d(v) of a vertex v is defined as the number of neighbors that the protein v has.
If all the vertices of G are pairwise adjacent, then G is a complete graph and D_{ G } = 1. A complete graph on n vertices is denoted by K_{ n } . The cluster score of G is defined as D_{ G } × |V|.
K-Connectivity
A path in a non-empty graph G = 〈V, E〉 between two vertices u and v is a sequence of distinct vertices u = v_{ 0 } , v_{ 1 } , ..., v_{ k } = v such that v_{ i }v_{i+1}∈ E, 0 ≤ i < k - 1. G is called connected if every two vertices of G are linked by a path in G. G is called k-connected (for k ∈ ℵ) if |V| > k and the graph G = 〈V - X, E - (X × X)〉 is connected for every set X ⊆ V with |X| < k. The distance d(u, v) is the shortest path in G between two vertices u and v. The greatest distance between any two vertices in G is the diameter of G denoted by diamG. A non-empty 1-connected subgraph with the minimum number of edges is called a tree. It is well known that a connected graph is a tree if and only if the number of edges of the graph is one less than the number of its vertices. It is a classic result of graph theory-- the global version of Menger's theorem [34]--that a graph is k-connected if any two of its vertices can be joined by k independent paths (two paths are independent if they only intersect in their ends).
Results and Discussion
Data Sets
Protein-Protein Interaction Network Data
In this work, we use two high-throughput protein-protein interaction (PPI) data collections. The first data collection, GRID, contains six protein interaction networks from the Saccharomyces cerevisiae (bakers' yeast) genome. These include two-hybrid interactions from Uetz et al. [2] and Ito et al. [3], as well as interactions characterized by mass spectrometry technique from Ho, Gavin, Krogan and their colleagues [6–9]. We refer to these data sets as PPI_{ Uetz } , PPI_{ Ito } , PPI_{ Ho } , PPI_{ Gavin2 } , PPI_{ Gavin6 } , and PPI_{ Krogan } .
Summary statistics of each data set.
Data set | Proteins | Interactions | Min. Deg | Avg.Deg | Max. Deg |
---|---|---|---|---|---|
PPI _{ BioGRID } | 5040 | 27557 | 0 | 10.93 | 318 |
PPI _{Gavin 6} | 1563 | 6531 | 0 | 8.36 | 81 |
PPI _{Gavin 2} | 1373 | 3200 | 0 | 4.66 | 52 |
PPI _{ Krogan } | 2672 | 7073 | 0 | 5.29 | 140 |
PPI _{ Ho } | 1563 | 3596 | 1 | 4.60 | 62 |
PPI _{ Ito } | 775 | 732 | 0 | 1.8 | 54 |
PPI _{ Uetz } | 823 | 823 | 0 | 1.7 | 21 |
Protein Complex Data
Summary statistics of each protein complex data sets for each PPI network.
PPI | MPC | APC | ||||
---|---|---|---|---|---|---|
No. of Complex | Avg. Size | Max size | No. of Complex | Avg. Size | Max size | |
Biogrid | 651 | 11.94 | 88 | 62 | 9.29 | 34 |
Gavin 6 | 443 | 11.31 | 80 | 53 | 8.84 | 27 |
Gavin 2 | 439 | 11.35 | 88 | 54 | 8.72 | 26 |
Krogan | 531 | 10.89 | 75 | 56 | 8.94 | 31 |
Ho | 543 | 10.55 | 70 | 30 | 6.60 | 18 |
Ito | 119 | 5.85 | 20 | 15 | 4.86 | 8 |
Uetz | 355 | 9.15 | 56 | 12 | 6.41 | 14 |
Cellular Component Annotation
The level of noise in protein interaction data--especially those obtained by two-hybrid experiments--has been estimated to be as high as 50% [36–38]. Liu et al. [31] have shown that using a de-noised protein interaction network as input leads to better quality of protein complex predictions by existing methods. A protein complex can only be formed if its proteins are localized within the same component of the cell. So we use localization coherence of proteins to clean up the input protein interaction network. We use cellular component terms from Gene Ontology (GO) [39] to evaluate localization coherence. We find that among the 5040 yeast proteins, only 4345 or 86% of them are annotated. To avoid arriving at misleading conclusions caused by biases in the annotations, we use the concept of informative cellular component. We define a cellular component annotation as informative if it has at least k proteins annotated with it and each of its descendent GO terms has less than k proteins annotated with it. In this work, we set k as 10. This yields 150 informative cellular component GO terms on the BioGRID data set.
Performance Evaluation Measures
There are many studies that predict protein complexes. To evaluate the performance of various protein complex prediction methods, we compare the predicted protein complexes with real protein complex data sets, APC and MPC.
If Overlap(S,C) meets or exceeds a threshold θ, then we say S and C match. Following Liu et al. [31], we use an overlap threshold of 0.5 to determine a match.
The precision and recall are two numbers between 0 and 1. They are the commonly used measures to evaluate the performance of protein complex prediction methods [30, 31]. In particular, precision corresponds to the fraction of predicted clusters that matches real protein complexes; and recall corresponds to the fraction of real protein complexes that are matched by predicted clusters.
Observations
To justify using the connectivity definition and cellular component annotation, we analyze the connectivity number and localization coherence of reference complexes of MPC on PPI networks obtained by [6–9] as well as [35].
Co-Localization Score of Known Complexes
The locscore for MPC and APC are 0.74 and 0.86 respectively. The relatively large values of these numbers suggest that cleaning the input PPI network by cellular component information should help us improve precision and recall of existing algorithms.
Impact of Localization Information
In this work, the cleaning of PPI networks using informative cellular component GO terms is an important preprocessing step. So we analyze here the impact of using informative GO cellular component annotation on the performance of four existing algorithms--CMC, MCL, PCP, and RNSC-- on their standard parameters. (The CMC package comes with its own PPI-cleaning method. However, in order to observe the effect of cleaning based on cellular component GO terms on CMC, this method is not used in this work.)
Let G_{ i } = G[L_{ i } ] be the induced subgraph of G generated by the vertex set L_{ i } , where {L_{1}, L_{2}, ..., L_{ k } } is the set of localization groups. Thus each L_{ i } contains a set of proteins localized to the same cellular component--i.e., they are annotated by the same informative GO term. Let C_{ i } be the set of all clusters predicted by an algorithm on ${G}_{i}{C}_{L}={\displaystyle {\cup}_{i=1}^{k}{C}_{i}}$ denotes the set of all clusters predicted by the algorithm on G.
Features of clusters predicted by different algorithms on the both the original and C_{ L }networks.
CMC | MCL | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Avg | Avg | Avg | Avg | ||||||||
PPI | Setting | Cluster | Size | Prec | Recall | Den. | Cluster | Size | Prec | Recall | Den. |
BioGRID | (1) | 295 | 9.58 | 0.210 | 0.124 | 0.78 | 376 | 10.41 | 0.098 | 0.072 | 0.40 |
(2) | 296 | 9.36 | 0.361 | 0.155 | 0.75 | 647 | 9.43 | 0.2411 | 0.173 | 0.49 | |
Gavin 6 | (1) | 155 | 10.51 | 0.367 | 0.203 | 0.66 | 160 | 11.72 | 0.471 | 0.194 | 0.50 |
(2) | 299 | 9.55 | 0.401 | 0.239 | 0.62 | 327 | 9.40 | 0.486 | 0.261 | 0.57 | |
Gavin 2 | (1) | 110 | 11.5 | 0.390 | 0.118 | 0.44 | 115 | 9.81 | 0.652 | 0.252 | 0.39 |
(2) | 213 | 9.69 | 0.417 | 0.159 | 0.49 | 373 | 9.63 | 0.479 | 0.266 | 0.40 | |
Krogan | (1) | 215 | 8.93 | 0.251 | 0.124 | 0.60 | 246 | 8.07 | 0.146 | 0.094 | 0.43 |
(2) | 166 | 8.15 | 0.494 | 0.163 | 0.64 | 247 | 7.77 | 0.477 | 0.184 | 0.54 | |
Ho | (1) | 121 | 8.17 | 0.206 | 0.057 | 0.32 | 146 | 8.19 | 0.486 | 0.145 | 0.34 |
(2) | 149 | 7.89 | 0.335 | 0.103 | 0.45 | 96 | 7.28 | 0.500 | 0.110 | 0.36 | |
PCP | RNSC | ||||||||||
Avg | Avg | Avg | Avg | ||||||||
PPI | Setting | Cluster | Size | Prec | Recall | Den. | Cluster | Size | Prec | Recall | Den. |
BioGRID | (1) | 174 | 8.73 | 0.253 | 0.109 | 0.51 | 174 | 6.31 | 0.367 | 0.119 | 0.78 |
(2) | 341 | 9.27 | 0.343 | 0.158 | 0.60 | 301 | 7.36 | 0.425 | 0.156 | 0.81 | |
Gavin 6 | (1) | 95 | 9.61 | 0.463 | 0.185 | 0.63 | 105 | 6.41 | 0.381 | 0.126 | 0.74 |
(2) | 228 | 9.14 | 0.482 | 0.243 | 0.62 | 295 | 7.33 | 0.410 | 0.234 | 0.72 | |
Gavin 2 | (1) | 54 | 9.40 | 0.537 | 0.125 | 0.50 | 89 | 5.98 | 0.370 | 0.074 | 0.61 |
(2) | 121 | 9.17 | 0.446 | 0.141 | 0.45 | 158 | 6.81 | 0.487 | 0.132 | 0.59 | |
Krogan | (1) | 100 | 7.90 | 0.380 | 0.109 | 0.61 | 92 | 6.25 | 0.423 | 0.081 | 0.72 |
(2) | 205 | 7.66 | 0.458 | 0.158 | 0.68 | 200 | 6.77 | 0.510 | 0.165 | 0.69 | |
Ho | (1) | 42 | 5.59 | 0.285 | 0.040 | 0.29 | 15 | 6.85 | 0.333 | 0.046 | 0.41 |
(2) | 51 | 5.00 | 0.372 | 0.060 | 0.37 | 26 | 6.84 | 0.370 | 0.073 | 0.38 |
The precision and recall values obtained at the matching threshold θ = 0.5 are given in Table 3. RNSC performs best on PPI_{ Biogrid } , while MCL performs best on PPI_{Gavin 6}, PPI_{Gavin 2}, and PPI_{ Ho } . In the orginal network of PPI_{ krogan } , PCP shows better precision against recall compared to other methods, while after cleaning by using localization information almost all methods have similar performance. This table shows that none of these algorithms has the best precision vs recall in all networks.
Density of Known Complexes
Furthermore, almost all complexes which are complete or have high density are of the form K_{3}, while there are a large number of cliques of size 3 which are not complex. For example, in PPI_{ BioGRID } , there exist 176 known complexes of size three, while the number of cliques of size 3 in PPI_{ BioGRID } is 37230. It means that only about 0.47% of them are known real complexes. So, those clusters and complexes with size atmost three are removed in our work, to avoid an excessive number of false positive predictions.
We have also studied the number of known complexes of size four in PPI_{ BioGRID } . We find that there exist 138 real complexes of size four, while only 54 of them have high density.
The discussions above suggest that the density criterion alone cannot answer the question of finding complexes. We need to introduce another criterion to overcome this problem.
Connectivity of Known Complexes
We show in this section that connectivity is a reasonable alternative criterion for identifying protein complexes. Although this criterion is simple, it may directly describe the general understanding of the protein complex concept. This criterion is better than density because, while there are a lot of known complexes that are not complete or dense, there are many k-connected subgraphs with low density. For example, Figure 1(A) shows two real complexes of MPC with low density (0.34). Both of them have a large 2-connected subgraph.
where ${s}_{1}^{k}(c),{s}_{2}^{k}(c),\mathrm{...},{s}_{n}^{k}(c)$ are maximal k- connected subgraphs of complex c.
The kscore and average density of different PPI networks on MPC.
Data Set | PPI _{ BioGRID } | PPI _{Gavin 6} | PPI _{Gavin 2} | PPI _{ Krogan } | PPI _{ Ho } |
---|---|---|---|---|---|
Avg Density | 0.41 | 0.29 | 0.21 | 0.20 | 0.25 |
1score | 0.995 | 0.929 | 0.970 | 0.870 | 0.983 |
2score | 0.784 | 0.868 | 0.758 | 0.678 | 0.748 |
3score | 0.537 | 0.521 | 0.494 | 0.351 | 0.446 |
4score | 0.374 | 0.318 | 0.397 | 0.254 | 0.232 |
This suggests that using connectivity number as a criterion of protein complex prediction may be a good approach. Therefore, our algorithm is based on finding maximal k-connected subgraphs in PPI networks by keep increasing k until k cannot be increased any more. In other words, the algorithm continues until some integer k_{0} such that there is no k-connected subgraph with k > k_{0}.
Evaluation
Testing for Accuracy
To check the validity of CFA, we compare clusters predicted by CFA with the clusters obtained by CMC, MCL, PCP and RNSC, on the seven protein interaction networks of GRID and BioGRID. The networks are first segregated by informative cellular component GO terms before these algorithms are run. MPC and APC are used as benchmark real protein complexes.
Precision and recall values of different algorithms on each PPI network.
APC | MPC | |||||||
---|---|---|---|---|---|---|---|---|
Method | Data Set | No. of Cluster | Match Complex | Recall/Prec | Match Cluster | Match Complex | Recall/Prec | Match Cluster |
CFA | (1) | 423 | 52 | 0.838 0.310 | 131 | 119 | 0.182 0.435 | 184 |
CMC | (1) | 296 | 51 | 0.822 0.293 | 87 | 101 | 0.155 0.361 | 107 |
MCL | (1) | 647 | 51 | 0.822 0.179 | 116 | 113 | 0.173 0.241 | 156 |
PCP | (1) | 341 | 50 | 0.806 0.290 | 99 | 103 | 0.158 0.343 | 117 |
RNSC | (1) | 301 | 52 | 0.838 0.345 | 104 | 102 | 0.156 0.425 | 128 |
CFA | (2) | 324 | 51 | 0.962 0.456 | 148 | 122 | 0.275 0.543 | 176 |
CMC | (2) | 299 | 50 | 0.943 0.347 | 104 | 106 | 0.239 0.401 | 120 |
MCL | (2) | 327 | 50 | 0.943 0.422 | 138 | 116 | 0.261 0.486 | 159 |
PCP | (2) | 228 | 48 | 0.905 0.403 | 92 | 108 | 0.243 0.482 | 110 |
RNSC | (2) | 295 | 50 | 0.943 0.362 | 107 | 104 | 0.234 0.410 | 121 |
CFA | (3) | 235 | 49 | 0.907 0.497 | 117 | 119 | 0.271 0.595 | 140 |
CMC | (3) | 213 | 31 | 0.574 0.347 | 74 | 70 | 0.159 0.417 | 89 |
MCL | (3) | 373 | 47 | 0.870 0.332 | 124 | 117 | 0.266 0.479 | 179 |
PCP | (3) | 121 | 28 | 0.518 0.388 | 47 | 62 | 0.141 0.446 | 54 |
RNSC | (3) | 158 | 25 | 0.463 0.392 | 62 | 58 | 0.132 0.487 | 77 |
CFA | (4) | 330 | 45 | 0.803 0.451 | 149 | 104 | 0.195 0.533 | 176 |
CMC | (4) | 166 | 40 | 0.714 0.379 | 63 | 87 | 0.163 0.494 | 82 |
MCL | (4) | 247 | 45 | 0.803 0.368 | 91 | 98 | 0.184 0.477 | 118 |
PCP | (4) | 205 | 40 | 0.714 0.400 | 82 | 84 | 0.158 0.458 | 94 |
RNSC | (4) | 200 | 37 | 0.660 0.430 | 86 | 88 | 0.165 0.510 | 102 |
CFA | (5) | 120 | 13 | 0.433 0.166 | 20 | 62 | 0.114 0.416 | 50 |
CMC | (5) | 149 | 6 | 0.200 0.060 | 9 | 56 | 0.103 0.335 | 50 |
MCL | (5) | 96 | 12 | 0.400 0.250 | 24 | 60 | 0.110 0.500 | 48 |
PCP | (5) | 51 | 4 | 0.133 0.098 | 5 | 33 | 0.060 0.372 | 19 |
RNSC | (5) | 26 | 5 | 0.166 0.500 | 13 | 21 | 0.038 0.730 | 19 |
CFA | (6) | 45 | 3 | 0.200 0.088 | 4 | 15 | 0.126 0.226 | 12 |
CMC | (6) | 9 | 0 | 0.000 0.000 | 0 | 1 | 0.008 0.111 | 1 |
MCL | (6) | 65 | 3 | 0.200 0.076 | 5 | 15 | 0.126 0.230 | 15 |
PCP | (6) | 8 | 0 | 0.000 0.000 | 0 | 1 | 0.008 0.125 | 1 |
RNSC | (6) | 11 | 0 | 0.000 0.000 | 0 | 1 | 0.008 0.545 | 6 |
Table 5 shows that CFA performs better on PPI_{ Krogan } , PPI_{ Ito } , PPI_{Gavin 2}and PPI_{Gavin 6}compared to other methods. In fact, both precision and recall values of CFA are greater than all of the other algorithms in these networks. In PPI_{ Ho } , RNSC has the greatest precision. However, RNSC predicts merely 26 clusters and, among these predictions, 13 clusters are matched to 5 real complexes in APC and 19 clusters are matched to 21 real complexes in MPC. Thus the recall value of RNSC is very low (0.166 on APC and 0.038 on MPC). In contrast, CFA correctly predicts 13 real complexes of APC and 62 of MPC. The clusters of CFA give the precision value 0.416 (0.166) and the recall value 0.114 (0.433) on MPC (APC), which are generally better than that obtained by RNSC and other methods on PPI_{ Ho } .
Detailed breakdown of predicted clusters by different algorithms with respect to APC and MPC reference protein complexes.
Method | |A| | |B| | |A∪ B| | |A- B| | |B- A| | No. of Cluster | Precision |
---|---|---|---|---|---|---|---|
CFA | 184 | 131 | 208 | 77 | 24 | 423 | 0.492 |
CMC | 107 | 87 | 125 | 38 | 18 | 296 | 0.422 |
MCL | 156 | 116 | 177 | 61 | 21 | 647 | 0.274 |
PCP | 117 | 99 | 140 | 41 | 23 | 341 | 0.411 |
RNSC | 128 | 104 | 151 | 47 | 23 | 301 | 0.502 |
Some interactions in PPI_{ Biogrid } are derived from two-hybrid technique. Due to the level of noise in two-hybrid experiments, we expect those predicted clusters having the form of a tree structure to have lower reliability compared to other 1-connected subgraphs. Hence, in order to improve the results of CFA, we only use 1-connected subgraphs that are not trees. A tree with n vertices has n - 1 edges; so a connected cluster is a tree if and only if its cluster score is 2. Thus, we consider 1-connected subgraphs with cluster scores greater than 2. Similarly, we can do additional filtering for each k-connected subgraphs by considering the clusters with cluster score greater that k+1. The precision and recall values of the resulting further refined clusters are 0.465 and 0.178 in MPC and 0.347 and 0.838 in APC. So the precision vs recall of CFA, using cluster score filtering, shows significant improvement compared to other methods in PPI_{ Biogrid } on APC too.
Precision and recall values after removing highly overlapping clusters.
APC | MPC | |||||||
---|---|---|---|---|---|---|---|---|
Method | Data Set | No. of Cluster | Recall/Prec/F-measure | Recall/Prec/F-measure | ||||
CFA | (1) | 238 | 0.822 | 0.277 | 0.415 | 0.170 | 0.378 | 0.235 |
CMC | (1) | 208 | 0.741 | 0.235 | 0.358 | 0.145 | 0.322 | 0.200 |
MCL | (1) | 467 | 0.790 | 0.113 | 0.199 | 0.147 | 0.164 | 0.155 |
PCP | (1) | 230 | 0.758 | 0.226 | 0.348 | 0.133 | 0.282 | 0.181 |
RNSC | (1) | 186 | 0.809 | 0.274 | 0.409 | 0.150 | 0.365 | 0.213 |
CFA | (2) | 164 | 0.924 | 0.390 | 0.549 | 0.250 | 0.530 | 0.340 |
CMC | (2) | 197 | 0.924 | 0.274 | 0.423 | 0.214 | 0.355 | 0.267 |
MCL | (2) | 191 | 0.905 | 0.272 | 0.419 | 0.221 | 0.356 | 0.272 |
PCP | (2) | 144 | 0.811 | 0.319 | 0.458 | 0.214 | 0.416 | 0.283 |
RNSC | (2) | 152 | 0.924 | 0.348 | 0.506 | 0.205 | 0.263 | 0.230 |
CFA | (3) | 124 | 0.907 | 0.475 | 0.624 | 0.250 | 0.564 | 0.347 |
CMC | (3) | 122 | 0.500 | 0.237 | 0.322 | 0.123 | 0.295 | 0.173 |
MCL | (3) | 215 | 0.851 | 0.237 | 0.371 | 0.248 | 0.395 | 0.305 |
PCP | (3) | 82 | 0.481 | 0.487 | 0.485 | 0.116 | 0.365 | 0.176 |
RNSC | (3) | 90 | 0.425 | 0.255 | 0.320 | 0.120 | 0.377 | 0.183 |
CFA | (4) | 169 | 0.767 | 0.337 | 0.469 | 0.180 | 0.455 | 0.258 |
CMC | (4) | 120 | 0.714 | 0.341 | 0.462 | 0.158 | 0.450 | 0.234 |
MCL | (4) | 150 | 0.767 | 0.293 | 0.425 | 0.169 | 0.386 | 0.235 |
PCP | (4) | 130 | 0.678 | 0.300 | 0.416 | 0.133 | 0.369 | 0.196 |
RNSC | (4) | 108 | 0.660 | 0.370 | 0.475 | 0.160 | 0.500 | 0.242 |
CFA | (5) | 96 | 0.400 | 0.156 | 0.225 | 0.105 | 0.385 | 0.165 |
CMC | (5) | 109 | 0.166 | 0.045 | 0.072 | 0.073 | 0.247 | 0.113 |
MCL | (5) | 71 | 0.400 | 0.169 | 0.238 | 0.105 | 0.408 | 0.167 |
PCP | (5) | 43 | 0.133 | 0.093 | 0.110 | 0.055 | 0.325 | 0.094 |
RNSC | (5) | 16 | 0.166 | 0.312 | 0.217 | 0.023 | 0.562 | 0.045 |
CFA | (6) | 41 | 0.134 | 0.049 | 0.071 | 0.126 | 0.195 | 0.153 |
CMC | (6) | 8 | 0.000 | 0.000 | 0.000 | 0.008 | 0.125 | 0.015 |
MCL | (6) | 52 | 0.134 | 0.076 | 0.097 | 0.117 | 0.192 | 0.145 |
PCP | (6) | 8 | 0.000 | 0.000 | 0.000 | 0.008 | 0.125 | 0.015 |
RNSC | (6) | 5 | 0.000 | 0.000 | 0.000 | 0.008 | 0.400 | 0.016 |
Examples of Predicted Clusters
In this section, we present five matched and unmatched clusters predicted by CFA.
In Figure 1(A), two MIPS complexes, marked as 1 and 2, are depicted according to the protein interactions of PPI_{Gavin 2}. Complex 1 is an eleven- member complex (MIPS ID. 550.1.213; Probably transcription DNA Maintanace Chromatin Structure) that contains a protein, Y NL 113W, whose interactions with other proteins are missing from PPI_{Gavin 2}. Complex 2 contains 12 proteins (MIPS ID. 510.40.10; RNA polymerase II ) and there exists a protein, Y LR 418C, in this complex whose interactions with other proteins are missing in PPI_{Gavin 2}. There are four common proteins in these two complexes. Without considering localization annotations, CFA predicts all vertices of this graph (except for Y LR 418C and Y NL 113W) as a 2-connected subgraph. After segregating the network using GO terms, CFA predicts two clusters (Figure 1(B)) which are matched to the real complexes in Figure 1(A).
To gain further insights into the differences among CFA's clusters and clusters predicted by other algorithms, we consider the first CFA cluster presented in Figure 5. This cluster is matched perfectly to a 30-member complex on MPC. In contrast, CMC's clusters only overlap with at most 16 members of this complex. The corresponding cluster predicted by PCP is a twenty five-member cluster, and the other members of the real complex do not belong to the PCP cluster. Similarly, merely fifteen members of the corresponding RNSC cluster overlap with the same complex. Among these methods only MCL predicts a cluster which is matched to the same complex perfectly.
The third cluster shown in Figure 5 is an unmatched cluster which is obtained by CFA, CMC, PCP and RNSC algorithms. None of the proteins of this cluster belongs to any real complex in MPC and APC. However, MCL predicts a cluster containing all members of the above mentioned cluster with an extra protein with a different GO term annotation.
Conclusions
In the first part of this work, we study the impact of using informative cellular component GO term annotations on the performance of several different protein complex prediction algorithms. We have shown (Table 3) that existing algorithms predict protein complexes with significantly higher precision and recall when the input PPI network is cleansed using informative cellular component GO term annotations. Therefore, we propose for protein complex prediction algorithms a preprocessing step where the input PPI network is segregated by informative cellular component GO terms.
In the second part of this work, we study the density of protein interactions within protein complexes. We have shown (Figure 3) that there are many real complexes with different density. So density is not a good criterion for prediction of protein complexes. Therefore, we look at the connectivity number of complexes as a possible alternative criterion. We observe (Table 4) that 87%-99% of real protein complexes are 1-connected, 68%-87% are 2-connected, 35%-54% are 3-connected, and 23%-37% are 4-connected.
Precision and recall values of maximal k-connected (k ≥ 1) subgraphs, C 1, C 2, ..., C 9, and their union U.
PPI _{ BioGRID } | PPI _{Gavin 6} | PPI _{Gavin 2} | PPI _{ Krogan } | PPI _{ Ho } | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Data | Prec/Recall | Prec/Recall | Prec/Recall | Prec/Recall | Prec/Recall | |||||
C 1 | 0.356 | 0.163 | 0.486 | 0.248 | 0.685 | 0.252 | 0.537 | 0.184 | 0.423 | 0.112 |
C 2 | 0.380 | 0.149 | 0.497 | 0.241 | 0.535 | 0.161 | 0.462 | 0.184 | 0.461 | 0.058 |
C 3 | 0.516 | 0.150 | 0.597 | 0.187 | 0.523 | 0.102 | 0.549 | 0.150 | 0.555 | 0.023 |
C 4 | 0.631 | 0.112 | 0.666 | 0.37 | 0.520 | 0.045 | 0.709 | 0.090 | 0.000 | 0.000 |
C 5 | 0.615 | 0.094 | 0.666 | 0.070 | 0.538 | 0.022 | 0.720 | 0.065 | -- | -- |
C 6 | 0.614 | 0.059 | 0.562 | 0.049 | 0.600 | 0.013 | 0.645 | 0.045 | -- | -- |
C 7 | 0.561 | 0.043 | 0.800 | 0.024 | 0.500 | 0.002 | 0.608 | 0.037 | -- | -- |
C 8 | 0.680 | 0.0353 | 0.714 | 0.018 | 1.000 | 0.002 | 0.666 | 0.033 | -- | -- |
C 9 | 0.880 | 0.0276 | 0.000 | 0.000 | -- | -- | -- | -- | -- | -- |
U | 0.435 | 0.182 | 0.543 | 0.275 | 0.595 | 0.271 | 0.533 | 0.195 | 0.416 | 0.114 |
Methods
In the Observations section we explained that cellular component annotations can help us to improve predictions. On the other hand, by studying the connectivity number of real complexes as subgraphs of PPI network, we showed that the connectivity number could be a reasonable criterion to predict complexes. So we present a new algorithm based on finding k-connected subgraphs (1 ≤ k) on PPI networks segregated by informative cellular component GO terms.
Algorithm
A new algorithm named CFA (k-Connected Finding Algorithm) is presented here to predict complexes from an input (cleansed) PPI network. The CFA algorithm comprises two main steps. In the first step, maximal k-connected subgraphs for various k are generated as candidate complexes. In the second step, a number of filtering rules are applied to eliminate unlikely candidates.
The heart of the first step of CFA contains two simple procedures. The first procedure is REFINE, which removes all vertices of degree less than k from the input graph. This is an obvious optimization since, by the global version of Menger's theorem [34], such vertices cannot be part of any k-connected subgraphs. The second procedure is COMPONENT, which takes the refined graph and fragments it into k-connected subgraphs. This procedure finds a set of h < k vertices that disconnects the input graph, producing several connected components of the graph. The procedure is then recursively called on each of these connected components. The procedure terminates on a connected component (and returns it as a maximal k-connected subgraph) if it cannot be made disconnected by removing h < k vertices. The correctness of this procedure follows straightforwardly from the global version of Menger's theorem.
Pseudo codes of CFA
Step1:// Find maximal k-connected subgraphs |
---|
Procedure REFINE |
Input: Graph G = (V, E) and a parameter k. |
Output: All vertices in G of degree less than k are removed. |
The reduced graph is returned. |
Procedure COMPONENT |
Input: Connected graph H = (V, E) and a parameter k. |
Output: Fragment the graph H into k-connected subgraphs. |
If H does not have more than k vertices, |
Then stop. |
Find some u_{1,...,} u_{ h } (h < k) in H such that H - {u 1,...,u_{ h }} is not a connected subgraph. |
If such a set u_{1},..., u_{ h } is found, |
Then for all connected component c in H - {u_{1},...,u_{ h }}, |
call COMPONENT(c,k) |
Else return H as a result. |
Procedure k-CONNECTED |
Input: Graph G = (V,E) |
Output: COMPONENT(REFINE(G,k),k). |
Step2:// Filtering |
Procedure CFA |
Input: Graph G = (V, E) |
Output: Maximal k-connected subgraphs in G of size at least 4. |
Set k to 1 |
While Ck is not empty |
Set Ck to the result of k-CONNECTED(G). |
Increment k. |
Set G 1 to 1-connected subgraphs from C 1 with the diameter < 4. |
Set Gk to k-connected subgraphs from Ck with the diameter < k (for k ≥ 2) |
Set U to the union of Gk's (k ≥ 1) |
Remove all subgraphs of size less than 4 in the set U. |
Implementation
Optimal parameters for CMC, MCL, PCP and RNSC algorithms.
Algorithm | Parameter | Optimal value |
---|---|---|
MCL | Inflation | 1.8 |
CMC | Min-deg-ratio | 1 |
Overlap-threshold | 0.5 | |
Merge-threshold | 0.25 | |
Min-size | 4 | |
PCP | FSWeight-threshold | 0.4 |
Min clique size | 4 | |
Overlap-threshold | 0.5 | |
RNSC | Diversification frequency | 50 |
Tabu length | 50 | |
Number of experiments | 3 | |
Scaled stopping tolerance | 15 | |
Shuffling diversification length | 9 |
Declarations
Acknowledgements
We thank Mehdi Sadeghi and Hamid Pezeshk for valuable comments and suggestions. We thank Sylvian Brohee for providing us the implementations of MCL and RNSC. We thank Hon Nian Chua for generously allocating us time and resources on the physical interactions of PPI_{ BioGRID } . This research was supported in part by Shahid Beheshti University (Eslahchi); a grant from Iran's Institute for Research in Fundamental Sciences (Eslahchi, Habibi); and a Singapore's National Research Foundation grant NRF-G-CRP-2997-04-082(d) (Wong).
Authors’ Affiliations
References
- Fields S: Proteomics. Proteomics in genomeland. Science. 2001, 291: 1221-1224. 10.1126/science.291.5507.1221View ArticlePubMedGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403: 623-627. 10.1038/35001009View ArticlePubMedGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001, 98: 4569-4574. 10.1073/pnas.061034498PubMed CentralView ArticlePubMedGoogle Scholar
- Drees BL, Sundin B, Brazeau E, Caviston JP, Chen GC, Guo W: A protein interaction map for cell polarity development. J Cell Biol. 2001, 154: 549-571. 10.1083/jcb.200104057PubMed CentralView ArticlePubMedGoogle Scholar
- Fromont-Racine M, Mayes AE, Brunet-Simon A, Rain JC, Colley A, Dix I: Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins. Yeast. 2000, 17: 95-110. 10.1002/1097-0061(20000630)17:2<95::AID-YEA16>3.0.CO;2-HPubMed CentralView ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams S-L, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Srensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415: 180-183. 10.1038/415180aView ArticlePubMedGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hoefert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415 (6868): 141-147. 10.1038/415141aView ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440 (7084): 631-636. 10.1038/nature04532View ArticlePubMedGoogle Scholar
- Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrn-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul-Amin M, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006, 440 (7084): 637-643. 10.1038/nature04670View ArticlePubMedGoogle Scholar
- Spirin V, Mriny LA: Protein complexes and functional modules in molecular networks. PNAS. 2003, 100 (21): 12123-12128. 10.1073/pnas.2032324100PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4 (2): 1-27.Google Scholar
- Pereira-Leal JB, Enright AJ, Ouzounis CA: Detection of functional modules from protein interaction networks. Proteins. 2004, 54: 49-57. 10.1002/prot.10505View ArticlePubMedGoogle Scholar
- King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics. 2004, 20 (17): 3013-3020. 10.1093/bioinformatics/bth351View ArticlePubMedGoogle Scholar
- Arnau V, Mars S, Marin I: Iterative cluster analysis of protein interaction data. Bioinformatics. 2005, 21 (3): 364-378. 10.1093/bioinformatics/bti021View ArticlePubMedGoogle Scholar
- Lu H, Zhu X, Liu H, Skogerb G, Zhang J, Zhang Y, Cai L, Zhao Y, Sun S, Xu J, Bu D, Chen R: The interactome as a tree - an attempt to visualize the protein-protein interaction network in yeast. Nucleic Acids Res. 2004, 32 (16): 4804-4811. 10.1093/nar/gkh814PubMed CentralView ArticlePubMedGoogle Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006, 7: 207- 10.1186/1471-2105-7-207PubMed CentralView ArticlePubMedGoogle Scholar
- Said MR, Begley TJ, Oppenheim AV, Lauffen-burger DA, Samson LD: Global network analysis of phenotypic effects. protein networks and toxicity modulation in Saccharomyces cerevisiae. Proc Natl Acad Sci USA. 2004, 101 (52): 18006-18011. 10.1073/pnas.0405996101PubMed CentralView ArticlePubMedGoogle Scholar
- Dunn R, Dudbridge F, Sanderson CM: The use of edge-betweenness clustering to investigate biological function in protein interaction networks. BMC Bioinformatics. 2005, 6: 39- 10.1186/1471-2105-6-39PubMed CentralView ArticlePubMedGoogle Scholar
- Bandyopadhyay S, Sharan R, Ideker T: Systematic identification of functional orthologs based on protein network comparison. Genome Res. 2006, 16 (3): 428-435. 10.1101/gr.4526006PubMed CentralView ArticlePubMedGoogle Scholar
- Middendorf M, Ziv E, Wiggins CH: Inferring network mechanisms. the Drosophila melanogaster protein interaction network. Proc Natl Acad Sci USA. 2005, 102 (9): 3192-3197. 10.1073/pnas.0409515102PubMed CentralView ArticlePubMedGoogle Scholar
- Friedrich C, Schreiber F: Visualisation and navigation methods for typed protein-protein interaction networks. Appl Bioinformatics. 2003, 2 (3 Suppl): S19-S24.PubMedGoogle Scholar
- Ding C, He X, Meraz RF, Holbrook SR: A unified representation of multiprotein complex data for modeling interaction networks. Proteins. 2004, 57: 99-108. 10.1002/prot.20147View ArticlePubMedGoogle Scholar
- Brun C, Chevenet F, Martin D, Wojcik J, Gunoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 2003, 5: R6- 10.1186/gb-2003-5-1-r6PubMed CentralView ArticlePubMedGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nat Biotechnol. 2003, 21 (6): 697-700. 10.1038/nbt825View ArticlePubMedGoogle Scholar
- Gagneur J, Krause R, Bouwmeester T, Casari G: Modular decomposition of protein-protein interaction networks. Genome Biol. 2004, 5 (8): R57- 10.1186/gb-2004-5-8-r57PubMed CentralView ArticlePubMedGoogle Scholar
- Van Dongen S: Graph clustering by flow simulation. PhD thesis. 2000, Center for Mathematics and Computer Science (CWI), University of UtrechtGoogle Scholar
- Enright AJ, Dongen SV, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30 (7): 1575-1584. 10.1093/nar/30.7.1575PubMed CentralView ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbors and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006, 22: 1623-1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Chen J, Hsu W, Lee ML, Ng SK: Discovering reliable protein interactions from high-throughput experimental data using network topology. Artificial Intelligence in Medicine. 2005, 35 (1): 37- 10.1016/j.artmed.2005.02.004View ArticlePubMedGoogle Scholar
- Chua HN, Ning K, Sung WK, Leong HW, Wong L: Using indirect protein interactions for the protein complex prediction. Journal of Bioinformatics and Computational Biology. 2008, 6 (3): 435-466. 10.1142/S0219720008003497View ArticlePubMedGoogle Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25: 1891-1897. 10.1093/bioinformatics/btp311View ArticlePubMedGoogle Scholar
- Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mell-wig C, Fischer S, Gavin A, Bork P, Superti-Furga G, Serrano L, Russell RB: Structure-based assembly of protein complexes in yeast. Science. 2004, 303: 2026-2029. 10.1126/science.1092645View ArticlePubMedGoogle Scholar
- Mewes HW, Heumann K, Kaps A, Mayer K, Pfeiffer F, Stocker S, Frishman D: MIPS. a database for genomes and protein sequences. Nucleic Acids Research. 1999, 27 (1): 44-48. 10.1093/nar/27.1.44PubMed CentralView ArticlePubMedGoogle Scholar
- Diestel R: Graph Theory. 2005, Springer-Verlage, HeidelbergGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Research. 2006, 34: D535-D539. 10.1093/nar/gkj109PubMed CentralView ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis. Protein phylogenetic profiles. Proceedings of the National Academy of Sciences, USA. 1999, 96: 4285-4288. 10.1073/pnas.96.8.4285.View ArticleGoogle Scholar
- Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics. 2003, 19: 1524-1530. 10.1093/bioinformatics/btg187View ArticlePubMedGoogle Scholar
- Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences. 1998, 23: 324-328. 10.1016/S0968-0004(98)01274-2View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight S, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics. 2000, 25: 25-29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Y, Liu X: A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 42-49.Google Scholar
- van Rijsbergen CJ: Information Retireval. 1979, Butterworths, LondonGoogle Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006, 7: 488- 10.1186/1471-2105-7-488PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.