Protein complex prediction based on k-connected subgraphs in protein interaction network

Background Protein complexes play an important role in cellular mechanisms. Recently, several methods have been presented to predict protein complexes in a protein interaction network. In these methods, a protein complex is predicted as a dense subgraph of protein interactions. However, interactions data are incomplete and a protein complex does not have to be a complete or dense subgraph. Results We propose a more appropriate protein complex prediction method, CFA, that is based on connectivity number on subgraphs. We evaluate CFA using several protein interaction networks on reference protein complexes in two benchmark data sets (MIPS and Aloy), containing 1142 and 61 known complexes respectively. We compare CFA to some existing protein complex prediction methods (CMC, MCL, PCP and RNSC) in terms of recall and precision. We show that CFA predicts more complexes correctly at a competitive level of precision. Conclusions Many real complexes with different connectivity level in protein interaction network can be predicted based on connectivity number. Our CFA program and results are freely available from http://www.bioinf.cs.ipm.ir/softwares/cfa/CFA.rar.

The next algorithm is RNSC (Restricted Neighborhood Search Clustering) [13]. It is a cost-based local search algorithm that explores the solution space to minimize a cost function, which is calculated based on the numbers of intra-cluster and inter-cluster edges.
However, many biological data sources contain noise and do not contain complete information due to limitations of experiments. Recently, some computational methods have estimated the reliability of individual interaction based on the topology of the protein interaction network (PPI network) [23,28,29]. The Protein Complex Prediction method (PCP) [30] uses indirect interactions and topological weight to augment proteinprotein interactions, as well as to remove interactions with weights below a threshold. PCP employs clique finding on the modified PPI network, retaining the benefits of clique-based approaches. Liu et al. [31] proposed an iterative score method to assess the reliability of protein interactions and to predict new interactions. They then developed the Clustering based on Maximal Clique algorithm (CMC) that uses maximal cliques to discover complexes from weighted PPI networks.
Following these past works, we model the PPI network with a graph, where vertices represent proteins and edges represent interactions between proteins. We present a new algorithm CFA-short for k-Connected Finding Algorithm-to find protein complexes from this graph. Our algorithm is based on finding maximal k-connected subgraphs. The union of all maximal k-connected subgraphs (k ≥ 1) forms the set of candidate protein clusters. These candidate clusters are then filtered to remove (i) clusters having less than four proteins and (ii) clusters having a large diameter. We compare the results of our algorithm with the results of MCL, RNSC, PCP and CMC. Our algorithm produces results that are comparable or better than these existing algorithms on real complexes of [32,33].

Preliminaries
Generally, a complete or a dense subgraph of a protein interaction network is proposed to be a protein complex. But there are many complexes which have different topology and density (see Figure 1). So we need to define a criterion to predict protein complexes with different topology.

Interaction Graphs
A PPI network is considered as an undirected graph G = 〈V, E〉, where each vertex v V represents a protein in the network and each edge uv E represents an observed interaction between proteins u and v. Two vertices u and v of G are adjacent or neighbors if and only if uv is an edge of G. The degree d(v) of a vertex v is defined as the number of neighbors that the protein v has.
The density of a graph G = 〈V, E〉 is defined by If all the vertices of G are pairwise adjacent, then G is a complete graph and D G = 1. A complete graph on n vertices is denoted by K n . The cluster score of G is defined as D G × |V|.

K-Connectivity
A path in a non-empty graph G = 〈V, E〉 between two vertices u and v is a sequence of distinct vertices is the shortest path in G between two vertices u and v. The greatest distance between any two vertices in G is the diameter of G denoted by diamG. A non-empty 1connected subgraph with the minimum number of edges is called a tree. It is well known that a connected graph is a tree if and only if the number of edges of the graph is one less than the number of its vertices. It is a In complex 1, except for one vertex, there are at least two independent paths between every two proteins. In complex 2, except for two vertices, there are at least two independent paths between every two proteins. Part (B) are two 2-connected subgraphs obtained from the network in Part (A). classic result of graph theory-the global version of Menger's theorem [34]-that a graph is k-connected if any two of its vertices can be joined by k independent paths (two paths are independent if they only intersect in their ends).

Protein-Protein Interaction Network Data
In this work, we use two high-throughput protein-protein interaction (PPI) data collections. The first data collection, GRID, contains six protein interaction networks from the Saccharomyces cerevisiae (bakers' yeast) genome. These include two-hybrid interactions from Uetz et al. [2] and Ito et al. [3], as well as interactions characterized by mass spectrometry technique from Ho, Gavin, Krogan and their colleagues [6][7][8][9]. We refer to these data sets as PPI Uetz , PPI Ito , PPI Ho , PPI Gavin2 , PPI Gavin6 , and PPI Krogan .
The other data collection is obtained from BioGRID [35]. This data collection includes interactions obtained by several techniques. We only consider interactions derived from mass spectrometry and two-hybrid experiments as these represent physical interactions and cocomplexed proteins. We refer to this data set as PPI Bio-GRID . Some descriptive statistics of each protein interaction network are presented in Table 1.

Protein Complex Data
Two reference sets of protein complexes are used in our work. The first data set was gathered by Aloy et al. [32] and the other was released in the Munich Information Center for Protein Sequences (MIPS) [33] at the time of this work (September 2009). We refer to the two protein complex data sets as APC (Aloy Protein Complex) and MPC (MIPS Protein Complex), respectively. Details of these data sets are described in Table 2. During validation, those proteins which cannot be found in the input interaction network are removed from the complex data.

Cellular Component Annotation
The level of noise in protein interaction data-especially those obtained by two-hybrid experiments-has been estimated to be as high as 50% [36][37][38]. Liu et al. [31] have shown that using a de-noised protein interaction network as input leads to better quality of protein complex predictions by existing methods. A protein complex can only be formed if its proteins are localized within the same component of the cell. So we use localization coherence of proteins to clean up the input protein interaction network. We use cellular component terms from Gene Ontology (GO) [39] to evaluate localization coherence. We find that among the 5040 yeast proteins, only 4345 or 86% of them are annotated. To avoid arriving at misleading conclusions caused by biases in the annotations, we use the concept of informative cellular component. We define a cellular component annotation as informative if it has at least k proteins annotated with it and each of its descendent GO terms has less than k proteins annotated with it. In this work, we set k as 10. This yields 150 informative cellular component GO terms on the BioGRID data set.

Performance Evaluation Measures
There are many studies that predict protein complexes. To evaluate the performance of various protein complex prediction methods, we compare the predicted protein complexes with real protein complex data sets, APC and MPC.
To compare the clusters-i.e., predicted protein complexes-found by different algorithms to real protein complexes, we use a measure based on the fraction of proteins in the predicted cluster that overlaps with the known complex. Let S be a predicted cluster and C be a reference complex, with size |S| and |C| respectively. The matching score between S and C is defined by If Overlap(S,C) meets or exceeds a threshold θ, then we say S and C match. Following Liu et al. [31], we use an overlap threshold of 0.5 to determine a match.
Given a set of reference complexes C = {C 1 , C 2 , ...., C n }and a set of predicted complexes S = {S 1 ,S 2 , ..., S m },   41 14 precision and recall at the whole-complex level are defined as follows:

≥
The precision and recall are two numbers between 0 and 1. They are the commonly used measures to evaluate the performance of protein complex prediction methods [30,31]. In particular, precision corresponds to the fraction of predicted clusters that matches real protein complexes; and recall corresponds to the fraction of real protein complexes that are matched by predicted clusters.
Another measure which can be used to evaluate the performance of a method is F-measure. According to [40], this measure was first introduced by Rijsbergen [41]. They defined F-measure as the harmonic mean of precision and recall:

Observations
To justify using the connectivity definition and cellular component annotation, we analyze the connectivity number and localization coherence of reference complexes of MPC on PPI networks obtained by [6][7][8][9] as well as [35].

Co-Localization Score of Known Complexes
A protein complex is a set of proteins that interact with each other at the same time and place, forming a single multimolecular machine [10]. This biological definition of a protein complex helps us predict protein complexes. Using the information of cellular component annotation existing in GO, Liu et al. [31] define a localization group as the set of proteins annotated with a common informative cellular component GO annotation. They then define the co-localization score of the complex, c, as the maximum number of proteins in the complex that are in the same localization group, max{c ∩ L i | i = 1, ...,k}, divided by the number of those proteins in c with localization annotations, |{p c|∃L i L, p L i }|, where L = {L 1 , ..., L k }is a set of localization groups. More formally, the co-localization score of a set of complexes C is the weighted average score over all complexes: The locscore for MPC and APC are 0.74 and 0.86 respectively. The relatively large values of these numbers suggest that cleaning the input PPI network by cellular component information should help us improve precision and recall of existing algorithms.

Impact of Localization Information
In this work, the cleaning of PPI networks using informative cellular component GO terms is an important preprocessing step. So we analyze here the impact of using informative GO cellular component annotation on the performance of four existing algorithms-CMC, MCL, PCP, and RNSC-on their standard parameters. (The CMC package comes with its own PPI-cleaning method. However, in order to observe the effect of cleaning based on cellular component GO terms on CMC, this method is not used in this work.) Let G i = G[L i ] be the induced subgraph of G generated by the vertex set L i , where {L 1 , L 2 , ..., L k } is the set of localization groups. Thus each L i contains a set of proteins localized to the same cellular component-i.e., they are annotated by the same informative GO term. Let C i be the set of all clusters predicted by an algo-  denotes the set of all clusters predicted by the algorithm on G.
To evaluate the impact of localization information, we compare the precision and recall of C L and clusters generated on the original PPI network G. Table 3 summarizes some general features of clusters predicted by the algorithms mentioned. We observe that, by using protein cellular component annotations, the number of predicted clusters generally increases, while the average cluster size decreases. We further observe that the average size of clusters predicted by MCL and CMC algorithms are larger than those predicted by others. We also compare the precision and recall of the clusters predicted by the four algorithms. We find that generally the precision and recall values have significant improvements in C L .
The precision and recall values obtained at the matching threshold θ = 0.5 are given in Table 3. RNSC performs best on PPI Biogrid , while MCL performs best on PPI Gavin6 , PPI Gavin2 , and PPI Ho . In the orginal network of PPI krogan , PCP shows better precision against recall compared to other methods, while after cleaning by using localization information almost all methods have similar performance. This table shows that none of these algorithms has the best precision vs recall in all networks.
We present two illustrative examples in Figure 2. The first example (Figure 2(A)) is the unmatched cluster predicted by CMC on the original network of PPI Gavin2 . This cluster contains a four-member protein complex with specific GO cellular component annotation (GO.0005956; protein kinase CK2 complex). The other  seven proteins in the CMC cluster belong to other localization groups. This cluster is refined in C L to match well with the same real complex. In Figure 2(B), PCP predicts a sevenmember cluster matched to a complex of MPC using the localization annotation on PPI Krogan . In contrast, only four proteins in this complex are matched to the corresponding PCP cluster predicted on the original network.

Density of Known Complexes
We consider the density of known complexes with size at least three for each PPI network. Figure 3 shows that algorithms based on graph density cannot predict a large number of known complexes, and recall values of these algorithms are destined to be limited. We have also studied the number of known complexes of size four in PPI BioGRID . We find that there exist 138 real complexes of size four, while only 54 of them have high density.
The discussions above suggest that the density criterion alone cannot answer the question of finding complexes. We need to introduce another criterion to overcome this problem.

Connectivity of Known Complexes
We show in this section that connectivity is a reasonable alternative criterion for identifying protein complexes. Although this criterion is simple, it may directly describe the general understanding of the protein complex concept. This criterion is better than density because, while there are a lot of known complexes that are not complete or dense, there are many k-connected subgraphs with low density. For example, Figure 1 In Table 4, the kscore and average density of different PPI networks on MPC are shown. The average density of the set of real complexes are usually low. On the other hand, on average, 99.5% of proteins of each real complex are located in 1-connected subgraphs. Also 78.4%, 53.7% and 37.4% of proteins of each real complex are located in 2-connected, 3-connected, and 4-connected subgraphs respectively. By increasing the connectivity number, this average decreases but there exist some proteins which are located in a subset of a real complex with high k-connectivity.
This suggests that using connectivity number as a criterion of protein complex prediction may be a good approach. Therefore, our algorithm is based on finding maximal k-connected subgraphs in PPI networks by keep increasing k until k cannot be increased any more. In other words, the algorithm continues until some integer k 0 such that there is no k-connected subgraph with k > k 0 .

Testing for Accuracy
To check the validity of CFA, we compare clusters predicted by CFA with the clusters obtained by CMC, MCL, PCP and RNSC, on the seven protein interaction networks of GRID and BioGRID. The networks are first segregated by informative cellular component GO terms before these algorithms are run. MPC and APC are used as benchmark real protein complexes.
In PPI Uetz , none of the algorithms could produce any cluster matched by real complexes in MPC and APC. PPI U etz is a difficult example because, as can be seen in Table 1, it is a much sparser and much more incomplete network compared to the other PPI networks. So in Table 5, we present the number of matched clusters and matched complexes predicted by the clustering methods on the other six PPI networks. Table 5 shows that CFA performs better on PPI Krogan , PPI Ito , PPI Gavin2 and PPI Gavin6 compared to other methods. In fact, both precision and recall values of CFA are greater than all of the other algorithms in these networks. In PPI Ho , RNSC has the greatest precision. However, RNSC predicts merely 26 clusters and, among these predictions, 13 clusters are matched to 5 real complexes in APC and 19 clusters are matched to 21 real complexes in MPC. Thus the recall value of RNSC is very low (0.166 on APC and 0.038 on MPC). In contrast, CFA correctly predicts 13 real complexes of APC and 62 of MPC. The clusters of CFA give the precision value 0.416 (0.166) and the recall value 0.114 (0.433) on MPC (APC), which are generally better than that obtained by RNSC and other methods on PPI Ho .
We also study the number of matched clusters and matched complexes of predictions on PPI Biogrid . We find that almost all algorithms predict the same number of real complexes in APC. However, CFA matches a lot more complexes in MPC than CMC (18% more), MCL (5% more), PCP (15% more) and RNSC (17% more).
Furthermore, this significant superiority of CFA in recall comes with the highest precision value in MPC. The overall precision of CFA on the combined APC and MPC complexes, as can be computed from Table 6, is 0.492, which is comparable to CMC (0.422), PCP (0.411), and RNSC (0.502), and is superior to MCL (0.274).
We find that all complexes predicted by CMC and RNSC are identified by at least one of the other three algorithms. To compare real complexes predicted by CFA, MCL and PCP, Figure 4 shows a Venn diagram of complexes predicted by these algorithms on the combined set of APC and MPC complexes. It shows that CFA predicts maximum number of real complexes that MCL and PCP cannot predict. So CFA is finding a different group of complexes from other methods.
Some interactions in PPI Biogrid are derived from twohybrid technique. Due to the level of noise in twohybrid experiments, we expect those predicted clusters having the form of a tree structure to have lower reliability compared to other 1-connected subgraphs. Hence, in order to improve the results of CFA, we only use 1-connected subgraphs that are not trees. A tree with n vertices has n -1 edges; so a connected cluster is a tree if and only if its cluster score is 2. Thus, we consider 1-connected subgraphs with cluster scores greater than 2. Similarly, we can do additional filtering for each k-connected subgraphs by considering the clusters with cluster score greater that k+1. The precision and recall values of the resulting further refined clusters are 0.465 and 0.178 in MPC and 0.347 and 0.838 in APC. So the precision vs recall of CFA, using cluster score filtering, shows significant improvement compared to other methods in PPI Biogrid on APC too.
On the other hand, we observe that some predicted clusters have large overlap with each other. That is, we have some clusters S i and S j such that Overlap(S i, S j ) ≥ a. To get a more concise understanding of CFA and the other prediction methods, we also clean up the set of predictions by removing redundant clusters. In the other words, when two predicted clusters show an overlap score above the threshold value (of a = 0.5), we keep the larger one. The precision and recall values after this additional cleaning of the set of predictions are given in Table 7. Table 7 shows that, generally, CFA identifies the most number of complexes based on nonredundant predicted clusters on each PPI network.

Examples of Predicted Clusters
In this section, we present five matched and unmatched clusters predicted by CFA.
In Figure 1  Maintanace Chromatin Structure) that contains a protein, Y NL113W, whose interactions with other proteins are missing from PPI Gavin2 . Complex 2 contains 12 proteins (MIPS ID. 510.40.10; RNA polymerase II ) and there exists a protein, Y LR418C, in this complex whose interactions with other proteins are missing in PPI Gavin2 . There are four common proteins in these two complexes. Without considering localization annotations, CFA predicts all vertices of this graph (except for Y LR418C and Y NL113W) as a 2-connected subgraph. After segregating the network using GO terms, CFA predicts two clusters (Figure 1(B)) which are matched to the real complexes in Figure 1(A). The "data sets" column refers to networks, where (1) denotes PPI Biogrid , (2) denotes PPI Gavin6 , (3) denotes PPI Gavin2 , (4) denotes PPI Krogan , (5) denotes PPI Ho , and (6) denotes PPI Ito . The best precision and recall value for each PPI network are highlighted in bold font. In Figure 5, we show three matched and unmatched clusters. The first cluster contains 30 proteins from PPI Gavin6 . The cluster is perfectly matched to a complex in MPC of size 30. The density in this complex is 0.2, so it can be considered as a non-dense real complex. The second cluster is a nineteen-member cluster from PPI Krogan . This cluster contains a known complex in MPC of size 18 proteins with specific GO annotation (GO: 0006511; ubiquitin-dependent protein catabolic process). The one additional protein (YDR363W-A) predicted by CFA to be in this cluster turns out to have the same biological process GO term annotation. We think that with more accurate experimental data, this 19th protein may also be a protein of this complex. The smallest cluster in our samples contains six proteins that are predicted by CFA in PPI BioGRID . The cluster members have the same specific GO annotation (GO: 0015031; protein transport), though this cluster is not presented as a known complex in MPC and APC.
To gain further insights into the differences among CFA's clusters and clusters predicted by other algorithms, we consider the first CFA cluster presented in Figure 5. This cluster is matched perfectly to a 30-member complex on MPC. In contrast, CMC's clusters only overlap with at most 16 members of this complex. The corresponding cluster predicted by PCP is a twenty fivemember cluster, and the other members of the real complex do not belong to the PCP cluster. Similarly, merely fifteen members of the corresponding RNSC cluster overlap with the same complex. Among these methods only MCL predicts a cluster which is matched to the same complex perfectly.
The third cluster shown in Figure 5 is an unmatched cluster which is obtained by CFA, CMC, PCP and RNSC algorithms. None of the proteins of this cluster belongs to any real complex in MPC and APC. However, MCL predicts a cluster containing all members of the above mentioned cluster with an extra protein with a different GO term annotation.

Conclusions
In the first part of this work, we study the impact of using informative cellular component GO term annotations on the performance of several different protein complex prediction algorithms. We have shown ( Table  3) that existing algorithms predict protein complexes with significantly higher precision and recall when the input PPI network is cleansed using informative cellular component GO term annotations. Therefore, we propose for protein complex prediction algorithms a preprocessing step where the input PPI network is segregated by informative cellular component GO terms.
In the second part of this work, we study the density of protein interactions within protein complexes. We have shown (Figure 3) that there are many real complexes with different density. So density is not a good criterion for prediction of protein complexes. Therefore, we look at the connectivity number of complexes as a possible alternative criterion. We observe ( Table 4) that 87%-99% of real protein complexes are 1-connected, 68%-87% are 2-connected, 35%-54% are 3-connected, and 23%-37% are 4-connected.
So in the third part of this work, we propose the CFA algorithm to predict protein complexes based on finding k-connected subgraphs on an input PPI network that has been seggregated according to informative cellular component GO term annotations on its proteins. Table 8 shows the precision and recall of maximal kconnected subgraphs on different PPI networks using MPC complexes as reference protein complexes. It can be seen that, by increasing the connectivity number of subgraphs, precision values show significant improvement compared to subgraphs with low connectivity numbers. However, the recall values decrease, due to a decrease in the number of predicted subgraphs. We have found that combining the k-connected subgraphs for various values of k as our set of predicted protein complexes yields the best precision vs recall performance. This combined set constitutes the predicted clusters output by CFA.
Finally, we compare the performance of CFA to several state-of-the-art protein complex prediction methods. We have shown ( Table 5) that CFA performs better than other methods for most test cases. For example, in the largest network in our test sets (PPI Biogrid ), the number of complexes predicted by RNSC is very low compared to CFA. In particular, CFA predicts 19 complexes which RNSC is unable to predict, while RNSC predicts 2 complexes which CFA is unable to predict. Furthermore, by varying the threshold on the matching score, we show in Figure 6 the F-measure graphs based on protein clusters predicted for various protein interaction networks. We observe that CFA consistently shows the best performance compared to other methods over the entire range.

Methods
In the Observations section we explained that cellular component annotations can help us to improve predictions. On the other hand, by studying the connectivity number of real complexes as subgraphs of PPI network, we showed that the connectivity number could be a reasonable criterion to predict complexes. So we present a new algorithm based on finding k-connected subgraphs (1 ≤ k) on PPI networks segregated by informative cellular component GO terms.

Algorithm
A new algorithm named CFA (k-Connected Finding Algorithm) is presented here to predict complexes from an input (cleansed) PPI network. The CFA algorithm comprises two main steps. In the first step, maximal k-connected subgraphs for various k are generated as candidate complexes. In the second step, a number of filtering rules are applied to eliminate unlikely candidates.
The heart of the first step of CFA contains two simple procedures. The first procedure is REFINE, which removes all vertices of degree less than k from the input graph. This is an obvious optimization since, by the global version of Menger's theorem [34], such vertices    cannot be part of any k-connected subgraphs. The second procedure is COMPONENT, which takes the refined graph and fragments it into k-connected subgraphs. This procedure finds a set of h < k vertices that disconnects the input graph, producing several connected components of the graph. The procedure is then recursively called on each of these connected components. The procedure terminates on a connected component (and returns it as a maximal k-connected subgraph) if it cannot be made disconnected by removing h < k vertices. The correctness of this procedure follows straightforwardly from the global version of Menger's theorem.
In the second step of CFA, we call the procedures defined in the first step on larger and larger values of k until no more k-connected subgraphs are returned. This way, we obtain maximal k-connected subgraphs for various values of k. These subgraphs are then filtered using the following three simple rules: (1) 1-connected subgraphs having diameter greater than 4 are removed. (2) k-connected subgraphs (k ≥ 2) having diameter greater than k are removed. (3) Subgraphs of size less than 4 are removed. The pseudo codes of the CFA algorithm are given in Table 9.

Implementation
We choose fixed parameter values for each algorithm ( Table 10). The implementations for RNSC and MCL are obtained from the main author of [42], Sylvian Brohee. The implementations for PCP and CMC are obtained from the one of their authors, Limsoon Wong. 1 Faculty of Mathematics, Shahid-Beheshti University, g.c., Tehran, Iran. 2 School of Computing, National University of Singapore, Singapore.
Authors' contributions LW and CE conceived the project and designed the experiments. All authors contributed to conceiving and improving the proposed algorithm. MH implemented the algorithm during all stages of its development and performed all the experiments. All authors contributed to writing the manuscript. All authors have read and approved the manuscript. Table 9 Pseudo codes of CFA Step1:// Find maximal k-connected subgraphs

Procedure REFINE
Input: Graph G = (V, E) and a parameter k.
Output: All vertices in G of degree less than k are removed.
The reduced graph is returned.

Procedure COMPONENT
Input: Connected graph H = (V, E) and a parameter k.

Increment k.
Set G1 to 1-connected subgraphs from C1 with the diameter <4.
Set Gk to k-connected subgraphs from Ck with the diameter < k (for k ≥ 2) Set U to the union of Gk's (k ≥ 1) Remove all subgraphs of size less than 4 in the set U.