# An effective method for refining predicted protein complexes based on protein activity and the mechanism of protein complex formation

- Jianxin Wang
^{1}Email author, - Xiaoqing Peng
^{1}, - Qianghua Xiao
^{1}, - Min Li
^{1}and - Yi Pan
^{1, 2}

**7**:28

**DOI: **10.1186/1752-0509-7-28

© Wang et al.; licensee BioMed Central Ltd. 2013

**Received: **9 October 2012

**Accepted: **14 March 2013

**Published: **28 March 2013

## Abstract

### Background

Identifying protein complexes from protein-protein interaction network is fundamental for understanding the mechanism of cellular component and protein function. At present, many methods to identify protein complexes are mainly based on the topological characteristics or the functional similarity features, neglecting the fact that proteins must be in their active forms to interact with others and the formation of protein complex is following a just-in-time mechanism.

### Results

This paper firstly presents a protein complex formation model based on the just-in-time mechanism. By investigating known protein complexes combined with gene expression data, we find that most protein complexes can be formed in continuous time points, and the average overlapping rate of the known complexes during the formation is large. A method is proposed to refine the protein complexes predicted by clustering algorithms based on the protein complex formation model and the properties of known protein complexes. After refinement, the number of known complexes that are matched by predicted complexes, *Sensitivity*, *Specificity*, and *f*-measure are significantly improved, when compared with those of the original predicted complexes.

### Conclusion

The refining method can discard the spurious proteins by protein activity and generate new complexes by just-in-time assemble mechanism, which can enhance the ability to predict complex.

### Keywords

Protein activity Protein complex formation model Just-in-time Refining Gene expression## Background

In a cell, rather than function individually or in isolation, proteins interact physically with each other to form multisubunit protein complexes that act as sophisticated multimolecular machines, such as the anaphase-promoting complexes, RNA splicing and polyadenylation machinery, protein export and transport complexes[1]. The functionality of the cell depends on protein physical interactions and these multimolecular machines, thus great effort has been made to identify and describe all protein-protein interactions (PPIs) and protein complexes in a cell. The maturity of high-throughput experimental techniques and computational prediction, such as two-hybrid assay, mass spectrometry experiments, and the protein chip technology, make it possible to construct large-scale protein-protein interaction networks (PPINs) of many species. Many protein complexes are well understood, particularly in the model organism Saccharomyces cerevisiae (a strain of yeast). For this relatively simple organism, the study of protein complexes is now being performed genome wide and the elucidation of most protein complexes of the yeast is undergoing. Predicting protein complex from protein interaction networks is one of the most challenges but a fundament to analyze tissue and protein functionality [2–4]. Graph theory has become a powerful research tool for analyzing PPINs [5]. In graphs which are used to represent PPINs, proteins are represented by vertexes or nodes, and PPIs are represented by edges. Computation methods based on graph theory are also used to predict protein complex [6].

So far, many clustering methods are developed for identifying proteins complexes in PPINs [2–5, 7–13, 13–17, 17–28]. Dense sub-graph based methods and hierarchy algorithms are two representative categories. Based on the assumption that the members in the same protein complex and functional module strongly bind each other, a cluster can be referred as a densely connected subgraph within a PPIN. The density (*d*) of a subgraph with *n* vertices and *m* edges is generally defined as *d*=2*m*/(*n*(*n*−1)) [9]. Maximal cliques are used in several algorithms to identify protein complexes [9, 14, 22, 24, 29]. Spirin et al. [9] employ *d* = 1 to identify the maximal cliques as protein complexes. An algorithm named Clique Percolation Method (CPM) [29] is proposed by uniting the maximal cliques with *k*−1 common nodes, and the well-known protein network analysis tools CFinder [14] is developed on it. The maximal cliques in CMC [22] are generated from a weighted PPIN, and then combined or removed based on connectivity and overlapping rate. Wang et al [24] propose a new topological model by extending the definition of *k*-clique community of algorithm CPM and introducing distance restriction, and develop a novel algorithm called CP-DR based on the new topological model to identify protein complexes. Some other dense sub-graph based complex detection algorithms follow a “seed and extension” paradigm, such as MCODE [30], Density-Periphery based graph clustering algorithm (DPClus) [16], IPCA [4], and Core-Attachment method [21], using different mechanisms of seeds selection, cluster expansion and stop conditions to detect protein complexes. Recently, entropy-based graph clustering methods are also applied on PPINs to detect dense sub-graphs as protein complexes [27, 28]. Hierarchical clustering algorithms are based on similarity or distance to identify protein complexes, with the idea that the majority of proteins within a same protein complex tend to have similar or identical functions [31]. The similarity or distance between any two proteins is defined as the possibility of the two proteins in the same functional module. The most classic hierarchical clustering method is GN algorithm [8]. MoNet algorithm [19] is a typical coagulation, derived from the GN algorithm. HCS algorithm [32] is used to analyze the protein modular structure based on graph connectivity. HC-PIN method [26] uses the weighted edge clustering coefficient to perform fast hierarchical clustering.

Besides the topological characteristics or the functional similarity features of protein complexes, some researchers try to reveal the mechanism of protein complex formation. Focusing on the dynamic formation of protein complexes, De Lichtenberg et al. [33] construct time-dependent PPIN of Yeast by integrating PPIN and gene expression data. By analyzing the dynamics of protein complexes during the yeast cell cycle, they discover that most complexes are constituted by both periodically and constitutively expressed proteins, which suggests a mechanism of just-in-time assembly. Based on gene expression data, Komurov and White [34] also classify proteins of eukaryotic into periodically expressed proteins and constitutively expressed proteins. However, by analyzing the topology of the two classes of proteins, they find that most functional modules are consisted by proteins from one class (periodically expressed proteins or constitutively expressed proteins), seldom from both of the two classes. The different conclusions drawn from De Lichtenberg and Komurov might be caused by the difference of data sources and the considered module types.

A protein is active when it is in its active form, and it can interact with other active proteins and perform function [35]. During the formation of a complex, at one time point the co-active proteins will assemble together, and at the next time point the new co-active proteins will be added in, therefore the complex can be assembled step by step following a just-in-time way. In this paper, a protein complex formation model is presented based on the just-in-time mechanism. Relied on the protein complex formation model and combined with gene expression data, an investigation is carried out on 408 known complexes of yeast [36] and the protein complexes identified by existing methods. Based on the protein complex formation model and the properties of known complexes, we propose an effective method to refine the complexes predicted by existing methods. Several clustering algorithms are applied on yeast PPIN to predict protein complexes. The *speciality*, *sensitivity*, *f*-measure and other evaluation metrics are compared between the original predictions and the refined predictions.

## Methods

In this section, we first introduce our former study on protein activity, which is deduced from gene expression data. Later on, based on the active time points of proteins and the protein complex formation model, we investigate the formation of each known complex. These analyses are also carried out on the complexes predicted by existing methods. Based on the difference between known complexes and predicted complexes, an effective method is introduced to refine the complexes predicted by existing methods.

### Material

The PPI data of yeast is downloaded from DIP [37] updated to Feb. 28, 2012. The Database of Interacting Proteins (DIP) is a database that provides species specific subsets, which contains all the interactions of proteins from a particular species. In our experiment, the self-interactions and repeated ones in the original PPIN are discarded. The final PPIN used in our experiment contains 5023 proteins and 22570 interactions.

We use GSE3431 [38] in GEO to extract active time points of each protein, which is an expression profile of yeast by array affymetrix gene expression data over three successive metabolic cycles. The overall design of this expression experiment is 12 time intervals per cycle, and 25 minutes per time interval. Thus each gene has gene expression values (levels) at 12 time points in each cycle. In our method, one cycle with average expression value at every time point of three successive cycles is used to reduce the noise and error.

### Protein active information

*p*based on its algorithmic mean and variance can be calculated.

In Equations (1) and (2), *n* is the number of time points in a cell cycle, and *E* *V*_{
i
}(*p*) is the expression value of *p* at time point *i*. In Equation (4), *S*_{1}(*p*)=*u*(*p*) and *S*_{2}(*p*)=*u*(*p*)+3*σ*(*p*). The detail about the active threshold principle (3-sigma principle) is presented in [39].

*p*,

*u*(

*p*) is the algorithmic mean of its expression values and

*σ*(

*p*) is the standard variance of its expression values.

*F*(

*p*) reflects the fluctuation of its expression curve. The higher Standard Variance, and the smaller the

*F*(

*p*). As shown in Equation (4), the active threshold of

*p*is determined by both algorithmic mean and 3-sigma. If the fluctuation of expression values is low, corresponding to small

*σ*(

*p*),

*S*

_{1}(

*p*) plays a more important influence on the active threshold. Reversely,

*S*

_{2}(

*p*) plays a more important role to determine the active threshold. Protein

*p*is considered as active at some time points only when the expression values of these time points are above its active threshold

*A*

*c*

*t*

*i*

*v*

*e*_

*T*

*H*(

*p*).

*Active*(

*p*) is an active time point set of protein

*p*, which contains the time points when protein

*p*is active, defined as Equation 5.

In our experiments, although 96% of the proteins in the yeast PPIN from DIP can be covered by gene products in this gene expressing profile, the active time points of a small portion of proteins cannot be deduced from this gene expression data. The active information of 1142 proteins cannot be inferred from their expression curves in this gene expression profile, and 177 proteins have no expression values in this gene expression data. These proteins might be active in other gene expression experiments, since the interval between two time points in GSE3431 is considerably long and some proteins are active only under special environments. If the active time points of protein *p* cannot be inferred from this gene expression profile, a special active time point “0” is used. That is to say *Active* (*p*)={0} in this case.

### Protein complex formation mechanism and protein activity

*C*can be formed in a continuous time point set. A complex formation model based on the just-in-time mechanism can be illustrated as follows. Suppose

*A*

*P*

_{ i }(

*C*) (

*i*=0,…,

*n*) is the set of proteins which are active at time point

*i*and belong to the complex

*C*, where

*n*is the number of time points in a cell cycle.

*C*can be formed in a continuous time point set [

*S*,

*S*+

*K*], if the boolean function

*F*

*C*(

*C*,

*S*,

*K*) is true. If the follow conditions are satisfied,

*F*

*C*(

*C*,

*S*,

*K*) is true, otherwise it is false.

- (1), for$\left|\left(\bigcup _{i=S}^{S+j}A{P}_{i}\right(C\left)\right)\cap A{P}_{S+j+1}\left(C\right)\right|>0$
*j*=0,…,*K*−1(2)$\left|\bigcup _{i=S}^{S+K}A{P}_{i}\left(C\right)\right|=\left|C\right|$

*S*,

*S*+

*K*] for complex

*C*that can make

*F*

*C*(

*C*,

*S*,

*K*) be true, we say that the complex can be formed in the continuous active time point set. If there exists no interval for complex

*C*that can make

*F*

*C*(

*C*,

*S*,

*K*) be true, we say that the complex cannot be formed in a continuous active time point set. Combined with the active time point set of each protein and this model, a statistic is carried on the 408 known protein complexes of Yeast [36] and the complexes predicted by representative algorithms to calculate the percentage of complexes which can be formed in a continuous active time point set. The comparison is listed in Table 1. About 63.4% known complexes can be formed in a continuous time point set, while only a small portion of predicted complexes can be formed in a continuous time point set.

**The comparison of average co-active rate and the percentage of complexes formed in a continuous time point set of the known complexes and complexes predicted by algorithms**

Complexes | Average | Formed |
---|---|---|

( | complexes(%) | |

Known Complexes | 0.595 | 63.4% |

Predicted Complexes(CMC) | 0.210 | 23.0% |

Predicted Complexes(DPClus) | 0.329 | 36.5% |

Predicted Complexes(IPCA) | 0.269 | 29.3% |

Predicted Complexes(CPM) | 0.335 | 37.0% |

Predicted Complexes(MCL) | 0.252 | 28.8% |

Predicted Complexes(Core) | 0.265 | 29.7% |

*C*, when

*F*

*C*(

*C*,

*S*,

*K*) is true, the overlapping rate of time point

*S*+

*i*is the fraction of the number of common members in

*A*

*P*

_{S+i}(

*C*) and the protein set assembled in time points [

*S*,

*S*+

*i*−1] to the minimum size of two sets, defined as Equation (6). The overlapping of complex

*C*during the formation interval [

*S*,

*S*+

*K*] is the average of the overlapping rates of time points from

*S*+1 to

*S*+

*K*, defined as Equation (7). The average overlapping of the known complexes is calculated, so do the complexes predicted by representative algorithms, as shown in Table 1. We can observe that the average overlapping of the known complexes is above 0.5, and that of the complexes predicted by each algorithm is significantly lower. Therefore the property that protein complexes are formed step by step is neglected and the predicted complexes contain spurious proteins. Based on these shortcomings, we can refine the predicted complexes.

### Refining method

To improve the accuracy of complex prediction, we propose a method to refine the protein complexes predicted by existing methods based on protein activity and the protein complex formation model with the just-in-time mechanism.

#### The framework of the method

*OV*(

*C*

_{i,k},

*C*

_{j,l}), is defined as the fraction of the number of common members in the two clusters to the minimum size of two clusters, shown as Equation (8), where

*C*

_{i,k}is a sub-cluster

*k*at time point

*i*, and

*C*

_{j,l}is a sub-cluster

*l*at time point

*j*. Based on the overlapping of sub-clusters at adjacent time points, a protein complex can be formed step by step. Only the complexes, whose size is greater than 2, will be refined, because the complex with size=2 has only two possible forms after refinement, two singletons or the same complex with no change. The refining method has two steps, including splitting and assembling. The overall framework of the refining method is shown in Figure 1. Firstly, a clustering method is applied on a PPIN to predict complexes. Then, the splitting and assembling processes are consecutively carried out on each predicted complex.

#### Splitting of protein complex

*n*+1 time points, where

*n*is the number of time points in a gene expression profile and an additional special time point ‘0’ is used to contain the proteins whose

*Active*(

*p*)={0}. For each active time point, the proteins which are not active at the time point are excluded, thus the complex can be split into several connected components (sub-clusters) at the time point. In each component, all proteins are active at the time point.

The detail of the splitting algorithm is shown in Algorithm 1. The inputs of this algorithm are an original complex which contains proteins and interactions derived from PPIN, and the active time point set of the complex. For each active time point, the proteins which are not active at this time point will be removed from the original complex, and several sub-clusters are generated based on PPIN topology.

### Algorithm 1 Splitting

After splitting, a complex will become many sub-clusters on its whole active time point set. Most sub-clusters are smaller than the original complex. Some sub-clusters may be singletons, some sub-clusters may be identical ones, and some ones are high-overlapping with each other.

#### Assembling of protein complex

Based on the just-in-time mechanism, the assembling of a protein complex is to assemble the sub-clusters with certain overlapping at adjacent time points. The core idea of assembling is to combine the sub-clusters from adjacent time points that satisfy *O* *V*(*C*_{i,k},*C*_{i+1,l})≥*T*, where *T* is a threshold to discriminate high overlapping from low overlapping. Active time point ‘0’ is used to denote a special active time point for proteins whose active time points cannot be inferred from the current gene expression profile. These proteins have the potential to be active at some time points when special conditions are prepared. Thus, the sub-clusters at ‘0’ time point can combine with sub-clusters at arbitrary time points with overlapping great than *T*.

The description of assembling algorithm is shown in Algorithm 2. Firstly, let the sub-clusters at time point ‘0’ combine with sub-clusters at other time points. If there exists *C*_{i,k} at time point *i* and *C*_{0,l} at time point ‘0’ satisfying *O* *V*(*C*_{0,l},*C*_{i,k})≥*T*, let *C*_{i,k} be the union of *C*_{0,l} and itself. After all sub-clusters at each time point have been checked, delete the sub-clusters in the set at time point ‘0’ which have been combined. Secondly, the sub-clusters at time point *i* and the adjacent time point *i*+1 are checked. For each *C*_{i,k} at time point *i*, if there exists *C*_{i+1,l} that satisfies *O* *V*(*C*_{i+1,l},*C*_{i,k})≥*T*, let *C*_{i+1,l} be the union of *C*_{i,k} and itself. After all the sub-clusters in the set at time point *i* set have been checked, the sub-clusters at time point *i* are deleted if they have been combined. Finally, the left sub-clusters at each time point are considered as new protein complexes. These new protein complexes are assembled according to the time order, connectivity and the overlapping of active proteins. After the assembling process, an original protein complex might become several new protein complexes, or be discarded. New protein complexes tend to be smaller than the original ones. After all original predicted protein complexes are undertaken the splitting and assembling steps, the identical ones in the final results are deleted.

### Algorithm 2 Assembling

*t*

_{1}”, protein A, C and E are active and connective with each other, thus they can form a sub-cluster. Protein A, B and E are active at time point “

*t*

_{2}” and form a sub-cluster. At time point “

*t*

_{3}”, only protein B is active. At time point “

*t*

_{4}”, protein B and D are active, but not connective with each other, thus each of them becomes singletons. By combining the high-overlapping sub-clusters in adjacent time points, a new protein complex is generated, which contains A, B, C, and E, as shown in Figure 2(c).

### Evaluation metrics

*OS*(

*Pc*,

*Kc*), shown in Equation (10) is used to assess how effectively a predicted complex

*Pc*matches a known complex

*Kc*.

where |*V*_{
p
c
}| is the number of proteins in the predicted complex and |*V*_{
K
c
}| is the number of proteins in the known complex. If a predicted complex *Pc* that has no common proteins with a known complex *Kc*, then *OS*(*Pc*, *Kc*)=0. Usually, a predicted complex and a known complex are considered as a match if their overlapping *OS* is no less than 0.2 [16, 26, 30]. If we say a predicted complex *Pc* perfectly matches a known complex *Kc*, it means all proteins appears in *Pc* are equal with that in *Kc*, and *OS*(*Pc*, *Kc*)=1. *PC* is the number of the predicted complexes. *MKC* is the number of known complexes that are matched by predicted complexes and *MPC* is the number of predicted complexes that match the known complexes, when a threshold value of *OS* is adopted. *MKC* illustrates the ability to predict complex.

*Sn*) is the fraction of the known complexes that are matched by the predicted complexes (

*O*

*S*≥0.2) among all the known complexes [30], shown in Equation (11). Specificity (

*Sp*) is the fraction of the predicted complexes that match the known complexes (

*O*

*S*≥0.2) to the total number of the predicted complexes [30], defined in Equation (12).

where *TP* (True Positive) is the number of the predicted complexes that match the known complexes (*OS*(*Pc*, *Kc*) ≥ 0.2), *FP* (False Positive) is the number of the predicted complexes that don’t match the known complexes, and *FN* (False Negative) is the number of the known complexes that are not matched by any predicted complexes.

## Results and discussion

In order to evaluate the efficiency of the refining method, we apply it to refine the protein complexes predicted by six representative clustering algorithms. DPClus and IPCA are density-based local search algorithms [16, 26]. Clique Percolation Method (CPM) is a powerful algorithm to find protein complexes[29] and MCL is a fast and highly scalable clustering algorithm for networks based on stochastic flow [40, 41]. CMC [22] and Core-Attachment [21] are the latest ones for detecting community structures. We use the parameters recommended by their authors in these algorithms.

In the experiments, the size of predicted complexes which need to be refined should be not smaller than 3. According to the analysis of known complexes, the average co-active rate of the known complexes is above 0.5, thus in our refined method the *OV*’s threshold *T* is set as 0.6. The impact of the varying of *T* on the accuracy of protein complexes prediction is analyzed in subsection “Analysis of parameter *T*”.

### Comparison with known complexes

For convenience sake, the complexes predicted by each clustering method are mentioned as original complexes, the complexes refined by our method are referred as refined complexes, denoted as algorithm_O and algorithm_R in tables and figures, respectively. The original complexes and refined complexes of each algorithm are compared with the known protein complexes obtained from the literature published in Nucleic Acids Research [36] separately. There are 408 manually annotated complexes which are considered as the gold standard data and of which each consists of two or more proteins.

*T*=0.6) that they cannot be combined with each other. The average sizes of the original complexes predicted by CMC, IPCA and DPClus are relatively small, and after refinement the average sizes of the refined complexes are smaller than 3.

**The average size of the complexes predicted by algorithms before and after refinement**

Algorithm |
| Average Size |
---|---|---|

CMC_O | 1369 | 3.45 |

CMC_R | 1532 | 2.48 |

DPClus_O | 383 | 3.98 |

DPClus_R | 578 | 2.76 |

IPCA_O | 1650 | 3.72 |

IPCA_R | 1843 | 2.67 |

CPM_O | 197 | 13.30 |

CPM_R | 346 | 17.48 |

MCL_O | 621 | 6.71 |

MCL_R | 877 | 3.62 |

Core_O | 675 | 6.10 |

Core_R | 1025 | 3.49 |

*MKC*of three algorithms after refinement are improved significantly under different

*OS*threshold values, compared with those of the original complexes. It demonstrates that the refining method can discard the spurious proteins by protein activity and generate new complexes by just-in-time assemble mechanism, which can enhance the ability to predict complex.

*MKC*before and after refinement of the algorithms with relatively small average sizes of predicted complexes are compared with respect to different

*OS*threshold values. When the value of the

*OS*threshold is changed from 0.7 to 1,

*MKC*after refined is improved. Because these three methods prefer to find small complexes, and the average sizes of the refined complexes are smaller than 3, which makes the overlapping score between the refined complexes and the known complexes mainly fall in [0.7, 1], in which

*MKC*will increase after refinement. Therefore, for the algorithms that predict relatively small complexes, the refining method can make the prediction more precise.

### Analysis of sensitivity, specificity and *f*-measure

*Sn*of refined complexes of each algorithm is higher than that of the original ones. The improvements of

*Sn*of most algorithms are above 15%, except for IPCA, which is about 9%. The potential ability to identify more known complexes is available in each algorithm. The room for the improvement of

*Sn*is determined by the number and the average size of the original complexes predicted by an algorithm. If an algorithm prefers to predict a large number of complexes with small average size, the room for the improvement of

*Sn*is very limited, since a very small number of new different complexes will be generated from the small original complexes. The number of complexes predicted by IPCA is the largest, and the average size is relatively small, thus the space to discard spurious proteins and identify more known complexes by the refining method is limited. If an algorithm tends to identify complexes with relative larger average size, there is great room for improvement, because more new different complexes will be reassembled by the refinement, such as Core-Attachment. As shown in Figure 6(b), the

*Sp*of each algorithm after refinement is also boosted, and they are about 12% in CMC, 8% in CPM, 5% in DPClus, 6% in IPCA, 6% MCL, and 10% in Core-Attachment, respectively. It means although the number of refined complexes of most algorithms are increased, the percentages of

*MPC*are also increased. The

*f*-measure which is based on the increased

*Sn*and

*Sp*is also improved largely after refinement, and the average improvement is about 12%. The

*f*-measure of each algorithms after refinement is enhanced by about 15% in CMC, 16% in CPM, 10% in DPClus, 8% in IPCA, 10% MCL, and 14% in Core-Attachment, respectively, as shown in Figure 6(c). These improvements validate the efficiency of the refining method, which can improve the ability of each algorithm to predict more

*MKC*and

*MPC*.

### Analysis of parameter *T*

*OV*’s threshold

*T*is set as 0.6, while the average co-active rate in the known protein complexes is larger than 0.5. In this subsection, we will discuss the impact of different values of

*T*on the prediction results. Actually,

*T*can vary in the range of [0,1]. In Figure 7, the

*f*-measure of each algorithms varies with different values of

*T*. When

*T*=0, the refined prediction results are the original ones. It is very easy to find out that, the

*f*-measure of six algorithms are enhanced when

*T*>0. We can observe that for all the six algorithms, when

*T*is in the range of [0.1, 0.5], the changes of

*f*-measure is insignificant. Because certain overlapping exists in the sub-clusters and is relatively easy to satisfy the combination condition when

*T*is relatively small, the difference between the original ones and the refined ones is mainly the deletion of some proteins. The

*PC*are not significantly increased, so are the

*MPC*and

*MKC*, which in turns have an influence on the

*Sn*,

*Sp*and

*f*-measure. When

*T*is in the range of (0.5, 1.0], the combination condition is more critical. It makes hard to combine sub-clusters, which makes the prediction more precisely. Thus

*PC*is significantly increased, so are the

*MPC*and

*MKC*. For most algorithms, the

*f*-measures are also significantly increased in this region, except for IPCA. As shown in Figure 7, the

*f*-measure of IPCA after refinement is very flat in the range of [0.1, 1]. This is because the average size of the original complexes predicted by IPCA is relatively small, and the number of the original complexes is very large. On one had, it has tried its best to identify known complexes; On the other hand, the large number of original complexes with small average size limits the spaces to increase

*MPC*,

*MKC*,

*Sn*,

*Sp*and

*f*-measure. Thus, for most algorithms, the value of

*OV*’s threshold

*T*is recommended in the range of [0.5, 1.0].

## Conclusion

Based on the just-in-time mechanism, the protein complex formation model is presented. Furthermore, we analyze the known protein complexes based on the complex formation model and combined with gene expression data, and find out that most complexes can be formed in a continuous time point set and the average overlapping rate of the known complexes during the formation is larger than 0.5. For the complexes which are predicted by clustering methods, only a small portion of them can be formed in a continuous time point set, and the average overlapping rates during the formation are significant lower than that of the known complexes. This paper proposes a method to refine the predicted complexes based on the protein activity and the complex formation model. The refining method contains two steps, splitting and assembling. To evaluate the refining method, we apply it to six algorithms which prefer to predict complexes with different sizes. Through the comparison of the *MKC*, *f*-measure, *Sn*, and *Sp* of original complexes and the refined ones, the results show that the *MKC*, *f*-measure, *Sn*, and *Sp* of each algorithm are significantly improved after refinement. Furthermore, it is easy to find out that the performance of algorithms which predict complexes with relative large average size has been greatly improved by our refinement method.

## Declarations

### Acknowledgements

This work is supported in part by the National Natural Science Foundation of China under Grant No.61232001, No. 61128006, No. 61073036.

## Authors’ Affiliations

## References

- Barabási A, Oltvai Z: Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004, 5 (2): 101-113. 10.1038/nrg1272.PubMedView ArticleGoogle Scholar
- Garrels J: Yeast genomic databases and the challenge of the post-genomic era. Funct & Integr genomics. 2002, 2 (4): 212-237.View ArticleGoogle Scholar
- Gavin A, Superti-Furga G, et al: Protein complexes and proteome organization from yeast to man. Curr Opin Chem Biol. 2003, 7: 21-10.1016/S1367-5931(02)00007-8.PubMedView ArticleGoogle Scholar
- Li M, Chen J, Wang J, Hu B, Chen G: Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC bioinformatics. 2008, 9: 398-10.1186/1471-2105-9-398.PubMedPubMed CentralView ArticleGoogle Scholar
- Amoutzias G, de Peer YV: Single-gene and whole-genome duplications and the evolution of protein-protein interaction networks. Evolutionary Genomics and Systems Biology. Edited by: Gustavo Caetano-Anollés. 2010, Hoboken: Wiley-Blackwell, 413-429.View ArticleGoogle Scholar
- Wang J, Li M, Deng Y, Pan Y: Recent advances in clustering methods for protein interaction networks. BMC genomics. 2010, 11 (Suppl 3): S10-10.1186/1471-2164-11-S3-S10.View ArticleGoogle Scholar
- Hartuv E, Shamir R: A clustering algorithm based on graph connectivity. Inf Process Lett. 2000, 76 (4): 175-181.View ArticleGoogle Scholar
- Girvan M, Newman ME: Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 2002, 99 (12): 7821-7826. 10.1073/pnas.122653799.View ArticleGoogle Scholar
- Spirin V, Mirny L: Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences. 2003, 100 (21): 12123-12128. 10.1073/pnas.2032324100.View ArticleGoogle Scholar
- King A, Pržulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics. 2004, 20 (17): 3013-3020. 10.1093/bioinformatics/bth351.PubMedView ArticleGoogle Scholar
- Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D: Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101 (9): 2658-2663. 10.1073/pnas.0400054101.PubMedPubMed CentralView ArticleGoogle Scholar
- Newman M: Fast algorithm for detecting community structure in networks. Phys Rev E. 2004, 69 (6): 066133-View ArticleGoogle Scholar
- Veraksa A, Bauer A, Artavanis-Tsakonas S: Analyzing protein complexes in Drosophila with tandem affinity purification–mass spectrometry. Dev Dyn. 2005, 232 (3): 827-834. 10.1002/dvdy.20272.PubMedView ArticleGoogle Scholar
- Adamcsek B, Palla G, Farkas I, Derényi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22 (8): 1021-1023. 10.1093/bioinformatics/btl039.PubMedView ArticleGoogle Scholar
- Li X, Tan S, Foo C, Ng S, et al: Interaction graph mining for protein complexes using local clique merging. GENOME INFORM SER. 2005, 16 (2): 260-269.Google Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC bioinformatics. 2006, 7: 207-219. 10.1186/1471-2105-7-207.PubMedPubMed CentralView ArticleGoogle Scholar
- Rungsarityotin W, Krause R, Schödl A, Schliep A: Identifying protein complexes directly from high-throughput TAP data with Markov random fields. BMC bioinformatics. 2007, 8: 482-10.1186/1471-2105-8-482.PubMedPubMed CentralView ArticleGoogle Scholar
- Cho Y, Hwang W, Ramanathan M, Zhang A: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC bioinformatics. 2007, 8: 265-10.1186/1471-2105-8-265.PubMedPubMed CentralView ArticleGoogle Scholar
- Luo F, Yang Y, Chen C, Chang R, Zhou J, Scheuermann R: Modular organization of protein interaction networks. Bioinformatics. 2007, 23 (2): 207-214. 10.1093/bioinformatics/btl562.PubMedView ArticleGoogle Scholar
- Zhang S, Wang R, Zhang X: Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A: Stat Mech Appl. 2007, 374: 483-490. 10.1016/j.physa.2006.07.023.View ArticleGoogle Scholar
- Leung H, Xiang Q, Yiu S, Chin F: Predicting protein complexes from PPI data: a core-attachment approach. J Comput Biol. 2009, 16 (2): 133-144. 10.1089/cmb.2008.01TT.PubMedView ArticleGoogle Scholar
- Liu G, Wong L, Chua H: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25 (15): 1891-1897. 10.1093/bioinformatics/btp311.PubMedView ArticleGoogle Scholar
- Lancichinetti A, Fortunato S, Kertész J: Detecting the overlapping and hierarchical community structure in complex networks. New J Phys. 2009, 11 (3): 033015-10.1088/1367-2630/11/3/033015.View ArticleGoogle Scholar
- Wang J, Liu B, Li M, Pan Y: Identifying protein complexes from interaction networks based on clique percolation and distance restriction. BMC Genomics. 2010, 11 (Suppl 2): S10-10.1186/1471-2164-11-S2-S10.View ArticleGoogle Scholar
- Jiang P, Singh M: SPICi: a fast clustering algorithm for large biological networks. Bioinformatics. 2010, 26 (8): 1105-1111. 10.1093/bioinformatics/btq078.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang J, Li M, Chen J, Pan Y: A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networksDavid00. Comput Biol Bioinformatics, IEEE/ACM Transactions on. 2011, 8 (3): 607-620.View ArticleGoogle Scholar
- Kenley E, Cho Y: Entropy-based graph clustering: Application to biological and social networks. Data Mining (ICDM), 2011 IEEE 11th International Conference on. 2011, : IEEE, 1116-1121.View ArticleGoogle Scholar
- Chen B, Shi J, Zhang S, Wu F: Identifying protein complexes in protein–protein interaction networks by using clique seeds and graph entropy. Proteomics. 2013, 13 (2): 269-277. 10.1002/pmic.201200336.PubMedView ArticleGoogle Scholar
- Palla G, Derényi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005, 435 (7043): 814-818. 10.1038/nature03607.PubMedView ArticleGoogle Scholar
- Bader G, Hogue C: An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics. 2003, 4: 2-10.1186/1471-2105-4-2.PubMedPubMed CentralView ArticleGoogle Scholar
- Eisenberg D, Marcotte E, Xenarios I, Yeates T, et al: Protein function in the post-genomic era. NATURE-LONDON-. 2000, 405: 823-826. 10.1038/35015694.PubMedView ArticleGoogle Scholar
- Pržulj N, Wigle D, Jurisica I: Functional topology in a network of protein interactions. Bioinformatics. 2004, 20 (3): 340-348. 10.1093/bioinformatics/btg415.PubMedView ArticleGoogle Scholar
- de Lichtenberg U, Jensen L, Brunak S, Bork P: Dynamic complex formation during the yeast cell cycle. Sci Signal. 2005, 307 (5710): 724-727.Google Scholar
- Komurov K, White M: Revealing static and dynamic modular architecture of the eukaryotic protein interaction network. Mol Syst Biol. 2007, 3: 110-PubMedPubMed CentralView ArticleGoogle Scholar
- Petsko G, Ringe D: Protein structure and function. 2004, : Sinauer Associates IncGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak S: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009, 37 (3): 825-831. 10.1093/nar/gkn1005.PubMedPubMed CentralView ArticleGoogle Scholar
- Xenarios I, Salwinski L, Duan X, Higney P, Kim S, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.PubMedPubMed CentralView ArticleGoogle Scholar
- Tu B, Kudlicki A, Rowicka M, McKnight S: Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science. 2005, 310 (5751): 1152-1158. 10.1126/science.1120499.PubMedView ArticleGoogle Scholar
- Wang J, Peng X, Li M, Pan Y: Construction and application of dynamic protein interaction network based on time course gene expression data. Proteomics. 2013, 13 (2): 301-312. 10.1002/pmic.201200277.PubMedView ArticleGoogle Scholar
- van Dongen S: Graph clustering by flow simulation. PhD thesis. University of Utrecht 2000
- Enright A, Van Dongen S, Ouzounis C: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30 (7): 1575-1584. 10.1093/nar/30.7.1575.PubMedPubMed CentralView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.