Identifying protein complexes based on density and modularity in protein-protein interaction network

Background Identifying protein complexes is crucial to understanding principles of cellular organization and functional mechanisms. As many evidences have indicated that the subgraphs with high density or with high modularity in PPI network usually correspond to protein complexes, protein complexes detection methods based on PPI network focused on subgraph's density or its modularity in PPI network. However, dense subgraphs may have low modularity and subgraph with high modularity may have low density, which results that protein complexes may be subgraphs with low modularity or with low density in the PPI network. As the density-based methods are difficult to mine protein complexes with low density, and the modularity-based methods are difficult to mine protein complexes with low modularity, both two methods have limitation for identifying protein complexes with various density and modularity. Results To identify protein complexes with various density and modularity, including those have low density but high modularity and those have low modularity but high density, we define a novel subgraph's fitness, fρ, as fρ= (density)ρ*(modularity)1-ρ, and propose a novel algorithm, named LF_PIN, to identify protein complexes by expanding seed edges to subgraphs with the local maximum fitness value. Experimental results of LF-PIN in S.cerevisiae show that compared with the results of fitness equal to density (ρ = 1) or equal to modularity (ρ = 0), the LF-PIN identifies known protein complexes more effectively when the fitness value is decided by both density and modularity (0<ρ<1). Compared with the results of seven competing protein complex detection methods (CMC, Core-Attachment, CPM, DPClus, HC-PIN, MCL, and NFC) in S.cerevisiae and E.coli, LF-PIN outperforms other seven methods in terms of matching with known complexes and functional enrichment. Moreover, LF-PIN has better performance in identifying protein complexes with low density or with low modularity. Conclusions By considering both the density and the modularity, LF-PIN outperforms other protein complexes detection methods that only consider density or modularity, especially in identifying known protein complexes with low density or low modularity.


Background
Identifying protein complex is important in understanding the cellular organizations and functional mechanisms. However, the experimental methods to discover protein complexes are costly and time-consuming. Fortunately, with the development of high-throughput techniques such as yeast-two-hybrid [1], mass spectrometry [2], and protein chip technologies [3], protein-protein interactions (PPIs) are increasing fast and available conveniently, which results that large PPI networks for various species can be downloaded easily from public biological databases such as DIP [4], MIPS [5] and SGD [6]. Furthermore, many evidences have indicated that PPI network is a "small-world" network [7,8]. Cliques and dense subgraphs in it generally correspond to the protein complex [9][10][11][12][13]. Thus, a series of protein complexes detection methods are emerged based on mining dense subgraphs in PPI network and namely density-based methods [14][15][16][17][18][19][20][21][22][23].
Density-based methods, such as CPM [15,16], CP-DR [17], CMC [18], DPClus [19], IPCA [20], SPICi [21], and Core-Attachment [22], identify protein complexes in PPI networks based on detecting cliques or dense subgraphs. For example, CPM proposes a clique percolation method to mine adjacent k-cliques chains as protein complexes. CP-DR modified CPM by adding distance restriction. CMC algorithm first generates a weighted PPI network by an iterative scoring method and then identifies protein complexes by removing or merging highly overlapped maximal cliques of this weighted PPI network based on their interconnectivity. DPClus, IPCA and SPICi are all "seed-expanding" methods, which identify protein complexes by expanding seeds to density clusters by recursively adding the qualifying neighbours. Core-Attachment algorithm first mines complex core as dense subgraph and then identifies protein complex with its core and attachments separately.
The density-based methods can identify known protein complexes with high density effectively, but they will ignore the protein complexes with low density. However, many protein complexes are not dense subgraphs. For example, out of the 408 known protein complexes of S.cerevisiae which are provided by Pu S et al. in [24], 89 complexes have density lower than 0.5. Moreover, when they mining protein complexes, most of these methods may neglect many peripheral proteins that connect to the core protein clusters with few links, even though these peripheral proteins are also very important to the protein complex.
To solve this problem, many researchers investigated topologies of protein complexes in PPI networks and found that many protein complexes are densely connected within themselves but sparsely connected with the rest of the PPI network [9][10][11][12][13]. Thus, Radicchi et al. [25] proposed in-degree and out-degree of nodes in a subgraph to describe the connections within the subgraph and the connections of the subgraph with the rest of the graph. Radicchi et al. [25] and Li et al. [26,27] considered the modularity of a subgraph as the sum of the in-degree of all its vertices, divided by the sum of the out-degree of all its vertices. Radicchi et al. defined the weak module as a subgraph whose modularity is more than 1, and Luo et al. [13] proposed a hierarchical clustering algorithm, Monet, to identify protein complexes as weak modules. Li et al. defined the l-module as a subgraph whose modularity is more than the given l value, and proposed a fast hierarchical clustering algorithm, FAG-EC, to identify protein complexes as l-module. Wang et al. [28] modified FAG-EC and proposed HC-PIN algorithm to identify protein complexes in weighted PPI network. Based on l-module, Ren et al. [29] proposed MOMA algorithm and Wang et al. [30] proposed OH-PIN algorithm to identify overlapping and hierarchical protein complexes in PPI network. Lancichinetti et al.. [31] defined a fitness function f of a subgraph as the sum of the in-degree of all its vertices, divided by the sum of the degree (the sum of in-degree and out-degree) of all its vertices, and proposed NFC algorithm to identify protein complexes as subgraphs with the local maximum fitness value in PPI network. Obviously, a subgraph's fitness value has positive correlation with its modularity. Wang et al. [32] modified NFC algorithm by using essential proteins as seeds and proposed EPOF algorithm.
All the above algorithms identify protein complexes as subgraphs in PPI network with high modularity. So they are considered as modularity-based methods. These modularity-based methods can identify protein complexes with different densities, but they usually ignore protein complexes with low modularity. However, many protein complexes are dense subgraphs with low modularity. For example, out of the 408 known protein complexes of S.cerevisiae, 254 complexes have l value (modularity) lower than 0.5.
Conclusion above, both density-based methods and modularity-based methods have limitation. Densitybased methods may neglect protein complexes with low density and modularity-based methods may neglect protein complexes with low modularity. To identify both of these two kinds of protein complexes, dense subgraphs and subgraphs with high modularity, we define the subgraph's fitness by considering both the density and the modularity of a subgraph, and propose a novel algorithm, named LF_PIN, to identify protein complexes by extending each seed edge to a subgraph until its fitness reaches the local maximum value. LF_PIN chooses seed edges according to the edge clustering value because we find that the higher clustering value a PPI has, the more likely it is to be in a protein complex. The experimental results of S.cerevisiae and E.coli show that LF_PIN outperforms the other competing algorithms in terms of matching with known protein complexes and functional enrichment. Moreover, it can identify known protein complexes with low density or low modularity effectively.

Density, modularity and fitness
Dense subgraphs and modules in PPI network generally correspond to protein complexes. As dense subgraphs may have low modularity and modules may have low density, protein complexes have various values of modularity and density in PPI network. So we need to define a criterion to predict protein complexes with different topology, including those with low density but high modularity and low modularity but high density. To do it, we define a subgraph's fitness by considering both density and modularity and propose a novel protein complex model as a subgraph with the local maximum fitness value in PPI network.
A weighted PPI network is considered as an undirected weighted graph G = (V,E,W), where each vertex v V represents a protein, each edge <u,v> E represents an interaction between protein u and v, and each weight w u,v W represents the weight of an interaction between protein u and v. For an undirected weighted graph G, the density of a subgraph H (H⊆ G), donated as q H , is defined as: where m H and n H are the number of edges and vertices in H respectively.
For a vertex v in a subgraph H of an undirected weighted graph G, its weighted in-degree, denoted as d in w (H, v), is the sum of weights of edges connecting vertex v to other vertices in H; its weighted out-degree, denoted as d out w (H, v), is the sum of weights of edges connecting vertex v to other vertices in G-H; and its weighted degree, denoted as d w (H,v), is the sum of its weighted in-degree and its weighted out-degree [28,32].
Generally, the subgraph's modularity is defined as the sum of in-degree of all its vertices, divided by the sum of out-degree of all its vertices [25][26][27][28][29][30]33]. Obviously, this modularity takes value from 0 to ∞. To make the value range of modularity is as same as that of density, we refer to the fitness function of NFC and EPOF and modify the subgraph's modularity in an undirected weighted graph as the sum of weighted in-degree of all its vertices, divided by the sum of weighted degree of all its vertices [31,32].
Obviously, md H takes value from 0 to 1. If a subgraph has higher modularity, it has more connections within itself and less connection to the rest of the PPI network. When a subgraph's modularity is equal to 1, it has no connection to the rest of the PPI network.
By considering both subgraph's density and modularity, the fitness of a subgraph H in an undirected weighted graph G, denoted as f r (H), is defined as where the parameter ρ decides the importance of density in the fitness and takes value from 0 to 1.
Based on the above definition of fitness, the fitness of a vertex v with respect to a subgraph H, denoted as f r (v, H), is defined as the difference of the fitness of the subgraph H with and without vertex v [31][32][33].
where f r (H+{v}) is the fitness of the subgraph in which vertex v is added to H and f r (H-{v}) is the fitness of the subgraph in which vertex with v is removed from H.
When subgraph is a singleton edge, it has the maximum density of 1. When subgraph is the whole graph, it has the maximum modularity of 1. Generally, with the expanding of a subgraph, its modularity is increasing and its density is decreasing. Thus, by expanding from an edge, we can obtain a subgraph with the local maximum fitness value and output it as a complex. The process of a complex extending from an edge is adding neighbor vertices into the subgraph or removing vertices from the subgraph when the inclusion of a new neighbor vertex or the elimination of one vertex from the subgraph will increase the subgraph fitness.

Seed selecting
How to select the seeds is very important for identifying protein complexes. Obviously, the seeds are edges which have more possibility to be in protein complexes. Moreover, it is obvious that comparing with applying to an unweighted PPI network, the performance of a protein complex method can be improved when applying to a weighted PPI network whose edge's weight reflects the possibility of the edge in a protein complex [34][35][36]. So, if the input PPI network is a weighted PPI network and its edge's weight represents the possibility of the edge to be in a protein complex, seed are simply chosen as those edges with weight more than average weight. If the input PPI network is an un-weighted PPI network or a weighted PPI network but its edge's weight cannot reflect the possibility of the edge to be in a protein complex (for example, the weight represents the PPI's confidence), a weighted PPI network should be constructed from the input PPI network and its edge's weight represents the possibility of the edge to be in a protein complex. Then, our method LF-PIN can be applied to this weighted PPI network.
Wang et al. [28] defined the clustering value of an edge in a weighted and an un-weighted graph, and pointed out that two vertices connected by an edge with larger clustering value are more likely to lie in the same module. As modules in PPI network generally correspond to protein complexes [9][10][11][12][13], a PPI with higher edge clustering value in a PPI network has more possibility to be in a protein complex. So, it is reasonable to build a weighted PPI network by calculating its edge's weight according to the edge's clustering value and apply LF-PIN to this weighted PPI network. The clustering value of an edge <u,v> in a graph G, donated as ECV(u,v), is defined as [28]: where w u,k is the weight of edge <u,k> when G is a weighted graph and is equal to 1 when G is an unweighted graph, the N u and N v are the sets of neighbors of vertex u and vertex v respectively, and I u,v denotes the set of common vertices in N u and N v (i.e., However, in a PPI network, there are many edges whose clustering values are equal to 0. It is obvious that these edges cannot be deleted from the PPI network. As they also have little possibility to be in protein complexes, their weights are set as a small constant, which reflects their small possibility in protein complexes. Thus, the weight of an edge <u,v> in a PPI network G is calculated as: where a is the weight of an edge with ECV = 0, ECV avg is the average clustering value of all edges in G, including the edges with ECV = 0. a is set as a constant whose value is much smaller than 1 because the average weight is equal to 1 and the possibility of an edge whose ECV = 0 to be in a protein complex is much less than that of an edge selected randomly. For example, out of 15166 PPIs in the PPI network of S.cerevisiae download from DIP database [37], 2130 PPIs (14%) are in protein complexes. Out of 8573 PPIs whose ECV = 0 in the PPI network, 231 PPIs (2.7%) are in protein complexes. The possibility of a PPI whose ECV = 0 to be in a protein complex is only one fifth of that of a PPI selected randomly. So, for this PPI network, the value of a is set as 0.2.

Algorithm LF-PIN
Based on quantitative description of protein complexes, we propose a novel clustering algorithm LF-PIN (based on Local Fitness) to identify protein complex in a weighted PPI network whose edge's weight reflects the possibility of the edge to be in a protein complex. The detailed description of algorithm LF-PIN is shown in Figure 1. The input of algorithm LF-PIN is parameters ρ, and a weighted PPI network which is described as a simple undirected graph G(V, E, W). Algorithm LF-PIN has three stages: seed selecting, seed expanding, and outputting. Firstly, seed are selected as edges whose weights no less than average weight and sorted into seed queue Sq in non-increasing order by the edge weight. Then, when the seed queue Sq is not null, LF-PIN will always select the first edge in Sq as the seed and gradually add neighbor vertex or remove vertex decided by the measure of vertex fitness. If the expanding cluster has neighbor vertices with fitness more than zero, the neighbor vertex with maximum fitness is added to it to increase its fitness. Then, the fitness values of all vertices in the new cluster are recalculated and the vertices whose fitness are negative are deleted. However, it is possible that when adding a neighbor vertex, the seed's vertex will have negative fitness to the new cluster and will be removed. To avoid this case, if adding a neighbor vertex will results the seed's vertex have negative fitness, this neighbor vertex is discarded and the next neighbor vertex with maximum fitness is checked. The expanding will stop when all neighbor vertices are checked or have negative fitness, and an identified cluster is produced. At the same time, all edges which include vertices in the identified cluster are removed from Sq. The seed expanding processes will stop when the seed queue Sq is null. At last, LF-PIN outputs all identified clusters.
The input PPI network of LF-PIN should be a weighted PPI network whose edge's weight reflects the possibility of the edge to be in a protein complex. If the given PPI network is not a weighted PPI network of this kind, a pre-processor is run to generate the input PPI network from the given PPI network by the method proposed in the above section. The detailed description of the preprocessor is shown in Figure 2. Firstly, each edge's clustering value of the given PPI network is calculated by the formula (8). Then, each edge's weight is calculated according to their clustering value by the formula (9) and the weighted PPI network is generated and output.

Results and discussion
To evaluate the performance of algorithm LF-PIN, we compare it with seven previous competing algorithms: CMC [18], Core-Attachment [22], CPM [15,16], DPClus [19], HC-PIN [28], NFC [31], and MCL [38,39]. The first four algorithms are density-based methods. HC-PIN and NFC are modularity-based methods. MCL is a fast and highly scalable cluster algorithm for PPI networks based on stochastic flow. Obviously, HC-PIN, MCL, and NFC can all identify protein complexes with different density. The values of the parameters in each algorithm are selected from those recommended by the authors and listed in Table 1.
The original un-weighted PPI networks of S.cerevisiae and E.coli are downloaded from DIP database [37] updated to Jun. 14, 2010 and Oct. 10, 2010, respectively. To generate the input PPI network of LF-PIN, we first removed all self-connecting interactions and repeated In the section, evaluation methods used in our experiments are described firstly. Secondly, the effect of parameter ρ on clustering results is discussed. Thirdly, all the identified complexes of LF-PIN and those of seven other algorithms in S.cerevisiae are compared with the known protein complexes of S.cerevisiae in [24], which are listed in the Additional file 3. Fourthly, performances of LF-PIN and seven other algorithms for identifying protein complexes with low density or low modularity are discussed. Finally, all the identified complexes of LF-PIN and those of seven other algorithms in both S.cerevisiae and E.coli are compared in terms of functional enrichment. The original data of figures in the section are listed in the Additional file 4.

Evaluation methods
Two criteria are used in the paper to evaluate the performance of protein complex detection methods. One is matching the identified protein complex set with known protein complex set. Another is the functional enrichment of the identified protein complexes.
To determine how effectively a predicted complex (Pc) matches a known complex (Kc), the overlapping score OS(Pc, Kc) between a predicted complex Pc and a known complex Kc is calculated as [12,19,28]: where |V Pc | is the number of proteins in the predicted complex and |V Kc | is the number of proteins in the known complex. A predicted complex Pc and a known complex Kc are considered as a match if their overlapping score OS(Pc, Kc) is no less than a specific threshold. Generally, the typical value of the threshold is selected as 0.2 [12,19,28]. If OS(Pc, Kc) is equal to 1, we say that they are perfectly matched. Based on the matching of known complexes and predicted complexes, three popular evaluation criteria, Specificity (Sp), Sensitivity (Sn) and F-score, are used to quantify the quality of protein complexes detection methods. Specificity is the fraction of the predicted complexes that are matched by the known complexes among all the predicted complexes [12]. Sensitivity is the fraction of the  Table 1 Parameters of protein complex detection methods used in the paper known complexes that are matched by the predicted complexes among all the known complexes [12]. F-score combines the Sensitivity and Specificity and is defined as [28]: As F-score considers both Sensitivity and Specificity, it is a comprehensive evaluation and used as prediction accuracy in the paper.
To evaluate the functional enrichment of predicted protein complexes, the P-value of a protein complex with a given GO term is used to estimate whether the proteins in the complex are enriched for the GO term with a statistically significant probability compared to what one would expect by chance [19,40]. The smaller P-value indicates the predicted protein complexes is not accumulated at random and is more biologically significant than the one with a larger P-value [28,41]. As a protein complex has various P-values for various GO terms, its P-value defaults to its minimum P-value.

Effect of parameter r
The parameter ρ of LF-PIN decides the importance of density in the fitness. It takes value from 0 to 1. To evaluate the effect of the parameter ρ on the clustering results, we change the values of parameter ρ from 1 to 0 with 0.2 decrements and achieve eleven different output sets of protein complexes from the PPI network of S.cerevisiae. The experimental results are shown in Table 2 and Figure 3.
As shown in Table 2, with the decrease of ρ, the number and the density of identified complexes are decreasing; the size and the modularity of identified complexes are increasing. Generally, with the increase of a subgraph's size, its modularity is increasing and its density is decreasing. Meanwhile, with the decrease of ρ, the density is less important and the modularity is more important in the fitness. So, with the decrease of ρ, a seed edge should be expanded to a larger subgraph with smaller density and larger modularity. As shown in Figure 3, when overlapping score's threshold is no less than 0.2, the F-scores of LF-PIN with ρ = 1 and ρ = 0 are much less than those of LF-PIN with ρ in range from 0.2 to 0.8. As LF-PIN predicts protein complexes considering only subgraph's density when ρ = 1, considering only subgraph's modularity when ρ = 0, and considering both subgraph's density and modularity when 0<ρ<1, the F-score curves in Figure 3 imply that compared with only considering density or modularity, the prediction accuracy can be improved by considering both density and modularity. The F-scores of LF-PIN with ρ in range from 0.2 to 0.8 are very close, which means the performance of LF-PIN with ρ in range from 0.2 to 0.8 are close. As shown in Table 2, when ρ is in the range of 0.4 to 0.8, the density of identified protein complexes are too high (nearly 1). So, to identify various protein complexes, including those with low density and low modularity, the value of parameter ρ is selected as 0.2 in the paper.

Comparison with known complexes
To directly validate the effectiveness of algorithm LF-PIN for identifying protein complexes, we compare the protein complexes predicted by LF-PIN and other seven algorithms with the known protein complexes obtained from [24] and list the percentage of matched predicted complexes of these eight algorithms in Figure 4. We can see from Figure 4 that when overlapping score's threshold is equal to 0.2 (the typical value of overlapping score's threshold used in many literature), 63% complexes detected by LF-PIN are matched by the known complexes. This ratio is much higher than those identified by other seven competing algorithms at the same threshold. For example, when overlapping score's threshold is equal to 0.2, 45% complexes predicted by HC-PIN are matched, which is the best result in the seven competing algorithms. Even compared with this best result, 40% improvement can be obtained by using LF-PIN algorithms. Furthermore, Figure 4 shows that for each overlapping score's threshold, the percentage of matched complexes in the complex set identified by LF-PIN is much higher than those identified by other seven competing algorithms. All these indicate that LF-PIN outperforms other seven competing algorithms in terms of matching with the known complexes.   To compare the performance of LF-PIN with those of other seven competing algorithms, we calculate Sensitivity, Specificity, F-score, the number and the percentage of perfect matches of the eight algorithms and list them in Table  3. Here, the overlapping score's threshold is selected as 0.2. As shown in Table 3, the number of complexes identified by LF-PIN is 297, which is much less than those identified by CMC (only 26%), Core-Attachment (only 22%), and DPClus (only 25%). Obviously, the more complexes dose an algorithm identify, the more perfect matches and matched complexes dose the algorithm identify. Thus, LF-PIN identifies less perfect matches than Core-Attachment, and its Sensitivity value is less than those of CMC, Core-Attachment, and DPClus. However, Table 3 shows that the percentage of perfect matches in the identified complexes and the Specificity value of LF-PIN are both higher than those of other seven algorithms, which means that the percentages of perfect matches and matched complexes in the complexes identified by LF-PIN are both higher than those of the other algorithms. Moreover, LF-PIN has the highest F-score value in the eight algorithms. Even compared with the highest F-score value of other seven algorithms, 41% improvement can be obtained by using LF-PIN algorithms.
To give a more complete comparison, we compare the F-score of LF-PIN and other seven algorithms with respect to different overlapping score's thresholds in Figure 5. As shown in Figure 5, LF-PIN algorithm has the highest value of F-score in the eight algorithms when overlapping score's threshold is no less than 0.2, which means it has the highest prediction accuracy in the eight algorithms.

Comparison with known complexes of low density
A more attractive characteristic of LF-PIN is that this algorithm can identify significant protein complexes with low density. We can see from Table 2 that the density of protein complexes identified by LF-PIN (ρ = 0.2) can be as low as 0.17. To directly validate the effectiveness of algorithm LF-PIN for identifying protein complexes with low density, we select all known protein complexes obtained from [24] with density less than 0.5, counting up to 89 complexes, and compare them with the protein complexes predicted by LF-PIN and other seven algorithms in Figure 6 and Figure 7. Figure 6 shows the percentage of matched predicted complexes of LF-PIN and other algorithms. Figure 7 shows the F-score of LF-PIN and other algorithms.
We can see from Figure 6 that for each overlapping score's threshold, especially when overlapping score's threshold less than 0.6, the percentage of matched complexes in the complex set identified by LF-PIN is much higher than those identified by other seven algorithms, especially by the four density-based algorithms: CMC, Core-Attachment, CPM and DPClus. For example, when overlapping score's threshold is equal to 0.2, 18% complexes detected by LF-PIN are matched with known complexes with low density. This matched percentage is 3.2 times as that detected by CMC, 4.0 times as that detected by Core-Attachment, 1.8 times as that detected by CPM, and 4.1 times as that detected by DPClus.
As shown in Figure 7, for each overlapping score's threshold, especially when overlapping score's threshold is less than 0.7, the F-score of LF-PIN is much higher than those of other seven algorithms, especially for the four density-based algorithms. For example, when overlapping score's threshold is equal to 0.2, the F-score of LF-PIN is 0.268, which is 2.6 times as that detected by CMC, 3.2 times as that detected by Core-Attachment, 1.9 times as that detected by CPM, and 3.3 times as that detected by DPClus.
Conclusion above, compared with other seven algorithms, especially for the four density-based algorithms, LF-PIN has much better performance for identifying protein complexes with low density. When overlapping score's threshold is set to 0.2, compared with the four density-based algorithms, the prediction accuracy can be improved no less than 90% by using LF-PIN algorithm.

Comparison with known complexes of low modularity
Another attractive characteristic of LF-PIN is that this algorithm can identify significant protein complexes with low modularity. We can see from Table 2 that the modularity of protein complexes identified by LF-PIN (ρ = 0.2) can be as low as 0.08. To directly validate the effectiveness of algorithm LF-PIN for identifying protein complexes with low modularity, we select all known protein complexes obtained from [24] with modularity less than 0.3, counting up to 247 complexes, and compare them with the protein complexes predicted by LF-PIN and other seven algorithms in Figure 8 and Figure 9.      shows the percentage of matched predicted complexes of LF-PIN and other seven algorithms. Figure 9 shows the F-score of LF-PIN and other seven algorithms. We can see from Figure 8 that for each overlapping score's threshold, the percentage of matched complexes in the complex set identified by LF-PIN is higher than those identified by other seven algorithms, especially by NFC. For example, when overlapping score's threshold is equal to 0.2, 37% complexes detected by LF-PIN are matched. However, only 10% complexes detected by NFC are matched. The improvement of LF-PIN algorithms is about 2.7 times. Meanwhile, Figure 9 shows that when overlapping score's threshold no less than 0.2, the F-score of LF-PIN is also higher than those of other seven algorithms, especially of NFC. For example, when overlapping score's threshold is equal to 0.2, the F-score of LF-PIN is 0.404, and the F-score of NFC is only 0.135. The F-score of LF-PIN is three times as that of NFC.. All these imply that compared with other seven algorithms, especially for the modularity-based algorithm NFC,, LF-PIN has better performance for identifying protein complexes with low modularity.
However, from Figure 8 and Figure 9, we can see that compared with the improvements of using LF-PIN instead of NFC, the improvements of using LF-PIN instead of HC-PIN (another module-based algorithm) is little. For example, when overlapping score's threshold is equal to 0.2, compared with the results of HC-PIN, the percentages of matched complexes identified by LF-PIN is only improved 22% and the F-score of LF-PIN is only improved 24%. The possible reason is that the λ value of HC-PIN is only selected as 0.5 in the paper, which result the modularity of the protein complex identified by HC-PIN is smaller (when a subgraph's λ value is 0.5, its modularity is only 0.33). So, the protein complexes identified by HC-PIN can also match well with the known protein complexes with low modularity.

Comparison with other algorithms in terms of functional enrichment
To evaluate the effectiveness of LF-PIN, we apply it and other seven algorithms in PPI networks of S.cerevisiae and E.coli, and compare the functional enrichment of protein complexes identified by each algorithm in Table 4 and Table 5, respectively. Here, the GO annotation is downloaded from GO database [42] and Biological Processes are used.
As shown in Table 4, only 12.8% protein complexes identified by LF-PIN are insignificant (Generally, a complex with P-value≥0.01 is considered insignificant and that with P-value<0.01 is considered significant). This percentage is much lower than those of other seven algorithms. For example, the percentage of insignificant complexes identified by HC-PIN is 37.4%, which is the lowest in the seven other algorithms. Even this lowest percentage is about 3 times as that identified by LF-PIN. On the contrary, the percentage of significant complexes identified by LF-PIN, including those of complexes whose P-value falls within <E-10, [E-10, E-5], [E-5, 0.01], are all higher than those identified by other seven algorithms. All these mean that LF-PIN is more effective for identifying significant proteins complexes than other algorithms in S.cerevisiae.
From Table 5, we can draw the same conclusion in E.coli. The percentage of insignificant complexes identified by LF-PIN is lower than those identified by other seven algorithms. On the contrary, compared with the results of other seven algorithms, the percentage of significant complexes identified by LF-PIN is improved 44% to 133%. The statistical results of Table 4 and Table 5 indicate that LF-PIN has good performance for identifying significant proteins complexes.

Conclusions
In the post-genome era, one of the most important works is to discover the protein complexes with various density and modularity. In this paper, we propose a novel fitness function by considering both density and modularity and develop a fitness-based local search algorithm, named LF-PIN, to identify protein complexes with different density and modularity in PPI network. By tuning the value of parameter ρ in the fitness function, we can adjust the importance of density and modularity in the fitness. Experimental results in S.cerevisiae show that compared with considering only density (ρ = 1) or only modularity (ρ = 0), LF-PIN has better performance when considering both density and modularity (0<ρ<1). To compare algorithm LF-PIN with other protein complexes detection methods, we apply LF-PIN and other seven competing algorithms, including CMC, Core-Attachment, CPM, DPClus, HC-PIN, MCL, and NFC, to the protein interaction network of S.cerevisiae and E.coli. The experimental results in both S.cerevisiae and E.coli show that LF-PIN identifies much more significant protein complexes and generates much less insignificant protein complexes than other algorithms. When