Determining modular organization of protein interaction networks by maximizing modularity density
© Zhang et al; licensee BioMed Central Ltd. 2010
Published: 13 September 2010
With ever increasing amount of available data on biological networks, modeling and understanding the structure of these large networks is an important problem with profound biological implications. Cellular functions and biochemical events are coordinately carried out by groups of proteins interacting each other in biological modules. Identifying of such modules in protein interaction networks is very important for understanding the structure and function of these fundamental cellular networks. Therefore, developing an effective computational method to uncover biological modules should be highly challenging and indispensable.
The purpose of this study is to introduce a new quantitative measure modularity density into the field of biomolecular networks and develop new algorithms for detecting functional modules in protein-protein interaction (PPI) networks. Specifically, we adopt the simulated annealing (SA) to maximize the modularity density and evaluate its efficiency on simulated networks. In order to address the computational complexity of SA procedure, we devise a spectral method for optimizing the index and apply it to a yeast PPI network.
Our analysis of detected modules by the present method suggests that most of these modules have well biological significance in context of protein complexes. Comparison with the MCL and the modularity based methods shows the efficiency of our method.
Understanding the cell as a system of interacting components is a fundamental goal of current biology. Various types of biological networks are being constructed in cellular systems including PPI networks, gene regulatory networks and metabolic networks, etc. Exploring how molecules interact to form cellular machinery is a key task in systems biology. The well-understood graph-theoretical concepts has become a powerful tool to explore the topology, organization, function and evolution of biological networks. In this field, recent studies have made great progresses which considerably expanded our insights in the organizational principles and cellular mechanisms of cellular systems [1, 2].
Modularity has been considered to be one of the main organization principles of biological networks in the past decade years. Biological modules as a critical level of biological hierarchy and relatively independent units play special roles in biological systems . How to uncover modular structures in various biological networks is a basic step for understanding cellular functions and organizational mechanisms of biosystems. For example, by using the network partition, Zhao et al. (2006) investigated the functional and evolutionary modularity of human metabolic networks from a topological perspective .
In the past few years, a huge number of computational methods have been developed for detecting network modules and analyzing the network structure of biological networks. Hierarchical clustering has been proven to be useful tools for analyzing biological networks. Ravasz et al. (2002) studied the hierarchical modular organization of metabolic networks based on a topological linkage matrix. Researchers of three groups [4–6] employed hierarchical clustering based on three different clustering methods respectively to analyze the modular structure of yeast protein interaction networks. The diffusion kernel of graph was also suggested as a universal similarity metric to construct the clustering tree of networks . However, this type of approaches may generate many identical distances (similarity) and leads to the ‘tie in proximity' problem during hierarchical clustering. This type of method imposes a stringent tree structure on the network which is highly sensitive to the metric used to assess (dis)similarity, and typically requires subjective evaluation to define modules. And the evaluation of these (dis)similarity measures for hierarchical clustering is not an easy problem.
Several studies of protein interaction networks have focused on detecting highly connected protein modules [8–10] which generally correspond to meaningful biological units such as protein complexes and functional modules. In general, these approaches only employ local connectivity among proteins and neglect many peripheral proteins that connect to the core protein clusters with few links. However, biological networks including PPI networks are generally very sparse. Most methods only identify strongly connected subgraphs as modules, so only a few modules were detected [9, 11, 12]. And biologically meaningful sparse protein modules are ignored by these approaches and those lost peripheral proteins may represent experimentally true interactions. Furthermore, because these approaches heavily rely on the local topological connectivity, they ignore the impact of global organization of networks. But biological networks are globally coordinated system, so the local connectivity based methods can not be employed to explore the relationship among modules. Another important factor is the noise of interaction data, other sources such as the function annotation data and gene expression data have been integrated into protein interaction networks to improve the effectiveness of module detection [13, 14].
One popular class of methods for dissecting modular structure in the field of general complex networks is based on optimizing a global quality function called modularity [15, 16] to partition the network into modules. And it has been comprehensively adopted to analyze biological networks [3, 17–19]. However, it has recently been shown that the resolution of the modularity based methods is intrinsically limited. It fails to find small communities in large networks—instead, groups of small communities turn out merged as larger ones . Li et al. (2008) proposed a novel quality function called modularity density (D) which aims to conquer the resolution limit problem in modularity . They have tested it on many kinds of small networks for illustration but not on large real networks.
In this study, we aim to introduce the new quantitative measure modularity density into the modular analysis of biomolecular networks and develop new algorithms for detecting functional modules in protein-protein interaction (PPI) networks. We first adopt the simulated annealing (SA) technique to maximize the modularity density and evaluate its advantages on a suit of simulated networks where the modules are known. In order to conquer the computational burden of SA procedure, we adopt a spectral fc-means method for optimizing the measure and apply it to a yeast PPI network. Our biological analysis of detected modules suggests that most of these modules carry distinguished biological significance. We also make a comparison of our method with other two methods including the popular MCL and modularity based methods to verify its effectiveness.
Materials and methods
Definition of modularity and modularity density
where m is the number of modules, L is the total number of edges in the network, l i is the number of edges between nodes in module i, and d i is the total number of degrees of the nodes in module i. The highest Q value of all possible module separations is called the network modularity. In the past studies, empirical and simulation studies showed that the network partition method of maximizing modularity Q (MQ) has good performance. However, Fortunato and Barthelemy (2007) recently pointed out the serious resolution limits of this method, and claimed that the size of a detected module depends on the size of the whole network. The main reason is that the modularity Q does not capture the information of the number of nodes in a module, and the choice of partition is highly sensitive to the total number of links in the network.
This measure provides a way to determine if a certain mesoscopic description of the graph is accurate in terms of modules. The larger the value of D, the more accurate a partition is. So the community detection problem can be viewed as a problem of finding a partition of a network such that its modularity density D is maximized. The search for optimal modularity density D is a challenging problem due to the fact that the space of possible partitions grows faster than any power of system size.
where λ is a value ranging from 0 to 1, and when λ = 0.5, the D0.5 corresponds to modularity density D. By varying λ, we can detect detailed and hierarchical organization of biological systems. In other words, we can divide the network into large modules and small modules using a small λ and a large λ respectively.
Simulated annealing for maximizing D (MD)
where C i (C f ) is the cost before(after) the update.
Specific implementation detail can be seen in . Note that we add a decision clause to ensure that each potential ‘module' is connected. The one that performs best consists in isolating the module from the rest of the network, and performing a nested' SA, entirely independent of the ‘global’ one. In using Q and D as fitness functions', the method is more direct than those relying on heuristic procedures. Moreover, SA enables us to carry out an exhaustive search and to minimize the problem of finding sub-optimal partitions. We should note that the SA method can't scale to very large networks, but it is an efficient evaluation method for its exhaustive characteristic. Several efficient methods for optimizing Q have been proposed, but designing efficient algorithms for optimizing the new measure (D) is still an essential and challenging problem.
Spectral method for maximizing D (SpeMD)
From the standard result in linear algebra, the optimal of the above trace maximization has close relationship with the leading k eigenvectors of 2A − B by relaxing as an arbitrary orthonormal matrix . We can adopt the corresponding spectral algorithms and use the leading k eigenvectors of 2A − B to optimize the modularity density D. To obtain the final network partition, we apply the k-means clustering method to cluster eigenvectors. Importantly, the same principle can be derived for Dλ.
The procedure of the algorithm
k-means: for each value of k, 2 ≤ k ≤ K
Form the matrix U k = [u 2 , u 3 , …, u k ] from the matrix U k .
Treat the rows of U k as points in R k and cluster them into k clusters using k-means or even other clustering methods.
Maximizing modularity density D or Dλ with given λ: Pick the k and the corresponding partition P k that maximizes D or Dλ.
We should note that this type of spectral clustering technique has been successfully applied to general clustering problems as well as graph clustering problems [24, 25]. Here, we explore the characteristic of modularity density D, and derive a new spectral clustering based method for maximize D (Dλ) (SpeMD). And the SpeMD procedure described here can be seen as a particular manner of employing the standard k-means algorithm on the elements of the leading k eigenvectors to extract k clusters simultaneously. Convergence and computational complexity of the SpeMD procedure are key problems when this method is applied to large complex networks. Fortunately, several strategies can be employed to improve these problems. First, we can initialize the k-means such that the starting centroids are chosen to be as orthogonal as possible . This strategy does not change the time complexity, but can improve the quality of convergence, thus at the same time reduce the need for restarting the random initialization process. Second, several fast techniques for solving eigen system have been developed and several methods of k-means acceleration can also be found in the literature. Based on this type of techniques, for large sparse networks with m ~ n, and k ≪ n, the SpeMD procedure will scale roughly linearly as a function of the number of nodes n. Here we didn't consider these ameliorative techniques and only focus on the validity of the SpeMD method.
where N is the size of the PPI network, k is the number of their common proteins, and |C|, |M| are the sizes of an experimental complex and a computed protein module respectively.
Furthermore, the geometric accuracy and separation described in the study of Brohee and van Helden  are employed to evaluate the performance of the module-detection methods. We first build a contingency table T, where row i corresponds to the i th experimental complex and column j to the j th module and the value of a cell T ĳ indicates the number of proteins found in common between complex i and module j. The contingency table has n rows (complexes) and m columns (modules). Using this table, each module partitioning result is compared with the experimental complexes.
In this section, we apply the present method to a suit of simulated networks and a yeast PPI network to test its efficiency. We first present detailed numerical results to show the difference of network partition determined by maximizing the modularity density D and modularity Q with simulated annealing (SA) technique. In general, maximizing D (MD) can give more detailed and valid results, while maximizing Q (MQ) encounters serious resolution limit in simulated networks.
Then we apply the new spectral method for maximizing the generalized D λ (SpeMD) to a yeast PPI network to identify functional modules which show significant biological relevance. Comparison with MQ and MCL, we show that the SpeMD can obtain competitive performance with the well-known MCL method and resolve much finer modular structure than MQ method. To extract appropriate modules, the SpeMD and MCL both rely on one parameter. Here, we perform the SpeMD and MCL with adjusted parameters to obtain the ‘best’ geometric accuracy and separation. For SpeMD, we tune λ from 0.4 to 0.7 in step of 0.05, and for MCL, we sample inflation parameter values from 1.5 to 2 in steps of 0.1.
First we do the comprehensive tests on a group of simulated networks which take on significant modular characteristics. In the work of , D-based method has been showed to be able to obtain competitive performance with Q-based method. However, the size of artificial networks generated by using Newman's popular procedure as well as its variant are too small to show the serious resolution limit problem of Q. Therefore, we devise a new type of artificial networks. The network is composed of 2m cliques (m n1-clique and m n2-clique), and external edges are placed randomly with a fixed expectation values so as to keep the average edge connections k out of each node to nodes of other cliques. So each network has m(n1 + n2) nodes and about m(n1(n1 − 1)/2 + n2(n2 − 1)/2) + m(n1 + n2)k out /2 edges. In the following test, we let n1 = 10 and n2 = 15. Note that we can also relax cliques as dense modules for testing, but here we just show the clique case for convenience.
The most interesting observation is that performance of MD is almost the same, while that of MQ is greatly decreasing with the increase of NC (also the size of networks). For example, for 50 random networks with k out = 6, always on an average >99.9% nodes are classified correctly by MD on four different sizes of networks with NC = 20, 40, 60, 80, while about 92.40%, 78.18%, 69.75%, 62.59% nodes by the MQ respectively. This fact shows the serious resolution limit problem of modularity Q, while that can not be observed on the small networks such as the simulated networks using Newman's method.
To test the performance of MD and MQ in selecting the number of communities, we calculate the number of modules. Figure 2 shows the averaged number of modules on four different sizes of networks (NC = 20, 40, 60, 80) with respect to k out by MD and MQ respectively. We can see that MD performs much better than MQ. The MD can almost always identify the right number of modules in four different sizes of networks with k out ≤ 7. While MQ can not do that. For example, for 50 random networks with NC = 60 and k out = 7, on an average 59.7 modules are identified by MD, while only about 37.20 modules by the MQ. For the harder case (k out = 8), MD can still do much better than MQ. Actually, even for the easiest case k out = 2, MQ can not identify the right modules with 52.50 modules for NC=80. This uncovers the underlying resolution limit just as pointed in . In summary, the MD can recover the underlying community structure more often than the MQ by a sizable margin in the simulated modular networks. The modularity density D more relies on local connectivity of a network and can uncover finer modular structure. While modularity Q more relies on size and total links of a network and can lead to serious resolution limit. Moreover, the limit is more serious as size of networks increasing.
Results on a PPI network
Illustration of detected modules.
Comparison with MCL and MQ
There has been many methods for detecting network modules. The comparison of all the methods is not an easy task. Here, we attempt to compare the MD (SpeMD) with two types of classical methods: MQ and MCL. Just as we have mentioned, the modularity (Q) maximization based module-detection method has been comprehensively applied in many fields including analysis of biological networks. Another method is the Markov Cluster algorithm (MCL) which was developed by van Dongen . The method simulates a flow on the network by calculating successive powers of the network adjacency matrix. In each iteration, an inflation step is applied to enhance the contrast between regions of strong or weak flow in the network. The process converges towards a partition of the network, with a set of high-flow regions separated by boundaries with no flow. The value of the inflation parameter strongly influences the the size and number of the detected modules. In a recent evaluation study , the algorithm was found to be superior to several representative graph clustering algorithms including MCODE , RNSC  and SPC  for the prediction of protein complexes.
Discussion and conclusion
PPI networks are typical examples of complex biological systems that are difficult to understand from raw experimental data alone. Algorithmic and modeling progresses in the area of biomolecular networks analysis could contribute to the understanding of biological processes and organization. Many methods have been developed to organize, display and extract significant patterns in these systems .
A number of network clustering algorithms have been proposed to find modular structures in PPI networks and other biological networks. Our work is a further development along this line for dissecting biological systems. Here we introduce the quantitative measure (Modularity density D) for exploring modular organization of networks to the field of biomolecular networks. We suggest the SA technique to maximize it for rigorous evaluation and we propose an efficient spectral -means method in the decomposition procedure. Our comparative experiments with MCL and MQ on a yeast PPI network show that the MD (SpeMD) method can effectively detect protein interaction modules from a complex interaction network. In the current research, we use known complexes to choose the optimal λ as well as the inflation parameter for MCL algorithm. Actually, we can also adopt an intrinsic measure which compares the resulting modules against the original network to choose the most appropriate parameter in an unsupervised manner. For example, van Dongen  suggested the so-called efficiency measure to test the performance of network clustering efficiency. Therefore, the present method can be easily adapted to a fully self-contained method that doesn't rely on any known data or given parameters. The current algorithm, as most clustering methods, uncovers only disjoint modules (clusters). However, in real biological systems, proteins can be contained in more than one functional module or complex. Zhang et al. (2007c) has suggested to apply fuzzy c-means clustering method to a spectral space for uncovering fuzzy modules . It can also be addressed using an intrinsic measure based on the original network in the same way as suggested in  by post-processing the modules obtained from the present algorithm.
Modularity Q have been extensively employed for dissecting and evaluating the modular organization of biomolecular networks [3, 17–19] as well as clustering the graphic representation of gene expression profile data . However, the heavy resolution limit of modularity Q reminds researchers to use it cautiously. And the modularity density D may become an alterative measure to achieve these goals.
In summary, our method is very effective for uncovering modular organization in biomolecular networks. It provides an objective approach to explore the organization and interactions of biological processes. With the increasing amount of biological ‘interaction' data available, MD (SpeMD) can facilitate the construction of a more complete view of the composition and interconnection of functional modules and the understanding of the organization of the whole cell at system level. We plan to automate this algorithm to compute functional modules for a large number of biological networks. We hope that related studies will benefit from the present method coupled with the modularity density D (Dλ).
This work is partially supported by the National Natural Science Foundation of China under grant No. 10631070, No.60873205, Innovation Project of Chinese Academy of Sciences, kjcs-yw-s7, the Ministry of Science and Technology, China, under Grant No.2006CB503905 and the ‘Special Presidential Prize and Excellent PhD Thesis Award' Scientific Research Foundation of CAS.
This article has been published as part of BMC Systems Biology Volume 4 Supplement 2, 2010: Selected articles from the Third International Symposium on Optimization and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/4?issue=S2
- Barabasi A, Oltvai Z: Network biology: understanding the cell's functional organization. Nature Rev. Gen. 2004, 5: 101-113. 10.1038/nrg1272.View ArticleGoogle Scholar
- Zhang S, Jin G, Zhang XS, Chen L: Discovering functions and revealing mechanisms at molecular level from biological networks. Proteomics. 2007, 7: 2856-2869. 10.1002/pmic.200700095PubMedView ArticleGoogle Scholar
- Zhao J, Yu H, Luo JH, Cao ZW, Li YX: Hierarchical modularity of nested bow-ties in metabolic networks. BMC Bioinformatics. 2006, 7: 386- 10.1186/1471-2105-7-386PubMedPubMed CentralView ArticleGoogle Scholar
- Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 2003, 5: R6- 10.1186/gb-2003-5-1-r6PubMedPubMed CentralView ArticleGoogle Scholar
- Lu H, Zhu X, Liu H, Skogerbo G, Zhang J, Zhang Y, Cai L, Zhao Y, Sun S, Xu J, Bu D, Chen R: The interactome as a tree–an attempt to visualize the protein-protein interaction network in yeast. Nucleic Acids Res. 2004, 32: 4804-4811. 10.1093/nar/gkh814PubMedPubMed CentralView ArticleGoogle Scholar
- Rives AW, Galitski T: Modular organization of cellular networks. Proc. Natl Acad. Sci., USA. 2003, 100: 1128-1133. 10.1073/pnas.0237338100PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang S, Ning XM, Zhang XS: Graph kernels, hierarchical clustering, network community structure: experiment and comparative analysis. Eur. Phys. J. B. 2007, 57: 67-74. 10.1140/epjb/e2007-00146-y.View ArticleGoogle Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc. Natl Acad. Sci, USA. 2003, 100: 12123-12126. 10.1073/pnas.2032324100PubMedPubMed CentralView ArticleGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 2- 10.1186/1471-2105-4-2PubMedPubMed CentralView ArticleGoogle Scholar
- Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, et al.: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res. 2003, 31: 2443-2450. 10.1093/nar/gkg340PubMedPubMed CentralView ArticleGoogle Scholar
- King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics. 2004, 20: 3013-3020. 10.1093/bioinformatics/bth351PubMedView ArticleGoogle Scholar
- Palla G, Derényi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks. Nature. 2005, 435: 814-818. 10.1038/nature03607PubMedView ArticleGoogle Scholar
- Cho Y, Hwang W, Ramanathan M, Zhang A: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics. 2007, 8: 265- 10.1186/1471-2105-8-265PubMedPubMed CentralView ArticleGoogle Scholar
- Segal E, Wang H, Koller D: Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics. 2003, 19 (S1): 264-272. 10.1093/bioinformatics/btg1037.View ArticleGoogle Scholar
- Newman ME, Girvan M: Finding and evaluating community structure in networks. Phys. Rev. E. 2004, 69: 026113-10.1103/PhysRevE.69.026113.View ArticleGoogle Scholar
- Newman MEJ: Modularity and community structure in networks. Proc. Natl. Acad. Sci., USA. 2006, 103: 8577-582. 10.1073/pnas.0601602103PubMedPubMed CentralView ArticleGoogle Scholar
- Guimer R, Amaral LAN: Functional cartography of complex metabolic networks. Nature. 2005, 438: 895-900. 10.1038/nature03288View ArticleGoogle Scholar
- Caretta-Cartozo C, De Los Rios P, Piazza F, et al.: Bottleneck Genes and Community Structure in the Cell Cycle Network of S. pombe. PLoS Comput. Biol. 2007, 3: e103- 10.1371/journal.pcbi.0030103PubMedPubMed CentralView ArticleGoogle Scholar
- Wang Z, Zhang J: In search of the biological significance of modular structures in protein networks. PLoS Comput. Biol. 2007, 3: e107- 10.1371/journal.pcbi.0030107PubMedPubMed CentralView ArticleGoogle Scholar
- Fortunato S, Barthélemy M: Resolution limit in community detection. Proc. Natl. Acad. Sci., USA. 2007, 104: 36-41. 10.1073/pnas.0605965104PubMedPubMed CentralView ArticleGoogle Scholar
- Li Z, Zhang S, Wang RS, Zhang XS, Chen L: Quantitative function for community detection. Physical Review E. 2008, 77: 036109-10.1103/PhysRevE.77.036109.View ArticleGoogle Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science. 2002, 297: 1551-1555. 10.1126/science.1073374PubMedView ArticleGoogle Scholar
- Bach F, Jordan M: Learning spectral clustering. In Proceedings of 17th Advances in Neural Information Processing Systems. 2004Google Scholar
- White S, Smyth P: A spectral clustering approach to finding communities in graphs. In Proceedings of SIAM International Conference on Data Mining. 2005Google Scholar
- Zhang S, Wang RS, Zhang XS: Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A. 2007, 374: 483-490. 10.1016/j.physa.2006.07.023.View ArticleGoogle Scholar
- Ng A, Jordan M, Weiss Y: On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Systems. 2002, 14: 849-856.Google Scholar
- Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 2002, 30: 31-34. 10.1093/nar/30.1.31PubMedPubMed CentralView ArticleGoogle Scholar
- Broheé S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006, 7: 488- 10.1186/1471-2105-7-488PubMedPubMed CentralView ArticleGoogle Scholar
- van Dongen S: Graph clustering by flow simulation. Ph‚D thesis, University of Utrecht, Centers for mathematics and computer science (CWI). 2000Google Scholar
- Friedel CC, Krumsiek J, Zimmer R: Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. RECOMB. 2008, 4955: 3-16.Google Scholar
- Stone EA, Ayroles JF: Modulated modularity clustering as an exploratory tool for functional genomic inference. PLoS Genetics. 2009, 5: e1000479- 10.1371/journal.pgen.1000479PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.