Revisiting the variation of clustering coefficient of biological networks suggests new modular structure
© Hao et al.; licensee BioMed Central Ltd. 2012
Received: 12 October 2011
Accepted: 16 February 2012
Published: 1 May 2012
A central idea in biology is the hierarchical organization of cellular processes. A commonly used method to identify the hierarchical modular organization of network relies on detecting a global signature known as variation of clustering coefficient (so-called modularity scaling). Although several studies have suggested other possible origins of this signature, it is still widely used nowadays to identify hierarchical modularity, especially in the analysis of biological networks. Therefore, a further and systematical investigation of this signature for different types of biological networks is necessary.
We analyzed a variety of biological networks and found that the commonly used signature of hierarchical modularity is actually the reflection of spoke-like topology, suggesting a different view of network architecture. We proved that the existence of super-hubs is the origin that the clustering coefficient of a node follows a particular scaling law with degree k in metabolic networks. To study the modularity of biological networks, we systematically investigated the relationship between repulsion of hubs and variation of clustering coefficient. We provided direct evidences for repulsion between hubs being the underlying origin of the variation of clustering coefficient, and found that for biological networks having no anti-correlation between hubs, such as gene co-expression network, the clustering coefficient doesn’t show dependence of degree.
Here we have shown that the variation of clustering coefficient is neither sufficient nor exclusive for a network to be hierarchical. Our results suggest the existence of spoke-like modules as opposed to “deterministic model” of hierarchical modularity, and suggest the need to reconsider the organizational principle of biological hierarchy.
The high relevance between functional organization and topological features has motivated the development of statistical measures to characterize cellular networks. Increasingly, these measures reveal that biological network organization is characterized by the power law of degree distribution, the concept of modularity and the degree correlations on connected nodes [1–3]. Networks with high modularity have dense connections between the nodes within same cellular functions but sparse connections between nodes in different functions. Furthermore, a central theory in biology is the hierarchical organization of cellular processes, which means that high-level processes are build by connecting low-level ones [4, 5]. For example, the process mitosis is composed by several low-level functions, such as spindle assembly, centrosome separation and chromosome alignment. Consequently, it is reasonable to suppose that functional modules of interest are hierarchically organized in the same way, that small modules are combined into larger modules and then further combined into even larger ones. This complexity, therefore, poses great challenges to researchers trying to understand the modularity structure of cellular networks.
To identify the hierarchical modularity of metabolic networks, Ravasz et al. focused on detecting a “global signature” of network architecture [6, 7]. In Ravasz’s study, they revealed that for metabolic networks and for certain hierarchical networks the clustering coefficient, C(k), of a node follows a scaling law with degree k C(k) ~ k-1. To explain this, they proposed a network model which possesses both the power law of degree distribution and the scaling law of C(k). The starting point of this network model is a small cluster of five fully connected nodes; then creates four identical replicas, connecting the peripheral nodes of each cluster to the central node of the old cluster, resulted in a large 25-node cluster. Next, four replicas of this 25-node cluster are generated and the 16 peripheral nodes are connected to the central node of the old cluster, obtaining a larger cluster of 125 nodes. These replication and connection steps can be repeated indefinitely to generate a hierarchical architecture. In each step i, the number of nodes in the network is 5 i . This network model, which we explicitly denote by “deterministic hierarchical model”, has subsequently a great influence on the studies of network biology [8, 9], and the scaling of C(k) is widely used to identify whether or not a network is hierarchically organized nowadays.
Two former studies have suggested that the decrease of C(k) can be tentatively attributed to the tendency that large degree nodes are connected to small degree ones in biological networks[1, 10]. For example, Soffer and Vazquez proposed a novel measurement of clustering coefficient taking into account of the neighborhood degree of node, which didn’t scale with k. Their work suggested that the variation of C(k) can be attributed to neighborhood degree distribution. However, the “deterministic model” is also anti-correlated. Thus, it is still possible that both the degree anti-correlation and the variation of C(k) is the reflection of hierarchy, suggesting that proper “null model” is needed to clarify their relationships. Moreover, metabolic networks is nicely approximated by C(k) ~ k-1, providing a strong evidence for the existence of hierarchy in these networks. However, to our best knowledge, former studies didn’t directly indicated why C(k) strictly follows this scaling law (k-1) in metabolic networks. These may be the reasons why the variation of C(k) is still widely used in assessing biological network hierarchy. In fact, almost every study on biological networks that observed the variation of C(k), including protein-protein networks, functional networks, human disease networks or even ecological networks, claimed that they have found a hierarchical modular structure, for example [11–17]. This situation suggested that, a further and systematical investigation of clustering coefficient focused on different types of biological networks is necessary. In this work we revealed the reason why C(k) scales with k-1 in metabolic networks and suggested by “null model” that the variation of C(k) is neither sufficient nor exclusive for a hierarchical network. Our findings suggest the existence of spoke-like topology as opposed to “deterministic hierarchical model”.
Results and Discussion
Origin of the scaling law in metabolic networks
One may argue that given the degree distribution, hierarchical structure of a network is largely defined, so it is not surprising that random networks generated this way have similar dependence of clustering coefficient on node’s degree. To rule out this possibility, we next investigate whether the scaling law of C(k) can be reproduced in a totally random network. For this purpose, we first generated a random scale-free network of 10,000 nodes with degree following P(k) ~ k-2.6, and then added several large degree nodes unexpected from the degree distribution (Figure 2C, using preferentially attachment). As shown in Figure 2D, C(k) of the network with several large degree nodes added appears to scale with k-1, as opposed to the original network that has no variation. Therefore, several super-hubs is sufficient to give rise to the scaling law of C(k). This result is reasonable. For example, the metabolic network of E.coli has 2,409 edges with average claustering coefficient and degree of the largest hub . To keep the same value of clustering coefficient for this super hub, its neighbors have to be connected by edges, which is nearly 3 times of the number of edges in the network!
We then would like to present an analytical investigation for this result. Consider an undirected random network with S nodes, M edges and average clustering coefficient , the probability that a newly added node j has a link with a node i is , and thus the expected number of edges that newly added node j connects to i is , where k i , k j are the degrees for nodes i and j and the function min() is to make sure at most one edge connecting two nodes. In a random network with no degree correlation, the average degree of the neighbors of a node would be the average degree of the network, <k>. Thus, the expected number of edges that the newly added node connecting to a neighbor of node i is. The number of newly added triangles involving node i that generated by node j connecting to both node i and its neighbors can be roughly estimated by . For node j with small degree k j , takes the value , and thus , where is determined by k j . Now the clustering coefficient of node i is , which doesn’t vary with degree k i . However, for a node with large k j , takes the value and thus . Considering that the clustering coefficient of a random scale-free network is extremely small (, for example, there are thousands of triangles in biological networks, whereas there are only tens of triangles in random networks of similar size), the is now mainly determined by , thus . To test this, we constructed a network with 10,000 nodes following the distribution P(k) ~ k-2.6, which has only 64 triangles in total and thus the . Then we randomly added 2 nodes with degree 2,000 and 200 nodes with degree 20 into this network respectively, of which the number of newly added triangles as a function of degree k is counted (Figure 2E). Although the number of newly added edges is the same, the number of newly added triangles increases in different rates as a function of k: in the first case and in the second case respectively! As a result, the clustering coefficient shows a perfect scaling dependence on node’s degree in the first case, whereas it doesn’t vary with k in the second case (Figure 2F). This striking difference comes from the restriction . For nodes with small degrees, takes the value, whereas for nodes with large degrees, takes the value . Notably, this formula reflects the fact that there is at most one edge connecting two nodes in these biological networks. Hence, this formula implies that connections between large degree nodes in metabolic networks are highly suppressed, compared to a random network with no constraints on edge multiplicity. For example, the two largest hubs in metabolic network of E.coli would be connected by edges without constraints on edge multiplicity! In this case, a large degree node is forced to connect to small degree ones; as a consequence, its clustering coefficient is relatively small.
It should be noted that the clustering coefficient in the first case is at least an order of magnitude larger than that of the network in the second case, suggesting that the existence of super-hubs is one of the origins of high clustering of metabolic networks. Thus, when the level of clustering coefficient is regarded as a measure of modularity level, the existence of super-hubs should be considered, otherwise the modularity level of biological networks would be highly overestimated [7, 18].
Variation of C(k) is a reflection of degree correlation
Next, we ask whether the existence of super-hubs is the only reason that biological networks show dependence of clustering coefficient on node’s degree. However, we found that for other biological networks, the C(k) curve can be highly different with random networks of same degree distribution (for example, the protein-protein interaction network and the genetic synthetic lethal network. Additional file 1: Figure S1), suggesting that the variation of C(k) cannot be solely attributed to the existence of super-hubs. For metabolic networks, we have shown that the dependence of clustering coefficient on node’s degree has its origin in the suppression of hub-hub connections (). Hence, it is possible that even without the existence of super-hubs, the anti-correlation between hubs is enough to cause the variation of C(k). Former studies have found that many biological networks are disassortative, indicating that the strong repulsion between hubs is frequently observed [1, 19–21].
One concern is that a few of edges of the generated random networks and biological networks may be overlapped, and thus hierarchy structure is conserved in null networks. To rule out this possibility, we further generated much more stringent but uncorrelated random networks of which a large fraction of edges are overlapped with the biological network. However, we found that the variation of C(k) was substantially disappeared (Additional file 2: Figure S2). One should also note that the clustering coefficient distribution of Figure 3 shows clear deviation from any scaling law C(k) ~ k-β, further suggesting that biological networks cannot be characterized by the “deterministic hierarchical model”.
Spoke versus “deterministic hierarchical model”
The “deterministic hierarchical model” suggests that the variation of C(k) is caused by rigid hierarchy that is built by connecting the external nodes of low-level dispersed modules to the central nodes of a high-level module.(Figure 4A) . However, our results suggest that the variation of C(k) in biological networks, is caused by the abundance of large degree nodes connecting to those with much smaller degree, which we refer to as a heuristic “spoke” model (Figure 4B). The two models can be easily checked by visualizing the connection of a few hubs for a real network. Figure 4B shows the connection of the top 6 best connected proteins and their neighbors in a small protein interaction network formed by proteins localized in nucleus according to a high-confidence dataset (Figure 4C) . Apparently, the protein network supports the picture of “spoke” model rather than rigid hierarchy of “deterministic hierarchical model”.
What do our results suggest for the conception of modularity? First of all, they suggest the existence of functional modules that are spoke-like or built by connecting spoke-like topologies. This new view will include many biological modules that can not be revealed by finding densely connected regions such as cliques or k-cores. For example, the functional module associated to cell wall organization is built by connecting several spokes (Additional file 3 Figure S3). Many biological pathways include enzymes and tens of its substrates may be better depicted by this view of modularity. We found that even protein complexes could be spoke-like as well. Figure 4D shows three protein complexes of S. cerevisiae, of which FBP degradation complex and nucleolar ribonuclease P complex are built by a single spoke, while mitochondrial ribosomal small subunit is built by connecting two spokes centered on mrp4 and mrps5 respectively. However, we stress that the traditional idea of modularity as finding densely connected regions is still useful in identifying cellular machines. In fact, the protein network integrates “spoke” topology and densely connected regions into a highly interconnected web. A single molecule could be both a member of clique and a member of spoke-like topology. For example, srb4 encodes a core component of the SRB mediator complex of S. cerevisiae and is required for transcription of most yeast genes. However, the execution of the function of srb4 also relies on the interaction of many poorly connected genes outside the complex such as cbs1, a mitochondrial translational activator of cob mRNA, resulting in a large spoke centered on srb4 (Figure 4E). These explain why C(k) shows negative dependence on node’s degree in protein network, even though there are a large number of protein complexes.
New hierarchical modularity paradigm
Finally, our work raises two fundamental questions: a question about motivation of spoke-like topology during evolution and a question about how low-level modules communicate with each other to generate high-level ones. A possible answer for the first question is that suppression between hubs confines mutational perturbations to the local. It is widely accepted that hub genes are more essential than poorly connected genes. Thus, the overabundance of spoke-like topology may reduce the accumulative effect of the mutational perturbations of two directly connected hubs. Another possible answer for the first question is that the overabundance of spoke-like topology shortens the distance between molecules, and thus signals propagate more quickly. A molecule connecting with a hub is more easily to propagate its signal than a molecule connecting with a poorly connected node. Given that most molecules of biological networks are poorly connected, this may be one of the reasons why these networks favor spoke-like topology. This speculation is supported by the finding that more nodes in an assortative network (i.e., social network) fail to connect to the largest component to propagate its signal than in a disassortative network (i.e., World Wide Web) [19, 22].
It is widely accepted that biological hierarchy can be well characterized by a “deterministic hierarchical model”, because it reconciles modularity and scale-freeness, with C(k) following a scaling law . A later study further developed a more general power-law of C(k) to identify hierarchical network . Although the model successfully shows that C(k) of a “deterministic hierarchical model” network follows the scaling law C(k) ~ k-1, there is no evidence showing that a network following this scaling law is necessarily a network of hierarchy. Therefore, it is not sufficient to identify network hierarchy. More evidences comes from the fact that many networks with no significant variation of C(k) are also hierarchically organized. It has been found that many complex systems have hierarchical organization, including social networks that are known to be assortative and lack the variation of C(k) [26, 27]. These studies further suggest that the scaling of C(k) is neither a sufficient nor a needed condition for a network to be hierarchical. Although two former studies have suggested the shortcomings of using the variation of C(k) in assessing network hierarchy [10, 26], our study provided further and more direct evidences. Nowadays, many sophisticated models have been developed to include the variation of C(k) and degree distribution. However, since the variation of C(k) is still widely used as a standard indicator of hierarchical network structure, it is necessary to specifically point out the limitations of “deterministic hierarchical model”. By doing this, our study suggests the need to reconsider the modularity nature of biological systems. In particular, we stress the importance of overlap in the communication of different modules. Our study may be applicable to other complex networks as well, such as WWW, of which the variation of C(k) was interpreted as the existence of network hierarchy too .
Our analysis includes four types of biological networks of yeast: Physical protein interaction network, genetic synthetic lethal network, gene co-expression network and metabolic network. Dataset of protein-protein interaction was obtained from DIP (version 10/2010) . To display the organization for the top 6 best connected nuclear proteins, a high-confidence dataset curated from literatures and high-throughput sources was used , where the subcellular localization information was according to MIPs annotation . Dataset of synthetic lethal interaction was obtained from Biogrid (version 3.1.72) , and the metabolic networks of 43 organisms were obtained from Jeong H et al. . The gene co-expression network was constructed according to the yeast cell cycle expression data . Arrays where greater than 10% of the gene expression information was missing were removed and genes where more than 7 arrays the expression information was missing were removed. Then, the Pearson coefficient was calculated for every gene pair, and only gene pairs with absolute value larger than 0.65 were used to construct the gene co-expression network.
To generate seed networks that preserves the joint probability P(k i , k j ), we draw nodes from the degree distribution P(k) for each degree k, and then form a node set S containing k i copies of each node i, where N denotes the number of nodes in biological network. Then, we select at random two nodes from S, connect them to generate a new random network and then remove them from S. At each time, we estimate the joint probability R(k i , k j ) in the random network, and test if . When the condition is not fulfilled, we discard the two nodes and draw two new ones from S. This step is repeated until for all the degrees.
CL was supported by the National Natural Science Foundation of China (Grant Nos.30600367) and Foundation of Harbin Medical University (WLD-QN1107).
- Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science 2002,296(5569):910-913. 10.1126/science.1065103View Article
- Barabasi AL, Albert R: Emergence of scaling in random networks. Science 1999,286(5439):509-512. 10.1126/science.286.5439.509View Article
- Girvan M, Newman ME: Community structure in social and biological networks. Proc Natl Acad Sci USA 2002,99(12):7821-7826. 10.1073/pnas.122653799View Article
- Alon U: Biological networks: the tinkerer as an engineer. Science 2003,301(5641):1866-1867. 10.1126/science.1089072View Article
- Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular cell biology. Nature 1999,402(6761 Suppl):C47-52.View Article
- Ravasz E: Detecting hierarchical modularity in biological networks. Methods Mol Biol 2009, 541: 145-160. 10.1007/978-1-59745-243-4_7View Article
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science 2002,297(5586):1551-1555. 10.1126/science.1073374View Article
- Alm E, Arkin AP: Biological networks. Curr Opin Struct Biol 2003,13(2):193-202. 10.1016/S0959-440X(03)00031-9View Article
- Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004,5(2):101-113. 10.1038/nrg1272View Article
- Soffer SN, Vazquez A: Network clustering coefficient without degree-correlation biases. Phys Rev E Stat Nonlin Soft Matter Phys 2005,71(5 Pt 2):057101.View Article
- Yook SH, Oltvai ZN, Barabasi AL: Functional and topological characterization of protein interaction networks. Proteomics 2004,4(4):928-942. 10.1002/pmic.200300636View Article
- Hawkins T, Chitale M, Kihara D: Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP. BMC Bioinformatics 2010, 11: 265. 10.1186/1471-2105-11-265View Article
- Kumar G, Ranganathan S: Network analysis of human protein location. BMC Bioinformatics 2010,11(Suppl 7):S9.
- Zhang X, Zhang R, Jiang Y, Sun P, Tang G, Wang X, Lv H, Li X: The expanded human disease network combining protein-protein interaction information. Eur J Hum Genet 2011.
- Zhou J, Deng Y, Luo F, He Z, Tu Q, Zhi X: MBio. 2010,1(4):e00169-10. 10.1128/mBio.00169-10
- Wuchty S, Almaas E: Evolutionary cores of domain co-occurrence networks. BMC Evol Biol 2005, 5: 24. 10.1186/1471-2148-5-24View Article
- Muller H, Acquati F: Topological properties of co-occurrence networks in published gene expression signatures. Bioinform Biol Insights 2008, 2: 203-213.
- Wagner A, Fell DA: The small world inside large metabolic networks. Proc Biol Sci 2001,268(1478):1803-1810. 10.1098/rspb.2001.1711View Article
- Newman ME: Assortative mixing in networks. Phys Rev Lett 2002,89(20):208701.View Article
- Song C, Havlin S, Makse HA: Origins of fractality in the growth of complex networks. Nature Physics 2006, 2: 275-281. 10.1038/nphys266View Article
- Yamada T, Bork P: Evolution of biomolecular networks: lessons from metabolic and protein interactions. Nat Rev Mol Cell Biol 2009,10(11):791-803. 10.1038/nrm2787View Article
- Newman ME: Mixing patterns in networks. Phys Rev E Stat Nonlin Soft Matter Phys 2003,67(2 Pt 2):026126.View Article
- Ravasz E, Barabasi AL: Hierarchical organization in complex networks. Phys Rev E Stat Nonlin Soft Matter Phys 2003,67(2 Pt 2):026112.View Article
- Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M: Still stratus not altocumulus: further evidence against the date/party hub distinction. PLoS Biol 2007,5(6):e154. 10.1371/journal.pbio.0050154View Article
- Belli G, Gari E, Aldea M, Herrero E: Osmotic stress causes a G1 cell cycle delay and downregulation of Cln3/Cdc28 activity in Saccharomyces cerevisiae. Mol Microbiol 2001,39(4):1022-1035. 10.1046/j.1365-2958.2001.02297.xView Article
- Sales-Pardo M, Guimera R, Moreira AA, Amaral LA: Extracting the hierarchical organization of complex systems. Proc Natl Acad Sci USA 2007,104(39):15224-15229. 10.1073/pnas.0703740104View Article
- Clauset A, Moore C, Newman ME: Hierarchical structure and the prediction of missing links in networks. Nature 2008,453(7191):98-101. 10.1038/nature06830View Article
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449-D451. Database issue 10.1093/nar/gkh086View Article
- Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, et al.: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 2004, 32: D41-D44. Database issue 10.1093/nar/gkh092View Article
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34: D535-D539. Database issue 10.1093/nar/gkj109View Article
- Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The large-scale organization of metabolic networks. Nature 2000,407(6804):651-654. 10.1038/35036627View Article
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998,9(12):3273-3297.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.