On the origin of distribution patterns of motifs in biological networks
© Konagurthu and Lesk; licensee BioMed Central Ltd. 2008
Received: 20 December 2007
Accepted: 12 August 2008
Published: 12 August 2008
Inventories of small subgraphs in biological networks have identified commonly-recurring patterns, called motifs. The inference that these motifs have been selected for function rests on the idea that their occurrences are significantly more frequent than random.
Our analysis of several large biological networks suggests, in contrast, that the frequencies of appearance of common subgraphs are similar in natural and corresponding random networks.
Indeed, certain topological features of biological networks give rise naturally to the common appearance of the motifs. We therefore question whether frequencies of occurrences are reasonable evidence that the structures of motifs have been selected for their functional contribution to the operation of networks.
The network or directed graph description has become the preferred representation of the integrated activity of components of biological processes. The exponential growth of biological network data in the last five years has its source in recent advances in technologies such as mass spectrometry, genome-scale ChiP-chip experiments, yeast two-hybrid assays, combinatorial reverse genetic screens, and rapid literature mining techniques .
The science of systems biology has the aim of understanding the functional constraints and design principles of biological networks. Alon and colleagues were the first to introduce the notion of "motifs" in biological networks [2, 3]. Motifs are small patterns observed to recur throughout a network, with frequencies statistically higher than expected in random networks of similar connectivity parameters. Since the introduction of this concept, motifs have been reported in many biological networks: metabolic, signaling pathway, protein-protein interaction, and ecological networks amongst others [2–6]. Moreover, the prevalence of motifs is often considered as evidence for evolutionary selection, for implementing a specific function [2, 3, 7]. Motifs are believed to be building blocks of the functional architecture of a biological network .
In this paper we investigate the role of small network subgraphs as building blocks of biological networks. We analysed several biological networks: transcription regulation networks of Saccharomyces cerevisiae under different physiological conditions, the transcription regulation network of Escherichia coli, and a neuronal signalling pathway network of the hippocampal CA1 neuron.
Contrary to previous reports, we find that commonly accepted motifs are neither over- nor under-represented in these real networks in comparison to their random formulations. We discuss how the topology of biological networks automatically predisposes them to contain a certain distribution of motifs. This suggests that the evidence for the functional significance of motifs should be reevaluated.
We use the transcription regulatory networks of Saccharomyces cerevisiae under various physiological conditions – composite, cell cycle, sporulation, diauxic shift, DNA damage, and stress response – published by Luscombe and coworkers . Their largest (composite) network contains 3459 nodes and 7014 interactions (http://networks.gersteinlab.org/regulation/dynamics/index2.html).
To aid comparison of our work with that of Shen-Orr et al. , we also use their Escherichia coli transcription network containing 424 nodes and 577 interactions (http://www.weizmann.ac.il/mcb/UriAlon/Network_motifs_in_coli/ColiNet-1.0/).
Additionally, we use the neuronal signalling pathway network of the hippocampal CA1 neuron published by Máayan and colleagues, containing 594 nodes and 1422 interactions  (http://www.mssm.edu/labs/iyengar/).
We implemented Ullmann's algorithm for subgraph isomorphism  to enumerate fixed sized subgraph patterns (e.g. FFL, 3-cycle).
In enumerating variable sized (maximal) subgraph patterns such as SIMs and MIMs, we used our algorithms described in . We note that Bifans are counted as MIMs with exactly two elements each in both parent and child sets. (See Definitions.)
To generate random networks conserving the degree sequence of the real network, we use the method described by Shen-Orr et al. : Starting with the same number of nodes as in an original network, nodes in the random graph are assigned a specific number of in- and out-"edge-stubs." Randomly chosen pairs of in- and out-edge-stubs are joined, giving rise to a random (directed) graph.
A FFL is a set of three nodes (source, intermediate, and target) with one direct path, and another indirect path through an intermediate node, from source to target (See Figure 1(a)).
A 3-cycle (3-CYC) is a three-node directed cyclic graph (Figure 1(b)).
Single and multiple input modules (SIM and MIM) in a directed graph are maximal subgraphs comprising two non-empty disjoint sets (layers): and (standing for Parent and Child). By maximal we mean, for example, that each MIM is not contained in a larger MIM.
A SIM requires that contain only one node and contain at least two nodes, such that the full graph contains an edge from the parent node to every c i ∈ . We also require that the indegree – number of incoming edges – of every c i to be strictly equal to one: within the full network, not just within the subgraph. By this definition of a SIM, no edges can exist between any c i , c j ∈ . It follows that is the only parent of all nodes in set .
A MIM requires that both and must contain ≥ 2 nodes, that there is an edge from every p i ∈ to every c i ∈ , no edge between any p i , p j ∈ , and no edge between any c i , c j ∈ . A Bifan is a maximal MIM with and containing exactly 2 elements . (Figure 1(e))
We note that in counting both SIMs and MIMs, we ignore self-edges.
We emphasize that we impose the criterion of maximality when enumerating SIMs and MIMs. In case of SIM, the set is maximal, whereas with MIMs both and sets are maximal.
These statements define the fundamental network motif set – FFL, SIM, and MIM – as, in a sense, "orthogonal": No subgraph can be more than one of the FFL, SIM, and MIM .
Frequencies of canonical subgraph patterns in biological networks
(a) Yeast transcription – composite
(b) Yeast transcription – Cell Cycle
(c) Yeast transcription – Sporulation
(d) Yeast transcription – Diauxic Shift
(e) Yeast transcription – DNA Damage
(f) Yeast transcription – Stress Response
(g) Escherichia coli transcription
(h) Hippocampal CA1 neuronal signalling pathway
For each network, 1000 random networks were generated conserving the degree sequence of the original network. Comparisons were made between the frequencies of appearances of various patterns in the real network, and the means and dispersions of their appearances in corresponding random networks.
Table 1 presents the significance profiles of various patterns. The results show that the frequencies of various subgraph patterns are not significantly over- or under-represented in real networks when compared to their random formulations. A few outliers (where |z-score| > 2) appear in Table 1: FFLs in Yeast Sporulation (z-score = 2.31), 3-CYCs in Yeast Stress Response (z-score = 2.47) and neuronal signalling pathway (z-score = 2.4), and Bifans in Yeast Composite (z-score = -2.05) and Cell Cycle (z-score = -2.33). Some outliers are slightly overrepresented (z > 0), and others are slightly underrepresented (z < 0). We observe no outliers with |z-score| ≥ 2.47.
We employ the same random model as used in earlier related works [2, 3, 5, 7]. While conserving the degree sequence of the original network, the edges in a random network are chosen randomly so that the resultant network is free from the pressure of "evolutionary selection" which is incident on real biological networks. However, in addition to the conservation of the degree sequence, more sophisticated random models can be generated by embedding other connectivity constraints observed in real networks, such as rules of clustering together of nodes in a neighbourhood, and path-lengths between pairs of nodes. These additional constraints will only make the random null hypothesis more stringent to refute. Nevertheless, even using the basic random model employed in our work, we fail to gather any statistical evidence that the canonical patterns appear in real networks at non-random frequencies.
We note that there are differences in the counts of various motifs reported by Luscombe et al.  and this work, even though we use the same datasets (Table 1(a–f)). Our figures supersede those reported by Luscombe et al. (see  for a detailed explanation).
Our definitions of fixed size subgraphs such as FFL and 3-CYC are consistent with those originally defined by Alon and colleagues [2, 3]. Consequently, we agree on the absolute count of these subgraph patterns in the real network. Surprisingly however, our results of appearances of FFLs in random networks greatly differ. To reconfirm our results reported in Table 1(g), we generated another set of 1000 random networks using an alternative method of random network generation – starting with the original network, over a large number of repetitions, two randomly chosen interactions are swapped. (i.e., interactions: (P1,C1), and (P2,C2) become (P1,C2), and (P2,C1)). Indeed we get similar statistical significance results using this alternative method, compared to those reported in Table 1(g).
Our definition of Bifan ensures that we count only those patterns where a pair of target genes are strictly regulated by a pair of transcription factors – Bifans are maximal MIMs where = = 2. We believe Shen-Orr et al.  fail to maintain this strictness, thereby overcounting Bifans by including in their count two parent, two child subMIMs of larger maximal MIMs. (See Discussion.)
Similarly, our definitions and enumeration methods of SIMs and MIMs are mathematically more rigorous than those used by Shen-Orr and colleagues . Our counts of maximal MIMs and SIMs could be converted directly to counts of non-maximal MIMs and SIMs (see below). We note therefore that the non-observance of statistically significant differences between natural and randomized networks in counts of maximal MIMs and SIMs implies that there are no statistically significant differences between natural and randomized networks in counts of non-maximal MIMs and SIMs. This comment, together with the reminder that our definitions (and counts) of FFLs and 3-CYCs are identical with those of Alon e t al., shows clearly that the discrepancies are not a simple effect of alternative definitions of SIMs, MIMs and Bifans.
The observed discrepancy in occurrence frequency of FFLs and 3-CYCs is a natural consequence of topological properties of networks
Occurrences of FFLs and 3-CYCs in various biological networks (see Table 1) show patterns: there are a relatively large number of FFLs and relatively small number of 3-CYCs. In this section we explain the topological basis for these differences in their frequencies.
First we note that random connectivity within three-node subgraphs itself favours FFLs. Consider a directed, complete – there is an edge between every pair of nodes – three node graph (3-graph). Excluding bidirectional edges, for any set of 3 nodes there are 23 = 8 possible directed 3-graphs. Each of these configurations is isomorphic to either a FFL or a 3-CYC – any directed complete 3-graph is either a FFL or 3-CYC. Out of 8 possibilities, 6 form FFLs, and 2 form 3-CYCs. Allowing bidirectional edges, there are an extra 19 possible configurations containing at least one bidirectional edge. Each of these possibilities gives multiple FFLs or 3-CYCs or both. With or without bidirectional edges, there is a natural 3:1 bias towards forming an FFL over a 3-CYC in a 3-graph.
Global properties of biological networks also favour FFLs over 3-CYCs. Most biological networks, such as those used in our study, are scale-free . In scale-free networks, the connectivity of nodes follows the power law: the probability of a node having k neighbours is P(k) ~ k-γ. Only a few nodes in such a network are highly-connected (and form hubs), while most nodes are sparsely connected .
Percentage of FFLs in various networks having exactly n of its nodes as hubs
n = 1
n = 2
n = 3
n = 0
Yeast Cell Cycle
Yeast DNA damage
Yeast Stress response
We also observe that there is an imbalance between indegree and outdegree around hubs – there are significantly more outgoing edges than incoming edges. We have seen above that FFLs are naturally favoured over 3-CYCs in 3-graphs. The imbalances between in- and out-degree around the hubs further enhances the formation of FFLs. Consider a hub with m incoming edges and n outgoing edges. With a random addition of an edge between any pair of (m + n) nodes adjacent to this hub, the probability of forming an FFL in this system is: while that of forming a cycle is: . Then, , which is symmetric in m and n. If there is a large disparity between m and n (i.e., m ≪ n, or m ≫ n), then one of the terms or dominates, resulting in . For example, when m = 2 and n = 20, PFFL = 0.91, and P3-CYC = 0.09. This shows the odds against the formation of a 3-CYC in networks with structures typical of biological networks.
There have been suggestions that 3-CYC is an "anti-motif" – a motif that is selected against in many biological networks . But, as described above, the suppression of 3-CYCs is an expected consequence of topological properties of biological networks.
These properties are sufficient to account for the observed profiles of FFLs and 3-CYCs.
Assemblies of motifs
Number of occurrences of various assemblies shown in Figure 2
Frequencies of patterns in Figure 2
Yeast Cell Cycle
Yeast Diauxic Shift
Yeast DNA damage
Yeast Stress Response
Frequencies of Bi-FFL assembly in various networks
Yeast Cell Cycle
Yeast Diauxic Shift
Yeast DNA Damage
Yeast Stress Response
On SIMs, MIMs and Bifans
SIMs and MIMs are variable sized subgraphs. Alon and colleagues  defined the dense overlapping regulon (DOR) as a two-layered subgraph with not necessarily complete connections between them. MIMs are special DORs, a concept that arose as a generalization of the Bifan (Figure 1(e)) subgraph. These Bifans were observed to be present in large numbers in biological networks. However, some investigators fail to impose the criterion of maximality while counting MIMs. This can lead to significant inflation of counts [2, 5]. Note that this applies equally to natural graphs and random ones (Hence we emphasize that the differences between our results and those of Alon et al. are not explicable solely on the basis of alternative definitions of some of the motifs).
A maximal MIM with m parents and n children contains [2 m - (m + 1)] × [2 n - (n + 1)] - 1 easily enumerable non-maximal "subMIMs". Our definition of a Bifan ensures that we are only counting (maximal) MIMs that contain 2 parents and 2 children. Counting subMIMs as Bifans will combinatorially increase their counts, as each maximal MIM will contribute to m C2 × n C2 Bifans. For example, the Yeast composite network contains a large MIM containing 2 parents and 119 children. This alone contributes to 7021 non-maximal Bifans. The same consistency is maintained when counting SIMs. The list of subgraphs occurrences in various networks used in this paper can be downloaded from http://hollywood.bx.psu.edu/networks/analysis/.
The natural appearance of bipartite graphs in dense general graphs has received some attention in graph theory . It has also been demonstrated, using Ramsey theory , that bipartite cliques appear in sufficiently dense bipartite graphs [19, 20]. MIMs are bipartite cliques. Biological networks contain regions in which dense bipartite graphs naturally appear, and hence giving rise to bipartite cliques. This in itself speaks against the notion of evolutionary selection of MIMs .
Evidence for selection of motifs?
Analysis of natural networks shows that several commonly observed subgraphs identified as motifs do not appear at frequencies significantly greater than in corresponding random graphs. Instead, their frequency of occurrence is the result of the small-world character of many biological networks, and of the associated degree distribution.
It might be asserted that the general type of motif – for instance FFL rather than 3-cycle – is selected because of a general propriety to serve a particular function (For example, Alon et al.  pointed out that a FFL with AND logic at the output node can function as a filter rejecting transient stimuli).
Or it might be asserted that individual FFLs (or 3-cycles) within a network play specific functional roles at specific points.
Statistics of frequency of occurrences of specific motifs, and the comparison of observed frequencies in natural networks relative to random networks, do not – no matter what numerical results emerge – provide evidence for or against assertions of type 2. If any individual subgraph at some node plays an essential functional role in a network, it could be selected – whether it is a commonly-occurring subgraph or not. Conversely, an observation of significantly non-random occurrence frequencies of motifs would suggest the action of positive or negative selection, acting at the level of assertions of type 1 or type 2. Indeed it seems inescapable that if assertions of type 1 are true, then at least some assertions of type 2 must also be true, but not vice versa.
Our results suggest that there is no evidence for type 1 assertions.
We have analysed several biological networks. Our results suggest that there is no evidence suggesting selection for or against subgraph patterns such as FFL, 3-CYC, SIM, MIM, Bifan. We have shown that, in contrast to the need to invoke selection to explain the structure of observed networks, it is the topological properties of networks that automatically favour the observed frequency profiles of various subgraph patterns.
- Sharan R, Ideker T: Modeling cellular machinery through biological network comparison. Nature Biotechnology. 2006, 24 (4): 427-430.View ArticlePubMedGoogle Scholar
- Shen-Orr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics. 2002, 31: 64-68.View ArticlePubMedGoogle Scholar
- Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network motifs: simple building blocks of complex networks. Science. 2002, 298 (5594): 824-827.View ArticlePubMedGoogle Scholar
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002, 298 (5594): 799-804.View ArticlePubMedGoogle Scholar
- Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological changes. Nature. 2004, 431 (7006): 308-312.View ArticlePubMedGoogle Scholar
- Máayan A, Jenkins SL, Neves S, Hasseldine A, Grace E, Dubin-Thaler B, Eungdamrong NJ, Weng G, Ram PT, Rice JJ, Kershenbaum A, Stolovitzky GA, Blitzer RD, Iyengar R: Formation of regulatory patterns during signal propagation in a mammalian cellular network. Science. 2005, 309: 1078-1083.View ArticleGoogle Scholar
- Mangan S, Alon U: Structure and function of the feed-forward loop network motif. Proc Natl Acad Sci USA. 2003, 100: 11980-11985.PubMed CentralView ArticlePubMedGoogle Scholar
- Mangan S, Itzkovitz S, Zaslaver A, Alon U: The incoherent feed-forward loop accelerates the response-time of the gal system of Escherichia coli. J Mol Biol. 2006, 356: 1073-1081.View ArticlePubMedGoogle Scholar
- Mangan S, Zaslaver A, Alon U: The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. J Mol Biol. 2003, 334: 197-204.View ArticlePubMedGoogle Scholar
- Kalir S, Mangan S, Alon U: A coherent feed-forward loop with a SUM input function prolongs flagella expression in Escherichia coli. Mol Syst Biol. 2005, 1: E1-E6.View ArticleGoogle Scholar
- Zaslaver A, Mayo AE, Rosenberg R, Bashkin P, Sberro H, Tsalyuk M, Surette MG, Alon U: Just-in-time transcription program in metabolic pathways. Nature Genetics. 2004, 36: 486-491.View ArticlePubMedGoogle Scholar
- Ullmann JR: An Algorithm for Subgraph Isomorphism. J. ACM. 1976, 23: 31-42.View ArticleGoogle Scholar
- Konagurthu AS, Lesk AM: Single and Multiple input modules in regulatory networks. Proteins. 2008, 2008 Apr 23Google Scholar
- Alon U: An Introduction to Systems Biology: Design Principles of Biological Circuits (Chapman & Hall/Crc Mathematical and Computational Biology Series). 2006, Chapman & Hall/CRCGoogle Scholar
- Barabási AL, Albert R: Emergence of scaling in random networks. Science. 1999, 286 (5439): 509-512.View ArticlePubMedGoogle Scholar
- Kashtan N, Itzkovitz S, Milo R, Alon U: Topological generalizations of network motifs. Phys Rev E Stat Nonlin Soft Matter Phys. 2004, 70 (3 Pt 1): 031909-View ArticlePubMedGoogle Scholar
- Holyer I: The NP-completeness of some edge partitioning problems. SIAM J Computing. 1981, 10: 713-717.View ArticleGoogle Scholar
- Graham RL, Rothschild BL, Spencer JH: Ramsey theory. Discrete mathematics and optimization. 1980, New York, NY: John WileyGoogle Scholar
- Erdős P, Spencer JH: Probabilistic methods in combinatorics. 1974, New York, NY: Academic pressGoogle Scholar
- Feder T, Motwani R: Clique partitions, graph compression and speeding-up algorithms. STOC '91: Proceedings of the twenty-third annual ACM symposium on Theory of computing. 1991, 123-133. New York, USA: ACMView ArticleGoogle Scholar