On the origin of distribution patterns of motifs in biological networks

Background Inventories of small subgraphs in biological networks have identified commonly-recurring patterns, called motifs. The inference that these motifs have been selected for function rests on the idea that their occurrences are significantly more frequent than random. Results Our analysis of several large biological networks suggests, in contrast, that the frequencies of appearance of common subgraphs are similar in natural and corresponding random networks. Conclusion Indeed, certain topological features of biological networks give rise naturally to the common appearance of the motifs. We therefore question whether frequencies of occurrences are reasonable evidence that the structures of motifs have been selected for their functional contribution to the operation of networks.


Background
The network or directed graph description has become the preferred representation of the integrated activity of components of biological processes. The exponential growth of biological network data in the last five years has its source in recent advances in technologies such as mass spectrometry, genome-scale ChiP-chip experiments, yeast two-hybrid assays, combinatorial reverse genetic screens, and rapid literature mining techniques [1].
The science of systems biology has the aim of understanding the functional constraints and design principles of biological networks. Alon and colleagues were the first to introduce the notion of "motifs" in biological networks [2,3]. Motifs are small patterns observed to recur throughout a network, with frequencies statistically higher than expected in random networks of similar connectivity parameters. Since the introduction of this concept, motifs have been reported in many biological networks: metabolic, signaling pathway, protein-protein interaction, and ecological networks amongst others [2][3][4][5][6]. Moreover, the prevalence of motifs is often considered as evidence for evolutionary selection, for implementing a specific function [2,3,7]. Motifs are believed to be building blocks of the functional architecture of a biological network [3].
Consider for example the canonical set of motifs in transcription regulatory networks: Single input module (SIM), Multiple input module (MIM), and Feedforward loop (FFL) [3]. (See Figure 1. Originally, Alon and colleagues [2] proposed a dense overlapping regulon (DOR) as a motif; MIMs are special DORs that arose as a generalization of Bifan motif). Specific functions have been ascribed to each type of motif [2,[7][8][9][10][11]: SIMs are commonly associ-ated with temporal ordering of gene expression, MIMs with combinatorial gene regulation, and FFLs with filters that do not pass on transient signals [2]. These functions depend not only on the topology of the subgraph, but on the logic at nodes receiving multiple inputs. The common occurrence of these motifs, relative to corresponding randomized graphs, has been taken as evidence for their selection for function.
In this paper we investigate the role of small network subgraphs as building blocks of biological networks. We analysed several biological networks: transcription regulation networks of Saccharomyces cerevisiae under different physiological conditions, the transcription regulation network of Escherichia coli, and a neuronal signalling pathway network of the hippocampal CA1 neuron.
Contrary to previous reports, we find that commonly accepted motifs are neither over-nor under-represented in these real networks in comparison to their random formulations. We discuss how the topology of biological networks automatically predisposes them to contain a certain distribution of motifs. This suggests that the evidence for the functional significance of motifs should be reevaluated.
In enumerating variable sized (maximal) subgraph patterns such as SIMs and MIMs, we used our algorithms described in [13]. We note that Bifans are counted as MIMs with exactly two elements each in both parent and child sets. (See Definitions.) To generate random networks conserving the degree sequence of the real network, we use the method described by Shen-Orr et al. [2]: Starting with the same number of nodes as in an original network, nodes in the random graph are assigned a specific number of in-and out-"edge-stubs." Randomly chosen pairs of in-and outedge-stubs are joined, giving rise to a random (directed) graph.

Definitions
A FFL is a set of three nodes (source, intermediate, and target) with one direct path, and another indirect path through an intermediate node, from source to target (See Figure 1(a)).
Single and multiple input modules (SIM and MIM) in a directed graph are maximal subgraphs comprising two Canonical subgraph patterns in biological networks non-empty disjoint sets (layers): and (standing for Parent and Child). By maximal we mean, for example, that each MIM is not contained in a larger MIM.
A SIM requires that contain only one node and contain at least two nodes, such that the full graph contains an edge from the parent node to every c i ∈ . We also require that the indegree -number of incoming edges -of every c i to be strictly equal to one: within the full network, not just within the subgraph. By this definition of a SIM, no edges can exist between any c i , c j ∈ . It follows that is the only parent of all nodes in set .
A MIM requires that both and must contain ≥ 2 nodes, that there is an edge from every p i ∈ to every c i ∈ , no edge between any p i , p j ∈ , and no edge between any c i , c j ∈ . A Bifan is a maximal MIM with and containing exactly 2 elements [14]. (Figure 1(e)) We note that in counting both SIMs and MIMs, we ignore self-edges.
We emphasize that we impose the criterion of maximality when enumerating SIMs and MIMs. In case of SIM, the set is maximal, whereas with MIMs both and sets are maximal.
These statements define the fundamental network motif set -FFL, SIM, and MIM -as, in a sense, "orthogonal": No subgraph can be more than one of the FFL, SIM, and MIM [13].

the transcription network of Escherichia coli
For each network, 1000 random networks were generated conserving the degree sequence of the original network. Comparisons were made between the frequencies of appearances of various patterns in the real network, and the means and dispersions of their appearances in corresponding random networks. Table 1 presents the significance profiles of various patterns. The results show that the frequencies of various subgraph patterns are not significantly over-or underrepresented in real networks when compared to their random formulations. A few outliers (where |z-score| > 2) appear in Table 1: FFLs in Yeast Sporulation (z-score = 2.31), 3-CYCs in Yeast Stress Response (z-score = 2.47) and neuronal signalling pathway (z-score = 2.4), and Bifans in Yeast Composite (z-score = -2.05) and Cell Cycle (z-score = -2.33). Some outliers are slightly overrepresented (z > 0), and others are slightly underrepresented (z < 0). We observe no outliers with |z-score| ≥ 2.47.
We employ the same random model as used in earlier related works [2,3,5,7]. While conserving the degree sequence of the original network, the edges in a random network are chosen randomly so that the resultant network is free from the pressure of "evolutionary selection" which is incident on real biological networks. However, in addition to the conservation of the degree sequence, more sophisticated random models can be generated by embedding other connectivity constraints observed in real networks, such as rules of clustering together of nodes in a neighbourhood, and path-lengths between pairs of nodes. These additional constraints will only make the random null hypothesis more stringent to refute. Nevertheless, even using the basic random model employed in our work, we fail to gather any statistical evidence that the canonical patterns appear in real networks at non-random frequencies.
We note that there are differences in the counts of various motifs reported by Luscombe et al. [5] and this work, even though we use the same datasets (Table 1(a-f)). Our figures supersede those reported by Luscombe et al. (see [13] for a detailed explanation).
Our reanalysis of Escherichia coli transcription network provides the most direct comparison of our results with those of Alon and coworkers (see Table 1(g)). We fail to see any statistical evidence to suggest that the canonical subgraphs appear more frequently than random. On comparing our results with those published by Shen-Orr et al. [2], we find that: 1. Our definitions of fixed size subgraphs such as FFL and 3-CYC are consistent with those originally defined by Alon and colleagues [2,3]. Consequently, we agree on the absolute count of these subgraph patterns in the real network. Surprisingly however, our results of appearances of FFLs in random networks greatly differ. To reconfirm our results reported in Table 1(g), we generated another set of 1000 random networks using    3. Similarly, our definitions and enumeration methods of SIMs and MIMs are mathematically more rigorous than those used by Shen-Orr and colleagues [2]. Our counts of maximal MIMs and SIMs could be converted directly to counts of non-maximal MIMs and SIMs (see below). We note therefore that the nonobservance of statistically significant differences between natural and randomized networks in counts of maximal MIMs and SIMs implies that there are no statistically significant differences between natural and randomized networks in counts of non-maximal MIMs and SIMs. This comment, together with the reminder that our definitions (and counts) of FFLs and 3-CYCs are identical with those of Alon et al., shows clearly that the discrepancies are not a simple effect of alternative definitions of SIMs, MIMs and Bifans.

The observed discrepancy in occurrence frequency of FFLs and 3-CYCs is a natural consequence of topological properties of networks
Occurrences of FFLs and 3-CYCs in various biological networks (see Table 1) show patterns: there are a relatively large number of FFLs and relatively small number of 3-CYCs. In this section we explain the topological basis for these differences in their frequencies.
First we note that random connectivity within three-node subgraphs itself favours FFLs. Consider a directed, complete -there is an edge between every pair of nodes -three node graph (3-graph). Excluding bidirectional edges, for any set of 3 nodes there are 2 3 = 8 possible directed 3graphs. Each of these configurations is isomorphic to either a FFL or a 3-CYC -any directed complete 3-graph is either a FFL or 3-CYC. Out of 8 possibilities, 6 form FFLs, and 2 form 3-CYCs. Allowing bidirectional edges, there are an extra 19 possible configurations containing at least one bidirectional edge. Each of these possibilities gives multiple FFLs or 3-CYCs or both. With or without bidirectional edges, there is a natural 3:1 bias towards forming an FFL over a 3-CYC in a 3-graph.
Global properties of biological networks also favour FFLs over 3-CYCs. Most biological networks, such as those used in our study, are scale-free [15]. In scale-free networks, the connectivity of nodes follows the power law: the probability of a node having k neighbours is P(k) ~ k -γ . Only a few nodes in such a network are highly-connected (and form hubs), while most nodes are sparsely connected [15].
We asked how many of the FFLs in various networks contain hubs among their nodes. (We consider as hubs the top 10% of nodes in the network that are highly-connected, having more than 10 neighbours.) Table 2 contains the percentages of FFLs enumerated in various networks, having n = {0, 1, 2, 3} nodes as hubs. A large majority of the FFLs contain at least one hub; most common being the FFLs with hubs at two of their nodes. In the Yeast composite network, 961 of 997 FFLs have at least one common source-intermediate edge between them. These 961 FFLs can be grouped into 114 clusters (containing distinct source-intermediate edges) revealing that connected hubs often share many common children, automatically giving rise to FFLs. We believe that the principle of preferential attachment predisposes a biological network to have connected hubs that have shared children. This gives a network its robustness to random node failure [15].
We also observe that there is an imbalance between indegree and outdegree around hubs -there are significantly more outgoing edges than incoming edges. We have seen above that FFLs are naturally favoured over 3-CYCs in 3graphs. The imbalances between in-and out-degree around the hubs further enhances the formation of FFLs. Consider a hub with m incoming edges and n outgoing edges. With a random addition of an edge between any   pair of (m + n) nodes adjacent to this hub, the probability of forming an FFL in this system is: while that of forming a cycle is: . There have been suggestions that 3-CYC is an "anti-motif" -a motif that is selected against in many biological networks [14]. But, as described above, the suppression of 3-CYCs is an expected consequence of topological properties of biological networks.
These properties are sufficient to account for the observed profiles of FFLs and 3-CYCs.

Assemblies of motifs
Kashtan and colleagues [16] observed that regulatory networks contain multi-output FFL generalizations (see Figure 2(a)) in frequencies much higher than multi-input ( Figure 2(d)) and multi-intermediate (Figure 2(f)) generalisations. (These authors also suggested that multi-output FFLs were selected to achieve some information processing role [16].) We, in contrast, observe that the varied frequencies of assemblies of multiple FFLs are a consequence of the occurrence of FFLs around hubs. Figure 2 shows all possible assemblies involving two FFLs sharing a common edge. In Table 3 we enumerate the occurrences of each such assembly in various networks. Clearly, the multi-output assembly of two FFLs abounds over other possibilities, simply because a large number of FFLs share a common source-intermediate edge.
Thus the numbers of multi-output FFLs grow combinatorially with the number of FFLs sharing a common sourceintermediate edge. The count of (k<n)-output assembly of FFLs, where n is the number of FFLs sharing two common (source and intermediate) nodes, is expected to increase as n C k . For example, 5 FFLs having a common sourceintermediate edge (see Figure 3) will give rise to 10 nonredundant bi-output FFLs. Table 4 shows the statistical significance of finding bi-ouput FFLs in various real networks used in this work, by comparing the occurrences with those observed in their corresponding random networks. Statistically, their frequencies are not significantly greater than in random networks.

On SIMs, MIMs and Bifans
SIMs and MIMs are variable sized subgraphs. Alon and colleagues [2] defined the dense overlapping regulon (DOR) as a two-layered subgraph with not necessarily complete connections between them. MIMs are special DORs, a concept that arose as a generalization of the Bifan (Figure 1(e)) subgraph. These Bifans were observed to be present in large numbers in biological networks. However, some investigators fail to impose the criterion of maximality while counting MIMs. This can lead to significant inflation of counts [2,5]. Note that this applies equally to natural graphs and random ones (Hence we emphasize that the differences between our results and those of Alon et al. are not explicable solely on the basis of alternative definitions of some of the motifs).
A maximal MIM with m parents and n children contains  Frequencies of patterns in Figure 2 ( The natural appearance of bipartite graphs in dense general graphs has received some attention in graph theory [17]. It has also been demonstrated, using Ramsey theory [18], that bipartite cliques appear in sufficiently dense bipartite graphs [19,20]. MIMs are bipartite cliques. Biological networks contain regions in which dense bipartite graphs naturally appear, and hence giving rise to bipartite cliques. This in itself speaks against the notion of evolutionary selection of MIMs [2].

Evidence for selection of motifs?
Analysis of natural networks shows that several commonly observed subgraphs identified as motifs do not appear at frequencies significantly greater than in corresponding random graphs. Instead, their frequency of occurrence is the result of the small-world character of many biological networks, and of the associated degree distribution.
What does this imply about the idea that motifs have been selected, by evolution, for function? The statement that motifs are selected for function has two possible interpretations, not necessarily incompatible: 1. It might be asserted that the general type of motiffor instance FFL rather than 3-cycle -is selected because of a general propriety to serve a particular function (For example, Alon et al. [1] pointed out that a FFL with AND logic at the output node can function as a filter rejecting transient stimuli).
2. Or it might be asserted that individual FFLs (or 3cycles) within a network play specific functional roles at specific points.
Statistics of frequency of occurrences of specific motifs, and the comparison of observed frequencies in natural networks relative to random networks, do not -no matter what numerical results emerge -provide evidence for or against assertions of type 2. If any individual subgraph at some node plays an essential functional role in a network, it could be selected -whether it is a commonly-occurring subgraph or not. Conversely, an observation of significantly non-random occurrence frequencies of motifs would suggest the action of positive or negative selection, acting at the level of assertions of type 1 or type 2. Indeed it seems inescapable that if assertions of type 1 are true, then at least some assertions of type 2 must also be true, but not vice versa.
Our results suggest that there is no evidence for type 1 assertions.

Conclusion
We have analysed several biological networks. Our results suggest that there is no evidence suggesting selection for or against subgraph patterns such as FFL, 3-CYC, SIM, MIM, Bifan. We have shown that, in contrast to the need to invoke selection to explain the structure of observed networks, it is the topological properties of networks that automatically favour the observed frequency profiles of various subgraph patterns.

Authors' contributions
Both the authors contributed equally to the planning and execution of this study; both authors contributed to the draft, and have read and approved the final manuscript. Example of FFLs sharing two hub nodes Figure 3 Example of FFLs sharing two hub nodes. Example of FFLs sharing two hub nodes that are connected.