The topological significance of a pattern instance
Let G = (V, E) be a directed graph without multiple edges that represents a regulatory network, where the vertices v ∈ V denote biological entities, e.g., proteins, genes or small molecules. Causal relationships between these entities are made up of directed edges e ∈ E. A topological pattern is given by n connected vertices and the way they are connected with each other. The particular coherence which is described by a pattern is always based on all edges that exist between n vertices. The entirety of all distinct n-vertex patterns in G is then given by
where
is the i-th pattern consisting of n vertices. Actually, each pattern
represents a set of isomorphic connected subgraphs which have the same structural properties and differ only in the participating vertices. Accordingly, a pattern
comprehends a set of instances, i.e.,
, and each instance is a unique subgraph
of G, with the subset of vertices
⊆ V and the subset of edges
⊆ E (Figure 1). The edges in
are only incident to vertices in
and we denote them as the intrinsic edges of the pattern instance
. Other edges, e ∈ E\
, do not contribute to the coherence of the vertices
. Moreover, these extrinsic edges are part of the environment of
which describes how the pattern instance is embedded into the network. If there are more pattern instances
in G than in similar random networks, then the respective pattern
is called a motif. Consequently, the entirety of n-vertex patterns in G may contain several n-vertex motifs
. Then, the motif
compasses its own representatives, the instances of the motif
.
Following the logic of [24], we denote the topological significance of a pattern instance
as how essential for all connections within a network it is. To quantify this significance we eliminate all edges of a pattern instance (i.e., its intrinsic edges
) and measure how this affects the number of connected ordered pairs of vertices in the network. An ordered pair of vertices (i, j)|i ≠ j and i, j ∈ V, is connected iff there is at least one path from vertex i to vertex j in G. Note, that the ordered pair (i, j) is different from (j, i) in a directed network. The more ordered pairs become disconnected upon the removal of all edges of a pattern instance, the higher is the topological significance of this instance for the whole network. We define the pairwise disconnectivity index of a pattern instance, Dis(
), as the fraction of those initially connected pairs of vertices in a network which become disconnected if the intrinsic edges of the pattern instance
are removed from the network
In Eq. 1 N is the total number of ordered pairs of vertices in a graph G = (V, E) that are connected by at least one directed path of any length. It is supposed that N > 0, i.e., there exists at least one edge in the network that links two different vertices. N' is the number of ordered pairs of vertices in the subgraph G' = (V, E') of G where E' = E/
. Therefore, G' is the subgraph of G that results from removing the intrinsic edges of the pattern instance
from G. The pairwise disconnectivity index of a pattern instance ranges between 0 and 1, whereas zero indicates that the removal of its intrinsic edges does not disconnect vertices within the network and one denotes the cases when no pair of vertices is connected any more.
Figure 2 illustrates how an instance of the feed-forward loop (FFL), one of the best studied network motifs [14, 15, 20–23], may affect the existing communication in a network. The FFL is a three-vertex pattern that is given here by the intrinsic edges X → Y, Y → Z and X → Z. It is linked to the rest of a network by its vertices X, Y, Z where each of these can be at the start or end of an extrinsic edge (blue dotted edges). Further extrinsic edges are between other pairs of vertices in the environment of a FFL instance, e.g. the ordered pair (E1, E2). Whether a FFL instance can have an impact on the connection between two vertices depends on the kind of constituents of the paths that link them. If these paths consist of extrinsic edges only then the connection will not be affected upon the removal of the FFL (e.g., the pair (E1, E3)). Essentially, the FFL instance may be critical for those paths which include at least one intrinsic edge of the instance. However, that depends on the presence of alternative (i.e., parallel) paths between the corresponding vertices that use extrinsic edges only. For example, the pair (E2, Y) does not critically depend on the FFL instance due to the presence of another path (E2, X, E3, E4, Y) that includes no intrinsic edge of the instance. In contrast, the pair (E2, E6) looses its connection upon the deletion of the FFL instance though three parallel paths are connecting these vertices.
Usually, several instances of a particular pattern can be found in a network. For estimating the topological significance of the pattern itself the impact of its representatives has to be considered. We find that the average pairwise disconnectivity index of all instances of a pattern reflects this appropriately and define
as the pairwise disconnectivity index of a pattern
that consists of J instances. With it Eq. 2 also states the topological significance of a randomly chosen instance of the pattern
.
Applying the pairwise disconnectivity index to the analysis of topological patterns in regulatory networks
We have applied our approach to the characterization of three-vertex topological patterns in transcription regulation networks from three different organisms: a bacteria (Escherichia coli) [14], a unicellular eukaryote (the yeast Saccharomyces cerevisiae) [15] and higher eukaryotes (mammals: human, mouse, rat) [25, 26]. 3-vertex motifs were identified by means of the Z-Score as proposed by Alon and colleagues [15]. This normalized value states whether the abundance of a pattern in the real network exceeds its occurrence in a number of random ensembles: that is, a positive Z-Score refers to an over-representation in the real network, whereas a negative Z-Score means under-representation. Since there is no commonly accepted threshold Z-Score value for defining motifs, we consider patterns with Z-Score > 0 as motifs and all other ones as non-motifs. For the networks of E. coli and S. cerevisiae 3-vertex motifs were already identified [14, 15], whereas for the mammalian transcription network this is reported for the first time. To distinguish between different motifs many of which have no commonly accepted names, we used the identification numbers (IDs) of small connected graphs as it is provided by the FANMOD software [27, 28]. The name of a pattern instance was generated by combining a prefix E, Y or M for referring to E. coli, S. cerevisiae or mammalian, respectively, with the corresponding ID followed by the pairwise disconnectivity index rank of the instance among all instances of a given pattern.
Bacterial transcription network
The E. coli transcription network consists of 418 vertices and 519 edges. It exhibits four 3-vertex patterns, two of which are motifs according to the Z-score criteria (Figure 3). One of these motifs (ID = 6) appears most frequently and seems to be part of larger motifs known as the single-input module [20]. The mean pairwise disconnectivity index of its instances is 0.0039: that is only about 0.4% of all connected pairs of genes become suspended when a randomly selected instance of this motif is deleted from the network. The second motif, ID = 38, is known as the feed-forward loop [14, 15] and appears in the E. coli network less often than the previous, but its instances exhibit a higher average pairwise disconnectivity index (0.018). The patterns ID = 12 and ID = 36 are not over-represented here (negative Z-Score) and are therefore not ranked as motifs. The pattern ID = 12 denotes a chain-like structure where a gene regulates another one which itself regulates a third one. It is attributed to a pairwise disconnectivity index that ranges within the same scale as the feed-forward loop on average. In contrast, the pattern ID = 36, that abstracts the influence of two genes on a third one, has a much lower mean pairwise disconnectivity index than that of the ID = 12 pattern, but higher than that of the ID = 6 motif.
The boxplots in Figure 4 show how the pairwise disconnectivity index is distributed among the instances of different 3-vertex patterns (see Figure 3, E. coli). The population of each pattern is very heterogeneous. Most instances exhibit a low pairwise disconnectivity index value. However, very few pattern instances cause a significant effect when deleted, thereby indicating that the network is vulnerable against a targeted removal of particular instances. While about 3% of all motif instances are not crucial for sustaining the connection between any gene pair, nearly 9% of them disconnect at least 1% of the gene pairs in E. coli. In contrast, the instances of non-over-represented patterns always disconnect at least one gene pair and one third of them 1% or more. In general, comparing the medians of pattern instances (shown in Figure 4 as a solid horizontal bar) indicates that motifs are not topologically more significant than the non-motif patterns.
Nevertheless, the instance with the highest pairwise disconnectivity index in the E. coli network is a motif instance. This feed-forward loop consists of the genes hns, flhDC and fliAZY (Figure 5, ID = E.38.1). Interestingly, the gene flhDC is part of all pattern instances with a high topological significance, either together with the gene fliAZY or ompR_envZ (Figure 5). Like hns and fliAZY, the gene flhDC is involved in the synthesis of flagella in E. coli. A reduced activity of flhDC and fliAZY results in the loss of motility in E. coli [29, 30] which has vital consequences for the bacteria. This can be the case for a loss of the ompR_envZ regulatory system too, which is known to play a critical role in stress response by regulating the transcription of porin genes in response to medium osmolarity [31]. Altogether, the high topological significance of the pattern instances in Figure 5 seems to reflect the importance of the few recurring interactions between these essential genes for E. coli adequately.
Yeast transcription network
The transcription network of S. cerevisiae consists of 688 vertices and 1079 edges. It features three additional patterns besides those ones that have already been identified in E. coli. A positive Z-Score is attributed to four patterns in S. cerevisiae, although the patterns ID = 102 and ID = 166 occur only once (Figure 3). Likewise to the observations from E. coli, the average topological significance of the motif ID = 6 is lower than that of the feed-forward loop. On average, a randomly selected FFL instance breaks the connection between less than 1% of all connected pairs of genes, which is lower than for instances of the pattern ID = 12. Their mean pairwise disconnectivity index is about 0.0135 and appears to be the highest of all patterns in the S. cerevisiae network with a negative Z-Score.
Except for the pattern ID = 14, the pairwise disconnectivity index varies considerably for the instances of a pattern in this network (Figure 6). The respective patterns of the candidates with a high topological significance display positive Z-Scores as well as negative Z-Scores, which refer to over-representation and under-representation, respectively. Hence, motifs are not in favour for sustaining the pairwise connections between genes compared with non-motif patterns. In contrast to the E. coli network, the S. cerevisiae network seems to be more robust upon the elimination of a pattern instance, since much less of them have a notable effect on the existing pairwise connections between genes at all: The average pairwise disconnectivity index of a pattern instance is with 0.002 just half as high as in the E. coli network. Therewith, more alternative paths are at hand that strengthen pairwise connections between genes here so that also fewer instances cause a significant perturbation in the network (about 3% with Dis (
) ≥ 0.05 in yeast contrary to 10% in the E. coli network). Certainly, the overall impact of these pattern instances is comparable to the E. coli network (see Figures 5 and 7). A reason for this might be that such pattern instances are embedded in an alike fashion in both networks and may so have a similar influence on the existing connections.
The highest pairwise disconnectivity index is about 0.08 (Figure 6) and refers to a feed-forward loop instance that embodies the genes RME1, IME1 and IME1_UME6 (Figure 7). RME1 is known to encode a zinc finger protein that can repress the transcription of IME1 [32]. RME1 and IME1 are the master regulators of meiosis in S. cerevisiae [33–35]. An ime1 disruption prevents expression of almost all meiotic genes and all tested meiotic events [33]. RME1 is essential for sustaining the communication abilities between lots of gene pairs, similar to the genes MCM1, SNF2_SWI1 and SWI5. Gene MCM1 is central to the transcription control of cell-type specific genes and the pheromone response. The SNF2/SWI complex is an evolutionarily conserved ATP-dependent chromatin remodeling complex that plays an important role in DNA damage repair, DNA replication and stress response [36]. SWI5 activates the expression of cell cycle genes [37]. Altogether, these genes exert vital functions in S. cerevisiae and each of them appears quite frequently among the pattern instances with the highest topological significance.
Mammalian transcription network
The third network represents genes coding for transcription factors in mammalian species (human, mouse, and rat) and their interplay. This mammalian network consists of 279 vertices and 657 edges and has been extracted from the contents of the TRANSPATH® database on signal transduction [25] and the TRANSFAC® database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors [26]. Unlike the other two networks it contains all of the thirteen possible 3-vertex patterns. Although five patterns display positive Z-Scores, only four of them indicate a clear over-representation (Figure 3). In addition, one might find it difficult to classify the pattern ID = 102 as a motif due to its low frequency. Nevertheless, the FFL is a motif in mammals and the only pattern that is over-represented in all three networks. Although its occurrence rises with the increasing density and complexity of the networks, its topological significance is decreasing notably. Actually, a low average pairwise disconnectivity index can be observed for almost all motifs in mammals, with motif ID = 174 as the only exception.
Three of the seven patterns with a negative Z-Score have been found in the networks of E. coli and S. cerevisiae too, but unlike in mammals the pattern ID = 6 is a motif in them. Yet, its average topological significance for these networks does not differ greatly. Similar applies to the pattern ID = 12 that exhibits one of the highest mean pairwise disconnectivity indices here as well. In contrast, just a minor role seems to be adopted by the pattern ID = 36 though it is the second most common one. Other non-motif patterns in the mammalian network are crucial for linking only 1% of gene pairs mostly on average. Nevertheless, their appearance is a hint on the more complex organization of transcription regulation in higher organisms. Thus, it seems to be convenient that the pattern ID = 238 can be found only here (Figure 3): it represents the mutual transcription control of three retinoic acid receptor isoforms with the vertices RAR-alpha, RAR-beta and RAR-gamma. Note that this pattern does not even occur in any random network of similar size and degree distribution. On the other hand, it is still surprising that the pattern ID = 164 appears nearly 200 times in the mammalian network, but neither in the network of E. coli nor in the network of S. cerevisiae.
Despite the overall low mean topological significance of the various patterns in the mammalian network, the pairwise disconnectivity index of their instances covers a broad range of values (Figure 8). This spreading is even stronger for non over-represented patterns and more noticeable as in the other two networks. Thus, a high topological significance does not go along with motifs here as well. However, this network is different with regard to the robustness of its architecture: About one third of all pattern instances do not affect any of pairwise connections between genes and more than 15% disconnect at least 1% of the gene pairs. No motif instance exhibits a pairwise disconnectivity index higher than 0.04. This can be found for non-over-represented patterns exclusively (ID = 6, 12, 36, 164).
The most intense perturbation outranks the topologically most significant pattern instances in the other two networks. Deleting this pattern instance, which comprises the genes c-myc, HMGA1 and PAX3, suspends the connections between 10% of all genes in the mammalian network (M.6.1, Figure 9). The proto-oncogene c-myc is engaged in diverse processes ranging from cell proliferation to apoptosis [38] and its interaction with PAX3 repeatedly occurs in the pattern instances with the highest topological significance (Figure 9). Such a frequent appearance has been observed for some genes in the networks of E. coli and S. cerevisiae too. Furthermore, these genes have been found to exert vital functions in their organism. The same applies for PAX3 and c-myc in mammals: The paired box gene 3 activates developmental genes (e.g., Mitf) and just as c-myc the loss of PAX3 is lethal [39]. It is interesting to note that all transcription factors encoded by the genes constituting the interlinked patterns M.6.1, M.6.2, M.12.1, M.12.2 and M.164.1 (Figure 9) play pronounced roles in cell proliferation (E2F-1, c-myc, c-fos, HMGA1, and NSEP1) or are important developmental regulators (PAX3, Mitf).
A note on the joint deletion of intrinsic edges
The unusually often appearance of the same links (i.e., intrinsic edges) between genes in the pattern instances with the highest pairwise disconnectivity indices in all three networks raises the question of their contribution to the estimated significance of these pattern instances. Probably, the removal of individual intrinsic edges may already destroy the connection between many gene pairs so that their simultaneous removal is not as crucial. Otherwise they may have a significant non-additive impact taken together. However, answering this requires knowing the effect of deleting a single interaction (i.e., edge) in a network which can be accomplished in a similar way as for a pattern instance. It has been introduced as the pairwise disconnectivity index of an edge in [24] and specifies the fraction of ordered pairs becoming disconnected due to the removal of an individual edge.
As a first attempt, this fraction has been estimated for each intrinsic edge of a pattern instance in the three networks and their sum has been opposed to the pairwise disconnectivity index of the respective pattern instance. Although such kind of comparison highlights just a tendency if and how far the intrinsic edges of a pattern instance act synergistically, it is already a way that works for all kinds of patterns independent of their specific arrangement. Figure 10 illustrates this approximation for the pattern instances with the highest pairwise disconnectivity index in each network. The edge weights denote the topological significance of an edge for the corresponding network, e.g., Dis(hns → flhDC) = 0.005 for the edge from gene hns to flhDC in E. coli. Hence, the deletion of this interaction merely disconnects a half percent of all pairwise linked genes in E. coli. As expected, no effect is accomplished by removing the edge from hns to fliAZY, since there is always the alternative path via flhDC. In contrast, a relatively high pairwise disconnecivity index has been measured for the edge from flhDC to fliAZY. But still, the summarized effect of deleting these intrinsic edges separately from the E. coli network (0.049) is considerably lower as compared with the topological significance for the whole pattern instance, Dis(E.38.1) = 0.086. The same holds for the other two pattern instances in Figure 10 as well. Therewith a much stronger impact on pairwise connections between genes clearly exists due to the coherence of the intrinsic edges.
Whether this can be generalized for all pattern instances found in the three networks is shown in Figure 11. Most pattern instances in the three networks cluster near the diagonal since the joint removal of their intrinsic edges disconnects approximately the same number of gene pairs as the separate elimination of them does. However, some exceptions have been found, especially among those patterns that exhibit a high pairwise disconnectivity index per se.
A pattern instance is positioned below the diagonal dotted lines in Figure 11 due to considerable overlapping in the sets of pairwise linked genes which become disconnected upon the separate removal of the intrinsic edges of the instance. For example, consider how the vertices 1 and 5 in Figure 1A are linked. To disconnect them it is enough to delete one of the edges 1 → 2 or 2 → 5 at a time. Such kinds of dependencies seem to exist in larger scales in the analyzed networks pinpointing to lots of gene pairs that are connected in a linear chain-like manner as reflected by the pattern ID = 12 (Figure 3). There are almost no independent alternative paths between such gene pairs so that the connection between them is very sensitive upon the deletion of a single intrinsic edge. Therewith, the pattern ID = 12 is contained virtually exclusively amongst the pattern instances below the diagonal dotted lines in Figure 11.
The concurrent elimination of the intrinsic edges of a pattern instance located above the diagonal dotted lines breaks also pairwise connections between genes that are not so easily assailable as described above. At least two paths between such genes exist, each using a unique combination of intrinsic edges. Thus, they cannot be affected by eliminating a single intrinsic edge only. For example, in Figure 2 there are three paths linking vertex E2 with E6: The first one includes the intrinsic edge X → Z. The second consists of the intrinsic edges X → Y and Y → Z whereas the third path contains only the edge Y → Z. However, no matter which of the intrinsic edges is deleted, the vertex pair (E2, E6) remains untouched since at least one of the three paths is still present. Their connection is disrupted only if the whole pattern is deleted. Such dependencies can be observed in Figure 11 for few pattern instances in E. coli, but increasingly in the other two networks. This trend is most distinctive in the mammalian network. Besides the pattern instances with a high pairwise disconnectivity index, a considerable number of motif instances appear in the lower left corner of the plot for the mammalian network (Figure 11, red triangles): their intrinsic edges have an extremely small or even no impact at all on pairwise connections between genes. But as motif instances, they are a bottleneck for linking many gene pairs.