On cycles in the transcription network of Saccharomyces cerevisiae

Background We investigate the cycles in the transcription network of Saccharomyces cerevisiae. Unlike a similar network of Escherichia coli, it contains many cycles. We characterize properties of these cycles and their place in the regulatory mechanism of the cell. Results Almost all cycles in the transcription network of Saccharomyces cerevisiae are contained in a single strongly connected component, which we call LSCC (L for "largest"), except for a single cycle of two transcription factors. The fact that LSCC includes almost all cycles is well explained by the properties of a random graph with the same in- and out-degrees of the nodes. Among different physiological conditions, cell cycle has the most significant relationship with LSCC, as the set of 64 transcription interactions that are active in all phases of the cell cycle has overlap of 27 with the interactions of LSCC (of which there are 49). Conversely, if we remove the interactions that are active in all phases of the cell cycle (25% of interactions to transcription factors), the LSCC would have only three nodes and 5 edges, many fewer than expected. This subgraph of the transcription network consists mostly of interactions that are active only in the stress response subnetwork. We also characterize the role of LSCC in the topology of the network. We show that LSCC can be used to define a natural hierarchy in the network and that in every physiological subnetwork LSCC plays a pivotal role. Conclusion Apart from those well-defined conditions, the transcription network of Saccharomyces cerevisiae is devoid of cycles. It was observed that two conditions that were studied and that have no cycles of their own are exogenous: diauxic shift and DNA repair, while cell cycle and sporulation are endogenous. We claim that in a certain sense (slow recovery) stress response is endogenous as well.


Background
Cycles have a central role in control of continuing processes (for an example, see Hartwell [1]). Therefore we expect the regulatory mechanism of a cell to have many cycles of interactions. Only some of these interactions have the form of a transcription factor (TF for short) regu-lating expression of a target gene. Our question is: given that there are cycles of transcription interactions, are they important in the regulation of life processes?
Graph properties of the regulatory networks have been reported in a number of papers. Shen-Orr et al. [2] ana-lyzed the regulatory networks statistically and observed certain characteristic motifs that are more frequent than in the random model and which have functional significance (while other small subgraphs are significantly less frequent). Cycles, or feedback loops also may have some typical regulatory role, e.g. they may be related to multiple steady states [3][4][5].
Luscombe et al. [6] studied the dynamics of the regulatory network of Saccharomyces cerevisiae as it changes for multiple conditions and proposed a method for the statistical analysis of network dynamics. They have found large changes in the topology of the network and compared it with random graphs. We have found that the transcription network of Saccharomyces cerevisiae contains a single large strongly connected component (a union of overlapping cycles), which we call LSCC, and that the topology changes discussed by Luscombe et al. [6] are well reflected within LSCC, in spite of its small size.
Yu and Gerstein [7] have examined the structure of regulatory networks and showed that it exhibited a certain natural hierarchy. We propose another hierarchical partition of the network: above the LSCC, the LSCC, below the LSCC and "parallel" to the LSCC (see Fig. 1, 2) and we show that this partition is in some sense natural.
Comparisons of biological networks with random graphs were subject of methodological investigations of Barabasi and Albert [8] who proposed a scale-free model. This model is difficult to apply here. While the networks we investigated have the key property of scale-free networks, i.e. they have many nodes with degree much higher than the average, the distribution of the degrees is too irregular to match with a particular power law. In a scale free network the ratio #{nodes with degrees k to 2k -1} to #{nodes with degrees 2k to 4k -1} is convergent, but in our networks it varies widely for different k's (for recent study of scale-free nature on biological networks, see also [9,10]). Therefore Milo et al. [11] (see also Newman et al. [12]) proposed several methods of generating graphs that have the same in-and out-degrees as the reference network. We used their "matching algorithm" whenever possible, as well as faster and somewhat biased variants.

Results and Discussion
In the data set of Luscombe et al. [6] we can see the LSCC with 25 TFs and one small strongly connected component with two TFs.
To see if the cycles of the LSCC are significant, we checked how the topological changes of the transcription network during various physiological conditions are reflected inside the LSCC, we checked several graph characteristics of the TFs in the LSCC, and we compared the characteristics of the LSCC to the cycles in random networks.

General characterization of the cycles Size of LSCC is relatively small
The cycles form two connected components, one "degenerate", consisting of 2 TFs, and one "large", consisting of 25 TFs.
The degenerate component consists of two TFs with indistinguishable interactions that have self-loops, thus they are TFs of themselves, and of each other. This may be a result of a relatively recent gene duplication. Thus we will ignore this cycle in our discussions.
The size of the largest cyclic component, 25, is rather small compared with random models (averages 42-43), with p-value ca. 0.025. The number of nodes in the remaining cycles, 2, is not very different from the average (0.8 to 1.3).
Classifying TFs and TTs of Luscombe network by their posi-tions on the longest paths, note that class INT is empty Figure 1 Classifying TFs and TTs of Luscombe network by their positions on the longest paths, note that class INT is empty. The paths are computed in the graph of scc's, in particular, we view LSCC as a single node. The entry in column i and row j shows the number of nodes with these properties: on the longest path through node u has i + j edges and the longest path from u to another node (a TT) has i edges (consequently, the longest path from another node to u has j edges). Note that the only way a node may be on a path of length 3 is when it has an edge from the node that corresponds to LSCC.  By the way of contrast, the transcription network of Escherichia coli is either devoid of cycles or it contains very few of them (depending on the data set, see Cosentino Lagomarsino et al. [13]).

LSCC connected very strongly to the cell cycle
The transcription network reported by Luscombe et al. [6] has 142 TFs and 7074 interactions, of which we disregard 21 "self-loop" interactions of the remainder 254 are TF to TF; we use ITF to denote the latter set (interactions to transcription factors). 25 TFs and 49 interactions form the LSCC. The subnetworks associated with the 5 stages of the cell cycle have 64 interactions in common (we name this set CCC, "common to cell cycle"), all of them directed to TFs (hence in ITF) and 27 of them are present in the LSCC. If even one of these two sets, LSCC or CCC, is random, the expected number of common elements would be smaller than 13 (49 × 64/254) and the probability of |LSCC ∩ CCC| ≥ 27 would be below 10 -6 (estimated by binomial formula). This shows that LSCC is very strongly related to the cell cycle.

Cycles of subnetworks other than cell cycle
Stress response is special in the sense that it has cycles of its own, all of which involve YAP6 that is not active in any other subnetwork. It seems that the cyclic interaction of this TF with two other TFs is a differentiating part of stress response condition from other exogenous conditions, diauxic shift and DNA damage. The latter have similar sets of active interactions in LSCC, but they lack 5 interactions involving YAP6.
One cycle consists of 3 interactions that are common to all conditions, REB1 → SIN3 → HSF1 → REB1. Note that HSF1 is a Heat Stress Factor, very important in the stress response, but also in "basal level sustained transcription" (see Mager and Ferreira [14]). One possible role of cycles in stress response is slowing down the recovery transition from the stress condition, so it can last several hours [14]. During the recovery, sporulation and cell cycle activities are suppressed. In this sense, stress response is partially endogenous to use the classification of Luscombe et al. [6] (they group Cell Cycle and Sporulation as endogenous and the other conditions as exogenous). Fig. 3 shows the graph formed by the transcription factors and interactions of LSCC, with nodes placed on a square grid as to minimize the edge lengths.

LSCC has an orderly layout
In the diagram, al (apricot color) marks the nodes present in the cycles of all subnetworks. The cycles in the diauxic shift and DNA damage subnetworks contain only these nodes. (Note that an interaction of LSCC can be active in a subnetwork without belonging to a cycle in that subnetwork.) The diagram of LSCC, each node is TF in Table 6 Figure 3 The diagram of LSCC, each node is TF in Table 6.

Out-Lscc
The cycles in the sporulation subnetwork sp contain apricot and strawberry nodes.
The cycles in the cell cycle subnetwork cc contain apricot, strawberry and cerulean nodes.
The cycles in the stress response subnetwork sr contain apricot and sienna nodes.
Nodes that are not included in the cycles of any subnetwork are black.
We managed to find an orderly layout for LSCC, in which few edges are long while nodes with the same color are grouped together.

LSCC has small feedback vertex set
Another property of LSCC is that it has a small and unique minimum feedback vertex set, a set of nodes whose removal destroys all cycles.
The fact that there exists a unique minimum feedback vertex set with three nodes (vertices) can be clearly seen in Fig. 4. Let us call this set F = {1, 3, 25}.
We can use F to distinguish three natural cyclic units within LSCC, S b for each b ∈ F. We can think that b is the "boss" of S b . We define S b as the union of all simple cycles that go through b but not through F -{b}. Only one node can have two bosses: {4} = S 1 ∩ S 25 . Because there is only one path from 1 to 4 and three disjoint paths from 25 to 4, we remove 4 from S 1 to make our units disjoint. The three sets coincide well with functional categories: S 3 = {3, 21, 24} are the nodes on cycles of LSCC sr , S 1 are the nodes on cycles of LSCC sp , and S 25 are the nodes on cycles of LSCC cc minus S 1 (observe that S1 is contained in LSCC cc ).
(Actually, S 25 has 11 nodes and it has one node that is not in LSCC cc , 18, and one node of the cell cycle network is missed, 8.) Thus the cyclic subnework has three cyclic parts, plus two acyclic parts: 5 nodes on paths from S 25 to S 3 , and 1 node on a path from S 25 to S 1 . We show this schematically in Fig. 5.

Differences and similarities of subnetworks are reflected in LSCC
For subnetwork A we define LSCC A as the set of interactions of A that are also in LSCC; to measure the difference between two sets we use |A ⊕ B|, the number of elements that are in one of the sets A and B but not in both.
One way that shows the importance of LSCC to regulatory mechanism is that the differences and similarities between physiological conditions tend to be "exaggerated" when we use LSCC as the "window". When we com-pare a symmetric difference of |ITF x ⊕ ITF y | with |LSCC x ⊕ LSCC y |, the size of the latter should be, on the average, 49/ 254 of the former. These comparisons are in Figure 6. In general, sp is very related to cc, and the difference inside LSCC is smaller than expected, while dd, ds and sr are unrelated, and the differences in LSCC are larger than expected, especially in the case of sr, the stress response.

Statistic profile of the TFs from the LSCC for three different original networks
We tested properties of LSCC in randomly generated networks. We also tabulated results of random tests based on two larger data sets. In our tables, we refer to the networks using names of the first authors of the paper in which they were published [6,7,15], hence we call them Luscombe, Yu and Balaji.
In our random networks we kept all original connections from TFs to Terminal Targets (i.e. regulated genes which are not TFs themselves. Later we refer to them with abbre- The smallest feedback vertex set of LSCC and the subdivi-sion of LSCC  viation TTs). The remaining connections were "rewired" at random, using three criteria, R, F and B. Criterion R was a uniformly random permutation of the edge ends, conditional on obtaining a "correct network" -no self-loops or duplicated edges. Criterion F was creating a bias in the selection of the permutation so the resulting number of feed-forward loops was close to the actual value in the original network. Criterion B was similar, but with bi-fans rather than feed-forward loops.
When we refer to our computed average value we used form x (y, z) to denote "average obtained using criterion R (F, B)".

Average size and out-degree
The size of LSCC is quite a bit smaller than the average, 25 versus 42 (41, 43), with p-value of 0.025 (0.04, 0.02), and the situation is similar for Yu and Balaji. (The sizes of LSCC, as well as the classes defined in the next section in terms of LSCC, are in Tables 1, 2, 3.) The average number of targets for the TFs of the LSCC is much higher than the average for all TFs, 128 versus 50. This discrepancy is somewhat smaller when we make such a comparison in a random model, 97 (96, 100) versus 50. Because cycles are sets of edges, it is very clear that a node with large out-degree has more chances to belong to a cycle, or a union of overlapping cycles that is LSCC. For example, in the actual network, almost half members of LSCC (12 of 25) belong to the top 20 TFs if we rank them by the number of the targets.
The lower average out-degree of the LSCC in random models is perhaps a simple consequence of the fact that they have, on the average, much larger LSCC, so the TFs from the top 20 TFs do not dominate the average as much as in the smaller LSCC of the actual network. Detailed comparisons of average out-degrees can be found in Table  4.

Position of LSCC in the hierarchy
Only 9 TFs belong to the in-component of the LSCC (denoted In-LSCC) in the sense that there are paths from these TFs to the LSCC; of these 9 paths 8 are single edges and one consists of two edges. If we consider that path to be exception, collectively the LSCC has unambiguous hierarchical position 2nd from the top. In a random network, on the average we have 17 (16, 17.5) TFs in In-LSCC. In this sense, the LSCC is higher in the hierarchy than the average in the random models.
Almost all paths with more than 2 edges are related to the LSCC in the following sense: either they include a TF from the LSCC, or form the final part of a path that starts in the LSCC. Two TFs form an exception to that rule, namely they can start a path with more than 2 edges that is not such a final part.
After collapsing scc's to single nodes we measured for each TF the maximum path length (for paths to which it belongs), and we call it MPL. For 38 TFs the value of MPL is at most 2, and they form a rather separate part of the transcription network which we call SIMPLE. 104 TFs have MPL of at least 3. Maximum of MPL is 13, more than the average in random networks that is 8.3 (8.4, 8.5). (The maximum length of a simple path is perhaps a better measure, but it requires a much more complex program to compute it. It is closely related to the feedback vertex set problem.) Intersections and symmetric differences of functional subnet-works inside ITF and LSCC Figure 6 Intersections and symmetric differences of functional subnetworks inside ITF and LSCC. Three cyclic units of LSCC with connections Figure 5 Three cyclic units of LSCC with connections. Yu and Gerstein [7] propose a partition of networks according to the length of shortest paths to those TFs that have only TTs as their targets. This definition would not work with the length of the shortest paths to TTs: this length is 1 for all TFs but ten, and for that ten, it is 2, so the hierarchy would be trivial. Because LSCC has such a special and statistically significant position in the network, we propose to partition TFs by their relation to LSCC, as it is indicated in Fig. 1 We performed our study using the data of Luscombe et al. [6] because we wanted to compare the cycles with physiological subnetworks described in their paper. Nevertheless, we compared our definition of a hierarchy with that of Yu and Gerstein [7], who performed their investigation in a larger transcription network. We performed two tests applied by Yu and Gerstein to their classes (see Fig. 2 for the partition of Yu network into classes).  The division we propose is closely related to the notion proposed by Yu and Gerstein: a division of transcription control mechanisms into reflex processes and cogitation processes. SIMPLE clearly corresponds to reflex processes. In a cogitation process, one that involves a long path of interactions, we can partition the process into beginning, middle and the ending part. As the various paths have very different lengths, identifying LSCC as the middle is both "objective" and independent from the path length, and in the same time quite arbitrary. However, we show in the next subsection that LSCC has a "switchboard" property even in the physiological conditions in which paths do not form cycles, and we just have seen that the percentage of cancer related genes sharply drops as we move from the middle to the final part of the long paths.

Topological changes inside LSCC
In Fig. 7 and Fig. 8 we can see the interactions of LSCC that are active in various physiological conditions. We can observe large difference between the subnetworks, both in the composition and in topological characteristics like average path length.
Luscombe et al. [6] measured the following topological characteristic in the subnetworks: the average length of shortest paths from TFs to TTs. By its very nature, LSCC is  In other words, only 12% of shortest connections between TFs and TTs does not go through LSCC, and these paths contribute only 5.8% to the sum of lengths.
Because so many TF-to-TT paths go through LSCC, the differences between average path lengths that were observed for different subnetworks by Luscombe et al. [6] are largely caused by the different presence of these networks in the LSCC. In Table 5 we use PERCENTPATH to denote the percentage of the shortest paths from transcription factors to the terminal targets that either originate or go through LSCC, and PERCENTLENGTH to denote the similar percentage for the sum of lengths of shortest paths. Table 5 shows that even in DNA damage and diauxic shift subnetworks the majority of shortest paths between TFs and TTs goes through LSCC; we may say that LSCC has a role of a switchboard (each node is TF in Table 6).

Conclusion
We inspected graph-theoretic properties of the cycles in the transcription network of Saccharomyces cerevisiae. While in general cycles are "avoided" by the network, interactions common to all phases of the cell cycle form a big exception, and interactions specific to the stress response form a smaller exception. In spite of their modest number (they involve 25 of 142 transcription factors that were included in the data set), the transcription factors that are included in cycles have a large topological impact: most of the shortest paths between transcription factors and terminal targets go through them.
One should compile many kinds of data to establish the exact role of the cycles of transcription interactions in controlling life processes. In particular, cell cycle, which is closely related to cancer, possesses a long cycle that can be easily interrupted at many different points, and the process itself can be interrupted by a number of different conditions (like DNA damage).
Parts of LSCC that are active during endogenous condition (or, conditions with larger number of active cycles) Figure 7 Parts of LSCC that are active during endogenous condition (or, conditions with larger number of active cycles). Cell cycle: Interaction between 5 and 15 appears to repress stress response. Sporulation: Most of the cell cycle interactions are present, but the cycle interactions leaving node 25 are not. Replication of DNA is an activity shared with the cell cycle. Stress response: When we compare the part of LSCC that is active during stress response with parts of LSCC that are active during cell cycle and sporulation, we note that in the latter cases the stress response cycle is totally inactive, but it is partially active during the diauxic shift and DNA damage, which are related to stress (damage -obvious, diauxic shift -the shift toward less favored nutrition source). Center of the cell cycle is activated during stress response, which can be part of a repression mechanism. We have shown that LSCC is a key part of the regulatory network and that it can be divided into functional subunits. Further work will yield fuller and clearer picture of these subunits and their interactions under various conditions.

Data
We used supplementary materials for [6] ; we also used supplementary materials of [7,15] and the list of yeast homologs of human cancer genes personally communicated by Haiyuan Yu.

Graph-theoretic definitions
A graph of a network consists of nodes (which correspond to TFs, transcription factors and TTs, terminal targets) and directed edges/interactions. Parts of LSCC that are active during exogenous condition (or, conditions with the fewest active cycles) Figure 8 Parts of LSCC that are active during exogenous condition (or, conditions with the fewest active cycles). DNA damage: The activity is, with small exception, subset of stress response, but without the cycle-closing activity of node 3. Diauxic shift: Part of stress activated too, like for DNA damage.  In each physiological subnetwork we consider the set P of pairs of the form TF-TT such that there is a chain of interactions (a path) from the TF of the pair to the TT; for p ∈ P we define ᐍ p , the length of the shortest path for this pair; moreover, Q is the subset of P such that the respective path has to go through LSCC. Then average path length = ∑ p∈P ᐍ p /|P|, PERCENTPATH = |Q|/|P|, and PERCENTPATH = ∑ p∈Q ᐍ p /∑ p∈P ᐍ p .
A path in a graph is a sequence of nodes (u 0 , ..., u k-1 ) such that each consecutive pair (u i-1 , u i ) is an edge. If additionally there exists an edge (u k-1 , u 0 ) we say that this is a cycle.
A single node (u) forms a degenerate cycle.
Nodes in a graph are partitioned into strongly connected components, or SCC's. A node u is contained in SCC(u) which is the union of the node sets of all cycles that contain u.
SCC's with one node are called trivial.
For graph G we define strong component graph G SCC , the graph of SCC's of G. Nodes of G SCC are scc's of G, and edges are pairs of the form (SCC(u), SCC(v)) such that (u, v) is an edge of G.
G SCC cannot have cycles of its own, and therefore it is easy to compute longest paths in that graphs (the algorithm is considered folklore). The paths lengths in that graph are used in Fig. 1.
We use LSCC to denote the largest strongly connected component in a graph. We apply this definition when the majority of elements of non-trivial scc's belongs to one of them, so there is no ambiguity as to which one is "the largest".

Algorithms
To compute non-trivial scc's we first obtained a "dictionary" protein code ↔ number followed by pairs of numbers representing the edges. We computed scc's and the graph of scc's using the method described in section 22.5 of Cormen et al. [16].
Shortest paths used in subsection on Position of LSCC in the hierarchy were computed using breadth first search.

Defining motifs, generating random graphs
We define a feed-forward loop (3 for short) as a triple of nodes {u 0 , u 1 , u 2 } such that there exists three edges: two form a path (u 0 , u 1 , u 2 ) while the third forms a shortcut, (u 0 , u 2 ). A bi-fan is a quadruple of nodes (u 0 , u 1 , v 0 , v 1 ) such that all of the 4 possible edges of the form (u i , v j ) exist.
When we count ffl's and bi-fans we remove the self-loops (edges of the form (u, u)) from the graph.
Moreover, every triple/quadruple is counted separately, even when they share nodes.
To count ffl's and bi-fans we made a table Overlap that for a pair of TFs stored the number of common targets. For every positive entry k = Overlap(a, b) we add k(k -1)/2 to the count of bi-fans, and if there is an edge from a to b, we add Overlap(a, b) to the count of ffl's.
We generated networks to make statistic comparisons. First, we generated random networks, or R. For Luscombe network, we permutated TF entries of adjacency lists at random. After permutation, lists could contain errors; a TF that "owns" the respective list, or a TF that has another copy earlier on the list. We repeated random permutations until error-free list were obtained, a process that took 1-2 seconds.
For Yu and Balaji, this provably unbiased approach [11] had no results within 30 minutes, so we used a variation of metropolis random walk. Starting from the original network, we repeatedly selected pairs of edges at random to swap their endpoints; a swap introducing new errors was performed with probability β and rejected otherwise. We set β so the process would result in an error-free network in a reasonable time (several seconds or several millions attempts on the average) Random networks were modified to boost the number of motifs, either feed-forward loops (version F) or bi-fans (version B). Boosting was performed via a metropolis process in which a randomly selected swap was rejected if it decreased the number of desired motifs by k (more precisely, such a swap was rejected with probability 1 -α k for some α), or if it increased the number of errors by l (a swap was rejected with probability 1 -β l ). Parameter α was adapted by the algorithm; decreased if the number of motifs was too small and not growing, and increased when it was too large. OUT-LSCC, the out-component of LSCC. SIMPLE, TFs whose longest paths to which they belong is at most 1 or 2. INT, TFs whose longest paths to which they belong is at most 3. EXCP, TFs that are not in any of IN-LSCC, LSCC, OUT-LSCC, or SIMPLE. ffl, feed-forward loop. In tables, we used Luscombe, Yu and Balaji to refer to networks from the data sets published in [6,7,15] respectively, and we used R, F and B to refer to random models generated with simple metropolis method (R), a variation of that method that increased the number of ffls (F) to the actually observed value, and a similar variation for the bi-fan motifs (B). The terms PERCENTPATH and PER-CENTLENGTH are explained in detail in the caption of Table 5.