Ranges of control in the transcriptional regulation of Escherichia coli
© Sonnenschein et al. 2009
Received: 12 June 2009
Accepted: 24 December 2009
Published: 24 December 2009
Skip to main content
© Sonnenschein et al. 2009
Received: 12 June 2009
Accepted: 24 December 2009
Published: 24 December 2009
The positioning of genes in the genome is an important evolutionary degree of freedom for organizing gene regulation. Statistical properties of these distributions have been studied particularly in relation to the transcriptional regulatory network. The systematics of gene-gene distances then become important sources of information on the control, which different biological mechanisms exert on gene expression.
Here we study a set of categories, which has to our knowledge not been analyzed before. We distinguish between genes that do not participate in the transcriptional regulatory network (i.e. that are according to current knowledge not producing transcription factors and do not possess binding sites for transcription factors in their regulatory region), and genes that via transcription factors either are regulated by or regulate other genes. We find that the two types of genes ("isolated" and "regulatory" genes) show a clear statistical repulsion and have different ranges of correlations. In particular we find that isolated genes have a preference for shorter intergenic distances.
These findings support previous evidence from gene expression patterns for two distinct logical types of control, namely digital control (i.e. network-based control mediated by dedicated transcription factors) and analog control (i.e. control based on genome structure and mediated by neighborhood on the genome).
The circular genome of E. coli is still an object of intense scientific research (see, e.g., ). It is a rich source of information on the organization of gene regulation, the interplay of different types of control exerted on gene expression, a model system for analyzing DNA topology, the model for which the most detailed electronically accessible transcriptional regulatory network has been compiled.
Many processes, acting on a broad range of scales, contribute to the evolution of bacterial chromosomes. Genes are organized in operons, i.e., groups of genes sharing a regulatory domain. The genome is shaped by point mutations, large-scale rearrangements, strand breaks and inversions during replication. The gene inventory is modified by gene duplications or deletions and lateral or horizontal transfer of genes. It is striking that an ever closer look at statistical properties of data reveals ever more systematic information, shaped by evolution, on an ever broader range of length scales.
Starting from the work by De Martelaere and Van Gool (1981)  and Jurka and Savageau (1985)  the gene density along the circular chromosome of E. coli has been discussed as a potential source of information on the evolutionary shaping of the system and in particular as a means of using DNA topology (i.e. the 3D structure of the genome) for regulatory purposes (see also ).
The papers by Warren and ten Wolde (2004)  and Képès (2004)  focus on distances between genes or operons. Both are studies of the specific patterns in the distributions of distances between regulatory pairs (genes or operons regulating each other or pairs of genes or operons co-regulated by other genes). Warren and ten Wolde (2004)  find a substantially reduced distance between operons in such regulatory pairs, suggesting an evolutionary pressure to reduce such distances for efficient regulation. For obtaining this, they use classical characteristics of point process statistics, namely partial pair correlation functions and nearest neighbor distance probability density functions.
Képès (2004)  observe a periodicity in the distances between regulator and target, where the period length is in the same order of magnitude as known loop domains in the 3D organization of the E. coli chromosome.
Darling et al. (2008)  discuss biases in genomic inversions with respect to the replichores and other patterns of genome rearrangement in bacterial chromosomes. Another important factor influencing gene-gene distance statistics on a very general level is gene clustering. The origin of observed gene clustering is attributed to gene duplication and divergence, an evolutionary advantage of clustering, as it might increase a gene's chance for horizontal gene transfer or, lastly, selective advantage of gene clusters due to functional coupling and the efficient organization of transcription (see the discussion in ). From the systems perspective, mainly the regulatory control mediated by direct binding of transcription factors has been investigated. The compilation of these interactions for E. coli into a database  allows the construction of a transcriptional regulatory network (TRN) . This view yields deep topological insights into the hierarchical organization of TRNs (Ma et al., 2004 ; Yu and Gerstein, 2006 ) and their composition out of specific network motifs (Shen-Orr et al., 2002 ). The TRN has been used for the interpretation of expression patterns (Gutierrez-Rios et al., 2003 ; Herrgard et al., 2003 ), revealing both the potential and the limitations of this perspective. In particular, recently it became obvious that other effects with very different regulatory mechanism have to be taken into account, like alterations of the DNA structure on a small [16, 17] and larger  scale. Thus, understanding the organizational logic of gene regulation necessitates a clear distinction of the different control types in the first place, as a prerequisite for the assessment of their impact in regulation.
Another link between these two research areas, gene distribution and TRN, comes from the observation that gene neighborhood explains some features of observed gene expression patterns (Marr et al. 2008 ; Blot et al. 2006 ). In particular, Marr et al. (2008)  analyze the interplay between two types of control in gene expression profiles in E. coli, one network-mediated and the other mediated by DNA topology.
These two control types have been termed digital (referring to the fact that the TRN provides static information on the connections between unique, discontinuous components, e.g. a particular pair of regulator and regulated gene) and analog (referring to the fact that the expression of specific genes is under the control of continuous information provided by distributions of supercoiling energy in the genome), respectively .
The statistical properties of gene distributions and gene spacings have been studied to detect deviations from randomness and interpret these deviations in a suitable evolutionary context. To a large extent, these investigations differ (apart from the technical details of the statistical tools and the construction of suitable null models) predominantly in the categories of genes analyzed. In the present paper we show results for two analysis steps, where the first analysis distinguishes between two classes. Analysis I discusses genes involved in regulation (i.e., either being regulated by a transcription factor or producing a transcription factor regulating other genes; class 1) and genes not involved in regulation mediated by transcription factors (which in the following we will call "isolated genes"; class 2). Analysis II consists of pairs of genes regulated by a common transcription factor. Distances between the genes in such a pair will be contrasted to the distances between arbitrary genes.
The biological hypothesis behind these categories is that different means of gene regulation essentially have different length scales. The novel feature of our approach lies in two points: (1) the distribution of regulated/regulating genes vs. (regulatorily) isolated genes has not been studied before. Our finding here, a pronounced deviation from randomness for the isolated genes, fits to the hypothesis stemming from previous investigations of control types in gene expression patterns (Marr et al. 2008 ); (2) in order to detect deviations from randomness we employ different non-classical types of correlation functions. Our hypothesis, based on the findings from Marr et al. (2008) , is that the existence of distinct logical types of control (namely digital and analog) has a systematic impact on the statistical features of gene distributions. In particular, distances between isolated genes and all others should be smaller than average distances between genes, as isolated genes tend to be co-regulated by spatial neighborhood via the 3D structure of the genome.
Results are in the following presented both on the level of individual genes and on the level of operons.
First we present the gene distance distributions for the two gene classes, (isolated genes and genes involved in regulation; see above). Then we discuss pair correlation functions g(s), partial pair correlation functions g ij (s), mark connection functions p ij (s), connectivity correlation functions c(s), and control correlation functions k 3(s) (see Materials and Methods).
Thus we are confronted with problems of point process statistics (see Materials and Methods), where the genes or operons are the points. They are marked by 1 or 2, corresponding to the classes above, 1: involved in regulation, 2: isolated.
Typical gene sizes range from a few hundred bp to several thousand bp with the mean size centered around 1 kpb.
There is even a maximum for distances of 1 and 2 kbp, while for larger values the curves approach fast the value 1, which corresponds to absence of location correlation. The range of correlation is for the operons somewhat longer than for the genes, it goes until 6 kbp.
A suitable tool for analyzing the relative contributions of the different categories to these correlations is the partial pair correlation function g ij (s) with ij = 11 (between genes involved in regulation), ij = 22 (between isolated genes) and ij = 12 (one gene involved in regulation, the other isolated), respectively.
The curves for g 12(s), however, are new and, to a certain extent, unexpected: the distances between isolated and regulatory genes do not show a peak at intermediate distances. Obviously, the repulsion between isolated and regulatory genes is stronger and longer than that of genes of the same type, namely 7 kbp. In contrast, for operons it is shorter, only 3 kbp.
The term "repulsion" is used here in a simplifying sense, in order to say that there is a tendency that the distances between isolated and regulatory genes are larger than between genes of the same type. This may be a result of real repulsion as well as of relative "attraction" of the members of one class towards itself. We interpret this repulsion as an unmixing of genes predominantly regulated by transcription factors (digital control; cf. ) and genes predominantly regulated by the 3D structure of the genome (analog control). For the first type (class 1) distance correlations should be less important than for the second type (class 2) where regulation is mediated (among other processes) by the neighborhood of genes on the genome.
It should be noted that the partial pair correlation functions g ij (s) compared to the mark connection functions p ij (s) are individually normalized. In contrast to p ii (s) we see maxima of g ii (s) around 2 kbp. Comparison between the types 1 and 2 shows that regulatory genes are more regularly distributed than isolated genes (as the maximum is higher for g 11(s)). We would also like to point out that the estimates of the partial pair correlation function and mark connection function depend continuously on the proportions of class 1 and class 2 genes in this analysis (see also Methods). We thus expect that small fluctuations in the data will leave the main results of our analysis intact.
Both, in the partial pair correlation functions g 12(s) and in the mark connection function p 12(s) one can see that the two classes (isolated genes and genes involved in regulation) repel each other. On the level of the operons this repulsion is less clearly visible (and has a range up to approximately 2.5 kbp); in general, operons are more irregularly spaced than the genes. In all these cases, this can be explained by the elimination of many short (intra-operon) distances from consideration, when passing from the gene level of description to the operon level.
Our statement that short distances and analog control are qualitatively related can also be checked on the level of this data set. While it should be noted that our key result is a statistical signal emerging from the collective ensemble of genes (and here we show additionally, how these findings can again be cross-validated against high-throughput data), we again resort to the data from  and compare a histogram of inter-gene distances obtained from supercoiling-sensitive genes with a histogram obtained from a random selection of genes. The trend towards smaller distances is clearly seen. This figure is included as supplementary information (Additional File 1).
The inset in Figure 10b summarizes the two parts of Figure 10 by showing the difference between the isolated gene curve and the regulatory gene curve from Figure 10a (full curve in the inset; tsEPODs) and from Figure 10b (dotted curve in the inset; heEPODs), respectively. A particular interesting feature seen in the inset is that at short distances the full curve goes up and the dotted curve goes down, i.e. there are (at short distances) far more isolated genes in the vicinity of transcriptionally silenced EPODs and more regulatory genes in the vicinity of highly expressed EPODs.
Patterns (i.e. systematic deviations from randomness in the arrangement of genes) in the genome of E. coli have been studied on many different scales.
Here we analyzed another facet of this topic by distinguishing between genes involved and not involved in regulation based on transcription factors. Our key finding is that these two classes, regulatory and isolated genes display a statistical repulsion. Furthermore, the (operon-level) partial pair correlation function has a peak at shorter distances for isolated genes than for regulatory genes. This preference of shorter distances for isolated genes is also visible in the mark connection function and is supportive of our hypothesis that analog control is more important for this class of genes than for the regulatory genes, for which digital control is a longer-ranging alternative.
Whether the statistical properties of inter-gene distances discussed here originate from the need to organize gene regulation or from the dynamics of genome rearrangement cannot be ultimately decided based on the data at hand.
Minimal models of genome arrangement dynamics and its impact on gene expression could be a useful tool for deciding whether the distance pattern between genes is indirectly shaped (and therefore deviates from pure randomness) by these dynamics, rather than being evolutionarily constraint to contribute more directly to gene regulation.
The statistical differences between isolated and regulatory genes described here suggest that, indeed, the genes currently classified as isolated from the perspective of the available TRN are systematically different from the genes involved in regulation. We by no means want to suggest that (a) these genes are indeed unregulated nor (b) that the current version of RegulonDB (version 6.2) is complete. However, when considering the extreme cases of isolated genes being just gaps in the database and, on the other hand, isolated genes being systematically regulated by other means, our results support the latter view.
Even though we consider our findings in an evolutionary context (by making visible some deviations from randomness of the gene distances in the E. coli genome, which can only be understood evolutionarily) we here do not directly discuss the comparative genomics aspect of it. It would be particularly interesting to analyze the degree of evolutionary conservation as a function of the distance between genes and separately for the two categories of genes. A hypothesis for such an extension of our analysis could be that pairs of genes contributing strongly to the patterns we observe, have a higher degree of evolutionary conservation. This is, indeed, a whole work package we plan to tackle in a future investigation.
Eventually one needs to arrive at a more holistic view of the system and explain the interplay between gene arrangement, DNA binding site distributions, physical properties of DNA binding sites, the architectural properties of the transcriptional regulatory network and the spatial gene expression patterns, in order to understand the binding site code behind global gene expression and to unravel the universal design principles of transcriptional regulation.
In the statistical analyses of this paper, the genes are considered as points on a circle C, the circular chromosome of E. coli. Thus, a random system of points is analysed, which leads to the application of methods of point process statistics. (The term "process" is related to early applications where the points were time instants. Also the term "stationary" is related to these applications; "homogeneous" could be an equivalent.) These methods have been mainly developed for the planar (d = 2) and spatial (d = 3) case, but can be easily applied also in the one-dimensional (d = 1) case considered here. So our main reference is Illian et al. (2008) .
Similarly to the investigations of [5, 6] we assume that the point pattern belongs to a "stationary" point process, i. e. that the point distribution is rotation invariant. This implies that the local point density does show only irregular fluctuations, as it is the case. Thus it makes sense to speak about the "intensity", the mean number of points per length unit. As in  it is denoted here by λ.
The points considered are marked. There are two marks, namely "1" and "2", where "1" stands for "regulatory" and "2" for "isolated". The fraction of i-points is denoted by p i , for i = 1, 2. Note that p i can be interpreted as the probability that a randomly chosen point has the mark i. Furthermore, the probabilities p 1 and p 2 make sense, where p i is the fraction of i-points (a point with mark i) in the point process. It can be interpreted as the probability that a randomly chosen point is an i-point.
The statistical analysis uses a series of summary characteristics, which have been successfully employed in spatial point process statistics. All these functions depend on a variable s, which is a distance. In all cases this is the shortest distance along the circular genome.
All these function can be called "correlation functions", but not all include only point pairs; therefore some of them are not second-order characteristics.
The best known function is the pair correlation function g(s), which is explained here as in , p. 219, since the explanation there is closer to the "two-point interpretation" used for explaining the other functions. (The explanation in Warren and ten Wolde is different but equivalent.)
Consider two points x and y on C of distance s and two infinitesimal length elements of lengths dx and dy centred at x and y. Denote the probability that in the two elements there is each a point by p(x, y). This probability is given by the so-called product density ϱ(x, y) as p(x, y) = ϱ(x, y)dxdy.
In the stationary case (which is assumed), ϱ(x, y) depends only on the distance s of x and y, and the simpler symbol ϱ(s) is used. The pair correlation function is then g(s) = ϱ(s)/λ 2.
The normalisation by division by λ 2 makes that for large r, g(r) approximates 1. Values of g(r) larger than 1 for small r indicate clustering, while values smaller than 1 indicate some tendency of regularity or repulsion between the points. See the discussion of the information given by a pair correlation function in , pp.219.
The partial pair correlation functions g ij (s) are defined using refined product densities ϱ(s) where one of the points in the infinitesimal intervals is an i-point and the other a j-point, see , p. 325. These functions are normalized by p i p j λ 2, which leads to g 11(s), g 12(s) and g 22(s). The g α (s) in  are similar to g 11(s).
Again, the normalisation leads to values around 1 for large s, and also the general interpretation is similar to that of g(r), see , pp.325. For i ≠ j the relations between different sorts of points are characterized.
For example, values smaller than 1 for g ij (r) indicate some tendency of repulsion or inhibition between points of the different types i and j.
for i ≠ j. It is useful to consider the mark connection functions additionally to the partial pair correlation functions since they characterize the occurrence of the point types with eliminated influence of fluctuations in point density; see , p. 332.
Comparison of the Figures 5 and 6 shows the power of this approach. The curves in Figure 5 are heavily dominated by the frequencies of point distances, which show for the genes a maximum at around s = 2000...3000, while Figure 6 shows the true nature of the marking: the probability that two points of distance s have, for example, both mark 2 decreases monotonically with s.
where ϱ conn (s) is a quantity which yields the probability that between the points x and y in the infinitesimal intervals above, if both are regulatory (both have mark 1), there is a direct regulatory relationship, i. e. one of them regulates the other or both regulate the other. It is similar to the connectivity function in , p. 249, and can be interpreted as the conditional probability that between two regulatory points at distance s there is a direct regulatory relationship.
It is defined for the sub-point process of all points that are regulated by other points ("passive" points, a subset of all 1-points); its product density is denoted by ϱ pp (s). Furthermore, ϱ 3(s) is a quantity which yields the probability that for two passive points x and y in the infinitesimal intervals above there exists a third point which regulates both x and y. Thus, k 3(s) can be interpreted as the conditional probability that for two passive points at distance s there is a third point which controls both of them.
We obtained the data from RegulonDB (version 6.2) , which is a database specifically dedicated to the transcriptional regulation of E. coli. A total number of 4548 genes are included in this database, of which 1474 bear information about their transcriptional regulation and thus have been classified as class 2 genes.
MTH acknowledges support by Volkswagen Foundation. NS is supported by a Jacobs University scholarship. We are indebted to Georgi Muskhelishvili (Bremen, Germany) and Carsten Marr (Munich, Germany) for helpful comments on the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.