System size reduction in stochastic simulations of the facilitated diffusion mechanism
© Zabet; licensee BioMed Central Ltd. 2012
Received: 10 April 2012
Accepted: 3 September 2012
Published: 8 September 2012
Skip to main content
© Zabet; licensee BioMed Central Ltd. 2012
Received: 10 April 2012
Accepted: 3 September 2012
Published: 8 September 2012
Site-specific Transcription Factors (TFs) are proteins that bind to specific sites on the DNA and control the activity of a target gene by enhancing or decreasing the rate at which the gene is transcribed by RNA polymerase. The process by which TF molecules locate their target sites is a key component of transcriptional regulation. Therefore it is essential to gain insight into the mechanisms by which TFs search for the target sites.
Research in this area uses experimental and analytical approaches, but also stochastic simulations of the search process. Previous work based on stochastic simulations focussed only on short sequences, primarily for reasons of technical feasibility. Many of these studies had to disregard possible biases introduced by reducing a genome-wide system to a smaller subsystem. In particular, we identified crucial parameters that require adjustment, which were not adequately changed in these previous studies.
We investigated several methods that adequately adapt the parameters of stochastic simulations of the facilitated diffusion, when the full sequence space is reduced to smaller regions of interest. We found two methods that scale the system accordingly: the copy number model and the association rate model. We systematically compared the results produced by simulations of the subsystem with respect to the original system. Our results confirmed that the copy number model is adequate only for high abundance TFs, while for low abundance TFs the association rate model is the only one that reproduces with high accuracy the results of the full system.
We propose a strategy to reduce the size of the system that adequately adapts important parameters to capture the behaviour of the full system. This enables correct simulations of a smaller sequence space (which can be as small as 100 Kbp) and, thus, provides independence from computationally intensive genome-wide simulations of the facilitated diffusion mechanism.
Transcription Factors need to locate their target sites on the DNA within a time frame that is shorter than can be achieved by random diffusion. The search process is further complicated by the fact that target sites are usually similar to a significant number of other sites (decoys), and by the fact that there are other molecules searching for their target sites simultaneously. To understand transcriptional regulation better, it is therefore essential to have a complete understanding of the mechanistic way in which this search process takes place.
In the last 40 years, both theoretical and experimental research were able to identify that the search mechanism is a combination of a three-dimensional diffusion and a one-dimensional random walk, which is often referred to as the facilitated diffusion mechanism [1–6]. Despite considerable progress and mainly due to the technical limitations , there is still a significant gap in our understanding of how TFs locate their target sites . One issue is the way in which the TF performs the one-dimensional random walk, in the sense that there is still no consensus whether the TF molecules predominantly slide (do not lose contact with the DNA during the one-dimensional random walk) [9–11] or hop (perform small jumps on the DNA during the one-dimensional random walk) [7, 12]. Another example is the disagreement between the values for the proportion of time that TF molecules spend on the DNA: analytical computations of an optimal search process  differ from values measured experimentally .
One way to address these questions are stochastic simulations of the facilitated diffusion mechanism [14–16]. In [17, 18] we proposed a computational model, GRiP, that allows genome-wide simulation of the facilitated diffusion mechanism. In particular, the CPU time required to simulate 1 s of an E.coli K-12 cell and lac repressor (lacI) TFs in GRiP resides between 1 h and 4 h on a 2×2.26 GHz quad-core Intel Xeon MacPro computer, see .
Despite the significant speed-up compared to previous tools, it is still not feasible to use the full genomic sequence as a search space. To address a scientific question with GRiP, multiple simulations need to be performed to allow a meaningful statistical analysis of the results. Thus, even small improvements in simulation speed can add up to some significant time saving. The optimization of the algorithm or of the implementation can potentially increase the speed of the simulations, but this is limited by the level of detail in the simulated model. In addition, even in the case of significant algorithm optimisations, simulating eukaryotic systems that have more than 100 Mbp and 107 TFs becomes impractical.
One strategy to increase simulation speed consists of system size reduction, following the logic that the properties of the search process are the same irrespective of simulating only a subset or the full genomic sequence. However, this requires a few simulation parameters to be adapted to the size of the subsystem (e.g. the number of TF molecules in the subsystem as compared to the full system). This change in system parameters is required in order to avoid biases in the results, e.g. TFs could locate the target sites faster or target sites might be occupied for longer time intervals if there is an inappropriate number of TFs. The main advantage of this approach is that smaller systems will display faster speeds due to smaller DNA regions and, consequently, due to lower number of molecules bound to the DNA which perform the one-dimensional random walk.
Our results indicate that if the diffusion parameters are conserved and if the proportion of covered DNA is similar for the original system and the subsystem, then the subsystem captures the dynamic and steady state behaviour of the original system with negligible error.
In this contribution, we present two adaptation methods (the copy number method and the association rate method) that managed to keep the simulation results for the full system and the subsystem constant. We systematically investigate the degree to which the simulation results are affected when reducing the size of the system. The first method (copy number method) is simpler to implement, but is limited with respect to how much the system can be reduced and in terms of accuracy. This is caused by the fact that TF copy numbers are integers and values lower than 1 cannot be considered while intermediary values need to be rounded to the closest integer value.
The second approach, the (association rate method), is slightly more difficult to implement (it requires to measure the proportion of time the molecules spend on the DNA a priori), but surpasses all the limitations of the previous method (higher accuracy and the size of the smallest subsystem is not limited by TF copy number any more).
Overall, we show that copy number method performs well in the case of high abundance TFs, while for low abundance TFs, one needs to rely on the association rate method.
In this study we consider the lac repressor (lacI) TF, since this is one of the best described TFs with respect to the facilitated diffusion mechanism. Details regarding the lacI parameters used in this paper are presented in the Methods section. For the purpose of this study, we did not aim to provide a complete and exact description of the lac repressor system, but rather to describe under which conditions one can reduce the size of the system.
Next we present two models that are intended to keep the subsystems equivalent to the full system with respect to the facilitated diffusion mechanism.
where T F bound represents the number of bound molecules.
A subsystem with a DNA molecule of size λM (with λ ∈[0,1]), is equivalent to the full system if the one-dimensional random walk behaviour in the two systems are the same and if the local crowding (the number of TF molecules bound to the DNA, normalised by the length of the DNA) remains the same in both the full system and the subsystem. We assume that the one-dimensional diffusion parameters are the same in all systems (which leads to similar one-dimensional diffusion measures, see Additional file 1: Supplementary Material) and then we need to impose that the local crowding is also the same. One measure for crowding is the number of bound molecules per base pair and this leads to the following condition on crowding
where is the number of bound molecules in the subsystem λ.
where t R is the residence time of a molecule on the DNA, T F free the number of free molecules and the ratio of free DNA. Note that, in the Additional file 1: Supplementary Material, the accuracy of this estimate is systematically investigated in the case of DNA crowding. The results confirm that for a DNA occupancy up to 50% (as in the case of E.coli) the equation displays negligible errors.
Copy number and association rate for subsystems when we use the DNA size ratio method
Note that the association rate for the full system (2400 s −1) leads to a slightly different occupancy on the DNA than initially computed (f=0.9). This is due to the fact that the value of 2400 s −1was computed under the assumption that the system consists of both cognate and non-cognate molecules which cover 25% of the DNA. Since in this case we consider a significantly lower occupancy, then the proportion of time the TFs spend on the DNA increases. This is important due to the fact that the association rate model requires that the correct value for the proportion of time spent on the DNA (f ) is provided.
The proportion of time spent on the DNA can be computed using the approach described in , but the accuracy can slightly suffer in the case of crowding on the DNA; see Additional file 1: Figure S4. To ensure a parameter estimation characterised by a high accuracy, we used a set of 20 simulations and measured the observed proportion of time spent on the DNA (f ); see Additional file 1: Figure S3.
In addition, Table 1 shows that for low abundance proteins, it is not possible to apply Model I (copy number model) to reduce the size of the system beyond certain limits. For example, in the case of a low abundance protein with only 10 molecules per cell, the genome cannot be reduced further than 10 times (minimum size for 10 molecules is 460 Kbp).
Figure 3 shows that there is a strong correlation between the full system and the six subsystems in terms of occupancy biases. In particular, both models (copy number and association rate) seem to capture the same occupancy bias as the full system even for subsystems that are 100 times smaller, i.e., the boxplots of the correlations for all subsystems do not seem to deviate to much from that of the full system (leftmost). This is true for both models (copy number and association rate model) and seem to be valid for all types of TF abundances (low, medium or high). Nevertheless, because TF copy numbers are integers, applying large system reduction to low abundance proteins can result in a lower accuracy of the method.
The correlation between the occupancy bias of the full system and all the subsystems indicates that the peaks in the occupancy bias data are captured by all subsystems for both models (copy number and association rate models). However, to capture the complete perspective on the occupancy bias we need to investigate if the size of these peaks is conserved, i.e., we are interested whether the same ratio between occupancy and affinity is found in the subsystems as compared to the full system. To do this, we use the ratio between normalized affinity and normalized occupancy for all sites that have a certain minimum affinity. This minimum affinity threshold removes low affinity sites from the data, where the noise in occupancy bias is high (and could lead to misinterpretation of the data).
For the sites with the affinity above a certain value we compute the ratio between the normalized affinity and the normalized occupancy. In the low and medium abundance TFs we expect the ratio to be around one, but in the case of high abundance TFs, due to the high crowding, the ratio should be significantly lower than 1 (resulting in many false positives, in the sense that these sites are identified as highly occupied sites, with prospective high affinity, but the actual affinity is lower than predicted based the occupancy) [18, 20].
Figure 4 shows that in both models the size of the system can be reduced while displaying similar distributions of the ratio between affinity and occupancy. Interestingly, in a highly crowded environment, there seems to be a high peak centred around 0.5 (higher occupancy than affinity; false positives), but also a small number of values around 1 (similar degree of affinity to occupancy). However, we found that, for low/medium abundance TFs, the copy number model in conjunction with the smallest subsystem (46 Kbp) can lead to results that deviate significantly from the ones of the full system (the top second panel in Figure 4). Overall, we conclude that the association rate model always performs well and that for high abundance proteins, both models show good results.
Next, we are interested in how the system size reduction influences the search process. Figure 5 shows that, for both models, reducing the size of the system does not significantly change the time required to locate the target site for all types of TF abundances. Again, the association rate model seems to slightly outperform the copy number model, due to higher accuracy of how the scaling factor is incorporated into the model, i.e., in Model I the copy number can take only integers, resulting in lower accuracy for low abundance TFs, while for Model II due to the fact the association rate can take real positive numbers the accuracy is only limited by the accuracy of the floating point number representation in computers.
Usually, the activity of TF regulated genes is controlled by the presence or absence of TF molecules at certain target sites. Using our model, we measured the proportion of time the target site was occupied. For long time intervals, this proportion of time approximates the probability that a target site is occupied by a TF.
Figure 6 calculates the probability that the target site is occupied in the full system compared to the one of the subsystems for both models. Both models seem to approximate the behaviour of the full system with negligible error. In particular, both the average value and the variability in the probabilities seem to be conserved in all subsystems.
Both models produce accurate results compared to the full system and lead to significant enhancement of the simulation speed. In particular, the errors in the approximate subsystems compared to the full system are negligible and are overshadowed by the speed enhancement produced by these methods.
When simulating the facilitated diffusion mechanism, one usually needs multiple long runs for the same set of parameters. This can take a significant amount of CPU time and can lead to undesirable simulation time (greater than 2 months). One solution is to enhance the current algorithms, but this might lead to coarser grained models unable to capture enough details of the mechanism of facilitated diffusion. Alternatively, one could simulate a subsystem of the full system. Figure 7 shows that by decreasing the system size 100 times the CPU time required to simulate 1 s in the cell can be decreased 100 fold.
To keep the full system and the subsystems equivalent, we developed two models: (i) the copy number model (Model I) and (ii) the association rate model (Model II). Model I is easier to construct, but has two main drawbacks. First, there is a limit on how much one can reduce the system due to the fact that TF copy number has to be at least 1. This is mainly an issue for low abundance proteins (e.g. for a TF with 10 molecules, the smallest system one could obtain is 460 Kbp). Secondly, due to the fact that TF copy numbers are integers the accuracy of the method might suffer. For example, when we reduce the size of the system from 4.6 Mbpto 46 Kbp for a TF with 100 molecules the ratio between the affinity and the occupancy of the subsystem deviates significantly from that of the full system. Nevertheless, this approach is easy to apply and displays negligible errors for high abundance proteins.
The association rate model surpasses both drawbacks of the copy number model by managing to reduce the system independently of TF copy number and reproduces the results of the full system with high accuracy. However, this model assumes measuring the actual time the TF molecules spend on DNA in the full system a priori, which might be time consuming.
In the context of GRiP software [17, 18] this indicates that, when we reduce the system size of high abundance TFs (e.g. non-cognate TF have ∼105molecules), one could use the copy number model, while, for low abundance proteins, the association rate model needs to be applied.
In conclusion, this paper offers a comprehensive description and analysis of the methods that need to be applied when performing non-genome-wide stochastic simulations of the facilitated diffusion mechanism. More specifically, we show that one does not have to perform genome-wide studies of the TF search process for their target sites as long as the parameters of subsystem (the subsystem which considers only a small area around the region of interest) are correctly adjusted.
We consider the case of the lac repressor in E.coli K-12 . The lac repressor tetramer has only three known high affinity sites : (i) AATTGTGAGCGGATAACAATT, (ii) AAATGTGAGCGAGTAACAACC and (iii) GGCAGTGAGCGCAACGCAATT. To construct the PWM of the lac repressor we use these three sites, but we assume that there is a 9 bp gap in the middle of the motif, which leads to the following three sequences: (i) AATTGTNNNNNNNNNACAATT, (ii) AAATGTNNNNNNNNNACAACC and (iii) GGCAGTNNNNNNNNNGCAATT. This assumption is justified by the fact that the lac repressor tetramer consists of two dimers, each recognising only 6 bp.
The binding energies for the entire E.coli K-12 genome  are normally distributed with mean 〈E lacI〉/K B T=2.47 and standard deviation of 2.16. In addition, the mean of the exponential binding energy is 〈exp (E lacI)〉=1.04. Using the approach described in , we computed the specific waiting time . Furthermore, we assume that the average length of the binding motif in prokaryotes is 23 bp, the average DNA occupancy is 0.25∈[0.1,0.5], the system has 50000 non cognate TFs and an association rate of k assoc=2400 s −1. The rest of the parameters describing this system are detailed in . The selected parameters resulted in an average time spent on the DNA of f≈0.88, a residence time of t R ≈4.5 ms and a sliding length of , which are in accordance with the values estimated in .
The lac repressor has three sites that control the activity of the lac operon, namely: O 1, O 2 and O 3(see Additional file 1: Supplementary Material). Our PWM matrix correctly predicts that O 1 is the strongest site on the DNA and, in our analysis, we will use this site when we measure the time required for a lacI molecule to reach a target site or when we measure the proportion of time the target site was occupied by lacI molecules.
The author would like to thank Boris Adryan and his group (in particular to Rob Foy) for useful discussions and comments on the manuscript. This work was supported by the Medical Research Council [G1002110].
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.