Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels
- Xiaolei Wang^{1, 2},
- Hiroyuki Kuwahara^{1, 2} and
- Xin Gao^{1, 2}Email author
https://doi.org/10.1186/1752-0509-8-S5-S5
© Wang et al.; licensee BioMed Central Ltd. 2014
Published: 12 December 2014
Abstract
Background
A quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to the rational design of gene regulatory networks. Recent advances in high-throughput technologies have enabled high-resolution measurements of protein-DNA binding affinity. Importantly, such experiments revealed the complex nature of TF-DNA interactions, whereby the effects of nucleotide changes on the binding affinity were observed to be context dependent. A systematic method to give high-quality estimates of such complex affinity landscapes is, thus, essential to the control of gene expression and the advance of synthetic biology.
Results
Here, we propose a two-round prediction method that is based on support vector regression (SVR) with weighted degree (WD) kernels. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The subsequences identified as important in the first round are then fed into a second WD kernel to fit the experimentally measured affinities. To our knowledge, this is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers. The proposed method was tested by predicting the binding affinity landscape of Gcn4p in Saccharomyces cerevisiae using datasets from HiTS-FLIP. Our method explicitly identified important subsequences and showed significant performance improvements when compared with other state-of-the-art methods. Based on the identified important subsequences, we discovered two surprisingly stable 10-mers and one sensitive 10-mer which were not reported before. Further test on four other TFs in S. cerevisiae demonstrated the generality of our method.
Conclusion
We proposed in this paper a two-round method to quantitatively model the DNA binding affinity landscape. Since the ability to modify genetic parts to fine-tune gene expression rates is crucial to the design of biological systems, such a tool may play an important role in the success of synthetic biology going forward.
Keywords
Introduction
A major goal of synthetic biology is to manipulate existing organisms so as to construct new biological systems that possess desired functions [1–3]. The ability to adjust the expression of genes precisely is then necessary if the behavior of a synthetic biological system is to be fine-tuned for a given functional specification. Since the initiation of transcription is one of the most important steps in gene regulation [4], a quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to predicting the dynamics of gene circuits. However, the mechanistic characterization of intricate TF-DNA interactions from first principles of biochemistry still remains elusive. Consequently, the use of phenomenological models to characterize the affinity of TF-DNA interactions is essential for the rational design of synthetic gene circuits.
There are several high-throughput methods available to perform high-resolution measurements of protein-DNA interactions. These include protein-binding microarrays (PBMs), which characterize in vitro binding specificities of TFs for relatively short DNA sequences [5], and chromatin immunoprecipitation (ChIP)-based methods, which, in a cell-type specific fashion, can map genome-wide binding locations of TFs, provided that the relevant antibodies are available [6]. Mechanically induced trapping of molecular interactions (MITOMI) was developed which is capable of detecting low affinity TF-DNA binding using a microfluidic device [7]. They further developed a second-generation of MITOMI that was capable of measuring thousands of interactions in parallel [8]. Recently, high-throughput sequencing - fluorescent ligand interaction profiling (HiTS-FLIP) has been developed [9]. This is a method based on a second-generation DNA sequencing technology, which allows for hundreds of millions of in vitro measurements of TF-DNA binding affinities and provides a more comprehensive picture of the binding affinity landscapes of TFs. Further, HiTS-FLIP permits measurements of longer sequences of DNA, making possible analysis of complex binding affinity landscapes of dimeric and oligomeric TFs.
Various statistical and computational models have been developed to characterize binding affinity [10, 11]. The most commonly used one of these is the position weight matrix (PWM) [12, 13]. The basic PWM model aligns the DNA binding sequences and calculates the weights of different nucleotides at different positions within the alignment. There are different variants of PWM. All PWM models, however, assume that mononucleotides at different positions contribute independently. Although such models provided relatively accurate predictions for short binding motifs [14], alternative models have been developed to encode short-range information through building much larger matrices for subsequences (k-mers) [5, 15]. MDscan [16], for example, combined the word enumeration and position-specific weight matrix updating to iteratively approximate maximum a posteriori scoring function. Foat et al. proposed MatrixREDUCE [17], a statistical mechanics method that took high-throughput measurements of binding affinity as inputs and performed a least-squares fit to estimate the position specific affinity matrix that contained the relative energy contribution of each nucleotide at different positions. RankMotif++ [18] learned PWM motif models by maximum likelihood estimation of a probabilistic model for binding preferences.
Recent advances in high-throughput measurements of binding affinity and machine learning techniques have enabled the direct learning of the DNA-binding affinity landscape of TFs. A special case of the mismatch string kernel, di-mismatch kernel, has been proposed [19] that maps each binding sequence to a kernel space depending on similarity to all unique k-mers in the training data, for a fixed k and certain allowance of mismatches. Spectrum kernel has been applied to classify mammalian enhancers [20]. More recently, Annala et al. [21] proposed a linear model (HK→ ME) that assumes the binding affinity to be the sum of the contributions of certain subsequences of the binding sequences. Their method was the best performer in the Dialogue for Reverse Engineering Assessment and Methods 5 (DREAM5) transcription factor/DNA motif recognition competition.
Despite the significant advances made in computational methods for modeling DNA-binding affinity landscapes, there are still bottlenecks that continue to hamper the progress. The major one is that although most existing methods assume that k-mers make an important contribution to the binding affinity, there is no systematic overview provided of the importance of all k-mers with different lengths at different positions. For example, the PWM model assumes mononucleotides to be independent [13], whereas the di-mismatch kernel reveals which of the k-mers are important for a specific length, but cannot determine at which positions the k-mers are important or whether those with shorter length also contribute to the affinity. Similarly, the HK→ ME method used k-mers of specific lengths, i.e., all of the 4-6-mers as well as those 7- and 8-mers with the highest median intensity. This fact makes it difficult for the existing methods to well capture the important k-mers and their important positions.
In this paper, we propose a two-round support vector regression (SVR) method based on weighted degree (WD) kernels to overcome this bottleneck. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The identified subsequences are then fed into a second WD kernel to fit the experimentally measured affinities. Our method can systematically explore all the subsequences up to a certain length at all positions, and the results can easily be interpreted by users. We have applied this method to predict the binding affinity landscape of Gcn4p in Saccharomyces cerevisiae by using datasets from HiTS-FLIP. Through comparison with state-of-the-art predictors, we demonstrate that our method can provide significant improvements. We also demonstrate that our method can be straightforwardly used to visualize the importance of any k-mer at any position in binding sequences, thus to gain insights in the design of binding site sequences. Furthermore, we predict two high-affinity 10-mer motifs that are significantly more stable than the previously reported binding motifs. To evaluate the generalization power of our method, we further test it on the datasets from MITOMI2.0 [8] of four other TFs in S. cerevisiae, i.e., Cbf1p, Cin5p, Pho4p and Yap1p. Our method shows consistent improvements over state-of-the-art methods.
Materials and methods
Support vector regression
where ξ_{i} are ${\xi}_{i}^{*}$ are slack variables and C > 0 is the trade-off constant. If the relationship between x_{ i } and y_{ i } is non-linear, SVR can perform non-linear regression by kernel tricks which implicitly map x_{ i } to higher-dimensional feature spaces, i.e., f (x) = 〈w, Φ(x)〉 + b, where Φ(x) is a kernel mapping representation.
String kernels
String kernels are positive definite kernel functions defined on pairs of strings. The basic idea of string kernels is to map each string to a high-dimensional feature space and calculate the inner product of the two feature vectors. In other words, string kernels measure the similarity between pairs of strings. The more similar the two strings, a and b, the higher will be the value of the string kernel, K(a, b).
The two main types of string kernels are distribution-based kernels and k -mer-based kernels. Distribution-based kernels attempt to model uncertainties using random variables. Such kernels include the probability product kernel [23] and the spectral latent kernel [24]. Such kernels, however, require relatively long input sequences to capture the statistically meaningful distributions of subsequences.
In contrast to the distribution-based kernels, the k -mer-based string kernels essentially count all subsequences in the two sequences with lengths up to a pre-defined value and use these as features. A k-mer is a length-k subsequence in a sequence, a. There are several types of k-mer-based kernels each of which handles different assumptions. The spectrum kernel [25] maps each sequence into a feature space where each dimension counts the number of occurrences of a particular subsequence. The underlying assumption of the spectrum kernel is that the positions at which the subsequences occur are not important, rather the frequencies of their occurrences are the informative factor. Unlike the spectrum kernel, which is position independent, the weighted degree (WD) kernel [26] compares matches of subsequences at exact positions.
where β_{ k } are weights for different k-mers and $\mathbb{I}\left[\cdot \right]$ is an indicator function such that it is 1 when the condition inside the bracket is true and 0 otherwise. From Eq. 1, the computational complexity of calculating WD kernel between two sequences a and b is O(dL).
where β_{ k,m } are the weights for k-mers and m mismatches, γ_{ i } are the weights for different sequence positions, ${\omega}_{s}=\frac{1}{2\left(s+1\right)}$ are the weights assigned to shifts (in either direction) of extent s, and S(i) determines the shift range at position i. $\mathbb{I}$[a_{ k } (i + s) =_{ m } b_{ k } (i)] equals to 1 if and only if a_{ k } (i + s) and b_{ k } (i) differ by exactly m mismatches, and 0 otherwise. In this case, the computational complexity of calculating WD kernel between two sequences a and b is O(dLs). Thus, the total runtime to compute the kernel matrix for the entire training dataset is O(n^{2}dLs).
A two-round SVR with WD kernel method
In the first round, we map the input training sequences into a kernel space by the WD kernel with shift limit, s, and mismatch limit, m, according to Eq. 2. Here, all the k-mers up to length d are used. We then apply WD-kernel-based SVR to learn a model that maps DNA sequences and their K_{ d } values. The setting of d, s and m can be determined by the cross validation (CV) on the training set. From the learned SVR model, the top ten k-mers that contribute most to the high binding affinity (small K_{ d } values) and another ten that contribute the most to the low binding affinity (large K_{ d } values) for each k up to d are selected according to their expected decrease and increase of f (x), respectively, where f (·) is the learned regression model and x is an input sequence of length L. Here ten is the default parameter of our method and can be customized by users. Following [30], we quantify the importance of x_{ k } (i), which represents a k-mer starting from position i of x, as Q(x_{ k } (i)) = $\mathbb{E}$[f (x)|x_{ k } (i)] − $\mathbb{E}$ [f (x)]. In the second round, we learn another SVR model by encoding only the selected k-mers from Round 1 in the WD kernel. That is, when checking k-mers in the kernel function, we only count those matches that belong to the selected k-mers from the first round. In this round, since the important k-mers are supposed to be more conserved than other subsequences, we allow only shifts up to s but not mismatches, and, thus, follow Eq. 2. Again, the parameters d and s are set by cross-validation on the training set.
Once a test DNA binding sequence is given, it is mapped to a kernel space that is composed of the selected k-mers from Round 1. The learned SVR model in Round 2 is then applied to predict the K_{ d } value for this test binding sequence.
Results
Datasets
We first applied our method to predict the binding affinity landscape of Gcn4p in Saccharomyces cerevisiae by using datasets from HiTS-FLIP. Gcn4p is a master regulator that transcriptionally controls the expression of many genes including those for the amino acid biosynthesis pathway [31]. Gcn4p is a basic leucine zipper protein which interacts with DNA binding sites as a dimer [32], and is known to preferentially bind to several sequence motifs, including a 7-mer motif, TGACTCA [33, 34]. On the basis of its important role in yeast in general control of amino acids [35], a quantitative characterization of the binding of Gcn4p to its promoter sites is crucial not only for elucidation of the regulatory mechanisms involved in such stress-response pathways, but also to allow design of a synthetic GCN4-induced response pathway in yeast.
The HiTS-FLIP datasets contain K_{ d } values of 83,252 DNA sequences of length 12bp. In TF-DNA interactions, K_{ d } represents the concentration of the TF at which the DNA region is occupied 50% of the time at equilibrium. The K_{ d } values in the HiTS-FLIP datasets range from 8 nM to 1000 nM, where a small K_{ d } represents high binding affinity and a large K_{ d } represents low binding affinity. Since the adjustment of binding-affinity for optimal TF-recognition sites is often important in the fine-tuning of the behavior of gene circuits, we focused on modeling the DNA-binding affinity landscape of Gcn4p with strong interactions. Here, DNA sequences with K_{ d } less than 100 nM were defined as optimal Gcn4p recognition sites. This threshold was chosen by looking up in the HiTS-FLIP datasets the K_{ d } value of a specific 9-mer, to which Gn4p was reported to bind relatively poorly in vivo and in vitro [34]. This resulted in 1,393 DNA sequences. This 12-mer dataset was randomly partitioned into 10 subsets to perform 10-fold CV. All the results of our method and other methods in this paper are based on the same 10-fold CV.
To evaluate the generalization power of our method, we further tested it on four other TF data sets of S. cerevisiae, i.e., Cbf1p, Cin5p, Pho4p and Yap1p. These four TFs are the ones measured by MITOMI2.0 [8]. Each data set contains the relative binding affinities for nucleotide sequences with 52bp in length, in which shorter binding sites are included. After removing the sequences with "nan" (not a number) and taking average relative affinities for the same sequences, each data set contains 1,456 52bp sequences, with their corresponding relative binding affinities.
Performance measures
where y_{ i } and ${y}_{i}^{\prime}$ are the real and predicted K_{ d } values, z_{ i } and ${{z}^{\prime}}_{i}$ are the real and predicted rank of K_{ d } values, for the i-th binding sequence respectively, and n is the number of binding sequences in the training or test sets.
Results of 10-fold cross validation
Average prediction performance of the Rounds 1 and 2 of our method on test sets of the 10-fold CV.
Test Performance of Round 1: WD with s = 1 & m = 1 | ||||
---|---|---|---|---|
d | Runtime | RMSE | Pearson Cor | Spearman Cor |
2 | 572 | 20.06 | 0.74 | 0.46 |
3 | 1034 | 19.99 | 0.74 | 0.47 |
4 | 1448 | 19.87 | 0.75 | 0.48 |
5 | 1834 | 19.79 | 0.75 | 0.49 |
6 | 2221 | 19.77 | 0.75 | 0.49 |
7 | 2430 | 19.76 | 0.75 | 0.50 |
8 | 2908 | 19.75 | 0.75 | 0.50 |
9 | 3193 | 19.74 | 0.75 | 0.50 |
Test Performance of Round 2: WD with s = 0 & m = 0 | ||||
d | Runtime | RMSE | Pearson Cor | Spearman Cor |
2 | 47 | 18.82 | 0.78 | 0.55 |
3 | 90 | 18.09 | 0.80 | 0.59 |
4 | 128 | 17.65 | 0.81 | 0.62 |
5 | 166 | 17.34 | 0.82 | 0.65 |
6 | 200 | 17.09 | 0.83 | 0.66 |
7 | 235 | 16.89 | 0.84 | 0.68 |
8 | 268 | 16.89 | 0.84 | 0.65 |
9 | 302 | 16.87 | 0.84 | 0.65 |
Comparison of the relative performance of Rounds 1 and 2 reveals that Round 2 has significant improvements with respect to all the measures (Table 1). In particular, the RMSE, Pearson correlation and Spearman correlation of Round 2 are better than those of Round 1 by 15%, 12% and 36%, respectively. This is due to the fact that Round 1 encodes many irrelevant k-mers, whereas Round 2 encodes only the important k-mers identified in Round 1. The kernel mapping of Round 2 is thus far more accurate than that of Round 1. With d = 7, Round 1 encodes 21,844 k-mers to calculate the kernel function, whereas Round 2 encodes only 140 k-mers. This explains the significant improvement on the runtime of Round 2 over Round 1, although Round 2 requires inputs from Round 1.
Comparison with state-of-the-art methods
Our method was further compared with state-of-the-art methods on the same datasets. HK→ ME [21] was the best performer at the DREAM5 transcription factor/DNA motif recognition competition. According to Annala's method, HK→ ME was set to use all the 4-6-mers as well as 2000 7-mers and 1000 8-mers with the lowest median K_{ d } values. We also compared with PWM which assumed the mononucleotide contributed independently to the binding affinity. The model x of PWM is solved from the function A · × = K where A is an n × u matrix, where n is the number of training sequences, and u = L × z where L is the length of each training sequence and z is the size of the dictionary. For the Gcn4p dataset, L is 12 and z is 4. Here A[i, j] is set to 1 if the i-th sequence contains the specific nucleotide at the specific position indicated by the index j, otherwise 0. The x is a u-dimensional column vector to be trained, each entry of which represents the weight for the corresponding nucleotide at the corresponding position. For the Gcn4p dataset, K is set to be ln(K_{ d }) because ln(K_{ d }) is proportional to the binding free energy, which is assumed to be additive.
Comparison with state-of-the-art methods.
PWM | HK→ ME | SVR w. WD | Our Method | |
---|---|---|---|---|
RMSE | 20.2 | 25.4 | 22.5 | 16.9 |
RMSRE | 46% | 51% | 58% | 44% |
Pearson Cor | 0.56 | 0.49 | 0.70 | 0.84 |
Spearman Cor | 0.50 | 0.45 | 0.50 | 0.68 |
Discovery of a stable high-affinity 10-mer motif
Statistics of the ten 7-mers that were identified to be important for high-affinity 12-mers through Round 1.
Rank | 7-mer | Freq. | MIN | MAX | Average | Standard Deviation |
---|---|---|---|---|---|---|
1 | ATGACTC | 419 | 8.49 | 409.08 | 39.31 | 43.04 |
2 | TGACTCA | 990 | 8.49 | 567.81 | 56.66 | 54.61 |
3 | GTGACTC | 446 | 9.83 | 648.79 | 74.46 | 96.84 |
4 | TGAGTCA | 453 | 14.52 | 303.87 | 63.66 | 54.64 |
5 | TATGACT | 224 | 8.74 | 896.78 | 112.54 | 190.25 |
6 | GACTCAT | 392 | 8.49 | 963.28 | 167.26 | 254.46 |
7 | ATGAGTC | 504 | 15.60 | 975.18 | 276.01 | 292.93 |
8 | TGACTAA | 327 | 14.67 | 821.67 | 192.02 | 199.69 |
9 | TACTCAC | 847 | 9.65 | 975.05 | 437.92 | 336.43 |
10 | GACTAAT | 808 | 14.67 | 984.67 | 528.74 | 300.75 |
Multiple sequence alignment of the six robust 7-mers reveals a high-affinity 10- mer motif, TRTGACTCAT. Interestingly, we found that, in the 12-mer dataset, the sequences composed of this 10-mer motif were guaranteed to have high binding affinity with the K_{ d } value being less than 100 nM. In addition, we found that this 10-mer motif leads to even lower variability in the K_{ d } values than the six 7-mers (Figure 4). Indeed, the mean and the standard deviation of all 12-mers composed of the two concrete sequences from this motif show high binding affinity with a low dispersion rate; TATGACTCAT has mean 23.89 nM and standard deviation nM, while TGTGACTCAT has mean 26.88 nM and standard deviation 18.25 nM. These results indicate that the high-affinity 10-mer motif we found is even more dominant factor that alone can stabilize K_{ d } at a low level. [9] confirmed that a palindromic 9-mer motif, ATGACTCAT, has a higher binding affinity than the consensus 7-mer motif, TGACTCA, as observed in previous studies [33, 36]. While this 9-mer motif can indeed result in high binding affinity, it can also lead to much lower binding affinity than our 10-mer motif. For example, a 12-mer sequence, CATGACTCATAG, is observed to have K_{ d } value of 265.8 nM in the HiTS-FLIP dataset. Since the last 8 bases of the two motifs are the same (i.e., 5'-TGACTCAT- 3'), these observations indicate that our motif can further stabilize the K_{ d } values at a high affinity level by including an additional nucleotide in the left half-site. This is consistent with a previous experimental study in that, while Gcn4p binds to DNA sites as a dimer, the left half-site plays more important role than the right half-site in strong Gcn4p-DNA interactions [34].
In addition to the high-affinity 10-mer motif, we found a 10-mer sequence, CAT-GACTAAT, by performing multiple sequence alignment of 4 other 7-mers identified through Round 1 (i.e., ATGAGTC, TGACTAA, GACTCAC, and GACTAAT). Unlike the 10-mer motif that we found, however, this 10-mer is more context- dependent. While 12-mers composed of this 10-mer also gave relatively high mean binding affinity (K_{ d } = 97.88 nM), the range of the binding affinities increased significantly to include the K_{ d } values between 20.61 nM and 432.00 nM and to have the standard deviation of 90.65 nM (Figure 4). This shows that two nucleotide substitutions from the low-variance, high-affinity 10-mer motif can substantially alter the characteristics of the 10-mer. These observations indicate that the DNA binding affinity landscape of Gcn4p is very complex and that a strong interdependency is prevalent. This suggests that models based on additive, independent characteristics of binding free energy may not be able to quantitatively capture interactions of DNA and dimeric--and more generally oligomeric--TFs and that efficient models that consider interdependency of subsequences are key to understanding the DNA binding affinity landscape of such proteins and to fine-tuning of gene expression processes.
Results on four other TFs in S. cerevisiae
Comparison with state-of-the-art methods on four other TFs in S. cerevisiae
PWM | HK→ ME | SVR w. WD | Our Method | |
---|---|---|---|---|
Cbf1p | 0.25 | 0.30 | 0.36 | 0.63 |
Cin5p | 0.21 | 0.26 | 0.47 | 0.62 |
Pho4p | 0.19 | 0.24 | 0.41 | 0.61 |
Yap1p | 0.22 | 0.24 | 0.40 | 0.58 |
Conclusion
In this paper, we have proposed a novel two-round support vector regression method that is based on weighted degree kernels with shifts and mismatches, with the first round focusing on feature selection and the second round focusing on regression. The WD kernels have been used with support vector classification method and successfully applied to a number of biological sequence classification problems, including transcription start site prediction [37], splice site prediction [38], alternative splicing site prediction [28], trans-splicing site prediction [39], and translation initiation site prediction [40]. However, the power of combining the WD kernels with the support vector regression has not been well studied in bioinformatics. Further, to the best of our knowledge, two rounds of string kernels have not been applied to identify crucial k-mers and to avoid projecting the input sequences to overly high-dimensional kernel space.
We applied the proposed two-round method to model the mapping of DNA sequences to their binding affinity for the Gcn4p binding in yeast using high-resolution datasets measured by HiTS-FLIP. We showed that the quantitative prediction from our new method is significantly improved over existing methods. We further demonstrated that the identification of important subsequences would allow extraction of human-interpretable rules for the purpose of quantitative control of binding affinity. Two 10-mers were predicted by our method that were surprisingly stable but were not previously reported. Another 10-mer that just has two nucleotide changes from one of the stable ones was predicted that was comparatively sensitive. Additional tests on four other TFs validate the generalization power of the proposed method. Our program and sample data are freely available at http://sfb.kaust.edu.sa/Pages/Software.aspx.
Declarations
Acknowledgements
We thank Polly Fordyce for valuable discussions about the MITOMI2.0 datasets. This work and publication costs were supported by the grant number, FCC/1/1976-04-01, made by King Abdullah University of Science and Technology (KAUST).
This article has been published as part of BMC systems Biology Volume 8 Supplement 5, 2014: Proceedings of the 25th International Conference on Genome Informatics (GIW/ISCB-Asia): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S5.
Authors’ Affiliations
References
- Endy D: Foundations for engineering biology. Nature. 2005, 438 (7067): 449-453. 10.1038/nature04342.View ArticlePubMedGoogle Scholar
- Purnick PEM, Weiss R: The second wave of synthetic biology: from modules to systems. Nat Rev Mol Cell Biol. 2009, 10 (6): 410-422. 10.1038/nrm2698.View ArticlePubMedGoogle Scholar
- Kuwahara H, Fan M, Wang S, Gao X: A framework for scalable parameter estimation of gene circuit models using structural information. Bioinformatics. 2013, 29 (13): 98-107. 10.1093/bioinformatics/btt232.View ArticleGoogle Scholar
- Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P: Molecular Biology of the Cell. Garland Science. 2002, New York, 4thGoogle Scholar
- Berger MF, Bulyk ML: Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc. 2009, 4 (3): 393-411. 10.1038/nprot.2008.195.PubMed CentralView ArticlePubMedGoogle Scholar
- Gottardo R: Modeling and analysis of ChIP-chip experiments. Methods Mol Biol. 2009, 567: 133-143. 10.1007/978-1-60327-414-2_9.View ArticlePubMedGoogle Scholar
- Maerkl SJ, Quake SR: A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007, 315 (5809): 233-237. 10.1126/science.1131007.View ArticlePubMedGoogle Scholar
- Fordyce PM, Gerber D, Tran D, Zheng J, Li H, DeRisi JL, Quake SR: De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat Biotechnol. 2010, 28 (9): 970-975. 10.1038/nbt.1675.PubMed CentralView ArticlePubMedGoogle Scholar
- Nutiu R, Friedman RC, Luo S, Khrebtukova I, Silva D, Li R, Zhang L, Schroth GP, Burge CB: Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat Biotechnol. 2011, 29 (7): 659-664. 10.1038/nbt.1882.PubMed CentralView ArticlePubMedGoogle Scholar
- Alleyne TM, Peña-Castillo L, Badis G, Talukder S, Berger MF, Gehrke AR, Philippakis AA, Bulyk ML, Morris QD, Hughes TR: Predicting the binding preference of transcription factors to individual DNA k-mers. Bioinformatics. 2009, 25 (8): 1012-1018. 10.1093/bioinformatics/btn645.PubMed CentralView ArticlePubMedGoogle Scholar
- Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S, DREAMC , Bussemaker HJ, Morris QD, Bulyk ML, Stolovitzky G, Hughes TR: Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol. 2013, 31 (2): 126-134. 10.1038/nbt.2486.PubMed CentralView ArticlePubMedGoogle Scholar
- Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987, 193 (4): 723-750. 10.1016/0022-2836(87)90354-8.View ArticlePubMedGoogle Scholar
- Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16 (1): 16-23. 10.1093/bioinformatics/16.1.16.View ArticlePubMedGoogle Scholar
- Benos PV, Bulyk ML, Stormo GD: Additivity in protein-DNA interactions: how good an approximation is it?. Nucleic Acids Res. 2002, 30 (20): 4442-4451. 10.1093/nar/gkf578.PubMed CentralView ArticlePubMedGoogle Scholar
- Bulyk ML, Johnson PLF, Church GM: Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002, 30 (5): 1255-1261. 10.1093/nar/30.5.1255.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu XS, Brutlag DL, Liu JS: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol. 2002, 20 (8): 835-839. 10.1038/nbt717.View ArticlePubMedGoogle Scholar
- Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by matrixreduce. Bioinformatics. 2006, 22 (14): 141-149. 10.1093/bioinformatics/btl223.View ArticleGoogle Scholar
- Chen X, Hughes TR, Morris Q: RankMotif++: a motif-search algorithm that accounts for relative ranks of K -mers in binding transcription factors. Bioinformatics. 2007, 23 (13): 72-79. 10.1093/bioinformatics/btm224.View ArticleGoogle Scholar
- Agius P, Arvey A, Chang W, Noble WS, Leslie C: High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comput Biol. 2010, 6 (9):Google Scholar
- Lee D, Karchin R, Beer MA: Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011, 21 (12): 2167-2180. 10.1101/gr.121905.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Annala M, Laurila K, Lähdesmäki H, Nykter M: A linear model for transcription factor binding affinity prediction in protein binding microarrays. PLoS One. 2011, 6 (5): 20059-10.1371/journal.pone.0020059.View ArticleGoogle Scholar
- Vapnik V, Chervonenkis A: Theory of Pattern Recognition. Nauka. 1974, MoscowGoogle Scholar
- Jebara T, Kondor R, Howard A: Probability product kernels. The Journal of Machine Learning Research. 2004, 5: 819-844.Google Scholar
- Xie B, Jankovic BR, Bajic VB, Song L, Gao X: Poly(A) motif prediction using spectral latent features from human DNA sequences. Bioinformatics. 2013, 29 (13): 316-325. 10.1093/bioinformatics/btt218.View ArticleGoogle Scholar
- Leslie C, Eskin E, Noble WS: The spectrum kernel: a string kernel for SVM protein classification. Proceedings of Pacific Symposium on Biocomputing (PSB2002). 2002, 546-575.Google Scholar
- Rätsch G, Sonnenburg S: Accurate splice site detection for C. elegans. Kernel Methods in Computional Biology. 2004, 277-298.Google Scholar
- Leslie CS, Eskin E, Cohen A, Weston J, Noble WS: Mismatch string kernels for discriminative protein classification. Bioinformatics. 2004, 20 (4): 467-476. 10.1093/bioinformatics/btg431.View ArticlePubMedGoogle Scholar
- Rätsch G, Sonnenburg S, Schälkopf B: RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics. 2005, 21 (Suppl 1): 369-377. 10.1093/bioinformatics/bti1053.View ArticleGoogle Scholar
- Mohapatra A, Mishra PM, Padhy S: Discriminative DNA classification and motif prediction using weighted degree string kernels with shift and mismatch. Proceedings of ICAC3'09. 2009, ACM, New York, NY, USA, 56-61.View ArticleGoogle Scholar
- Sonnenburg S, Zien A, Philips P, Rätsch G: POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors. Bioinformatics. 2008, 24 (13): 6-14. 10.1093/bioinformatics/btn170.View ArticleGoogle Scholar
- Natarajan K, Meyer MR, Jackson BM, Slade D, Roberts C, Hinnebusch AG, Marton MJ: Transcriptional profiling shows that GCN4P is a master regulator of gene expression during amino acid starvation in yeast. Mol Cell Biol. 2001, 21 (13): 4347-4368. 10.1128/MCB.21.13.4347-4368.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Hope IA, Struhl K: GCN4, a eukaryotic transcriptional activator protein, binds as a dimer to target DNA. EMBO J. 1987, 6 (9): 2781-2784.PubMed CentralPubMedGoogle Scholar
- Hill DE, Hope IA, Macke JP, Struhl K: Saturation mutagenesis of the yeast his3 regulatory site: requirements for transcriptional induction and for binding by GCN4 activator protein. Science. 1986, 234 (4775): 451-457. 10.1126/science.3532321.View ArticlePubMedGoogle Scholar
- Sellers JW, Vincent AC, Struhl K: Mutations that define the optimal half-site for binding yeast GCN4 activator protein and identify an ATF/CREB-like repressor that recognizes similar DNA sites. Mol Cell Biol. 1990, 10 (10): 5077-5086.PubMed CentralView ArticlePubMedGoogle Scholar
- Hinnebusch AG: Translational regulation of GCN4 and the general amino acid control of yeast. Annu Rev Microbiol. 2005, 59: 407-450. 10.1146/annurev.micro.59.031805.133833.View ArticlePubMedGoogle Scholar
- Zhu C, Byers KJRP, McCord RP, Shi Z, Berger MF, Newburger DE, Saulrieta K, Smith Z, Shah MV, Radhakrishnan M, Philippakis AA, Hu Y, De Masi F, Pacek M, Rolfs A, Murthy T, Labaer J, Bulyk ML: High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 2009, 19 (4): 556-566. 10.1101/gr.090233.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Sonnenburg S, Zien A, Rätsch G: Arts: accurate recognition of transcription starts in human. Bioinformatics. 2006, 22 (14): 472-480. 10.1093/bioinformatics/btl250.View ArticleGoogle Scholar
- Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G: Accurate splice site prediction using support vector machines. BMC Bioinformatics. 2007, 8 (Suppl 10): 7-10.1186/1471-2105-8-S10-S7.View ArticleGoogle Scholar
- Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, Krüger N, Sonnenburg S, Rätsch G: MGENE: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009, 19 (11): 2133-2143. 10.1101/gr.090597.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Saeys Y, Abeel T, Degroeve S, Van de Peer Y: Translation initiation site prediction on a genomic scale: beauty in simplicity. Bioinformatics. 2007, 23 (13): 418-423. 10.1093/bioinformatics/btm177.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.