MetaDBSite: a meta approach to improve protein DNA-binding sites prediction
- Jingna Si†1, 3,
- Zengming Zhang†1,
- Biaoyang Lin1,
- Michael Schroeder2 and
- Bingding Huang1, 2Email author
© Si et al; licensee BioMed Central Ltd. 2011
Published: 20 June 2011
Protein-DNA interactions play an important role in many fundamental biological activities such as DNA replication, transcription and repair. Identification of amino acid residues involved in DNA binding site is critical for understanding of the mechanism of gene regulations. In the last decade, there have been a number of computational approaches developed to predict protein-DNA binding sites based on protein sequence and/or structural information.
In this article, we present metaDBSite, a meta web server to predict DNA-binding residues for DNA-binding proteins. MetaDBSite integrates the prediction results from six available online web servers: DISIS, DNABindR, BindN, BindN-rf, DP-Bind and DBS-PRED and it solely uses sequence information of proteins. A large dataset of DNA-binding proteins is constructed from the Protein Data Bank and it serves as a gold-standard benchmark to evaluate the metaDBSite approach and the other six predictors.
The comparison results show that metaDBSite outperforms single individual approach. We believe that metaDBSite will become a useful and integrative tool for protein DNA-binding residues prediction. The MetaDBSite web-server is freely available at http://projects.biotec.tu-dresden.de/metadbsite/ and http://sysbio.zju.edu.cn/metadbsite.
Protein-DNA complexes perform essential functions in many cellular activities. For example, transcription factors bind to specific DNA sequences in promoters to activate gene expression . Protein-DNA interactions also play important roles in many other biological processes, including DNA replication, DNA repairing, viral infection, DNA packing and DNA modifications . However, the biophysical mechanism of protein-DNA interactions is not clear and the identification of protein-DNA interactions by experimental methods is difficult at present.
Although there are more than 60,000 experimentally determined structures deposited in the current (June 2010) Protein Data Bank database  , there are only several hundreds structures on protein-DNA complexes, which is much smaller than the number of protein-DNA complexes in nature. With recent advances in DNA sequencing such as next-generation sequencing technology, genome sequences for many organisms were completed in recent years, producing a huge amount of protein sequences, many of which are DNA-binding proteins. Predicting the DNA binding properties of these DNA binding proteins will be very useful in helping understanding their biological functions.
Summary of detailed characteristics of the six available web servers for DNA-binding sites prediction.
Machine learning methods
Properties used in training
Support Vector Machine (SVM)
Predicted secondary structure
Predicted solvent accessibility
Naïve Bayes classifier
Relative solvent accessibility
The side chain pKa value
The side chain pKa value
Position-specific scoring matrix (PSSM)
Kernel logistic regression
Penalized logistic regression
Protein sequence information
However, several limitations impair the application of the above servers: each method constructed their own dataset; had their own definition of binding sites; used different parameters derived from sequences; applied different machine learning methods, produced different accuracy and sensitivity, and calculated at much different speed. Therefore, a better and more consistent prediction server is needed. To meet this goal, we have developed metaDBSite, a meta web server for predicting protein DNA-binding sites based solely on amino acid sequences of proteins. MetaDBSite combined the six available online web servers mentioned in Table 1. MetaDBSite used support vector machine (SVM) learning method to learn and test the data. We constructed a large dataset PDNA-316 from PDB and compared the performance of MetaDBSite and the six servers. We showed that the MetaDBSite has a higher sensitivity in distinguishing DNA binding sites on the benchmark dataset. We believe that metaDBSite will become a useful tool for predicting protein-DNA binding residues for relevant researchers.
Results and discussion
Performance on PDNA-316 benchmark dataset
The prediction results of metaDBSite (10-fold SVM cross-validation) and the other six methods alone for the PDNA-316 benchmark dataset.
Comparison of various definitions of DNA-binding sites
Using structural information to eliminate false positives
It is noted that those six methods and metaDBSite are all using the information of protein sequence only. Since the dataset of PDNA-316 is derived from PDB and the structures for all proteins are known. This structural information of proteins could be used to improve the prediction of metaDBSite. To do this, we used spatial clustering based on the coordinates of the CA atoms of those predicted residues in metaDBSite in the next step, trying to eliminate those false positive predictions (FP). After clustering, those clusters with small number of residues are treated as false positive and thus are removed. We have tried several different parameters in this clustering procedure. All the results have shown that the accuracy and specificity are both increased but the sensitivity decreased (data not shown). In the protein-DNA complex structure, because of the spiral of the DNA molecule, the real DNA-binding residues defined with a distance cut-off do not tend to gather together spatially. Some of the real DNA-binding sites can be 3 or less residues and may locate at an isolated position on protein surface. Therefore, when we eliminate those small clusters, some of TPs may also be removed with FPs at the same time. And this is the reason why we can increase specificity and decrease sensitivity after this clustering post-process.
DNA-binding residues prediction from protein sequence is of great importance to understand the mechanism of protein-DNA interactions. There have been a lot of research efforts done to discriminate DNA-binding residues from non DNA-binding ones. Various machine learning methods have been applied and different kinds of features based on protein sequence and/or structural information have been used. However, it is hard to directly compare these existing prediction methods because of different data-sets, definitions and evaluation criteria being used. Here, based on the prediction results from six available predictors, we have developed metaDBSite: a meta server for DNA-binding residues prediction based on protein sequence. We evaluated metaDBSite and other 6 predictors on a large data-set using the same definition and criteria. We have shown that MetaDBSite can achieve a better balance of sensitivity and specificity.
Materials and methods
To evaluate these prediction methods, we derived a large dataset of protein-DNA complexes from current PDB . 865 protein-DNA complexes with resolution better than 3.0 Å were downloaded from PDB and the sequences were submitted to the program H-CD-HIT  to get a non-redundant dataset. These 865 proteins are first clustered at a high identity (90%), and then the non-redundant sequences are further clustered at a low identity (60%). A third cluster is performed at lower identity (30%). Default clustering parameters were selected in H-CD-HIT. After clustering, we have 316 protein-DNA complexes in total and it is called PDNA-316 dataset. This dataset is listed in the supplemental data on our metaDBSite web-server and can be downloaded freely.
Defining real DNA-binding sites
Several previous studies on protein-DNA binding site prediction [8, 13–15] have used various definitions of DNA-binding sites. In a protein-DNA complex, an amino acid residue in the protein is defined as DNA-binding site if the distance between any atoms of this residue and any atoms of the DNA molecule is less than a distance threshold. This threshold ranged from 3.5 Å to 6.0 Å in the previous studies. The other residues are regarded as non DNA-binding sites. On the other hand, we also tried to define binding sites with solvent accessible surface area (ASA). We calculated surface area for each protein residue when DNA molecule was absent and present, respectively. The solvent accessible surface area of residues which change at least 1 A2 before and after DNA molecule appeared are considered to be DNA-binding residues, the other residues are regarded as non DNA-binding residues. In the final metaDBSite approach, distance 3.5 Å was chosen to define the real DNA-binding sites.
In the formulas above, TP is the abbreviation of true positives (residues predicted to be DNA-binding residues that are in fact DNA-binding residues); TN is the abbreviation of true negatives (residues predicted to be non DNA-binding residues that are in fact non DNA-binding residues); FP is the abbreviation of false positives (residues predicted to be DNA-binding residues that are in fact non DNA-binding residues)); FN is the abbreviation of false negatives (residues predicted to be non DNA-binding residues that are in fact DNA-binding residues). These definitions and measures are comparable to the previous studies.
In this work, the six predictors were combined into a prediction system called metaDBSite with the assistance of the Support Vector Machine (SVM). As a machine-learning method as a classifier for two classes, SVM aims to find a rule that put each member in a training set into the corresponding class correctly. Here, the SVM was trained to distinguish DNA-binding residues from non-binding residues. DNA binding amino acids were regarded to be positive samples, and non-DNA binding amino acids were considered to be negative samples. The residue was defined as binding site if the distance between any atoms of this residue and any atoms of the DNA molecule was less than 3.5 Å. With this definition, there are 5342 positive samples and 67396 negative samples in the PDNA-316 dataset.
The detailed procedure of metaDBSite is illustrated in Figure 1. The given sequence is submitted to six web servers and the prediction results are retrieved. Among these six predictors, four of them (i.e., DISIS, DNABindR, BindN, and BindN-rf) return the prediction based on their own scoring functions. The residues with a score above a certain threshold are considered as DNA-binding residues. These scores provide us four input parameters for SVM. For the other two predictors: DP-Bind and DBS-PRED, they only indicate which residues are predicted to bind to DNA or not. Therefore, we simply add a score “+1” to binding sites and “0” to non-binding sites in these two methods. Finally, a total of six parameters are used in the SVM training.
The PDNA-316 datasets were divided into 10 roughly equal subsets. 10-fold cross-validation was performed here. To predict whether a given amino acid in a sequence belongs to the DNA binding site or non-DNA binding site, the subset to which this residue belongs was labelled as the “test” set, whereas the nine remaining subsets were labelled as “training” sets. SVM models were developed for each of the “training” sets. The class label for positive and negative samples was set to +1 and -1, respectively. The ratio of positive to negative samples was about 1:10 in the training set. Using the training set at such a ratio would inevitably cause the SVM model to predict every pair as a negative case. The optimized ratio in the training set was set at 1:1. Each training set was modified by discarding a random selection of the negative samples prior to training. The implemented SVM algorithm was LIB-SVM (http://www.csie.ntu.edu.tw/~cjlin/). The applied kernel function was the radial basis function (RBF). The corresponding parameter settings of SVM learning were automatically optimized by LIB-SVM.
Funding from the Ministry of Science and Technology of China (grant No. 2008DFA11320) and EU 7th framework Marie Curie Action IRESE program (grant NO. 247097) is kindly acknowledged.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 1, 2011: Selected articles from the 4th International Conference on Computational Systems Biology (ISB 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S1.
- Ptashne M: Regulation of transcription: from lambda to eukaryotes. Trends Biochem Sci. 2005, 30: 275-279. 10.1016/j.tibs.2005.04.003.View ArticlePubMedGoogle Scholar
- Ofran Y, Mysore V, Rost B: Prediction of DNA-binding residues from sequence. Bioinformatics. 2007, 23: i347-353. 10.1093/bioinformatics/btm174.View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V: Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 2006, 7: 262-10.1186/1471-2105-7-262.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Brown SJ: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 2006, 34: W243-248. 10.1093/nar/gkl298.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Yang MQ, Yang JY: Prediction of DNA-binding residues from protein sequence information using random forests. Bmc Genomics. 2009, 10 (Suppl 1): S1-10.1186/1471-2164-10-S1-S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Hwang S, Gou Z, Kuznetsov IB: DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics. 2007, 23: 634-636. 10.1093/bioinformatics/btl672.View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004, 20: 477-486. 10.1093/bioinformatics/btg432.View ArticlePubMedGoogle Scholar
- Lv S, Wang X, Cui Y, Jin J, Sun Y, Tang Y, Bai Y, Wang Y, Zhou L: Application of attention network test and demographic information to detect mild cognitive impairment via combining feature selection with support vector machine. Comput Methods Programs Biomed. 97: 11-18. 10.1016/j.cmpb.2009.05.003.Google Scholar
- Calle ML, Urrea V: Letter to the Editor: Stability of Random Forest importance measures. Brief Bioinform. 2010Google Scholar
- De Roach JN: Neural networks--an artificial intelligence approach to the analysis of clinical data. Australas Phys Eng Sci Med. 1989, 12: 100-106.PubMedGoogle Scholar
- Wang L, Brown SJ: Prediction of DNA-binding residues from sequence features. J Bioinform Comput Biol. 2006, 4: 1141-1158. 10.1142/S0219720006002387.View ArticlePubMedGoogle Scholar
- Tsuchiya Y, Kinoshita K, Nakamura H: PreDs: a server for predicting dsDNA-binding site on protein molecular surfaces. Bioinformatics. 2005, 21: 1721-1723. 10.1093/bioinformatics/bti232.View ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005, 6: 33-10.1186/1471-2105-6-33.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones S, Shanahan HP, Berman HM, Thornton JM: Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Res. 2003, 31: 7189-7198. 10.1093/nar/gkg922.PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22: 1658-1659. 10.1093/bioinformatics/btl158.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.