BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features
© Wang et al. 2010
Published: 28 May 2010
Skip to main content
© Wang et al. 2010
Published: 28 May 2010
Understanding how biomolecules interact is a major task of systems biology. To model protein-nucleic acid interactions, it is important to identify the DNA or RNA-binding residues in proteins. Protein sequence features, including the biochemical property of amino acids and evolutionary information in terms of position-specific scoring matrix (PSSM), have been used for DNA or RNA-binding site prediction. However, PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites in protein sequences.
In the present study, several new descriptors of evolutionary information have been developed and evaluated for sequence-based prediction of DNA and RNA-binding residues using support vector machines (SVMs). The new descriptors were shown to improve classifier performance. Interestingly, the best classifiers were obtained by combining the new descriptors and PSSM, suggesting that they captured different aspects of evolutionary information for DNA and RNA-binding site prediction. The SVM classifiers achieved 77.3% sensitivity and 79.3% specificity for prediction of DNA-binding residues, and 71.6% sensitivity and 78.7% specificity for RNA-binding site prediction.
Predictions at this level of accuracy may provide useful information for modelling protein-nucleic acid interactions in systems biology studies. We have thus developed a web-based tool called BindN+ (http://bioinfo.ggc.org/bindn+/) to make the SVM classifiers accessible to the research community.
Protein-DNA and protein-RNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanisms of the protein-nucleic acid recognition, it is important to identify the DNA or RNA-binding amino acid residues in proteins. The identification is straightforward if the structure of a protein-DNA or protein-RNA complex is known. Unfortunately, it is very expensive and time-consuming to solve the structure of a protein-DNA/RNA complex. Currently, only a few hundreds of protein-nucleic acid complexes have structural data available in the Protein Data Bank (PDB, http://www.rcsb.org/pdb/). With the rapid accumulation of sequence data, predictive methods are needed for identifying potential DNA or RNA-binding residues in protein sequences.
Several machine learning methods have been reported for predicting DNA or RNA-binding residues directly from amino acid sequences [1–3], using biochemical features of amino acid residues [4, 5], and by incorporating evolutionary information in terms of position-specific scoring matrices [6–8]. Ahmad et al.  investigated representative structures of protein-DNA complexes, and used the amino acid sequences in these structures to train artificial neural networks (ANNs) for prediction of DNA-binding residues. Yan et al.  constructed Naïve Bayes classifiers for DNA-binding site prediction from amino acid identities. Naïve Bayes classifiers were also developed for predicting RNA-binding residues directly from amino acid sequences . However, without using biological knowledge for classifier construction, the prediction accuracy was relatively low in these studies.
The use of evolutionary information for input encoding has been shown to improve classifier performance. Ahmad and Sarai  constructed ANN classifiers for DNA-binding site prediction using evolutionary information in terms of position-specific scoring matrix (PSSM). More recently, PSSM profiles have also been used to train support vector machines (SVMs) and logistic regression models for sequence-based prediction of DNA-binding residues [7, 8]. For a given protein sequence, its PSSM profile can be derived from the result of a PSI-BLAST search against a large sequence database. PSSM scores indicate how well an amino acid position in the query sequence is conserved among its homologues. Since functional sites, including DNA and RNA-binding residues, tend to be conserved among homologous proteins, PSSM can provide relevant information for classifier construction. However, PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites.
In our previous studies [4, 5], ANN and SVM classifiers were constructed for DNA or RNA-binding site prediction using relevant biochemical features, including the hydrophobicity index, side chain pKa value, and molecular mass of an amino acid. These features were used to represent biological knowledge, which might not be learned from the training data. It was found that classifier performance was enhanced by using the biochemical features for input encoding, and the SVM classifiers outperformed the ANN predictors. Nevertheless, it is still unknown whether classifier performance can be further improved by combining the biochemical features with evolutionary information.
This study aimed to examine different descriptors of evolutionary information for DNA and RNA-binding site prediction, and to improve classifier performance by combining relevant sequence features. Three new descriptors of evolutionary information as well as PSSM were used to construct SVM classifiers, and the new descriptors were shown to improve classifier performance. Interestingly, the most accurate classifiers were obtained by combining the new descriptors with PSSM and relevant biochemical features for input encoding. The results suggest that PSSM, although useful for classifier construction, does not capture all the evolutionary information for predicting DNA and RNA-binding residues in protein sequences. A new web server called BindN+ (http://bioinfo.ggc.org/bindn+/) has been developed to make the SVM classifiers accessible to the biological research community.
Two amino acid sequence datasets, PDNA-62 and PRINR25, were derived from structural data of protein-DNA and protein-RNA complexes available at the Protein Data Bank (PDB at http://www.rcsb.org/pdb/). The PDNA-62 dataset was used to train classifiers for DNA-binding residues as in previous studies [4–7]. PDNA-62 was derived from 62 structures of representative protein-DNA complexes. The PRINR25 dataset was prepared for RNA-binding site prediction in our previous study . PRINR25 was derived from 174 structures of protein-RNA complexes. Both PDNA-62 and PRINR25 had less than 25% identity among the sequences in each dataset.
As in the previous studies [1, 4–6], an amino acid residue was designated as a DNA or RNA-binding site if the side chain or backbone atoms of the residue fell within a cutoff distance of 3.5 angstroms (Å) from any atoms of the DNA or RNA molecule in the complex. All the other residues were regarded as non-binding sites. Both PDNA-62 and PRINR25 were imbalanced datasets with ~15% residues labelled as binding sites and ~85% residues as non-binding sites.
Support vector machines (SVMs) were trained using residue-wise data instances derived from the sequence datasets. From a sequence with n amino acid residues, a total of (n – w + 1) data instances were extracted, where w was the sliding window size. In this study, each instance consisted of eleven consecutive residues (w = 11) with the target residue positioned in the middle of the subsequence. An instance was labelled as 1 (positive) if the target residue was DNA/RNA-binding, or as -1 (negative) if the target residue was non-binding. The context information provided by the five neighboring residues on each side of the target residue was previously shown to be optimal for sequence-based prediction of DNA or RNA-binding residues [4, 5].
To generate the input vector for training SVMs, each residue was represented with three biochemical features and several descriptors of evolutionary information (see below). The three biochemical features, including the hydrophobicity index (feature H), side chain pKa value (feature K), and molecular mass (feature M) of an amino acid, were previously used to construct classifiers for DNA or RNA-binding site prediction [4, 5].
where and are two data vectors, and γ is a training parameter. A smaller γ value makes the decision boundary smoother. Another parameter for SVM training is the regularization factor C, which controls the trade-off between low training error and large margin . Different values for the γ and C parameters have been tested in this study to optimize the classifier performance.
Considering the great complexity of protein-DNA/RNA interactions, the labelled datasets derived from the available structures are rather small in size. On the other hand, there are abundant unlabeled sequence data in public databases such as UniProt . The unlabeled data contain evolutionary information about the conservation of each sequence position, and DNA/RNA-binding residues tend to be conserved among homologous proteins .
Position-specific scoring matrix (PSSM) has often been used as a descriptor of evolutionary information. PSSM profiles can be derived by searching a protein sequence database using the PSI-BLAST program . For each position in a query sequence, there are 20 PSSM scores. The evolutionary information captured by PSSM was previously shown to improve the performance of artificial neural networks and support vector machines for DNA-binding site prediction [6, 7].
where is the value of feature X for the amino acid residue in b j , which is aligned to a i at position i in the query sequence p.
Although X can be any biological feature with a numerical domain, the three biochemical features relevant for DNA and RNA-binding site prediction have been investigated in this study, that is, . The new descriptors of evolutionary information can be defined as follows:
(1) : The mean and standard deviation of the H feature values for each residue a i in the sequence p. Hydrophobicity (H) is a key factor in amino acid side chain packing and protein folding. Hydrophobic amino acids, which are often located inside proteins, are underrepresented at the DNA interaction interfaces [1–4]. Thus, if a residue has the greater mean of hydrophobicity with less standard deviation in the sequence alignment, the residue in the query sequence is less likely to be located at the interaction interface.
(2) : measures how well the side chain pKa value (K) of an amino acid residue is conserved among the homologous sequences in the alignment. The side chain pKa determines the ionization state of a residue. Since the phosphate groups of nucleic acids are negatively charged, the ionization state of amino acid side chains affects the interaction with DNA or RNA molecules. Amino acid residues with positively charged side chains (e.g., arginine) are overrepresented at the interaction interface. In other words, if a residue has the greater mean of feature K with less standard deviation in the sequence alignment, the residue in the query sequence is more likely to be a DNA or RNA-binding residue.
(3) : Each amino acid has a unique value of molecular mass (feature M), which is closely related to the volume of space occupied by the residue in protein structures. DNA or RNA-binding residues may have the size constraint to be fitted into the interaction interface, and the mean and standard deviation of M may be used to represent the evolutionary information for the size constraint.
where TP is the number of true positives; TN is the number of true negatives; FP is the number of false positives; and FN is the number of false negatives. Since the datasets used in this study are imbalanced, both sensitivity and specificity are also computed from prediction results. Furthermore, the average of sensitivity and specificity, referred to as strength in this paper, has been shown to provide a fair measure of classifier performance [1–4].
The Receiver Operating Characteristic (ROC) curve is probably the most robust approach for classifier evaluation and comparison . The ROC curve is drawn by plotting the true positive rate (i.e., sensitivity) against the false positive rate, which equals to (1 – specificity). In this work, the ROC curve has been generated by varying the output threshold of a classifier and plotting the true positive rate against false positive rate for each threshold value. The area under the ROC curve (AUC) can be used as a reliable measure of classifier performance . Since the ROC plot is a unit square, the maximum value of AUC is 1, which is achieved by a perfect classifier. Weak classifiers have AUC values close to 0.5.
Effect of evolutionary information on DNA-binding site prediction.
Classifier performance was improved to varying levels when each of the three new descriptors of evolutionary information was added to the biochemical features for input encoding. As shown in Table 1, the descriptor (the mean and standard deviation of feature K) gave rise to the highest performance with 74.2% prediction strength (73.4% sensitivity and 75.0% specificity), MCC = 0.365 and ROC AUC = 0.813. The classifier using all the three new descriptors ( , and ) achieved slightly better performance with 74.6% prediction strength (72.4% sensitivity and 76.8% specificity), MCC = 0.377 and ROC AUC = 0.817. Therefore, the use of the three new evolutionary information descriptors for input encoding was found to improve classifier performance.
Position-specific scoring matrix (PSSM) was previously shown to improve the accuracy of DNA-binding site prediction [6–8]. In this study, the SVM classifier constructed using PSSM in addition to the biochemical features achieved high performance with 76.5% prediction strength (74.8% sensitivity and 78.2% specificity), MCC = 0.409 and ROC AUC = 0.849. Interestingly, the most accurate classifier was obtained by combining PSSM with the new descriptors of evolutionary information for input encoding. This classifier achieved 78.3% prediction strength (77.3% sensitivity and 79.3% specificity), MCC = 0.440 and ROC AUC = 0.859 (Table 1).
The results suggest that although PSSM can be used to improve classifier performance, it does not capture all the evolutionary information for DNA-binding site prediction. While PSSM scores indicate whether an amino acid residue is conserved among homologous sequences, the three new descriptors can be used to represent the conservation of the relevant biochemical properties for DNA-binding residues. However, since classifier performance is improved only slightly by combining PSSM with the new descriptors, it is likely that the evolutionary information captured by the different descriptors may be partially overlapping.
Effect of evolutionary information on RNA-binding site prediction.
Classifier performance was improved by using each of the new descriptors of evolutionary information. In particular, the use of descriptor resulted in slightly better performance with 70.5% prediction strength (66.5% sensitivity and 74.6% specificity), MCC = 0.312 and ROC AUC = 0.774. The performance was improved to 71.6% prediction strength (67.4% sensitivity and 75.8% specificity), MCC = 0.331 and ROC AUC = 0.784 when all the three new descriptors of evolutionary information were used for classifier construction (Table 2).
The use of PSSM was also found to significantly improve RNA-binding site prediction, and the classifier achieved 74.6% prediction strength (71.5% sensitivity and 77.7% specificity), MCC = 0.380 and ROC AUC = 0.818. Nevertheless, the classifier constructed using all the descriptors of evolutionary information (PSSM, , and ) appeared to give the best predictive performance with 75.2% prediction strength (71.6% sensitivity and 78.7% specificity), MCC = 0.393 and ROC AUC = 0.825 (Table 2).
The best SVM classifiers developed in this study are compared favourably with the other existing predictors. For DNA-binding site prediction, DBS-PSSM , a PSSM-based artificial neural network predictor constructed using the PDNA-62 dataset, was shown to give 68.2% sensitivity and 66.0% specificity. By contrast, the best classifier in this study achieved 77.3% sensitivity and 79.3% specificity (Table 1).
The DP-Bind system provided several classifiers for DNA-binding site prediction, and these classifiers were also constructed using the PDNA-62 dataset. The PSSM-based SVM classifier of DP-Bind achieved 76.9% sensitivity and 74.7% specificity with ROC AUC = 0.836 on imbalanced test datasets . The best performance was achieved by the PSSM-based kernel logistic repression predictor , and the average of sensitivity and specificity reached 76.5%. In this study, the best SVM classifier achieved 78.3% prediction strength and ROC AUC = 0.859 (Table 1).
Yan et al. developed a Naïve Bayes classifier for DNA-binding residues, and evolutionary information was not used for input encoding. The Matthews correlation coefficient of the Naïve Bayes classifier reached 0.28, which is significantly lower than that of the present study (MCC = 0.440, Table 1).
For RNA-binding site prediction, Terribilini et al. reported a Naïve Bayes classifier that could predict at 38% sensitivity and 93% specificity (65.5% prediction strength). The highest MCC value of the Naïve Bayes classifier was 0.35. In contrast, this study achieved 75.2% prediction strength and MCC = 0.393 (Table 2). With the specificity level set to 93.0% on the ROC curve (Figure 3), the best SVM classifier had 47.0% sensitivity and MCC = 0.421. Thus, the SVM classifier developed in this study appears to be more accurate than the Naïve Bayes model  for RNA-binding site prediction.
To make the SVM classifiers accessible to the biological research community, we have developed the BindN+ web server (http://bioinfo.ggc.org/bindn+/). The web interface of BindN+ is similar to that of our previous system, BindN . Users can enter an amino acid sequence in FASTA format; choose the type of prediction to be made for either DNA or RNA-binding residues; and specify the desired level of sensitivity or specificity for the prediction result. The system performs a three-iteration PSI-BLAST search against the UniProtKB database to extract evolutionary information as described in Methods. The query sequence is encoded using the three biochemical features (H, K and M), PSSM, and the new descriptors of evolutionary information ( , and ). The most accurate SVM classifier constructed in this study is then used to scan the query sequence for putative DNA or RNA-binding residues. To make predictions, the user-defined level of sensitivity or specificity is used to choose the output threshold of the SVM model according to its ROC curve shown in Figure 2 or Figure 3.
BindN+ represents a significant upgrade to the previous web server BindN, which was based on SVM models constructed with the relevant biochemical features . BindN has been frequently accessed, and the prediction results have been shown to provide useful information for biological research . Since our approach does not require structural information for binding site prediction, BindN+ can be used for genome-wide analyses of DNA and RNA-binding proteins. The analytical results may provide useful information for systematic understanding of protein-nucleic acid interactions.
In this study, several descriptors of evolutionary information have been examined for sequence-based prediction of DNA and RNA-binding residues. The new descriptors of evolutionary information have been shown to improve classifier performance. Interestingly, the most accurate classifiers have been obtained by combining the new descriptors, PSSM and relevant biochemical features for input encoding. The results suggest that although PSSM can be used to improve classifier performance, it does not capture all the evolutionary information for DNA and RNA-binding site prediction. The SVM classifiers developed in this study are compared favourably with the other existing predictors. Thus, a new web server called BindN+ (http://bioinfo.ggc.org/bindn+/) has been developed to make the SVM classifiers publicly available. We anticipate that BindN+ can provide a useful tool for modelling protein-nucleic acid interactions in systems biology studies.
LW initiated and designed the study. LW and CH conducted the data analysis. LW drafted the manuscript. MQY and JYY provided valuable insights on biomolecular interactions and systems biology modeling, participated in result interpretation and manuscript preparation. All authors have reviewed the final version and agreed on the content.
This work is supported by the CSREES/USDA, under project number SC-1700355.
This article has been published as part of BMC Systems Biology Volume 4 Supplement 1, 2010: Proceedings of the ISIBM International Joint Conferences on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS). The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/4?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.