A graph kernel method for DNA-binding site prediction
© Yan Wang; licensee BioMed Central Ltd. 2014
Published: 8 December 2014
Protein-DNA interactions play important roles in many biological processes. Computational methods that can accurately predict DNA-binding sites on proteins will greatly expedite research on problems involving protein-DNA interactions.
This paper presents a method for predicting DNA-binding sites on protein structures. The method represents protein surface patches using labeled graphs and uses a graph kernel method to calculate the similarities between graphs. A new surface patch is predicted to be interface or non-interface patch based on its similarities to known DNA-binding patches and non-DNA-binding patches. The proposed method achieved high accuracy when tested on a representative set of 146 protein-DNA complexes using leave-one-out cross-validation. Then, the method was applied to identify DNA-binding sties on 13 unbound structures of DNA-binding proteins. In each of the unbound structure, the top 1 patch predicted by the proposed method precisely indicated the location of the DNA-binding site. Comparisons with other methods showed that the proposed method was competitive in predicting DNA-binding sites on unbound proteins.
The proposed method uses graphs to encode the feature's distribution in the 3-dimensional (3D) space. Thus, compared with other vector-based methods, it has the advantage of taking into account the spatial distribution of features on the proteins. Using an efficient kernel method to compare graphs the proposed method also avoids the demanding computations required for 3D objects comparison. It provides a competitive method for predicting DNA-binding sites without requiring structure alignment.
Structural genomics projects are yielding an increasingly large number of protein structures with unknown function. As a result, computational methods for predicting functional sites on these structures are in urgent demand. There has been significant interest in developing computational methods for identifying amino acid residues that participate in protein-DNA interactions based on combinations of sequence, structure, evolutionary information, and chemical and physical properties. For example, Jones et al.  analyzed residue patches on the surface of DNA-binding proteins and used electrostatic potentials of residues to predict DNA-binding sites. Later, they extended that method by including DNA-binding structural motifs . In related studies, Tsuchiya et al.  used a structure-based method to identify protein-DNA binding sites based on electrostatic potentials and surface shape. Gao and Skolnick  predict DNA-binding using structural template comparison and statistical potential. Sophisticated machine-learning methods, like SVM, neural network, and Random Forest, have also been used to predict DNA-binding sites integrating a wide range of features [5–9]. On another direction, several methods have been developed for predicting DNA-binding sites using only protein sequence-derived information as input [10–15]. To date, the methods that take the advantage of structure-derived information achieve better results than those using only sequence-derived information.
One common limitation of the above-mentioned methods is that the sequence and structural properties of a surface patch are input to machine-learning methods in the form of vectors. When the properties of a surface patch are encoded as a vector, the information of how these properties distribute over the surface is lost. For example, if a surface patch includes five amino acid residues, the above-mentioned methods will encode the amino acid identities of this surface patch as five independent values in a vector. In this representation, the spatial arrangement of these five residues on the surface patch is not encoded. Unfortunately, the spatial arrangement of properties on a surface patch plays a crucial role in determining the function of the surface patch.
To overcome this limitation, this paper presents a graph approach for DNA-binding site prediction. In this study, graphs are used to represent surface patches, such that the spatial arrangement of various properties on the surface is explicitly encoded. The similarities between surface patches are then computed using a graph kernel method. A voting strategy is then used to classify surface patches into DNA-binding sites versus non-binding sites based on their similarity to known DNA-binding surface and non-DNA-binding surface. When applied to set of unbound structures of DNA-binding proteins, the proposed method can precisely identify the locations of DNA-binding sites.
DNA-binding proteins were obtained from our previous study . In that study, we extracted all protein-DNA complexes from the PDB . Then, the dataset was culled using PISCES . The resulting dataset consisted of 171 proteins with mutual sequence identity ≤ 30% and each protein had at least 40 amino acid residues. All the structures have resolution better than 3.0 Å and R factor less than 0.3. In the current study, seven features are evaluated for their usefulness in the prediction of DNA-binding sites. Thus, seven features were calculated for each protein. Among them, structural conservation was calculated based on the alignment of structural neighbors (See details in section 2.2). 25 proteins were discarded because no structures neighbors were found. In the end, 146 DNA-binding proteins were used to evaluate the proposed method in cross-validation.
DNA was removed from the protein-DNA complexes and seven features were calculated for each amino acid of the protein: (1) Relative solvent accessibility was calculated using NACCESS ; (2) Electrostatic potential was calculated using Delphi  with the same parameters used in the study of Jones et a. . The electrostatic potential of a residue is defined as the average of the electrostatic potentials at the locations of all its atoms as described in Jones et a. ; (3) Sequence entropy at each residue position (the sequence entropy for the corresponding column in the multiple sequence alignment) was extracted from the HSSP database . Sequence entropy is a measure of sequence conservation. The lower the value, the more conserved is the corresponding residue position; (4) Surface curvature at each residue position was calculated using MSP (http://connolly.best.vwh.net/); (5) Pockets on protein surface were detected using Proshape (http://csb.stanford.edu/~koehl/ProShape/download.php). The pocket size of a residue is defined as the size of the pocket that the residue is located in. If a residue is not located in any pocket, then a value of 0 is assigned to the pocket size of the residue; (6) The DALI server  was used to search for structural neighbors in the PDB for each of the DNA-binding proteins. The DALI server returned a multiple alignment of the query structure and its structural neighbors. Then, structural conservation score was calculated for each residue position using Scorecons  based on the multiple alignment; and (7) position-specific scoring matrix (PSSM) of a protein was built by running 4 iterations of PSI-BLAST  against the NCBI non-redundant (nr) database. In the PSSM, each residue position corresponds to 20 values. Thus, in total, each amino acid residue is associated with 26 attributes. All these attributes were normalized to the range of 0.
Interface residues and surface residues
Interface residues are defined as in Jones et al. . Solvent accessible surface area (ASA) was computed for each residue in the unbound protein (in absence of DNA) and in the protein-DNA complex. A residue is defined to be an interface residue if its ASA in the protein-DNA complex is less than its ASA in the unbound protein by at least 1Å2. A residue is defined to be a surface residue if its relative accessibility in the unbound protein is >5%. In total, 4,337 interfaces residues and 27,248 surface residues were obtained.
Interface patches and non-interface patches
For each DNA-binding protein, an interface patch and a non-interface patch were obtained. The interface patch included all the interface residues. The non-interface patch was randomly taken from the protein surface such that (1) it consisted of a group of contiguous surface residues; (2) it had the same number of residues as the interface patch; and (3) it did not include any interface residue.
Graph representation of patches
Each amino acid residue is represented using a node labeled with the 26 attributes of the residue. Two residues are considered contacting if the closest distance between their heavy atoms is less than the sum of the radii of the atoms plus 0.5 Å. An edge is added between two nodes if the corresponding residues are contacting. In this way, a surface patch of residues is represented as a labeled graph.
Kernel methods are a popular method with broad applications in data mining. In a simple way, a kernel function can be considered as a positive definite matrix that measures the similarities between each pair of input data. It the currently study, a graph kernel method, namely shortest-path kernel, developed by Borgwart and Kriegel , is used to compute the similarities between graphs.
The first step of the shortest-path kernel is to transform original graphs into shortest-path graphs. A shortest-path graph has the same nodes as its original graph, and between each pair of nodes, there is an edge labeled with the shortest distance between the two nodes in the original graph. In the current study, the edge label will be referred to as the weight of the edge. This transformation can be done using any algorithm that solves the all-pairs-shortest-paths problem. In the current study, the Floyd-Warshall algorithm was used.
where, k edge ( ) is a kernel function for comparing two edges (including the node labels and the edge weight).
where, weight(e) returns the weight of edge e. K weight ( ) is a Brownian bridge kernel that assigns the highest value to the edges that are identical in length. Constant c was set to 2 as in Borgward et al.. We tried different values of c between 1 and 5 with increments of 1, the change in accuracy was less than 1%.
If Gx has more nodes than Gy does, then |Ex|>|Ey|, where Ex and Ey are the sets of edges in the shortest-path graphs of Gx and Gy. Therefore, the summation in K(G, G x ) includes more items than the summation in K(G, G y ) does. Each item (i.e., k edge ( )) inside the summation has a non-negative value. The consequence is that if K(G, G x )>K(G,G y ) it may not necessary indicate that G x is more similar to G than G y is, instead, it could be an artifact of the fact that Gx has more nodes than Gy. To overcome this problem, a voting strategy is developed for predicting whether a graph (or a patch) is an interface patch:
Algorithm Voting_Stategy (G)
Input : graph G
Outpu t: G is an interface patch or non-interface patch
Let T be the set of proteins in the training set
Let v be the number of votes given to "G is an interface patch"
v = 0
While (T is not empty)
Take one protein (P) out of T
Let G int and G non-int be the interface and non-interface patches from P.
If K(G, G int )>K(G,G non-int ), then increase v by 1
If , then G is an interface patch
Else G is a non-interface patch
Using this strategy, when K(G, G int ) is compared with K(G, G non-int ), Gint and Gnon-int are guaranteed to have identical number of nodes, since they are the interface and non-interface patches extracted from the same protein (see section 2.4 for details). Each time K(G, G int )>K(G, G non-int ) is true, one vote is given to "G is an interface patch". In the end G is predicted to be an interface patch if "G is an interface patch" gets more than half of the total votes, i.e., .
Results and discussion
Distinguish interface patches from non-interface patches
146 interface patches and 146 non-interface patches were obtained from the dataset. The graph kernel method was used to compute similarities between patches and the voting strategy was used to classify these patches into interface versus non-interface patches. When evaluated using a leave-one-out cross-validation, this method achieves an overall accuracy of 88.7%. 87.7% (Sensitivity) of the interface patches and 89.7% (Specificity) of the non-interface patches were correctly predicted.
Contributions of the features
Contributions of features.
Prediction of DNA-binding residues
Predicting DNA-binding sites on unbound proteins
13 test proteins with both DNA-bound and unbound structures in the PDB were taken from a previous study . 14 such proteins were considered in the study by Tjong and Zhou. Here, we discarded 2abk because the sequence identity between the bound and unbound proteins was only 45%. In this section, the DNA-binding sites on the 13 unbound proteins will be predicted using the graph kernel method. The prediction results are evaluated based on the actual DNA-binding sites gleaned from the corresponding protein-DNA complexes. For each surface residue on the test proteins, we obtained a surface patch that included the residue and its 5 closest neighbors. Then, the patches were classified into interface versus non-interface patches using the 146 proteins as training set. For each test protein, the training set was filtered such that none of the proteins in the training set shares > 30% identical residues with the test proteins.
Predictions by the top 1 patch.
PDB id 1
Top 1 patch
P Random (%) 4
The top 1 patch overlaps with the actual DNA-binding site
Using the voting strategy, each patch was assigned a number representing the number of votes it got. The higher the vote number, the more similar is the patch to the interface patches. For each test protein, we sorted the patches based on the numbers of votes they get, such that the top 1 patch got the most votes. Table 2 shows that on every test protein, the top 1 patch overlaps with the actual DNA-binding site. On 7 of the 13 proteins, all the six residues in the top 1 are actually interface resides (6 true positives, 0 false positive). When averaged over the 13 proteins, the top 1 patch contains 4.8 interface residues and 1.2 non-interface residues, i.e., on average, 80% of the residues in the top 1 patch are interface residues. These results show that on a test protein, the top 1 patch can precisely indicate the location of the actual DNA-binding site.
If a patch is randomly picked from a test protein, what is the probability (P random ) to obtain a patch that is at least as good as the top 1 patch in terms of predicting the DNA-binding sites? For each test protein, P random is calculated as N/N all , where N all is the total number of patches on the protein, N is the number of the patches that have at least as many interface residues as the top 1 patch. The results (Table 2) show that for 9 of the 13 proteins, P random is less than 10%. The average P random for the 13 protein is 9.8%. This indicates the significance of the predicting method.
Obtaining higher coverage by combining multiple top-ranking patches
In the evaluation of DNA-binding site prediction methods, there are mainly two measures that researchers are interested in: coverage (TP/N int ) and accuracy (TP/N pr ), where TP is true positive, i.e. the number of residues that are predicted to be interface residues and are actually interface residues, N int is the total number of interface residues and N pr is the number of residues that are predicted to be interface residues. Coverage shows percentage of the actual interface residues that are correctly predicted and accuracy is the percentage of the predicted interface residues that are actually interface residues.
Comparison with other methods
While many computational methods have been proposed for the prediction of DNA-binding sites, it is difficult to make direct comparisons between them, due to lack of a standardized benchmark for the evaluation. Here, it is not our intent to make a systematic comparison between different methods. We only compared our method with two recent methods, MV  and DISPLAR , regarding their ability to find DNA-binding sites on the 13 unbound proteins. Both MV and DISPLAR use both structural and sequence information in predicting DNA-binding sites.
The MV method  integrates a wide range of structural, evolutionary, energy-based and experimental data and uses a random forest method to predict functional sites, including protein-, peptide-, DNA-, RNA-binding sites on protein structures. The 13 unbound protein structures were submitted to the MV online server to obtain the predicted DNA-binding sites. The MV returned a list of amino acid residues with their corresponding prediction scores. By changing the score threshold using for prediction, the MV method obtained an AUC of 0.85 for the ROC. In comparison, our method returned a list of surface patches with their prediction scores (i.e. vote counts). Our method achieved a slightly better AUC of 0.87.
Comparison with other methods.
MV and our method returned a list of residues (or patches) with decreasing prediction scores and allowed users to tradeoff between coverage and accuracy by choosing a threshold. To compare them with DISPLAR, for each test protein, we gradually decreased the prediction threshold until the coverage achieved was equal to or higher than that of DISPLAR. Then the coverage and accuracy of the methods were compared. On a test protein, method A is better than B, if accuracy(A)>accuracy(B) and coverage (A)≥coverage(B). Table 3 shows the comparisons. On each protein, the best performance among the three methods are shown in the bold font. On 1 mml no method is better than others are on both accuracy and coverage, thus a best performance cannot be identified. Our method achieved the best performance in 6 of the 13 proteins (tie with MV on 1zzk).
This paper presents competitive method for predicting DNA-binding sites on proteins. The effectiveness of the method is demonstrated using cross-validation and by applying it to 13 unbound protein structures. Different from other methods that represent sequence and structural properties of surface using vectors, the method proposed in this study uses labeled graphs. Compared to vectors, one advantage of labeled graphs is that they can specifically encode the spatial arrangement of the properties on protein surface. Since proteins and DNA interact in a 3-dimensional space, the spatial arrangement of the properties on protein surface plays a pivotal role in the interactions. Therefore, computational methods for prediction of the interface should consider the spatial arrangement of the properties. The proposed method uses a graph kernel to explore this information. Using this graph kernel method, the proposed method avoids the demanding computation involved in the structural alignment and comparison.
This work was supported by NSF awards 1262265 and 1305175 to C. Yan.
Publication costs for this work were funded by NSF award 1262265 to C. Yan.
This article has been published as part of BMC Systems Biology Volume 8 Supplement 4, 2014: Thirteenth International Conference on Bioinformatics (InCoB2014): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S4.
- Jones S, Shanahan HP, Berman HM, Thornton JM: Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucl Acids Res. 2003, 31 (24): 7189-7198. 10.1093/nar/gkg922.PubMed CentralView ArticlePubMedGoogle Scholar
- Shanahan HP, Garcia MA, Jones S, Thornton JM: Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucl Acids Res. 2004, 32 (16): 4732-4741. 10.1093/nar/gkh803.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsuchiya Y, Kinoshita K, Nakamura H: Structure-based prediction of DNA-binding sites on proteins using the empirical preference of electrostatic potential and the shape of molecular surfaces. Proteins. 2004, 55 (4): 885-894. 10.1002/prot.20111.View ArticlePubMedGoogle Scholar
- Gao M, Skolnick J: DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res. 2008, 36 (12): 3978-3992. 10.1093/nar/gkn332.PubMed CentralView ArticlePubMedGoogle Scholar
- Keil M, Exner T, Brickmann J: Pattern recognition strategies for molecular surfaces: III. Binding site prediction with a neural network. J Comput Chem. 2004, 25 (6): 779-789. 10.1002/jcc.10361.View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004, 20 (4): 477-486. 10.1093/bioinformatics/btg432.View ArticlePubMedGoogle Scholar
- Tjong H, Zhou H-X: DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces. Nucl Acids Res. 2007, 35 (5): 1465-1477. 10.1093/nar/gkm008.PubMed CentralView ArticlePubMedGoogle Scholar
- Xiong Y, Liu J, Wei D-Q: An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins. 2011, 79 (2): 509-517. 10.1002/prot.22898.View ArticlePubMedGoogle Scholar
- Segura J, Jones PF, Fernandez-Fuentes N: A holistic in silico approach to predict functional sites in protein structures. Bioinformatics. 2012, 28 (14): 1845-1850. 10.1093/bioinformatics/bts269.View ArticlePubMedGoogle Scholar
- Yan C, Terribilini M, Wu F, Jernigan RL, Dobbs D, Honavar V: Identifying amino acid residues involved in protein-DNA interactions from sequence. BMC Bioinformatics. 2006, 7: 262-10.1186/1471-2105-7-262.PubMed CentralView ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005, 6 (1): 33-10.1186/1471-2105-6-33.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Brown SJ: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucl Acids Res. 2006, 34: W243-W248. 10.1093/nar/gkl298.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Yang M, Yang J: Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 2009, 10 (Suppl 1): S1-10.1186/1471-2164-10-S1-S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu J, Liu H, Duan X, Ding Y, Wu H, Bai Y, Sun X: Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics. 2009, 25 (1): 30-35. 10.1093/bioinformatics/btn583.PubMed CentralView ArticlePubMedGoogle Scholar
- Ofran Y, Mysore V, Rost B: Prediction of DNA-binding residues from sequence. Bioinformatics. 2007, 23 (13): i347-353. 10.1093/bioinformatics/btm174.View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucl Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang G, Dunbrack RLJ: PISCES: a protein sequence culling server. Bioinformatics. 2003, 19: 1589-1591. 10.1093/bioinformatics/btg224.View ArticlePubMedGoogle Scholar
- Hubbard SJ: NACCESS. 1993, Department of Biochemistry and Molecular Biology, University College, LondonGoogle Scholar
- Rocchia W, Alexov E, Honig B: Extending the applicability of the nonlinear Poisson-Boltzmann equation: Multiple dielectric constants and multivalent ions. J Phys Chem. 2001, 105: 6507-6514. 10.1021/jp010454y.View ArticleGoogle Scholar
- Sander C, Schneider R: Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins. 1991, 9: 56-68. 10.1002/prot.340090107.View ArticlePubMedGoogle Scholar
- Holm L, Sander C: Dali: a network tool for protein structure comparison. Trends Biochem Sci. 1995, 20: 478-480. 10.1016/S0968-0004(00)89105-7.View ArticlePubMedGoogle Scholar
- Valdar WSJ: Scoring residue conservation. Proteins: Struct, Funct, Genet. 2002, 48 (2): 227-241. 10.1002/prot.10146.View ArticleGoogle Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Borgwardt KM, Kriegel HP: Shortest-path kernels on graphs. The fifth IEEE International Conference on Data Minning (ICDM'05). 2005, 74-81.View ArticleGoogle Scholar
- Borgwardt KM, Ong CS, Schonauer S, Vishwanathan SVN, Smola AJ, Kriegel HP: Protein function prediction via graph kernels. Bioinformatics. 2005, 21 (suppl_1): i47-56.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.