- Open Access
Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation
© Xu et al.; licensee BioMed Central Ltd. 2015
Published: 6 February 2015
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation. There have been several computational methods proposed in the literature to deal with the DNA-binding protein identification. However, most of them can't provide an invaluable knowledge base for our understanding of DNA-protein interactions.
We firstly presented a new protein sequence encoding method called PSSM Distance Transformation, and then constructed a DNA-binding protein identification method (SVM-PSSM-DT) by combining PSSM Distance Transformation with support vector machine (SVM). First, the PSSM profiles are generated by using the PSI-BLAST program to search the non-redundant (NR) database. Next, the PSSM profiles are transformed into uniform numeric representations appropriately by distance transformation scheme. Lastly, the resulting uniform numeric representations are inputted into a SVM classifier for prediction. Thus whether a sequence can bind to DNA or not can be determined. In benchmark test on 525 DNA-binding and 550 non DNA-binding proteins using jackknife validation, the present model achieved an ACC of 79.96%, MCC of 0.622 and AUC of 86.50%. This performance is considerably better than most of the existing state-of-the-art predictive methods. When tested on a recently constructed independent dataset PDB186, SVM-PSSM-DT also achieved the best performance with ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods.
The experiment results demonstrate that PSSM Distance Transformation is an available protein sequence encoding method and SVM-PSSM-DT is a useful tool for identifying the DNA-binding proteins. A user-friendly web-server of SVM-PSSM-DT was constructed, which is freely accessible to the public at the web-site on http://bioinformatics.hitsz.edu.cn/PSSM-DT/.
DNA-binding proteins are pivotal to the cell functions such as DNA replication, transcriptional regulation, packaging recombination, DNA repair, DNA modification and other fundamental activities associated with DNA. For example, in eukaryotic cells, histones which is a typical type of DNA-binding protein often help package chromosomal DNA into a compact structure, and as another typical DNA-binding protein, restriction enzymes are DNA-cutting enzymes found in bacteria that recognize and cut DNA only at a particular sequence of nucleotides to serve a host-defense role. DNA-binding proteins represent a broad category of proteins, known to be highly diverse in sequence and structure. Structurally, they have been divided into eight structural groups, which were further classified 54 protein structural families[1, 2]. Functionally, protein-DNA interactions play various roles across the entire genome as previously mentioned . The past decade has witnessed tremendous progress in genome sequencing [4–7]. According to the Genome On Line Database, the complete sequenced genomes of almost 1000 cellular organisms have been released, and about 5000 active genome sequencing projects are on the way [8, 9]. The unprecedented amount of genetic information has provided hundreds of thousands of protein sequences , indicating that a challenging problem to elucidate their functions is posed.
At present, several experimental techniques have been employed for identifying DNA-binding proteins, such as filter binding assays, genetic analysis, chromatin immunoprecipitation on microarrays, and X-ray crystallography. But experimental approaches for identifying the DNA-binding proteins are costly and time consuming. It would be highly desirable to develop computational approaches that can automatically determine whether a novel sequence binds to DNA or not. Therefore, a reliable identification of DNA-binding proteins with effective computational approach is an import research topic in the proteomics fields. It has been observed that many attempts have been made for identifying DNA-binding proteins and many effective computational predicting methods have been proposed for analyzing it in the literature. The computational methods represent a broad category of predicting methods for DNA-binding proteins, known to be highly diverse in classifiers and protein representation.
In terms of classifiers, the computational methods can be divided into template-based and machine-learning-based methods, depending on how they use the information from the putative DNA-binding proteins. Template-based methods can be further classified into two classes, one of which utilize a structural comparison protocol to detect significant structural similarity between the query and a template known to bind DNA at either the domain or the structural motif to assess the DNA-binding preference of the target sequence [11, 12] and the other employ a sequence comparison protocol (such as PSI-BLAST) to detect significant sequence similarity between the query and a template known to bind DNA to evaluate the DNA-binding preference of the target sequence . Machine-learning-based methods do not perform direct structural comparison, but typically follow a machine-learning framework. To obtain good predictive model, various machine-learning algorithms have been employed to construct classification models, such as support vector machine (SVM) [14–17], neural network [18–22], random forest , naïve Bayes classifier [24, 25], nearest neighbor  and ensemble classifiers [27, 28], 
In the task of computational protein function prediction, there are two major problems: choice of the classification algorithm and choice of the protein representation. Depending on the choice of protein representation, these computational predictive methods can be classified into two categories: i) analysis from protein structure [19, 20, 28, 30] and ii) prediction from amino acid sequence[11, 21, 31–33]. In case of structure-based prediction methods, Stawiski et al.  examined positively charged patches on the surface of putative DNA-binding proteins in comparison with that on non DNA-binding proteins. They employed 12 features including the patch size, hydrogen-bonding potential, and the fraction of evolutionary conserved positively charged residues and other properties of the protein to train a neural network (NN) for identifying DNA-binding proteins. Ahmad and Sarai  trained a NN classifier using three features, including net charge, electric dipole and quadruple moments of the protein. Bhardwaj et al.  examined the sizes of positively charged patches on the surface of putative DNA-binding proteins. They based their SVM classifier on the protein's overall charge, overall and surface amino acid composition. Szilágyi and Skolnick  previously trained a logistic regression classifier using the amino acid composition, the asymmetry of the spatial distribution of specific residues and the dipole moment of the protein. Guy Nimrod and Andras Szilágyi et al.  recently developed a random forest classifier based on the electrostatic potential, cluster-based amino acid conservation patterns and the secondary structure content of the patches, as well as features of the whole protein including its dipole moment. Since the negative samples are much more than real DNA-binding proteins, this is an imbalanced binary classification problem from the view of machine learning. Song et al.  employed ensemble classifier  to solve this problem and improved the identification. Several methods considering the sequence-order effects were proposed, and the experimental results showed that this information can improve the predictive performance [37, 38].
The accuracy of structure-based prediction methods is usually higher, but they can't be used in high throughput annotation, as it requires the high-resolution 3D structure of the query sequence. Until now, many computational methods have been proposed for identifying DNA-binding protein from their amino acid sequences directly. There are four different categories of protein sequence features and three kinds of sequence encoding methods have been proposed [31, 39–41]. The four categories of features are composition information, structural and functional information, physicochemical properties and evolutionary information and the three kinds of coding methods are overall composition-transition-distribution called OCTD (Global method), autocross-covariance (ACC) transformation (Nonlocal method) and split amino acid (SAA) Transformation (Local method). A comprehensive survey of these methods can be found in related research work [42–44]. However, most of the present encoding methods provided limited information to explain the mechanisms of DNA-protein interactions. It is desirable to explore a novel encoding method that can reveal the binding mechanism of DNA-proteins interactions.
In the current study, to further advance the prediction accuracy and understand the binding mechanism of DNA-protein interaction, we presented here a novel encoding method called PSSM distance transformation (PSSM-DT) to transform the PSSM profiles of query sequences into uniform numeric representations. Then we constructed a DNA-binding protein identification method SVM-PSSM-DT by combining the PSSM-DT with SVM. The benchmark test and independent test showed that PSSM-DT is a promising protein encoding method.
As shown by a series of recent publications [45–59] and summarized in a comprehensive review, to develop a useful statistical prediction method or model for a biological system, one needs to engage the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) construct a web-server for the prediction method. Below, we describe our proposed method followed such a general procedure.
where the subset S+ contains 525 DNA-binding proteins, the subset S- consists of 550 non DNA-binding proteins and the symbol ∪ represents the "union" in the set theory. The benchmark dataset was obtained according to the following procedure. (1) Extract DNA-binding protein sequences from Protein Data Bank (PDB) released at December 2013 by searching the mmCIF keyword of 'DNA binding protein' through the advanced search interface. (2) Remove the sequences with length of less than 50 amino acid residues and character of 'X'. (3) Utilize PISCES to cutoff those sequences that have >= 25% pairwise sequence identity to any other in the same subset. Thus the subset S+ consisting 525 sequences is obtained. (4) Randomly extract some non DNA-binding proteins from Protein Data Bank, then utilize PISCES to cutoff those sequence that have >= 25% pairwise sequence identity to any other in the same subset and remove all the sequences with less than 50 amino acids or with character of 'X'. Thus the subset S- containing 550 sequences is obtained. A complete list of all the PDB codes and sequence for the benchmark dataset can be found in Supporting Information S1.
Position Specific Scoring Matrix
where L is the length of protein, the Si,jrepresents the occurrence probability of amino acid j at position i of the protein sequence, the rows of matrix represent the positions of the sequence and the columns of the matrix represent the 20 types original amino acids. PSSM scores are generally shown as positive or negative integers. Positive scores indicate that the given amino acid occurs more frequently in the alignment than expected by chance, while negative scores indicate that the given amino acid occurs less frequently than expected. Large positive scores often indicate critical functional residues, which may be active site residues or residues required for other intermolecular interactions. Therefore the element of PSSM profile can be used to approximately measure the occurrence probability of the corresponding amino acid at a specific position.
PSSM distance transformation
where i is one type of the amino acid, L is the length of the sequence, S i,j is the PSSM score of amino acid j at position i. In such a way, 20*LG is the number of PSSM-SDT features, where LG is the maximum value of lg (lg = 1, 2,...,LG).
where i1 and i2 refer to two different types amino acids. Similarly, the total number of PSSM-DDT features can be calculated as 380*LG.
PSSM-DT is the combination of variable PSSM-SDT and PSSM-DDT. Thus a sequence can be transformed into a uniform feature vector with a fixed dimension of 400*LG by using variable PSSM-DT from its PSSM profile.
Support vector machine
In this study, SVM parameter γ and penalty parameter C were optimized based on 5-fold cross validation in a grid-based manner with respect to the sequence in benchmark dataset. In this study, jackknife test is taken as the evaluation method to calculate the evaluation criteria. For a dataset with N sequences, each time, one of sequence is taken out as testing sequence and the remaining sequences are employed as training dataset. This process repeated until each sequence in the dataset is tested exactly once. The average performance over all the processes is taken as the final results.
In this study, TP, FP, TN and FN donated the numbers of true positives, false positives, true negatives and false negatives, respectively. ACC denotes the percentage of both positive instances and negative instances correctly predicted. SN and SP represent the percentage of positive instances correctly predicted and that of negative instances correctly predicted, respectively. A ROC curve is a plot of Sensitivity versus (1-Specificity) and generated by shifting the decision threshold. AUC gives a measure of classifier performance. An AUC of 1.0 indicates perfect classifier whereas an AUC of classifier no better than random is 0.5. The value of MCC measures the degree of overlap between the predicted labels and true labels of all the samples in the benchmark dataset. It returns a value between -1 and +1. A perfect prediction at 100% accuracy yields a MCC of +1, whereas a random prediction gives a MCC of 0 and a terrible prediction at 0 accuracy produce a MCC of -1.
Results and discussion
The selection of LGand features
Results on benchmark dataset of different features through jackknife validation.
where M is the matrix of sequence representatives in PSSM-DT; A is the weight vectors of the training samples; N is the number of training samples; j is the dimension of the feature vector. The element in W represents the discriminative power of the corresponding feature.
The discriminant weight of the descriptors for pairs (R, R), (R, P), (P, R) and (A, R) with different lg values are shown in Figure 2B. As indicated by the figure, the descriptor with lg of 4 for pair (R, R) has the highest discriminant power. For pair (R, P) and (P, R), the discriminant weight of all descriptors are slightly different. In case of pair (A, R), the descriptor with lg of 5 is the most discriminative feature. In conclusion, for an amino acid pair, the distance between the two amino acids along the sequence can impact its discriminant power in DNA-binding protein identification.
Additionally, we take protein 1AKH [PDB:1AKH] chain A as an example to show the availability of PSSM-DT based protein representation on DNA-binding protein identification. 1AKH is known as the MATa1/MATα2 homeodomain heterodimer and its chain A is the yeast mating type transcription factors (MATa1). MATa1 proteins are members of the homeodomain superfamily of DNA-binding proteins and contact the DNA with its homeodomain. It always folds into a compact three-helix domain containing a helix-turn-helix DNA-binding motif. Figure 2C lists the distributions of descriptors for the top four most discriminative pairs on the sequence of MATa1 protein. From this figure we can see that there are 5 occurrences of the proposed descriptors in the DNA-binding region and no occurrence in the non DNA-binding regions. There are totally 5 descriptors occurred in the DNA-binding region, including pair(R, R) with lg of 1, pair(R, R) with lg of 3, pair(P, R) with lg of 2, pair(P, R) with lg of 3 and pair(A, R) with lg of 1. This is further confirmed by the three dimensional structure shown in Figure 2D. As indicated by the figure, there is no descriptor for the four top most discriminative amino acid pairs that occur in the non DNA-binding regions, and all the five occurrences are within the one DNA-binding region. Furthermore, the figure showed that the pair(R, R) with lg of 1and pair(P, R) with lg of 3 are very closed to the three dimensional structure of DNA, indicating that these two descriptors are very discriminative for DNA and protein interaction.
Comparison with existing PSSM based encoding schemes
Results on benchmark dataset of different PSSM based encoding schemes through jackknife validation.
As shown in Table 2 and Figure 3, the PSSM-DT based protein representation generated the highest performance and outperformed the other four protein representations based on PSSM, indicating that PSSM-DT based protein representation is effective for DNA-binding protein identification.
Comparison with existing prediction methods
Results on benchmark dataset of different predictors through jackknife validation.
From Table 3 and Figure 4 we can see that SVM-PSSM-DT achieved the best performance with ACC of 79.96%, MCC of 0.62 and AUC of 86.50%, outperforming other four methods by 4.56-7.41% in terms of ACC, 0.12-0.18 in terms of MCC and 5-10.4% in terms of AUC. It indicates that PSSM-DT can advance the prhedictive performance of DNA-binding proteins identification from PSSM based sequence information.
Results on Independent dataset PDB186 of different predictorsa
From Table 4 and Figure 5 we can see that among the seven predictive methods, the proposed method has the highest performance with ACC of 80.00%, MCC of 0.674 and AUC of 87.40% and DBPPred is the known reported predictive method with the best predictive performance (ACC = 76.90%, MCC = 0.538 and AUC = 79.10%). So the independent prediction of SVM-PSSM-DT is improved by ACC of 3.105%, MCC of 0.136 and AUC of 8.30% when compared with the DBPPred method, indicating that SVM-PSSM-DT is an effective prediction model for DNA-binding protein identification.
We have constructed a user-friendly web-server of SVM-PSSM-DT freely accessible to the public. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided below on how to use the web-server to get the desired results.
Step 2. Either type or copy and paste the query protein sequences into the input box at the center of Figure 6. AS this server need calculate the PSSM profile for every protein sequence through PSI-BLAST, which is a time-consuming operation, thus it receive only a query protein sequence at a time. The input sequence should be in the FASTA format and example sequences in FASTA format can be seen by clicking on the Example button right above the input box.
Step 3. Click on the Submit button to submit the query sequence to the server, then you will see the predicted results on your screen. For example, use the protein 1IGN chain B as a query sequence, you will see on your screen that the predictive result is "DNA-binding protein".
In this work, we investigated the idea of identifying DNA-binding proteins from sequence by combining SVM and PSSM-DT. The PSSM-DT is the features from PSSM by considering the probabilities of pairs of amino acid separated by certain number of sites along the sequence in a sequence. A benchmark test on a dataset of 525 DNA-binding proteins and 550 proteins which do not bind to DNA using jackknife validation showed that SVM-PSSM-DT achieved the best predicting performance with ACC of 79.96%, MCC of 0.62 and AUC of 86.50%, and performed better than other state-of-the-art methods by 4.56-7.41% in terms of ACC, 5-10.4% in terms of AUC and 0.12-0.18 in terms of MCC. Subsequently, the blind test performed on the Independent dataset PDB186 indicated that the proposed predictive method obtain an ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods. Additionally, the discriminant weight of the descriptors in PSSM-DT-based protein representation is calculated based on the benchmark dataset and the analysis results show that pair(R, R), pair(R, P), pair(P, R) and pair(A, R) are the top four most discriminative amino acid pairs. The three dimensional structure of the protein 1AKH chain A showed that the descriptors for the top four most discriminative amino acid pairs only occur in the DNA-binding regions of the protein, indicating that PSSM-DT is a useful tool for identifying DNA-binding protein.
Availability of supporting data
The data set supporting the results of this article is included within the article and its additional file 1.
This work was supported by the National Natural Science Foundation of China (No. 61300112, 61370165, 61203378, 61370010), the Natural Science Foundation of Guangdong Province (No. S2012040007390, S2013010014475), the Scientific Research Innovation Foundation in Harbin Institute of Technology (Project No. HIT.NSRIF.2013103), the Shanghai Key Laboratory of Intelligent Information Processing, China (Grant No. IIPL-2012-002), the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, MOE Specialized Research Fund for the Doctoral Program of Higher Education 20122302120070, Open Projects Program of National Laboratory of Pattern Recognition, Shenzhen International Cooperation Research Funding GJHZ20120613110641217 and Baidu Collaborate Research Funding.
This article has been published as part of BMC Systems Biology Volume 9 Supplement 1, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/9/S1
- Luscombe N, Austin SE, Berman HM, Thornton JM: An overview of the structure of protein-DNA complex. Gonome Biol. 2000, 1 (1): 1-37.Google Scholar
- Lin C, Zou Y, Qin J, Liu XR, Jiang Y, Ke CH, Zou Q: Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One. 2013, 8 (2): e56499-10.1371/journal.pone.0056499.PubMed CentralView ArticlePubMedGoogle Scholar
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E: Genome-wide location and function of DNA binding proteins. Science. 2000, 290 (5500): 2306-2309. 10.1126/science.290.5500.2306.View ArticlePubMedGoogle Scholar
- Harris T, Buzb PR, Babcock H, Beer E, Bowers J: Singlemolecule DNA sequencing of a viral genome. Science. 2008, 320: 106-109. 10.1126/science.1150427.View ArticlePubMedGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.PubMed CentralPubMedGoogle Scholar
- Shendure J, Porreca GJ, Reppas NB, Lin XX, McCutcheon JP: Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005, 309: 1728-1732. 10.1126/science.1117389.View ArticlePubMedGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-U875. 10.1038/nature06884.View ArticlePubMedGoogle Scholar
- Liolios K, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 2006, 34: D332-D334. 10.1093/nar/gkj145.PubMed CentralView ArticlePubMedGoogle Scholar
- Zou Q, Li XB, Jiang WR, Liu ZY, Li GL, Chen K: Survey of MapReduce frame operation in bioinformatics. Briefings in bioinformatics. 2014, 15: 637-647. 10.1093/bib/bbs088.View ArticlePubMedGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, 34: D187-D191. 10.1093/nar/gkj161.PubMed CentralView ArticlePubMedGoogle Scholar
- Gao M, Skolnick J: DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res. 2008, 36: 3978-3992. 10.1093/nar/gkn332.PubMed CentralView ArticlePubMedGoogle Scholar
- Shanahan HP, Garcia MA, Jones S, Thornton JM: Identifying DNAbinding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res. 2004, 32: 4732-4741. 10.1093/nar/gkh803.PubMed CentralView ArticlePubMedGoogle Scholar
- Marcotte EM, Pellegrin M, Ng HL, Rice DW, Yeate TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science. 1999, 285: 751-753. 10.1126/science.285.5428.751.View ArticlePubMedGoogle Scholar
- Brown J, Akutsu T: Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology. BMC Bioinforma. 2009, 10 (1): 25-10.1186/1471-2105-10-25.View ArticleGoogle Scholar
- Bhardwaj N, Langlois RE, Zhao G, Lu H: Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005, 33 (20): 6486-6493. 10.1093/nar/gki949.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang HL, Lin IC, Liou YF, Tsai CT, Hsu KT, Huang WL, Ho SJ, Ho SY: Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. BMC Bioinforma. 2011, 12 (Suppl): S47-View ArticleGoogle Scholar
- Xiong Y, Liu J, Wei DQ: An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins. 2011, 79 (2): 509-517. 10.1002/prot.22898.View ArticlePubMedGoogle Scholar
- Ahmad S, Andrabi M, Mizuguchi K, Sarai A: Prediction of mono- and dinucleotide-specific DNA-binding sites in proteins using neural networks. BMC Struct Biol. 2009, 9: 30-10.1186/1472-6807-9-30.PubMed CentralView ArticlePubMedGoogle Scholar
- Stawiski EW, Gregoret LM, Mandel-Gutfreund Y: Annotating nucleic acid binding function based on protein structure. J Mol Biol. 2003, 326 (4): 1065-1079. 10.1016/S0022-2836(03)00031-7.View ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: Moment-based prediction of DNA-binding proteins. J Mol Biol. 2004, 341 (1): 65-71. 10.1016/j.jmb.2004.05.058.View ArticlePubMedGoogle Scholar
- Kumar M, Gromiha M, Raghava G: Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinforma. 2007, 8 (1): 463-10.1186/1471-2105-8-463.View ArticleGoogle Scholar
- Wei L, Liao M, Gao Y: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014, 11: 192-201.View ArticlePubMedGoogle Scholar
- Nimrod G, Schushan M, Szilágyi A, Leslie C, Ben-Tal N: iDBPs: a web server for the identification of DNA binding proteins. Bioinformatics. 2010, 26 (5): 692-693. 10.1093/bioinformatics/btq019.PubMed CentralView ArticlePubMedGoogle Scholar
- Yan C, Wu F, Jernigan R, Dobbs D, Honavar V: Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 2006, 7 (1): 262-10.1186/1471-2105-7-262.PubMed CentralView ArticlePubMedGoogle Scholar
- Govindan G, Nair AS: New Feature Vector for Apoptosis Protein Subcellular Localization Prediction. Advances in Computing and Communications Communications. 2011, 170: 294-301.View ArticleGoogle Scholar
- Qian ZL, Cai YD, Li YX: A novel computational method to predict transcription factor DNA binding preference. Biochem Biophys Res Commun. 2006, 348 (3): 1034-1037. 10.1016/j.bbrc.2006.07.149.View ArticlePubMedGoogle Scholar
- Nann L, Lumini A: Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids. 2008, 34 (4): 635-641. 10.1007/s00726-007-0016-3.View ArticleGoogle Scholar
- Xia JF, Zhao XM, Huang DS: Predicting protein-protein interactions from protein sequences using meta predictor. Amino Acids. 2010, 39 (5): 1595-1599. 10.1007/s00726-010-0588-1.View ArticlePubMedGoogle Scholar
- Zou Q, Li XB, Jiang Y, Zhao YM, Wang GH: BinMemPredict: a Web server and software for predicting membrane protein types. Current Proteomics. 2013, 10: 2-9. 10.2174/1570164611310010002.View ArticleGoogle Scholar
- Tjong H, Zhou HX: DISPLAR: an accurate method for predicting DNAbinding sites on protein surfaces. Nucleic Acids Res. 2007, 35 (5): 1465-1477. 10.1093/nar/gkm008.PubMed CentralView ArticlePubMedGoogle Scholar
- Fang Y, Guo Y, Feng Y, Li M: Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids. 2008, 34 (1): 103-109. 10.1007/s00726-007-0568-2.View ArticlePubMedGoogle Scholar
- Shao X, Tian Y, Wu L, Wang Y, Jing L, Deng N: Predicting DNA- and RNAbinding proteins from sequences with kernel methods. J Theor Biol. 2009, 258 (2): 289-293. 10.1016/j.jtbi.2009.01.024.View ArticlePubMedGoogle Scholar
- Cai Y, Lin S: Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta. 2003, 1648 (1-2): 127-133. 10.1016/S1570-9639(03)00112-2.View ArticlePubMedGoogle Scholar
- Szilágyi A, Skolnick J: Efficient prediction of nucleic acid binding function from low-resolution protein structures. J Mol Biol. 2006, 358: 922-933. 10.1016/j.jmb.2006.02.053.View ArticlePubMedGoogle Scholar
- Song L, Li D, Zeng X: nDNA-prot: Identification of DNA-binding Proteins Based on Unbalanced Classification. BMC bioinformatics. 2014, 15 (1): 298-10.1186/1471-2105-15-298.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin C, Chen WQ, Qiu C, Wu YF, Krishnan S, Zou Q: LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 2014, 123: 424-435.View ArticleGoogle Scholar
- Liu B, Xu J, Fan SX, Xu RF, Zhou JY, Wang XL: PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation. Molecular Informatics. 2014, 34 (1): 8-17.View ArticleGoogle Scholar
- Liu B, Xu JH, Lan X, Xu RF, Zhou JY, Wang XL, Chou KC: iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition. PLoS One. 2014, 9 (9): e106691-10.1371/journal.pone.0106691.PubMed CentralView ArticlePubMedGoogle Scholar
- Chou KC: Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol. 2011, 14 (4): 236-247.View ArticleGoogle Scholar
- Yuan Y, Shi X, Li X, Lu W, Cai Y, Gu L, Liu L, Li M, Kong X, Xing M: Prediction of interactiveness of proteins and nucleic acids based on feature selections. Mol Divers. 2010, 14 (4): 627-633. 10.1007/s11030-009-9198-9.View ArticlePubMedGoogle Scholar
- Song J, Tan H, Takemoto K, Akutsu T: HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics. 2008, 24 (13): 1489-1497. 10.1093/bioinformatics/btn222.View ArticlePubMedGoogle Scholar
- Nanni L, Brahnam S, Lumini A: High performance set of PseAAC and sequence based descriptors for protein classification. J Theor Biol. 2010, 266 (1-10):Google Scholar
- Zhang Z, Kochhar S, Grigorov MG: Descriptor-based protein remote homology identification. Protein Sci. 2005, 14 (2): 431-444. 10.1110/ps.041035505.PubMed CentralView ArticlePubMedGoogle Scholar
- Zou C, Gong J, Li H: An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics. 2013, 14: 90-10.1186/1471-2105-14-90.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen W, Feng PM, Lin H, Chou CK: iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research. 2013, 41: e69-10.1093/nar/gks1455.View ArticleGoogle Scholar
- Chen W, Lin H, Feng PM, Ding C, Zuo YC, Chou KC: iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS One. 2012, 7 (10): e47843-10.1371/journal.pone.0047843.PubMed CentralView ArticlePubMedGoogle Scholar
- Xiao X, Wang P, Lin WZ, Chou KC: iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Analytical biochemistry. 2013, 436 (2): 168-177. 10.1016/j.ab.2013.01.019.View ArticlePubMedGoogle Scholar
- Xu Y, Shao XJ, Wu LY, Deng NY, Chou KC: iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ. 2013, 1: e171-PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Zhang D, R Xu, Xu J, Wang X, Chen Q, Dong Q, Chou KC: Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2014, 30 (4): 472-479. 10.1093/bioinformatics/btt709.View ArticlePubMedGoogle Scholar
- Liu B, Wang XL, Chen QC, Dong QW, Lan X: Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. PLoS ONE. 2012, 7 (9): e46633-10.1371/journal.pone.0046633.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Wang XL, Lin L, Dong QW, Wang X: Exploiting three kinds of interface propensities to identify protein binding sites. Computational Biology and Chemistry. 2009, 33 (4): 303-311. 10.1016/j.compbiolchem.2009.07.001.View ArticlePubMedGoogle Scholar
- Liu B, Wang XL, Lin L, Tang BZ, Dong QW, Wang X: Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC Bioinformatics. 2009, 10: 381-10.1186/1471-2105-10-381.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Wang XL, Zou Q, Dong QW, Chen QC: Protein Remote Homology Detection by Combining Chou's Pseudo Amino Acid Composition and Profile-Based Protein Representation. Molecular Informatics. 2013, 32: 775-782. 10.1002/minf.201300084.View ArticleGoogle Scholar
- Zhang Y, Liu B, Dong Q, Jin VX: An improved profile-level domain linker propensity index for protein domain boundary prediction. Protein and Peptide Letters. 2011, 18 (1): 7-16. 10.2174/092986611794328717.View ArticlePubMedGoogle Scholar
- Zou Q, Wang Z, Wu Y, Liu B, Lin Z, Guan X: An Approach for Identifying Cytokines Based On a Novel Ensemble Classifier. BioMed Research International. 2013, 686090-Google Scholar
- Liu B, Liu F, Fang L, Wang X, Xu RF, Chou K-C: repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. (doi: 10.1093/bioinformatics/btu1820)Google Scholar
- Feng PM, Chen W, Lin H, Chou K: iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem. 2013, 442 (1): 118-125. 10.1016/j.ab.2013.05.024.View ArticlePubMedGoogle Scholar
- Chen W, Fneg PM, Deng EZ, Lin H, Chou KC: iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Analytical Biochemistry. 2014, 462: 76-83.View ArticlePubMedGoogle Scholar
- Liu B, Yi J, SV A, Lan X, Ma Y, Huang TH, Leone G, Jin VX: QChIPat: a quantitative method to identify distinct binding patterns for two biological ChIP-seq samples in different experimental conditions. BMC Genomics. 2013, 14 (Suppl 8): S3-10.1186/1471-2164-14-S8-S3.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT: Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics. Bioinformatics. 2007, 23: 538-544. 10.1093/bioinformatics/btl677.View ArticlePubMedGoogle Scholar
- Biswas AK, Noman N, Sikder AR: Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinformatics. 2010, 11:Google Scholar
- Ruchi V, Grish CV, Raghava GPS: Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile. Amino Acids. 2010, 39: 101-110. 10.1007/s00726-009-0381-1.View ArticleGoogle Scholar
- Zhao XW, Li XT, Ma ZQ, Yin MH: Prediction of lysine ubiquitylation with ensemble classifier and feature selection. Int J Mol Sci. 2011, 12: 8347-8361. 10.3390/ijms12128347.PubMed CentralView ArticlePubMedGoogle Scholar
- Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001, 29 (14): 2994-3005. 10.1093/nar/29.14.2994.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Xu JH, Xu RF, Wang XL, Chen QC: Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics. 2014, 15 (Supple 2): S3-Google Scholar
- Vapnik VN, Vapnik V: Statistical learning theory. 1998, New York: WileyGoogle Scholar
- Ding H, Feng PM, Chen W, Lin H: Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst. 2014, 10 (8): 2229-2235. 10.1039/C4MB00316K.View ArticlePubMedGoogle Scholar
- Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, Chou KC: iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics. 2014, 30 (11): 1522-1529. 10.1093/bioinformatics/btu083.View ArticlePubMedGoogle Scholar
- Liu B, Wang X, Lin L, Dong Q, Wang X: A Discriminative Method for Protein Remote Homology Detection and Fold Recognition Combining Top-n-grams and Latent Semantic Analysis. BMC Bioinformatics. 2008, 9: 510-10.1186/1471-2105-9-510.PubMed CentralView ArticlePubMedGoogle Scholar
- Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19 (13): 1656-1663. 10.1093/bioinformatics/btg222.View ArticlePubMedGoogle Scholar
- Yu CS, Chen YC, Lu CH, J K Hwang JK: Prediction of protein subcellular localization. Proteins: Structure, Function, and Bioinformatics. 2006, 64 (3): 643-651. 10.1002/prot.21018.View ArticleGoogle Scholar
- Sieber M, Allemann RK: Arginine (348) is a major determinant of the DNA binding specificity of transcription factor E12[J]. Biological chemistry. 1998, 379 (6): 731-735.PubMedGoogle Scholar
- Rohs R, West SM, Sosinsky A, Liu P: The role of DNA shape in protein-DNA recognition. Nature. 2009, 461 (7268): 1248-1253. 10.1038/nature08473.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar KK, Pugalenthi G, Suganthan PN: DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. Journal of Biomolecular Structure and Dynamics. 2009, 26 (6): 679-686. 10.1080/07391102.2009.10507281.View ArticlePubMedGoogle Scholar
- Lou WC, Wang XQ, Chen F, Chen YX, Bo J, Zhang H: Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes. PLos One. 2014, 9 (1): e86703-10.1371/journal.pone.0086703.PubMed CentralView ArticlePubMedGoogle Scholar
- Dong Q, Zhou S, Guan J: A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics. 2009, 25: 2655-2662. 10.1093/bioinformatics/btp500.View ArticlePubMedGoogle Scholar
- Li W, Jaroszewski L, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2001, 26: 82-83.Google Scholar
- Gao M, Skolnick J: A threading-based method for the prediction of DNAbinding proteins with application to the human genome. PLoS Comput Biol. 2009, 5 (11): e1000567-10.1371/journal.pcbi.1000567.PubMed CentralView ArticlePubMedGoogle Scholar
- Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. [http://bioinformatics.hitsz.edu.cn/PSSM-DT/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.