- Open Access
Hot spot prediction in protein-protein interactions by an ensemble system
© The Author(s) 2018
- Published: 31 December 2018
Hot spot residues are functional sites in protein interaction interfaces. The identification of hot spot residues is time-consuming and laborious using experimental methods. In order to address the issue, many computational methods have been developed to predict hot spot residues. Moreover, most prediction methods are based on structural features, sequence characteristics, and/or other protein features.
This paper proposed an ensemble learning method to predict hot spot residues that only uses sequence features and the relative accessible surface area of amino acid sequences. In this work, a novel feature selection technique was developed, an auto-correlation function combined with a sliding window technique was applied to obtain the characteristics of amino acid residues in protein sequence, and an ensemble classifier with SVM and KNN base classifiers was built to achieve the best classification performance.
The experimental results showed that our model yields the highest F1 score of 0.92 and an MCC value of 0.87 on ASEdb dataset. Compared with other machine learning methods, our model achieves a big improvement in hot spot prediction.
- Hot spot residues
- Protein-protein interaction
- Ensemble learning
Protein is one of important biological macro-molecules in organisms. Protein-protein interactions play a mediating role in protein function biologically . In order to better understand the mechanism of protein-protein interactions, hot spot residues have to be studied. By studying hot spot residues, small molecules that bind to hot spot residues can be designed to prevent erroneous protein-protein interactions . On the other hand, the study of hot spot residues can also be used to predict the secondary structure of proteins. Saraswathi et al. found that different amino acid distributions play a crucial role in determining secondary structures . In previous studies, hot spot residues were identified by experimental methods, such as alanine mutagenesis scanning . Based on the large number of mutations created by experimental methods, relevant researchers can extract a large number of accurate hot spot residues and apply them to investigate functional sites of protein-protein interactions . With the increase of mutation data, researchers established many standard databases focused on hot spot residues, such as binding interface database (BID)  and Alanine Scanning Energetics database (ASEdb) . However, experimental methods are time-consuming and laborious to keep up with the speed of increasing demand for research data. Machine learning methods can be used to alleviate the disadvantages of experimental methods and identify hot spot residues.
Feature selection is an important part of developing prediction method. With the popularity of big data, researchers have developed multiple websites for feature extraction and selection. Our previous work proposed a new sequence-based model that combines physicochemical features with the relative accessible surface area of amino acid sequences for hot spot prediction . Bin Liu et al. developed a python package that can extract features and implement model training , which can be used to identify post-translational modification sites and proire-protein binding sites. In addition, they also proposed a server that can generat pseudo components of biological samples, such as protein and DNA , which yields different outputs for different modes, including sequence types, heat vectors between feature vectors and feature vectors. Furthermore, some researchers have suggested that websites dedicated to feature selection can be used for different models. Chen et al. proposed a Python package for feature extraction and selection , which properly processes the sequence and structural characteristics of proteins and peptides, making these features more suitable for training model.
Many machine learning methods have been developed to identify hot spot residues. Some of them determined hot spot residues by calculating the energy contribution of each interfacial residue during protein-protein interactions such as Robetta server . It is worth noting that most of the machine learning methods tried to train data with extracting relevant features from the sequence or structure information of proteins, and then test on unknown hot spot data. For example, β ACV ASA integrated water exclusion theory into β contacts to predict hot spots . Other methods used structure-based calculations to predict hot spot residues. Wang et al. proposed a novel structure-based computational approach to identify hot spot residues by docking protein homologs . Furthermore, Xia et al. proposed APIS model based on structural features and amino acid physicochemical characteristics, and used SVM to train the model . The classification model worked well and yielded an F1 score of 0.64. In addition, some researchers developed network methods to predict hot spots. Ye et al. used residue-residue network features and micro-environment feature in combination with support vector machines to predict hot spots, which yielded an F1 score value of 0.79 . Although many methods have been developed to predict hot spots, the prediction performance is still low and the used structural features is difficult to obtain. Therefore, it is important for us to improve hot spot prediction and find more effective features.
Ensemble learning methods have been applied in various research fields. It is divided into feature fusion and decision fusion, which can combine the advantages and avoid the disadvantages of different classifiers, thus optimize model and improve classification accuracy. For example, He et al. developed an ensemble learning for face recognition, which used KNN and SVM training features with weighted summation decision matrices to obtain the optimal ensemble classifier. In general, combining multi-classifiers performs better than single classifier . For example, Pan et al. used integrated GTB(Gradient Tree Boosting), SVM and ERT(Extremely Randomized Trees) to predict hot-spot residues between proteins and RNA, which yielded an ACC of 0.86 .
In order to address the above issues in hot spot predictions, this paper proposed a novel ensemble machine learning system with feature extraction to identify hot spot residues. The method is based on protein sequence information alone. First, our method obtained 46 independent amino acid sequence properties from AAindex1  and relative accessible surface area (relASA)  from NetSurfP website to encode protein sequence. Then, the method combined an auto-correlation function with sliding window to encode these properties into amino acid features. Last, a new ensemble classifier, which combined the k-Nearest-Neighbours (KNN)  and SVM with radial basis Gaussian function , was built to train and test the curated data sets. Here, the publicly available LIBSVM software  was used to predict hot spot residues. As a result, our model achieved good prediction performance on different data sets. On the ASEdb training set, our method achieved the highest F1 value of 0.92 and an MCC value of 0.87 than state-of-the-art methods.
There are many definitions of hot spot residues in previous studies. In alanine mutant scanning experiments, hot spot is defined as the residue whose change value of binding free energy is greater than 2 Kcal/mol, and non-hot spot residue with less than 0.4 Kcal/mol, while the rest ones are unnecessary, when the interface residues on PPIs are mutated to alanine . It has been confirmed that most of the previous researchers used the criterion . The ratio of positive instances to negative ones under this definition is basically close to 1, which is more credible when using the criterion for training one model . According to this definition, two data sets were used in this work, the train set from Alanine Scanning Energetics Database (ASEdb) and the test set from binding interface database (BID). The data in the two databases are all verified by alanine mutation scan experiments. The BID data set is divided into four sub-groups: ’strong’, ’intermediate’, ’weak’ and ’insignificant’ interactions. Here, those residues labeled with ’strong’ are considered as hot spots and the rest residues are non-hot spots for our model.
Databases for hot spots prediction
Independent test(Mix set)
Ensemble learning method
To identify whether residues are hot spots, protein sequences have to be encoded into numerical sequences. To better characterize protein sequences, AAindex1 database was used, which contains 544 physicochemical and biochemical properties for 20 types of amino acids. Since highly related properties may make the predictions bias, relevant ones with a correlation coefficient more than 0.5 were removed in this work . First, the correlation coefficients, CCp i, between a property, p i, i=1-544, and the other ones are calculated. Then the number of relevant properties, Np i, is counted for the property p i. The calculation is repeated for all of the 544 properties. After the process, 46 properties were obtained and used to characterize protein sequences. The details of the properties are listed in Additional file 1.
In order to reflect the importance of the order of residues in protein sequence, auto-correlation function was used to calculate the attribute correlation coefficient of one residue and its neighbor residues in protein sequence as a one-dimensional feature .
where hl is one amino acid property for the l-th residue, L is the length of protein sequence and the M value is the number of neighbors that needs to be adjusted.
In addition, the ASA value of each residue can be calculated by web server NetsurfP (http://www.cbs.dtu.dk/services/NetSurfP/) and then used as a feature in this work . In total, every residue is represented by an input vector with 46*L features.
There are many metrics to evaluate the quality of machine learning model. Some of the most commonly used ones include accuracy (ACC), specificity (SPE), recall, F1 score (F1) and Matthews correlation coefficient (MCC). Furthermore, the Receiver Operating Characteristic (ROC) curves and area under ROC curve (AUC) values can be also used as evaluation criteria. Among them, F1, MCC and AUC are the important metrics to comprehensively evaluate models [33, 34].
Performance of ensemble classifiers on different M for auto-correlation function
Performance of different classifiers on ASEdb
Performance of our model on train and test sets
Prediction performance of top 83 classifier on training and test sets
Prediction performance of model with top 83 classifiers on different test sets
Comparison with other methods
Prediction comparison of different methods on BID test sets
B-factor, individual atomic contacts and the co-occurring contacts
Physicochemical, structural neighborhood features
Structural neighborhood features
Feature selection algorithm
Comparison of performance under different feature selection on training set
Hu’s feature selection
Feature correlation analysis
The classification and quantity statistics of base classifiers
1-3, 5-14, 18-31, 33, 34, 36, 37, 39, 40, 43-46
Descriptor cluster analysis
The classification and quantity statistics of AAindex1 properties
Alpha and Turn propensities
GEIM800103, CHAM83102, QIAN880129, ROBB760111, RICJ880114
RACS820104, QIAN880117, WOLS870103, FASG760104, ISOY800106
ROBB760107, QIAN880139, QIAN880113, RICJ880117, SNEP660104
NAKH900113, QIAN880128, PRAM820101, KHAG800101, SUEM840102
WERD780103, RICJ880104, VASM830102, ROSM880103, RICJ880105
ISOY800107, RACS820103, JOND750102, TANS770108, KLEP840101, VELV850101
GERO01103, NADH010107, AURR980118, AURR980120, WILM950104
This paper proposed a novel ensemble system that integrates feature selection and two types of base classifiers to achieve the best performance in hot spot prediction. It is worth mentioning that we only used the amino acid sequence information of protein and the feature of relative accessible surface area (relASA). Here, 46 descriptors of amino acids were obtained from AAindex1 database. Next, auto-correlation function was combined with the idea of sliding window to obtain amino acid features for protein sequence. Finally, the encoded data was respectively input into ensemble model containing SVM and KNN base classifiers. The model has been fully trained and tested, then the optimal ensemble model was obtained by means of majority voting. To sum up, the ensemble model with the top 83 classifiers yielded the best performance on training and test datasets. On the ASEdb and BID, the model achieved F1 scores of 0.92 and 0.89, respectively. Afterwards, based on different independent test sets (SKEMPI, dbMPIKT and Mix datasets), our model achieved good F1 scores of 0.8579, 0.8472 and 0.8657, respectively. In comparison with other the state-of-the-art methods, our model performs the best.
The publication costs of this article was covered by the National Natural Science Foundation of China No. 61672035 and this work was also supported by the National Natural Science Foundation of China (Nos. 61472282 and 61872004), Anhui Province Funds for Excellent Youth Scholars in Colleges (gxyqZD2016068) and Anhui Scientific Research Foundation for Returned Scholars.
Availability of data and materials
About this supplement
This article has been published as part of BMC Systems Biology Volume 12 Supplement 9, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume-12-supplement-9.
QL and PC conceived the study; QL, JZ and BW participated in the database design; QL and PC carried it out and drafted the manuscript. All authors revised the manuscript critically. JL, BW and PC approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Caufield JH, Wimble C, Shary S, Wuchty S, Uetz P. Bacterial protein meta-interactomes predict cross-species interactions and protein function. Bmc Bioinformatics. 2017; 18(1):171.PubMedPubMed CentralGoogle Scholar
- Xu D, Si Y, Meroueh SO. A computational investigation of small-molecule engagement of hot spots at protein–protein interaction interfaces. J Chem Inf Model. 2017; 57(9):2250–2272.PubMedPubMed CentralGoogle Scholar
- Saraswathi S, Fernández-Martínez JL, Koliński A, Jernigan RL, Kloczkowski A. Distributions of amino acids suggest that certain residue types more effectively determine protein secondary structure. J Mol Model. 2013; 19(10):4337–48.PubMedGoogle Scholar
- Wells JA. Systematic mutational analyses of protein-protein interfaces. Methods Enzymol. 1991; 202(1):390–411.PubMedGoogle Scholar
- Romero-Durana M, Pallara C, Glaser F, Fernández-Recio J. Modeling Binding Affinity of Pathological Mutations for Computational Protein Design.New York: Springer; 2017.Google Scholar
- Fischer TB, Arunachalam KV, Bailey D, Mangual V, Bakhru S, Russo R, Huang D, Paczkowski M, Lalchandani V, Ramachandra C. The binding interface database (bid): a compilation of amino acid hot spots in protein interfaces. Bioinformatics. 2003; 19(11):1453.PubMedGoogle Scholar
- Thorn KS, Bogan AA. Asedb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics. 2001; 17(3):284–5.PubMedGoogle Scholar
- Hu S-S, Chen P, Wang B, Li J. Protein binding hot spots prediction from sequence only by a new ensemble learning method. Amino Acids. 2017; 49:1773–85. https://doi.org/10.1007/s00726-017-2474-6.
- Liu B, Wu H, Zhang D, Wang X, Chou KC. Pse-analysis: a python package for dna/rna and protein/ peptide sequence analysis based on pseudo components and kernel methods. Oncotarget. 2017; 8(8):13338–43.PubMedPubMed CentralGoogle Scholar
- Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-one: a web server for generating various modes of pseudo components of dna, rna, and protein sequences. Nucleic Acids Res. 2015; 43(Web Server issue):65–71.Google Scholar
- Chen Z, Zhao P, Li F, Leier A, Marquezlago TT, Wang Y, Webb GI, Smith AI, Daly RJ, Chou KC. ifeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018; 34(14):2499–2502.PubMedGoogle Scholar
- Kim DE, Chivian D, Baker D. Protein structure prediction and analysis using the robetta server. Nucleic Acids Res. 2004; 32(Web Server issue):526–31.Google Scholar
- Liu Q, Hoi SC, Kwoh CK, Wong L, Li J. Integrating water exclusion theory into beta contacts to predict binding free energy changes and binding hot spots. BMC Bioinformatics. 2014; 15(1):57.PubMedPubMed CentralGoogle Scholar
- † LW, Hou Y, Quan H, Xu W, Bao Y, Li Y, Yuan F, Zou S. A compound-based computational approach for the accurate determination of hot spots. Protein Sci. 2013; 22(8):1060–70.Google Scholar
- Xia JF, Zhao XM, Song J, Huang DS. Apis: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. Bmc Bioinformatics. 2010; 11(1):174.PubMedPubMed CentralGoogle Scholar
- Ye L, Kuang Q, Jiang L, Luo J, Jiang Y, Ding Z, Li Y, Li M. Prediction of hot spots residues in protein–protein interface using network feature and microenvironment feature. Chemometr Intell Lab Syst. 2014; 131(3):16–21.Google Scholar
- He Y, Wu H, Zhong R. Face recognition based on ensemble learning with multiple lbp features. Appl Res Comput. 2018; 35(1):292–295.Google Scholar
- Pan Y, Wang Z, Zhan W, Deng L. Computational identification of binding energy hot spots in protein-rna complexes using an ensemble approach. Bioinformatics. 2017; 34(9):1473–1480.Google Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008; 36(Database issue):202–5.Google Scholar
- Petersen B, Petersen TN, Andersen P, Nielsen M, Lundegaard C. A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct Biol. 2009; 9:51. https://doi.org/10.1186/1472-6807-9-51.
- Guo G, Wang H, Bell D, Bi Y, Greer K. Knn model-based approach in classification. Lect Notes Comput Sci. 2003; 2888:986–96.Google Scholar
- Romero R, Iglesias EL, Borrajo L. A linear-rbf multikernel svm to classify big text corpora. Biomed Res Int. 2015; 2015:878291.PubMedPubMed CentralGoogle Scholar
- Chang CC, Lin CJ. Libsvm: A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2(3):1–27.Google Scholar
- Tuncbag N, Gursoy A, Keskin O. Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy. Bioinformatics. 2009; 25(12):1513–20.PubMedGoogle Scholar
- Li. L, Kuang H, Zhang Y, Zhou Y, Wang K, Wan Y. Prediction of eukaryotic protein subcellular multi-localisation with a combined knn-svm ensemble classifier. J Comput Biol Bioinforma Res. 2011; 3:15–24.Google Scholar
- Moal IH, Fernándezrecio J. Skempi: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics. 2012; 28(20):2600–7.PubMedGoogle Scholar
- Liu Q, Chen P, Wang B, Zhang J, Li J. dbMPIKT: a database of kinetic and thermodynamic mutant protein interactions. BMC Bioinformatics. 2018; 19:455.PubMedPubMed CentralGoogle Scholar
- Chen P, Li J, Wong L, Kuwahara H, Huang JZ, Gao X. Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences. Proteins Struct Funct Bioinforma. 2013; 81(8):1351–62.Google Scholar
- Zhang SW, Pan Q, Zhang HC, Shao ZC, Shi JY. Prediction of protein homo-oligomer types by pseudo amino acid composition:approached with an improved feature extraction and naive bayes feature fusion. Amino Acids. 2006; 30(4):461–8.PubMedGoogle Scholar
- Marsh JA, Teichmann SA. Relative solvent accessible surface area predicts protein conformational changes upon binding. Structure. 2011; 19(6):859–67.PubMedPubMed CentralGoogle Scholar
- Polikar R. Ensemble learning. Scholarpedia. 2009; 4(1):1–34.Google Scholar
- Zhang H, Berg AC, Maire M, Malik J. Svm-knn: Discriminative nearest neighbor classification for visual category recognition. Proc IEEE Conf Comput Vis Pattern Recognit. 2006; 2:2126–36.Google Scholar
- Bradley AP. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997; 30(7):1145–59.Google Scholar
- Chen P, Hu S, Zhang J, Gao X, Li J, Xia J, Wang B. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction. IEEE/ACM Trans Comput Biol Bioinforma. 2016; 13:901–12. https://doi.org/10.1109/TCBB.2015.2505286.
- Ting KM. Confusion Matrix, Encyclopedia of Machine Learning and Data Mining. Boston: Springer; 2017.Google Scholar
- Tuncbag N, Keskin O, Gursoy A. Hotpoint: hot spot prediction server for protein interfaces. Nucleic Acids Res. 2010; 38(Web Server issue):402.Google Scholar
- Liu Q, Ren J, Song J, Li J. Co-occurring atomic contacts for the characterization of protein binding hot spots. PloS ONE. 2015; 10(12):0144486.Google Scholar
- Xia J, Yue Z, Di Y, Zhu X, Zheng CH. Predicting hot spots in protein interfaces based on protrusion index, pseudo hydrophobicity and electron-ion interaction pseudopotential features. Oncotarget. 2016; 7(14):18065–75.PubMedPubMed CentralGoogle Scholar
- Deng L, Guan J, Wei X, Yi Y, Zhang QC, Zhou S. J Comput Biol J Comput Mol Cell Biol. 2013; 20(11):878–91.Google Scholar
- Hu SS, Peng C, Bing W, Li J. Protein binding hot spots prediction from sequence only by a new ensemble learning method. Amino Acids. 2017; 49(1):1–13.Google Scholar
- Zhang Y, Zha Y, Zhao S, Xiuquan DU. Protein structure class prediction based on autocorrelation coefficient and pseaac. J Front Comput Sci Technol. 2014; 8(1):103–110.Google Scholar
- Otaki JM, Tsutsumi M, Gotoh T, Yamamoto H. Secondary structure characterization based on amino acid composition and availability in proteins. J Chem Inf Model. 2010; 50(4):690–700.PubMedGoogle Scholar
- Hubert L, Baker FB. Data analysis by single-link and complete-link hierarchical clustering. J Educ Stat. 1976; 1(2):87–111.Google Scholar
- Janson G, Zhang C, Prado MG, Paiardini A. Pymod 2.0: improvements in protein sequence-structure analysis and homology modeling within pymol. Bioinformatics. 2017; 33(3):444.PubMedGoogle Scholar
- Dennis MS, Eigenbrot C, Skelton NJ, Ultsch MH, Santell L, Dwyer MA, O’Connell MP, Lazarus RA. Peptide exosite inhibitors of factor viia as anticoagulants. Nature. 2000; 404(6777):465–70.PubMedGoogle Scholar