- Research
- Open Access
Improved protein-protein interactions prediction via weighted sparse representation model combining continuous wavelet descriptor and PseAA composition
- Yu-An Huang†1,
- Zhu-Hong You†2Email author,
- Xing Chen3Email author and
- Gui-Ying Yan4
https://doi.org/10.1186/s12918-016-0360-6
© The Author(s). 2016
- Published: 23 December 2016
Abstract
Background
Protein-protein interactions (PPIs) are essential to most biological processes. Since bioscience has entered into the era of genome and proteome, there is a growing demand for the knowledge about PPI network. High-throughput biological technologies can be used to identify new PPIs, but they are expensive, time-consuming, and tedious. Therefore, computational methods for predicting PPIs have an important role. For the past years, an increasing number of computational methods such as protein structure-based approaches have been proposed for predicting PPIs. The major limitation in principle of these methods lies in the prior information of the protein to infer PPIs. Therefore, it is of much significance to develop computational methods which only use the information of protein amino acids sequence.
Results
Here, we report a highly efficient approach for predicting PPIs. The main improvements come from the use of a novel protein sequence representation by combining continuous wavelet descriptor and Chou’s pseudo amino acid composition (PseAAC), and from adopting weighted sparse representation based classifier (WSRC). This method, cross-validated on the PPIs datasets of Saccharomyces cerevisiae, Human and H. pylori, achieves an excellent results with accuracies as high as 92.50%, 95.54% and 84.28% respectively, significantly better than previously proposed methods. Extensive experiments are performed to compare the proposed method with state-of-the-art Support Vector Machine (SVM) classifier.
Conclusions
The outstanding results yield by our model that the proposed feature extraction method combing two kinds of descriptors have strong expression ability and are expected to provide comprehensive and effective information for machine learning-based classification models. In addition, the prediction performance in the comparison experiments shows the well cooperation between the combined feature and WSRC. Thus, the proposed method is a very efficient method to predict PPIs and may be a useful supplementary tool for future proteomics studies.
Keywords
- Protein-protein interactions
- Protein sequence
- Continuous wavelet transform
- Sparse representation based classifier
Background
In this post-genomic era, protein, as the major component of organism, is widely studied because of its important role in nearly all cell functions including DNA transcription and replication, metabolic cycles, and signaling cascades. Researches show that many functions of complex biological systems seem to be more closely determined by their interactions rather than their individual components. Therefore, the protein-protein interaction networks have been dawning increasing research attentions and interests. Moreover, the recent advance in practical applications in drug discovery comes to be a promotion factor for studies on PPIs which provides great insights into mechanisms of human diseases. Efforts have been devoted to the development of experimental methods for detecting PPIs and constructing protein interaction networks, such as yeast two-hybrid (Y2H) [1, 2] screens, tandem affinity purification (TAP) [3], mass spectrometric protein complex identification (MS-PCI) [3] and other high-throughput biological techniques for PPIs detection. However, experimental methods are expensive, time-consuming and tedious. Meanwhile experimentally identified PPIs are usually associated with high rates of both false positive and false negative results. For the sake of detecting larger fraction of the whole PPI network and utilizing the valuable and vast biological data provided by experimental methods, there is a growing need to develop computational methods capable of identifying PPIs.
A number of computational approaches haven been proposed for detecting PPIs based on various data types, such as genomic information, protein domain and protein structure information [4]. However, these methods are limited by the need of prior information about proteins, and the accuracies of them are sensitive to the reliability of the prior information. In addition, the exponential growth of newly discovered protein sequences is accumulated in numerous different types of databases. Therefore, it is significant to develop sequence-based PPI predicting systems mining information directly from amino acid sequences. Many researchers have engaged in trials to establish sequence-based system for predicting PPIs and have gained some preliminary result. To solve this problem, Zhou et al. [5] proposed an approach combing support vector machine and local protein sequence descriptors which account for the interactions between sequentially distant amino acid residues. When applied to predicting yeast PPIs, this method yielded a promising accuracy of 88.56%. Najafabadi et al. [6] found similarity in codon usage is a strong predictor for expressing proteins and got a 75% increase in sensitivity in his experience considering codon usage. Shi et al. [7] explored a kind of descriptor named correlation coefficient transformation and used support vector machine and this method adequately considers the neighboring effect and the level of correlation coefficient.
Computational systems for predicting pairwise protein interactions usually rely on two main components: feature extraction and machine learning model. Efficient feature descriptors are capable of mining useful information and normalizing different-length proteins to the same size. Furthermore, effective feature extraction methods can lead to an improvement in prediction performance. Until now, a number of feature extraction approaches based on protein sequence have been proposed and most of them consider the sequence order effect. In fact, employing graphic approaches to mine proteins’ information would be of great novelty. In this work, we adopt a novel descriptor named CW-LBP and show it is sufficient to reveal the complicated relations between protein interactions and their amino acid sequences. This sequence representation first encodes the protein sequence as a numerical sequence by substituting each amino acid with a specific proteins’ physicochemical property. Then, Meyer continuous wavelet transformation is employed to represent a protein sequence as an image. Finally, an image texture descriptor, Local Binary Pattern Histogram Fourier (LBP-HF) is used to extract features. In order to describe a protein in a discrete model which could provide comprehensive information, Chou’s pseudo amino acid composition (PseAAC) is employed as another kind of feature descriptor. PseAAC is a popular protein descriptor using the first 20 factors to reflect components of 20 conventional amino acid (AA) compositions and λ additional factors to reflect the influence of sequence order.
As the second step of computational methods for predicting PPIs, a wide range of machine learning models have been applied in previous works. However, the popular classifiers such as SVM [8, 9] and neural network [10] need much effort to adjust the optimal parameters. Recently, Sparse Representation based Classification (SRC) comes to be a new technique in study of face recognition for its excellent ability against illumination variations, occlusions, and random noise. Matching the feature descriptors extracted by the proposed graphic-based features (i.e., LBP-HF descriptors), SRC would be an ideal classification model. As indicated in the study of [11], Weighted Sparse Representation based Classifier (WSRC), a variant of basic SRC, additionally consider the local information of each training samples and therefore have a strong classification ability surpassing original SRC. In addition, WSRC needs little manual invention to adjust the optimal parameters, which is a significant character for the vast data volume of various protein sequence sets. Thus, WSRC algorithm is used as the machine learning tool to make the final prediction based on the extracted feature sets.
In this study, we report a novel computational method for predicting protein-protein interactions based on amino acid sequences by using the classifier of WSRC and the combined features consisting of CW-LBP and PseAAC descriptors. Firstly, each protein is transformed into a CW image deriving from amino acid sequence and then CW-LBP features are extracted from these images using LBP-HF texture descriptor. Secondly, for a more comprehensive representation for protein sequences, we extracted the Chou’s pseudo amino acid composition of each sample and merged it with CW-LBP descriptor as the whole feature set. By doing this, our feature representation of one protein would own 216 dimensions of which 176 come from CW-LBP descriptor and 40 is the Chou’s PseAA composition. Finally, WSRC is utilized to deal with the classification. To evaluate the performance, the proposed approach is applied to three different PPI data sets: Saccharomyces cerevisiae, Human, and H.pylori.
Results
Evaluation measures
Assessment of prediction ability
For the sake of impartiality, we set the same corresponding parameters (σ = 1.5, ε = 0.00005) for WSRC when we explored using the proposed method to predict PPIs of Saccharomyces cerevisiae and H.plpori dataset. In order to minimize the overfitting of the prediction model and test the robustness of the proposed method, 5-fold cross-validation was used in our experiments. In 5-fold cross-validation, dataset would be divided into five parts which four of them are used for training and the rest one of them is used for testing. By this way, five models were generated from the original dataset.
5-fold cross validation result obtained in predicting Yeast PPIs dataset
Test set | Accu.(%) | Prec.(%) | Sen.(%) | MCC(%) | AUC(%) |
---|---|---|---|---|---|
1 | 93.43 | 96.98 | 89.93 | 87.70 | 97.70 |
2 | 92.27 | 95.01 | 89.39 | 85.71 | 96.99 |
3 | 92.36 | 96.62 | 87.64 | 85.81 | 97.39 |
4 | 92.62 | 95.65 | 89.19 | 89.30 | 97.09 |
5 | 91.83 | 95.10 | 87.95 | 84.94 | 96.80 |
Average | 92.50 ± 0.59 | 95.87 ± 0.89 | 88.82 ± 0.98 | 86.09 ± 1.02 | 97.20 ± 0.35 |
5-fold cross validation result obtained in predicting H.pylori PPIs dataset
Test set | Accu.(%) | Prec.(%) | Sen.(%) | MCC(%) | AUC(%) |
---|---|---|---|---|---|
1 | 85.03 | 82.18 | 90.67 | 74.28 | 92.36 |
2 | 83.30 | 78.25 | 91.12 | 71.91 | 91.33 |
3 | 84.34 | 80.00 | 90.46 | 73.44 | 91.84 |
4 | 84.17 | 82.99 | 89.27 | 72.83 | 92.04 |
5 | 84.59 | 78.85 | 91.11 | 73.79 | 91.96 |
Average | 84.28 ± 0.64 | 80.45 ± 2.07 | 90.54 ± 0.77 | 73.25 ± 0.92 | 91.91 ± 0.37 |
The flowchart for the feature extraction process
ROC curves from proposed method result for Saccharomyces cerevisiae PPIs dataset
ROC curves from proposed method result for H.pylori PPIs dataset
Comparison with SVM-based method
5-fold cross validation result obtained in predicting Human PPIs dataset
Classification model | Testing set | Accu.(%) | Prec.(%) | Sen.(%) | MCC(%) |
---|---|---|---|---|---|
WSRC | 1 | 95.53 | 99.14 | 91.17 | 91.35 |
2 | 95.89 | 98.61 | 92.59 | 92.06 | |
3 | 95.22 | 99.19 | 91.09 | 90.86 | |
4 | 95.83 | 98.74 | 92.31 | 91.94 | |
5 | 95.22 | 99.04 | 91.08 | 90.85 | |
Average | 95.54 ± 0.32 | 98.95 ± 0.25 | 91.65 ± 0.74 | 91.41 ± 0.58 | |
SVM | 1 | 87.68 | 87.60 | 85.64 | 78.26 |
2 | 87.56 | 88.04 | 85.18 | 78.10 | |
3 | 87.68 | 88.66 | 86.14 | 78.38 | |
4 | 90.07 | 89.54 | 89.31 | 82.05 | |
5 | 87.63 | 89.92 | 84.05 | 78.23 | |
Average | 88.13 ± 1.09 | 88.75 ± 0.98 | 86.06 ± 1.97 | 79.00 ± 1.71 |
ROC curves from proposed method result for Human PPIs dataset
ROC curves from SVM-based method result for Human PPIs dataset
Comparison with other methods
Performance comparison of different methods on the Yeast dataset
Model | Method | Accu.(%) | Prec.(%) | Sen.(%) | MCC(%) |
---|---|---|---|---|---|
Guos’ work [23] | ACC | 89.33 ± 2.67 | 88.87 ± 6.16 | 89.93 ± 3.68 | N/A |
AC | 87.36 ± 1.38 | 87.82 ± 4.33 | 87.30 ± 4.68 | N/A | |
Zhous’ work [5] | SVM + LD | 88.56 ± 0.33 | 89.50 ± 0.60 | 87.37 ± 0.22 | 77.15 ± 0.68 |
Yangs’ work [24] | Cod1 | 75.08 ± 1.13 | 74.75 ± 1.23 | 75.81 ± 1.20 | N/A |
Cod2 | 80.04 ± 1.06 | 82.17 ± 1.35 | 76.77 ± 0.69 | N/A | |
Cod3 | 80.41 ± 0.47 | 81.86 ± 0.99 | 78.14 ± 0.90 | N/A | |
Cod4 | 86.15 ± 1.17 | 90.24 ± 1.34 | 81.03 ± 1.74 | N/A | |
Proposed method | WSRC | 92.50 ± 0.59 | 95.87 ± 0.89 | 88.82 ± 0.98 | 86.09 ± 1.02 |
Discussion
In the proposed model, the protein features are extracted by using the transformations of numerical sequences, continuous wavelet and Local Binary Pattern Histogram Fourier. (see Fig. 1) This feature extraction method is mainly based on the assumptions that the information of protein sequences can provide enough information for predicting protein-protein interactions and the fact that the hydrophobicity character of protein influences the protein interacting process. To retain comprehensive information by feature extraction, there are two kinds of descriptors, namely CW-LBP and PseAAC, adopted to capture the continuous and discrete information, respectively. In addition, in order to combine with the CW-LBP feature well and to develop a prediction model which need little manual intervention, the classification method of weighted sparse representation-based classifier is used to make the final prediction.
It is worthwhile to highlight several aspects of the proposed approach based on the experiments results here. (1) The outstanding prediction performance shows that continuous wavelet transformation can cooperate well with the Local Binary Pattern Histogram Fourier for protein feature extraction. (2) The comparison result of WSRC versus SVM demonstrates that WSRC can be well combined with graph-based feature extraction method and the use of CW-LBP may help WSRC give a full play to its function. (3) It is worth noting that WSRC could yield stable and satisfactory prediction performance by keeping the same parameters in all experiment. Compared with other conventional classifiers including SVM, WSRC has a valuable advantage that it doesn’t need much manual intervention to adjust the optimal parameters and therefore has great potential to be applied to the large-scale prediction for new PPIs. (4) It is known that approaches using ensemble classifier usually achieve more accurate and robust performance than the methods using single classifier. However, using the single classifier, our proposed model obtains good performance similar to those obtained by the methods using ensemble classifier such as boosting. From these comparisons, it is demonstrated that the WSRC-based model combining the continuous wavelet transform descriptor and PseAA composition can improve the prediction accuracy compared with current state-of-the-art classification mothods.
Conclusions
The growing demand for PPIs knowledge is promoting the development of studies on computational methods for predicting PPIs. In this paper, we propose a new PPIs prediction model only using the information of protein sequences. Since hydrophilic interaction plays an important role in the process of protein interactions, we consider the hydrophobic property of amino acids in the process of feature extraction by transforming protein sequences into numerical ones. We then adopted continuous wavelet descriptors and Chou’s pseudo amino acid composition, which aims at capturing the continuous and discrete information from the hydrophobic sequences. Besides, weighted sparse representation based classifier was used as the sample classification model due to its advantages of low manual intervention in parameter adjustion and good cooperation with features.
Results obtained from our experiments have shown that it is a good attempt to represent proteins using graphic texture extraction approaches. Our proposed method is feasible and effective. When performed on the Saccharomyces cerevisiae, Human and H.pylori datasets, the proposed method achieved promising results with high average accuracies of 92.50%, 95.54% and 84.28% respectively.
Methods
Gold standard datasets
We verify the proposed method on a high confidence Saccharomyces cerevisiae PPIs data set. It is gathered from publicly available database of interacting proteins (DIP). We removed those protein pairs which have ≥40% sequence identity or whose lengths are less than 50 residues. Consequently, we got the remaining 5594 protein pairs which construct the positive data set. Besides, 5594 additional protein pairs whose sub-cellular localizations are different were chosen to build the negative data set. As a result, the whole data set consists of 11188 protein pairs of which half are from the positive samples and half are from the negative samples.
To demonstrate the generality of the proposed method, we also verify our approach on two other types of PPIs data sets. The first dataset is collected from the Human Protein References Database (HPRD). We removed those protein pairs which have ≥25% sequence identity. Finally, we used the remaining 3899 protein-protein pairs of experimentally verified PPIs from 2502 different human proteins to comprise the golden standard positive dataset. For golden standard negative dataset, we then followed the previous work [12] assuming the proteins in different subcellular compartments do not interact with each other. By this way, we finally obtained 4262 protein pairs from 661 different human proteins as the negative dataset. Consequently, the Human dataset is constructed by 8161 protein pairs. The second PPI dataset is composed of 2916 Helicobacter pylori protein pairs (1458 interacting pair and 1458 non-interacting pairs) as described by Martin et al. [13].
Continuous wavelet transformation
Wavelets are very effective and popular descriptors for all kinds of applications. Li et al. [14] firstly used wavelets features to descript protein sequence, which offer a novel insight into mining proteins information. Compared with Fourier transform, wavelet transform has a completely different merit function. It uses functions which are localized in both the real and Fourier space while Fourier transform decomposes the input signal into sines and cosines. As an implementation of the wavelet transform, continuous wavelet transform (CWT) use arbitrary scales and almost arbitrary wavelets. Reinforcing the traits due to the redundancy tends, continuous analysis is often easier to interpret.
Local binary pattern histogram fourier (LBP-HF)
Pseudo amino acid composition (PseAAC)
Due to the simplicity and effectiveness, the amino acid composition model comes to be a popular feature description for detecting protein attributes. For the sake of avoiding losing the sequenced-order information, Pseudo Amino Acid Composition [16] has been proposed to add additional values which can reflect the influence of sequence order. So PseAAC formed as this concatenation has stronger representation ability beyond the traditional AAC. Several studies [17] have shown that many useful descriptors could be produced when Amino-Acid Sequence is coupled with other information related to the physiochemical properties of amino acids. For this reason, we applied hydrophobicity index of amino acids to the producing of PseAAC descriptors. In this work, we adopted Autocovariance (AC) approach method which is one of the sequence-based variants of Chou’s pseudo amino acid composition.
Weighted sparse representation based classification (WSRC)
Declarations
Acknowledgements
ZY and YH were supported by the National Natural Science Foundation of China under Grant No. 61572506, in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences. XC was supported by the National Natural Science Foundation of China under Grant No. 11301517 and 11631014. GY was supported by National Natural Science Foundation of China under Grant No. 11371355 and 11631014.
Declarations
This article has been published as part of BMC Systems Biology Volume 10 Supplement 4, 2016: Proceedings of the 27th International Conference on Genome Informatics: systems biology. The full contents of the supplement are available online at http://bmcsystbiol.biomedcentral.com/articles/supplements/volume-10-supplement-4.
Funding
The publication costs for this article were funded by the corresponding author’s institution. The publication funding came from National Natural Science Foundation of China under Grant No. 61572506, No. 11301517, No. 11631014, and No. 11371355.
Availability of data and materials
The datasets during and/or analysed during the current study available from the corresponding author on reasonable request.
Authors’ contributions
YH conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript. ZY & XC designed the project, analyzed experiments, and revised the manuscript. GY revised the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci. 2001;98(8):4569–74.View ArticlePubMedPubMed CentralGoogle Scholar
- Pazos F, Valencia A. In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins. 2002;47(2):219–27.View ArticlePubMedGoogle Scholar
- Gavin A-C, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon A-M, Cruciat C-M. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415(6868):141–7.View ArticlePubMedGoogle Scholar
- Skrabanek L, Saini HK, Bader GD, Enright AJ. Computational prediction of protein–protein interactions. Mol Biotechnol. 2008;38(1):1–17.View ArticlePubMedGoogle Scholar
- Zhou YZ, Gao Y, Zheng YY. Prediction of protein-protein interactions using local description of amino acid sequence. In: Advances in Computer Science and Education Applications. Berlin, Heidelberg: Springer; 2011: 254–262.Google Scholar
- Najafabadi HS, Salavati R. Sequence-based prediction of protein-protein interactions by means of codon usage. Genome Biol. 2008;9(5):R87.View ArticlePubMedPubMed CentralGoogle Scholar
- Shi M-G, Xia J-F, Li X-L, Huang D-S. Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids. 2010;38(3):891–9.View ArticlePubMedGoogle Scholar
- Koike A, Takagi T. Prediction of protein–protein interaction sites using support vector machines. Protein Eng Des Sel. 2004;17(2):165–73.View ArticlePubMedGoogle Scholar
- Dong Q, Wang X, Lin L, Guan Y. Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins. BMC Bioinformatics. 2007;8(1):1.View ArticleGoogle Scholar
- Chen H, Zhou HX. Prediction of interface residues in protein–protein complexes by a consensus neural network method: test against NMR data. Proteins. 2005;61(1):21–35.View ArticlePubMedGoogle Scholar
- Lu C-Y, Min H, Gui J, Zhu L, Lei Y-K. Face recognition via weighted sparse representation. J Vis Commun Image Represent. 2013;24(2):111–6.View ArticleGoogle Scholar
- You Z-H, Yu J-Z, Zhu L, Li S, Wen Z-K. A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing. 2014;145:37–43.View ArticleGoogle Scholar
- Martin S, Roe D, Faulon J-L. Predicting protein–protein interactions using signature products. Bioinformatics. 2005;21(2):218–26.View ArticlePubMedGoogle Scholar
- Li F-M, Li Q-Z. Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach. Protein Pept Lett. 2008;15(6):612–6.View ArticlePubMedGoogle Scholar
- Ahonen T, Matas J, He C, Pietikäinen M: Rotation invariant image description with local binary pattern histogram fourier features. In: Image Analysis. Berlin, Heidelberg: Springer; 2009: 61–70.Google Scholar
- Chou KC. Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins. 2001;43(3):246–55.View ArticlePubMedGoogle Scholar
- Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374.View ArticlePubMedPubMed CentralGoogle Scholar
- Candes EJ, Tao T. Near-optimal signal recovery from random projections: Universal encoding strategies? Inf Theory IEEE Trans. 2006;52(12):5406–25.View ArticleGoogle Scholar
- Candes EJ, Romberg JK, Tao T. Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math. 2006;59(8):1207–23.View ArticleGoogle Scholar
- Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM Rev. 2001;43(1):129–59.View ArticleGoogle Scholar
- Chou K-C. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol. 2011;273(1):236–47.View ArticlePubMedGoogle Scholar
- Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y. Locality-constrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on: 2010. IEEE: 3360-3367.Google Scholar
- Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 2008;36(9):3025–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang L, Xia J-F, Gui J. Prediction of protein-protein interactions from protein sequence using local descriptors. Protein Pept Lett. 2010;17(9):1085–90.View ArticlePubMedGoogle Scholar
- Bock JR, Gough DA. Whole-proteome interaction mining. Bioinformatics. 2003;19(1):125–34.View ArticlePubMedGoogle Scholar
- Nanni L. Hyperplanes for predicting protein–protein interactions. Neurocomputing. 2005;69(1):257–63.View ArticleGoogle Scholar