ppiPre: predicting protein-protein interactions by combining heterogeneous features
© Deng et al.; licensee BioMed Central Ltd. 2013
Published: 14 October 2013
The Erratum to this article has been published in BMC Systems Biology 2015 9:50
Protein-protein interactions (PPIs) are crucial in cellular processes. Since the current biological experimental techniques are time-consuming and expensive, and the results suffer from the problems of incompleteness and noise, developing computational methods and software tools to predict PPIs is necessary. Although several approaches have been proposed, the species supported are often limited and additional data like homologous interactions in other species, protein sequence and protein expression are often required. And predictive abilities of different features for different kinds of PPI data have not been studied.
In this paper, we propose ppiPre, an open-source framework for PPI analysis and prediction using a combination of heterogeneous features including three GO-based semantic similarities, one KEGG-based co-pathway similarity and three topology-based similarities. It supports up to twenty species. Only the original PPI data and gold-standard PPI data are required from users. The experiments on binary and co-complex gold-standard yeast PPI data sets show that there exist big differences among the predictive abilities of different features on different kinds of PPI data sets. And the prediction performance on the two data sets shows that ppiPre is capable of handling PPI data in different kinds and sizes. ppiPre is implemented in the R language and is freely available on the CRAN (http://cran.r-project.org/web/packages/ppiPre/).
We applied our framework to both binary and co-complex gold-standard PPI data sets. The detailed analysis on three GO aspects suggests that different GO aspects should be used on different kinds of data sets, and that combining all the three aspects of GO often gets the best result. The analysis also shows that using only features based solely on the topology of the PPI network can get a very good result when predicting the co-complex PPI data. ppiPre provides useful functions for analysing PPI data and can be used to predict PPIs for multiple species.
Although different experimental methods [1, 2] have already generated a large amount of PPI for many model species in recent years , these existing PPI data are incomplete and contain many false positive interactions. In order to refine these PPI data, computational approaches are urgently needed.
Some recent researches have shown that PPIs can be integrated with other kinds of biological data in using supervised learning to predict PPIs [4–7]. In supervised learning, a classifier is trained using truly interacting protein pairs (positive samples) and protein pairs which are not interacting with each other (negative samples). Then the trained classifier is able to recover false negative interactions and remove false positive interactions from the PPIs input by users.
Existing studies are mainly differing in the selection of features used in the prediction framework. In these studies, different biological evidences are extracted and used as features training the classifier, including Gene Ontology (GO) functional annotations [8, 9], protein sequences  and co-expressed proteins . For the organisms or proteins which are lack of research, biological features may don't work well, so features based on network topology are also needed to integrate [12–14].
Although some frameworks and tools have also been proposed for predicting PPIs [15–20], they have two disadvantages in general. First, most of the frameworks only support a few well studied model organisms. Second, these frameworks often need users to provide additional biological data along with the PPIs. Moreover, different species often require different features, which make these existing frameworks not very convenient to use.
In this paper, we describe ppiPre, an open-source framework for the PPI prediction problem. The framework is implemented in the R language so it can work together with other R packages dealing with biological data and network , which is different from other tools accessed via web services. ppiPre integrates features extracted from multiple heterogeneous data sources, including GO , KEGG  and topology of the PPI network. Users don't need to provide additional biological data other than gold-standard PPI data. ppiPre provides functions for measuring the similarity between proteins and for predicting PPIs from the existing PPI data.
Heterogeneous features are integrated in the prediction framework of ppiPre, including three GO-based semantic similarities, one KEGG-based similarity indicating the proteins are involved in the same pathways and three topology-based similarities using only the network structure of the PPI network.
We chose these three features to be integrated in our framework because they are highly available for the PPIs of different species and can be easily accessed in the R environment. Not like other methods and software tools, ppiPre did not integrate biological features that may not be available for the species or proteins which are not well studied, such as structural and domain information.
GO-based semantic similarities
Proteins are annotated by GO with terms from three aspects: biological process (BP), molecular function (MF), and cellular component (CC). Directed acyclic graphs (DAGs) are used to describe these aspects. It is known that interacting protein pairs are likely to be involved in similar biological processes or in similar cellular component compared to those non-interacting proteins [2, 24, 25]. Thus if two proteins are semantically similar based on GO annotation, the probability that they actually interact is higher than two proteins that are less similar.
Several similarity measures have been developed for evaluating the semantic similarity between two GO terms [26–28]. The information content (IC) of GO terms and the structure of the GO DAG are often used in these measures.
where p(t) is the probability of occurrence of the term t in a certain GO aspect. Two IC-based semantic similarity measures proposed recently are integrated in ppiPre, which are Topological Clustering Semantic Similarity (TCSS)  and IntelliGO .
In TCSS, the GO DAGs are divided into subgraphs. A PPI is scored higher if the two proteins are in the same subgraph. The algorithm is made up of two major steps.
In the first step, a threshold on the ICs of all terms is used to generate multiple subgraphs. The roots of the subgraphs are the terms which are below the previously defined threshold. If roots of two subgraphs have similar IC values, these two subgraphs are merged. Overlapping subgraphs may occur because some GO terms have more than one parent terms. In order to remove overlap between subgraphs, edge removal and term duplication are processed. Transitive reduction of GO DAG is used to remove overlapping edges by generating the smallest graph that has the same transitive closure as the original subgraph. After edge removal, if a term is included in two or more subgraphs, it will be duplicated into each subgraph. More details are described in .
After the first step, a meta-graph is constructed by connecting all subgraphs. Then the second step called normalized scoring is processed. For two GO terms, normalized semantic similarity is calculated based on the meta-graph rather than the whole GO DAG so that more balanced semantic similarity scores can be obtained.
where P t is the proteins that are annotated by t in aspect O and N(t) is the child terms of t.
where LCA(s m ,t n ) is the common ancestor of the terms s m and t n with the highest IC. T i and T j are two sets of GO terms which annotate the two proteins i and j respectively.
where w(g, t) is the weight of the EC which indicates the annotation origin between protein g and GO term t, and IAF (Inverse Annotation Frequency) represents the frequency of term t occurred in all the proteins annotated in the aspect where t belongs.
The detailed explanation of the definition can be found in .
The similarity measure proposed by Wang  is also implemented in the ppiPre package, which is based on the graph structure of GO DAG.
In the GO DAG, each edge has a type which is "is-a" or "part-of". In Wang's measure, a weight is given to each edge according to its type. DAG t = (t,T t ,E t ) represents the subgraph made up of term t and its ancestors, where T t is the set of the ancestor terms of t and E t is the set of edges in DAG t .
where SV(m) is the sum of the semantic contribution of all the terms in DAG m .
The semantic similarity between two proteins i and j is defined as the maximum value of all the similarity between any term that annotate i and any term that annotate j.
where P(i) is the set of pathways which protein i involved in the KEGG database.
In order to deal with the proteins that haven't got any annotations in GO or KEGG database, topology-based similarity measures are also integrated. In ppiPre, three different topological similarities are implemented.
where N(i) is set of all the direct neighbours of protein i in PPI network.
where k n is the degree of protein n.
The data of interacting protein pairs verified by experiments are very incomplete and the non-interacting protein pairs far outnumber interacting protein pairs. So the classical SVM  which is able to handle small and unbalanced data is chosen to integrate different features in ppiPre. We have tested different kernels in e1071 and the results showed no significant difference, so the default kernel and parameters are used in ppiPre.
Results and discussion
Gold-standard positive yeast protein interaction data sets
Number of Interactions
Number of Proteins
Non-interacting pairs were randomly selected from the proteins in gold-standard positive data sets as gold-standard negative data sets. The positive and negative data sets are set to the same size. In order to minimize the impact to the topological characteristics of the PPI network, the degree of each protein was maintained.
10-fold cross validation was used to evaluate the performance of the prediction framework.
Predictive abilities of GO-based similarities
AUC for the yeast gold-standard PPI data sets using single GO aspect
Binary data set
Co-complex data set
For the binary data set, the BP aspect shows the best performance among all three aspects in ROC analysis of three GO-based semantic similarities (Figure 2, Table 2). This result is expected. The BP aspect is related to protein interaction and thus can be used to predict them.
For the co-complex data set, the CC aspect shows the best performance in ROC analysis of three GO-based semantic similarities (Figure 3, Table 2). Since the MIPS data set is composed of protein complexes, and a protein complex can only be formed if its proteins are localized within the same compartment of the cell, terms in the CC aspect correctly reflect the functional grouping of proteins in these complexes.
AUC for the yeast gold-standard PPI data sets using a combination of GO aspects
Binary data set
Co-complex data set
Predictive abilities of KEGG-based and topological similarities
AUC for the yeast gold-standard PPI data sets using KEGG-based and different topological similarities
Binary data set
Co-complex data set
Integration of biological and topological similarities
After analysing biological and topological features separately, we integrated these heterogeneous features together.
The result shows that integrating biological and topological similarities can improve the prediction performance. So, it's necessary to integrate heterogeneous features together when dealing with the PPI prediction problem. All the features are integrated in ppiPre.
For proteins with unknown annotations in GO and KEGG, the GO-based and KEGG-based similarity measures cannot work. But the impact on these two data sets can be ignored since interactions without annotations are only 2 in the binary data set (0.19%) and 16 in MIPS data set (1.84%). However, when ppiPre is used on a large amount of proteins that are poorly annotated in GO, users should consider that the performance of ppiPre may be hampered under such situation.
Implementation and usage
The current version of ppiPre supports 20 species. The detail of the species supported and IC data used in GO-based semantic similarities are described in . The annotation data of GO and KEGG are got from the packages GO.db and KEGG.db.
Functions provided in ppiPre
Computes the Adamic-Adar similarity
Computes biological and topological similarities
Predict false negative interactions using topological similarities
Computes KEGG-based similarity and GO-based similarities
Computes IntelliGO semantic similarity
Computes the Jaccard similarity
Computes KEGG-based similarity
Computes the Resource Allocation similarity
Trains the SVM classifier, and then predict false interactions
Computes TCSS semantic similarity
Computes all the three topological similarities
Computes Wang's semantic similarity
An open-source framework ppiPre for PPI prediction is proposed in this paper. Several heterogeneous features are combined in ppiPre, including three GO-based similarities, one KEGG-based similarity and three topology-based similarities. To make the prediction, users don't need to provide additional biological data other than gold-standard PPI data.
ppiPre can be integrated into existing bioinformatics analysis pipelines in the R environment. Other features will be evaluated and integrated in future work, and the framework will be tested on PPI data of more species especially those poorly annotated in GO.
We thank the anonymous reviewers for many helpful suggestions and comments. We thank Dr. Yong Wang of AMSS, CAS for insightful discussions. We also thank Rongjie Shao and Gang Wang of Xidian University for technical support. A preliminary version of this paper was published in the proceedings of IEEE ISB2012 .
The publication of this article was funded by the NSFC (91130006, 60933009, 61202174) and the Fundamental Research Funds for the Central Universities (K5051223003, K5051223005).
This article has been published as part of BMC Systems Biology Volume 7 Supplement 2, 2013: Selected articles from The 6th International Conference of Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S2.
- Gavin A-C, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon A-M, Cruciat C-M, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier M-A, Copley RR, Edelmann A, Querfurth E, Rybin V: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.View ArticlePubMedGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403: 623-627. 10.1038/35001009.View ArticlePubMedGoogle Scholar
- De Las Rivas J, Fontanillo C: Protein-Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks. PLoS Comput Biol. 2010, 6: e1000807-10.1371/journal.pcbi.1000807.PubMed CentralView ArticlePubMedGoogle Scholar
- Ben-Hur A, Noble WS: Kernel methods for predicting protein-protein interactions. Bioinformatics. 2005, 21: i38-46. 10.1093/bioinformatics/bti1016.View ArticlePubMedGoogle Scholar
- Chen X-W, Liu M: Prediction of protein-protein interactions using random decision forest framework. Bioinformatics. 2005, 21: 4394-4400. 10.1093/bioinformatics/bti721.View ArticlePubMedGoogle Scholar
- Patil A, Nakamura H: Filtering high-throughput protein-protein interaction data using a combination of genomic features. BMC Bioinformatics. 2005, 6: 100-10.1186/1471-2105-6-100.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin X, Liu M, Chen X: Assessing reliability of protein-protein interactions by integrative analysis of data in model organisms. BMC Bioinformatics. 2009, 10 (Suppl 4): S5-10.1186/1471-2105-10-S4-S5.View ArticleGoogle Scholar
- Mahdavi M, Lin Y-H: False positive reduction in protein-protein interaction predictions using gene ontology annotations. BMC Bioinformatics. 2007, 8: 262-10.1186/1471-2105-8-262.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuchaiev O, Rašajski M, Higham DJ, Pržulj N: Geometric De-noising of Protein-Protein Interaction Networks. PLoS Comput Biol. 2009, 5:Google Scholar
- Wang C, Cheng J, Su S: Prediction of Interacting Protein Pairs from Sequence Using a Bayesian Method. The Protein Journal. 2009, 28: 111-115. 10.1007/s10930-009-9170-7.View ArticlePubMedGoogle Scholar
- Qi Y, Klein-Seetharaman J, Bar-Joseph Z: A mixture of feature experts approach for protein-protein interaction prediction. BMC Bioinformatics. 2007, 8 (Suppl 10): S6-10.1186/1471-2105-8-S10-S6.PubMed CentralView ArticlePubMedGoogle Scholar
- Lü L, Zhou T: Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications. 2011, 390: 1150-1170. 10.1016/j.physa.2010.11.027.View ArticleGoogle Scholar
- Guimerà R, Sales-Pardo M: Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences. 2009, 106: 22073-22078. 10.1073/pnas.0908366106.View ArticleGoogle Scholar
- Chua HN, Ning K, Sung W-K, Leong HW, Wong L: Using indirect protein-protein interactions for protein complex prediction. J Bioinform Comput Biol. 2008, 6: 435-466. 10.1142/S0219720008003497.View ArticlePubMedGoogle Scholar
- Kim S, Shin S-Y, Lee I-H, Kim S-J, Sriram R, Zhang B-T: PIE: an online prediction system for protein-protein interactions from text. Nucleic Acids Research. 2008, 36 (Web Server): W411-W415. 10.1093/nar/gkn281.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo Y, Li M, Pu X, Li G, Guang X, Xiong W, Li J: PRED_PPI: a server for predicting protein-protein interactions based on sequence data with probability assignment. BMC Research Notes. 2010, 3: 145-10.1186/1756-0500-3-145.PubMed CentralView ArticlePubMedGoogle Scholar
- Li D, Liu W, Liu Z, Wang J, Liu Q, Zhu Y, He F: PRINCESS, a Protein Interaction Confidence Evaluation System with Multiple Data Sources. Mol Cell Proteomics. 2008, 7: 1043-1052. 10.1074/mcp.M700287-MCP200.View ArticlePubMedGoogle Scholar
- Michaut M, Kerrien S, Montecchi-Palazzi L, Chauvat F, Cassier-Chauvat C, Aude J-C, Legrain P, Hermjakob H: InteroPORC: Automated Inference of Highly Conserved Protein Interaction Networks. Bioinformatics. 2008, 24: 1625-1631. 10.1093/bioinformatics/btn249.View ArticlePubMedGoogle Scholar
- Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A: PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics. 2006, 7: 365-10.1186/1471-2105-7-365.PubMed CentralView ArticlePubMedGoogle Scholar
- McDowall MD, Scott MS, Barton GJ: PIPs: human protein-protein interaction prediction database. Nucleic Acids Research. 2009, 37 (Database): D651-D656. 10.1093/nar/gkn870.PubMed CentralView ArticlePubMedGoogle Scholar
- Csárdi G, Nepusz T: The igraph software package for complex network research. InterJournal Complex Systems. 2006, 1695:Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMed CentralView ArticlePubMedGoogle Scholar
- Lehner B, Fraser AG: A first-draft human protein-interaction map. Genome Biology. 2004, 5: R63-10.1186/gb-2004-5-9-r63.PubMed CentralView ArticlePubMedGoogle Scholar
- Jansen R: A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data. Science. 2003, 302: 449-453. 10.1126/science.1087361.View ArticlePubMedGoogle Scholar
- Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. IJCAI. 1995, 448-453.Google Scholar
- Jiang J, Conrath D: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. International Conference Research on Computational Linguistics (ROCLING X). 1997, 9008-Google Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Semantic similarity measures as tools for exploring the gene ontology. Pac Symp Biocomput. 2003, 601-612.Google Scholar
- Jain S, Bader G: An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics. 2010, 11: 562-10.1186/1471-2105-11-562.PubMed CentralView ArticlePubMedGoogle Scholar
- Benabderrahmane S, Smail-Tabbone M, Poch O, Napoli A, Devignes M-D: IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC Bioinformatics. 2010, 11: 588-10.1186/1471-2105-11-588.PubMed CentralView ArticlePubMedGoogle Scholar
- Rogers MF, Ben-Hur A: The use of gene ontology evidence codes in preventing classifier assessment bias. Bioinformatics. 2009, 25: 1173-1177. 10.1093/bioinformatics/btp122.View ArticlePubMedGoogle Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007, 23: 1274-1281. 10.1093/bioinformatics/btm087.View ArticlePubMedGoogle Scholar
- Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins. 2006, 63: 490-500. 10.1002/prot.20865.PubMed CentralView ArticlePubMedGoogle Scholar
- van Noort V, Snel B, Huynen MA: Exploration of the omics evidence landscape: adding qualitative labels to predicted protein-protein interactions. Genome Biology. 2007, 8: R197-10.1186/gb-2007-8-9-r197.PubMed CentralView ArticlePubMedGoogle Scholar
- Jaccard P: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaud Sci Nat. 1901, 37: 541-Google Scholar
- Adamic LA, Adar E: Friends and neighbors on the Web. Social Networks. 2003, 25: 211-230. 10.1016/S0378-8733(03)00009-1.View ArticleGoogle Scholar
- Zhou T, Lü L, Zhang Y-C: Predicting missing links via local information. The European Physical Journal B - Condensed Matter and Complex Systems. 2009, 71: 623-630. 10.1140/epjb/e2009-00335-8.View ArticleGoogle Scholar
- Vapnik VN: The Nature of Statistical Learning Theory. 2000, SpringerView ArticleGoogle Scholar
- Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, Hao T, Rual J-F, Dricot A, Vazquez A, Murray RR, Simon C, Tardivo L, Tam S, Svrzikapa N, Fan C, de Smet A-S, Motyl A, Hudson ME, Park J, Xin X, Cusick ME, Moore T, Boone C, Snyder M, Roth FP: High-Quality Binary Protein Interaction Map of the Yeast Interactome Network. Science. 2008, 322: 104-110. 10.1126/science.1158684.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han J-DJ, Bertin N, Chung S, Vidal M, Gerstein M: Annotation Transfer Between Genomes: Protein-Protein Interologs and Protein-DNA Regulogs. Genome Research. 2004, 14: 1107-1118. 10.1101/gr.1774904.PubMed CentralView ArticlePubMedGoogle Scholar
- Deng Y, Gao L: ppiPre - an R package for predicting protein-protein interactions. 2012 IEEE 6th International Conference on Systems Biology (ISB). 2012, 333-337.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.