Affinity purification-mass spectrometry (AP-MS) has been widely used for generating bait-prey data sets so as to identify underlying protein-protein interactions and protein complexes. However, the AP-MS data sets in terms of bait-prey pairs are highly noisy, where candidate pairs contain many false positives. Recently, numerous computational methods have been developed to identify genuine interactions from AP-MS data sets. However, most of these methods aim at removing false positives that contain contaminants, ignoring the distinction between direct interactions and indirect interactions.
In this paper, we present an initialization-and-refinement framework for inferring direct PPI networks from AP-MS data, in which an initial network is first generated with existing scoring methods and then a refined network is constructed by the application of indirect association removal methods. Experimental results on several real AP-MS data sets show that our method is capable of identifying more direct interactions than traditional scoring methods.
The proposed framework is sufficiently general to incorporate any feasible methods in each step so as to have potential for handling different types of AP-MS data in the future applications.
Proteins play an important role in a variety of biological activities of organism in cells. Knowing the interactions between proteins can facilitate the identification of protein functions and the discovery of new drug targets. Therefore, the accurate inference of protein-protein interaction (PPI) network from experimental data is one of most important and challenging topics in bioinformatics and proteomics.
Affinity purification-mass spectrometry (AP-MS) is a mainstream experimental method for identifying PPIs in a high-throughput manner. In each AP-MS experiment, a tagged protein (bait) is first selectively purified along with its potential interacting partners (preys) from a cell or tissue lysate. Then, MS is used to identify and quantify these affinity purified proteins. Such purification experiments are repeated many times with different bait proteins. The set of bait-prey pairs from all purifications, termed the AP-MS data, is used to infer the underlying protein-protein interaction network structure.
Ideally, one bait protein should have a real and direct interaction relationship with each associated prey protein. However, there is a large number of false positive interactions in the AP-MS data, where the prey protein can be a non-specific contaminant. In addition, some prey proteins do not interact with the bait protein directly, which connect to the bait protein via other intermediate proteins. To remove these spurious interactions and improve the quality and reliability of network, many scoring algorithms have been proposed to solve the PPI inference problem from AP-MS data. As summarized in several recent reviews [1–3], these scoring methods can be categorized into two classes according to the underlying assumption on the composition of candidate interactions: spoke and matrix models. Spoke models consider only bait-prey interactions, whereas matrix models additionally incorporate prey-prey pairs into the set of candidate interactions. On the other hand, these methods are developed for handling different types of AP-MS data. For qualitative AP-MS data, scoring methods mainly measure the strength of interactions according to the co-occurrence correlation between proteins. Typical methods in this category include SA , PE , DC , Hart  and IDBOS . For quantitative AP-MS data, some methods such as SAINT , MiST , ComPASS , HGSCore  infer interactions between proteins by exploring the quantitative information of proteins.
Despite of recent algorithmic advances on inferring PPI networks from AP-MS data, there are still several challenging problems that remain unsolved. In this paper, we focus on one of such questions: Can we accurately infer the direct PPI network from AP-MS data? Note that there are two types of protein interactions: direct (physical, binary) interaction and indirect (co-complex) interaction. Direct interactions are those in which interacting proteins approach closely and bind together in the form of a complex in some biological processes and then perform certain functions . The indirect interaction between two proteins only refers to their functional relationship without the former direct/physical contact. In other words, two proteins with the indirect interaction cooperate to carry out a given task without actually engaging in a physical contact . Mathematically, if the PPI network is represented as a graph, then each edge in the graph corresponds to a direct interaction. Meanwhile, two proteins have an indirect interaction if they are connected in the graph but have no direct edge. However, most existing scoring methods are developed to infer PPI networks whose edges are mixed of direct and indirect interactions. In other words, these methods do not distinguish direct interactions from indirect interactions in the construction of PPI networks. To our knowledge, only a few studies have investigated the problem of constructing a PPI network that is composed of only direct interactions [15–17].
Therefore, it is still highly demanded to develop effective algorithms for inferring direct protein-protein interactions from the AP-MS data. This paper presents a general framework for inferring direct PPIs from AP-MS data. It is composed of two phases: initialization phase and refinement phase. In the initialization phase, we utilize an existing scoring method to generate a PPI network that may contains both direct and indirect interactions. In the refinement phase, we distinguish direct interactions from indirect interactions in the initial PPI networks. Note that this framework is general and very flexible, in which we can use different algorithms in each phase. To demonstrate the feasibility and advantages of our framework, we conduct a series of comprehensive performance studies. In the experiments, we use SA, PE, DC and Hart methods as the scoring methods in the first phase and two indirect interaction removal methods [18, 19] in the second phase. Experimental results show that our method is capable of detecting more direct interactions than traditional scoring methods.
The rest of the paper is organized as follows. “Methods” section describes our PPI network inference framework. “Results and discussion” section presents the experimental results and “Conclusion” section concludes this paper.
Here we propose a general initialization-and-refinement framework for inferring the direct PPI network from AP-MS data. The initial idea of such as a two-step procedure has been discussed in our previous work , which has been published online since 2014. In addition, some preliminary experimental results have been presented by the corresponding author in the highlight track of ISB 2015. In this paper, we further formalize this idea and conduct extensive empirical studies to demonstrate its feasibility and effectiveness in practice. In the first step, we use the existing interaction scoring methods to generate an initial PPI network that is mixed of direct and indirect interactions. In the second step, we try to remove indirect interactions from the initial network by utilizing the so-called network cleaning methods.
Figure 1 provides an overview of this framework. In the following, we will elaborate each step in detail.
(I) In the initialization phase, we utilize existing scoring methods to construct an initial PPI network. Indeed, we can use any feasible scoring algorithms in this step. As we have discussed in the introduction, many interaction scoring algorithms have been proposed to infer PPI networks from AP-MS data. These methods are designed to tackle different types of AP-MS data. Therefore, the choice of scoring methods actually depends on the input AP-MS data. For qualitative AP-MS data sets where we only know the co-occurrence of bait-prey pairs, we need to choose methods such as SA , PE  and DC . For quantitative AP-MS data sets with protein abundance information, we can use those methods such as SAINT  and MiST .
Despite of the seeming difference among existing methods, the problem of interaction prediction from AP-MS data can be modeled as a complex pairwise correlation mining problem. Here variables correspond to proteins, while samples correspond to purifications. Note that this problem is different from the traditional correlation mining problem with several distinct features: (1) Each variable (i.e.. protein) may take different roles (bait vs. prey). Such information is valuable for effective interaction detection, which has been incorporated in many scoring methods such SA and PE; (2) The data sets are highly noisy, in which many frequently appeared proteins may be containments.
Overall, many existing algorithms are available in the literature that can be utilized in this step. A detailed description and discussion on the advantages and limitations of available methods are beyond the scope of this paper, which could be found in a recent review paper .
(II) In the refinement phase, we obtain a filtered PPI network by exploring indirect association cleaning methods on the initial network. Recently, several algorithms have been proposed to recover direct relationships from an observed correlation matrix containing both direct and indirect relationships (e.g. network deconvolution , Silencer ). Since the initial PPI network generated in the first phase is mixed of direct interactions and indirect interactions, it is feasible to use such indirect association cleaning methods to remove indirect interactions in this phase.
Although these association cleaning methods are developed from quite different starting points, their objectives are the same: inferring the underlying unknown true direct network from the the measured correlation matrix that may be mixed of direct and indirect associations. The basic idea of these methods is summarized as follows.
Suppose a network is represented as an observed pairwise correlation matrix G, which is derived from the measurements of the total effect (both direct effect and indirect effect) of each variable on every other variable. If suppose S is the true matrix of direct associations, then each entry of correlation measurement in G can be obtained by summing up the direct effects mediated through the direct neighbors of the corresponding variable in the true network S. Based on this relationship, both the network deconvolution method  and the Silencer method  provide an approximate closed-form solution for S in terms of G. Note that actually both approaches are related to the partial correlation , which is the correlation between two variables when the effects of other variables are removed. These two methods scale the inverse correlation matrix in different manners.
Results and discussion
To demonstrate the efficacy and utility of our framework, we conduct a series of tests with several real data sets. The experimental settings, data sets used and performance evaluation results are given in the following sub-sections.
We use two public large-scale yeast AP-MS data sets: Gavin  and Krogan , whose raw experimental data sets were downloaded from http://interactome-cmp.ucsf.edu/. In addition, we also use a larger combined data set, which is generated from the integration of purifications from the above two data sets. The relevant information on these three data sets are summarized in Table 1.
Although many databases have been constructed for storing PPIs from different species (e.g. [23–25]), there are still no comprehensive gold standard sets for direct protein interactions in the literature. Here we follow Schelhorn et al.  to use three reference sets of experimentally validated binary protein interactions for the performance assessment in the experiments. These reference sets are denoted by Y2H, PCA, and BGS, respectively. The first two sets are collections of binary protein interactions experimentally determined from the Y2H technique  and the PCA technique . The third reference set is composed of manually curated yeast interactions supported by literature and is taken from an extensive validation of the Y2H method . The Y2H reference set and BGS reference set are downloaded from http://www.sciencemag.org/content/322/5898/104/suppl/DC1 and the PCA reference set is obtained from http://www.sciencemag.org/cgi/content/full/1153878/DC1.
Performance evaluation results
To quantify the effectiveness of such an initialization-and-refinement framework, we compare the initial network and filtered network to check if more direct interactions are reported on each data sets. The experimental results on Gavin data, Krogan data, and Combined data are given in Figs. 2, 3, and 4, respectively. As shown in these figures, our two-step method is able to identify more experimentally validated direct interactions than the corresponding initial scoring method in most cases. In order to quantitatively illustrate this fact, we calculate the normalized AUC (area under the curve) value as an overall performance indicator for each method. Here the normalized AUC value is defined as the quotient between the AUC value and xmax×ymax, where xmax and ymax are the maximal value of x-axis and y-axis, respectively. The experimental results on three data sets in terms of normalized AUC values are summarized in Tables 2, 3, and 4, respectively. As shown in these tables, the proposed procedure is able to boost the performance of initial networks in terms of normalized AUC values in most cases.
To make the discussion easier to follow, we take the experimental results on the Gavin data set as an example for a brief illustration. Table 2 shows the performance comparison between different methods, where total 24 pairs of comparative normalized AUC values are listed. In the table, ↑ highlights the cases that the normalized AUC value is increased in the curve induced from our proposed framework. In contrast, ↓ corresponds to the case without any improvements. Notably, among the 24 pairs of experiments for both ND and Silencer, 21 pairs demonstrate the positive promotion induced by the filtered network, versus only 3 pairs of results with no improvement. Accordingly, our two-step framework can facilitate us to identify more direct interactions compared to the traditional scoring methods.
Similar conclusions can be drawn from the experimental results on the Krogan data set and the Combined data set. More precisely, ND and Silencer can provide at least 19 enhanced cases in both Tables 3 and 4. Moreover, compared to traditional scoring methods in terms of normalized AUC, the worst improvements of the filtered network are 0.02 in Table 3 and 0.01 in Table 4, and the best improvements are 0.33 and 0.27, correspondingly. Meanwhile, in the cases that we cannot achieve performance improvement, the decrease on the performance is almost negligible. This means that it is safe to apply our framework to boost the performance of existing scoring methods for inferring direct PPI networks. In other words, the proposed procedure is able to improve the robustness (i.e., reduce the variance) of final results across a variety of scoring methods.
To check the overlap among the PPIs generated from the same data set by different PPI scoring methods (SA, PE, DC and Hart) after the indirect interaction removal procedure, we plot three Venn diagrams in both Figs. 5 and 6 when ND and Silencer are respectively used as the refinement algorithm. As shown in Fig. 5, we let each scoring method report approximately 10,000 PPIs on each data set. Among these PPIs, the number of PPIs that are reported by all four methods is 4029 (Gaivn), 1732 (Krogan) and 1953 (Combined data), respectively. Moreover, the number of PPIs that are only reported by one method ranges from 1427 to 4622. Therefore, the results obtained by different methods are very diverse. Similar conclusions can be drawn from the Venn diagrams in Fig. 6 as well. This indicates that our proposed framework is applicable to different scenarios.
To illustrate why significant improvements are observed when some scoring methods are used in the first phase, we present the top-10 ranked PPIs and other related details in Additional file 1: Tables S1–S24. When we use ND in the second phase, it is clearly visible that we can always achieve a significant performance improvement over the PE method after the refinement procedure from Tables 2, 3, and 4. This is because many “true PPIs” with low ranks (initial ranks) in Additional file 1: Tables S2, S6, S8 and S10 are re-ranked to be the 10 highest ranked interactions. In contrast, the initial top-10 PPIs and those top-10 ones after refinement are almost the same when the SA method is used in the initialization phase. As a result, the performance improvement is less visible in Tables 2, 3, and 4 when SA is used as the initial scoring method. When Silencer is used in the second phase, similar conclusions can be drawn from Additional file 1: Tables S13–S24 as well.
Overall, our two-step method generally has better performance than the corresponding component scoring methods for each data set. This indicates that the proposed framework is effective in inferring direct PPIs from AP-MS data. Moreover, the refinement step will provide significant performance gain when the results generated from the first step are not good enough. Hence, the two-step framework is of practically considerable value and provides us a new door to conduct the unmixed direct PPI network discovery.
As AP-MS experiments have generated large amounts of data, it is critical to establish the genuine PPI network from the experimental data. Our two-step framework combines existing PPI scoring methods and network deconvolution techniques, which achieves better performance than the traditional scoring methods on several AP-MS data sets. This framework is sufficiently general to incorporate any feasible methods in each step so as to have potential for handling different types of AP-MS data in the future applications.
In the future work, we will work on optimization models that can infer the direct PPI networks from the AP-MS data in a single procedure. In addition, it is also very critical to develop fast algorithms that can solve the network inference problem in linear time.
Affinity purification-mass spectrometry
Area under the curve
Binary gold standard
Comparative Proteomic Analysis Software Suite
Hypergeometric Spectral Counts score
International Conference on Systems Biology
Mass Spectrometry Interaction Statistics
Protein fragment complementation assay
Significance Analysis of the Interactome
Nesvizhskii AI. Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments. Proteomics. 2012; 12(10):1639–55.
Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature. 2006; 440(7084):637–43.
Yu H, Braun P, Yıldırım MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al. High-quality binary protein interaction map of the yeast interactome network. Science. 2008; 322(5898):104–10.
BT performed the implementations and drafted the manuscript. CZ and FG participated in the analysis of experimental results. ZH conceived the study and finalized the manuscript. All authors read and approved the final manuscript.
Supplementary Tables. This file provides the supplementary tables (Tables S1–S24) that illustrate why significant improvements are observed when some scoring methods are used in the first phase. We present the top-10 ranked PPIs detected from three data sets after the refinement procedure. Meanwhile, we also record the initial ranks of these top-10 ranked PPIs in these tables. Note that we list more than 10 PPIs in some tables such as Table S3 and Table S5 because there are top-k ranked PPIs (k >10) that have the same ranking score. (PDF 67 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Tian, B., Zhao, C., Gu, F. et al. A two-step framework for inferring direct protein-protein interaction network from AP-MS data.
BMC Syst Biol11
(Suppl 4), 82 (2017). https://doi.org/10.1186/s12918-017-0452-y