Volume 10 Supplement 2
Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2015: systems biology
Protein-protein interaction prediction based on multiple kernels and partial network with linear programming
- Lei Huang^{1},
- Li Liao^{1}Email author and
- Cathy H. Wu^{1, 2}
https://doi.org/10.1186/s12918-016-0296-x
© The Author(s) 2016
Published: 1 August 2016
Abstract
Background
Prediction of de novo protein-protein interaction is a critical step toward reconstructing PPI networks, which is a central task in systems biology. Recent computational approaches have shifted from making PPI prediction based on individual pairs and single data source to leveraging complementary information from multiple heterogeneous data sources and partial network structure. However, how to quickly learn weights for heterogeneous data sources remains a challenge. In this work, we developed a method to infer de novo PPIs by combining multiple data sources represented in kernel format and obtaining optimal weights based on random walk over the existing partial networks.
Results
Our proposed method utilizes Barker algorithm and the training data to construct a transition matrix which constrains how a random walk would traverse the partial network. Multiple heterogeneous features for the proteins in the network are then combined into the form of weighted kernel fusion, which provides a new "adjacency matrix" for the whole network that may consist of disconnected components but is required to comply with the transition matrix on the training subnetwork. This requirement is met by adjusting the weights to minimize the element-wise difference between the transition matrix and the weighted kernels. The minimization problem is solved by linear programming. The weighted kernel fusion is then transformed to regularized Laplacian (RL) kernel to infer missing or new edges in the PPI network, which can potentially connect the previously disconnected components.
Conclusions
The results on synthetic data demonstrated the soundness and robustness of the proposed algorithms under various conditions. And the results on real data show that the accuracies of PPI prediction for yeast data and human data measured as AUC are increased by up to 19 % and 11 % respectively, as compared to a control method without using optimal weights. Moreover, the weights learned by our method Weight Optimization by Linear Programming (WOLP) are very consistent with that learned by sampling, and can provide insights into the relations between PPIs and various feature kernel, thereby improving PPI prediction even for disconnected PPI networks.
Keywords
Protein interaction network Network inference Interaction prediction Random walk Linear programmingBackground
Protein-protein interaction (PPI) plays an essential role in many cellular processes. In order to have a better understanding of intracellular signaling pathways, modeling of protein complex structures and elucidating various biochemical processes, many high-throughput experimental methods, such as yeast two-hybrid system and mass spectrometry method, have been used to uncover protein interactions. However, these methods are known to be prone to having high false-positive rates, besides their high cost. Therefore, great efforts have been made to develop efficient and accurate computational methods for PPI prediction.
Many pair-wise biological similarity based computational approaches have been developed to predict if any given pair of proteins interact with each other, based on various properties such as sequence homology, gene co-expression, phylogenetic profiles, three-dimensional structural information, etc. [1–7]. However, without first principles to tell deterministically if two given proteins interact or not, the pair-wise biological similarity based on various features and attributes can run out its predictive power, as the signals may be weak, noisy, or inconsistent, which can present serious issues even for methods based on integrated heterogeneous pair-wise features, e.g. genomic features, semantic similarities, etc. [8–11].
To circumvent the limitations with using pair-wise biological similarity, pair-wise topological features have been used to measure the similarity for any given node pair to make PPI prediction for the corresponding proteins [12–15], if a PPI network is constructed with nodes representing proteins and edges representing interactions. Moreover, to go beyond these node centric topological features and get the whole network structure involved, variants of random walk [16] based methods [17–19] have been developed, but the computational cost of these methods increases by N times for all-against-all PPI prediction. Thus many kernels on network for link prediction and semi-supervised classification have been systematically studied [20], which can measure the random-walk distance for all node pairs at once. But both the variants of random walk and random walk based kernels do not perform well in detection of interacting proteins when the direct edge connecting them in the network is removed and the remaining path connecting them is long [20]. Besides, instead of computing proximity measures between nodes from the network structure explicitly, many latent features based on rank reduction and spectral analysis have been utilized to do prediction, such as geometric de-noise methods [1, 21], multi-way spectral clustering [22], matrix factorization based methods [23, 24]. Mostly, the prediction task of these methods will be reduced to the convex optimization problem whose objective function should be carefully designed to ensure fast convergence and avoid being stuck in the local optima. Furthermore, biological features and topological features can supplement each other to improve the prediction performance, such as by assigning weights to edges in the network based on pair-wise biological similarity scores. Then, methods based on explicit or latent features, such as supervised random walk [19] or matrix factorization method, can be applied to the weighted network to make prediction, based on multi-modal biological sources. [23, 24]. However, for these methods, only the pair-wise features for the existing edges in the PPI network will be utilized, even though from a PPI prediction perspective what is particularly useful is to incorporate pair-wise features for node pairs that are not currently linked by a direct edge but will if a new edge (PPI) is predicted.
Therefore, it is of great interest if we can infer PPI network directly from multi-modal biological features kernels that involve all node pairs. It not only can help us improve prediction performance but also provide insights into relations between PPIs and various similarity features of protein pairs. Yamanishi et al. [25] developed a method based on kernel canonical correlation analysis to infer PPI networks from multiple types of genomic data. However, in that work all genomic kernels are simply added together, with no weights to regulate these heterogeneous and potentially noisy data sources for their contribution towards PPI prediction. Meanwhile, it seems that the partial network needed for supervised learning based on kernel CCA need to be sufficiently large, e.g., a leave-one-out cross validation is used, to attain good performance. In Huang et al. [26] the weights for different data sources are optimized using a sampling based method, ABC-DEP, which is computationally demanding.
In this paper, we propose a new method to infer de novo PPIs by combining multiple data sources represented in kernel format and obtaining optimal weights based on random walk over the existing partial network. The novelty of the method lies in the use of Barker algorithm to construct the transition matrix for the training subnetwork and find the optimal weights by linear programing to minimize the element-wise difference between the transition matrix and the adjacency matrix, aka, the weighted kernel from multiple heterogeneous data. Then we apply regularized Laplacian kernel (RL) to the weighted kernel to infer missing or new edges in the PPI network. A preliminary version of this work was described in [27]. Relative to that paper, the current work includes extension to handle interaction prediction problem for PPI networks consisting of disconnected components and new results on the human PPI network, which is much more sparse than the yeast PPI network. Our method can circumvent the issue of unbalanced data faced with many machine learning methods in bioinformatics by training on only a small partial network. Our method works particularly well with detecting interactions between nodes that are far apart in the network.
Methods
Problem definition
where i and j are two nodes in the nodes set V, and (i,j) represents an edge between i and j, (i,j)∈E. The graph is called connected if there is a path of edges to connect any two nodes in the graph. Given many PPI networks are not connected and has many connected component with various size, we select a large connected component (e.g. largest connected component) as golden standard network to do supervised learning. Specifically, by adopting the same setting in [26], we divide the golden standard network into three parts: connected training network G _{ tn }=(V,E _{ tn }), validation set G _{ vn }=(V _{ vn },E _{ vn }) and testing set G _{ tt }=(V _{ tt },E _{ tt }), such that E=E _{ tn }∪E _{ vn }∪E _{ tt }, and any edge in G can only belong to one of these three parts.
To ensure good inference, it is important to learn optimal weights for G _{ tn } and various K _{ i } to build kernel fusion K _{ fusion }. Otherwise, given the multiple heterogeneous kernels from different data sources, the kernel fusion without optimized weights is likely to generate erroneous inference on PPI.
Weight optimization with linear programming (WOLP)
This stationary distribution provides constraints at optimizing the weights. For example, the positive training examples (nodes that are closer to the start node s) should have higher probability than the negative training examples (nodes that are far away from s). In Backstrom et al. [19], this is used as constraint in minimizing the L2 norm of the weights for optimal weights. In the work of Backstrom et al. [19], a gradient descent optimization method is adopted to get optimal weights, and only the pair-wise features for the existing edges in the network are utilized, which means Q(i,j) is nonzero only for edge (i,j) that already exists in the training network. To leverage more information from multiple heterogeneous sources, in our case the Q(i,j), as defined in Eq. (4), are nonzero unless there is no features for edge i,j in all kernels K _{ a }. Having many non-zero elements in Q makes it much more difficult for the traditional gradient descent optimization method to converge and to find the global optima.
As the number of elements in the transition matrix is typically much larger than the number of weights, Eq. (9) provides more equations than the number of variables, making it an overdetermined linear equation system. This overdetermined linear equation system can be solved with linear programming using standard programs in [33, 34].
The optimized weights W ^{∗} can then be plugged back into Eq. (4) to form an optimal transition matrix for the whole set of nodes, and the random walk from the source node using this optimal transition matrix hence leverages the information from multi data sources and is expected to give more accurate prediction for missing and/or de novo links: nodes that are most frequented by random walk are more likely, if not yet detected, to have a direct link to the source node. The formal procedure for solving this overdetermined linear system and inferring PPIs for a particular node is shown by Algorithm 1.
PPI prediction and network inference
As we discussed in introduction section, the use of random walk from a single start node is not efficient for all-against-all prediction, especially for the large and sparse PPI networks. Therefore, it would be of great interest if the weights learned by WOLP based on a single start node can also work network wide. Actually, it is widely observed that the many biological networks contain several hubs (i.e., nodes with with high degree) [35]. Thus we extend our algorithm to all-against-all PPI inference by hypothesizing that the weights learned based on a start node with high degree would be utilizable by other nodes. We will verify this hypothesis by doing all-against-all PPI inference for real PPI network.
We design a supervised WOLP version that can learn weights more accurately for the large and sparse PPI network. Similarly, if the whole PPI network is connected, then the golden standard network is itself; otherwise, the golden standard network that used to do supervised learning should be a large component of the disconnected PPI network. To do so, we divide the golden standard network into three parts: connected training network G _{ tn }=(V,E _{ tn }), validation set G _{ vn }=(V _{ vn },E _{ vn }) and testing set G _{ tt }=(V _{ tt },E _{ tt }), such that E=E _{ tn }∪E _{ vn }∪E _{ tt }, and any edge in G can only belong to one of these three parts. Then we use WOLP to learn weights based on G _{ tn } and G _{ vn }, and finally use G _{ tt } to verify the prediction capability of these weights. The main structure of our method is shown by Algorithm 2, and the supervised version of WOLP is shown by Algorithm 3. The while loop in Algorithm 3 is used to find optimal setting of D, L and mapping strategy(upper or lower) that can generate best weights W _{ opt } with respect to inferring and G _{ tn } and G _{ vn }.
Moreover, many existing network-level link prediction or matrix completion methods [1, 19, 21, 23, 24] can only work well on connected PPI networks, but detection of interacting pairs for disconnected PPI networks has been a challenge for these methods. However, our WOLP method can solve the problem effectively. Because various feature kernels can connect all the disconnected components of the originally disconnected PPI network; and we believe once the optimal weights have been learned based on the training network generated from a large connected component (e.g. largest connected component), they can also be used to build the kernel fusion when the prediction task scale up to the originally disconnected PPI network. To do so, we update the Algorithm 2 to Algorithm 4 that shows the detailed process of interaction prediction for disconnected PPI networks. Given an originally disconnected network G, firstly, we learn the optimal weights by Algorithm 3 based on a large connected component G _{ cc } of G. After that, we randomly divide the edge set E of the disconnected G into training edge set G _{ tn } and testing edge set G _{ tt }, and use the optimal weights we learned before directly to linearly combine G _{ tn } and other corresponding feature kernels to build the kernel fusion, and finally evaluate the performance through predicting G _{ tt }. Here we call G _{ tn } training edge set, because G _{ tn } no longer needs to be connected to learn any weights.
Results and discussion
We examine the soundness and robustness of the proposed algorithms with use of both synthetic and real data. Our goal here is to demonstrate that the weights obtained by our method can help build a better kernel fusion leading to more accurate PPI prediction.
Experiments on single start node and synthetic data
Where R _{5093} indicates a 5093 by 5093 random matrix with elements between [ 0,1], which can also be seen asbackground noise matrix; J _{5093} indicates a 5093 by 5093 all-one matrix, r a n d _{ diff }(J _{5093},G _{ syn },ρ _{ i }) is used to randomly generate a difference matrix (if (i, j) = 1 in G _{ syn } and (i, j) should be 0 in the difference matrix) between J _{5093} and G _{ syn } with density ρ _{ i }; r a n d _{ sub }(G _{ syn },ρ _{ i }) is used to generate a subnetwork from G _{ syn } with density ρ _{ i }; ρ _{ i } are different for each kernel; η is a positive parameter between [ 0,1] and R _{5093} will be rebuilt every time for each kernel.
The general process of experimenting with synthetic data is: we generate synthetic network G _{ syn }, synthetic feature kernels K firstly, and then divide nodes V of G _{ syn } into D, L and M, where D and L can be seen as training nodes, M can be seen as testing nodes. By using G _{ syn }, start node s and K, we can get the stationary distribution p based on the optimized kernel fusion \( K_{OPT} = W_{0}G_{syn}(u,v) + \sum \limits _{i=1}^{n} W_{i}K_{i}(u,v) \). Finally, we try to prove that K _{ OPT } is better than the control kernel fusion \( K_{EW} = G_{syn} + \sum \limits _{i=1}^{n}K_{i} \) built by equal weights, if the p(M) is more similar to p ^{′}(M) based on G _{ syn }, as compared to p ^{″}(M) based on the control kernel fusion K _{ EW }, where p(M) indicates the rank of stationary probabilities respect to the testing node M. We evaluate the rank similarity between pairs (p(M),p ^{′}(M)) and (p ^{″}(M),p ^{′}(M)) by discounted cumulative gain (DCG) [38].
DCG@20 of rank comparison
Repetition | DCG@20(p(M),p ^{′}(M)) | DCG@20(p ^{″}(M),p ^{′}(M)) |
---|---|---|
1 | 0.7101 | 0.6304 |
2 | 0.9305 | 0.4423 |
3 | 0.4035 | 0.2657 |
4 | 0.8524 | 0.5690 |
5 | 0.7256 | 0.4417 |
6 | 0.3683 | 0.3009 |
7 | 0.7707 | 0.2753 |
8 | 1.0034 | 0.3663 |
9 | 0.7119 | 0.4603 |
10 | 0.6605 | 0.6123 |
Experiments on network inference with real data
We use the yeast PPI network downloaded from DIP database (Release 20150101) [37] and the high-confidence human PPI network downloaded from PrePPI database [39] to test our algorithm.
Data and kernels of yeast PPI networks
For the yeast PPI network, some interactions without Uniprotkb ID have been filtered out in order to do name mapping and make use of genomic similarity kernels [40]. As a result, the originally disconnected PPI network contains 5093 proteins and 22,423 interactions. The largest connected component consists of 5030 proteins and 22,394 interactions, and is used to serve as the golden standard network.
Six feature kernels are included in PPI inference for the yeast data. G _{ tn }: G _{ tn } is the connected training network that provides connectivity information. It can also be thought of as a base network to do the inference. K _{ Jaccard } [41]: This kernel measure the similarity of protein pairs i,j in term of \(\frac {neigbors(i) \cap neighbors(j)}{neighbors(i) \cup neighbors(j)}\). K _{ SN }: It measures the total number of neighbors of protein i and j, K _{ SN }=n e i g h b o r s(i)+n e i g h b o r s(j). K _{ B } [40]: It is a sequence-based kernel matrix that is generated using the BLAST [42]. K _{ E } [40]: This is a gene co-expression kernel matrix constructed entirely from microarray gene expression measurements. K _{ Pfam } [40]: Similarity measure derived from Pfam HMMs [43]. All these kernels are normalized to the scale of [0,1] in order to avoid bias.
Data and kernels of human PPI networks
The originally disconnected human PPI network has 3993 proteins and 6669 interactions, which is much sparser than the yeast PPI network. The largest connected component that serve as the golden standard network contains 3285 proteins and 6310 interactions.
Eight feature kernels are included in PPI inference for the human data. G _{ tn }: G _{ tn } is the connected training network that provides connectivity information. It can also be thought of as a base network to do the inference. K _{ Jaccard } [41]: This kernel measure the similarity of protein pairs i,j in term of \(\frac {neigbors(i) \cap neighbors(j)}{neighbors(i) \cup neighbors(j)}\). K _{ SN }: It measures the total number of neighbors of protein i and j, K _{ SN }=n e i g h b o r s(i)+n e i g h b o r s(j). K _{ B }: It is a sequence-based kernel matrix that is generated using the BLAST [42]. K _{ D }: It is a domain-based similarity kernel matrix measured by the method of neighborhood correlation [44]. K _{ BP }: It is a biological process based semantic similarity kernel measured by Resnik with BMA [45]. K _{ CC }: It is a cellular component based semantic similarity kernel measured by Resnik with BMA [45]. K _{ MF }: It is a molecular function based semantic similarity kernel measured by Resnik with BMA [45].
PPI inference based on the largest connected component
Division of golden standard PPI networks
Species | G _{ tn } | G _{ vn } | G _{ tt } |
---|---|---|---|
Yeast | V,E={5,030,5,394} | V,E={−,1,000} | V,E={−,16,000} |
Human | V,E={3,285,3,310} | V,E={−,300} | V,E={−,2,700} |
With the weights learned by WOLP and using i _{ th } hub as the start node, we build the kernel fusion WOLP-K-i by Eq. (2). PPI network inference is made by RL kernel Eq. (3), and named as R L _{ WOLP-K-i }, i=1,2,3. The performance of inference is evaluated by how well the testing set G _{ tt } is recovered. Specifically, all node pairs are ranked in decreasing order by their edge weights in the RL matrix, and edges in the testing set G _{ tt } are labeled as positive and node pairs with no edges in G are labeled as negative. An ROC curve is plotted for true positive v.s. false positives, by running down the ranked list of node pairs. To make comparison, besides the PPI inferences R L _{ WOLP-K-i }, i=1,2,3 learned by our WOLP, we also include other two PPI network inferences: \( {RL}_{G_{tn}} \) and R L _{ EW-K }, where \( {RL}_{G_{tn}} \) indicates RL based PPI inference is solely from the training network G _{ tn }, and R L _{ EW-K } represents RL based PPI inference is from kernel fusion built by equal weights, e.g. w _{ i }=1, i=0,1...n. Additionally, G _{ set }∼n indicates there is n number of edges in the set G _{ set }, e.g. G _{ tn }∼5394 means the connected training network G _{ tn } contains 5394 edges.
Comparison of AUCs for yeast PPI prediction
Rep | Avg AUC(R L _{ WOLP-K-1∼10}) | AUC(\({RL}_{G_{tn}}\)) | AUC(R L _{ EW-K }) |
---|---|---|---|
1 | 0.8367 ± 0.0134 | 0.7127 | 0.6976 |
2 | 0.7937 ± 0.0584 | 0.7768 | 0.7014 |
3 | 0.7802 ± 0.0545 | 0.7732 | 0.7009 |
4 | 0.7811 ± 0.0507 | 0.7406 | 0.7029 |
5 | 0.8349 ± 0.0301 | 0.7477 | 0.6991 |
6 | 0.8160 ± 0.0492 | 0.7180 | 0.7091 |
7 | 0.7670 ± 0.0636 | 0.7513 | 0.6992 |
8 | 0.8018 ± 0.0539 | 0.7739 | 0.7042 |
9 | 0.7989 ± 0.0552 | 0.7302 | 0.7017 |
10 | 0.8172 ± 0.0388 | 0.7387 | 0.6953 |
Comparison of AUCs for human PPI prediction
Rep | Avg AUC(R L _{ WOLP-K-1∼10}) | AUC(\({RL}_{G_{tn}}\)) | AUC(R L _{ EW-K }) |
---|---|---|---|
1 | 0.8871 ± 0.0122 | 0.8228 | 0.7823 |
2 | 0.8986 ± 0.0144 | 0.8106 | 0.8127 |
3 | 0.8988 ± 0.0088 | 0.8216 | 0.8088 |
4 | 0.8955 ± 0.0114 | 0.8161 | 0.8142 |
5 | 0.8994 ± 0.0089 | 0.8190 | 0.8088 |
6 | 0.8875 ± 0.0182 | 0.7927 | 0.8067 |
7 | 0.8904 ± 0.0237 | 0.8302 | 0.8096 |
8 | 0.8978 ± 0.0121 | 0.8205 | 0.8153 |
9 | 0.9011 ± 0.0101 | 0.7995 | 0.8130 |
10 | 0.8818 ± 0.0281 | 0.8078 | 0.8104 |
Effects of the training data
Effects of training data size on prediction performance (AUC) for yeast
G _{ tt }∼15000 | G _{ tt }∼14000 | G _{ tt }∼13000 | |
---|---|---|---|
\(\phantom {\dot {i}\!}{RL}_{ {WOLP-K-1}:G_{tn}\sim 5394} \) | 0.8658 | - | - |
\( {RL}_{G_{tn}\sim 7394} \) | 0.7931 | - | - |
\(\phantom {\dot {i}\!} {RL}_{ {EW-K}:G_{tn}\sim 5394} \) | 0.7519 | - | - |
\(\phantom {\dot {i}\!} {RL}_{ {WOLP-K-1}:G_{tn}\sim 5394} \) | - | 0.8659 | - |
\( {RL}_{G_{tn}\sim 8394} \) | - | 0.8538 | - |
\(\phantom {\dot {i}\!} {RL}_{ {EW-K}:G_{tn}\sim 5394} \) | - | 0.7537 | - |
\(\phantom {\dot {i}\!} {RL}_{ {WOLP-K-1}:G_{tn}\sim 5394} \) | - | - | 0.8659 |
\( {RL}_{G_{tn}\sim 9394} \) | - | - | 0.8619 |
\(\phantom {\dot {i}\!} {RL}_{ {EW-K}:G_{tn}\sim 5394} \) | - | - | 0.7520 |
Effects of training data size on prediction performance (AUC) for human
G _{ tt }∼2600 | G _{ tt }∼2100 | G _{ tt }∼1600 | |
---|---|---|---|
\(\phantom {\dot {i}\!} {RL}_{ {WOLP-K-1}:G_{tn}\sim 3310} \) | 0.9277 | - | - |
\( {RL}_{G_{tn}\sim 3710} \) | 0.8359 | - | - |
\(\phantom {\dot {i}\!} {RL}_{ {EW-K}:G_{tn}\sim 3310} \) | 0.8590 | - | - |
\(\phantom {\dot {i}\!} {RL}_{ {WOLP-K-1}:G_{tn}\sim 3310} \) | - | 0.9305 | - |
\( {RL}_{G_{tn}\sim 4210} \) | - | 0.8779 | - |
\(\phantom {\dot {i}\!} {RL}_{ {EW-K}:G_{tn}\sim 3310} \) | - | 0.8620 | - |
\(\phantom {\dot {i}\!} {RL}_{ {WOLP-K-1}:G_{tn}\sim 3310} \) | - | - | 0.9338 |
\( {RL}_{G_{tn}\sim 4710} \) | - | - | 0.9227 |
\(\phantom {\dot {i}\!} {RL}_{ {EW-K}:G_{tn}\sim 3310} \) | - | - | 0.8639 |
Detection of interacting pairs far apart in the network
It is known that the basic idea of using random walk or random walk based kernels [17–20] for PPI prediction is that good interacting candidates usually are not faraway from the start node, e.g. only 2, 3 edges away in the network. Consequently, the testing nodes have been chosen to be within a certain distance range, which largely contributes to the good performance reported by many network-level link prediction methods. In reality, however, a method that is capable and good at detecting interacting pairs far apart in the network can be even more useful, such as in uncovering cross talk between pathways that are not nearby in the PPI network.
Detection of interacting pairs for disconnected PPI networks
Analysis of weights
Conclusion
In this work we developed a novel and fast optimization method using linear programming to integrate multiple heterogeneous data for PPI inference problem. The proposed method, verified with synthetic data and tested with DIP yeast PPI network and PrePPI high-confidence human PPI network, enables quick and accurate inference of PPI networks from topological and genomic feature kernels in an optimized integrative way. Compared to the baseline (G _{ tn } and EW-K), our WOLP method achieved performance improvement in PPI prediction with over 19 % higher AUC on yeast data and 11 % higher AUC on human data, and this margin is maintained even when the control methods use a significantly larger training set. We also demonstrated that by integrating topological and genomic features into regularized Laplacian kernel, the method avoids the short-range problem encountered by random-walk based methods – namely the inference becomes less reliable for nodes that are far from the start node of the random walk, and shows obvious improvements on predicting faraway interactions; The weights learned by our WOLP are highly consistent with the weights learned by sampling based method, which can provide insights into the relations between PPIs and various similarity features of protein pairs, thereby helping us make good use of these features. Moreover, we further demonstrated those relations are also maintained when the golden standard network (largest connected component) scale up to the original PPI network that consists of disconnected components. That is to say, the weights learned based on the connected training subnetwork of the largest connected component can also help to detect interactions for the originally disconnected PPI networks effectively and accurately. As more features with respect to proteins are collected from various -omics studies, they can be used to characterize protein pairs in terms of feature kernels from different perspectives. Thus we believe that our method can provide us a quick and accurate way to fuse various feature kernels from heterogeneous data, thereby improving PPI prediction.
Declarations
Declarations
Publication of this article is funded by Delaware INBRE program, with grant from the National Institute of General Medical Sciences-NIGMS (8 P20 GM103446-12) from the National Institutes of Health. This article has been published as part of BMC Systems Biology Vol 10 Suppl 2 2016: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2015: systems biology. The full contents of the supplement are available online at http://bmcsystbiol.biomedcentral.com/articles/supplements/volume-10-supplement-2.
Authors’ contributions
LH designed the algorithm and experiments, and performed all calculations and analyses. LL and CHW aided in interpretation of the data and preparation of the manuscript. LH wrote the manuscript, LL and CHW revised it. LL and CHW conceived of this study. All authors have read and approved this manuscript.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Kuchaiev O, Rašajski M, Higham DJ, Pržulj N. Geometric de-noising of protein-protein interaction networks. PLoS Comput Biol. 2009; 5(8):1000454.View ArticleGoogle Scholar
- Murakami Y, Mizuguchi K. Homology-based prediction of interactions between proteins using averaged one-dependence estimators. BMC Bioinformatics. 2014; 15(1):213.PubMedPubMed CentralView ArticleGoogle Scholar
- Salwinski L, Eisenberg D. Computational methods of analysis of protein–protein interactions. Curr Opin Struct Biol. 2003; 13(3):377–82.PubMedView ArticleGoogle Scholar
- Craig R, Liao L. Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices. BMC Bioinformatics. 2007; 8(1):6.PubMedPubMed CentralView ArticleGoogle Scholar
- Gonzalez A, Liao L. Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines. BMC Bioinformatics. 2010; 11(1):537.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, Bisikirska B, Lefebvre C, Accili D, Hunter T, Maniatis T, Califano A, Honig B. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature. 2012; 490(7421):556–60.PubMedPubMed CentralView ArticleGoogle Scholar
- Singh R, Park D, Xu J, Hosur R, Berger B. Struct2net: a web service to predict protein–protein interactions using a structure-based approach. Nucleic Acids Res. 2010; 38(suppl 2):508–15.View ArticleGoogle Scholar
- Deng Y, Gao L, Wang B. ppipre: predicting protein-protein interactions by combining heterogeneous features. BMC Syst Biol. 2013; 7(Suppl 2):8.View ArticleGoogle Scholar
- Sun J, Sun Y, Ding G, Liu Q, Wang C, He Y, Shi T, Li Y, Zhao Z. Inpreppi: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes. BMC Bioinformatics. 2007; 8(1):414.PubMedPubMed CentralView ArticleGoogle Scholar
- Cho YR, Mina M, Lu Y, Kwon N, Guzzi P. M-finder: Uncovering functionally associated proteins from interactome data integrated with go annotations. Proteome Sci. 2013; 11(Suppl 1):3.View ArticleGoogle Scholar
- Jung SH, Jang WH, Han DS. A computational model for predicting protein interactions based on multidomain collaboration. IEEE/ACM Trans Comput Biol Bioinformatics. 2012; 9(4):1081–90.View ArticleGoogle Scholar
- Chen HH, Gou L, Zhang XL, Giles CL. Discovering missing links in networks using vertex similarity measures. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing. SAC ’12. New York, NY, USA: ACM: 2012. p. 138–43.Google Scholar
- Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A. 2011; 390(6):11501170.View ArticleGoogle Scholar
- Lei C, Ruan J. A novel link prediction algorithm for reconstructing protein-protein interaction networks by topological similarity. Bioinformatics. 2012. doi:https://doi.org/10.1093/bioinformatics/bts688. http://bioinformatics.oxfordjournals.org/content/early/2012/12/11/bioinformatics.bts688.full.pdf+html.
- Pržulj N. Protein-protein interactions: Making sense of networks via graph-theoretic modeling. BioEssays. 2011; 33(2):115–23.PubMedView ArticleGoogle Scholar
- Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab (November 1999). Previous number = SIDL-WP-1999-0120. http://ilpubs.stanford.edu:8090/422/.
- Tong H, Faloutsos C, Pan JY. Random walk with restart: fast solutions and applications. Knowl Inform Syst. 2008; 14(3):327–46.View ArticleGoogle Scholar
- Li RH, Yu JX, Liu J. Link prediction: The power of maximal entropy random walk. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM ’11. New York, NY, USA: ACM: 2011. p. 1147–1156, doi:https://doi.org/10.1145/2063576.2063741.Google Scholar
- Backstrom L, Leskovec J. Supervised random walks: Predicting and recommending links in social networks. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. WSDM ’11. New York, NY, USA: ACM: 2011. p. 635–44.Google Scholar
- Fouss F, Francoisse K, Yen L, Pirotte A, Saerens M. An experimental investigation of kernels on graphs for collaborative recommendation and semisupervised classification. Neural Netw. 2012; 31(0):53–72.PubMedView ArticleGoogle Scholar
- Cannistraci CV, Alanis-Lobato G, Ravasi T. Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding. Bioinformatics. 2013; 29(13):199–209.View ArticleGoogle Scholar
- Symeonidis P, Iakovidou N, Mantas N, Manolopoulos Y. From biological to social networks: Link prediction based on multi-way spectral clustering. Data Knowl Eng. 2013; 87(0):226–42.View ArticleGoogle Scholar
- Wang H, Huang H, Ding C, Nie F. Predicting protein–protein interactions from multimodal biological data sources via nonnegative matrix tri-factorization. J Comput Biol. 2013; 20(4):344–58. doi:https://doi.org/10.1089/cmb.2012.0273.PubMedView ArticleGoogle Scholar
- Menon AK, Elkan C. Link prediction via matrix factorization. In: Proceedings of the 2011 European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II. ECML PKDD’11. Berlin, Heidelberg: Springer: 2011. p. 437–52.Google Scholar
- Yamanishi Y, Vert JP, Kanehisa M. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics. 2004; 20(suppl 1):363–70.View ArticleGoogle Scholar
- Huang L, Liao L, Wu CH. Inference of protein-protein interaction networks from multiple heterogeneous data. EURASIP J Bioinformatics Syst Biol. 2016; 2016(1):1–9. doi:https://doi.org/10.1186/s13637-016-0040-2.View ArticleGoogle Scholar
- Huang L, Liao L, Wu CH. Protein-protein interaction network inference from multiple kernels with optimization based on random walk by linear programming. In: Proceedings of 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Washington DC, USA: IEEE computer society: 2015. p. 201–7.Google Scholar
- Ito T, Shimbo M, Kudo T, Matsumoto Y. Application of kernels to link analysis. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. KDD ’05. New York, NY, USA: ACM: 2005. p. 586–92.Google Scholar
- Smola AJ, Kondor R. Kernels and Regularization on Graphs In: Schölkopf B, Warmuth MK, editors. Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24–27, 2003. Proceedings. Berlin, Heidelberg: Springer: 2003. p. 144–58.Google Scholar
- Mantrach A, van Zeebroeck N, Francq P, Shimbo M, Bersini H, Saerens M. Semi-supervised classification and betweenness computation on large, sparse, directed graphs. Pattern Recognit. 2011; 44(6):1212–24.View ArticleGoogle Scholar
- Pan JY, Yang HJ, Faloutsos C, Duygulu P. Automatic multimedia cross-modal correlation discovery. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’04. New York, NY, USA: ACM: 2004. p. 653–8.Google Scholar
- Baker J. An algorithm for the location of transition states. J Comput Chem. 1986; 7(4):385–95. doi:https://doi.org/10.1002/jcc.540070402.View ArticleGoogle Scholar
- Paige CC, Saunders MA. Lsqr: An algorithm for sparse linear equations and sparse least squares. ACM Trans Math Softw. 1982; 8(1):43–71.View ArticleGoogle Scholar
- Fong DC-L, Saunders M. Lsmr: An iterative algorithm for sparse least-squares problems. SIAM J Sci Comput. 2011; 33(5):2950–71.View ArticleGoogle Scholar
- Barabási AL. Scale-free networks: A decade and beyond. Science. 2009; 325(5939):412–3. doi:https://doi.org/10.1126/science.1173299. http://science.sciencemag.org/content/325/5939/412.full.pdf.PubMedView ArticleGoogle Scholar
- Kumar R, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A, Upfal E. Stochastic models for the web graph. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. FOCS ’00. Washington, DC, USA: IEEE Computer Society: 2000. p. 57.Google Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004; 32(90001):449–51.View ArticleGoogle Scholar
- Christopher D, Manning HS. Prabhakar Raghavan: Introduction to Information Retrieval. New York, USA: Cambridge University Press; 2008.Google Scholar
- Zhang QC, Petrey D, Garzón JI, Deng L, Honig B. Preppi: a structure-informed database of protein–protein interactions. Nucleic Acids Res. 2013; 41(D1):828–33. doi:https://doi.org/10.1093/nar/gks1231. http://nar.oxfordjournals.org/content/41/D1/D828.full.pdf+html.View ArticleGoogle Scholar
- Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble WS. A statistical framework for genomic data fusion. Bioinformatics. 2004; 20(16):2626–635.PubMedView ArticleGoogle Scholar
- Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles. 1901; 37:547–79.Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.PubMedView ArticleGoogle Scholar
- Sonnhammer ELL, Eddy SR, Durbin R. Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins: Struct Funct Bioinformatics. 1997; 28(3):405–20.View ArticleGoogle Scholar
- Song N, Joseph JM, Davis GB, Durand D. Sequence similarity network reveals common ancestry of multidomain proteins. PLoS Comput Biol. 2008; 4(5):1–19. doi:https://doi.org/10.1371/journal.pcbi.1000063.View ArticleGoogle Scholar
- Resnik P. Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1. IJCAI’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc: 1995. p. 448–53. http://dl.acm.org/citation.cfm?id=1625855.1625914.Google Scholar
- Huang L, Liao L, Wu CH. Evolutionary model selection and parameter estimation for protein-protein interaction network based on differential evolution algorithm. Comput Biol Bioinformatics, IEEE/ACM Trans. 2015; 12(3):622–31.View ArticleGoogle Scholar
- Deng M, Mehta S, Sun F, Chen T. Inferring domain–domain interactions from protein–protein interactions. Genome Res. 2002; 12(10):1540–8.PubMedPubMed CentralView ArticleGoogle Scholar
- Itzhaki Z, Akiva E, Altuvia Y, Margalit H. Evolutionary conservation of domain-domain interactions. Genome Biol. 2006; 7(12):125.View ArticleGoogle Scholar
- Park J, Lappe M, Teichmann SA. Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the {PDB} and yeast1. J Mol Biol. 2001; 307(3):929–38.PubMedView ArticleGoogle Scholar
- Betel D, Isserlin R, Hogue CWV. Analysis of domain correlations in yeast protein complexes. Bioinformatics. 2004; 20(suppl 1):55–62.View ArticleGoogle Scholar