Joint clustering of protein interaction networks through Markov random walk
- Yijie Wang^{1} and
- Xiaoning Qian†^{1, 2, 3}Email author
https://doi.org/10.1186/1752-0509-8-S1-S9
© Wang and Qian; licensee BioMed Central Ltd. 2014
Published: 24 January 2014
Abstract
Biological networks obtained by high-throughput profiling or human curation are typically noisy. For functional module identification, single network clustering algorithms may not yield accurate and robust results. In order to borrow information across multiple sources to alleviate such problems due to data quality, we propose a new joint network clustering algorithm ASModel in this paper. We construct an integrated network to combine network topological information based on protein-protein interaction (PPI) datasets and homological information introduced by constituent similarity between proteins across networks. A novel random walk strategy on the integrated network is developed for joint network clustering and an optimization problem is formulated by searching for low conductance sets defined on the derived transition matrix of the random walk, which fuses both topology and homology information. The optimization problem of joint clustering is solved by a derived spectral clustering algorithm. Network clustering using several state-of-the-art algorithms has been implemented to both PPI networks within the same species (two yeast PPI networks and two human PPI networks) and those from different species (a yeast PPI network and a human PPI network). Experimental results demonstrate that ASModel outperforms the existing single network clustering algorithms as well as another recent joint clustering algorithm in terms of complex prediction and Gene Ontology (GO) enrichment analysis.
Keywords
Introduction
Over the past decade, one goal of systems biology is to understand how different molecules work together to maintain cellular functionalities [1, 2]. It is now a common belief that many complex diseases including cancer are due to systems impairments caused by not only single genetic mutations but also disruption of molecular interactions under different situations, which have been conjectured to be the probable sources of disease heterogeneity as well as treatment response heterogeneity [3–5]. Hence, by analyzing large-scale gene expression profiles and protein-protein interaction (PPI) data, computational methods may help us to have a better understanding of biological pathways and cellular organization and thereafter their relationships to diseases as well as potential drug responses [1, 2]. One way to investigate these large-scale data is to analyze them in the framework of network analysis [2]. In this paper, we focus on the analysis of PPI networks. We are interested in network clustering to divide the given network into small parts, which can be considered as potential functional modules or pathways [6–8] since biological functions are carried by groups of genes and proteins in a coordinated way [9, 10].
There are many existing algorithms for clustering single PPI networks. Normalized cut (NCut) method [11] aims to partition the network based on a novel global criterion, which focuses on the contrast between the total dissimilarity across different clusters and the total similarity within clusters based on network topology. The formulation of NCut is equivalent to finding low conductance sets on the transition matrix of the Markov random walk on the network to analyze [12, 13]. Markov CLustering algorithm (MCL) [14] detects clusters based on stochastic flow simulation, which has been proven to be effective at clustering biological networks. Recently, an enhanced version of MCL--Regularized MCL (RMCL) [15, 16]--has been proposed to penalize large clusters at each iteration of MCL to obtain more balanced clusters and it has been shown to have better performance to identify clusters with potential functional specificity.
However, it is well known that the current public PPI datasets are quite noisy and there exist both false positive and false negative interactions due to different technical reasons [17]. Therefore, clustering simply based on one network constructed from a single data source may not be able to yield robust and accurate results. We may need to appropriately integrate multiple information sources to repress the noise in existing PPI datasets by borrowing strengths from each other. AlignNemo [18] is one of such recent efforts, which detects network clusters on an alignment network of two given PPI networks. AlignNemo takes into account not only the network topology from two PPI networks but also the homology information between proteins across two networks. However, based on the reported experiments and our empirical findings, AlignNemo has low clustering coverage because the alignment network is constructed based on only similar proteins by their sequence similarity and those proteins that do not appear in the alignment network are never considered for clustering.
In this paper, we propose a joint clustering algorithm based on a new Markov random walk on an integrated network, which is constructed by integrating protein-protein interactions in given PPI networks as well as homological interactions introduced by sequence similarity between proteins across networks. A novel alternative random walk strategy is proposed on the integrated network with the transition matrix integrating both topology and homology information. We formulate the joint clustering problem as searching for low conductance sets defined by this transition matrix. We then derive an approximate spectral solution algorithm for joint network clustering.
The organization of the rest of the paper is as follows: In section 2, we introduce the construction of the integrated network, the new random walk strategy, our final optimization problem formulation and the spectral algorithm for joint clustering. Section 3 contains experimental results on clustering two PPI networks within the same species (two yeast PPI networks and two human PPI networks, respectively) as well as those from different species (one yeast and one human PPI networks). Our experimental results demonstrate that our joint clustering algorithm, which we call it ASModel, outperforms the state-of-the-art single network clustering algorithms as well as AlignNemo [18] in terms of both protein complex prediction and Gene Ontology (GO) enrichment analysis [19]. Finally, we draw our conclusions in section 4.
Methodology
Terminology
where BLAST(u_{ i }, v_{ j }) stands for the bit score of sequence similarity between proteins u_{ i } and v_{ j } by BLAST [20]. Based on (2), we note that S_{12}(u_{ i }, v_{ j }) is in the range [0, 1].
Integrated network
The examples of ${\mathcal{M}}_{T}$ and ${\mathcal{M}}_{H}$ are also illustrated in Figure 1A.
Random walk strategy on the integrated network
where ${P}_{A}={D}_{A}^{-1}A$ and ${P}_{\stackrel{\u0304}{S}}={D}_{\stackrel{\u0304}{S}}^{-1}\stackrel{\u0304}{S}$. The matrix D_{ A } is a diagonal matrix with the degree of each node on its diagonal elements. $\stackrel{\u0304}{S}=S+{I}_{N\times N}$ is the adjacency matrix of network ${\mathcal{M}}_{H}$ with self-loops indicating self similarity of proteins. ${D}_{\stackrel{\u0304}{S}}$ is the corresponding diagonal matrix with ${D}_{\stackrel{\u0304}{S}}\left(i,i\right)={\sum}_{j}\stackrel{\u0304}{S}\left(i,j\right)$, where i, j ∈ {1, 2, ..., N} are new node indices in the integrated network and $\stackrel{\u0304}{S}\left(i,j\right)>0$ when i, j indicate proteins from different PPI networks. Again, $\stackrel{\u0304}{S}\left(i,i\right)=1$ for self similarity. Further-more, we find that P_{ A } is the transition matrix of the random walk on ${\mathcal{M}}_{T}$ and ${P}_{\stackrel{\u0304}{S}}$ is the transition matrix of the random walk on ${\mathcal{M}}_{H}$ including self-loops.
where ${P}_{S}={D}_{S}^{-1}S$ and ${P}_{\u0100}={D}_{\u0100}^{-1}\u0100$. Here, D_{ S } is a diagonal matrix with ${D}_{S}\left(i,i\right)={\sum}_{j}S\left(i,j\right)$. Here, $\u0100$ is the adjacency matrix of ${\mathcal{M}}_{T}$ with self-loops to allow for the possibility of random walker staying at the current node. ${D}_{\u0100}$ is the corresponding diagonal matrix with the node degree in $\u0100$ on its diagonal. P_{ S } is the transition matrix of the random walk on ${\mathcal{M}}_{H}$ and ${P}_{\u0100}$ is the transition matrix of the random walk on ${\mathcal{M}}_{T}$ including self-loops.
Searching for low conductance sets based on P
In ${\mathcal{M}}_{T}$, proteins with topological interactions ${\mathcal{E}}_{T}$ are likely to participate in similar cellular functions. Also, proteins with larger homological interactions ${\mathcal{E}}_{H}$ in ${\mathcal{M}}_{H}$ are more probable to be functionally similar. Because the random walk on the integrated network considers both types of interactions, each element P (i, j) of the corresponding transition matrix can be understood as the probability that proteins i and j have similar functions as these proteins are more likely to reach each other with a larger P (i, j). Based on this, we can make use of the concept of the conductance defined on the Markov chain to identify clusters based on P [11, 21] by searching for low conductance sets.
where $\pi $ is the stationary distribution of the corresponding Markov random walk on the integrated network and ${P}^{T}\pi =\pi $.
where ${D}_{\stackrel{\u0304}{P}}$ is a diagonal matrix with ${D}_{\stackrel{\u0304}{P}}\left(i,i\right)={\sum}_{j}\stackrel{\u0304}{P}\left(i,j\right)$; X is a N × k assignment matrix whose element x_{ iℓ } denotes whether node i belongs to cluster ℓ; 1_{ k } and 1_{ N } are all one vectors with k and N elements, respectively. Here, equations (8) and (11) have been proven to be equivalent previously in [21]. We can derive a spectral method to solve the above problem based on [12]. The directed network with P and its equivalent undirected network with $\stackrel{\u0304}{P}$ are illustrated in Figure 1C.
Joint Clustering Algorithm (ASModel)
Our joint clustering algorithm can be summarized into three steps which are illustrated in Figure 1. The first step is to construct the integrated network $\mathcal{M}$. The second step is to compute the transition matrix P based on the alternative random walk strategy in (7). The final step is to find low conductance sets on the equivalent network and apply the spectral method to solve the optimization problem. Algorithm 1 provides the pseudo code for ASModel.
Algorithm 1. ASModel for Joint Network Clustering
Input: Adjacency matrices A_{1} and A_{2}, Sequence similarity matrix S_{12}, and the number of desired clusters k
Output: Cluster assignment matrix X
1. Construct the integrated network $\mathcal{M}$ and compute A and S;
2. Compute the transition matrix P based on the random walk strategy using (7);
3. Obtain the equivalent adjacency matrix $\stackrel{\u0304}{P}$ which has the same low conductance sets as P;
4. Using the spectral algorithm to find k low conductance sets by $\stackrel{\u0304}{P}$ from (11) [12].
Experiments
Algorithms, data, and metrics
We compare our joint clustering algorithm ASModel to NCut [11], MCL [14], RMCL [15, 16], and AlignNemo [18]. Among the selected algorithms for performance comparison, AlignNemo [18] is a recently proposed protein complex detection algorithm, which also takes into account the homology and topology information from two PPI networks. NCut is equivalent to searching for low conductance sets by the transition matrix defined directly based on the given single network. Therefore, comparing with NCut aims to show that finding low conductance sets on the integrated network by our new ASModel is superior to separately finding similar low conductance sets on individual networks. MCL and RMCL are two state-of-the-art algorithms which have been proven effective on analyzing biological networks. Comparing with them can further demonstrate that our joint clustering algorithm ASModel can achieve better performances than clustering single networks separately. Both NCut and ASModel have one input parameter, which is the number of clusters k. We sample k in [100, 3000] with an interval of 100 and report the best results. MCL also has one parameter, the inflation number. We similarly search for the best performing value from 1.2 to 5.0 with an interval of 0.1. For RMCL, we adopt the parameters suggested in [15, 16]. AlignNemo is a heuristic algorithm without any tuning parameters [18] and we directly implement the provided algorithm in our experiments.
Information of four real-world PPI networks.
Network | #. nodes | #. edges | SGD | CORUM | |GO| |
---|---|---|---|---|---|
Sce DIP | 4980 | 22076 | 305 | -- | 956 |
Sce BGS | 5640 | 59748 | 306 | -- | 1005 |
Hsa HPRD | 9269 | 36917 | -- | 1294 | 4755 |
Hsa PIPs | 5226 | 37024 | -- | 1193 | 4560 |
where |g| and |root| are the number of proteins in GO term g and the number of proteins in its corresponding GO category. The information of reference complex datasets and GO terms is also provided in Table 1.
where C = {C_{1}, C_{2}, ..., C_{ k }} are the identified clusters by different algorithms and R = {R_{1}, R_{2}, ..., R_{ l }} denote the corresponding reference complex sets. The neighbor affinity $NA\left({C}_{i},{R}_{j}\right)=\frac{|{C}_{i}\cap {R}_{j}{|}^{2}}{\left|{C}_{i}\left|\times \right|{R}_{j}\right|}$ measures the overlap between the predicted complex C_{ i } and the reference complex R_{ j }.
We choose the lowest p-value of all enriched GO terms in the derived cluster as its final p-value. A GO term is enriched when the p-value of any cluster corresponding to this GO term is less than 1e-3.
Synthetic networks
We first evaluate and compare the clustering performance of our proposed ASModel with the performances of running random walk on individual networks as well as running the random walk directly on integrated networks with both interactions within networks and similarity across networks. The goal of this set of experiments is to demonstrate that not only joint clustering performs better than clustering individual networks by NCut, but also our proposed ASModel can achieve better performance than the normal random walk on the integrated work using the same set of integrated information.
Joint clustering of PPI networks within the same species
In this section, we first jointly cluster two PPI networks from the same species to demonstrate the effectiveness of our ASModel. Through applying ASModel, we expect that each PPI network can borrow strengths from the other PPI network to enhance the clustering performance.
Joint clustering of the SceDIP and SceBGS PPI networks
Complex prediction
The information of the derived clusters by all competing algorithms
PPI | Method | NCut | MCL | RMCL | ASModel | ASModel | ASModel |
---|---|---|---|---|---|---|---|
(DIP+BGS) | (HPRD+PIPs) | (DIP+HPRD) | |||||
Sce DIP | #. clusters | 525 | 659 | 814 | 737 | -- | 702 |
coverage | 2572 | 3630 | 3725 | 4537 | -- | 4425 | |
Sce BGS | #. clusters | 414 | 338 | 772 | 704 | -- | -- |
coverage | 4879 | 3544 | 5210 | 5169 | -- | -- | |
Hsa HPRD | #. clusters | 981 | 1239 | 1508 | -- | 1113 | 1231 |
coverage | 6534 | 7800 | 6879 | -- | 8631 | 8729 | |
Hsa PIPs | #. clusters | 491 | 576 | 581 | -- | 560 | -- |
coverage | 4542 | 4134 | 3966 | -- | 4358 | -- |
One important reason that we have seen different results for protein complex prediction by AlignNemo is that we here use a more strict evaluation criterion to consider that a reference complex R_{ j } is recovered by the identified cluster C_{ i } by clustering algorithms only when N A(C_{ i }, R_{ j } ) > 0.25. In the original paper of AlignNemo [18], a reference complex is considered to be recovered if at least two of its proteins overlap with a detected cluster, which may introduce the evaluation bias. Imagine that if one cluster contains 10 proteins, with every two belonging to a different reference complex. This evaluation criterion will conclude that five different complexes are recovered by the algorithm but the clustering results may not necessarily be desired. Our obtained results may indicate that the random walk strategy in our ASModel better integrates available information across networks than the heuristic strategy adopted in AlignNemo to discover biologically more meaningful clusters.
GO enrichment analysis
In summary, from both complex prediction and GO enrichment analysis, ASModel can achieve more biologically meaningful results. These promising results imply that joint clustering can improve the clustering performance for every individual PPI network when we integrate information from them appropriately.
Joint clustering of the HsaHPRD and HsaPIPs networks
Complex prediction
GO enrichment analysis
From these two experiments of joint clustering PPI networks from the same species, we note that ASModel can make full use of topology and homology information to improve the clustering performance for each PPI network.
Joint clustering of PPI networks from different species
Joint clustering of PPI networks within the same species has been proven to yield promising results. In order to show that ASModel can also improve the clustering performance for PPI networks from different species, we have done the following experiment.
Joint clustering with SceDIP and HsaHPRD PPI networks
Complex prediction
For the Hsa HPRD network, we compare the results of ASModel obtained from joint clustering of the Hsa HPRD and Hsa PIPs networks as well as joint clustering of the Hsa HPRD and Sce DIP PPI networks, AlignNemo, NCut, MCL, and RMCL. The comparison for the number of matched reference complexes and F-measure is given in Figure 9. From the figure, we find that RMCL gets the best performance in terms of these two metrics. ASModel achieves the competitive performance when joint clustering two human networks as shown before. ASModel for two human networks provides better results than jointly analyzing two networks for yeast and human. From this set of experiments, we find that joint clustering two networks within the same species works better than analyzing networks for different species. We in fact expect this because networks within the same species have more shared information, which can be utilized to supplement each other to improve clustering performance. Otherwise, for two networks for different species, joint clustering may not help as much since they may have different cellular constitution and organization due to evolutionary differences.
GO Enrichment analysis
From these experiments, no matter analyzing two PPI networks from the same species or from two different species, our joint clustering algorithm ASModel can achieve better results than analyzing these networks separately using single network clustering algorithms. Furthermore, we find that joint clustering using two PPI networks from the same species achieves more significant performance improvement than using two PPI networks form different species, which coincides with our intuition that we can find more robust and accurate clustering results if we use networks from the same species or species that are phylogenetically close so that the conservation across networks helps to derive more confident clustering results.
Conclusions and future work
In this paper, we have proposed a joint network clustering algorithm ASModel based on a new alternative random walk strategy. The experimental results based on both complex prediction and GO enrichment analysis demonstrate that using ASModel to joint clustering two PPI networks can achieve better clustering results than single network clustering algorithms and AlignNemo. Furthermore, from comparing with the performances of joint clustering PPI networks within the same species (section 3.2) and those from different species (section 3.3), we find that the more information the PPI networks in the integrated network share, the better the clustering results can be achieved. For our future work, we are collaborating with biologists to explore the potential opportunities using our ASModel to identify biologically meaningful clusters in different species. By carefully investigating recovered clusters, we may have a better understanding of protein functionalities, cellular organization, as well as the underlying signal transduction mechanisms for deriving future systematic intervention strategies.
Notes
Declarations
Acknowledgements
This work was supported by Award R21DK092845 from the National Institute Of Diabetes And Digestive And Kidney Diseases, National Institutes of Health; and Award #1244068 from the National Science Foundation.
Declarations
Publication of this article was funded by the faculty startup fund for XQ from Texas A&M University.
This article has been published as part of BMC Systems Biology Volume 8 Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/8/S1.
Authors’ Affiliations
References
- Butland G, Peregrin-Alvarez J, Li J, et al.: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature. 2005, 433: 531-537. 10.1038/nature03239.View ArticlePubMedGoogle Scholar
- Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nat Biotech. 2005, 23: 561-566. 10.1038/nbt1096.View ArticleGoogle Scholar
- Raman K: Construction and analysis of protein-protein interaction networks. Automated Experimentation. 2010, 2 (2):Google Scholar
- Zhang L, Zhang Y, Adusumilli S, et al.:: Molecular interactions that enable movement of the lyme disease agent from the tick gut into the hemolymph. PLoS Pathog. 2011, 7 (6): 1002079-10.1371/journal.ppat.1002079.View ArticleGoogle Scholar
- Wang Q, Feng J, Wang J, et al.:: Disruption of tab1/p38[alpha] interaction using a cell-permeable peptide limits myocardial ischemia/reperfusion injury. Molecular Therapy advance online publication. 2013Google Scholar
- Pereira-Leal J, Enright A, Ouzounis C: Detection of functional modules from protein interaction networks. Proteins. 2004, 54 (1): 49-57.View ArticlePubMedGoogle Scholar
- Spirin V, Mirny L: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA. 2003, 100: 12123-12128. 10.1073/pnas.2032324100.PubMed CentralView ArticlePubMedGoogle Scholar
- Poyatos J, Hurst L: How biologically relevant are interaction-based modules in protein networks?. Genome Biol. 2004, 5: 93-10.1186/gb-2004-5-11-r93.View ArticleGoogle Scholar
- Bernardo DD, Gardner T, Collins J: Robust identification of large genetic networks. Pac Symp Biocomput. 2004, 9: 486-497.Google Scholar
- Kolker E, Makarova K, Shabalina S, et al.:: Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae. Nucleic Acids Res. 2004, 32 (8): 2353-2361. 10.1093/nar/gkh555.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi J, Malik J: Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence. 2000, 22 (8):Google Scholar
- Wang Y, Qian X: Functional module identification in protein interaction networks by interaction patterns. Bioinformatics. 2013Google Scholar
- Xing E, Jordan M: On semidefinite relaxation for normalized k-cut and connections to spectral clustering. Technical report, UC. Berkeley. 2003Google Scholar
- Dongen SV: A cluster algorithm for graphs. Technical Report INS-R0010. 2000Google Scholar
- Satuluri V, Parthasarathy S: Scalable Graph Clustering Using Stochastic Flows: Applications to Community Discovery. 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). 2009Google Scholar
- Satuluri V, Parthasarathy S, Ucar D: Markov Clustering of Protein Interaction Networks. ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2010. 2010Google Scholar
- DIttrich MT, Klau GW, Rosenwald A, Dandekar T, Muller T: Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics. 2007, 24 (13): 223-231.View ArticleGoogle Scholar
- Ciriello G, Mina M, Guzzi PH, Cannataro M, Guerra C: Alignnemo: A local network alignment method to integrate homology and topology. PLoS ONE. 2012, 7 (6): 38107-10.1371/journal.pone.0038107.View ArticleGoogle Scholar
- Ashburner M, Ball C, Blake J, et al.:: Gene ontology: Tool for the unification of biology. the gene ontology consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Satuluri V, Parthasarathy S: Symmetrizations for Clustering Directed Graphs. 14th International Conference on Extending Database Technology (EDBT11). 2011Google Scholar
- Salwinski L, Miller C, Smith A, Pettit F, JU JB, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Research. 2004, 32: 449-451.View ArticleGoogle Scholar
- Breitkreutz B, Stark C, et al.:: The BioGRID Interaction Database: 2008 update. Nucleic Acids Res. 2008, 36: 637-640.View ArticleGoogle Scholar
- Prasad T, et al.:: Human Protein Reference Database--2009 update. Nucleic Acids Research. 2009, 37: 767-772. 10.1093/nar/gkn892.View ArticleGoogle Scholar
- McDowall M, Scott M, Barton G: Pips: Human protein-protein interactions prediction database. Nucleic Acids Research. 2009, 37: 651-656. 10.1093/nar/gkn870.View ArticleGoogle Scholar
- Hong E, et al.:: Gene ontology annotations at sgd: New data sources and annotation methods. Nucleic Acids Res. 2008, 36: 577-581.View ArticleGoogle Scholar
- Ruepp A, Brauner B, Dunger-Kaltenbach I, et al.:: Corum: The comprehensive resource of mammalian protein complexes. Nucl Acids Res. 2008, 36: 646-650.View ArticleGoogle Scholar
- Shih Y, Parthasarathy S: Identifying functional modules in interaction networks through overlapping markov clustering. Bioinformatics. 2012, 28: 473-479. 10.1093/bioinformatics/bts370.View ArticleGoogle Scholar
- Shih Y, Parthasarathy S: Scalable global alignment for multiple biological networks. BMC Bioinformatics. 2012, 13 (Suppl 3): 11-10.1186/1471-2105-13-S3-S11.View ArticleGoogle Scholar
- Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science. 2002, 296: 910-913. 10.1126/science.1065103.View ArticlePubMedGoogle Scholar
- Lancichinetti A, Fortunato S, Jnos K: Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics. 2009, 11 (3): 033015-10.1088/1367-2630/11/3/033015.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.