cisPath: an R/Bioconductor package for cloud users for visualization and management of functional protein interaction networks
© Wang et al.; licensee BioMed Central Ltd. 2015
Published: 21 January 2015
With the burgeoning development of cloud technology and services, there are an increasing number of users who prefer cloud to run their applications. All software and associated data are hosted on the cloud, allowing users to access them via a web browser from any computer, anywhere. This paper presents cisPath, an R/Bioconductor package deployed on cloud servers for client users to visualize, manage, and share functional protein interaction networks.
With this R package, users can easily integrate downloaded protein-protein interaction information from different online databases with private data to construct new and personalized interaction networks. Additional functions allow users to generate specific networks based on private databases. Since the results produced with the use of this package are in the form of web pages, cloud users can easily view and edit the network graphs via the browser, using a mouse or touch screen, without the need to download them to a local computer. This package can also be installed and run on a local desktop computer. Depending on user preference, results can be publicized or shared by uploading to a web server or cloud driver, allowing other users to directly access results via a web browser.
This package can be installed and run on a variety of platforms. Since all network views are shown in web pages, such package is particularly useful for cloud users. The easy installation and operation is an attractive quality for R beginners and users with no previous experience with cloud services.
Functions of this cisPath Package.
Format PPI file downloaded from PINA database
Format PPI file downloaded from iRefIndex database
Format PPI file downloaded from STRING database
Combine PPI information generated from different databases
Generate protein identifier mapping file
Visualize input proteins in PPI network and display evidence that supports specific interactions among network proteins
Identify and visualize shortest PPI paths between PPI network proteins
Open network graph editor
In recent years, cloud applications in which application software and all associated data are hosted centrally on the cloud have become more and more popular [5–7]. Cloud users typically access the software and data via a thin client web browser. RStudio  which was developed by RStudio Inc. and is available under a free software license for academic use is a powerful integrated development environment for R programming language. Deploying RStudio and R on a cloud server allows primary users to provide R as a cloud application to authorized users, who can access this R workspace from any computer. In this case, only primary users need to install and update R and related packages. Using R as a cloud application provides many benefits, but the necessity of downloading output to a local machine is an inconvenience that diminishes the benefit of the cloud model. Taking this into consideration, cisPath displays all of the results in web format. As such, using this package as a cloud application allows users to visualize, manage, and share results through the web browser, instead of downloading them. For users who may have only cloud drivers, such as Google Driver instead of cloud computing servers, results can be uploaded to cloud drivers, which can then be visualized and shared via a browser. Details on how to use Google Driver in hosting online results can be found on our website .
Environment and technologies
This cisPath package is available through the Bioconductor Project . For users of cloud servers, RStudio is recommended as it enables primary users to provide a browser-based interface of a version of R running on a remote Linux server. Louis Aslett has provided various kinds of Amazon Machine Images (AMIs) which make deploying an RStudio Server very fast and easy . These AMIs are highly recommended, especially for free micro instance users. The Bioconductor team has also developed AMIs optimized for running Bioconductor packages with the Amazon Elastic Compute Cloud (EC2) . Users without any cloud services experience can easily launch the AMI using instructions on the Bioconductor website. An introduction on the use of this package for R beginners is also available on our website .
Data collection and integration
There are several protein interaction databases, such as PINA, STRING, and iRefIndex [13, 14], which allow downloading PPI information for academic purposes free of charge, but such downloaded files from different databases do not take on a common format. In this cisPath package, functions are provided to format the downloaded files from the PINA, STRING and iRefIndex databases into a standard workable format. To remove redundant interactions, UniProt Knowledgebase (UniProtKB) accession numbers are used as unique protein identifiers. UniProtKB is a part of the UniProt database and serves as a central hub for collection of functional information on proteins with accurate annotation . UniProtKB consists of two sections including UniProtKB/Swiss-Prot (reviewed and manually annotated) and UniProtKB/TrEMBL (unreviewed and automatically annotated). Proteins with names that cannot be mapped to UniProtKB accession numbers are discarded.
The PINA database includes unified PPI data integrated from six manually curated databases: IntAct , MINT , BioGRID , DIP , HPRD  and MIPS MPact . Like PINA, the iRefIndex database also provides an index of protein interactions integrated from primary interaction databases. PPI data downloaded from the PINA and iRefIndex databases contain the PubMed IDs of corresponding papers which support the PPIs. The STRING database contains not only known PPIs but also predicted protein associations with confidence scores. The latest version of STRING (v9.1) currently covers 5,214,234 proteins from 1,133 organisms. Although the PINA and iRefIndex databases are both integrated from manually curated databases, many distinct interactions exist in each case. Thus, several functions have been included in this package to format downloaded PPI data from different databases, consequently allowing users to edit downloaded information or merge them with privately collected data to construct more comprehensive PPI networks.
Functions for visualization
In some cases, users may want to identify interaction paths with more than two interacting steps between a pair of given proteins in a PPI network, and another function may be used to yield this type of result. The function cisPath identifies and outputs the shortest PPI paths between a pair of given proteins involved in multiple interaction steps. Users can obtain the shortest path(s) by either directly requesting the path(s) that reflect minimal cost using the default "cost" values of edges, or manually assigning "costs" to specific edges in the PPI network by editing the input file. The "cost" of an edge between two interacting proteins is a numerical value that is greater or equal to one, quantifying the extent to which an interaction is unfavorable. The default value for the "cost" of each edge generated from the PINA and iRefIndex databases is 1, and the "cost" of the edge generated from the STRING database is given as max(1,log 100 1000-STRING_SCORE ). The variable STRING_SCORE is the confidence score given by the STRING database. An example of this function is shown in Figure 2B. Evidence representing the STRING score or PubMed ID of relevant manuscripts is shown for all interaction paths. Similar to the networkView function, other proteins that can interact with at least two of the proteins that lie on the shortest PPI path are also displayed, giving a full range of possibilities despite the fact that they may be suboptimal paths. All of the shortest paths are listed in a table under the network view and can be shown graphically when selected (Figure 2B). To identify the paths that reflect the least number of steps independent of what the associated "costs" are, the parameter byStep may be set as TRUE. In this case, all edge "costs" are assigned as 1 and PPI paths with the minimum number of steps between a pair of given proteins are produced.
Research groups that focus on specific proteins may require screening of the shortest interaction paths from a single fixed protein to all other proteins in the input database. In this case, only the source protein name should be inputted in the cisPath function. All proteins in the input database are scanned for the shortest interaction paths to the fixed protein, and all of the shortest PPI paths from the fixed protein to each of the relevant proteins are outputted. Upon finding a new protein of interest, users can query the shortest interaction paths to the fixed protein with a browser without launching R. Although more CPU time and space is required to compute this function and store the results, results can be easily placed on a cloud driver or web server for quick access over the Internet. Sample results for fixed source proteins TP53 and PTEN can be found on our website.
The functions networkView and cisPath described above allow users to change color and size of the nodes in the network view prior to running. There is an additional editor for easy modification of network graphs after running. Figure 2C shows a screenshot of this tool. This editor is accessed via an "Edit graph" button on the output webpage, and allows users to make changes to the output graph as well as draw new network graphs that are directed or undirected, using different edge and arrow styles. The editor is compatible across a range of different browsers. Since most commonly used browsers support the HTML5 Web Storage, users can store the network graph view and open it later using the same browser. An additional function of this editor allows the view graph to be converted into a span of text. As the text is reversible to an editable view graph, it is possible to share output graphs easily via email or online messenger. This editor is independently usable, and is included in the source package. It is also available on our website for online access or downloading for offline usage.
Results and discussion
As examples to demonstrate the use of this package, integrated PPI data from the PINA, iRefIndex and STRING databases for six model species are available for downloading from our website. Users should cite the databases accordingly when using these files. When integrating interactions from different databases, the function combinePPI will also count the number of valid proteins and protein interactions from each database. In view of the fact that the STRING database includes not only known PPIs but also predicted associations, only high confidence interactions from STRING are retained in these samples. The threshold of the STRING score is set to 700 by default, which can be changed manually with the function formatSTRINGPPI.
Number of overlapping and distinct proteins listed for Homo sapiens in different databases.
Number of overlapping and distinct protein interactions listed for Homo sapiens in different databases.
Compared to other existing tools such as Cytoscape  and the online web servers PINA and STRING, this cisPath package is especially useful for R and cloud users. For R users, this package can be installed and updated by using only two commands, and no other plugins are required. Users can view, edit, and save the output network views with most modern web browsers. Cloud users can simply connect to the RStudio server with a browser and construct and publish personalized PPI databases and networks. It allows the user to play the role of PPI database administrator, rather than simply act as a common user.
This cisPath package allows great convenience for cloud users who wish to visualize and manage functional protein interaction networks. With this package, PPI data from different databases can be integrated and used to deduce network graphs from one or several given proteins. This package can also identify the shortest interaction paths between a pair of proteins. Published evidence of interactions from different databases is present in the output HTML files. The graph editor adds a further layer of functionality, allowing different kinds of network views to be drawn according to user preference. With an RStudio server, molecular biologists can run cisPath functions and easily view results on mobile devices via web browsers. For cloud users, the Amazon EC2 service with AMIs which were generated by Louis Aslett is recommended . One micro instance which is available free of cost is sufficient to launch the AMI and run this package.
In developing this cisPath package, one of our main aims is to offer an easy-to-use tool for mobile devices and panel PCs that is accessible through the browser. Therefore only functions that can be easily operated on mobile devices are included. A number of other functions, although useful, were excluded from the package, as they were deemed unsuitable for mobile use. We hope that with this tool, users will now be able to evaluate their ideas concerning protein interactions through visualization and management of protein networks, and limitations of time and location will become a thing of the past.
Availability and requirements
Project name: cisPath
Availability: All sources and compiled code are free for academic use
Operating system(s): Platform independent
Programming language: R
Other requirements: R (>= 2.10.0)
License: GPL (>= 3)
Any restrictions to use by non-academics: None
We greatly appreciate the responses and suggestions from initial users which helped to improve this package. Publication of this article has been funded by National Natural Science Foundation of China (No.31401132 and No.81321003) and the 111 Project (No.B07001).
This article has been published as part of BMC Systems Biology Volume 9 Supplement 1, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/9/S1
- Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39: D561-568. 10.1093/nar/gkq973.PubMed CentralView ArticlePubMedGoogle Scholar
- Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41: D808-815. 10.1093/nar/gks1094.PubMed CentralView ArticlePubMedGoogle Scholar
- Cowley MJ, Pinese M, Kassahn KS, Waddell N, Pearson JV, Grimmond SM, Biankin AV, Hautaniemi S, Wu J: PINA v2.0: mining interactome modules. Nucleic Acids Res. 2012, 40: D862-865. 10.1093/nar/gkr967.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu J, Vallenius T, Ovaska K, Westermarck J, Makela TP, Hautaniemi S: Integrated network analysis platform for protein-protein interactions. Nat Methods. 2009, 6: 75-77. 10.1038/nmeth.1282.View ArticlePubMedGoogle Scholar
- Dudley JT, Butte AJ: In silico research in the era of cloud computing. Nat Biotechnol. 2010, 28: 1181-1185. 10.1038/nbt1110-1181.PubMed CentralView ArticlePubMedGoogle Scholar
- Schatz MC, Langmead B, Salzberg SL: Cloud computing and the DNA data race. Nat Biotechnol. 2010, 28: 691-693. 10.1038/nbt0710-691.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang L, Gu S, Liu Y, Wang B, Azuaje F: Gene set analysis in the cloud. Bioinformatics. 2012, 28: 294-295. 10.1093/bioinformatics/btr630.View ArticlePubMedGoogle Scholar
- Racine JS: RStudio: A Platform-Independent IDE for R and Sweave. Journal of Applied Econometrics. 2012, 27: 167-172. 10.1002/jae.1278.View ArticleGoogle Scholar
- cisPath. [http://www.isb.pku.edu.cn/cisPath/]
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.PubMed CentralView ArticlePubMedGoogle Scholar
- Aslett L: RStudio AMIs. [http://www.louisaslett.com/RStudio_AMI/]
- Bioconductor AMI. [http://www.bioconductor.org/help/bioconductor-cloud-ami/]
- Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics. 2008, 9: 405-10.1186/1471-2105-9-405.PubMed CentralView ArticlePubMedGoogle Scholar
- Mora A, Donaldson IM: iRefR: an R package to manipulate the iRefIndex consolidated protein interaction database. BMC Bioinformatics. 2011, 12: 455-10.1186/1471-2105-12-455.PubMed CentralView ArticlePubMedGoogle Scholar
- Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41: D43-47.Google Scholar
- Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M, Michaut M, Montecchi-Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B, van Eijk K, Hermjakob H: The IntAct molecular interaction database in 2010. Nucleic Acids Res. 2010, 38: D525-531. 10.1093/nar/gkp878.PubMed CentralView ArticlePubMedGoogle Scholar
- Licata L, Briganti L, Peluso D, Perfetto L, Iannuccelli M, Galeota E, Sacco F, Palma A, Nardozza AP, Santonico E, Castagnoli L, Cesareni G: MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 2012, 40: D857-861. 10.1093/nar/gkr930.PubMed CentralView ArticlePubMedGoogle Scholar
- Chatr-Aryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, O'Donnell L, Reguly T, Breitkreutz A, Sellam A, Chen D, Chang C, Rust J, Livstone M, Oughtred R, Dolinski K, Tyers M: The BioGRID interaction database: 2013 update. Nucleic Acids Res. 2013, 41: D816-823. 10.1093/nar/gks1158.PubMed CentralView ArticlePubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004, 32: D449-451. 10.1093/nar/gkh086.PubMed CentralView ArticlePubMedGoogle Scholar
- Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A: Human Protein Reference Database--2009 update. Nucleic Acids Res. 2009, 37: D767-772. 10.1093/nar/gkn892.PubMed CentralView ArticlePubMedGoogle Scholar
- Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stumpflen V: MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 2006, 34: D436-441. 10.1093/nar/gkj003.PubMed CentralView ArticlePubMedGoogle Scholar
- Saito R, Smoot ME, Ono K, Ruscheinski J, Wang PL, Lotia S, Pico AR, Bader GD, Ideker T: A travel guide to Cytoscape plugins. Nat Methods. 2012, 9: 1069-1076. 10.1038/nmeth.2212.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.