A semantic proteomics dashboard (SemPoD) for data management in translational research
© Jayapandian et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Jayapandian et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
One of the primary challenges in translational research data management is breaking down the barriers between the multiple data silos and the integration of 'omics data with clinical information to complete the cycle from the bench to the bedside. The role of contextual metadata, also called provenance information, is a key factor ineffective data integration, reproducibility of results, correct attribution of original source, and answering research queries involving "What", "Where", "When", "Which", "Who", "How", and "Why" (also known as the W7 model). But, at present there is limited or no effective approach to managing and leveraging provenance information for integrating data across studies or projects. Hence, there is an urgent need for a paradigm shift in creating a "provenance-aware" informatics platform to address this challenge. We introduce an ontology-driven, intuitive Semantic Proteomics Dashboard (SemPoD) that uses provenance together with domain information (semantic provenance) to enable researchers to query, compare, and correlate different types of data across multiple projects, and allow integration with legacy data to support their ongoing research.
SemPoD is an intuitive and powerful provenance ontology-driven data access and query platform that uses the MIAPE and MIMIx metadata guideline to create an integrated view over large-scale systems molecular biology datasets. SemPoD leverages the SysPro ontology to create an intuitive dashboard for biologists to compose queries, explore the results, and use a query manager for storing queries for later use. SemPoD can be deployed over many existing database applications storing 'omics data, including, as illustrated here, the LabKey data-management system. The initial user feedback evaluating the usability and functionality of SemPoD has been very positive and it is being considered for wider deployment beyond the proteomics domain, and in other 'omics' centers.
Though many molecular system biology research centers now have significant infrastructure in terms of instrumentation to acquire 'omics datasets, most of these datasets end up in study-specific data silos. Specifically, more than 50% of data being generated in laboratories are stored in local lab servers , which not only reduces data utilization and re-use, but also is a significant waste of funding resources . In addition, the size of experiment datasets continues to grow; more than 48% of respondents to a recent Science journal survey regularly generate 1 GB (gigabyte) or larger dataset . Therefore, there is an urgent need to effectively organize the data, cross-link the datasets across 'omics and clinical studies as part of the translational research roadmap, facilitate integration with legacy data, and allow seamless query across different types of data to gain research insight and accelerate research . Proteomic studies typically make use of multiple different work-flows that provide information at different scales. For example, protein profiling allows for large-scale analysis of protein expression whereas interaction proteomics focuses on specific protein complexes or networks. The objective of this work is to provide a means of integrating data across proteomics studies and workflows to provide a more global view of the biological problem being studied. In addition, the primary proteomics data should be integrated with resources that provide annotation information such as protein function and pathways. For example, a researcher might acquire large-scale proteomics data from tumors of 30 patients corresponding to several different clinical stages of colorectal cancer and would like to answer questions such as:
Extending the above scenario, the researcher may consider that although activation of pathway X is altered in a mouse model of disease Y, it is not clear that this is also the case in humans. Thus, if the researcher acquires datasets from several different cohorts of patients with disease Y, she might ask:
At present, there is no informatics infrastructure in the CPB that is capable of supporting these categories of queries. In addition, the lack of an effective query platform is also a key reason that once the 'omics data has been acquired, analyzed and interpreted, the data is typically archived and serves no further process. This is important issue both in terms of maximizing the return on research funding and also ensuring that the value of 'omics data can be significantly increased if that data is carefully integrated into a growing corpus of data that can then be re-used in different contexts. For example, a researcher with a long-standing interest in disease X has acquired multiple large-scale proteomics and transcriptomics datasets over several years. In response to a newly published finding that Single-Nucleotide Polymorphism (SNP) in gene Y are associated with disease X, the researcher now wants to query all of her legacy data and ask
In general, these types of queries are difficult to perform because they integrate several types of information, including biological annotations from outside sources.
The role of contextual metadata describing the experimental conditions, for example sample type, instrumentation, sample preparation, and statistical measures, is being increasingly noted as a key factor in managing translational research data . Contextual metadata is also called provenance information, derived from the Latin word provenire meaning the origin or history of data. Provenance metadata supports integration of comparable datasets, facilitates correlation of data across projects, and also supports analyses of data by answering "What", "Where", "When", "Which", "Who", "How", and "Why" queries (also known as the W7 model) [3, 4]. Provenance has long been used in many domains to track the ownership of cultural artifacts and also in scientific research [5–7]. Traditional translational informatics tools have either ignored the role of provenance to the detriment of data quality or used it for basic operations (e.g. file versioning).
In addition to incorporating provenance metadata in medical informatics platforms, there is a well-recognized need for an intuitive and powerful query interface that can be directly used by researchers. Frequently, analysis and querying of 'omics datasets requires expertise that may not be available to many translational laboratories. For example, in a recent survey in the Science journal about 57% of researchers have either no support for data analysis or are dependent on others for managing experiment data . Hence, there is a clear need to combine the query environment with the provenance-aware data integration platform to enable researchers to use contextual information to query and compare datasets using explicitly defined experiment conditions. In addition, the query environment should demonstrably reduce the technical complexity for query composition through use of visual interactive interfaces that transparently query distributed data, allow users to store query results for future reference, and show results in an intuitive manner .
The SemPoD platform is designed to address these challenges through use of provenance informationintegrated with a visual, ontology-driven, integrated query environment.
The first workflow is affinity-purification mass-spectrometry (AP-MS) workflow that enables the identification of specific protein complexes, thus identifying proteins that are associated with one another.
The second workflow is the shotgun expression proteomics that identifies and quantifies proteins in an unbiased manner from cells or tissues of interest.
Together, these two workflows account for approximately 50% of all experiments performed in the CPB and have been used in approximately 20 separate projects, generating over 3 Terabytes (TB) of data.
SemPoD leverages the SysPro ontology as the core resource to support various query functionalities, including "smart filtering" for reducing user effort in composing complex query patterns.
At present, the provenance metadata associated with the different stages of the proteomics workflow at CPB is not collected in a systematic manner. Often, the provenance metadata is stored as hand-written notes in a lab book and is not immediately available for query and analysis of the proteomics dataset. Further, any modification in the experiment protocols or related experiment metadata information makes it difficult to correlate or integrate data from previous runs with new datasets. The use of a variety of terms to describe provenance increases terminological heterogeneity across different projects and makes it difficult to effectively integrate datasets.
Hence, the SysPro ontology was developed to model experiment metadata by re-using and extending existing minimum information reporting guidelines defined by the 'omics community. Several "minimum information" reporting frameworks have been developed and are now part of the minimum reporting guidelines for biological and biomedical investigations (MIBBI) project , which facilitates collection and representation of experiment metadata in a variety of scientific domains. The minimum information required for reporting a molecular interaction experiment (MIMIx) framework  is part of the MIBBI project and extends the minimum information about a proteomics experiment (MIAPE)  framework with additional metadata terms describing interaction information that are used in the experiment workflows at the CPB. Concepts and terms already described in MIMix, for example "interaction detection method", "co-immunoprecipitation" were used as initial concepts in the construction of the SysPro ontology. Further, additional proteomics workflow specific terms were added to SysPro to reflect the specific requirement of provenance modeling in CPB by extending the World Wide Web Consortium (W3C) PROV ontology (PROV-O) .
The SysPro ontology also facilitates cross-linking of 'omics data with a variety of related genomics and clinical datasets, which are annotated with domain ontologies . A rapidly increasing number of biomedical domains, such as genetics, infectious diseases, and cancer, have created ontologies to model their domain information. These domain ontologies have significantly enhanced the use of standardized terminology across these communities. The most notable example is the case of Gene Ontology (GO) that is widely used to consistently annotate gene related information across a variety of applications .
The SemPoD query builder uses the SysPro ontology to support an advanced feature called "smart filtering" that dynamically updates the query interface in response to previous user selections. Figure 6 illustrates this feature, with selection of two classesnamely, "Cell line" and "Bait gene" and the corresponding drop down menus that are automatically populated with instance values of the classes defined in the SysPro ontology. The "smart filtering" approach allows the users to quickly compose large query patterns by significantly reducing the time needed to search and locate appropriate values in the query builder interface.
SemPoD has been deployed at the CPB and has been in use for over 2 months. SemPoD was evauluated both in terms of systematic user survey and scalability for queries with different levels of complexity over increasing size of data.
Details of queries used to evaluate the scalability of SemPoD
METADATA TERMS IN QUERY PATTERN
Q1. Search proteomics experiments in any human sample
Organism = 'Homo Sapiens'
Q2. Search proteomics experiments for 'Embryonic stem' cell line in any human sample
Organism = 'Homo Sapiens' (OR) Cell Line = 'Embryonic stem'
Q3. Search proteomics experiments for Human samples with Cell Line 'Embryonic stem' or Pertubated with 0 Dosage in Cytosol Subcellular Fraction
Organism: 'Homo sapiens' (AND) Cell Line: 'Embryonic stem' (OR) Perturbation: 'Dose = 0' (OR) Subcellular Fraction: 'Cytosol'
Q4. Search Experiments for Bait Gene 'DNMT1' in AP-MS experiments or WNT3A perturbations in Bait Run Group for Cell Line RKO
Bait Gene Symbol = "DNMT1" (AND) Experiment Type = "AP-MS" (OR) Perturbation = "WNT3A" (AND) Run Group = "Bait" (OR) Cell Line = "RKO"
Q5. Search Protein Expression Experiments for T-cells Cell Lines for Drosophila melanogaster organism perturbed with 10 ng in treated cell cultures
Experiment Type: 'Protein Expression' (OR) Cell Line: 'T-cells (Boom)' (AND) Organism: 'Drosophila melanogaster' (AND) Perturbation: '10 ng' (OR) Run Group: 'Treated' (AND) Sample Type: 'Cell culture'
Q6. Search Experiments for 'POU5F1' Bait Genes for 'Embryonic stem' Cell Lines in AP-MS or 'Mus musculus' organisms that are not perturbated or endogenous cell cultures
Bait Gene = 'POU5F1" (OR) Cell Line = 'Embryonic stem" (AND) Experiment Type = "AP-MS" (OR) Organism = "Mus musculus" (AND) Perturbation = "Not Applicable" (OR) Sample Type = "Cell culture" (OR) Bait Type = "Endogenous"
Q7. Search Protein Expression Experiments or 'T-cells Cell Lines for Drosophila melanogaster organism for Tagged cell cultures not perturbated for APC Bait Genes and No vector control run groups
Experiment Type: 'Protein Expression' (OR) Cell Line: 'T-cells (Boom)' (AND) Organism: 'Drosophila melanogaster' (AND) Bait Type: 'Tagged' (AND) Sample Type: 'Cell culture' (OR) Perturbation: 'Not Applicable' (AND) Bait Gene: 'APC' (AND) Run Group: 'No Vector Control'
The results clearly show that the total time for increasingly complex queries is relatively stable over the two datasets. Although there is notable difference in performance between the 20 GB and 50 GB datasets for the same query (Figure 12), this issue can be effectively addressed by improving the hardware configuration of the server. For example, Figure 12 shows that simple upgradation of the cache size, from 512 KB to 24 MB, significantly improves the performance for all the queries. Hence, the total time for query execution in SemPoD is not expected to be a significant bottleneck for complex queries over large datasets.
The functionality of SemPoD query environment is primarily limited by the provenance and domain information modeled in the SysPro ontology. Hence, in the next phase of SemPoD development, we are modeling terms from additional metadata standards included in the MIBBI project. In addition, the SysPro ontology is being expanded to include concepts from GO and PRO to enable linking of genotype and protein data from external sources with CPB internal datasets. This allow researchers to query across genotype and phenotype data, including clinical information.
The manual mapping of SysPro ontology terms to the underlying database is an important challenge that can be addressed by creating semi-automated mapping techniques, which can define initial mappings through use of lexical matching and subsequently reviewed by researchers. Since automated schema mapping is still an open research problem, the involvement of researchers to manually verify the ontology-to-database mapping will ensure the accuracy of results in SemPoD. We plan to release the first version of the SemPoD codebase as a git hub open source project, which will allow other users and developers to review and use SemPoD in other 'omics center. Similarly, the first version of the SysPro ontology will be released for public use through listing at the National Center for Biomedical Ontologies (NCBO) . We propose to define mappings between SysPro and other experiment metadata ontologies already listed at NCBO, including the Ontology for Biomedical Investigation (OBI)  and Experiment Factors Ontology (EFO) (derived from OBI) .
Many researchers routinely use several different proteomics workflows to study biomedical problems. Studies may use different cohorts of patients, different cell lines or different techniques, but their value for biomedical discovery is significantly increased if researchers can query across these different studies as well as integrate with legacy data. The SemPoD platform is an ontology-driven intuitive query platform that leverages provenance metadata for effectively addressing these challenges. The SemPoD platform features four components to facilitate query composition using existing experiment metadata standard terms through an integrated ontology browser, a result browser, and a query manager to store queries for subsequent re-use or sharing with other researchers. The evaluation results for SemPoD, both in terms of positive user feedback and scalability for complex queries over increasing size of datasets, show that SemPoD can successfully meet the informatics requirements for large 'omics' research centers.
This research was supported in part by the PhysioMIMI project (grant#NCRR-94681DBS78) and Case Western Reserve University/Cleveland Clinic & CTSA Grant (grant#UL1 RR024989.) We also thank members of the Center for Proteomics and Bioinformatics for their help in evaluating SemPoD prototypes.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.