Browser utilization
DASMiner establishes a general purpose procedure for discovering and getting data from DAS sources. It explores the DAS formalism (Figure 1B) and provides an intuitive interface (Figure 2) without exposing the user to the minutia of the DAS commands syntax. Specifically, the application automates the process of writing DAS queries and allows the user to completely explore any DAS source by trying different commands and configurations; these are all explicitly available as alternative operations in the browser. The navigation of DAS sources is aided by info links pointing to the DAS Registry, which provides information about the service and hints on what type of input is expected (e.g. what kind of ID or coordinate system is accepted by the source). Since each DAS source can choose to implement a particular abstraction of the DAS protocol, i.e. a specific set of commands and coordinate system, the browser's layout changes to expose the set of commands and coordinate systems supported by the specific service.
The browser has two main panels (Figure 2): (i) a query definition panel, where the user chooses commands and sets their arguments, and (ii) a data display and export panel aimed at visualizing and manipulating the XML response from the DAS source.
The procedure of assembling a query was designed such that the user will be prompted to enter query settings in a cascade model. Depending on which command has been selected, fields will be displayed in the query settings panel, where parameters for the command should be typed. For instance, Figure 2 depicts how the browser will appear to the user running the features command to retrieve the annotation of the p53 Human protein from Uniprot. The first step to perform the query is to select the data source from the sources menu (Figure 2, upper left corner), which populates the interface with the info orange icon, the description of the source and the capabilities menu. The next step is to select features from the command menu, which results in the display of the segment definitions panel. This navigation follows the DAS generic model as described in the diagram in Figure 1B. Finally, the protein ID (P53_HUMAN) is provided, followed by the selection of 'All features' (default) or 'Browse features' in the feature selector menu and pushing the search button. The DAS request is then sent to the Uniprot DAS server, which will send back an XML-formatted response. All query information is saved as variables in the Matlab workspace, so that the user can manipulate query results easily. For example, after a query is performed, the user is informed that four variables are created in the workspace: (i) DASquery_XML: string, returned by DAS service; (ii) DASquery_url: string, URL assembled by the API to retrieve the data; (iii) DASquery_struct: struct, XML is transformed into a Matlab struct that can be explored (in the case studies below we used structs to manipulate DAS data); (iv) DASquery_struct2: struct, XML is transformed into a Matlab struct using an alternative parser that creates a DOM tree out of the XML string.
Additionally, the XML output can be either exported to a file or visualized in the browser (Figure 2). Also, the query URL (top data display panel) can be exported to the Matlab workspace in the form of an API call and can be either executed by the function eval or inserted into any script.
A DAS Registry Discovery module has also been included to search the registry for sources (Figure 3). New services can be made available locally for browsing after being discovered through this module. The criteria for searching the registry include organism, coordinate system, authority, capability and label. As a criterion is selected, a pull-down menu is dynamically populated with the available options.
Figure 3 depicts the 57 DAS sources automatically retrieved for a query on Homo sapiens. The results table displays basic information about the sources, such as title, description and a link for the registry. The user can then select a DAS source for querying with the main interface.
Examples of API applications
DASMiner API was used to create enriched data sets of histone modification data and protein interactions by accessing multiple DAS sources. The following case studies can be reproduced by running the files available in the Examples folder of the distributed source code. The example files are named by their correspondent Figure as described in the Figure captions. In general, the scripts for the examples execute DASMiner API calls to collect the data, parse the data locally to construct an appropriate data representation, and then plot a graphical visualization.
A) Creating and visualizing enriched histone modification data sets
The ENCODE project was a large-scale community effort that sought to analyze 1% (30 Mb) of the human genome, through an array of experimental techniques that studied in detail the functions of selected DNA regions [13]. All assays performed were made accessible through the UCSC Encode Genome Browser through their web interface (Figure 4A) as well as through a DAS service [14].
One of the goals of ENCODE was to characterize histone modifications in normal human cell lines, e.g., GM06990 (lymphoblastoid) and HFL1 (lung fibroblast), and also in cancer cell lines, e.g., K562 (leukemia), HeLa (cervical carcinoma). Using ChIP-chip arrays [15], several H3 and H4 methylation and acetylation signals were measured, including H3K4me1, H3K4me2, H3K4me3, H3K9me3, H3K27me3, H3K36me3, H3K79me3, H3ac, H4ac. Taken together, these marks are a subset of what is known as the histone code. They act as a first-layer regulatory mechanism of gene expression, by inducing or repressing chromatin accessibility and recruitment of initiation factors [16].
We used histone modification data generated by ENCODE to exemplify how one can access data from DAS sources, and handle this experimental data to create new modes of visualization. Figure 4A shows how the UCSC Genome Browser exhibits information about histone data tracks, sorted by cell lines. The graphic in Figure 4B compares two specific positive histone marks, H3K4me3 and H3ac, measured in a normal (GM06990) and cancer cell line (K562), over chromosome 7. This side-by-side view of selected histone marks and selected cell lines facilitates the identification of ROIs to be further investigated. For example, looking at the graph we can outline that cancer cells have weaker positive marks when compared with normal cells, in regions located within bands q21.11 and q11.22 of chromosome 7. This is evidence of negatively modulated DNA, which may encode, for example, anti-tumorigenic functions. Other K562 ROIs are those that have gained positive marks, and therefore are likely to be more accessible for the DNA transcription machinery. Regions within bands p14.1 and p21.1 fall in this category as they have significant enrichment of H3K4me3 and H3ac modifications.
We also illustrate the potential of the API by creating an enriched histone data set that integrates information from multiple DAS sources. Figure 4C shows a heatmap of histone profiles in GM06990 cells for 5 marks, namely H4ac, H3ac, H3K4me1, H3K4me2, and H3K4me3. The data set for clustering was built by fetching ChIP-chip arrays from UCSC using the DASMiner. Then, the dataset of genomic regions with histone measurements was expanded by integrating two other DAS sources: the Vega/Havana Database [17] for retrieving gene annotation and the Genetic Association Disease database [18] for finding a cancer link. After retrieving these sources, a heatmap was generated where each column corresponds to a chromosome region that may be mapped to some gene, and this gene might be associated to some cancer type. Finally, the data was organized by hierarchical clustering using the Euclidean distance among histone modification profiles. This heatmap view provides an intuitive way to identify regions in the genome that share a similar histone modification pattern, and then to study these regions to characterize their function. In Figure 4D, we zoom in a selected a group of 39 regions with high signals for positive marks H3ac, H3K4me2, and H3K4me3. According to ENCODE findings, regions with this profile consist of very active transcribed DNA, and are usually associated with gene promoters. Within the group there are regions coding for genes TES, CAV1 and CAV2, which perform tumor suppressor activities. Also, from the GAD DAS annotation, we know that CAV1 and CAV2 are associated with prostate cancer.
B) Creating and visualizing integrated molecular interaction data sets
Another kind of molecular biology data available via DAS is protein interaction. The DASMI project [19] made available dozens of molecular interaction databases accessible via DAS protocol such as iPFAM, InterDom, Human Protein Reference Database, Bioverse, HomoMint and IntAct, to name just a few. We used this data to create an integrated model of a tumor suppressor (TS) network involving well-known human TS [20] and their interacting partners. Figure 5A illustrates a fragment of the TS network built using interactions reported in iPFAM. In this network, there is a connection between two proteins when their domains interact in 3D conformation. After representing this information in a network, we can interrogate it to extract knowledge regarding TS connectivity using graph algorithms. For example, we can find a subgraph of common interactors of p53 and Brca1, as depicted in Figure 5B. Both p53 and Brca1 participate in the DNA damage checkpoint during G1/S of cell cycle. They activate signalling pathways to carry out DNA repair and apoptosis in the cell, and these common interacting proteins are also participating in these processes. For example, Mdc1 is involved with double-stranded repair, while PARP1 acts in the base excision DNA repair [21].
For the other illustrative example, we built an integrated TS network using the information contained in 11 DAS sources, including PFAM and HPRD, and we visualized this data set using heatmaps. Figure 5C shows the TS network heatmap, where TS nodes are represented in columns, while non-TS are in rows. The color of a specific interaction is proportional to the number of hits supporting this interaction across different databases. Therefore, this heatmap exhibits how connected each TS is, and also allows assessing the reliability for a given TS/non-TS interaction. The visual inspection of this plot shows that Rb1, p53, Cdkn2a, Stk11, and Smarcb1 are among the most connected TS. Figure 5D provides a closer look over the heatmap, highlighting a group of 30 proteins and how they are linked to TS. For instance, we note that several cyclin-dependent kinases, i.e., cdk2, cdk3, cdk4, cdk5, cdk6, cdk7, cdk8, and cdk9, which are enzymes that control progression of the cell cycle, are usually found to be interacting with Rb1 and Cdkn2a, negative regulators of the cell cycle [21].