EnRICH: Extraction and Ranking using Integration and Criteria Heuristics
© Zhang et al.; licensee BioMed Central Ltd. 2013
Received: 11 April 2012
Accepted: 7 January 2013
Published: 15 January 2013
Skip to main content
© Zhang et al.; licensee BioMed Central Ltd. 2013
Received: 11 April 2012
Accepted: 7 January 2013
Published: 15 January 2013
High throughput screening technologies enable biologists to generate candidate genes at a rate that, due to time and cost constraints, cannot be studied by experimental approaches in the laboratory. Thus, it has become increasingly important to prioritize candidate genes for experiments. To accomplish this, researchers need to apply selection requirements based on their knowledge, which necessitates qualitative integration of heterogeneous data sources and filtration using multiple criteria. A similar approach can also be applied to putative candidate gene relationships. While automation can assist in this routine and imperative procedure, flexibility of data sources and criteria must not be sacrificed. A tool that can optimize the trade-off between automation and flexibility to simultaneously filter and qualitatively integrate data is needed to prioritize candidate genes and generate composite networks from heterogeneous data sources.
We developed the java application, EnRICH (Extraction and Ranking using Integration and Criteria Heuristics), in order to alleviate this need. Here we present a case study in which we used EnRICH to integrate and filter multiple candidate gene lists in order to identify potential retinal disease genes. As a result of this procedure, a candidate pool of several hundred genes was narrowed down to five candidate genes, of which four are confirmed retinal disease genes and one is associated with a retinal disease state.
We developed a platform-independent tool that is able to qualitatively integrate multiple heterogeneous datasets and use different selection criteria to filter each of them, provided the datasets are tables that have distinct identifiers (required) and attributes (optional). With the flexibility to specify data sources and filtering criteria, EnRICH automatically prioritizes candidate genes or gene relationships for biologists based on their specific requirements. Here, we also demonstrate that this tool can be effectively and easily used to apply highly specific user-defined criteria and can efficiently identify high quality candidate genes from relatively sparse datasets.
Hundreds to thousands of candidate genes, or genes of interest, can now be generated from a single experiment utilizing high throughput screening technologies. However, the number of candidate genes that can be experimentally studied in-depth is often constrained by time and cost. Therefore, prioritization of candidate genes is a critical step in the experimental process. Approaches to identify ‘the most promising’ candidates are becoming increasingly more sophisticated. For example, when microarray studies were initially reported, ‘the most promising’ candidates were often the most differentially expressed and could be obtained by a simple ranking of candidates based on fold change. As more data has become available, biologists have begun to look for ways [1–4] to use multiple data sources to increase the accuracy of candidate gene prioritization. Some tools have already been developed to address this need [5–11]. These tools prioritize candidates by their similarity to genes already known to be important for a particular biological process (e.g., genes known to regulate cell cycle in yeast). Multiple data sources including published literature, gene sequence, functional annotation, etc. can be considered when comparing the similarity of candidates to ‘known genes’. These tools [5–11] have made important progress towards the problem of candidate prioritization. However, these tools use data queried from predetermined sources, such as public databases, and include embedded criteria. Thus, these software packages have limited utility.
Biologists, with expertise in a given area, generally already have a list of criteria that could be applied to identify high quality candidates. Likely, for a given set of experiments and resulting datasets, the best candidates may satisfy one set of criteria in one dataset and a separate set of criteria in another dataset. Currently, there is no tool that allows simultaneous consideration of heterogeneous datasets to identify candidates that satisfy multiple criteria. This problem does not only relate to candidate genes, but also to putative relationships between genes in networks.
Putative gene relationships can be inferred from many heterogeneous sources (e.g., physical interactions, genetic interactions, expression correlation and interactions predicted by computational models). While each of these relationships from a given dataset should be interpreted differently (and subject to very different criteria), the ability to easily hypothesize gene relationships based on their meeting appropriate criteria in multiple datasets is an attractive prospect. This task not only calls for an automated filtering and integration tool, but also demands great flexibility of data sources and the ability to set filtering criteria. Finally, for proper interpretation, visualization of the resulting network must facilitate inspection by 1) retaining the original data sources of each putative relationship and 2) providing a mechanism to easily manage the size of the displayed network. While some tools have been developed to generate composite networks from multiple data sources (e.g., the Cytoscape  plugin CABIN , GraphWeb  and GeneMania ), they do not fully address the problems stated above. For example, CABIN supports only one filter for a single source network and thus multiple criteria cannot be applied. GraphWeb  does not support filtering by user-defined criteria and interactive network visualization. GeneMania  helps to predict the function of a set of input genes by utilizing functional association data to generate a functional relevant network, but does not address integration of user-determined data and filtration with user-defined criteria.
We identified the need for a tool that is able to: 1) filter individual datasets using appropriate criteria and then integrate them to prioritize candidates that meet the criteria in multiple datasets; 2) allow users to define the most appropriate datasets and filtering criteria; and 3) provide an interactive visualization to facilitate the generation of an integrated network with a manageable size and connectedness. To address the open demand of filtering and qualitative integration of heterogeneous datasets, we have developed a stand-alone, portable and flexible java application with its own user-interactive visualization. EnRICH (Extraction and Ranking using Integration and Criteria Heuristics) will assist biologists in prioritization of genes and gene relationships from heterogeneous-source data.
EnRICH was implemented in Java (SE 6 JDK). EnRICH visualization was written in Processing (http://processing.org/), an open-source programming language to create images, animation and interactions. The separation of non-visual and visual modules of EnRICH lays a flexible foundation for future development and provides the user easy access to both the text and visual output results.
The objectives of EnRICH are firstly to provide a tool for integration of multiple or heterogeneous data sets to prioritize candidate molecules that fulfill user-defined criteria, and secondly to make the integration process flexible and simple for biologists who have little programming skill. Our aim-oriented design principles are 1) user-defined data sources and criteria, 2) simplicity which allows straight-forward application of user-defined criteria to filter user-defined datasets, and 3) platform independence.
The current version of EnRICH accepts two types of data: list and network. A list is a set of elements that could be genes, proteins, etc., which have their own unique identification code or name. List data can come from a large variety of sources. For example, a list of genes can be differentially expressed genes (DEGs) from the analysis of a microarray experiment, genes identified by genome-wide association mapping, or genes retrieved from a database query. Each list member may have one or more attributes. For example, each gene in a list of DEGs has its own significance value, functional annotation, etc. For EnRICH, list data is represented as a named matrix that is composed of one column of elements and zero to multiple attribute columns. Attributes can either be value attributes that will be taken as mathematical values or label attributes treated as tags.
A network is a set of nodes that are interconnected by edges representing particular relationships between nodes. Like list data, network data can originate from heterogeneous sources including yeast two-hybrid experiments, computational or statistical inferences, literature summaries or database queries. Although there are several standard languages or formats for network representation, we assume that biologists may not be familiar with those standards. Thus, EnRICH applies a popular node-pair/edge list format as the input format for network data, where an edge is denoted by the pair of nodes it connects. In the matrix format, network edges are represented by two columns of node names. Like list data, network edges may have values and label attributes. Accordingly, a network is a named matrix consisting of two columns of nodes and zero to multiple attribute columns. EnRICH allows blank fields in the attribute column when data are missing.
EnRICH runs in two modes: undefined (without filters) and defined (using specific criteria to filter attributes). The undefined mode simply ignores the attributes of networks or lists. Each list or network is considered as a source, and all sources will be merged together. The defined mode simultaneously considers integration of networks or lists as well as user-defined criteria (which filters out elements that do not meet the criteria) over each network or list. For both types of running modes, candidates (edges of a network or elements of a list) are ranked by their reoccurrence across all sources after integration. The filtering process is completely user-defined. Because the filter is totally attribute-based, the user sets filters most appropriate for their biological question, which may include a combination of filters for each attribute, and even multiple filters for multiple attributes. For example, two of the comparison operators (<, <=, >, >=, ==) applied at the same time can be used to set a cutoff range for value attributes or several tags can be used (with an OR operator between them) when the user wants to select multiple label values (e.g. two annotations) for one attribute. When there are multiple attributes, multiple filters (with an AND operator between them) can be applied simultaneously.
EnRICH saves output results as a tab-delimited text file. In the output text file, the user can see what files were integrated, which filters were applied to each file, and the result. For list data, the result is a table, which consists of three columns: the label of an element, its reoccurrence across all lists, and names of source-lists. For network data, the result includes four tables: node statistics, edge statistics, nodes, and edges. The node degree reveals topological importance of the node, so the table of node statistics contains two columns, one column is the node degree (the number of connections a single node has) and the other is the number of nodes that are greater than or equal to (>=) this node degree. For the table of edge statistics, one column is edge reoccurrence (the number of times a single edge is recovered across all datasets) and the other is the number of edges that have an edge reoccurrence that are greater than or equal to (>=) this edge reoccurrence. The table of nodes and the table of edges are quite similar. Each has a column of nodes/edges, their reoccurrence, and source-networks. The only difference between the node and edge tables is that a node is represented by the node label and an edge is denoted as two node labels. The table of edges is a tab-delimited data table composed of several columns such as node label name, edge reoccurrence and source. Therefore, if desired, the user can directly copy or import them into another network visualization tool such as Cytoscape [12, 16].
Retinal disease genes are genes that, when knocked out or mutated, cause retinal degeneration (https://sph.uth.tmc.edu/retnet/disease.htm). The identification of retinal disease genes is a major goal of retinal degenerative disease research and as part of this effort, there have been a significant number of experiments that describe transcriptional changes during normal retinal development [17–23]. Here, we present a case study in which we use EnRICH to integrate multiple gene lists to identify potential retinal disease genes.
Nrl[24–27] is a retinal disease gene that is associated with the retinal degenerative disease enhanced s-cone syndrome . When Nrl is mutated, the resulting phenotype is an abundance of s-cone photoreceptors at the expense of rod photoreceptor differentiation [25, 29], leading to the eventual death of all photoreceptors. During normal development, Nrl influences the cone versus the rod cell fate decision by activating rod-specific genes, including the genes Rho and Nr2e3. Rho[31–33] is a rod-specific gene, the mutation of which leads to rod photoreceptor cell death and retinal degeneration. Nr2e3[34, 35] is also essential during retinal development, as it promotes the expression of rod-specific genes (including Rho) and represses the expression of cone-specific genes in rods. The mutation of Nr2e3 also causes enhanced s-cone syndrome . Based on the known regulatory relationships between these three disease genes and their importance for normal photoreceptor development, we rationalize that the behavior of these genes would make good criteria to identify additional retinal disease genes.
Using these assumptions, we defined the following criteria to identify retinal disease genes: 1) candidates must be highly co-expressed with Nrl, Nr2e3 and Rho during rod photoreceptor development of wild-type mice; and 2) candidates must be disregulated when Nrl is knocked out (as Nr2e3 and Rho are). With these criteria in mind, we decided to use a microarray dataset  (GSE4051), which profiles gene expression in isolated rod photoreceptors at multiple developmental stages (E16, P2, P6, xP10, 4-weeks) in both Nrl-knockout and wild-type mice. In these microarrays, we confirmed that Nr2e3 and Rho are highly co-expressed with Nrl in wildtype and are no longer co-expressed in the Nrl mutant.
The execution of our workflow generated five candidate genes (see Additional file 2: Table S2) from an initial pool of 272 unique differentially expressed genes (see Additional file 3). Based on a literature/database search, four of our five candidate genes (pde6b, gnb1, guca1a and cgna1) are confirmed retinal disease genes [37–46], and the fifth gene (kcne2) has been shown to be up regulated during a neuroinflamatory response in the retinas of diabetic rats , making it a reasonable candidate for a disease gene as well. Thus, in our example analysis to identify disease genes, 80% of our candidates are known disease genes, while the remaining candidate has a demonstrated tie to the diseased retina, and is therefore a high quality candidate. Using a Fisher test we also concluded that retinal disease genes are significantly overrepresented in the genes prioritized by EnRICH, compared with genes not prioritized by EnRICH (see Additional file 3).
Our case study demonstrates that a well-conceived data integration and criteria-based filtration, as implemented in EnRICH, can effectively identify a limited number of high quality candidate genes for careful hypothesis-based investigation. Conversely, if the number of candidates returned is too small, slight adjustments in the filtration criteria may be easily made to generate a larger, while still reasonably-sized, candidate pool.
EnRICH is a free java application which can qualitatively integrate results from large, heterogeneous data sources while simultaneously applying filters to each of them. It allows the user to define data sources, and to integrate them as well as specify multiple sorting criteria specific to each data source. It provides interactive network visualization tool for the user to identify an integrated network with a desirable balance between network size and quality. With EnRICH, biologists have an automated yet flexible integration tool to carry out their data analysis and effectively prioritize candidate genes for further investigation.
Project name: EnRICH (see Additional file 4: for the jar file of EnRICH program).
Operating system(s): platform-independent.
Programming language: Java.
Other requirements: Java 1.4.2 or higher.
License: GNU General Public License.
Any restrictions to use by non-academics: NO.
The authors wish to thank Guan Wang and Fadi Towfic for their helpful comments on this software. Financial support was provided by the Center for Integrative Animal Genomics at Iowa State University to JMS and MHWG.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.