Cross-species Conservation of context-specific networks
BMC Systems Biology volume 10, Article number: 76 (2016)
Many large data compendia on context-specific high-throughput genomic and regulatory data have been made available by international research consortia such as ENCODE, TCGA, and Epigenomics Roadmap. The use of these resources is impaired by the sheer size of the available big data and big metadata. Many of these context-specific data can be modeled as data derived regulatory networks (DDRNs) representing the complex and complicated interactions between transcription factors and target genes. These DDRNs are useful for the understanding of regulatory mechanisms and helpful for interpreting biomedical data.
The Cross-species Conservation framework (CroCo) provides a network-oriented view on the ENCODE regulatory data (CroCo network repository), convenient ways to access and browse networks and metadata, and a method to combine networks across compendia, experimental techniques, and species (CroCo tool suite). DDRNs can be combined with additional information and networks derived from the literature, curated resources, and computational predictions in order to enable detailed exploration and cross checking of regulatory interactions. Applications of the CroCo framework range from simple evidence look-up for user-defined regulatory interactions to the identification of conserved sub-networks in diverse cell-lines, conditions, and even species.
CroCo adds an intuitive unifying view on the data from the ENCODE projects via a comprehensive repository of derived context-specific regulatory networks and enables flexible cross-context, cross-species, and cross-compendia comparison via a basis set of analysis tools.
The CroCo web-application and Cytoscape plug-in are freely available at: http://services.bio.ifi.lmu.de/croco-web. The web-page links to a detailed system description, a user guide, and tutorial videos presenting common use cases of the CroCo framework.
The ENCODE , mouseENCODE , and modENCODE  consortia published a huge amount of high-throughput genome-wide and context-specific (regulatory) data sets, including extensive metadata for human, mouse, fly, and worm. The goal of the ENCODE projects is to provide a comprehensive view on regulatory features in hundreds of different contexts measured with a wide range of standardized high-throughput techniques. Such a compendium of measurements (data) and associated describing information (metadata) provides a rich resource for further analyses, experiments, and comparisons. Moreover, additional compendia such as TCGA  and Epigenomics Roadmap  offer even larger and more diverse data and metadata on the regulation of biological systems. The data provided in these large and complex compendia are well-structured according to various dimensions such as species, tissue, cell-line, experimental technique, treatment, gene, and protein, essentially forming a large and high-dimensional data cube. Of course, there are many ways to interpret and analyze the available information and data.
Any approach for comprehensive interpretation (not to speak of understanding) is difficult due to the sheer amount and complexity of the data. In principle, it is straightforward to navigate the data cube and select specific (sets of) data sets, but the currently available browsers of the compendia are limited to narrowing down the selection of experiments via simple filters. For example, one way to analyze regulatory information is to look at specific genomic regions (e.g., genes or promoters) in the raw data with tools like the UCSC Genome Browser , or portals like modMine  and the ENCODE meta-database . However, the measured complex and complicated interactions between genes and specific genomic regions are obfuscated by analyzing individual data sets and individual genomic measurements. Thus, a standard approach to analyze, interpret, and visualize context-specific raw data is via modeling of Transcription Factor (TF) – Target Gene (TG) regulatory networks [9, 10]. We call these networks Data Derived Regulatory Networks (DDRNs). DDRNs are graphs (V,E) defined on a common set of vertices V and specific sets of edges E between them. They provide an abstracted view on the data and are used as the main tool to enable a straightforward analysis, comparison and integration of data sets across different contexts, experimental techniques, and cell-lines, even across different compendia and across different species via basic network operations. Such regulatory networks are used in various contexts for generating and validating new biological hypotheses, for explaining experimental data [11–14], and for studying evolutionary mechanisms [10, 15, 16]. Initial analyses revealed that regulatory elements and the corresponding DDRNs are highly context-specific and complex [10, 17, 18]. Thus, differential and context-specific network analysis is now becoming a prevalent tool, as it enables the identification of new interactions, complexes and pathways, which would be obscured in context independent networks . For example, tools and web-services like NetWAS and GIANT  allow to infer and browse tissue-specific functional networks in order to identify tissue-specific disease-gene associations.
The construction of DDRNs from experimental binding data requires the identification of binding sites and the prediction of possible targets for the DNA binding protein in the respective context. ChIP-seq experiments directly measure the binding of a protein to the DNA, making the inference for regulatory targets for the ChIP-ed factor possible for all genes with bindings within the promoter region. This approach was, for instance, used by Kim et al.  for several transcription factors in mouse embryonic stem cells in order to induce cell type-specific regulatory sub-networks. Advanced experimental techniques and computational predictions like the combination of open chromatin data and transcription factor specific Position Weight Matrices (PWM) allow for the construction of networks for many factors at once. Neph et al.  introduce such an approach by combining Digital Genomic Footprinting (DGF)  from 41 cell-lines and tissues with binding site predictions using PWMs to infer TF-TF regulations on a genome-wide scale for 475 transcription factors at once, in order to investigate the cell-specificity of transcription factors and well-studied regulatory sub-networks. Via an interactive web tool, these 41 DDRNs can be visually compared. Furthermore, platforms like the Network Data Exchange  framework allow users to share, upload and distribute biological networks publicly. Although these tool and platforms provide a user-friendly overview of the networks, their functionalities are limited with respect to comparative analysis and the number of available networks. Using Cytoscape  in combination with additional plug-ins, advanced network analysis and operations can be performed, but networks have to be defined and imported (manually) into Cytoscape making it infeasible to work with many and huge context-specific regulatory DDRNs. Approaches like Diffany  enable context-specific and differential network analysis and inference of networks from an arbitrary number of heterogeneous data sets. However, to the best of our knowledge, no comprehensive network repository and tool set exists currently for the cross-species and context-specific regulatory networks analysis.
With CroCo, we present a repository of pre-computed regulatory networks (the CroCo network repository) and a user-friendly tool suite (implemented as a Cytoscape plug-in and an interactive web-application) for the efficient cross-species and cross-context comparison of DDRNs from the ENCODE data sets and regulatory networks derived from the literature, curated resources, and computational predictions.
Architecture of CroCo framework
The representation of regulatory data as graphs G=(V,E) (as so called Data Derived Regulatory Networks (DDRNs)) with a common set of genes V and distinct set of regulations (edges) E allows for a uniform processing of the regulations from thousands of data sets of heterogeneous types, various experimental techniques, and from different sources. In the CroCo network repository, we maintain such a common set of genes by mapping the respective measured objects, i.e., the transcription factors and target genes, to ENSEMBL gene identifiers . The same set of genes allows simple means to combine different networks in a straightforward and easy to interpret way. Furthermore, a more challenging mapping between species is obtained from orthology mappings via ENSEMBL Compara  to allow mapping of networks across species [14, 27, 28], again with a common set of genes for the respective species (see also Fig. 1).
As thousands of different DDRNs can be extracted from ENCODE experimental data, a convenient organization is needed to allow for the flexible selection of networks of interest. The ENCODE data sets are classified with respect to various criteria defined in the associated metadata. This induces a multidimensional organization of the available compendium. We use the seven classification criteria (dimensions): Compendium × Development stage × ChIP-Factor × Experimental technique × Species × Tissue/Cell-line × Treatment. CroCo systematically exploits this structure for representing, browsing, and processing the available data sets and the associated networks in the software.
Users may start in the CroCo Cytoscape plug-in and the CroCo web-application navigating the dimensions of the compendia in any order, starting with an arbitrary dimension and continuing subsequently along any other of the remaining dimensions. This is implemented by the CroCo multidimensional browsing feature. Thereby, CroCo provides an intuitive overview of all available networks and the associated data sets in the corresponding compendia.
Moreover, we systematically employ ontologies to structure any dimension of the compendia. These ontologies are either provided by the metadata of the data compendia or are derived from additional information.
Components of the CroCo framework
The network repository (croco-repo) is the central component of the CroCo system which consists of more than 7,500 pre-computed global networks and context-specific DDRNs for human, mouse, fly, and worm together with gene annotations and ortholog mappings. The analysis of the networks in the croco-repo is supported via a client side Cytoscape plug-in (croco-cyto) and a web-application (croco-web). With croco-web we offer a web interface, which allows —without the need of installing additional client-side software — to query and compare network statistics or to look-up evidence for specific TF-TG relations using standard web browsers. For complex downstream network analyses we developed a plug-in for the bioinformatics network tool Cytoscape in order to access the croco-repo and to efficiently perform common network operations.
Thus, CroCo consists of five components:
the network repository croco-repo,
the Application Programming Interface (API) croco-api,
the Cytoscape plug-in croco-cyto,
the web application croco-web, and
the web-service for remote access to the central repository (see Fig. 2 for the interplay of the different components).
Through the combination of the publicly available web-service and the croco-api the server-side data repository can be accessed from the client-side.
We structured the croco-api into: (i) a repository query layer, (ii) a network operation layer, and (iii) the network construction workflows (used for the network definition from the raw data). The query layer provides a low level set of operations to access the croco-repo such as: list networks in the repository, read a network, and retrieve the metadata and the construction parameters for a specific network. The croco-repo can either be accessed via a direct database connection using the Structured Query Language (SQL) or via the Hypertext Transfer Protocol (HTTP). The web-service exposes the query operations to a web-server and tunnels the requests to a server side croco-api instance with direct access to the croco-repo. On top of this API, we developed croco-cyto and croco-web for network analyses.
The components have been implemented in Java in combination with MySQL for the croco-repo and the Open Source Community Edition of the ZKOSS Web Framework  for croco-web.
Regulatory network repository and network definition
In the croco-repo the following two types of networks were integrated: (Context-specific) DDRNs: networks that represent a specific state of a system derived from an experimental measurement and Global networks: networks that represent context-independent interactions of a system (e.g., interactions combined from several states and contexts).
Besides global networks and context-specific DDRNs, croco-repo contains ortholog mappings for 59 eukaryotic species from ENSEMBL Compara, and gene annotations for the organisms investigated in an ENCODE project (human, mouse, worm, and fly). The networks (G=(V,E) are represented as nodes V with ENSEMBL gene identifiers and edges E⊆V×V as directed pair of nodes. This uniform representation of the networks facilitates the comparison of species-specific networks between different contexts (inter-context) and between species (inter-species). We downloaded genomic regions with enriched aligned ChIP and open chromatin reads (hereinafter referred to as peaks) from the ENCODE project sites and followed standard procedures for the networks definition [10, 20]. We used the following network definitions to derive the edges (E) from binding site predictions and ChIP, as well as open chromatin experiments:
Transcription factor binding site (TFBS) predicted networks: We used FIMO  with Position Weight Matrices (PWM) from TRANSFAC (Version 9.3) , JASPAR (Version 2014) , UniPROBE , Wei et al. , Wang et al. , and Chen et al.  to scan for possible binding sites in the genomes of human, mouse, and worm. We set the p-value threshold in FIMO for the reported bindings to 10−5. Regulations were predicted between TF-TG if a PWM hit associated with the TF is located within ± 5,000 base pairs (bp) of the Transcription Start Sites (TSS) of the TG in human and mouse and ± 500 bp in worm. Furthermore, we constructed a high-confidence network by further filtering the binding site predictions with a p-value threshold of 10−6. In addition, conserved TFBS predictions in 12 Drosophila genomes were integrated from Kheradpour et al.  for fly.
ChIP-chip and ChIP-seq networks: ChIP peaks are provided by the ENCODE projects for human, mouse, fly, and worm for thousands of different contexts.
Regulations were inferred between the ChIP-ed protein and all TGs with peaks within ± 5,000 bp for human and mouse and 500 bp for worm and fly of their TSSs.
Open chromatin network
We integrated the 41 human pre-computed cell-specific TF-TF networks derived from Digital Genomic Footprinting (DGF) published by Neph et al. . They used DGF footprints with a length of 6–40 bp  and overlapped those footprints with predicted TRANSFAC motif-binding sites using FIMO with a p-value threshold of 10−5. Regulations were inferred between two transcription factors if an associated PWM for the first TF is found within a footprint of the second TF.
Similar to Neph et al. , we used open chromatin peaks derived from all ENCODE DGF, DNase and FAIR-seq experiments to predict regulations.
We overlaid these peaks with our TFBS predicted networks in order to infer interactions between TFs and all TGs with predicted bindings.
In addition, we integrated networks from ConReg , a resource for global regulatory networks. ConReg provides the following network types:
Literature-networks: Networks derived from the scientific literature (PubMed and PubMedCentral) using a text mining approach. First, sentences with at least two genes and a regulatory keyword like ‘regulates’ were identified.
Next, relations were predicted between all genes identified in those sentences.
In order to filter the networks and to generate more specific networks, versions of the text mining network were produced using a species filter and an approach to generate directed networks (see  for the detailed network construction workflow).
CroCo network operations
Since the croco-repo contains many and large networks, efficient network operations are crucial to perform network analyses in a user-friendly and interactive way. Thus, the croco-api provides various common network operations optimized to work on the networks from the repository. Each network operation takes one or more networks and additional parameters as input in order to produce a new network. In addition to the basic network operations Union, Intersection, and Set-Difference, the following operations are provided: Orthology Transfer: Transfers a network using orthologs from the croco-repo to another species. Gene Set Filter: Creates an induced network consisting of genes only with a particular Gene Ontology (GO) annotation, or genes from a user-defined gene set. Support Filter: Removes edges which have been observed in less than a user-defined number of times in a merged network. Binding Site Filter: Filters interactions based on the p-value of the predicted binding, or the genomic distance of the binding of the TF to the transcription start site (TSS) of the TG.
Interactive comparative network operation with croco-cyto and croco-web
croco-cyto: The networks contained in the croco-repo, together with the available network operations in the croco-api can be accessed via croco-cyto, a plug-in for the networks analysis tool Cytoscape. Result networks, for example shared conserved networks of the analogous leukemia cell-lines in mouse and human, for genes involved in a certain pathway can be produced by selecting the respective networks and applying a sequence of network operations in croco-cyto. croco-cyto uses the croco-api to retrieve the pre-computed networks, ortholog mappings, and gene descriptions from the server-side croco-repo via the publicly available web-service. Thus, networks of interest can be easily defined, e.g., shared sub-networks between cell-lines, or conserved sub-networks among different species.
croco-web: The web-service croco-web gives a first view on the networks in the croco-repo and enables several network queries, without the need of installing additional software. croco-web consists of three analysis tools:
Geneset Overlap Browser, and
The Network-Browser allows to perform cross-species comparisons of networks according to the following network metrics:
the number of interactions,
the number of genes,
the in-, out- and total-degree of specific genes, and
the overlap of interactions with a user-defined regulatory sub-network.
Any combination of networks from the croco-repo can be selected for comparisons and, if desired, the network operation union and intersect can be applied to the selected networks.
The regulatory sub-network overlap function allows to investigate the occurrence of regulatory interactions (e.g. regulations between major pluripotency factors) in different contexts. With croco-web, such regulatory sub-networks can either be manually defined, or selected from a pre-defined set of small networks from the croco-repo.
Selected networks can be organized into groups in order to support the structured comparison of sets of networks, for example, a collection of stem cell networks can be assigned to one group and a collection of T-cell networks can be assigned to another group. Furthermore, for the special case where exactly two groups are defined, a t-test between the two groups is performed according to the selected metrics.
The second main feature is the Gene-Set Overlap Browser, which allows to navigate the croco-repo according to the number of interactions between a user-defined set of genes. Finally, the Evidence-Lookup allows to investigate regulatory interactions and literature references from the croco-repo for any user-defined TF-TG pair of interest.
Results and discussion
In the following use cases, we demonstrate the application of the CroCo Cytoscape plug-in (croco-cyto) and the CroCo web-application (croco-web) to the identification of conserved and context-specific regulations, network features, and sub-networks. We summarize the number of networks in the publicly available CroCo network repository (croco-repo) and describe their organization according to ontologies which support the navigation of networks of interest. Subsequently, an evaluation to existing software solutions for the comparative cross-species and cross-context network analysis is provided.
Identification of conserved regulatory networks in human K562 and mouse MEL cell-lines
The first use case demonstrates the identification of conserved networks with croco-cyto in the Data Derived Regulatory Networks (DDRNs) from the analogous leukemia cell-lines K562 and MEL in human and mouse, respectively. We filtered the networks to genes and orthologs involved in the human KEGG  chronic myeloid leukemia pathway (KEGG pathway id: hsa05220).
For the selected gene set, we created two networks, a unified network consisting of all interactions from the K562 and MEL network (union) and a network consisting only of the interactions, which were apparent in both networks (intersection). In order to compare the networks from the different species, the networks need to be transferred to a common gene set (either to human, or to mouse gene identifiers).
Here, we transferred the mouse MEL networks to human gene identifiers as the desired gene set of the KEGG chronic myeloid leukemia pathway was defined for human. We used functionalities of croco-cyto in order to retrieve the K562 and MEL DDNRs from the croco-repo and to transfer, unify, intersect, and filter the networks.
In Fig. 3 the selected network operations in croco-cyto are shown for constructing the combined and intersected networks.
The combined K562 and MEL DDRN for the selected gene set consists of 597 regulations (see Fig. 4 a). From these 597 regulations, 52 regulations are observed in both the human and mouse DDRNs. Furthermore, for 26 of the conserved regulations further support in the scientific literature could be found (green edges in Fig. 4 b) using the human Text-Mining network as an additional resource.
Thus, via straightforward and simple network and network operation selections in croco-cyto, complex cross-species network comparisons can be performed. Since K562 and MEL are analogous cell-lines in different species, the analysis allows determining species-specific regulatory adaptions and similarities in the same context (in this example leukemia).
Interactive cross-species and cross-context network browsing
The second use case highlights some of the network comparison features of croco-web. croco-web enables a wide-range of complex and powerful on-demand cross-species and cross-context network comparisons and explorations of all networks available in the croco-repo. Thus, interactive and fast testing of biological hypotheses can be conducted. To demonstrate some of the interactive analysis features of croco-web, we:
used the network overlap feature in order to study the context-specificity of a well-curated regulatory sub-network,
compared the transcription factor activity of Transformation-Related Protein 53 (TRP53) in mouse Embryonic Stem cells (ES-cells) and T-cells, and
queried evidence for the regulation of Early Growth Response protein 1 (EGR1) and the Myc proto-oncogene protein (MYC).
Context-specific regulatory sub-networks
Similar to Neph et al. , we used a stem cell related sub-network consisting of four major pluripotency factors with 13 regulatory interactions from Kim et al.  and overlapped this sub-network with DDRNs extracted from mouse T- and ES-cells in order to compute the fraction of common interactions in the sub-network and the selected DDRNs. We selected the corresponding networks from the Network-Browser in croco-web and used the network overlap metric to compare the networks according to their overlap with the stem cell sub-network. The results of this comparison plotted by croco-web show that the cells cluster, as expected, according to their biological classification (see Fig. 5). That is, a significant overlap with the regulatory sub-network for the pluripotency factors is observed only in the embryonic stem cells. Thus, the potential regulatory function of this sub-network appears to be inactive in T-cells. Such an overlap functionality allows to distinguish between context-specific regulations (i.e. regulations which are only active in certain cell-lines and conditions) and unspecific regulations.
Transcription factor activity of transformation-related protein 53 (TRP53) in mouse ES- and T-cells:
P53 is one of the best characterized genes, due to its role in apoptosis, genome stability and its frequent mutation in human cancer . Recent studies indicate that P53 also plays an important role in stem cells and has a high abundance in ES-cells [43–45]. To further analyze this, we again used the Network-Browser in croco-web to compare the number of target genes of TRP53 (the mouse ortholog of human P53) in mouse Embryonic Stem cells (ES-cells) and T-cells (see Fig. 6). As the networks have different sizes, the number of target genes for a specific transcription factor was normalized by the network size. This analysis highlights that, in general, a significantly larger number of regulations of TRP53 is observed in ES-cells than in T-cells.
A detailed analysis of the target genes of TP53 in these DDRNs may help to further characterize the function and role of TP53 in ES-cells.
Query of regulatory evidence for TF-TGs
The Evidence-Lookup tool in croco-web allows to query regulatory relations between a transcription factor and target gene. We employed the Evidence-Lookup tool as a use case to query information of the regulation of Early Growth Response protein 1 (EGR1) and the Myc proto-oncogene protein (MYC) from the croco-repo. For that particular example, several binding site predictions and experimental ChIP bindings of EGR1 were found within the promoter region of MYC (see Fig. 7). Furthermore, several open chromatin regions overlap with the experimental ChIP peaks and the predicted TFBS bindings. In addition, literature references describe regulatory mechanisms between the selected TF and TG.
Thus, there is evidence for this particular regulation in different contexts and the regulation is also well described in the scientific literature.
croco-repo: a comprehensive regulatory network repository
The previously demonstrated use cases can be conducted with croco-cyto and croco-web for all networks in the croco-repo. In the following, we summarize the available networks and their organization in the croco-repo.
croco-repo contains context- and species-specific DDRNs for every ChIP and open chromatin ENCODE data set. In addition, 81 global networks are integrated in the repository, yielding 7,546 networks in total.
The availability of experimental data differs across the considered species in the croco-repo. For instance, open chromatin derived networks are only available for human and mouse, whereas literature derived and binding site derived networks are available for all species. For human, 4,843 networks are contained in the repository: 4 networks from curated databases, 12 networks predicted with different PWM collection sets, 4 literature derived networks, 1,206 ChIP-derived factor-specific networks for 103 different cell-lines, and 3,617 open chromatin derived networks. ChIP experiments have been conducted for different conditions and with different antibodies. For example, for the human cell K562, the binding of 116 different TF have been measured, whereas for the human cell-line WI-38 only the binding of CTCF was investigated. The different human open chromatin networks stem from 207 DNase-seq, 54 DGF, 37 FAIR-seq experiments in combination with the 12 different binding site predicted networks and the 41 networks from Neph et al. , i.e. (207+54+37) experiments × 12 binding site networks + 41 networks = 3,617 different open chromatin networks. In Table 1, we summarize the available networks for the considered species in the croco-repo.
In order to enable flexible navigation and selection of networks from the croco-repo in croco-cyto and croco-web, we organized the networks according to various dimensions (annotations). We used the following seven dimensions: Compendium: ENCODE, modENCODE, mouseENCODE Species: Human, Mouse, Fly, Worm ENCODE gene name: Transcription factors with corresponding ENCODE ChIP-seq experiments, e.g., CTCF, GATA1, MEF2A Development stage: e.g., embroys, adult-male, adult-female Treatment: e.g., insulin, IFNa30, estrogen Experimental technique: Chromatin Immunoprecipitation (ChIP), Database (curated resources), Open Chromatin, Text-Mining, Transcription Factor Binding Site (TFBS) prediction Tissue/Cell-line: e.g., K562, H7-hESC, ES-E14.
Each dimension can be further structured according to simple value lists, or according to specific ontologies. Networks are assigned to node(s) in the dimension-specific value lists and ontologies based on their meta-information. For example, a DDRN from a human ChIP-seq experiment performed by ENCODE for CTCF in K562 cells is assigned to: ENCODE in the Compendium dimension, Human in the Species dimension, CTCF in the ENCODE gene name dimension, ChIP-seq in the Experimental technique dimension, and K562 in the Tissue/Cell-line dimension. Note, however, that not all networks can be categorized and organized according to all dimensions. We build a meta-ontology, which includes the seven dimensions including their categorizations. In the components of the CroCo system, users can start at any of our specified dimensions, e.g. Species, and browse for attributes of interest in the corresponding value list and ontologies (see Fig. 8). As soon as a leaf node within the specific dimension is reached (e.g., Human for the Species dimension) the user can select a further dimension to browse the remaining networks according to the not yet selected dimensions (only those dimensions are shown, which further separate the remaining networks). That way it is possible to first select a species and then to select an experimental technique, or vice versa.
The different components of CroCo have been designed to: (i) support the identification of possible conservations as well as differences between networks from different species and from different cell-lines, (ii) provide a uniform collection of networks for several model organisms, and (iii) allow for the straightforward navigation through thousands of derived networks. CroCo has a particular focus on the ENCODE projects and currently contains all ENCODE regulatory data sets. Use cases for CroCo range from the general comparison of network properties to the validation of hypotheses such as the identification of sub-networks that are conserved or unique in a set of cell-lines or species.
CroCo supports the following key functionalities, which we identified to perform the outlined use cases: ENCODE data: Availability of ENCODE context-specific DDRNs in a unified format, and structured according to their metadata via appropriate ontologies. Comparative analysis: Fast and efficient network operations on many and large networks for the collection of ENCODE networks to enable comparative network analysis. Downstream analysis: Downstream network analysis such as network clustering or significant sub-network identification on DDRNs from different ENCODE data sets. Network property look-up: Comparative network property analysis in order to get a first overview on the networks. Additional information: Query of additional evidence for a particular regulatory relations, e.g. query literature evidence for a specific regulation.
Databases like the NCBI SRA, the GEO database , modMine , and the ENCODE Genome browser  provide only the raw data, thus making computational and labor intensive work necessary in order to collect the data and derive networks from them.
In the supplement of the Neph et al.  publication, an interactive web application is presented, which allows to visually compare their derived regulatory networks for the 41 human cell-lines. Although the interactive web site gives a convenient overview of the networks, its functionality is limited with respect to comparative analysis and the number of available networks.
Cytoscape, in combination with additional plug-ins, comes close to implementing the needed functionalities provided by CroCo. With the Cytoscape Advanced Network Merge and ID Mapping option, networks can be intersected, set-differenced and merged. Available plug-ins allow to perform various downstream analyses including ortholog network transfers using the Homecat plug-in  and literature queries for protein interactions using the Agilent Literature Search plug-in. Via the Network Analysis feature, network properties can be visualized. Finally, the Diffany  Cytoscape plug-in enables different network analysis and inference. Nevertheless, there are shortcomings, which limit the usability of currently available Cytoscape plugins for the analysis of many and large DDRNs. For example, each network needs to be loaded into Cytoscape before network operations can be performed, and thus network operations on a large number of networks require tedious and error-prone manual work. Furthermore, since no comprehensive ENCODE network repository exists, DDRNs need to be manually created, which requires knowledge of the ENCODE data structure and resource intensive computation in order to derive networks.
With CroCo, we provide a collection of pre-computed networks (croco-repo) derived from ENCODE (ENCODE data) and further external databases organized in an easy to navigate multidimensional network ontology. Network operations, including the transfer of many and large networks from the croco-repo, can be processed at once. This can be done via croco-cyto, which makes the repository and the CroCo API functionality available within Cytoscape.
Also, several network queries, like the comparison of network features and the query of regulatory evidence for user-specified TF-TG pairs, can be performed directly via croco-web (Network property look-up; Additional information).
The raw data processing workflow, the choice of thresholds, and the used data processing tools have impacts on the network model.
Currently, the networks in the croco-repo are pre-processed, which allow, on the one hand, a fast retrieval of networks, but, on the other hand, limits the network (re-)definition.
In order to provide a higher flexibility, the network construction workflows are included in the croco-api. These workflows allow the generation of entirely new networks with desired parameters and input data. Additionally, interactions can be filtered using different criteria like the PWM p-value threshold, and the distance to a TSS.
The CroCo system implies several avenues for further research: possible extensions are the integration of networks from further sources, e.g., DDRNs from the TCGA  and Roadmap Epigenomics , from protein-protein interaction networks, and from proteomics data. Other extensions involve the development of approaches for fast, flexible, and resource efficient redefinition of the networks included in the croco-repo. This would enable the flexible variation of parameters defining the edges and allow for the analysis of the implied changes in the resulting networks.
The ENCODE projects (ENCODE, mouseENCODE, modENCODE) and other large-scale compendia, such as TCGA and Epigenomics roadmap provide genome-wide annotations for hundreds of cell-lines, tissues, and treatments using standardized experimental protocols for the model organisms human, mouse, fly, and worm.
Determining the similarities and differences between different species and cell-lines can improve our understanding of the evolution and conservation of regulatory mechanisms and the interpretation of observed data . However, due to the large amount of data, the use and the systematic analysis of these data compendia is not straightforward as a systematic cross-species and cross-condition comparative analysis requires a lot of cumbersome work as well as local storage to download and process the data. The CroCo network repository provides a network-oriented view on all regulatory data sets from the ENCODE projects structured according to ontologies derived from the metadata. Thereby, CroCo provides intuitive views on the data and allows for the investigation of regulatory mechanisms. Via the CroCo tool suite, convenient browsing and aggregating of networks along the ontology structures is enabled.
Network representations can be employed for intuitive abstracted views on the data and the investigation of regulatory mechanisms. Available network tools, such as Cytoscape, contain a wide range of interesting network analysis functionalities. In order to use them in combination with ENCODE data, time-consuming manual work is required for downloading and pre-processing the data as well as extracting the networks. With the CroCo system, systematic analysis of networks is supported via a network repository containing thousands of global and context-specific data derived regulatory networks.
The modular design of CroCo and the wide range of query operations available via the web application (croco-web), via a Cytoscape plug-in (croco-cyto), and via an Application Programming Interface (croco-api) provide access to networks and analysis tools to a broad user community. Finally, CroCo features the data cube paradigm characterized by: (i) the convenient multidimensional navigation in any order of the dimensions, (ii) the ontology-supported browsing of the data cube dimensions, (iii) combination, and (iv) comparison of networks, including cross-species transfer of regulatory network models.
Availability and requirements
Project name: CroCoProject home page: http://services.bio.ifi.lmu.de/croco-web Operating system(s): Platform independentProgramming language: JavaOther requirements: Java 1.8 or higher, and Cytoscape 3.0.1 or higher for croco-cyto License: GNU Lesser General Public License (LGPL) v3.0.
ENCODE Project Consortium. An integrated encyclopedia of dna elements in the human genome. Nature. 2012; 489(7414):57–74.
Mouse ENCODE Consortium. An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol. 2012; 13(8):418.
Celniker SE, Dillon LAL, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, Micklem G, Piano F, Snyder M, Stein L, White KP, Waterston RH, modENCODE Consortium. Unlocking the secrets of the genome. Nature. 2009; 459(7249):927–30.
Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008; 455(7216):1061–8.
Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ, Amin V, Whitaker JW, Schultz MD, Ward LD, Sarkar A, Quon G, Sandstrom RS, Eaton ML, Wu YC, Pfenning AR, Wang X, Claussnitzer M, Liu Y, Coarfa C, Harris RA, Shoresh N, Epstein CB, Gjoneska E, Leung D, Xie W, Hawkins RD, Lister R, Hong C, Gascard P, Mungall AJ, Moore R, Chuah E, Tam A, Canfield TK, Hansen RS, Kaul R, Sabo PJ, Bansal MS, Carles A, Dixon JR, Farh KH, Feizi S, Karlic R, Kim AR, Kulkarni A, Li D, Lowdon R, Elliott G, Mercer TR, Neph SJ, Onuchic V, Polak P, Rajagopal N, Ray P, Sallari RC, Siebenthall KT, Sinnott-Armstrong NA, Stevens M, Thurman RE, Wu J, Zhang B, Zhou X, Beaudet AE, Boyer LA, De Jager PL, Farnham PJ, Fisher SJ, Haussler D, Jones SJM, Li W, Marra MA, McManus MT, Sunyaev S, Thomson JA, Tlsty TD, Tsai LH, Wang W, Waterland RA, Zhang MQ, Chadwick LH, Bernstein BE, Costello JF, Ecker JR, Hirst M, Meissner A, Milosavljevic A, Ren B, Stamatoyannopoulos JA, Wang T, Kellis M. Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518(7539):317–30.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at ucsc. Genome Res. 2002; 12(6):996–1006.
Contrino S, Smith RN, Butano D, Carr A, Hu F, Lyne R, Rutherford K, Kalderimis A, Sullivan J, Carbon S, Kephart ET, Lloyd P, Stinson EO, Washington NL, Perry MD, Ruzanov P, Zha Z, Lewis SE, Stein LD, Micklem G. modMine: flexible access to modENCODE data. Nucleic Acids Res. 2012; 40(Database issue):1082–8.
Rosenbloom KR, Dreszer TR, Long JC, Malladi VS, Sloan CA, Raney BJ, Cline MS, Karolchik D, Barber GP, Clawson H, Diekhans M, Fujita PA, Goldman M, Gravell RC, Harte RA, Hinrichs AS, Kirkup VM, Kuhn RM, Learned K, Maddren M, Meyer LR, Pohl A, Rhead B, Wong MC, Zweig AS, Haussler D, Kent WJ. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res. 2012; 40(Database issue):912–7.
Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol. 2008; 9(10):770–80.
Neph S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA. Circuitry and dynamics of human transcription factor regulatory networks. Cell. 2012; 150(6):1274–86.
Küffner R, Fundel K, Zimmer R. Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics. 2005; 21 Suppl 2:259–67.
Van Landeghem S, De Bodt S, Drebert ZJ, Inzé D, Van de Peer Y. The potential of text mining in data integration and network biology for plant research: a case study on arabidopsis. Plant Cell. 2013; 25(3):794–807.
Faro A, Giordano D, Spampinato C. Combining literature text mining with microarray data: advances for system biology modeling. Brief Bioinform. 2012; 13(1):61–82.
Pesch R, Böck M, Zimmer R. ConReg: Analysis and Visualization of Conserved Regulatory Networks in Eukaryotes In: Böcker S, Hufsky F, Scheubert K, Schleicher J, Schuster S, editors. German Conference on Bioinformatics 2012. OpenAccess Series in Informatics (OASIcs). Dagstuhl: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik: 2012. p. 69–81.
Ideker T, Krogan NJ. Differential network biology. Mol Syst Biol. 2012; 8:565.
Cheng Y, Ma Z, Kim BH, Wu W, Cayting P, Boyle AP, Sundaram V, Xing X, Dogan N, Li J, Euskirchen G, Lin S, Lin Y, Visel A, Kawli T, Yang X, Patacsil D, Keller CA, Giardine B, The Mouse ENCODE Consortium, Kundaje A, Wang T, Pennacchio LA, Weng Z, Hardison RC, Snyder MP. Principles of regulatory information conservation between mouse and human. Nature. 2014; 515(7527):371–5.
Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O’Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M. Architecture of the human regulatory network derived from encode data. Nature. 2012; 489(7414):91–100.
Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee BK, Lee K, London D, Lotakis D, Neph S, Neri F, Nguyen ED, Qu H, Reynolds AP, Roach V, Safi A, Sanchez ME, Sanyal A, Shafer A, Simon JM, Song L, Vong S, Weaver M, Yan Y, Zhang Z, Zhang Z, Lenhard B, Tewari M, Dorschner MO, Hansen RS, Navas PA, Stamatoyannopoulos G, Iyer VR, Lieb JD, Sunyaev SR, Akey JM, Sabo PJ, Kaul R, Furey TS, Dekker J, Crawford GE, Stamatoyannopoulos JA. The accessible chromatin landscape of the human genome. Nature. 2012; 489(7414):75–82.
Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. Understanding multicellular function and disease with human tissue-specific networks. Nat Genet. 2015; 47(6):569–76.
Kim J, Chu J, Shen X, Wang J, Orkin SH. An extended transcriptional network for pluripotency of embryonic stem cells. Cell. 2008; 132(6):1049–1061.
Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, Thurman RE, Neph S, Kuehn MS, Noble WS, Fields S, Stamatoyannopoulos JA. Global mapping of protein-dna interactions in vivo by digital genomic footprinting. Nat Methods. 2009; 6(4):283–9.
Pratt D, Chen J, Welker D, Rivas R, Pillich R, Rynkov V, Ono K, Miello C, Hicks L, Szalma S, Stojmirovic A, Dobrin R, Braxenthaler M, Kuentzer J, Demchak B, Ideker T. Ndex, the network data exchange. Cell Syst. 2015; 1(4):302–5.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13(11):2498–504.
Landeghem SV, Parys TV, Dubois M, Inzé D, de Peer YV. Diffany: an ontology-driven framework to infer, visualise and analyse differential molecular networks. BMC Bioinforma. 2016; 17(1):18.
Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt S, Johnson N, Juettemann T, Kähäri AK, Keenan S, Kulesha E, Martin FJ, Maurel T, McLaren WM, Murphy DN, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ruffier M, Sheppard D, Taylor K, Thormann A, Trevanion SJ, Vullo A, Wilder SP, Wilson M, Zadissa A, Aken BL, Birney E, Cunningham F, Harrow J, Herrero J, Hubbard TJP, Kinsella R, Muffato M, Parker A, Spudich G, Yates A, Zerbino DR, Searle SMJ. Ensembl 2014. Nucleic Acids Res. 2014; 42(Database issue):749–55.
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. Ensemblcompara genetrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009; 19(2):327–35.
Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”. Genome Res. 2001; 11(12):2120–126.
Pesch R, Zimmer R. Complementing the eukaryotic protein interactome. PLoS ONE. 2013; 8(6):66635.
Potix Corporation. ZK: Leading Enterprise Java Web Framework. http://www.zkoss.org. Accessed 10 Aug 2016.
Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011; 27(7):1017–1018.
Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006; 34(Database issue):108–10.
Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen C-y, Chou A, Ienasescu H, Lim J, Shyr C, Tan G, Zhou M, Lenhard B, Sandelin A, Wasserman WW. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2014; 42(Database issue):142–7.
Robasky K, Bulyk ML. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-dna interactions. Nucleic Acids Res. 2011; 39(Database issue):124–8.
Wei GH, Badis G, Berger MF, Kivioja T, Palin K, Enge M, Bonke M, Jolma A, Varjosalo M, Gehrke AR, Yan J, Talukder S, Turunen M, Taipale M, Stunnenberg HG, Ukkonen E, Hughes TR, Bulyk ML, Taipale J. Genome-wide analysis of ets-family dna-binding in vitro and in vivo. EMBO J. 2010; 29(13):2147–160.
Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012; 22(9):1798–1812.
Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008; 133(6):1106–1117.
Kheradpour P, Stark A, Roy S, Kellis M. Reliable prediction of regulator targets using 12 drosophila genomes. Genome Res. 2007; 17(12):1919–1931.
Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, Thurman RE, John S, Sandstrom R, Johnson AK, Maurano MT, Humbert R, Rynes E, Wang H, Vong S, Lee K, Bates D, Diegel M, Roach V, Dunn D, Neri J, Schafer A, Hansen RS, Kutyavin T, Giste E, Weaver M, Canfield T, Sabo P, Zhang M, Balasundaram G, Byron R, MacCoss MJ, Akey JM, Bender MA, Groudine M, Kaul R, Stamatoyannopoulos JA. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012; 489(7414):83–90.
Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Van Loo P, Blanco E, Ticoll A, Lithwick S, Portales-Casamar E, Donaldson IJ, Robertson G, Wadelius C, De Bleser P, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJM, The Open Regulatory Annotation Consortium. Oreganno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 2008; 36(Database issue):107–13.
Gallo SM, Gerrard DT, Miner D, Simich M, Des Soye B, Bergman CM, Halfon MS. REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in drosophila. Nucleic Acids Res. 2011; 39(Database issue):118–23.
Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. Kegg as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016; 44(D1):457–62.
Surget S, Khoury MP, Bourdon JC. Uncovering the role of p53 splice variants in human malignancy: a clinical perspective. Onco Targets Ther. 2013; 7:57–68.
Solozobova V, Blattner C. p53 in stem cells. World J Biol Chem. 2011; 2(9):202–14.
Solozobova V, Blattner C. Regulation of p53 in embryonic stem cells. Exp Cell Res. 2010; 316(15):2434–446.
Bonizzi G, Cicalese A, Insinga A, Pelicci PG. The emerging role of p53 in stem cells. Trends Mol Med. 2012; 18(1):6–12.
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2008; 36(Database issue):13–21.
Meyer LR, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, Wong M, Sloan CA, Rosenbloom KR, Roe G, Rhead B, Raney BJ, Pohl A, Malladi VS, Li CH, Lee BT, Learned K, Kirkup V, Hsu F, Heitner S, Harte RA, Haeussler M, Guruvadoo L, Goldman M, Giardine BM, Fujita PA, Dreszer TR, Diekhans M, Cline MS, Clawson H, Barber GP, Haussler D, Kent WJ. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 2013; 41(Database issue):64–9.
Zorzan S, Lorenzetto E, Ettorre M, Pontelli V, Laudanna C, Buffelli M. Homecat: consensus homologs mapping for interspecific knowledge transfer and functional genomic data integration. Bioinformatics. 2013; 29(12):1574–1576.
Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, Farnham PJ, Hirst M, Lander ES, Mikkelsen TS, Thomson JA. The NIH Roadmap Epigenomics Mapping Consortium,. Nat Biotechnol. 2010; 28(10):1045–1048.
We would like to thank Madox Sesen for prototyping the network motif overlap functionality.
Part of this work has been funded by DFG graduate school RECESS and the DFG CRC 1035 Conformational Switches.
RP and RZ conceived and designed the software system. RP implemented the system. RP and RZ wrote the manuscript. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this article
Cite this article
Pesch, R., Zimmer, R. Cross-species Conservation of context-specific networks. BMC Syst Biol 10, 76 (2016). https://doi.org/10.1186/s12918-016-0304-1