Summarizing cellular responses as biological process networks

Lasher, Christopher D; Rajagopalan, Padmavathy; Murali, T M

doi:10.1186/1752-0509-7-68

Methodology article
Open access
Published: 29 July 2013

Summarizing cellular responses as biological process networks

Christopher D Lasher¹,
Padmavathy Rajagopalan^2,3 &
T M Murali^3,4

BMC Systems Biology volume 7, Article number: 68 (2013) Cite this article

3802 Accesses
2 Citations
Metrics details

Abstract

Background

Microarray experiments can simultaneously identify thousands of genes that show significant perturbation in expression between two experimental conditions. Response networks, computed through the integration of gene interaction networks with expression perturbation data, may themselves contain tens of thousands of interactions. Gene set enrichment has become standard for summarizing the results of these analyses in terms functionally coherent collections of genes such as biological processes. However, even these methods can yield hundreds of enriched functions that may overlap considerably.

Results

We describe a new technique called Markov chain Monte Carlo Biological Process Networks (MCMC-BPN) capable of reporting a highly non-redundant set of links between processes that describe the molecular interactions that are perturbed under a specific biological context. Each link in the BPN represents the perturbed interactions that serve as the interfaces between the two processes connected by the link.

We apply MCMC-BPN to publicly available liver-related datasets to demonstrate that the networks formed by the most probable inter-process links reported by MCMC-BPN show high relevance to each biological condition. We show that MCMC-BPN’s ability to discern the few key links from in a very large solution space by comparing results from two other methods for detecting inter-process links.

Conclusions

MCMC-BPN is successful in using few inter-process links to explain as many of the perturbed gene-gene interactions as possible. Thereby, BPNs summarize the important biological trends within a response network by reporting a digestible number of inter-process links that can be explored in greater detail.

Background

Motivation

The deluge of publicly available molecular biology data, including genome-wide gene expression measurements [1, 2] and gene and protein interaction networks [3, 4] has necessitated the development of computational methods that produce comprehensible views of large numbers of biological molecules and their connections. Reporting perturbation in gene expression on the basis of individual genes [5, 6] (of which there may be thousands) has given way to more holistic techniques—referred to as functional enrichment—that instead report the significance of the collective perturbation of processes—sets of biologically related genes (e.g., [7–11]. Results from these analyses reveal important trends that large lists of genes can obscure.

Recent work in functional enrichment of gene expression data [9, 11] has emphasized finding concise, non-redundant sets of processes that account for much of the overall perturbation among the genes. Methods by Lu et al.[9] and Bauer, Gagneur, and Robinson [11] assume that a cell or tissue perturbs certain biological processes in response to a change internal or external conditions. In their models, perturbed processes cause the perturbation of the expression of individual genes belonging to each of those processes. They consider the actual measurements of perturbation of the individual genes, as assessed by DNA microarrays, for example, as noisy observations of signals generated by perturbation of specific processes by cells in response to a stimulus. Both methods use generative models to assess the goodness of fit of a set of candidate perturbed processes to the observed gene perturbations, while differing in their precise formulations. The two methods use standard algorithms (greedy [9] and Markov chain Monte Carlo (MCMC) [11]) to find the set of processes with the greatest fit to the observed data.

Products of genes do not act independently but rather in concert with products of other genes through numerous interactions. In a similar vein, biological processes, composed of genes, may themselves interact. Accordingly, researchers have begun developing methods to identify connections between processes based on the underlying gene interaction networks. A method by Li et al.[12] computed the “crosstalk” between processes by counting the number of interactions that occur among the genes of each process and assessing the significance of this number against empirical distributions. Dotan-Cohen et al. published a more direct method [13] which uses Fisher’s Exact Test to determine if one process is linked to another, i.e., if genes in the first process have significantly more interactions with genes in the second process than would be expected by chance. Wang et al.[14] published a method that calculates what they call “functional similarity” between two processes using the sum of the distances between all pairs of genes belonging to those processes.

While the previous methods represent advances in finding high-level connections between processes, they do not incorporate information which could lead to discovering which connections have relevance under specific biological contexts. Motivated by this methodological gap, in earlier work [15] we extended the method of Dotan-Cohen et al.[13] by integrating gene expression data with gene-gene interactions to compute what we termed “Contextual Biological Process Linkage Networks” (CBPLNs). A link in a CBPLN indicates not only that the genes of two processes have a significant number of interactions among them, but that genes at the interface exhibit a large amount of perturbation in expression. Thus, it became possible to infer the inter-process connections relevant to a cell or tissue’s response to an internal or external stimulus.

The CBPLN method has several aspects that need improvement. First, because it must build empirical distributions to determine the significance of each link, it becomes prohibitively computationally expensive as the number of links to test grows. Second, the method reports all significant links, Since it makes no distinction among two or more links that are found to be significant on account of nearly identical sets of gene-gene interactions, it may output many redundant significant links. This latter drawback is universal to all methods that compute inter-process links, and also to most techniques for functional enrichment.

Here we present a new method that simultaneously addresses the shortcomings of earlier methods. Our method takes inspiration from the methods for functional enrichment reported by Lu et al.[9] and Bauer et al.[11]. We assume that links between biological processes become perturbed during the response of a cell or tissue to some stimulus, and that this perturbation of inter-process links propagates via the individual gene-gene interactions between genes belonging to the different processes. We can not directly observe perturbation of the links between the processes; instead, our method considers the perturbation of genes participating in the interfacing interactions of processes as noisy observations generated from the perturbed inter-process links. Our method infers a non-redundant set of processes and their perturbed links, which we call a Biological Process Network (BPN), from the interactions between the observed perturbed genes. We compute the likelihood of candidate BPNs in terms of parameters accounting for the noisiness in the observed states of the gene-gene interactions. Using Markov chain Monte Carlo (MCMC), we identify BPNs of high likelihood. We label this new method “MCMC Biological Process Networks” (MCMC-BPN). BPNs thus computed summarize the important biological trends within a response network by reporting to the user a digestible number of inter-process links that can be explored in greater detail.

Overview of the method

MCMC-BPN aims to explain as many interactions between genes with perturbed expression by as few inter-process links as possible. By including a link between a pair of processes in the BPN, we say that link “explains” the interactions cross-annotated by that pair of terms. Our objective is to the reward inclusion of links in the BPN that explain many interactions between perturbed genes not already explained by other links in the BPN. Simultaneously, another objective is to penalize the inclusion of more links in the BPN than necessary, including links which mostly explain unperturbed interactions, and missing a large number of perturbed interactions. To this end, we define a likelihood function as the product of several Bernoulli distributions, controlled by three parameters used to adjust for the amount of “noise” in the observed perturbation of the cross-annotated links.

The first of these parameters, the link prior λ, serves to reduce the number of links in a BPN, for when λ is low, having few links increases the overall likelihood. A low value for the second parameter, the false-positive rate α, encourages BPNs that explain many perturbed interactions. Finally, when the parameter β, which represents the false-negative rate, has a low value, it encourages BPNs that explain few unperturbed interactions. Note that modifying the BPN in such a way that increases the contribution of one parameter to the likelihood may be offset, or even outweighed, by a decrease in the contribution to the likelihood by another parameter. For example, including every possible link in a BPN will have a favorable likelihood contribution via α, since it necessarily explains all perturbed interactions. Such an inclusive BPN will, however, lead to very poor contributions via λ (many links are included in the BPN) and β (many unperturbed interactions will also be included). Thus, such a BPN will have a very low overall likelihood. A desirable BPN must strike a balance among the tension of all three parameters—neither including too many links, nor explaining too few perturbed interactions, nor explaining too many unperturbed interactions—in order maximize the overall likelihood.

While the likelihood function provides a means of scoring the quality of a given BPN, for any given data set, there exist 2^|L| possible BPNs, where L is the set of all possible links between pairs of processes. To search this potentially enormous solution space, we use the Metropolis-Hastings algorithm for Markov chain Monte Carlo (MCMC) [16]. Each state in the Markov chain represents a particular set of values for the parameters λ, α, and β, as well as a particular configuration of inter-process links. The neighbors of the state are those which have one additional or one less link, or which have one parameter with a different value. The parameter values and links which contribute to BPNs with high likelihoods will tend to remain consistent from one visited state to the next. Thus, we report the final BPN as the links that appear most frequently throughout the states visited during the MCMC.

Application

We applied MCMC-BPN to three treatment-control experiments relating to the liver and liver disease. In the first application, we compared gene expression of rat hepatocytes in two common in vitro culture systems [17]: hepatocyte monolayer (HM) and collagen sandwich (CS). The remaining two experiments contrast gene expression levels from liver tissue samples from dozens of human patients diagnosed with hepatitis C virus (HCV)-induced cirrhosis and hepatocellular carcinoma (HCC) with expression levels of samples from healthy patients [18]. Approximately 170 million people worldwide suffer from HCV infection [19]. HCC ranks third among the deadliest cancers worldwide, of which HCV is among the leading causes of incidence [20]. Below, we present and discuss the BPNs computed to summarize the major trends of differential expression of each of these three data sets. We found the BPNs contained links between biological processes that were anticipated, as well as unexpected connections that suggest further exploration.

Results

Data sources and contrasts

Table 1 summarizes the data sources for the three contrasts we studied. For the “CS vs. HM” contrast, we used the samples for CS day 8 as the treatment and samples for HM day 8 as the control. For this contrast, we pruned the STRING network to include those interactions with a score of 500 or greater. For the “Cirrhosis” contrast, we used samples from patients designated to be in the cirrhosis category as the treatment; for the “Very Advanced HCC” contrast, we used samples from patients designated as being in the “Very Advanced HCC” category as the treatment; in both contrasts, we used the samples from uninfected patients as the control.

Table 1 Data sources for each contrast

Full size table

We obtained functional annotations for the genes from the c2 canonical pathways and c5 GO gene sets of the Molecular Signatures Database (MSigDB) version 3.0 [7], downloaded on February 7, 2011, CORUM complexes [21] downloaded on February 7, 2011, NetPath signal transduction pathways [22] downloaded June 6, 2009, and NCI Pathway Interaction Database’s curated pathways [23] downloaded February 7, 2011. For the rat data, we normalized all data into the Ensembl Peptide ID namespace through a combination of the Synergizer [24] and MadGENE [25] mapping services. For the human data, we used the same services to normalize all the data into Entrez Gene namespace.

Next, we integrated the annotations with the gene interaction networks. We say that a pair of processes “cross-annotates” interactions in the underlying gene-gene interaction network if one of the two genes belongs to one of the processes in the link and the other gene belongs to the other process. (See the section titled “The MCMC-BPN algorithm” for details.) For each contrast, Table 2 presents the number of processes, the number of cross-annotating pairs among these processes, the number of interactions which the process pairs cross-annotate, and the number of those interactions which we consider “perturbed” (i.e., both interacting genes exhibit perturbed expression for that contrast; see the section titled “The MCMC-BPN algorithm” for details).

Table 2 Statistics on inputs by contrast

Full size table

For each contrast, we performed a total of five runs of MCMC-BPN. Each run took between 15 and 30 hours on a single core of a 2.8GHz AMD Opteron 4184 processor using our implementation in Python. We first describe results on the consistency of the BPNs computed by the different MCMC runs and summarize BPN statistics. Second, we show that the BPNs contain very little redundancy. Third, for each contrast, we display an example BPN and provide detail on several interesting links in the reported BPNs. Fourth, we demonstrate that the BPNs produced by MCMC-BPN are more informative while also being less redundant than those computed by two previous methods: CBPLN [15] and biological process linkage networks (BPLN) [13]. Finally, we describe some general observations of the behavior of the MCMC and features which affect the performance of our algorithm. Two additional files accompany these results. The supplementary information (Additional file 1) contains (a) details on how we executed the MCMC-BPN software to obtain and visualize our results and (b) a description of the files in the supplementary results, which are available in Additional file 2. This file contains all the five BPNs for each of the contrasts studied and the parameters estimated by each run of the software.

Consistency and statistics of BPNs computed by MCMC-BPN

We measured the consistency between the five BPNs for each contrast in two ways: how many links (i.e., the pairs of processes) each pair of BPNs shared, and how many explained interactions each pair of BPNs had in common. The average Jaccard Index (JI) for all ten pairwise comparisons of the shared links in the CS vs. HM BPNs was 0.91; in these and subsequent results, we report averages but not the standard deviations, since they were one to two orders of magnitude smaller than the averages. Figure 1 presents, for each of the three contrasts, a pair of heatmaps showing the overlap between each pair of BPNs on the basis of their common perturbed and unperturbed explained interactions. For the CS vs. HM contrast, the average Jaccard Index for the common perturbed interactions was 0.92, illustrating the high degree of overlap between the reported BPNs. The five BPNs for the CS vs. HM contrast consisted of an average of 27.6 processes with 20.0 inter-process links explaining 1686.0 interactions, of which 1070.2 interactions were perturbed. The BPNs explained 27.7% of all perturbed interactions using 0.1% of the possible links.

Unlike the CS vs. HM contrast, the BPNs reported for the Cirrhosis contrast showed mixed consistency. Figure 1 (center) clearly illustrates the divergence in the BPNs computed for the Cirrhosis contrast in terms of the overlap of their explained interactions. Three of the five BPNs (BPNs 1, 3, and 4) were identical, with 18 processes and 14 links between these processes. The two remaining BPNs had only two processes with one link and four processes with two links, respectively, none of which were present in the three identical BPNs. We found that β, the false-negative rate, took on a very high value (0.95) for these runs in comparison to the others (0.6). This value of β indicated that only 5% of the interactions explained by these BPNs were perturbed. We discarded these two BPNs from further analyses, reasoning that they represented a situation where the MCMC could not escape a local minimum. We found that the 14 links of the three remaining BPNs explained 947 interactions, 380 of which were perturbed. Thus the BPNs explained 19.5% of all perturbed interactions using 0.2% of all possible links.

Similar to the Cirrhosis contrast, three of the BPNs computed for the Very Advanced HCC contrast had a high degree of similarity (BPNs 1, 2, and 4 in Figure 1 (right)). The remaining two BPNs, which had a modest similarity to each other, showed very little overlap with the first three BPNs. Unlike the Cirrhosis contrast, the two groups of BPNs had similar numbers of processes and links; the three similar BPNs had a mean of 44.0 processes with 36.7 links between them, and the two remaining BPNs had a mean of 38.5 processes and 41.5 links. They differed remarkably, however, in the number of interactions their links explained. The three similar BPNs explained a mean of 8114.3 interactions, of which 5670.7 were perturbed. The remaining two BPNs explained a mean of 3470.5 interactions, of which 890.5 were perturbed. As in the Cirrhosis contrast, we found that β assumed high values (0.7 and 0.75) in these two runs compared to the others (0.3). Again reasoning the MCMC may have failed to escape local minima, we excluded the two dissimilar BPNs from the remainder of our analyses.

Lack of redundancy in BPNs

We sought to determine whether there was any redundancy within each BPN for each contrast. We used two measures for this evaluation: (i) the overlap among links in a BPN in terms of common interactions and (ii) the number of links in each BPN that explained each interaction. We define these measures in more detail in the section titled “ Measuring redundancy within a BPN.”

We measured the amount of overlap between every pair of links within each BPN in terms of the number of common explained interactions, averaging the results over the BPNs computed for each contrast. Figure 2 displays, for each contrast, the distributions of the maximum observed JIs for each link, divided into perturbed explained interactions and unperturbed explained interactions. Among the five CS vs. HM BPNs, when considering perturbed interactions, a mean of 80.0% of links had a maximum JI between 0 and 0.2. For unperturbed interactions, this number was 59.9%. Moreover, 80.7% of perturbed explained interactions and 82.2% of unperturbed explained interactions in CS vs. HM had only one link explaining them on average.

Links in Cirrhosis BPNs also exhibited little overlap (see Figure 2 (center)), with all links having a maximum JI of at most 0.2, both for perturbed and for unperturbed interactions. At least 85% of the perturbed explained interactions and the unperturbed explained interactions were explained by only one link. Links exhibited little overlap in Very Advanced HCC BPNs as well, as shown in Figure 2 (right). Nearly 90% of the links had a maximum JI of at most 0.2 in the case of perturbed explained interactions, with the number being nearly 70% for unperturbed explained interactions. Moreover, about 72% of both perturbed and unperturbed explained interactions were explained by only one link.

Overall, the dominance of low JIs for the processes and links indicated that the BPNs computed by MCMC-BPN demonstrated very little redundancy. The fact that most explained interactions had only one explaining link supported this observation.

Interpretation of the BPNs

CS vs. HM

Figure 3 presents one of the BPNs computed using the MCMC-BPN method on the data for the CS vs. HM contrast. The BPN contained up- and down-regulated processes in different components. Most up-regulated processes were related to metabolic functions performed by the liver, including lipid and carbohydrate metabolism, while most down-regulated processes related to cell replication and the cytoskeleton. These reflect the greater retention of physiological function of hepatocytes in CS culture versus HM culture, and the greater degree of de-differentiation for cells in HM versus CS, respectively, as reported by Kim et al.[17].

Two main components dominate the BPNs. The first component contained a mix of processes related to fatty acid metabolism (OXIDOREDUCTASE_ACTIVITY, KEGG_PPAR_SIGNALING_PATHWAY, REACTOME_ REGULATION_OF_LIPID_METABOLISM_BY_PEROXISOME_PROLIFERATOR_ACTIVATED_RECEPTOR_ ALPHA, and KEGG_BIOSYNTHESIS_OF_UNSATURATED_FATTY_ACIDS) and processes related to amino acid and carbohydrate metabolism (REACTOME_ METABOLISM_OF_CARBOHYDRATES, REACTOME_ METABOLISM_OF_AMINO_ACIDS, and KEGG_ARGININE_AND_PROLINE_METABOLISM), all critical functions carried out by hepatocytes [26]. A link between OXIDOREDUCTASE_ACTIVITY and REACTOME_METABOLISM_OF_AMINO_ACIDS bridges these two groups of processes. The second component contained down-regulated processes related to the de-differentiation of the hepatocytes in HMs.

Although the names of some of the processes appear to be very similar, their actual gene content tended to overlap very little. For example, the sets of genes annotated to CELL_CYCLE_GO_0007049 and to KEGG_CELL_CYCLE had JI of only 0.23. Similarly KEGG_PPAR_SIGNALING_PATHWAY and REACTOME_REGULATION_OF_LIPID_METABOLISM_BY_PEROXISOME_PROLIFERATOR_ACTIVATED_RECEPTOR_ALPHA, which are directly linked in the BPN, had a genes-based JI of 0.32. Figure 4 shows the dense network of interactions explained by this link. While genes belonging to both processes, such as peroxisome proliferator-activated receptor α(PPARA) and cholesterol 7 α-hydroxylase (CYP7A1), are involved in some interactions, there are many interactions that involve genes belonging to only one of the two processes.

Cirrhosis

The three consistent BPNs in the Cirrhosis contrast were composed entirely of immune response-related processes, as shown in Figure 5. While we anticipated seeing a response in terms of liver-related processes as in the two previous analyses, two factors likely played a large role in the dominance of the immunity processes. First, all cirrhosis patients had sustained infection by HCV. Second, samples in the previous two analyses contained RNA extracted solely from hepatocytes, the cells responsible for the bulk of metabolic functions of the liver. The samples in this contrast (as well as Very Advanced HCC) were from the whole liver. Thus, they contained a mixture of cell types, which could dilute the signal from metabolic processes. Our results corroborate those found by Wurmbach et al.[18], who categorized the bulk of the differentiated genes as participating in immune response.

Very Advanced HCC

The majority of processes in the three similar BPNs of the Very Advanced HCC contrast related to cell replication, owing to the advanced nature of HCC in the patients from whom the samples were derived. (See Figure 6.) The largest component of the BPN contained 17 processes and 18 links, including both down- and up-regulated processes, largely including processes related to cell replication. The BPN contained a few links related to liver-specific functions, however, such as that between KEGG_VALINE_LEUCINE_AND _ISOLEUCINE_DEGRADATION and OXIDOREDUC TASE_ACTIVITY_ACTING_ON_THE_CH_CH_GROUP_OF_DONORS, both down-regulated in comparison to control patients, indicating the progression of liver damage in the HCC patients. Interestingly, REACTOME _INNATE_IMMUNITY_SIGNALING was down-regulated in HCC patients compared to controls, suggesting a breakdown in immune response. MCMC-BPN reported a significant link between this process and REGULATION_OF_MITOTIC_CELL_CYCLE.

Comparison with CBPLN

We compared performance of MCMC-BPN to the CBPLN method by running MCMC-BPN over the day 8 CS vs. HM dataset taken directly from the CBPLN study [15], which featured the same gene expression and interaction data as the CS vs. HM study presented above, but a subset of older annotation data from MSigDB. Specifically, the CBPLN study featured a set of 18 processes significantly upregulated in CS in comparison to HM that we had manually identified and selected.

We performed five independent runs of MCMC-BPN on the CBPLN day 8 dataset. Two runs had identical sets of links with values of α and β between 0.30 and 0.35. The other three runs had high values of 0.60 and 0.65 for both α and β. We retained only the BPNs in the first group for further analyses. Both these BPNs contained 14 of the 18 terms and 16 (10.7%) of the 150 possible links, which explained 1,028 interactions, including 719 (59.7%) of the 1,205 perturbed interactions.

CBPLN produces a BPN with directed links. We ignored these directions to facilitate comparison to MCMC-BPN. We considered a link significant in the CBPLN results if the corrected p-value was at most 0.01, per the original CBPLN study [15]. The resulting undirected BPN for CBPLN contained all 18 processes with 58 links (38.7% of all possible links). The links explained 2,103 interactions, including 1,125 perturbed interactions (93.4% of all perturbed interactions).

Compared to the BPN produced by CBPLN, the BPNs produced by MCMC-BPN explained approximately two-thirds (63.9%) as many perturbed interactions in the underlying response network, however, they incorporated only approximately one-quarter (27.6%) as many links as the BPN produced by CBPLN. As shown in Figure 7. The links in BPNs from MCMC-BPN had much less overlap (81.3% of links with a maximum JI at most 0.2 for all interactions, and 87.5% for perturbed interactions) when compared to the links in the BPN produced by CBPLN (39.7% for all interactions and 37.9% for perturbed interactions).

Thus, while MCMC-BPN produced BPNs which explained somewhat less of the response network than the BPN produced by CBPLN, it did so using a much more concise, much less redundant set of links. Furthermore, CBPLN required explicitly defining the set of links for which to test for significant perturbation, whereas MCMC-BPN did not require any such specification.

Finally and most importantly, we note that MCMC-BPN was able to compute all five BPNs in fewer than 40 hours cumulative runtime on a standard modern desktop PC, whereas CBPLN required several hundred hours of cumulative runtime on a high-performance computing cluster on the same dataset, primarily due to its need to build empirical distributions to determine the statistical significance of each link. Executing CBPLN becomes nearly intractable on the full CS vs. HM, Cirrhosis, and Very Advanced HCC datasets. As stated in the “ Motivation” section, the computational expense of running CBPLN to compute links between more than a few dozen processes served as one of our primary motivations for developing MCMC-BPN.

Comparison with BPLNs

We also compared MCMC-BPN to the BPLN method presented by Dotan-Cohen et al.[13]. We computed BPLNs for the three contrasts using the method of Dotan-Cohen et al.[13] (see the section titled “Computation of BPLNs”). We used the same input annotations as for MCMC-BPN runs, i.e., those processes found significantly perturbed for the CS vs. HM contrast by GSEA. Since BPLN does not consider the state of perturbation of genes in the interaction network, we restricted the interaction network to all the perturbed interactions. Like CBPLN, BPLN also produced directed links between processes, so we considered a significant link in either direction sufficient to indicate a significant undirected link.

For each contrast and for each of two stringent significance thresholds, Table 3 lists the number of significant (undirected) links in the BPLN, the number of processes connected by these links, and the number of interactions that these links explain. The first significance threshold of 0.0001 is an arbitrary albeit reasonable threshold that an investigator might select when exploring results from BPLN. The second threshold produces a BPLN with a number of links as close to, but no fewer than, the number of average links reported for MCMC-BPN (shown in the final row for the contrast). We discuss these results below but only for the second threshold for each contrast, in order to avoid repetitiousness.

Table 3 Statistics on BPLNs computed for the CS vs. HM contrast

Full size table