Summarizing cellular responses as biological process networks
 Christopher D Lasher^{1},
 Padmavathy Rajagopalan^{2, 3} and
 T M Murali^{3, 4}Email author
https://doi.org/10.1186/17520509768
© Lasher et al.; licensee BioMed Central Ltd. 2013
Received: 5 September 2012
Accepted: 26 June 2013
Published: 29 July 2013
Abstract
Background
Microarray experiments can simultaneously identify thousands of genes that show significant perturbation in expression between two experimental conditions. Response networks, computed through the integration of gene interaction networks with expression perturbation data, may themselves contain tens of thousands of interactions. Gene set enrichment has become standard for summarizing the results of these analyses in terms functionally coherent collections of genes such as biological processes. However, even these methods can yield hundreds of enriched functions that may overlap considerably.
Results
We describe a new technique called Markov chain Monte Carlo Biological Process Networks (MCMCBPN) capable of reporting a highly nonredundant set of links between processes that describe the molecular interactions that are perturbed under a specific biological context. Each link in the BPN represents the perturbed interactions that serve as the interfaces between the two processes connected by the link.
We apply MCMCBPN to publicly available liverrelated datasets to demonstrate that the networks formed by the most probable interprocess links reported by MCMCBPN show high relevance to each biological condition. We show that MCMCBPN’s ability to discern the few key links from in a very large solution space by comparing results from two other methods for detecting interprocess links.
Conclusions
MCMCBPN is successful in using few interprocess links to explain as many of the perturbed genegene interactions as possible. Thereby, BPNs summarize the important biological trends within a response network by reporting a digestible number of interprocess links that can be explored in greater detail.
Keywords
Background
Motivation
The deluge of publicly available molecular biology data, including genomewide gene expression measurements [1, 2] and gene and protein interaction networks [3, 4] has necessitated the development of computational methods that produce comprehensible views of large numbers of biological molecules and their connections. Reporting perturbation in gene expression on the basis of individual genes [5, 6] (of which there may be thousands) has given way to more holistic techniques—referred to as functional enrichment—that instead report the significance of the collective perturbation of processes—sets of biologically related genes (e.g., [7–11]. Results from these analyses reveal important trends that large lists of genes can obscure.
Recent work in functional enrichment of gene expression data [9, 11] has emphasized finding concise, nonredundant sets of processes that account for much of the overall perturbation among the genes. Methods by Lu et al.[9] and Bauer, Gagneur, and Robinson [11] assume that a cell or tissue perturbs certain biological processes in response to a change internal or external conditions. In their models, perturbed processes cause the perturbation of the expression of individual genes belonging to each of those processes. They consider the actual measurements of perturbation of the individual genes, as assessed by DNA microarrays, for example, as noisy observations of signals generated by perturbation of specific processes by cells in response to a stimulus. Both methods use generative models to assess the goodness of fit of a set of candidate perturbed processes to the observed gene perturbations, while differing in their precise formulations. The two methods use standard algorithms (greedy [9] and Markov chain Monte Carlo (MCMC) [11]) to find the set of processes with the greatest fit to the observed data.
Products of genes do not act independently but rather in concert with products of other genes through numerous interactions. In a similar vein, biological processes, composed of genes, may themselves interact. Accordingly, researchers have begun developing methods to identify connections between processes based on the underlying gene interaction networks. A method by Li et al.[12] computed the “crosstalk” between processes by counting the number of interactions that occur among the genes of each process and assessing the significance of this number against empirical distributions. DotanCohen et al. published a more direct method [13] which uses Fisher’s Exact Test to determine if one process is linked to another, i.e., if genes in the first process have significantly more interactions with genes in the second process than would be expected by chance. Wang et al.[14] published a method that calculates what they call “functional similarity” between two processes using the sum of the distances between all pairs of genes belonging to those processes.
While the previous methods represent advances in finding highlevel connections between processes, they do not incorporate information which could lead to discovering which connections have relevance under specific biological contexts. Motivated by this methodological gap, in earlier work [15] we extended the method of DotanCohen et al.[13] by integrating gene expression data with genegene interactions to compute what we termed “Contextual Biological Process Linkage Networks” (CBPLNs). A link in a CBPLN indicates not only that the genes of two processes have a significant number of interactions among them, but that genes at the interface exhibit a large amount of perturbation in expression. Thus, it became possible to infer the interprocess connections relevant to a cell or tissue’s response to an internal or external stimulus.
The CBPLN method has several aspects that need improvement. First, because it must build empirical distributions to determine the significance of each link, it becomes prohibitively computationally expensive as the number of links to test grows. Second, the method reports all significant links, Since it makes no distinction among two or more links that are found to be significant on account of nearly identical sets of genegene interactions, it may output many redundant significant links. This latter drawback is universal to all methods that compute interprocess links, and also to most techniques for functional enrichment.
Here we present a new method that simultaneously addresses the shortcomings of earlier methods. Our method takes inspiration from the methods for functional enrichment reported by Lu et al.[9] and Bauer et al.[11]. We assume that links between biological processes become perturbed during the response of a cell or tissue to some stimulus, and that this perturbation of interprocess links propagates via the individual genegene interactions between genes belonging to the different processes. We can not directly observe perturbation of the links between the processes; instead, our method considers the perturbation of genes participating in the interfacing interactions of processes as noisy observations generated from the perturbed interprocess links. Our method infers a nonredundant set of processes and their perturbed links, which we call a Biological Process Network (BPN), from the interactions between the observed perturbed genes. We compute the likelihood of candidate BPNs in terms of parameters accounting for the noisiness in the observed states of the genegene interactions. Using Markov chain Monte Carlo (MCMC), we identify BPNs of high likelihood. We label this new method “MCMC Biological Process Networks” (MCMCBPN). BPNs thus computed summarize the important biological trends within a response network by reporting to the user a digestible number of interprocess links that can be explored in greater detail.
Overview of the method
MCMCBPN aims to explain as many interactions between genes with perturbed expression by as few interprocess links as possible. By including a link between a pair of processes in the BPN, we say that link “explains” the interactions crossannotated by that pair of terms. Our objective is to the reward inclusion of links in the BPN that explain many interactions between perturbed genes not already explained by other links in the BPN. Simultaneously, another objective is to penalize the inclusion of more links in the BPN than necessary, including links which mostly explain unperturbed interactions, and missing a large number of perturbed interactions. To this end, we define a likelihood function as the product of several Bernoulli distributions, controlled by three parameters used to adjust for the amount of “noise” in the observed perturbation of the crossannotated links.
The first of these parameters, the link prior λ, serves to reduce the number of links in a BPN, for when λ is low, having few links increases the overall likelihood. A low value for the second parameter, the falsepositive rate α, encourages BPNs that explain many perturbed interactions. Finally, when the parameter β, which represents the falsenegative rate, has a low value, it encourages BPNs that explain few unperturbed interactions. Note that modifying the BPN in such a way that increases the contribution of one parameter to the likelihood may be offset, or even outweighed, by a decrease in the contribution to the likelihood by another parameter. For example, including every possible link in a BPN will have a favorable likelihood contribution via α, since it necessarily explains all perturbed interactions. Such an inclusive BPN will, however, lead to very poor contributions via λ (many links are included in the BPN) and β (many unperturbed interactions will also be included). Thus, such a BPN will have a very low overall likelihood. A desirable BPN must strike a balance among the tension of all three parameters—neither including too many links, nor explaining too few perturbed interactions, nor explaining too many unperturbed interactions—in order maximize the overall likelihood.
While the likelihood function provides a means of scoring the quality of a given BPN, for any given data set, there exist 2^{L} possible BPNs, where L is the set of all possible links between pairs of processes. To search this potentially enormous solution space, we use the MetropolisHastings algorithm for Markov chain Monte Carlo (MCMC) [16]. Each state in the Markov chain represents a particular set of values for the parameters λ, α, and β, as well as a particular configuration of interprocess links. The neighbors of the state are those which have one additional or one less link, or which have one parameter with a different value. The parameter values and links which contribute to BPNs with high likelihoods will tend to remain consistent from one visited state to the next. Thus, we report the final BPN as the links that appear most frequently throughout the states visited during the MCMC.
Application
We applied MCMCBPN to three treatmentcontrol experiments relating to the liver and liver disease. In the first application, we compared gene expression of rat hepatocytes in two common in vitro culture systems [17]: hepatocyte monolayer (HM) and collagen sandwich (CS). The remaining two experiments contrast gene expression levels from liver tissue samples from dozens of human patients diagnosed with hepatitis C virus (HCV)induced cirrhosis and hepatocellular carcinoma (HCC) with expression levels of samples from healthy patients [18]. Approximately 170 million people worldwide suffer from HCV infection [19]. HCC ranks third among the deadliest cancers worldwide, of which HCV is among the leading causes of incidence [20]. Below, we present and discuss the BPNs computed to summarize the major trends of differential expression of each of these three data sets. We found the BPNs contained links between biological processes that were anticipated, as well as unexpected connections that suggest further exploration.
Results
Data sources and contrasts
We obtained functional annotations for the genes from the c2 canonical pathways and c5 GO gene sets of the Molecular Signatures Database (MSigDB) version 3.0 [7], downloaded on February 7, 2011, CORUM complexes [21] downloaded on February 7, 2011, NetPath signal transduction pathways [22] downloaded June 6, 2009, and NCI Pathway Interaction Database’s curated pathways [23] downloaded February 7, 2011. For the rat data, we normalized all data into the Ensembl Peptide ID namespace through a combination of the Synergizer [24] and MadGENE [25] mapping services. For the human data, we used the same services to normalize all the data into Entrez Gene namespace.
Statistics on inputs by contrast
Contrast  Processes  process pairs Crossannotating  interactions Crossannotated  interactions Perturbed 

CS vs. HM  210  14796  11585  3861 
Cirrhosis  148  8714  12913  1954 
Very Advanced HCC  345  36460  30201  15400 
For each contrast, we performed a total of five runs of MCMCBPN. Each run took between 15 and 30 hours on a single core of a 2.8GHz AMD Opteron 4184 processor using our implementation in Python. We first describe results on the consistency of the BPNs computed by the different MCMC runs and summarize BPN statistics. Second, we show that the BPNs contain very little redundancy. Third, for each contrast, we display an example BPN and provide detail on several interesting links in the reported BPNs. Fourth, we demonstrate that the BPNs produced by MCMCBPN are more informative while also being less redundant than those computed by two previous methods: CBPLN [15] and biological process linkage networks (BPLN) [13]. Finally, we describe some general observations of the behavior of the MCMC and features which affect the performance of our algorithm. Two additional files accompany these results. The supplementary information (Additional file 1) contains (a) details on how we executed the MCMCBPN software to obtain and visualize our results and (b) a description of the files in the supplementary results, which are available in Additional file 2. This file contains all the five BPNs for each of the contrasts studied and the parameters estimated by each run of the software.
Consistency and statistics of BPNs computed by MCMCBPN
Unlike the CS vs. HM contrast, the BPNs reported for the Cirrhosis contrast showed mixed consistency. Figure 1 (center) clearly illustrates the divergence in the BPNs computed for the Cirrhosis contrast in terms of the overlap of their explained interactions. Three of the five BPNs (BPNs 1, 3, and 4) were identical, with 18 processes and 14 links between these processes. The two remaining BPNs had only two processes with one link and four processes with two links, respectively, none of which were present in the three identical BPNs. We found that β, the falsenegative rate, took on a very high value (0.95) for these runs in comparison to the others (0.6). This value of β indicated that only 5% of the interactions explained by these BPNs were perturbed. We discarded these two BPNs from further analyses, reasoning that they represented a situation where the MCMC could not escape a local minimum. We found that the 14 links of the three remaining BPNs explained 947 interactions, 380 of which were perturbed. Thus the BPNs explained 19.5% of all perturbed interactions using 0.2% of all possible links.
Similar to the Cirrhosis contrast, three of the BPNs computed for the Very Advanced HCC contrast had a high degree of similarity (BPNs 1, 2, and 4 in Figure 1 (right)). The remaining two BPNs, which had a modest similarity to each other, showed very little overlap with the first three BPNs. Unlike the Cirrhosis contrast, the two groups of BPNs had similar numbers of processes and links; the three similar BPNs had a mean of 44.0 processes with 36.7 links between them, and the two remaining BPNs had a mean of 38.5 processes and 41.5 links. They differed remarkably, however, in the number of interactions their links explained. The three similar BPNs explained a mean of 8114.3 interactions, of which 5670.7 were perturbed. The remaining two BPNs explained a mean of 3470.5 interactions, of which 890.5 were perturbed. As in the Cirrhosis contrast, we found that β assumed high values (0.7 and 0.75) in these two runs compared to the others (0.3). Again reasoning the MCMC may have failed to escape local minima, we excluded the two dissimilar BPNs from the remainder of our analyses.
Lack of redundancy in BPNs
We sought to determine whether there was any redundancy within each BPN for each contrast. We used two measures for this evaluation: (i) the overlap among links in a BPN in terms of common interactions and (ii) the number of links in each BPN that explained each interaction. We define these measures in more detail in the section titled “ Measuring redundancy within a BPN.”
Links in Cirrhosis BPNs also exhibited little overlap (see Figure 2 (center)), with all links having a maximum JI of at most 0.2, both for perturbed and for unperturbed interactions. At least 85% of the perturbed explained interactions and the unperturbed explained interactions were explained by only one link. Links exhibited little overlap in Very Advanced HCC BPNs as well, as shown in Figure 2 (right). Nearly 90% of the links had a maximum JI of at most 0.2 in the case of perturbed explained interactions, with the number being nearly 70% for unperturbed explained interactions. Moreover, about 72% of both perturbed and unperturbed explained interactions were explained by only one link.
Overall, the dominance of low JIs for the processes and links indicated that the BPNs computed by MCMCBPN demonstrated very little redundancy. The fact that most explained interactions had only one explaining link supported this observation.
Interpretation of the BPNs
CS vs. HM
Two main components dominate the BPNs. The first component contained a mix of processes related to fatty acid metabolism (OXIDOREDUCTASE_ACTIVITY, KEGG_PPAR_SIGNALING_PATHWAY, REACTOME_ REGULATION_OF_LIPID_METABOLISM_BY_PEROXISOME_PROLIFERATOR_ACTIVATED_RECEPTOR_ ALPHA, and KEGG_BIOSYNTHESIS_OF_UNSATURATED_FATTY_ACIDS) and processes related to amino acid and carbohydrate metabolism (REACTOME_ METABOLISM_OF_CARBOHYDRATES, REACTOME_ METABOLISM_OF_AMINO_ACIDS, and KEGG_ARGININE_AND_PROLINE_METABOLISM), all critical functions carried out by hepatocytes [26]. A link between OXIDOREDUCTASE_ACTIVITY and REACTOME_METABOLISM_OF_AMINO_ACIDS bridges these two groups of processes. The second component contained downregulated processes related to the dedifferentiation of the hepatocytes in HMs.
Cirrhosis
Very Advanced HCC
Comparison with CBPLN
We compared performance of MCMCBPN to the CBPLN method by running MCMCBPN over the day 8 CS vs. HM dataset taken directly from the CBPLN study [15], which featured the same gene expression and interaction data as the CS vs. HM study presented above, but a subset of older annotation data from MSigDB. Specifically, the CBPLN study featured a set of 18 processes significantly upregulated in CS in comparison to HM that we had manually identified and selected.
We performed five independent runs of MCMCBPN on the CBPLN day 8 dataset. Two runs had identical sets of links with values of α and β between 0.30 and 0.35. The other three runs had high values of 0.60 and 0.65 for both α and β. We retained only the BPNs in the first group for further analyses. Both these BPNs contained 14 of the 18 terms and 16 (10.7%) of the 150 possible links, which explained 1,028 interactions, including 719 (59.7%) of the 1,205 perturbed interactions.
CBPLN produces a BPN with directed links. We ignored these directions to facilitate comparison to MCMCBPN. We considered a link significant in the CBPLN results if the corrected pvalue was at most 0.01, per the original CBPLN study [15]. The resulting undirected BPN for CBPLN contained all 18 processes with 58 links (38.7% of all possible links). The links explained 2,103 interactions, including 1,125 perturbed interactions (93.4% of all perturbed interactions).
Thus, while MCMCBPN produced BPNs which explained somewhat less of the response network than the BPN produced by CBPLN, it did so using a much more concise, much less redundant set of links. Furthermore, CBPLN required explicitly defining the set of links for which to test for significant perturbation, whereas MCMCBPN did not require any such specification.
Finally and most importantly, we note that MCMCBPN was able to compute all five BPNs in fewer than 40 hours cumulative runtime on a standard modern desktop PC, whereas CBPLN required several hundred hours of cumulative runtime on a highperformance computing cluster on the same dataset, primarily due to its need to build empirical distributions to determine the statistical significance of each link. Executing CBPLN becomes nearly intractable on the full CS vs. HM, Cirrhosis, and Very Advanced HCC datasets. As stated in the “ Motivation” section, the computational expense of running CBPLN to compute links between more than a few dozen processes served as one of our primary motivations for developing MCMCBPN.
Comparison with BPLNs
We also compared MCMCBPN to the BPLN method presented by DotanCohen et al.[13]. We computed BPLNs for the three contrasts using the method of DotanCohen et al.[13] (see the section titled “Computation of BPLNs”). We used the same input annotations as for MCMCBPN runs, i.e., those processes found significantly perturbed for the CS vs. HM contrast by GSEA. Since BPLN does not consider the state of perturbation of genes in the interaction network, we restricted the interaction network to all the perturbed interactions. Like CBPLN, BPLN also produced directed links between processes, so we considered a significant link in either direction sufficient to indicate a significant undirected link.
Statistics on BPLNs computed for the CS vs. HM contrast
Link significance threshold  Processes  Links  Explained perturbed interactions  Unexplained perturbed interactions 

CS vs. HM  
0.0001  191  1049  3186  675 
1.51 ×10^{−25}  23  20  698  3163 
MCMCBPN  27.6  20.0  1686.0  2175.0 
Cirrhosis  
0.0001  99  199  994  960 
9.04 ×10^{−14}  15  14  176  1778 
MCMCBPN  18.0  14.0  380.0  1574.0 
Very Advanced HCC  
0.0001  307  7249  13213  2187 
2.26 ×10^{−81}  27  38  2776  12624 
MCMCBPN  44.0  36.7  5670.7  9729.3 
CS vs. HM
Cirrhosis
A threshold of 9.04 ×10^{−14} gave a number of significant links for BPLN that matched the reported average for MCMCBPN. At this threshold, the links in the BPLN results explained fewer than half the number of perturbed genegene interactions. Over 65% of links had a maximum JI of 0.8 or greater (see Figure 8 (center)).
Very Advanced HCC
At the threshold of 2.26 ×10^{−81}, the BPLN contained 38 significant links, matching as closely as possible to the reported average of 36.7 for MCMCBPN. These links explained less than half the number of perturbed interactions as those explained by the links from MCMCBPN. The links from BPLN involved fewer processes overall compared to MCMCBPN. The links from the BPLN at this threshold, however, displayed a large amount of redundancy, with 88% having a maximum JI of 0.8 or greater; see Figure 8 (right).
Thus, for all contrasts, the links reported by BPLN proved less informative and more redundant than those reported by MCMCBPN. We concluded that BPLN computed a much poorer summary of the perturbed gene interaction network in comparison to MCMCBPN.
Behavior of the MCMC
Since our Markov Chain has the property of irreducibility (the MCMC can reach all states from any given state with positive probability) and aperiodicity (the MCMC will not remain trapped in cycles), we expect that, given sufficient number of steps, MCMC will visit each state in the solution space with a frequency proportional to the likelihood of the state [27]. To demonstrate that MCMCBPN follows this behavior, we performed five additional runs of MCMCBPN for the CS vs. HM contrast wherein we recorded the frequency with which the MCMC visited each state. In these runs, we fixed the parameters to the most probable values as determined by the first five runs (λ = 0.01, α = 0.2, β = 0.35), and permitted only linksbased transitions.
Conclusions
We have presented a method for computing connections between biological processes specific to a biological context corresponding to comparing gene expression measurements from two conditions (e.g., casecontrol gene expression studies). Our method, which we call MCMCBPN, uses MCMC to search a solution space of possible interprocess links that can explain the perturbed interactions between genes of the processes. We computed BPNs for three liverrelated contrasts: (i) rat hepatocytes in CS compared to those in HM (CS vs. HM), and samples from livers of human patients with (ii) HCVinduced cirrhosis (Cirrhosis) or (iii) HCC (Very Advanced HCC) compared to samples from patients with healthy livers.
The BPNs varied in size, roughly in proportion to the number of perturbed interactions for the contrast. The BPNs explained around 20–40% of the perturbed interactions per contrast, using only 0.1% of the possible links, and exhibited very little redundancy. In contrast, BPNs computed by CBPLN and BPLN contained several more links than MCMCBPNs. Moreover, these methods produced BPNs with considerable redundancy. We demonstrated that the BPNs reported contextually relevant connections between processes, such as cellgrowth processes related to latestage cancer in the Very Advanced HCC contrast, or metabolic pathway processes related to better retention of physiological function in CS vs. HM.
In the Cirrhosis and Very Advanced HCC contrasts, we observed a bifurcation in the BPNs reported by MCMCBPN, where one set contained BPNs that explained many more perturbed interactions than the other. We noticed that during these runs, the poorer quality BPNs tended to assume high values for the falsenegative rates, suggesting that the MCMC entered a solution space of poor likelihood that nevertheless was large and distant from more optimal solution spaces. One possible solution to alleviating this problem is to monitor the MCMCs as they progress, terminating or restarting those that diverge greatly from solutions of high likelihood. A more simple solution would be to restrict the values which the parameters can take even further (e.g., allow a maximum value of 0.7 rather than 0.95 for the falsenegative parameter).
It may be possible to increase both the number of probable links (and thus the connectivity within a BPN), as well as the probabilities for each link that contributes strongly to the overall likelihood by calculating the link probability as the fraction of most likely BPNs (e.g., the top 100,000) in which the link appears. Strategies to reduce the required number of MCMC steps, and thus the running time, include precomputing the possible contribution of each link to the likelihood before starting the MCMC, and pruning out the links with low contribution, thus significantly reducing the search space. We hope to include these and other improvements in future versions of MCMCBPN. Alternatively, statespacesearching algorithms other than MCMC could be applied, such as simulated annealing (SA). We believe that MCMCBPN and future extensions will prove useful in revealing the larger stories hidden within the everincreasing amounts of highthroughput life science data.
Methods
Computing gene expression perturbation
For each contrast, we applied Linear Models for Microarray Data (LIMMA) [6] to the microarray data to compute pvalues indicating the significance of differential expression of each gene. We declared all genes with a LIMMA pvalue≤0.05 as perturbed.
Selection of processes for computation of BPNs
The number of candidate BPNs is $O\left({2}^{\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)}\right)$, where n is the number of processes. Thus, the space of candidate BPNs grows extremely rapidly in comparison to the number of processes considered. For this reason, prior to running MCMCBPN, we screened the processes to include only those that showed significant perturbation as determined by GSEA [7]. For each contrast, we retained any processes with a false discovery rate (qvalue) at or below the threshold value of 0.1. We also excluded those processes with fewer than 10 genes or more than 300 genes, to remove overlyspecific and overlygeneral processes, respectively.
The MCMCBPN algorithm
Identifying perturbed crossannotated interactions
We then define the perturbed crossannotated interactions D⊆C as the subset of crossannotated interactions for which both incident genes are perturbed. Lastly, we denote the subset of perturbed interactions crossannotated by specific terms (p_{ i },p_{ j }), where p_{ i }≠p_{ j }, as D_{ i j }=C_{ i j }∩D. For example, in Figure 10, D_{1,3}={(v_{2},v_{3})}.
Calculation of BPN likelihood

Pr(X): the probability of selecting this subset of links X from among all possible pairs of terms in P

Pr(DX): the probability that the links in X explain the observed perturbed interactions D
We explicitly define each of these individual probabilities below.
where 0≤λ≤1 represents the prior probability of selecting a given link from the set of all links.
 (i)
I _{11}: perturbed interactions explained by at least one link in X, i.e., the “true positives” (I _{11} = {(v _{1},v _{3}),(v _{2},v _{3})} in the example);
 (ii)
I _{10}: perturbed interactions not explained by any links in X, i.e., the “false positives” (I _{10} = {(v _{1},v _{2})} in the example);
 (iii)
I _{01}: interactions that are not perturbed, but which are explained by one or more links in X, i.e., the “false negatives” (I _{01} = {(v _{3},v _{5})} in the example);
 (iv)
I _{00}: interactions that are not perturbed and are not explained by any links in X, i.e., the “true negatives” (I _{00} = {(v _{5},v _{6})} in the example).
We briefly note here that the example in Figure 10 contains one additional interaction, (v_{3},v_{4}), excluded from consideration in these categories as it does not meet the definition of a crossannotated interaction, since both v_{3} and v_{4} have identical annotations.
where α represents the falsepositive rate (i.e., the prior probability that a perturbed interaction is not explained by any link in X), and β represents the falsenegative rate (i.e., the prior probability an unperturbed interaction is explained by one or more links in X).
since we assume that all configurations of parameter values are equally likely.
Distributions of the parameters
We restrict the values of the parameters to discrete, rather than continuous, distributions. We allow λ, the prior probability of selecting a link, to take values in the set $\{0.05k\mid 1\le k\le 10,\phantom{\rule{0.3em}{0ex}}k\in \mathbb{N}\}\cup \{0.00001,0.00005,0.0001,0.0005,0.001,0.005,0.01\}$. This set spans a wide enough range to cover computing BPNs from many possible links, where only a small fraction may be included, to few links, where a large fraction of the links may be included. We restrict the set of values for both α, the falsepositive rate, and β, the falsenegative rate, to $\{0.05k\mid 1\le k<20,\phantom{\rule{0.3em}{0ex}}k\in \mathbb{N}\}$.
Design of the Markov chain
We define the set of all possible states in the Markov chain as $\mathcal{\mathcal{M}}$. Each state $m(\Phi ,X)\in \mathcal{\mathcal{M}}$ consists of two configurations: a configuration of parameters Φ and a set of explanatory links X. We restrict each state m(Φ,X) to having two types of neighboring states m^{′}(Φ^{′},X^{′}): neighbors with different parameter configurations, i.e., Φ^{′}≠Φ but X^{′}=X, and neighbors with different links configurations, i.e., Φ^{′}=Φ but X^{′}≠X. We restrict the parametersbased neighbors of m(Φ,X) to be those with a different value for only one parameter. We restrict the linksbased neighbors of m(Φ,X) to be those with a set of links containing one additional or one fewer link than X^{′}.
Design of the Markov chain Monte Carlo
In order to transition from the current state m(Φ,X) to a neighboring state m^{′}(Φ^{′},X^{′}), we propose a linksbased neighbor with probability ρ and a parametersbased neighbor with probability 1−ρ, where ρ, 0≤ρ≤1. If we propose a linksbased neighbor, then we draw neighbor m^{′}(Φ^{′}=Φ,X^{′}≠X) uniformly at random from the set of all linksbased neighbors of m(Φ,X). Otherwise, we draw a neighbor m^{′}(Φ^{′}≠Φ,X^{′}=X) uniformly at random from the set of all parametersbased neighbors. In this study, we set ρ=0.9, so that approximately 90% of proposed transitions were linksbased.
where N(m(Φ,X)) is the number of neighbors of state m(Φ,X) and N(m^{′}(Φ^{′},X^{′})) is the number of neighboring states of state m^{′}(Φ^{′},X^{′}). We note that because of the design of the Markov chain, N(m(Φ,X)) and N(m^{′}(Φ^{′},X^{′})) are equal and thus cancel each other out. If we accept the transition, then m^{′}(Φ^{′},X^{′}) becomes the current state; otherwise, m(Φ,X) remains the current state. Note that any time the proposed state has a greater likelihood than the current state, we will accept the transition. On the other hand, we still accept transitions to proposed states with poorer likelihoods with probability proportional to the ratio of likelihood of the proposed state to the likelihood of the current state. By allowing unfavorable transitions, the MCMC may escape local minima.
At the start of each MCMC run, we begin at a state which includes no links (i.e., X=∅) and where each parameter is set to a value drawn uniformly at random from its respective set of possible values. We then allow the MCMC to progress for a designated number of steps, as described below.
Reporting the BPN
We run the MCMC for a burnin period of 10^{7} steps. Following the burnin period, we run the MCMC for 10^{8} steps, recording at each step the value of each parameter λ, α, and β, as well as the links in X. Finally, the probability of a parameter (λ, α, or β) being a particular value is the fraction of recorded steps in which the parameter was observed at this value. The probability of a link Pr(l) is the fraction of recorded steps in which a given link l∈L was observed in X. We reported links with probabilities meeting or exceeding a userdefined threshold θ (we used θ=0.7) as those comprising the BPN, i.e., the reported BPN is the set of links {l∈L∣ Pr(l)≥θ}.
Computation of BPLNs
For each experiment, using significantly perturbed processes as determined by GSEA and the subnetwork induced by the set of perturbed interactions D as the inputs, we computed BPLNs as described by DotanCohen et al.[13]. Briefly, to test whether a link exists from one process to another, BPLN counts the number of genes belonging to the second process that also neighbor genes in the first process. Using a onesided Fisher’s Exact Test, it then determines whether this count is greater than expected by chance. After applying BenjaminiHochberg correction for multiple hypothesis testing [28], we declared significant and included all links which had a qvalue≤0.05. Since links in BLPN are directed, and links in BPNs returned by the MCMC method are undirected, we considered two processes p_{ i },p_{ j }∈P to be linked in the BPLN if the qvalue for either (p_{ i },p_{ j }) or (p_{ j },p_{ i }) was significant.
Measuring redundancy within a BPN
Redundancy of links
To assess the redundancy of links in a BPN, we calculated the JI of the sets of interactions crossannotated by every pair of links in the BPN. We calculated the JI on the basis of only perturbed interactions and only unperturbed interactions. For each link, we recorded the maximum JI between that link and all other links. We computed the mean of each of the JIs for each link over the BPNs computed for a contrast by independent executions of MCMC.
Explaining links per interaction
For each interaction explained by one or more links in a BPN, we counted the number of links that explain it. For every positive integer, k, we recorded the fraction of interactions which that were explained by k links. We reported the average fraction over the BPNs computed for a contrast.
Software availability
Our Python implementations of MCMCBPN, CBPLN, and BPLN are available under the Open Source Initiativeapproved MIT License from the Python Package Index at http://pypi.python.org/pypi/BiologicalProcessNetworks.
Declarations
Acknowledgements
We gratefully acknowledge financial support for this work from National Science Foundation grant CBET0933225, Environmental Protection Agency STAR grant EPARD83499801, the Institute for Critical Technology and Applied Sciences Center for Systems Biology of Engineered Tissues at Virginia Tech, and the Genetics, Bioinformatics, and Computational Biology Interdisciplinary Ph.D. Program of Virginia Tech.
Authors’ Affiliations
References
 Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A: NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 2010, 39 (Database): D1005D1010.PubMedPubMed CentralView ArticleGoogle Scholar
 Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF, Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R, Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, RoccaSerra P, Sansone S: ArrayExpress update–from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 2009, 37 (Database): D868D872. 10.1093/nar/gkn889.PubMedPubMed CentralView ArticleGoogle Scholar
 Tarcea VG, Weymouth T, Ade A, Bookvich A, Gao J, Mahavisno V, Wright Z, Chapman A, Jayapandian M, Ozgur A, Tian Y, Cavalcoli J, Mirel B, Patel J, Radev D, Athey B, States D, Jagadish HV: Michigan molecular interactions r2: from interacting proteins to pathways. Nucleic Acids Res. 2009, 37 (Database): D642D646. 10.1093/nar/gkn722.PubMedPubMed CentralView ArticleGoogle Scholar
 Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009, 37 (suppl_1): D412416.PubMedPubMed CentralView ArticleGoogle Scholar
 Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Nat Acad Sci. 2001, 98 (9): 51165121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
 Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Edited by: Irizarry R, Gentlemen R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 397420.View ArticleGoogle Scholar
 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. Proc Nat Acad Sci USA. 2005, 102 (43): 1554515550. 10.1073/pnas.0506580102.PubMedPubMed CentralView ArticleGoogle Scholar
 Kim S, Volsky D: BMC Bioinformatics. 2005, 6: 14410.1186/147121056144.PubMedPubMed CentralView ArticleGoogle Scholar
 Lu Y, Rosenfeld R, Simon I, Nau GJ, BarJoseph Z: A probabilistic generative model for GO enrichment analysis. Nucleic Acids Res. 2008, 36 (17): e10910.1093/nar/gkn434.PubMedPubMed CentralView ArticleGoogle Scholar
 Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 113. 10.1093/nar/gkn923.PubMed CentralView ArticleGoogle Scholar
 Bauer S, Gagneur J, Robinson PN: GOing Bayesian: modelbased gene set analysis of genomescale data. Nucleic Acids Res. 2010, 38 (11): 35233532. 10.1093/nar/gkq045.PubMedPubMed CentralView ArticleGoogle Scholar
 Li Y, Agarwal P, Rajagopalan D: A global pathway crosstalk network. Bioinformatics. 2008, 24 (12): 14421447. 10.1093/bioinformatics/btn200.PubMedView ArticleGoogle Scholar
 DotanCohen D, Letovsky S, Melkman AA, Kasif S: Biological process linkage networks. PLoS ONE. 2009, 4 (4): e531310.1371/journal.pone.0005313.PubMedPubMed CentralView ArticleGoogle Scholar
 Wang Q, Sun J, Zhou M, Yang H, Li Y, Li X, Lv S, Li X, Li Y: A novel networkbased method for measuring the functional relationship between gene sets. Bioinformatics. 2011, 27 (11): 15211528. 10.1093/bioinformatics/btr154.PubMedView ArticleGoogle Scholar
 Lasher CD, Rajagopalan P, Murali TM: PLoS ONE. 2011, 6: e1524710.1371/journal.pone.0015247.PubMedPubMed CentralView ArticleGoogle Scholar
 Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equation of state calculations by fast computing machines. J Chem Phys. 1953, 21 (6): 108710.1063/1.1699114.View ArticleGoogle Scholar
 Kim Y, Lasher CD, Milford LM, Murali T, Rajagopalan P: A comparative study of genomewide transcriptional profiles of primary hepatocytes in collagen sandwich and monolayer cultures. Tissue Eng Part C: Methods. 2010, 16 (6): 14491460. 10.1089/ten.tec.2010.0012.View ArticleGoogle Scholar
 Wurmbach E, Chen Y, Khitrov G, Zhang W, Roayaie S, Schwartz M, Fiel I, Thung S, Mazzaferro V, Bruix J, Bottinger E, Friedman S, Waxman S, Llovet JM: Genomewide molecular profiles of HCVinduced dysplasia and hepatocellular carcinoma. Hepatology. 2007, 45 (4): 938947. 10.1002/hep.21622.PubMedView ArticleGoogle Scholar
 Marcellin P: Hepatitis B and hepatitis C in 2009. Liver Int. 2009, 29: 18.PubMedView ArticleGoogle Scholar
 Yang JD, Roberts LR: Hepatocellular carcinoma: a global view. Nat Rev Gastroenterol Hepatol. 2010, 7 (8): 448458. 10.1038/nrgastro.2010.100.PubMedPubMed CentralView ArticleGoogle Scholar
 Ruepp A, Waegele B, Lechner M, Brauner B, Fobo G, Frishman G, Montrone C, Mewes H, DungerKaltenbach I: CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 2009, 38 (Database): D497D501.PubMedPubMed CentralView ArticleGoogle Scholar
 Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GSS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi SK, Tattikota SG, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HKC, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Rahiman BA, Prasad TSK, Lin J, Houtman JCD, Desiderio S, Renauld J: NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010, 11: R310.1186/gb2010111r3.PubMedPubMed CentralView ArticleGoogle Scholar
 Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH: PID: the Pathway Interaction Database. Nucleic Acids Res. 2009, 37 (Database): D674D679. 10.1093/nar/gkn653.PubMedPubMed CentralView ArticleGoogle Scholar
 Berriz GF, Roth FP: The Synergizer service for translating gene, protein and other biological identifiers. Bioinformatics. 2008, 24 (19): 22722273. 10.1093/bioinformatics/btn424.PubMedPubMed CentralView ArticleGoogle Scholar
 Baron D, Bihouée A, Teusan R, Dubois E, Savagner F, Steenman M, Houlgatte R, Ramstein G: MADGene: retrieval and processing of gene identifier lists for the analysis of heterogeneous microarray datasets. Bioinformatics. 2011, 27 (5): 725726. 10.1093/bioinformatics/btq710.PubMedPubMed CentralView ArticleGoogle Scholar
 Arias IM, Boyer JL, Chisari FV, Fausto M, Schachter D, Shafritz DA: The Liver: Biology and Pathobiology. 2001, Philadelphia: Lippincott Williams and WilkinsGoogle Scholar
 Andrieu C, De Freitas N, Doucet A, Jordan MI: An introduction to MCMC for machine learning. Mach Learn. 2003, 50: 543. 10.1023/A:1020281327116.View ArticleGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. Ser B (Methodological). 1995, 57: 289300.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.