Identification of direction in gene networks from expression and methylation
- David M Simcha^{1}Email author,
- Laurent Younes^{2},
- Martin J Aryee^{3, 4} and
- Donald Geman^{5}
https://doi.org/10.1186/1752-0509-7-118
© Simcha et al.; licensee BioMed Central Ltd. 2013
Received: 17 September 2012
Accepted: 17 October 2013
Published: 1 November 2013
Abstract
Background
Reverse-engineering gene regulatory networks from expression data is difficult, especially without temporal measurements or interventional experiments. In particular, the causal direction of an edge is generally not statistically identifiable, i.e., cannot be inferred as a statistical parameter, even from an unlimited amount of non-time series observational mRNA expression data. Some additional evidence is required and high-throughput methylation data can viewed as a natural multifactorial gene perturbation experiment.
Results
We introduce IDEM (Identifying Direction from Expression and Methylation), a method for identifying the causal direction of edges by combining DNA methylation and mRNA transcription data. We describe the circumstances under which edge directions become identifiable and experiments with both real and synthetic data demonstrate that the accuracy of IDEM for inferring both edge placement and edge direction in gene regulatory networks is significantly improved relative to other methods.
Conclusion
Reverse-engineering directed gene regulatory networks from static observational data becomes feasible by exploiting the context provided by high-throughput DNA methylation data.
An implementation of the algorithm described is available at http://code.google.com/p/idem/.
Keywords
Background
As the analysis of high-throughput gene expression data, notably phenotypic classification of samples [1–7], has expanded and matured, the focus has begun to shift towards mechanism and systems modeling [8–10]. In particular, much of the unrealized value of high-throughput molecular data may be in increasing our understanding of how various molecules interact in vivo, i.e., by reverse-engineering biological networks, hopefully revealing how disease states form and what targets might be available for their treatment. In the case of transcript data the most relevant type of network is one modeling transcriptional regulation. This may be thought of as a causal graph, wherein each node represents a variable and a directed edge is placed from every cause to each of its direct effects. From a causal gene regulatory graph, one then infers the effects of under- or over-expressing the mRNA level of one gene on the mRNA expression of other genes. The definition in terms of mRNA expression is a pragmatic choice, as this can be easily measured in a high-throughput fashion and can serve as a surrogate for protein concentration under some circumstances [11].
Because of this non-identifiability, the central challenge in reverse-engineering a gene regulatory network from observational data is placing directed edges: determining the direction of the causal arrow between a pair of genes that are irreducibly statistically dependent and believed at least provisionally to be causally related. However, an isolated causal assumption cannot be tested using only observational data [14]. Therefore, most past attempts to reverse-engineer gene regulatory networks using expression data fall into one of three categories. The first category of methods requires time series data and assumes that cause will temporally preceed effect, examples being dynamic Bayesian networks [15, 16], Granger causality [17] or a similar time shifted correlation technique [18]. The second category uses techniques such as ordinary differential equations and requires targeted perturbation of specific genes in quantitatively well-defined ways [19]. Compounding these difficulties, time series or perturbation data is often difficult to obtain, either for ethical or technical reasons, from human in vivo biological states. Methods in the third category utilize static data but allow edge directions to remain unidentifiable for many or all subgraphs; these include information-theoretic algorithms [20, 21], decision tree based algorithms [22], and static Bayesian network algorithms [23–25].
Our approach to dealing with causal direction is to broaden the context beyond mRNA expression and extract information from high-throughput data about auxiliary variables associated with each gene. We use causal assumptions which are justified on biological rather than computational grounds about connections in the extended network between the genes and the auxiliary variables. Finally, we then test whether the observed data is consistent with additional causal assumptions under the Causal Markov Condition. Even though a causal assumption cannot be tested in isolation using observational data, a set of causal assumptions can yield predictions that can be tested using such data [14]. We use methylation data in this study to illustrate our approach although other choices are possible. In mammalian cells, DNA methylation in the promoter region of a gene is frequently used as an epigenetic gene silencing signal. Notably, changes in promoter region methylation appear to cause targeted, gene-specific effects. For example, methylation appears to play a role in maintaining gene silencing in genomic imprinting [26]. Similarly, tumor suppressor genes are frequently hypermethylated in cancer [27, 28]. Techniques have been recently developed for measuring methylation in a high-throughput fashion [29]. In many cases, such measurements provide the context necessary to make edge directions identifiable when reverse-engineering gene regulatory networks.
This context is exploited by building enhanced directed regulatory network using two types of nodes for each gene: conventional ones representing mRNA expression and others representing methylation levels, both measurements being obtained from non-time series, high-throughput, observational data. A key simplifying assumption usually made in methodologies designed for large-scale reverse-engineering [20, 21], including ours, is that biological interactions among genes (such as regulator-target relationships) imply statistical dependence at a pairwise level. Therefore, for every pair of genes with a significant statistical interaction, we construct a simple, four-variable Bayesian network representing two mRNA variables (i.e., two genes) and two corresponding methylation states. This construction includes estimating the direction of the arrow between the two genes using a likelihood ratio test. In effect, the resulting algorithm, IDEM (Identification of Direction from Expression and Methylation) can be thought of as a taking advantage of a natural multifactorial gene perturbation experiment. When a gene promoter region becomes differentially methylated across samples, the expression of the target gene may be perturbed. Measuring promoter region methylation can be thought of as measuring how the system has been perturbed. A key model assumption, therefore, is that methylation of the promoter region of a gene directly affects the mRNA expression of the downstream gene and does not directly causally affect the expression of any other gene. Under this assumption, discovering the direction of a causal edge from non-time series observational data becomes possible in some cases, especially in subnetworks that are acyclic (tree-like). (Details are in the Theoretical results section).
Our main contribution is then a simple and novel statistical test to determine the direction of regulatory edges from the joint distribution of methylation and mRNA expression data. No time-series or intervention data is used. We explore the conditions under which this test is consistent from a theoretical perspective and measure its accuracy empirically using both real and simulated data, demonstrating that causal gene regulatory networks can at least be partially inferred from static observational data provided there is sufficient auxiliary information about the regulatory context.
Methods
Data acquisition and pre-processing
Sample size
Dataset | N samples | N genes |
---|---|---|
GBM | 279 | 11834 |
Ovarian | 536 | 11270 |
Where technical replicates existed, the values were averaged. Probes for which data was missing for any sample were discarded. To reduce the severity of batch effects [31], the batch-specific mean expression or methylation value for each gene was subtracted out, and then the global mean added back. Thus, all batches were forced to have the same mean for any given gene. Both expression and methylation data were then transformed to rank space on a per-sample basis. Finally, to simplify computations involving mutual information, each probe was binned into B equal frequency bins. Where multiple probe sets (for expression or methylation) mapped to the same gene, the pair of probe sets (one for methylation, one for expression) with the highest mutual information was selected to represent that gene. Preprocessed data can be found in Additional files 1, 2, 3 and 4.
Causal graphs and Bayesian networks
As indicated earlier, the relationship between causal graphs and Bayesian networks is complex. First, the Causal Markov Condition holds for a causal graph G containing vertices V only if all common causes of any pair of variables in V are included in the set of vertices V[12]. Marginalizing over common causes excluded from V can introduce statistical dependencies not predicted by the Causal Markov Condition as applied to G. Even under the strong assumptions that all common causes are measured (causal sufficiency [12]) and that no independences not implied by the causal DAG and Causal Markov Condition are present (causal faithfulness [12, 13]), the Causal Markov Condition does not imply that a Bayesian network graph that accurately describes the independence relationships among the variables under study accurately represents causality when interpreted as a causal graph. Finally, there is the identifiability problem illustrated in Figure 1.
Statistical framework
As mentioned above, a standard assumption in learning regulatory networks is that the mRNA levels of pairs of genes which are biologically interacting are dependent statistically as random variables. Plausible scenarios where this assumption is untrue do exist. For example, if a gene is regulated by the interaction of two regulators via XOR logic (activation if both regulators are active or if both are inactive and inhibition otherwise), then the gene can be independent of each of its regulators taken individually. On the other hand, the number of parameters to be estimated increases exponentially in the order (binary, ternary, etc.) of interactions to be learned. Therefore, we argue that for realistic sample sizes the bias error created by ignoring these higher-order scenarios will likely be smaller than the variance error suffered by attempting to recover them. It’s also important to note that statistical dependence alone does not imply causal dependence, for example if the expression of two genes has a common regulator or hidden variable. We attempt to mitigate this with the non-causal (NC) pruning and data processing inequality steps detailed later in this section.
Let G be the set of all genes for which both mRNA expression and promoter region methylation data are available. For any gene g ∈ G, let M_{ g } be the promoter region methylation of gene g and let E_{ g } be the mRNA expression level of gene g. As is common practice, the mutual information I(X;Y) between two random variables X, Y[32] will serve as a test statistic for independence, recalling that X ⊥ Y if and only if I(X;Y) = 0. Similarly, for three random variables X, Y, Z, we have X ⊥ Y|Z if and only if I(X;Y|Z) = 0. (Here and in the rest of this paper, we will use the standard notation X ⊥ Y to indicate that variables X and Y are independent and X ⊥ Y|Z to indicate that they are conditionally independent given a third variable, Z).
The first step of IDEM is to construct a mutual information relevance network [21] for mRNA expression. This means placing an undirected edge between every pair of genes G_{1} and G_{2} if the empirical evaluation of the mutual information of their mRNA expression (that we will denote $\xce({E}_{1};{E}_{2})$) exceeds a threshold and concluding that E_{1} and E_{2} are not statistically independent. Using the fact that, under null hypothesis that E_{1} ⊥ E_{2}, Wilks’ Theorem [33] implies that $2\mathrm{N\xce}({E}_{1},{E}_{2})$ approximately follows a chi-square with (B - 1)^{2} degrees of freedom where N represents the sample size, we place an undirected edge between E_{1} and E_{2} when the corresponding p-value is less than some value α.
Local Bayesian network
where we can assume all variables except V take values in {1, …, B}. For simplicity we will write p(e_{1}) for P(E_{1} = e_{1}), p(e_{1}|e_{2}) for P(E_{1} = e_{1}|E_{2} = e_{2}), and so forth. The meaning of all marginal and conditional distributions should be clear from the context.
The key difference is between the two models how the methylation of one gene affects the expression of the other. In Model 1 we are assuming that E_{1} ⊥ M_{2}|M_{1} and E_{2} ⊥ M_{1}|E_{1}, M_{2}, whereas in Model 2 it is the reverse: E_{2} ⊥ M_{1}|M_{2} and E_{1} ⊥ M_{2}|E_{2}, M_{1}.
From biological knowledge about methylation we know that one of these models is the correct causal graph for {M_{1}, M_{2}, E_{1}, E_{2}} if a causal link exists between E_{1}, E_{2}. This model is also accurate as a Bayesian network graph if marginalizing over all other relevant variables in the full network (such as the expression and methylation of other genes) does not introduce any additional statistical dependencies among {M_{1}, M_{2}, E_{1}, E_{2}}. This is because the Causal Markov Condition assumes that all common causes are included in a model. However, only low-order analysis is statistically and computationally feasible for large gene regulatory networks.
A likelihood ratio test for direction
In summary, assuming a local Bayesian network for which the Causal Markov Condition holds, and assuming E _{1} and E _{2} are causally linked, the direction of the edge between them is identifiable except in the degenerate case in which all four independence statements are true.
By the Law of Large Numbers, logρ converges to E_{ P }l l r(E_{1}, E_{2}, M_{1}, M_{2}) when the sample size goes to infinity. This is the test statistic for our test for edge direction. Except in the degenerate case mentioned above, the logarithm of ρ divided by N converges to a strictly positive value under Model 1 and a strictly negative value under Model 2. Hence: IDEM Decision Rule: Select Model 1 if ρ > 1 and Model 2 if ρ < 1.
Consequently, IDEM places an oriented edge from E_{1} to E_{2} if and only if $\xce({E}_{1},{E}_{2}|{M}_{2})>\xce({E}_{1},{E}_{2}\left|{M}_{1}\right)$, i.e., if M_{1} has a stronger effect in decoupling E_{1} and E_{2} than M_{2}.
Pruning
We now introduce two pruning criteria that will reduce the large number of edges typically returned by relevance networks. The first criterion attempts to detect dependencies between E_{1} and E_{2} that could be due to co-regulation induced by a third, unobserved, variable, while the second ones prunes triangles by removing their weakest edge based on the data processing inequality.
Non-causal pruning
Indirect edge removal
After edges are placed, maximum likelihood directions determined, and NC pruning is carried out, the graph is pruned using the data processing inequality method as introduced in ARACNE [20]. The goal here is to remove spurious edges, for which nonzero mutual information can be attributed to indirect interactions among modeled variables. For example, if Gene A regulates gene B and gene B regulates gene C, then I(A;C) may be strictly positive even if no direct regulatory link exists. Since feedback loops cannot be detected from non-time series observational data and inconsistencies in the likelihood ratio test in loopy scenarios prevent reliable detection of feed-forward loops, the most parsimonious explanation for a three-clique in the relevance network is that the weakest edge (the one with the smallest mutual information) is due to indirect regulation. Therefore, this edge removed.
GENIE3 comparison
The results of IDEM were compared to those produced by GENIE3 [22]. GENIE3 attempts to reverse engineer directed edges (though the biological interpretation of the direction is not explicitly stated) using only expression data, even though causal direction is often not identifiable from such data even under the strong assumptions of causal sufficiency and causal faithfulness. The purpose of this comparison is to demonstrate the value added by methylation data, especially when inferring directed networks. The data provided to GENIE3 was the same expression data provided to IDEM and was fully preprocessed except for the binning step. Since GENIE3 produces a weight for each edge rather than a hard decision, we considered the top M edges for each dataset, where M is the number of edges (both true and false) discovered by IDEM on the same dataset. Since GENIE3 produces a weight for both directions for every edge, only the larger of the two were considered when selecting the top M edges.
Results
IDEM was applied to the TCGA datasets mentioned previously, as well as the synthetic datasets, with B = 2. At the sample sizes available, using B = 2 proved empirically more successful than larger values of B. (Data not shown). Since α = 10^{-3} provided a good balance between high precision and sufficient recall to draw meaningful conclusions, detailed analyses (all analyses except the PR curve) were performed using α = 10^{-3}. The full network that we reverse-engineered for each dataset is available in Additional files 5 and 6.
Synthetic data
We first attempted to validate IDEM using synthetic expression data generated using GeneNetWeaver [34–36], the software used to simulate data for the DREAM challenge. Since differential methylation constitutes a natural multifactorial perturbation experiment in our model and GeneNetWeaver does not include any facilities to simulate methylation data, we created a set of perturbations analogous to methylation data as described below. We then used GeneNetWeaver’s multifactorial perturbation feature to generate expression data with these perturbations. The perturbations were generated by a procedure that was designed to make the distribution of absolute correlations between methylation variables and corresponding expression variables similar to that observed on the real GBM and ovarian data. For each gene g, first generate a standard deviation σ_{ g } from a uniform distribution over [0,0.05], and then generate perturbation values m_{g,j}, j = 1, …, N, from a Normal(0, σ_{ g }) density. The matrix of perturbations (i.e., genes by samples) was supplied to IDEM as the “methylation” data.
A network of 1,000 genes was generated by randomly placing approximately 2,000 edges. Each gene g was assigned a random outgoing and incoming weight, which were proportional to the probabilities of each edge being outgoing from g and incoming to g respectively, or in other words proportional to the expected out and in degree of each gene. The outgoing weights were sampled from the distribution P(W) = w^{-2}, w ≥ 1. The intent was for the out degree distribution to approximate a scale-free network. The incoming weights were sampled from P(W) = 2e^{-2w}, w ≥ 0. This network is available in Additional file 7.
After the network and perturbation data were generated, the expression data was generated using GeneNetWeaver, using the stochastic differential equation model and otherwise using the default settings. The expression data used included the simulated microarray noise that GeneNetWeaver is capable of producing. The synthetic datasets are available in Additional files 8 and 9.
Edge placement benchmark
No predicted R → T | Predicted R → T | |
---|---|---|
No actual R → T | TN | FP |
Actual R → T | FN | TP |
Synthetic data edge placement results
Method | N | Precision | Null | Recall | Null | Fisher |
---|---|---|---|---|---|---|
samples | prec. | recall | p-value | |||
IDEM | 100 | 0.565 | 0.0039 | 0.007 | 4.6e-5 | 3.6e-28 |
IDEM | 500 | 0.681 | 0.0039 | 0.032 | 1.8e-4 | 5.1e-127 |
IDEM | 1000 | 0.692 | 0.0039 | 0.057 | 3.3e-4 | 6.1e-226 |
GENIE3 | 100 | 0.130 | 0.0039 | 0.002 | 4.6e-5 | 9.7e-5 |
GENIE3 | 500 | 0.165 | 0.0039 | 0.008 | 1.8e-4 | 2.7e-20 |
GENIE3 | 1000 | 0.119 | 0.0039 | 0.010 | 3.3e-4 | 1.5e-22 |
Synthetic data edge direction results
Method | N samples | Fract correct | Binomial p-value |
---|---|---|---|
IDEM | 100 | 1 | 0.001 |
IDEM | 500 | 1 | 2.2e-19 |
IDEM | 1000 | 1 | 7.7e-34 |
GENIE3 | 100 | 0.43 | 0.77 |
GENIE3 | 500 | 0.48 | 0.64 |
GENIE3 | 1000 | 0.5 | 0.56 |
Knockdown validation
Knockdown experiments, where the expression of individual genes is perturbed in a targeted manner, can provide valuable information about regulatory networks. To the best of our knowledge, no publicly available knockdown data exists for the same tissue types for which TCGA methylation and expression data are available. We therefore evaluted IDEM’s predictions using a publicly available siRNA knockdown dataset from a human myeloid leukemia cell line [37]. This dataset was used under the assumption that gene regulatory networks are partially conserved across tissue types. It contains expression levels for control samples as well as samples with approximately 50 genes knocked down. The genes that were knocked down are referred to as “knockdown genes”. Seventeen negative control replicates were included and for most knockdown genes three replicates were included. Where multiple probe sets mapped to the same gene, the maximum expression level was used.
Since knocking down a gene is an intervention in the context of a controlled experiment, changes in a gene’s expression upon knocking down the knockdown gene are assumed to be caused by the knockdown. Therefore, we declared a target gene T to be regulated by a regulator R if T was differentially expressed between control samples and samples with R knocked down with a p-value ≤0.01 as assessed by the Wilcoxon Rank Sum Test and with a fold change greater than two between the median control and knockdown expression levels. We also required that T be expressed in either the control or knockdown samples with a geometric mean detection p-value ≤0.001 for the probe set used to represent each gene. This definition allows R to regulate T indirectly. Distinguishing direct from indirect regulation was not feasible given the nature of the knockdown dataset. To accommodate this ambiguity, the indirect edge removal step of the IDEM algorithm was skipped when preparing IDEM predictions for this validation.
Knockdown data edge direction benchmark
Predicted T → R | Predicted R → T | |
---|---|---|
No Knockdown R → T | N _{11} | N _{12} |
Knockdown R → T | N _{21} | N _{22} |
Knockdown data edge placement results
Dataset | Method | Precision | Null | Recall | Null | Fisher |
---|---|---|---|---|---|---|
prec. | recall | P-Val | ||||
GBM | IDEM | 0.0367 | 0.017 | 0.015 | 0.0073 | 1.13e-16 |
Ovarian | IDEM | 0.0397 | 0.018 | 0.018 | 0.0081 | 7.68e-21 |
GBM | GENIE3 | 0.032 | 0.017 | 0.010 | 0.0057 | 1.76e-08 |
Ovarian | GENIE3 | 0.033 | 0.018 | 0.011 | 0.0058 | 3.87e-09 |
Knockdown data edge direction results
Dataset | Method | P _{ signif } | P _{ non-signif } | Fisher P-val |
---|---|---|---|---|
GBM | IDEM | 0.606 | 0.51 | 0.0019 |
Ovarian | IDEM | 0.618 | 0.529 | 0.0025 |
GBM | GENIE3 | 0.680 | 0.656 | 0.305 |
Ovarian | GENIE3 | 0.705 | 0.696 | 0.452 |
KEGG validation
KEGG validation
Dataset | N correct direction | N interactions | binomial P-Val |
---|---|---|---|
Ovarian | 20 | 44 | 0.774 |
GBM | 33 | 51 | 0.024 |
Computational complexity
IDEM is designed to scale computationally to large datasets. Therefore, each step is of reasonable time complexity. Let N be the number of samples and M be the number of genes. The time complexity of the binning step is O(M N logN). The complexity of building the relevance network is O(M^{2}N). The time complexity of performing the likelihood ratio test is O(E N) where E is the number of edges remaining at this step. The time complexity of the indirect edge removal is O(M^{3}) in the worst case but in practice much smaller because the graph to which it is applied is typically sparse. Every major step can also be efficiently parallelized, and the building of the relevance network, the application of the likelihood ratio test and the application of the data processing inequality are parallelized in our reference implementation. The reference D implementation takes under 10 minutes to run an 8-core Xeon X5647 machine for any dataset described.
Theoretical results
This section proves a set of results with regard to the consistency of IDEM. Consistency means that, given infinite samples, the correct causal graph would be recovered. Since this section discusses consistency, it is assumed that, given adequate sample size, an arbitrarily large number of bins could be used for the binning process. This would asymptotically eliminate any biases due to the binning process.
Proof of consistency of likelihood ratio test
so that E_{ P }(llr) = R_{2} - R_{1}.
Effects of marginalization
This subsection examines the effects of marginalizing on any variables not present in {M_{1}, M_{2}, E_{1}, E_{2}} on the identifiability of the E_{1} - E_{2} edge, assuming that marginalization adds dependencies not present in Model 1 or Model 2. In this section, graphs are to be interpreted only as Bayesian networks and not as causal graphs, since the intent is to explore the effects of missing causal variables that introduce additional dependencies.
To simplify the notation, we marginalize over V first, which is equivalent to placing an edge between M_{1}, M_{2}. The direction of this edge is not important since both directions lead to the same independence relationships among {M_{1}, M_{2}, E_{1}, E_{2}}.
There are two edges that can be added to this graph while keeping it directed acyclic and without adding vertices: M_{1} - E_{2} or M_{2} - E_{1}. (Since these edges appear on the diagonal in all figures in this section, we refer to them as “diagonal edges”). Again, the direction is not important as long as the directions of these edges and the M_{1} - M_{2} edge are chosen such that the graph is acyclic. The diagonal edges are not to be interpreted biologically as direct causal relations. They only model the statistical effects of marginalizing over variables excluded from Models 3 and 4. If both edges are added, the graph is fully connected. No independence relationships remain regardless of the direction of the edge between E_{1}, E_{2}, so this direction is unidentifiable.
Under Model 3 E_{1} ⊥ M_{2}|M_{1} and under Model 4 E_{1} ⊥ M_{2}|M_{1}, E_{2}. Using notation and arguments similar to those in the previous section it can be shown that E_{ P }[ R_{34}] = I(M_{2};E_{1}|M_{1},E_{2}) - I(M_{2};E_{1}|M_{1}).
Therefore, these models are identifiable via a likelihood ratio test unless the independence relationships implied by both models apply simultaneously.
Proof of consistency for acyclic causal Markov graphs
- 1.
The Causal Markov Condition applies for G. Most importantly, this means no common causes have been omitted [12].
- 2.
Causal faithfulness [12, 13] applies. This means that only the independence relationships specified by the causal graph and Causal Markov Condition exist.
- 3.
The causal graph must be acyclic. This means no directed or undirected cycles.
- 4.
The Data Processing Inequality must be strict. Let A - B - C be a Markov chain implied by a causal graph and the Causal Markov Condition. Then I(A;C) < min(I(A;B), I(B;C)). If I(A;C) = min(I(A;B), I(B;C)) then direct vs. indirect causality will not be identifiable.
- 1.
The ARACNE [20] method is consistent in its recovery of irreducible pairwise statistical dependencies if the graph of these dependencies is a tree. (Since the graph produced by ARACNE is undirected, a tree is equivalent to an acyclic graph.) The proof can be found in the ARACNE reference. Since IDEM uses the mutual information relevance network and data processing inequality steps from ARACNE, the same logic applies to it. IDEM will also recover irreducible pairwise statistical dependences.
- 2.
Given an acyclic graph G for which the Causal Markov Condition applies, irreducible statistical dependency between variables A, B as defined in [20] exists only if a causal edge exists in G between A, B. This is best demonstrated by enumerating cases. In an acyclic causal graph there are three possible scenarios:
a.) There is a causal path between A and B. If this is not a direct causal path, then the A - B edge will be eliminated by the Data Processing Inequality step if the Data Processing Inequality is strict.
b.) There is no causal path between A and B but there is still statistical dependency. Then A and B must have a common cause.
Then the A - B edge will be eliminated by the Data Processing Inequality step.
c.) There is no causal path between A and B and no common cause. A will then be independent of B.
- 3.
If there are no cycles the likelihood ratio test as described previously is consistent with respect to edge direction between E _{1}, E _{2}. The degenerate case cannot occur under causal faithfulness. Assume without loss of generality that for some set of variables {E _{1}, E _{2}} the edge direction is E _{1} → E _{2}. Adding the respective methylation variables and all possible connections that {M _{1}, M _{2}, E _{1}, E _{2}} might have to the larger causal graph yields the graph shown in Figure 8. This is the most general acyclic model since the variables in {A, B, C, D, F, G, H, K} may be multidimensional and cannot be connected to each other without forming a cycle. Under this model the two independence relationships that apply to Model 1 as depicted in Figure 3 apply to {M _{1}, M _{2}, E _{1}, E _{2}}, so the likelihood ratio test is consistent.
This set of conditions is more general than it appears at first glance. While the complete causal graph of all genes is almost certainly not acyclic, some subgraphs may be. If no two nodes in this subgraph have a common cause outside the subgraph then the Causal Markov Condition applies and if all other conditions are met IDEM will produce consistent results.
Discussion
To the best of our knowledge, this work is the first attempt to mitigate the problem of identifying edge direction in gene regulatory networks using only high-throughput, non-time series, observational data. The performance of the algorithm on synthetic data using the GeneNetWeaver simulator is excellent. However, the discrepancy between the accuracy of IDEM as assessed by synthetic data vs. real expression, methylation and knockdown data is substantial. The knockdown benchmark results should be taken as a lower bound on the performance of IDEM. The knockdown data comes from a different cell type than those available in TCGA. Typically only three replicates are available for each knockdown experiment, decreasing the power to infer weak regulation. To compensate we considered differential expression statistically significant if the p-value was ≤0.01 without adjusting for multiple testing. Therefore, a significant number of edges in our “ground truth” data are likely false positives. Furthermore, it is possible that a substantial number of edges inferred from the knockdown data are the result of batch effects. Finally, since the knockdown data does not allow direct vs. indirect regulation to be distinguished, the indirect edge pruning step of IDEM is not used for this benchmark. Weak, indirect edges may be much harder to reverse-engineer in practice than strong, direct edges. Nonetheless our highest confidence edge direction predictions achieve an accuracy of 64–67% using only non-time series observational data. Likewise, IDEM’s performance in correctly predicting edge direction between members of the Pathways in Cancer KEGG gene set also represents a lower performance bound. This pathway represents gene relationships described across a multitude of experiments in many different cancer systems. As a result, it is likely that many of these edges are weak or non-existant in our test datasets, hampering IDEM’s ability to correctly infer directionality.
The consideration of only pairwise and (for pruning indirect edges) three-way interactions can clearly lead to biases, especially in the case of loopy networks. A significant difficulty, detailed in Theoretical results, is that marginalization over variables outside the {M_{1}, M_{2}, E_{1}, E_{2}} subnetwork might add statistical dependencies not suggested by Model 1 or Model 2 and make our likelihood ratio test inconsistent. However this is not an issue for acyclic subgraphs for which the Causal Markov Condition holds. Furthermore, in some cases a more complex likelihood ratio test can work around the marginalization issue even while involving only variables in {M_{1}, M_{2}, E_{1}, E_{2}}. Examining higher- order joint probabilities would increase the amount of data required to maintain a constant level of estimation error by an exponential factor in the dimensionality of the interactions examined. Empirically, despite the biases introduced by low-order analysis, accuracy of more traditional pairwise methods on real and simulated data is comparable to the accuracy of methods that use higher-order analyses [39].
It is important to note that, in our network, an edge need not represent direct transcription factor binding. The reverse-engineered network is a purely phenomenological prediction of what genes would be affected if the mRNA expression of a given gene were perturbed. For example, consider a hypothetical gene K. When K is expressed, it produces a kinase protein that interacts at the protein-protein level with a constitutively expressed transcription factor protein F. When F is phosphorylated, it activates or inhibits the transcription of a target gene T. Since F is constitutively expressed, its expression will not have high mutual information with any other gene’s expression. Biologically, the gene most relevant in explaining variation in the expression of T is K. In our method, K will have large mutual information with T and the edge K – T will likely be inferred. This edge does not represent direct transcription factor-target binding but is nonetheless biologically meaningful in that perturbing K would affect the expression of T with no other mRNA concentrations being affected as a necessary condition. Similarly, due to our lack of either expression or methylation data for about half of all known genes, an edge placed between two available genes might physically involve a third, unmodeled gene as an intermediary.
The primary practical use for IDEM will likely be generating hypotheses about the nature of human disease states, or treatment targets for such diseases. A significant benefit to the methodology is that the amount of publicly available joint methylation and mRNA expression data is rapidly increasing. As such data increases in availability, various datasets can be combined to produce a network with increasing sensitivity. Two limitations of IDEM are i)the use of methylation-induced epigenetic silencing to provide context for reverse-engineering directed edges precludes the use of this method in lower organisms in which methylation-induced silencing is not widely used; and ii) a graph of a regulatory network, built from publicly available data, does not provide clear conclusions on its own, but provides a useful starting point from which further studies can be undertaken to confirm and quantify the results.
Moving forward, several expression context variables in addition to methylation can be used to make edge direction identifiable. Methylation was used because it was the most practical at this time. However, variables such as copy number and the concentrations of highly targeted microRNAs can also be used. In principle, gene sets of higher order than pairs could also be considered given sufficient data and computational power. Considering higher order interactions would allow situations such as XOR logic to be discovered and remove some inconsistencies from likelihood ratio test for direction in loopy scenarios.
Conclusions
We demonstrate the feasibility of using DNE methylation data to make directed gene regulatory edges statistically identifiable from non-time series observational data. This is shown both theoretically and empirically, on both synthetic and real data.
Declarations
Acknowledgements
The work of DS and DG was partially supported by NIH-NCRR Grant UL1 RR 025005.
Authors’ Affiliations
References
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.http://www.sciencemag.org/cgi/content/abstract/286/5439/531PubMedView ArticleGoogle Scholar
- Geman D, d’Avignon C, Naiman DQ, Winslow RL: Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol. 2004, 3 (Article19): http://www.ncbi.nlm.nih.gov/pubmed/16646797[PMID: 16646797]Google Scholar
- Price ND, Trent J, El-Naggar AK, Cogdell D, Taylor E, Hunt KK, Pollock RE, Hood L, Shmulevich I, Zhang W: Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas. Proc Natl Acad Sci. 2007, 104 (9): 3414-3419. 10.1073/pnas.0611373104.http://www.pnas.org/content/104/9/3414.abstractPubMedPubMed CentralView ArticleGoogle Scholar
- Xu L, Tan A, Winslow R, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics. 2008, 9: 125-10.1186/1471-2105-9-125.http://www.biomedcentral.com/1471-2105/9/125PubMedPubMed CentralView ArticleGoogle Scholar
- Dettling M, Bühlmann P: Boosting for tumor classification with gene expression data. Bioinformatics. 2003, 19 (9): 1061-1069. 10.1093/bioinformatics/btf867.http://bioinformatics.oxfordjournals.org/content/19/9/1061.abstractPubMedView ArticleGoogle Scholar
- Zhang H, Yu CY, Singer B: Cell and tumor classification using gene expression data: Construction of forests. Proc Natl Acad Sci. 2003, 100 (7): 4168-4172. 10.1073/pnas.0230559100.http://www.pnas.org/content/100/7/4168.abstractPubMedPubMed CentralView ArticleGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci. 2002, 99 (10): 6567-6572. 10.1073/pnas.082099299.http://www.pnas.org/content/99/10/6567.abstractPubMedPubMed CentralView ArticleGoogle Scholar
- Eddy JA, Hood L, Price ND, Geman D: Identifying tightly regulated and variably expressed networks by Differential Rank Conservation (DIRAC). PLoS Comput Biol. 2010, 6 (5): e1000792-10.1371/journal.pcbi.1000792.http://dx.doi.org/10.1371PubMedPubMed CentralView ArticleGoogle Scholar
- Chuang HYY, Lee E, Liu YTT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007,, 3. http://dx.doi.org/10.1038/msb4100180Google Scholar
- Pe’er D, Hacohen N: Principles and strategies for developing network models in cancer. Cell. 2011, 144 (6): 864-873. 10.1016/j.cell.2011.03.001.http://dx.doi.org/10.1016/j.cell.2011.03.001PubMedPubMed CentralView ArticleGoogle Scholar
- Taniguchi Y, Choi PJ, Li G, Chen H, Babu M, Hearn J, Emili A, Xie XS: Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science (New York, N.Y.). 2010, 329 (5991): 533-538. 10.1126/science.1188308. http://www.ncbi.nlm.nih.gov/pubmed/20671182View ArticleGoogle Scholar
- Scheines R: An introduction to causal inference. Dep Philos. 1997,, Paper 430. http://repository.cmu.edu/philosophy/430Google Scholar
- Spirtes P, Glymour C, Scheines R: Causation, Prediction, and Search, second edition. 2001, Cambridge: The MIT Press,http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20path=ASIN/0262194406Google Scholar
- Pearl J: The causal foundations of structural equation modeling. 2011,http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.6668rep=rep1type=pdfGoogle Scholar
- Perrin B, Ralaivola L, Mazurie A, Bottani S, Mallet J, d’Alche-Buc F: Gene networks inference using dynamic Bayesian networks. Bioinformatics. 2003, 19 (suppl_2): ii138-ii148.http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/suppl_2/ii138PubMedGoogle Scholar
- Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics. 2004, 20 (18): 3594-3603. 10.1093/bioinformatics/bth448.http://bioinformatics.oxfordjournals.org/content/20/18/3594.abstractPubMedView ArticleGoogle Scholar
- Mukhopadhyay ND, Chatterjee S: Causality and pathway search in microarray time series experiment. Bioinformatics. 2007, 23 (4): 442-449. 10.1093/bioinformatics/btl598.http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/4/442PubMedView ArticleGoogle Scholar
- Ram R, Chetty M: Comput Biol Bioinformatics, IEEE/ACM Trans. 2011, 8 (2): 353-367.View ArticleGoogle Scholar
- Gardner TS, di Bernardo D, Lorenz D, Collins JJ: Inferring genetic networks and identifying compound mode of action via expression profiling. Science (New York, N.Y.). 2003, 301 (5629): 102-105. 10.1126/science.1081900. http://www.ncbi.nlm.nih.gov/pubmed/12843395View ArticleGoogle Scholar
- Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, Califano A: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006, 7 (Suppl 1): S7-S7. 10.1186/1471-2105-7-S1-S7. [PMID: 16723010 PMCID: 1810318]PubMedPubMed CentralView ArticleGoogle Scholar
- Butte AJ, Kohane IS, Kohane IS: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput. 2000, 5: 415-426.http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.7575Google Scholar
- Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P: Inferring regulatory networks from expression data using tree-based methods. PLoS ONE. 2010, 5 (9): e12776-10.1371/journal.pone.0012776.http://dx.doi.org/10.1371/journal.pone.0012776PubMedPubMed CentralView ArticleGoogle Scholar
- Friedman N, Linial M, Nachman I: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7: 601-620. 10.1089/106652700750050961.http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.5246PubMedView ArticleGoogle Scholar
- Friedman N: Inferring cellular networks using probabilistic graphical models. Science. 2004, 303 (5659): 799-805. 10.1126/science.1094068.http://www.sciencemag.org/content/303/5659/799.abstractPubMedView ArticleGoogle Scholar
- Hartemink A, Gifford D, Jaakkola T, Young R: Bayesian methods for elucidating genetic regulatory networks. Intell Syst IEEE. 2002, 17 (2): 37-43.Google Scholar
- Reik W, Walter J: Genomic imprinting: parental influence on the genome. Nat Rev Genet. 2001, 2: 21-32.http://dx.doi.org/10.1038/35047554PubMedView ArticleGoogle Scholar
- Herman JG, Baylin SB: Gene silencing in cancer in association with promoter hypermethylation. New England J Med. 2003, 349 (21): 2042-2054. 10.1056/NEJMra023075. http://www.ncbi.nlm.nih.gov/pubmed/14627790View ArticleGoogle Scholar
- Yang B, Guo M, Herman JG, Clark DP: Aberrant promoter methylation profiles of tumor suppressor genes in hepatocellular carcinoma. Am J Pathol. 2003, 163 (3): 1101-1107. 10.1016/S0002-9440(10)63469-4.http://ajp.amjpathol.org/cgi/content/abstract/163/3/1101PubMedPubMed CentralView ArticleGoogle Scholar
- Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E, Goldmann T, Seifart C, Jiang W, Barker DL, Chee MS, Floros J, Fan J: High-throughput DNA methylation profiling using universal bead arrays. Genome Res. 2006, 16 (3): 383-393. 10.1101/gr.4410706.http://genome.cshlp.org/content/16/3/383.abstractPubMedPubMed CentralView ArticleGoogle Scholar
- Institute NC, Institute NHGR: The cancer genome Atlas.http://cancergenome.nih.gov/index.asp
- Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010, 11 (10): 733-739. 10.1038/nrg2825.http://dx.doi.org/10.1038/nrg2825PubMedView ArticleGoogle Scholar
- Cover TM, Thomas JA: Elements of Information Theory. 2006, Hoboken: John Wiley and SonsGoogle Scholar
- Wilks SS: The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Stat. 1938, 9: 60-62. 10.1214/aoms/1177732360.View ArticleGoogle Scholar
- Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol J Comput Mol Cell Biol. 2009, 16 (2): 229-239. 10.1089/cmb.2008.09TT. http://www.ncbi.nlm.nih.gov/pubmed/19183003View ArticleGoogle Scholar
- Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE. 2010, 5 (2): e9202-10.1371/journal.pone.0009202.http://dx.doi.org/10.1371/journal.pone.0009202PubMedPubMed CentralView ArticleGoogle Scholar
- Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G: Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci. 2010, 107 (14): 6286-6291. 10.1073/pnas.0913357107.http://www.pnas.org/content/107/14/6286.abstractPubMedPubMed CentralView ArticleGoogle Scholar
- Consortium F, Suzuki H, Forrest AR, van Nimwegen E, Daub CO, Balwierz PJ, Irvine KM, Lassmann T, Ravasi T, Hasegawa Y, de Hoon MJ, Katayama S, Schroder K, Carninci P, Tomaru Y, Katayama KM, Kubosaki A, Akalin A, Ando Y, Arner E, Asada M, Asahara H, Bailey T, Bajic VB, Bauer D, Beckhouse AG, Bertin N, Bjorkegren J, Brombacher F, Bulger E: The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat Genet. 2009, 41 (5): 553-562. 10.1038/ng.375.http://dx.doi.org/10.1038/ng.375View ArticleGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999, 27: 29-34. 10.1093/nar/27.1.29.http://dx.doi.org/10.1093/nar/27.1.29PubMedPubMed CentralView ArticleGoogle Scholar
- Bansal M, Belcastro V, Ambesi-Impiombato A, di Bernardo D: How to infer gene networks from expression profiles. Mol Syst Biol. 2007, 3: 78-http://www.ncbi.nlm.nih.gov/pubmed/17299415[PMID: 17299415]PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.