Interpretation of the resulting graph
The above algorithm returns a partially directed partial correlation graph, whose directed edges form a causal network.
This procedure can be motivated by the following connection between partial correlation graph and a system of linear equations, where each node is in turn taken as a response variable and regressed against all other remaining nodes. In this setting the partial correlation coefficient is the geometric mean of and the corresponding reciprocal coefficient , i.e.
(4)
[see also equation 16 of ref. [20]]. In this light, an undirected edge between two nodes A and B in a partial correlation graph may also be interpreted as bidirected edge, in the sense that A influences B and vice versa in the underlying system of regression. Therefore, the test = 1 can be understood as removing one of these two directions, where Equation 2 suggests that only the relative variance reduction between the two involved nodes needs to be considered for establishing the final direction.
Reconstruction efficiency and approximations underlying the algorithm
Topology of the network
The proposed algorithm is an extension of the GGM inference approach of [19, 20]. Its accuracy of correctly recovering the topology of the partial correlation graph has been established, e.g., in [30].
However, it is well known that a directed Bayesian network and the corresponding undirected graph are not necessarily topologically identical: in the undirected graph for computing the partial correlations one conditions on all other nodes whereas in the directed graph one conditions only on a subset of nodes, in order to avoid conditioning "on the future" (i.e. on the dependent nodes). Therefore, it is critical to evaluate to what extent full order partial correlations are reasonable approximations for lower order partial correlations. This has already been investigated intensively by [31] who showed that in certain situations (sparse graphs, faithfulness assumption etc.) lower order partial correlations may be used as approximate substitute of full conditional correlations. Therefore, in the proposed algorithm we adopt the very same argument but apply it in the different direction, i.e. we approximate lower order partial correlation by full order partial correlation.
Node ordering
A second approximation implicit in our algorithm concerns the determination of the ordering of the nodes, which is done by multiple testing of pairwise ratios of standardized partial variances. We have conducted a number of numerical simulations (data not shown) that indicate that for randomly simulated DAGs the ordering of the nodes is indeed well reflected in the partial variances, as expected.
However, from variable selection in linear models it is also known that the partial variance (or the related R2) may not always be a reliable indicator for variable importance. Nevertheless, the partial ordering of nodes according to SPV and the implicit model selection in the underlying regressions is a very different procedure in comparison to the standard variable selection approaches, in which the increase or decrease of the R2 is taken as indicator of whether or not a variable is to be included, or a decomposition of R2 is sought [for a review see, e.g., [32]]. The distinctive feature of our procedure is that by performing all tests log() ≠ 0 simultaneously we consider all p regression equations at once, even if the final feature selection occurs only locally on the level of an individual regression.
It is also noteworthy that, as we impose directionality from the less well explained variable (large SPV, "exogenous", "independent") to the one with relatively lower SPV (well explained, "endogenous", "dependent" variable), we effectively choose the direction with the relatively smaller regression coefficient (conditional that the corresponding partial correlation is also significant).
Further properties of the heuristic algorithm and of the resulting graphs
The simple heuristic network discovery algorithm exhibits a number of further properties worth noting:
1. The estimated partially directed network cannot contain any (partially) directed cycles. For instance, it is not possible for a graph to contain a pattern such as A → B → A. This example would imply SPV
A
> SPV
B
> SPV
A
, which is a contradiction. As a consequence, the subgraph containing the directed edges only is also acyclic (and hence a DAG).
2. The assignment of directionality is transitive. If there is a directed edge from A to B and from B to C then there must also be a directed edge from A to C. Note however, that actual inclusion of a directed edge into the causal network is conditional on a non-zero partial correlation coefficient.
3. As the algorithm relies on correlations as input, causal processes that produce the same correlation matrix lead to the same inferred graph, and hence are indistinguishable. The existence of such equivalence classes is well known for SEMs [33] and also for Bayesian belief networks [34].
4. The proposed algorithm is scale-invariant by construction. Hence, a (linear) change in any of units of the data has no effect on the overall estimated partially directed network, and the implied causal relations.
5. We emphasize that the partially directed network is not the chain graph representing the equivalence class of the causal network that is obtained by considering only its directed edges – see [34].
6. The computational complexity of the algorithm is O(p3). Hence, it is no more expensive than computing the partial correlation graph, and thus allows for estimation of networks containing in the order of thousands and more nodes.
Analysis of a plant expression data set
To illustrate our algorithm for discovering causal structure, we applied the approach to a real world data example. Specifically, we reanalyzed expression time series resulting from an experiment investigating the impact of the diurnal cycle on the starch metabolism of Arabidopsis thaliana [35]. This is the same data set we used in a sister paper concerning the estimation of a vector autoregressive model [36].
The data are gene expression time series measurements collected at 11 different time points (0, 1, 2, 4, 8, 12, 13, 14, 16, 20, and 24 hours after the start of the experiment). The corresponding calibrated signal intensities for 22,814 genes/probe sets and for two biological replicates are available from the NASCArrays repository, experiment no. 60 [37]. After log-transforming the data we filtered out all genes containing missing values and whose maximum signal intensity value was lower than 5 on a log-base 2 scale. Subsequently, we applied the periodicity test of [38] to identify the probes associated with the day-night cycle. As a result, a subset of 800 genes remained for further analysis.
In order to estimate the correlation matrix for the 800 genes described by the data set we employed the dynamical correlation shrinkage estimator of [39] as this takes account of the autocorrelation. The corresponding correlation graph is displayed in Figure 1. It shows the 150 edges with the largest absolute values of correlation. This graph is very hard to interpret, the branches do not have any immediate or intuitive meaning (a complete annotation of the nodes can be found along with the dataset itself in the R package "GeneNet" [40]). For instance, there are no hubs as typically observed in biological networks [41, 42].
This is in great contrast to the partially directed partial correlation graph. For this specific data set, by multiple testing of the factor we identified 6, 102 significant edges connecting 669 nodes. For the second factor , determined whether edges are directed, the distribution of log() is displayed in Figure 2. The null distribution (dashed line) follows a normal distribution and characterizes the edges that cannot be directed. The alternative distribution (solid line) coincides with the directed edges. In total, we found 15, 928 significant directions.
To construct the network, we projected upon the significant edges (factor ) the significant directions (factor ). In the network of significant associations, 1,216 directions were significant. Note that the fraction of significant directions is by far greater in the subset of the significant partial correlations than in the complete set of all partial correlations. This agrees with the intuitive notion, that causal influences can only be attributed to existing connections between variables.
The resulting partially causal network is shown in Figure 3. For reasons of clarity we show only the subnetwork containing the 150 most significant edges, which connect 107 nodes. This graph exhibits a clear "hub" connectivity structure (nodes filled with red color). A prominent example for this is node 570, others are 81, 558, 783 and a few more genes. We see that many of the hub nodes have mostly outgoing arcs, which is indicative for key regulatory genes. This applies, e.g., to node 570, an AP2 transcription factor, or to node 81, a gene involved in DNA-directed RNA polymerase. An interesting aspect of the partially causal network is the web of highly connected genes (colored yellow in the lower right corner of Figure 3), which we hypothesize to constitute some form of a functional module. In this module, it is not possible to determine any directions, which could be due to complex interactions among the nodes of the module. Node 627 is another hub in the network that connects the functional module with the rest of the network and which according to the annotation of [35] encodes a protein of unknown function.
We also see that the partially directed network contains both directed and undirected nodes. This is a distinct advantage of the present approach. Unlike, e.g., a vector autoregressive model [36], it does not force directions onto the edges.
Finally, in order to investigate the stability of the inferred partial causal network, we randomly removed data points from the sample, and repeatedly reconstructed the network from the reduced data set. In all cases the general topological structure of the network remained intact, which indicates that this is a signal inherent in the data. This is also confirmed by the analysis using vector autoregressions [36].