Granger causality for sets of time series
Granger causality identification is a potential approach for the detection of possible interactions in a data driven framework couched in terms of temporal precedence. The main idea is that temporal precedence does not imply, but may help to identify causal relationships, since a cause never occurs after its effect.
A formal definition of Granger causality for sets of time series[20] can be given as follows.
Definition 1
[20]Granger causality for sets of time series: Suppose that ℑ
t
is a set containing all relevant information available up to and including time-point t. Let X
t
,andbe sets of time series containing p, m and n time series, respectively, whereandare disjoint subsets of X
t
, i.e., each time series only belongs to one set, and thus, p≥m + n. Let X
t
(h|ℑ
t
) be the optimal (i.e., the one which produces the minimum mean squared error (MSE) prediction) h -step predictor of the set of m time seriesfrom the time point t, based on the information in ℑ
t
. The forecast MSE of the linear combination ofwill be denoted by Ω
X
(h|ℑ
t
). The set of n time seriesis said to Granger-cause the set of m time seriesif
(1)
whereis the set containing all relevant information except for the information in the past and present of. In other words, ifcan be predicted more accurately when the information inis taken into account, thenis said to be Granger-causal for.
For the linear case, is Granger non-causal for if the following condition holds:
(2)
where ρ is the largest correlation calculated by Canonical Correlation Analysis (CCA).
In order to simplify both notation and concepts, only the identification of Granger causality for sets of time series in an Autoregressive process of order one is presented. Generalizations for higher orders are straightforward.
Functional clustering in terms of Granger causality
There are numerous definitions for clusters in networks in the literature[24]. A functional cluster in terms of Granger causality can be defined as a subset of genes that strongly interact among themselves but interact weakly with the rest of the network.
A usual approach for network clustering when the structure of the graph is known is the spectral clustering proposed by[25]. However, in biological data, the structure of the regulatory network is usually unknown.
In order to overcome this limitation, we developed a framework to cluster genes by their topological proximity using the time series gene expression information. We developed concepts of distance and degree for sets of time series based on Granger causality, and combined them to the modified spectral clustering algorithm. The procedures are detailed below.
Functional clustering
Given a set of time series (where p is the number of time series) and a definition of similarity w
ij
≥ 0 between all pairs of data points and, the intuitive goal of clustering is to divide the time series into several groups such that time series in the same group are highly connected by Granger causality and time series in different groups are not connected or show few connections to each other. One usual representation of the connectivity between time series is in the form of graph G = (V,E). Each vertex v
i
in this graph represents a time series gene expression. Two vertices are connected if the similarity w
ij
between the corresponding time series and is not zero (the edge of the graph is weighted by w
ij
). In other words, a w
ij
> 0 represents existence of Granger causality between time series and and w
ij
= 0 represents Granger non-causality. The problem of clustering can now be reformulated using the similarity graph: we want to find a partition of the graph such that there is less Granger causality between different groups and more Granger causality within the group.
Let G = (V,E) be an undirected graph with vertex set V = {v1,…,v
p
}(where each vertex represents one time series) and weighted edges set E. In the following we assume that the graph G is weighted, that is each edge between two vertices v
i
and v
j
carries a non-negative weight w
ij
≥ 0. The weighted adjacency matrix of the graph is the matrix W = w
ij
; i,j = 1,…,p. If w
ij
= 0, this means that the vertices v
i
and v
j
are not connected by an edge. As G is undirected, we require w
ij
= w
ji
. Therefore, in terms of Granger causality, w
ij
can be set as the distance between two time series and. This distance can be defined as
Definition 2
Distance between two (sets of) time seriesand:
(3)
Notice that is the Granger causality from time series to. In the case of sets of time series, just replace and by the set of time series and[20, 21]. Since absolute value of CCA ranges from zero to one and the higher the CCA, the higher is the quantity of information flow, it is possible to see that the higher the CCA, the shorter the distance is. Furthermore, it is necessary to point out that the average between and is calculated because the distance must be symmetric. The intuitive idea consists on the fact that the higher is the CCA coefficient, the lower is the distance between the time series (or sets of time series) independent of the direction of Granger causality.
Moreover, notice that the CCA is the Pearson correlation after dimension reduction, therefore, satisfies three out of four criteria for distances: (i) non-negativity; (ii) identity of indiscernible; and (iii) symmetry; and does not satisfy the (iv) triangular inequality, therefore, Pearson correlation is not a real metric. However, it is commonly used as a distance measure in several gene expression data analysis[26, 27]. The main advantage with this definition of distance is the fact that it is possible to interpret the clustering process by a Granger causality concept.
Another necessary concept is the idea of degree of a time series (vertex v
i
) which can be defined as
Definition 3
Degree of
is defined by:
(4)
where in-degree and out-degree are respectively
(5)
(6)
Notice that in-degree and out-degree represent the total information flow that “enters” and “leaves” the vertex v
i
, respectively. Therefore, the degree of vertex v
i
contains the total information flow passing through vertex v
i
.
Without loss of generality, it is possible to extend the concept of degree of a vertex v
i
(time series) to a set of time series (sub-network), where u = 1,…,k and k is the number of sub-networks.
Definition 4
Degree of sub-network
is defined by:
(7)
where in-degree and out-degree are respectively
(8)
(9)
Now, by using the definitions of distance and degrees for time series and sets of time series in terms of Granger causality, it is possible to develop a spectral clustering-based algorithm to identify sub-networks (set of time series that are highly connected within sets and poorly connected between sets) in the regulatory networks. The algorithm based on spectral clustering[25] is as follows:
Input: The p time series () and the number k of sub-networks to construct.
Step 1: Let W be the (p × p) symmetric weighted adjacency matrix where.
Step 2: Compute the non-normalized (p × p) Laplacian matrix L as (Mohar, 1991)
where D is the (p × p) diagonal matrix with the degrees d1,…,d
p
() on the diagonal.
Step 3: Compute the first k eigenvectors {e1,…,e
k
} (corresponding to the k largest eigenvalues) of L.
Step 4: Let U ∈ ℜp×k be the matrix containing the vectors {e1,…,e
k
} as columns.
Step 5: For i = 1,…,p, let y
i
∈ ℜk be the vector corresponding to the i th row of U.
Step 6: Cluster the points (y
i
)i=1,…,p ∈ ℜk with the k-means algorithm into clusters {X1,…,X
k
}. For k-means, one may select a large number of initial values to achieve (or to be closer) the global optimum configuration. In our simulations, we generated 100 different initial values.
Output: Sub-networks {X1,…,X
k
}.
Notice that this clustering approach does not infer the entire structure of the network.
Estimation of the number of clusters
The method presented so far describes a framework for clustering genes (time series) using their topological proximity in terms of Granger causality.
Now, the challenge consists in determining the optimum number of sub-networks k. The choice of the number of sub-networks k is often difficult depending on what the researcher is interested in. In our specific problem, one is interested in identifying the clusters presenting dense connectivity within a cluster and sparse connectivity between clusters.
In order to determine the most appropriate number of clusters in this specific context, we used a variant of the silhouette method[28].
Let us first define the cluster index s(i) in the case of dissimilarities. Take any time series in the data set, and denote by A the sub-network to which it has been assigned. When sub-network A contains other time series apart from, then we can compute:, which is the average dissimilarity of to A. Let us now consider any sub-network C which is different from A and compute: which is the dissimilarity of to C. After computing for all sub-networks C ≠ A, we set the smallest of those numbers and denote it by. The sub-network B for which this minimum value is attained (that is, we call the neighbor sub-network, or cluster of. The neighbor cluster would be the second-best cluster for time series. In other words, if could not belong to sub-network A, the best sub-network to belong to would be B. Therefore, b(i) is very useful to know the best alternative cluster for the time series in the network. Note that the construction of b(i) depends on the availability of other sub-networks apart from A, thus it is necessary to assume that there is more than one sub-network k within a given network[28].
After calculating a(i) and b(i), the cluster index s(i) can be obtained by combining them as follows:
(11)
Indeed, from the above definition we easily see that −1 ≤ s(i) ≤ 1 for each time series. Therefore, there are at least three cases to be analyzed, namely, when s(i) ≈ 1 or s(i) ≈ 0 or s(i) ≈ −1. For cluster index s(i) to be close to one we require a(i) ≪ b(i). As a(i) is a measure of how dissimilar i is to its own sub-network, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighboring sub-network. Thus, a cluster index s(i) close to one means that the gene is appropriately clustered. If s(i) is close to negative one, then by the same logic we see that would be more appropriate if it was clustered in its neighboring sub-network. A cluster index s(i) near zero means that the gene is on the border of two sub-networks. In other words, the cluster index s(i) can be interpreted as the fitness of the time series to the assigned sub-network.
The average cluster index s(i) of a sub-network is a measure of how tightly grouped all the genes in the sub-network are. Thus, the average cluster index s(i) of the entire dataset is a measure of how appropriately the genes have been clustered in a topological point of view and in terms of Granger causality.
Estimation of the number of clusters in biological data
In order to estimate the most appropriate number of sub-networks present in the data set, we estimate the average cluster index s of the entire dataset for each number of clusters k. When the number of identified sub-networks is equal or lower than the adequate number of sub-networks, the cluster index values are very similar. However, when the number of identified sub-networks becomes higher than the adequate number of sub-networks, the cluster index value s decreases abruptly. This is due to the fact that one of the highly connected sub-networks is split into two new sub-networks. Notice that these two new sub-networks present high connectivity between them because they are in fact, only one sub-network. In order to illustrate this event, see Figure2 for an example. In Figure2a, genes in cluster 1 are highly interconnected. Now, suppose that one wants to increase the number of clusters by splitting cluster 1 into two clusters namely clusters 1 and 5 (Figure2c). Notice that clusters 1 and 5 are highly connected between them. If the number of clusters is higher than the adequate number of clusters (four, in our case), the value s decreases substantially, since the Granger causality between clusters increases and the within cluster decreases. The breakpoint where the value s decreases abruptly can be used to determine the adequate number of sub-networks. In fact, this can be visually identified by analyzing the breakpoint at the plot similarly to the standard elbow method used in k-means. However, if one wants to determine the breakpoint in an objective manner, this can be done by adjusting two linear regressions, one with the first q dots and another with the remaining dots, thus identifying the breakpoint (the value q) that minimizes the sum of squared errors (Figure3).
Network construction
The network connecting clusters is constructed following procedures previously described[20, 21]. Briefly, after Classification Expectation Maximization (CEM)[29] Principal Component Analysis (PCA) is used to remove redundancy and to extract the eigen-time series from each cluster. PCA allows us to keep only the most significant components leading to variability in the dataset, thus reducing the number of variables for subsequent processing. In this study, we retained only components accounting for more than 5% of the temporal variance in each cluster[22]. The eigen-time series are then clustered as described in the section Functional clustering and the network can be inferred by applying the method proposed by[20, 21].
The Granger causality between cluster is identified by:
(12)
where is the sample canonical correlation between the sets and partialized by all information contained in X
t
minus the set.
Then, test
(Granger non-causality)
(Granger causality) where H0 and H1 are the null and alternative hypothesis, respectively.
Simulations
Four sets of Monte Carlo simulations were carried out in order to evaluate the proposed approach under controlled conditions. The first scenario represents four sub-networks without Granger causality between them (Figure4a). The second scenario consists of four sub-networks constituting a cyclic graph (Figure4b). The third scenario presents a feedback loop between sub-networks A and B (Figure4c). The fourth scenario is composed of a network with one sub-network (sub-network D) that only receives Granger causality and one sub-network (sub-network A) that only sends Granger causality (Figure4d). Since biological data usually possess several highly correlated genes (genes which hold the same information from a statistical stand point), we constructed 10 highly correlated time series for each. In other words, is represented by 10 time series with correlation of 0.6 between them, is represented by 10 time series with correlation of 0.6 between them and so on. Therefore, instead of 20 time series, each scenario is in fact composed of 200 time series.
For each scenario, time series lengths varied: 50, 75, 1000 and 200 time points. The number of repetitions for each scenario is 1,000. The synthetic gene expression time series data in sub-networks A, B, C and D were generated by the following equations described below.
Simulation 1:
Simulation 2:
Simulation 3:
Simulation 4:
where β = 0.6, γ = 0.3, εi,t∼N(0,Σ) with
and
(14)
for i = 1,…,20.
Actual biological data
In order to illustrate an application of the proposed approach, a dataset collected by[30] was used. The work presents whole genome gene expression data during the cell division cycle of a human cancer cell line (HeLa) characterized using cDNA microarrays. The dataset contains three complete cell cycles of ∼16 hours each, with a total of 48 time points distributed at intervals of one hour. The full dataset is available at:http://genome-www.stanford.edu/Human-CellCycle/HeLa/.
In order to evaluate our proposed approach, we chose to analyze the same gene set examined in Figure5 of[10], which comprised a set of 50 genes.