Functional clustering of time series gene expression data by Granger causality
- André Fujita^{1}Email author,
- Patricia Severino^{2},
- Kaname Kojima^{3},
- João Ricardo Sato^{4},
- Alexandre Galvão Patriota^{1} and
- Satoru Miyano^{3}
DOI: 10.1186/1752-0509-6-137
© Fujita et al.; licensee BioMed Central Ltd. 2012
Received: 14 October 2011
Accepted: 17 October 2012
Published: 30 October 2012
Abstract
Background
A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes.
Results
In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence.
Conclusions
This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.
Background
Gene network analysis of complex datasets, such as DNA microarray results, aims to identify relevant structures that help the understanding of a certain phenotype or condition. These networks comprise hundreds to thousands of genes that may interact generating intricate structures. Consequently, pinpointing genes or sets of genes that play a crucial role becomes a complicated task.
Common analyses explore gene-gene level relationships and generate broad networks. Although this is a valuable approach, genes might interact more intensely to a few members of the network, and the identification of these so-called sub-networks should lead to a better comprehension of the entire regulatory process.
The concept of Granger causality[6] has been previously shown to help in the identification and interpretation of regulatory networks in time series gene expression datasets[9–18]. The main advantage of Granger causality analysis in the context of gene expression datasets consists in the fact that each edge of the network represents the information flow from one gene to another[19]. Nevertheless, it is necessary to point out that Granger causality is not effective causality in the Aristothelic sense because it is based on prediction and numerical calculations. Fujita et al.[20–22] suggested a concept for the identification of Granger causality between groups of time series. The application was, however, limited to scenarios when clusters could be previously defined based on particular data characteristics. Here, we propose a method to define clusters by their topological proximity in the network. For this purpose we introduce an extension of the concept of functional clustering, initially proposed by[23] in neuroscience. In[23], they applied mutual information in order to group the most active brain regions. We are interested in clustering the genes by using the concept of information flow[19] between sets of time series[20]. The gene expression time series are grouped depending on the hidden structure underlying the network topology, in a way that genes which are topologically close in terms of Granger causality are clustered (Figure1a). We use the generalization of Granger causality for sets of time series datasets proposed by[20, 21] in order to define concepts of distance, degree and flow useful to determine gene sets that highly interact in terms of Granger causality. In other words, we will derive the Granger causality-based functional clustering directly from the time series gene expression data. For this purpose, an approach that allows the identification of the optimum number of clusters for a given dataset is also presented.
Materials and Methods
Granger causality for sets of time series
Granger causality identification is a potential approach for the detection of possible interactions in a data driven framework couched in terms of temporal precedence. The main idea is that temporal precedence does not imply, but may help to identify causal relationships, since a cause never occurs after its effect.
A formal definition of Granger causality for sets of time series[20] can be given as follows.
Definition 1
where$\left(\right)close="">{\Im}_{t}\setminus \left\{{\mathbf{X}}_{s}^{j}\right|s\le t\}$is the set containing all relevant information except for the information in the past and present of$\left(\right)close="">{\mathbf{X}}_{t}^{j}$. In other words, if$\left(\right)close="">{\mathbf{X}}_{t}^{i}$can be predicted more accurately when the information in$\left(\right)close="">{\mathbf{X}}_{t}^{j}$is taken into account, then$\left(\right)close="">{\mathbf{X}}_{t}^{j}$is said to be Granger-causal for$\left(\right)close="">{\mathbf{X}}_{t}^{i}$.
where ρ is the largest correlation calculated by Canonical Correlation Analysis (CCA).
In order to simplify both notation and concepts, only the identification of Granger causality for sets of time series in an Autoregressive process of order one is presented. Generalizations for higher orders are straightforward.
Functional clustering in terms of Granger causality
There are numerous definitions for clusters in networks in the literature[24]. A functional cluster in terms of Granger causality can be defined as a subset of genes that strongly interact among themselves but interact weakly with the rest of the network.
A usual approach for network clustering when the structure of the graph is known is the spectral clustering proposed by[25]. However, in biological data, the structure of the regulatory network is usually unknown.
In order to overcome this limitation, we developed a framework to cluster genes by their topological proximity using the time series gene expression information. We developed concepts of distance and degree for sets of time series based on Granger causality, and combined them to the modified spectral clustering algorithm. The procedures are detailed below.
Functional clustering
Given a set of time series$\left(\right)close="">{x}_{t}^{1},{x}_{t}^{2},\dots ,{x}_{t}^{p}$ (where p is the number of time series) and a definition of similarity w_{ ij } ≥ 0 between all pairs of data points$\left(\right)close="">{x}_{t}^{i}$ and$\left(\right)close="">{x}_{t}^{j}$, the intuitive goal of clustering is to divide the time series into several groups such that time series in the same group are highly connected by Granger causality and time series in different groups are not connected or show few connections to each other. One usual representation of the connectivity between time series is in the form of graph G = (V,E). Each vertex v_{ i } in this graph represents a time series gene expression$\left(\right)close="">{x}_{t}^{i}$. Two vertices are connected if the similarity w_{ ij } between the corresponding time series$\left(\right)close="">{x}_{t}^{i}$ and$\left(\right)close="">{x}_{t}^{j}$ is not zero (the edge of the graph is weighted by w_{ ij }). In other words, a w_{ ij } > 0 represents existence of Granger causality between time series$\left(\right)close="">{x}_{t}^{i}$ and$\left(\right)close="">{x}_{t}^{j}$ and w_{ ij } = 0 represents Granger non-causality. The problem of clustering can now be reformulated using the similarity graph: we want to find a partition of the graph such that there is less Granger causality between different groups and more Granger causality within the group.
Let G = (V,E) be an undirected graph with vertex set V = {v_{1},…,v_{ p }}(where each vertex represents one time series) and weighted edges set E. In the following we assume that the graph G is weighted, that is each edge between two vertices v_{ i } and v_{ j } carries a non-negative weight w_{ ij } ≥ 0. The weighted adjacency matrix of the graph is the matrix W = w_{ ij }; i,j = 1,…,p. If w_{ ij } = 0, this means that the vertices v_{ i } and v_{ j } are not connected by an edge. As G is undirected, we require w_{ ij } = w_{ ji }. Therefore, in terms of Granger causality, w_{ ij } can be set as the distance between two time series$\left(\right)close="">{x}_{t}^{i}$ and$\left(\right)close="">{x}_{t}^{j}$. This distance can be defined as
Definition 2
Notice that$\left(\right)close="">\text{CCA}({x}_{t}^{i},{x}_{t-1}^{j})$ is the Granger causality from time series$\left(\right)close="">{x}_{t}^{j}$ to$\left(\right)close="">{x}_{t}^{i}$. In the case of sets of time series, just replace$\left(\right)close="">{x}_{t}^{i}$ and$\left(\right)close="">{x}_{t}^{j}$ by the set of time series$\left(\right)close="">{\mathbf{X}}_{t}^{i}$ and$\left(\right)close="">{\mathbf{X}}_{t}^{j}$[20, 21]. Since absolute value of CCA ranges from zero to one and the higher the CCA, the higher is the quantity of information flow, it is possible to see that the higher the CCA, the shorter the distance is. Furthermore, it is necessary to point out that the average between$\left(\right)close="">\text{CCA}({x}_{t}^{i},{x}_{t-1}^{j})$ and$\left(\right)close="">\text{CCA}({x}_{t}^{j},{x}_{t-1}^{i})$ is calculated because the distance must be symmetric. The intuitive idea consists on the fact that the higher is the CCA coefficient, the lower is the distance between the time series (or sets of time series) independent of the direction of Granger causality.
Moreover, notice that the CCA is the Pearson correlation after dimension reduction, therefore,$\left(\right)close="">\text{dist}({x}_{t}^{i},{x}_{t}^{j})$ satisfies three out of four criteria for distances: (i) non-negativity; (ii) identity of indiscernible; and (iii) symmetry; and does not satisfy the (iv) triangular inequality, therefore, Pearson correlation is not a real metric. However, it is commonly used as a distance measure in several gene expression data analysis[26, 27]. The main advantage with this definition of distance is the fact that it is possible to interpret the clustering process by a Granger causality concept.
Another necessary concept is the idea of degree of a time series$\left(\right)close="">{x}_{t}^{i}$ (vertex v_{ i }) which can be defined as
Definition 3
Notice that in-degree and out-degree represent the total information flow that “enters” and “leaves” the vertex v_{ i }, respectively. Therefore, the degree of vertex v_{ i } contains the total information flow passing through vertex v_{ i }.
Without loss of generality, it is possible to extend the concept of degree of a vertex v_{ i } (time series$\left(\right)close="">{x}_{t}^{i}$) to a set of time series (sub-network)$\left(\right)close="">{\mathbf{X}}_{t}^{u}$, where u = 1,…,k and k is the number of sub-networks.
Definition 4
Now, by using the definitions of distance and degrees for time series and sets of time series in terms of Granger causality, it is possible to develop a spectral clustering-based algorithm to identify sub-networks (set of time series that are highly connected within sets and poorly connected between sets) in the regulatory networks. The algorithm based on spectral clustering[25] is as follows:
Input: The p time series ($\left(\right)close="">{x}_{t}^{i};i=1,\dots ,p$) and the number k of sub-networks to construct.
Step 1: Let W be the (p × p) symmetric weighted adjacency matrix where$\left(\right)close="">{w}_{i,j}={w}_{j,i}=1-\text{dist}({x}_{t}^{i};{x}_{t}^{j}),i,j=1,\dots ,p$.
where D is the (p × p) diagonal matrix with the degrees d_{1},…,d_{ p }($\left(\right)close="">\text{degree}\left({x}_{t}^{i}\right)={d}_{i};i=1,\dots ,p$) on the diagonal.
Step 3: Compute the first k eigenvectors {e_{1},…,e_{ k }} (corresponding to the k largest eigenvalues) of L.
Step 4: Let U ∈ ℜ^{p×k} be the matrix containing the vectors {e_{1},…,e_{ k }} as columns.
Step 5: For i = 1,…,p, let y_{ i } ∈ ℜ^{ k } be the vector corresponding to the i th row of U.
Step 6: Cluster the points (y_{ i })_{i=1,…,p} ∈ ℜ^{ k } with the k-means algorithm into clusters {X_{1},…,X_{ k }}. For k-means, one may select a large number of initial values to achieve (or to be closer) the global optimum configuration. In our simulations, we generated 100 different initial values.
Output: Sub-networks {X_{1},…,X_{ k }}.
Notice that this clustering approach does not infer the entire structure of the network.
Estimation of the number of clusters
The method presented so far describes a framework for clustering genes (time series) using their topological proximity in terms of Granger causality.
Now, the challenge consists in determining the optimum number of sub-networks k. The choice of the number of sub-networks k is often difficult depending on what the researcher is interested in. In our specific problem, one is interested in identifying the clusters presenting dense connectivity within a cluster and sparse connectivity between clusters.
In order to determine the most appropriate number of clusters in this specific context, we used a variant of the silhouette method[28].
Let us first define the cluster index s(i) in the case of dissimilarities. Take any time series$\left(\right)close="">{x}_{t}^{i}$ in the data set, and denote by A the sub-network to which it has been assigned. When sub-network A contains other time series apart from$\left(\right)close="">{x}_{t}^{i}$, then we can compute:$\left(\right)close="">a\left(i\right)=\text{dist}({x}_{t}^{i},\mathbf{A})$, which is the average dissimilarity of$\left(\right)close="">{x}_{t}^{i}$ to A. Let us now consider any sub-network C which is different from A and compute:$\left(\right)close="">\text{dist}({\mathbf{x}}_{t}^{i},\mathbf{C})$ which is the dissimilarity of$\left(\right)close="">{x}_{t}^{i}$ to C. After computing$\left(\right)close="">\text{dist}({x}_{t}^{i},\mathbf{C})$ for all sub-networks C ≠ A, we set the smallest of those numbers and denote it by$\left(\right)close="">b\left(i\right)=\mathrm{mi}{n}_{\mathbf{C}\ne \mathbf{A}}\text{dist}({x}_{t}^{i},\mathbf{C})$. The sub-network B for which this minimum value is attained (that is,$\left(\right)close="">\text{dist}({x}_{t}^{i},\mathbf{B})=b\left(i\right))$ we call the neighbor sub-network, or cluster of$\left(\right)close="">{x}_{t}^{i}$. The neighbor cluster would be the second-best cluster for time series$\left(\right)close="">{x}_{t}^{i}$. In other words, if$\left(\right)close="">{x}_{t}^{i}$ could not belong to sub-network A, the best sub-network to belong to would be B. Therefore, b(i) is very useful to know the best alternative cluster for the time series in the network. Note that the construction of b(i) depends on the availability of other sub-networks apart from A, thus it is necessary to assume that there is more than one sub-network k within a given network[28].
Indeed, from the above definition we easily see that −1 ≤ s(i) ≤ 1 for each time series$\left(\right)close="">{x}_{t}^{i}$. Therefore, there are at least three cases to be analyzed, namely, when s(i) ≈ 1 or s(i) ≈ 0 or s(i) ≈ −1. For cluster index s(i) to be close to one we require a(i) ≪ b(i). As a(i) is a measure of how dissimilar i is to its own sub-network, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighboring sub-network. Thus, a cluster index s(i) close to one means that the gene is appropriately clustered. If s(i) is close to negative one, then by the same logic we see that$\left(\right)close="">{x}_{t}^{i}$ would be more appropriate if it was clustered in its neighboring sub-network. A cluster index s(i) near zero means that the gene is on the border of two sub-networks. In other words, the cluster index s(i) can be interpreted as the fitness of the time series$\left(\right)close="">{x}_{t}^{i}$ to the assigned sub-network.
The average cluster index s(i) of a sub-network is a measure of how tightly grouped all the genes in the sub-network are. Thus, the average cluster index s(i) of the entire dataset is a measure of how appropriately the genes have been clustered in a topological point of view and in terms of Granger causality.
Estimation of the number of clusters in biological data
Network construction
The network connecting clusters is constructed following procedures previously described[20, 21]. Briefly, after Classification Expectation Maximization (CEM)[29] Principal Component Analysis (PCA) is used to remove redundancy and to extract the eigen-time series from each cluster. PCA allows us to keep only the most significant components leading to variability in the dataset, thus reducing the number of variables for subsequent processing. In this study, we retained only components accounting for more than 5% of the temporal variance in each cluster[22]. The eigen-time series are then clustered as described in the section Functional clustering and the network can be inferred by applying the method proposed by[20, 21].
where$\left(\right)close="">\widehat{\rho}$ is the sample canonical correlation between the sets$\left(\right)close="">{\mathbf{X}}_{t}^{i}$ and$\left(\right)close="">{\mathbf{X}}_{t-1}^{j}$ partialized by all information contained in X_{ t } minus the set$\left(\right)close="">{\mathbf{X}}_{t-1}^{j}$.
Then, test
$\left(\right)close="">{\text{H}}_{0}:\text{CCA}({\mathbf{X}}_{t}^{i},{\mathbf{X}}_{t-1}^{j}|{\mathbf{X}}_{t}\setminus \left\{{\mathbf{X}}_{t-1}^{j}\right\})=\widehat{\rho}=0$ (Granger non-causality)
$\left(\right)close="">{\text{H}}_{1}:\text{CCA}({\mathbf{X}}_{t}^{i},{\mathbf{X}}_{t-1}^{j}|{\mathbf{X}}_{t}\setminus \left\{{\mathbf{X}}_{t-1}^{j}\right\})=\widehat{\rho}\ne 0$ (Granger causality) where H_{0} and H_{1} are the null and alternative hypothesis, respectively.
Simulations
For each scenario, time series lengths varied: 50, 75, 1000 and 200 time points. The number of repetitions for each scenario is 1,000. The synthetic gene expression time series data in sub-networks A, B, C and D were generated by the following equations described below.
for i = 1,…,20.
Actual biological data
In order to illustrate an application of the proposed approach, a dataset collected by[30] was used. The work presents whole genome gene expression data during the cell division cycle of a human cancer cell line (HeLa) characterized using cDNA microarrays. The dataset contains three complete cell cycles of ∼16 hours each, with a total of 48 time points distributed at intervals of one hour. The full dataset is available at:http://genome-www.stanford.edu/Human-CellCycle/HeLa/.
Results
Simulated data
In order to study the properties of the proposed functional clustering method and to check its consistency, we performed four simulations with distinct network characteristics in terms of structure and Granger causality.
Frequency of the selected number of clusters for each scenario and time series length
Time series length/Number of clusters | 1 | 2 | 3 | 4 | 6 | 5 | silhouette width | |
---|---|---|---|---|---|---|---|---|
Scenario 1 | ||||||||
50 | 0 | 0 | 48 | 700 | 252 | 0 | 0.502 (0.098) | |
75 | 0 | 0 | 1 | 785 | 214 | 0 | 0.582 (0.054) | |
100 | 0 | 0 | 3 | 805 | 192 | 0 | 0.610 (0.042) | |
200 | 0 | 0 | 4 | 825 | 171 | 0 | 0.641 (0.034) | |
Scenario 2 | ||||||||
50 | 0 | 0 | 65 | 713 | 222 | 0 | 0.479 (0.112) | |
75 | 0 | 0 | 28 | 760 | 212 | 0 | 0.555 (0.071) | |
100 | 0 | 0 | 9 | 834 | 157 | 0 | 0.587 (0.050) | |
200 | 0 | 0 | 3 | 883 | 114 | 0 | 0.621 (0.029) | |
Scenario 3 | ||||||||
50 | 0 | 0 | 63 | 666 | 271 | 0 | 0.461 (0.123) | |
75 | 0 | 0 | 18 | 784 | 198 | 0 | 0.552 (0.078) | |
100 | 0 | 0 | 8 | 851 | 141 | 0 | 0.586 (0.050) | |
200 | 0 | 0 | 6 | 883 | 111 | 0 | 0.618 (0.031) | |
Scenario 4 | ||||||||
50 | 0 | 0 | 53 | 686 | 261 | 0 | 0.465 (0.110) | |
75 | 0 | 0 | 17 | 786 | 197 | 0 | 0.551 (0.075) | |
100 | 0 | 0 | 11 | 815 | 174 | 0 | 0.581 (0.055) | |
200 | 0 | 0 | 6 | 887 | 107 | 0 | 0.619 (0.033) |
Average of the percentage of correctly clustered time-series in 1,000 repetitions given the correct number of clusters
Scenario/Time series length | 50 | 75 | 100 | 200 |
---|---|---|---|---|
1 | 78.8 | 96.0 | 98.9 | 99.9 |
2 | 72.9 | 91.2 | 95.8 | 99.2 |
3 | 71.6 | 90.6 | 95.2 | 99.7 |
4 | 68.9 | 88.7 | 93.7 | 99.1 |
Percentage of edges with time series length equals to 50/75/100/200 when the estimated number of clusters were correctly identified as four
from/to | A | B | C | D |
---|---|---|---|---|
Scenario 1 | ||||
A | 100/100/100/100 | 6.7/6.3/5.2/5.4 | 8.9/6.0/5.0/5.3 | 4.8/5.7/5.4/4.5 |
B | 6.9/7.1/5.5/6.8 | 99.9/100/100/100 | 7.8/6.2/6.3/4.6 | 5.6/6.9/4.9/5.6 |
C | 7.6/5.9/6.5/5.6 | 6.9/7.7/4.7/5.1 | 100/100/100/100 | 4.9/5.4/5.7/5.8 |
D | 6.2/5.3/5.1/4.7 | 5.3/5.2/5.3/5.7 | 7.0/5.2/5.2/5.6 | 100/100/100/100 |
Scenario 2 | ||||
A | 100/100/100/100 | 28.9/59.8/80.4/99.7 | 8.0/6.4/6.8/5.2 | 6.4/6.6/5.0/5.0 |
B | 5.4/5.3/5.5/4.6 | 100/100/100/100 | 29.6/60.9/82.1/99.9 | 6.4/6.3/5.7/5.7 |
C | 7.5/5.4/6.7/4.5 | 8.8/6.6/6.6/6.3 | 100/100/100/100 | 23.0/50.4/71.2/99.1 |
D | 17.6/35.5/51.2/95.4 | 6.5/4.2/3.4/5.0 | 12.5/10.4/7.5/5.0 | 100/100/100/100 |
Scenario 3 | ||||
A | 100/100/100/100 | 29.6/61.9/82.1/100 | 7.8/7.3/4.5/5.0 | 7.4/6.8/4.6/5.2 |
B | 28.5/53.0/78.0/99.9 | 100/100/100/100 | 31.8/61.1/82.9/99.9 | 7.0/7.1/6.2/4.7 |
C | 8.4/6.9/6.4/5.6 | 7.6/7.8/7.3/5.2 | 99.9/100/100/100 | 25.5/46.8/70.6/99.3 |
D | 6.8/5.6/5.8/5.0 | 5.5/4.5/5.7/4.3 | 13.9/8.2/6.1/5.4 | 100/100/100/100 |
Scenario 4 | ||||
A | 100/100/100/100 | 25.1/52.6/75.8/99.6 | 22.9/41.8/59.5/96.0 | 6.8/5.8/5.2/4.7 |
B | 6.7/5.9/5.7/5.9 | 100/100/100/100 | 28.6/58.4/81.9/100 | 7.9/6.0/6.1/5.2 |
C | 9.3/8.8/6.1/6.2 | 8.8/6.2/6.3/4.5 | 100/100/100/100 | 26.5/53.2/75.4/99.2 |
D | 5.4/5.8/5.1/4.7 | 5.8/5.0/4.2/5.2 | 14.9/11.9/7.9/5.4 | 100/100/100/100 |
Biological data
By applying the method described in section Functional clustering to the biological dataset, the optimum number of sub-networks was identified as three. Notice in Figure5 that there is a clear breakpoint when the number of clusters is three.
In[10], a network depicting Granger interaction among genes from this same gene dataset was presented. The authors analyzed the network in the context of tumor progression and identified gene-gene connections associated with NF-κ B, p53, and STAT3. Here, cluster 1 groups not only NF-κ B, p53, and STAT3, but also the functionally associated gene BCL-XL, NF-κ B regulator A20 and targets IAP and iκ Bα. The presence of NF-κ B and fibroblast growth factors (FGFs) and receptors (FGFRs) in the same cluster is also in agreement with the previous work. Members of the FGF family and NF-κ B have been shown to interact in various contexts and, despite distinct roles, are involved in cell proliferation, migration and survival[32, 33].
Even though MCL-1 and P21 play important roles in cell survival, and BAI1 is transcriptionally regulated by P53, the analysis run here clustered them separately from P53 containing cluster. This result suggests that, in the context of this dataset, their interaction is stronger with genes such as c-JUN, also functionally related to cell survival, proto-oncogene MET and tumor suppressor MASPIN, for instance. Also worth noticing is the interaction of this cluster with the two members of cluster 3: FGF5 and FOP. Like the other members of FGF family grouped in cluster 2, FGF5 is involved in cell survival activities, while FOP was originally discovered as a fusion partner with FGFR1 in oncoproteins that give raise to stem cell myeloproliferative disorders. It would be interesting to identify specific details regarding the intensity and direction of the information flow within this cluster for a clearer understanding of their relationship in the context of cell cycle progression.
Discussions
Fujita et al.[20, 21] suggested both a concept of Granger causality for sets of time series and a method for its identification with a statistical test to control the rate of false positives. Although this method is useful for the identification of Granger causality between sets of time series in Bioinformatics and Neuroscience[22], the application was limited to pre-defined clusters, i.e., the time series composing each cluster needed to be previously known. We developed an objective method to define clusters based on the intuitive concept that a gene cluster should interact more intensely in terms of Granger causality within itself than with neighboring clusters.
Krishna et al.[34] proposed a Granger causality clustering method based on the structure of a pair-wise network. Their method consists in identifying pairwise Granger causality between gene expression time series and then, by applying the method proposed by Bader and Hogue (2003), to detect dense regions in the network. The difference between their approach and ours is that they take into account the number of edges, and the density of the network which is given by the number of estimated edges divided by the total number of possible edges. The presence of an edge is determined by the p-value’s threshold. Notice that depending on the threshold, the results can change. In our framework, we take into account the weight of Granger causality between sets of time series in order to identify how close two sets are. Consequently, it is possible to obtain a notion of distance between two clusters based on Granger causality, i.e., a continuous measure (distance in terms of Granger causality) instead of a discrete measure (presence or absence of an edge). Moreover, by using the concept of Granger causality between sets of time series proposed by[20], the concept of density of a network can be easily defined in terms of Granger causality instead of a density based on the number of edges as proposed by[34].
A disadvantage of our method is that it cannot be applied for very large datasets. The larger is the number of time series (genes), or the higher the order of the autoregressive process to be analyzed, the higher the chance to generate non-invertible covariance matrices in the calculation of distance (definition 2) and degree (definition 4) between clusters. We believe that this drawback can be overcome through sparse canonical correlation analysis[35], recently proposed in the literature. However, this topic deserves further studies before it can be used in both clustering and identification of Granger causality between sets of time series, since penalized methods relying on L1 penalization[35] or kernel[36] may present biased estimators.
We only analyzed the autoregressive process of order one because gene expression time series data, possibly due to experimental limitations, are typically not large. However, if one is interested in analyzing greater orders, one minus the maximum canonical correlation analysis value among all the tested autoregressive orders can be used as the distance measure between two time series.
The clustering algorithm used here is based on the well-known spectral clustering. Although results were satisfactory, other graph clustering methods may be used. The normalized cuts algorithm proposed by[37], for instance, presents better results in non Gaussian data sets.
Finally, which biological process underlie time series datasets correlation, remains a difficult question to be answered. Studies suggest that correlated genes may belong to common pathways or present the same biological function. However, it is also known that methods based exclusively on correlation cannot reconstruct entire gene networks. Further studies in the field of systems biology might be able to answer this question in the future.
Conclusions
We propose a time series clustering approach based on Granger causality and a method to determine the number of clusters that best fit the data. This method consists of (1) the definition of degree and distance, usually used in graph theory but now generalized for time series data analysis in terms of Granger causality; (2) a clustering algorithm based on spectral clustering and (3) a criterion to determine the number of clusters. We demonstrate, by simulations, that our approach is consistent even when the number of genes is greater than the time series’ length.
We believe that this approach can be useful to understand how gene expression time series relate to each other, and therefore help in the functional interpretation of data.
Declarations
Acknowledgements
The supercomputing resource was provided by Human Genome Center (Univ. of Tokyo). This work was supported by FAPESP and CNPq - Brazil and RIKEN - Japan.
Authors’ Affiliations
References
- Ng SK, McLachlan GJ, Wang K, Jones LB-T, Ng S-W: A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics. 2006, 22: 1745-1752. 10.1093/bioinformatics/btl165.View ArticleGoogle Scholar
- Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 2003, 34: 166-176. 10.1038/ng1165.View ArticleGoogle Scholar
- Shiraishi Y, et al.: Inferring cluster-based networks from differently stimulated multiple time-course gene expression data. Bioinformatics. 2010, 26: 1073-1081. 10.1093/bioinformatics/btq094.View ArticleGoogle Scholar
- Stuart JM, Segal E, Koller D, Kim SK: A gene co-expression network for global discovery of conserved genetics modules. Science. 2003, 302: 249-55. 10.1126/science.1087447.View ArticleGoogle Scholar
- Yamaguchi R, Yoshida R, Imoto S, Higuchi T, Miyano S: Finding module-based networks with state-space models - mining high-dimensional and short time-course gene expression data. IEEE Signal Process Mag. 2007, 24: 37-46.View ArticleGoogle Scholar
- Granger CWJ: Investigating causal relationships by econometric models and cross-spectral methods. Econometrica. 1969, 37: 424-438. 10.2307/1912791.View ArticleGoogle Scholar
- Ahmed HA, Mahanta P, Bhattacharyya DK, Kalita JK: GERC: tree based clustering for gene expression data. 11th IEEE Int Conference Bioinf Bioeng. 2011, 299-302.Google Scholar
- Bandyopadhyay S, Bhattacharyya M: A biologically inspired measure for coexpression analysis. IEEE/ACM Trans comput biol bioinf. 2011, 8: 929-942.View ArticleGoogle Scholar
- Fujita A, Sato JR, Garay-Malpartida HM, Morettin PA, Sogayar MC, Ferreira CE: Time-varying modeling of gene expression regulatory networks using wavelet dynamic vector autoregressive method. Bioinformatics. 2007a, 23: 16253-1630.View ArticleGoogle Scholar
- Fujita A, Sato JR, Garay-Malpartida HM, Yamaguchi R, Miyano S, Sogayar MC, Ferreira CE: Modeling gene expression regulatory networks with the sparse vector autoregressive model. BMC Syst Biol. 2007b, 1: 39-10.1186/1752-0509-1-39.View ArticleGoogle Scholar
- Fujita A, Sato JR, Garay-Malpartida HM, Sogayar MC, Ferreira CE, Miyano S: Modeling nonlinear gene regulatory networks from time-series gene expression data. J Bioinf Comput Biol. 2008, 6: 961-79. 10.1142/S0219720008003746.View ArticleGoogle Scholar
- Fujita A, Patriota AG, Sato JR, Miyano S: The impact of measurement error in the identification of regulatory networks. BMC Bioinf. 2009, 10: 412-10.1186/1471-2105-10-412.View ArticleGoogle Scholar
- Guo S, Wu J, Ding M, Feng J: Uncovering interactions in the frequency domain. PLoS Comput Biol. 2008, 4: e1000087-10.1371/journal.pcbi.1000087.View ArticleGoogle Scholar
- Kojima K, Fujita A, Shimamura T, Imoto S, Miyano S: Estimation of nonlinear gene regulatory networks via L1 regularized NVAR from time series gene expression data. Genome Informatics. 2008, 20: 37-51.Google Scholar
- Lozano AC, Abe N, Liu Y, Rosset S: Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinformatics. 2009, 25: i110-i118. 10.1093/bioinformatics/btp199.View ArticleGoogle Scholar
- Mukhopadhyay ND, Chatterjee S: Causality and pathway search in microarray time series experiments. Bioinformatics. 2007, 23: 442-449. 10.1093/bioinformatics/btl598.View ArticleGoogle Scholar
- Nagarajan R: A note on inferring acyclic network structures using Granger causality tests. Int J Biostatistics. 2009, 5: 10-View ArticleGoogle Scholar
- Shojaie A, Michailidis G: Discovering graphical Granger causality using the truncating lasso penalty. Bioinformatics. 2010, 26: i517-i523. 10.1093/bioinformatics/btq377.View ArticleGoogle Scholar
- Baccala LA, Sameshima K: Partial directed coherence: A new concept in neural structure determination. Biol Cybernetics. 2001, 84: 463-474. 10.1007/PL00007990.View ArticleGoogle Scholar
- Fujita A, Sato JR, Kojima K, Gomes LR, Nagasaki M, Sogayar MC, Miyano S: Identification of Granger causality between gene sets. J Bioinf Comput Biol. 2010a, 8: 679-701. 10.1142/S0219720010004860.View ArticleGoogle Scholar
- Fujita A, Kojima K, Patriota AG, Sato JR, Severino P, Miyano S: A fast and robust statistical test based on Likelihood ratio with Bartlett correction to identify Granger causality between gene sets. Bioinformatics. 2010b, 26: 2349-2351. 10.1093/bioinformatics/btq427.View ArticleGoogle Scholar
- Sato JR, Fujita A, Cardoso EF, Thomaz CE, Brammer MJ, Amaro E: Analyzing the connectivity between regions of interest: An approach based on cluster Granger causality for fMRI data analysis. NeuroImage. 2010, 52: 1444-1455. 10.1016/j.neuroimage.2010.05.022.View ArticleGoogle Scholar
- Tononi G, McIntosh AR, Russel DP, Edelman GM: Functional clustering: identifying strongly interactive brain regions in neuroimaging data. NeuroImage. 1998, 7: 133-149. 10.1006/nimg.1997.0313.View ArticleGoogle Scholar
- Edachery J, Sen A, Brandenburg F: Graph clustering using distance-k cliques. Proceedings of the Seventh International Symposium on Graph Drawing. Lecture Notes in Computer Science. vol. Edited by: Smith Y. 1731, Berlin, Heidelberg, Germany: Springer-Verlag GmbH, 1999-1999.Google Scholar
- Ng A, et al.: Advances in Neural Information Processing Systems. 2002, New York: MIT PressGoogle Scholar
- Bhattacharya A, De RK: Divisive correlation clustering algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles. Bioinformatics. 2008, 24: 1359-1366. 10.1093/bioinformatics/btn133.View ArticleGoogle Scholar
- Ihmels J, Bergmann S, Berman J, Barkai N: Comparative gene expression analysis by differential clustering approach: applications to the Candida albicans transcription program. PLoS Genet. 2005, 1: e39-10.1371/journal.pgen.0010039.View ArticleGoogle Scholar
- Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987, 20: 53-65.View ArticleGoogle Scholar
- Celeux G, Govaert G: A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal. 1992, 14: 315-332. 10.1016/0167-9473(92)90042-E.View ArticleGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002, 13: 1977-2000. 10.1091/mbc.02-02-0030..View ArticleGoogle Scholar
- Lee TL, Yeh J, Friedman J, Yan B, Yang X, Yeh NT, Waes CV, Chen Z: A signal network involving coactivated NF-κB and STAT3 and altered p53 modulates BAX/BCL-XL expression and promotes cell survival of head and neck squamous cell carcinomas. Int J Cancer. 2008, 122: 1987-1998. 10.1002/ijc.23324.View ArticleGoogle Scholar
- Lungu G, Covaleda L, Mendes O, Martini-Stoica H, Stoica G: FGF-1-induced matrix metalloproteinase-9 expression in breast cancer cells is mediated by increased activities of NF-κB and activating protein-1. Mol Carcinog. 2008, 47: 424-435. 10.1002/mc.20398.View ArticleGoogle Scholar
- Drafahl KA, McAndrew CW, Meyer AN, Haas M, Donoghue DJ: The Receptor Tyrosine Kinase FGFR4 Negatively Regulates NF-kappaB Signaling. PLoS ONE. 2010, 5: e14412-10.1371/journal.pone.0014412.View ArticleGoogle Scholar
- Krishna R, Li CT, Buchanan-Wollaston V: A temporal precedence based clustering method for gene expression microarray data. BMC Bioinf. 2010, 11: 68-10.1186/1471-2105-11-68.View ArticleGoogle Scholar
- Witten DM, Tibshirani R, Hastie T: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2010, 10: 515-534.View ArticleGoogle Scholar
- Hardoon DR, Shawe-Taylor J: Sparse canonical correlation analysis. Machine Learning. 2011, 83: 331-353. 10.1007/s10994-010-5222-7.View ArticleGoogle Scholar
- Shi J, Malik J: Normalized cuts and image segmentation. IEEE Trans Pattern Anal Machine Intelligence. 2000, 22: 888-905. 10.1109/34.868688.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.