Distance correlation screening
Our screening step is based on distance correlation (DC), which is a measure of dependence between two random vectors, not necessarily of same dimension [10]. For given random vectors X and Y, if we let ϕ
x
(t) and ϕ
y
(s) be the respective characteristic functions, then the distance covariance between X and Y can be defined as follows:
$$ {}\text{dCov}^{2}(\pmb{X}, \pmb{Y}) =\!\! \int_{R^{d_{x}+d_{y}}}\!\frac{||\phi_{\pmb{x}, \pmb{y}}(\pmb{t}, \pmb{s}) \,-\, \phi_{\pmb{x}}(\pmb{t})\phi_{\pmb{y}}(\pmb{s})||^{2}\omega(\pmb{t}, \pmb{s})\!}{c_{d_{x}}c_{d_{y}}||\pmb{t}||^{1+d_{x}}_{d_{y}}||\pmb{s}||^{1+d_{y}}_{d_{y}}}d\pmb{t}d\pmb{s}, $$
(1)
where d
x
and d
y
are the dimensions of X and Y, \(c_{d_{x}}=\frac {\pi ^{(1+d_{x})/2}}{\Gamma \{(1+d_{x})/2\}}\) and \(c_{d_{y}}=\frac {\pi ^{(1+d_{y})/2}}{\Gamma \{(1+d_{y})/2\}}\). Unless otherwise specified, \(||\pmb {z}||_{d_{z}}\) denotes the Euclidean norm of \(\pmb {z}\in \mathbb {R}^{d_{z}}\), and \(||\phi ||^{2} = \phi \bar {\phi }\) for a complex-valued function ϕ and its conjugate \(\bar {\phi }\).
Similar as Pearson’s correlation coefficient, the DC between X and Y is defined as a rescaled distance covariance:
$$ \text{dCor}{(\mathbf{X}, \mathbf{Y})} = \frac{\text{dCov}(\pmb{X}, \pmb{Y})}{\sqrt{\text{dCov}(\pmb{X}, \pmb{X})\text{dCov}(\pmb{Y}, \pmb{Y})}}. $$
(2)
Generally, we have 0≤dCor(X,Y)≤1, which is different from Pearson’s correlation. One remarkable property of DC is that dCor(X,Y)=0 if and only if X and Y are independent [11–13], indicating that DC can also measure nonlinear associations. With random samples {(X
i
,Y
i
),i=1,…,n}, a natural estimator of dCov(X,Y) can be obtained as follows:
$$ \widehat{\text{dCov}}^{2}(\pmb{X},\pmb{Y}) = \frac{1}{n^{2}}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}A_{ij}B_{ij}, $$
(3)
where
$$A_{ij}=a_{ij}-\bar{a}_{i}-\bar{a}_{j}+\bar{a},$$
$$B_{ij}=b_{ij}-\bar{b}_{i}-\bar{b}_{j}+\bar{b},$$
if we let \(\phantom {\dot {i}\!}a_{ij}=||\pmb {X}_{i}-\pmb {X}_{j}||_{d_{X}}\), \(\bar {a}_{i}=\frac {1}{n}\sum \limits _{k=1}^{n}||\pmb {X}_{k}-\pmb {X}_{i}||_{d_{X}}\), \(\bar {a}_{j}=\frac {1}{n}\sum \limits _{l=1}^{n}||\pmb {X}_{l}-\pmb {X}_{j}||_{d_{X}}\), \(\bar {a}=\frac {1}{n^{2}}\sum \limits _{k=1}^{n}\sum \limits _{l=1}^{n}||\pmb {X}_{l}-\pmb {X}_{k}||_{d_{X}}\), \(b_{ij}=||\pmb {Y}_{i}-\pmb {Y}_{j}||_{d_{Y}}\phantom {\dot {i}\!}\), \(\bar {b}_{i}=\frac {1}{n}\sum \limits _{k=1}^{n}||\pmb {Y}_{k}-\pmb {Y}_{i}||_{d_{Y}}\), \(\bar {b}_{j}=\frac {1}{n}\sum \limits _{l=1}^{n}||\pmb {Y}_{l}-\pmb {Y}_{j}||_{d_{Y}}\), \(\bar {b}=\frac {1}{n^{2}}\sum \limits _{k=1}^{n}\sum \limits _{l=1}^{n}||\pmb {Y}_{l}-\pmb {Y}_{k}||_{d_{Y}}\). The sample estimate of DC can be obtained immediately:
$$ \widehat{\text{dCor}}(\pmb{X},\pmb{Y}) = \frac{\widehat{\text{dCov}}(\pmb{X},\pmb{Y})}{\sqrt{\widehat{\text{dCov}}(\pmb{X},\pmb{X})\widehat{\text{dCov}}(\pmb{Y},\pmb{Y})}}. $$
(4)
One can test for significance of DC using an approximate t-test proposed by Szekely and Rizzo (2013) [13], which was implemented in R package energy [14]. Szekely and Rizzo (2013) established the following result under high dimensions
$$\mathcal{T}_{n}=\sqrt{v-1}\frac{\mathcal{R}^{*}_{n}(\pmb{X},\pmb{Y})}{\sqrt{1-(\mathcal{R}^{*}_{n}(\pmb{X},\pmb{Y}))^{2}}}\rightarrow t_{df=v-1},$$
where \(\mathcal {R}^{*}_{n}(\pmb {X},\pmb {Y})\) represents a modified distance correlation between X and Y (see Szekely and Rizzo (2013), Eq 2.10, p.197) and \(v=\frac {n(n-3)}{2}\). Here, it is worth noting that although the t-approximation above is derived under high dimensions, it also works well for low-dimension cases (in our problem, dimensions of X and Y both equal one for each test). To evaluate the performance of the t-approximation under dimension one, we consider two independence settings
-
Setting 1: X
i
∼N(0,1), Y
i
∼N(0,2), i=1,2,…,50,
-
Setting 2: X
i
∼Uniform(0,1), Y
i
∼Uniform(0,2), i=1,2,…,50.
For each setting, we generated 10,000 data sets and calculated the test statistic \(\mathcal {T}_{n}\) for each data set. Figure 1 compared the sample distribution of \(\mathcal {T}_{n}\) with the asymptotic t distribution (close to a standard normal distribution as the degree of freedom v−1 is generally large). Futhermore, we compared the approximate p-value with the permutation p-value (based on 10,000 random shuffles) in 100 replications. As shown in Fig. 2, the approximate p-values are very close to the permutation p-values, indicating a satisfactory performance of the t-approximation under low dimensions.
The distance correlation measure has been applied in previous genomic studies to quantify gene co-expressions [15]. Besides DC, there are several measures that can pick up nonlinear dependence between variables, although each of them has its own practical limitations. Clark (2013) [16] empirically compared six popular measures including Pearson’s correlation, Spearman’s correlation, distance correlation, mutual information (MI), maximum information coefficient (MIC) and Hoeffding’s D under a variety of different settings, and it was found that the six methods perform almost equally well in detecting the linear correlation. However, under the nonlinear dependence, the distance correlation and MIC performed notably better than the other measures. There are two considerations that lead to the choice of DC instead of MIC in our analysis. First, DC is straightforward to calculate and not an approximation while MIC relies on a user-defined number of grids for approximation. Second, as pointed out in some recent studies [17, 18], the DC exhibits more statistical power than MIC under moderate or small sample sizes.
Edge-count test
Our testing step is to compare two multivariate distributions (dimension is 2 in DCE analysis). In statistics literature, there are mainly two types of multivariate tests, namely the multi-dimensional Kolmogorov-Smirnov (KS) test [19] and edge-count test [20, 21]. These two methods, however, both have practical limitations when applied to real data. For instance, KS test is very conservative, i.e., the null hypothesis is too often not rejected. Also, by the brute force algorithm, the application of multi-dimensional KS test can be prohibitively computationally intensive. The edge-count test is easy to implement but it is known to be problematic under the location and scale alternatives. Recently, Chen and Friedman [9] developed a modified version of edge-count test, which works properly under different alternatives and exhibits substantial power gains over existing edge-count tests. Similar as other edge-count tests, the new test is based upon a similarity graph such as minimum spanning tree (MST, [22]) that is constructed over the pooled samples from different groups. Generally, if two groups have different distributions, samples would be preferentially closer to others from the same group than those from the other group, therefore the edges in the MST would be more likely to connect samples from the same group. The test rejects the null if the number of between-group edges is significantly less than expected.
To be precise, we let x1,x2,…,x
n
and y1,y2,…,y
m
be i.i.d. samples from two multivariate distributions F
X
and F
Y
, respectively. We first pooled samples from two groups and indexed them by 1,2,…,N=n+m. A MST is then constructed on the pooled samples using Kruskal’s algorithm [22]. Unless otherwise specified, G represents the MST (or other similarity graphs) as well as the set of all edges, and |G| denotes the total number of edges. To illustrate the technical details, we adopted the notations from Chen and Friedman’s paper. Let g
i
=0 if sample i is from group X and g
i
=1 otherwise. For the edge e connecting samples i and j, i.e., e=(i,j), we define:
$$ J_{e} = \left\{\begin{array}{ll} 0 &\text{if \(g_{i}\neq g_{j}\)}\\ 1 &\text{if \(g_{i}=g_{j}=0\)}\\ 2 &\text{if \(g_{i}=g_{j}=1\)}\\ \end{array},\right. $$
(5)
and
$$ R_{k}=\sum\limits_{e\in G}I_{J_{e}=k}, k=0, 1, 2. $$
(6)
Here R1 and R2 represent the numbers of edges connecting samples from same group, and R0 stands for number of edges connecting samples from different groups. The new test statistic simply quantifies the deviation of (R1,R2) from their expected values under true H0. It has the following quadratic form:
$$ S=(R_{1}-\mu_{1}, R_{2}-\mu_{2})\Sigma^{-1}\left(\begin{array}{l} R_{1}-\mu_{1} \\ R_{2}-\mu_{2} \end{array}\right), $$
(7)
where μ1=E(R1), μ2=E(R2) and Σ=V((R1,R2)T) have the following expressions (see the Appendix of Chen and Friedman’s paper for detailed proof):
$$ {}\begin{aligned} \mu_{1} &=|G|\frac{n(n-1)}{N(N-1)}, \\ \mu_{2} &=|G|\frac{m(m-1)}{N(N-1)}, \\ \Sigma_{11} &=\mu_{1}(1\,-\,\mu_{1})+2C\frac{n(n-1)(n-2)}{N(N-1)(N-2)}+(|G|(|G|-1)\\&-2C)\frac{n(n-1)(n-2)(n-3)}{N(N-1)(N-2)(N-3)},\\ \Sigma_{22} &=\mu_{2}(1\,-\,\mu_{2})+\!2C\frac{m(m-1)(m-2)}{N(N-1)(N-2)}+(|G|(|G|-1)\\&-2C)\frac{m(m-1)(m-2)(m-3)}{N(N-1)(N-2)(N-3)}, \\ \Sigma_{12} &=(|G|(|G|\,-\,1)\,-\,2C)\frac{nm(n\,-\,1)(m\,-\,1)}{N(N\,-\,1)(N\,-\,2)(N\,-\,3)}-\!\mu_{1}\mu_{2}, \end{aligned} $$
where \(C=\frac {1}{2}{\sum \nolimits }_{i=1}^{N}|G_{i}|^{2}-|G|\), and G
k
stands for the subgraph in G that includes all edges that connect to node k. It was proved that under the permutation null hypothesis, S asymptotically follows a Chi-square distribution with 2 degrees of freedom [9]. The p-value approximation generally works well under relatively small sample size, for instance, when min(n,m)=20. In their work, Chen and Friedman also suggested that the use of k-MST graphs (e.g., 3-MST or 5-MST) may lead to a better approximation of p-value in practice.
It is noteworthy to mention that Chen and Friedman’s method was developed for two-group comparison. In the case of multiple groups, a sequence of pairwise comparisons need to be conducted. Recently, we extended Chen and Friedman’s test to multiple-group case and proposed an overall test to compare more than two groups simultaneously. In our technical report [23], it was proved that the test statistics for p groups asymptotically follows a Chi-square distribution with p degrees of freedom under mild regularity conditions. To be precise, for an edge e in graph G, we let
$$ {}\begin{aligned} A_{e} & = \{e\} \cup \{e'\in G: e' \text{ and } e \text{ share a node}\}, \\ B_{e} & = A_{e}\! \cup\! \{e^{\prime\prime} \!\in G\! : \exists~e' \!\in A_{e}, \text{ such that } e^{\prime\prime} \text{ and } e' \text{ share a node}\}, \end{aligned} $$
then the following theorem can be derived:
Theorem 1
If |G|=O(N), \({\sum \nolimits }_{k=1}^{N}|G_{k}|^{2}-\frac {4|G|^{2}}{N}=O(N)\), |A
e
||B
e
|=o(N3/2), \({\lim }_{N\rightarrow \infty }\frac {N_{i}}{N}=\lambda _{i}\in (0, 1)\), then
$${}S\!:=\!(\!R_{1}-\mu_{1}, R_{2}-\mu_{2}, \ldots, R_{p}-\mu_{p}\!)\Sigma^{-1}\!\!\begin{pmatrix} R_{1}\,-\,\mu_{1} \\ R_{2}\,-\,\mu_{2} \\ \cdots \\ R_{p}\,-\,\mu_{p}\!\end{pmatrix}\!\!\longrightarrow \chi^{2}_{p}, $$
where i=1,…,p is the group index.
The expected values and covariance matrix can be derived as in (7):
$$ {}\begin{aligned} \mu_{k, 1\leq k\leq p} &=|G|\frac{n_{k}(n_{k}-1)}{N(N-1)}, \\ \Sigma_{kk, 1\leq k\leq p} &=\mu_{k}(1-\mu_{k})+2C\frac{n_{k}(n_{k}-1)(n_{k}-2)}{N(N-1)(N-2)}+(|G|(|G|-1)\\&-2C)\frac{n_{k}(n_{k}-1)(n_{k}-2)(n_{k}-3)}{N(N-1)(N-2)(N-3)},\\ \Sigma_{jk, 1\leq j\neq k\leq p} &=(|G|(|G|-1)-2C)\frac{n_{j}n_{k}(n_{j}-1)(n_{k}-1)}{N(N-1)(N-2)(N-3)}-\mu_{j}\mu_{k}, \end{aligned} $$
where \(N={\sum \nolimits }_{k=1}^{p}n_{k}\) and \(C=\frac {1}{2}{\sum \nolimits }_{i=1}^{N}|G_{i}|^{2}-|G|\). The detailed proof for Theorem 1 can be found in the Appendix of Zhang et al. (2017) [23].
Simulation study: edge-count test versus two existing approaches
We performed a simulation study to empirically compare the edge-count test with two existing methods based on Pearson’s correlation and mutual information. Particularly, we considered the following linear setting and nonlinear setting, where X and Y represent the expression levels of two genes and subscripts “1” and “2” stand for two conditions:
-
Linear setting: \((X_{1}, Y_{1})^{T}\sim \mathrm {N}\left [\left (\begin {array}{l} 0\\ 0 \end {array}\right), \left (\begin {array}{ll} 1 & \rho \\ \rho & 1 \end {array}\right)\right ]\), \((X_{2}, Y_{2})^{T}\sim \mathrm {N}\left [\left (\begin {array}{l} 0\\ 0 \end {array}\right), \left (\begin {array}{cc} 1 & \rho +\Delta \\ \rho +\Delta & 1 \end {array}\right)\right ]\), where ρ=0.3, Δ∈{0.1,0.2,…,0.6}.
-
Nonlinear setting: X
i
∼Uniform(−2,2), \(Y_{i}=X_{i}^{2}+\epsilon _{i}\), \(\epsilon _{i}\sim \mathrm {N}\left (0, \sigma _{i}^{2}\right)\), i=1,2, σ1=0.5, σ2=σ1+Δ, Δ∈{0.1,0.2,…,0.6}.
For each setting, we generated 1,000 data sets with sample sizes n1=n2=100 and three approaches were applied to test for the difference between two joint distributions. For edge-count test, we took 3-MST based on Euclidean distance and computed the p-value using Chi-square approximation. The R package infotheo [24] was used to estimate the entropies of X
i
and Y
i
, as well as the mutual information between X
i
and Y
i
, i=1,2. To evaluate the significance of the mutual information change, we performed a Fisher’s z transformation introduced in Zhang et al. (2012) [25]. To be precise, let H(X
i
) be the entropy of variable X
i
, and I(X
i
,Y
i
) be the mutual information between X
i
and Y
i
, then the transformed z
i
given below approaches to a standard normal distribution with variance \(\frac {1}{n_{i}-3}\):
$$z_{i}=\frac{1}{2}\log\frac{1+I^{*}(X_{i},Y_{i})}{1-I^{*}(X_{i},Y_{i})},$$
where \(I^{*}(X_{i},Y_{i})=\frac {I(X_{i},Y_{i})}{H(X_{i})+H(Y_{i})}\). The p-value can then be obtained by a two-sample z test, i.e.,
$$p\text{-value}_{MI}=2P\left(Z>\frac{|z_{1}-z_{2}|}{\sqrt{\frac{1}{n_{1}-3}+\frac{1}{n_{2}-3}}}\right).$$
For each data set, we conducted a quantile normalization to match the marginals and tested the hypothesis at α=0.05 (\(H_{0}: \pmb {F}^{*}_{1}(X_{1},Y_{1})=\pmb {F}^{*}_{2}(X_{2},Y_{2})\)) with three different methods, where \(\pmb {F}^{*}_{i}\) represented the joint distribution of (X
i
,Y
i
) after the marginal matching. The accuracy (true positive rate) of each method under each setting was summarized in Fig. 3. As we can see, all the three methods achieved good accuracy in the linear setting (except in the subtle case of Δ=0.1). The Pearson’s correlation and edge-count test performed slightly better than the mutual information. For the nonlinear (quadratic) setting, the edge-count test substantially outperformed the other two methods, while the Pearson’s method completely failed to identify the difference.
Our simulation study demonstrated the capability of our edge-count test in capturing both linear and nonlinear changes. Generally, the edge-count test performs similarly well as Pearson’s correlation and mutual information under linear setting but achieves significantly better sensitivity for nonlinear setting.