Distance correlation screening
Our screening step is based on distance correlation (DC), which is a measure of dependence between two random vectors, not necessarily of same dimension [10]. For given random vectors X and Y, if we let ϕ_{
x
}(t) and ϕ_{
y
}(s) be the respective characteristic functions, then the distance covariance between X and Y can be defined as follows:
$$ {}\text{dCov}^{2}(\pmb{X}, \pmb{Y}) =\!\! \int_{R^{d_{x}+d_{y}}}\!\frac{\phi_{\pmb{x}, \pmb{y}}(\pmb{t}, \pmb{s}) \,\, \phi_{\pmb{x}}(\pmb{t})\phi_{\pmb{y}}(\pmb{s})^{2}\omega(\pmb{t}, \pmb{s})\!}{c_{d_{x}}c_{d_{y}}\pmb{t}^{1+d_{x}}_{d_{y}}\pmb{s}^{1+d_{y}}_{d_{y}}}d\pmb{t}d\pmb{s}, $$
(1)
where d_{
x
} and d_{
y
} are the dimensions of X and Y, \(c_{d_{x}}=\frac {\pi ^{(1+d_{x})/2}}{\Gamma \{(1+d_{x})/2\}}\) and \(c_{d_{y}}=\frac {\pi ^{(1+d_{y})/2}}{\Gamma \{(1+d_{y})/2\}}\). Unless otherwise specified, \(\pmb {z}_{d_{z}}\) denotes the Euclidean norm of \(\pmb {z}\in \mathbb {R}^{d_{z}}\), and \(\phi ^{2} = \phi \bar {\phi }\) for a complexvalued function ϕ and its conjugate \(\bar {\phi }\).
Similar as Pearson’s correlation coefficient, the DC between X and Y is defined as a rescaled distance covariance:
$$ \text{dCor}{(\mathbf{X}, \mathbf{Y})} = \frac{\text{dCov}(\pmb{X}, \pmb{Y})}{\sqrt{\text{dCov}(\pmb{X}, \pmb{X})\text{dCov}(\pmb{Y}, \pmb{Y})}}. $$
(2)
Generally, we have 0≤dCor(X,Y)≤1, which is different from Pearson’s correlation. One remarkable property of DC is that dCor(X,Y)=0 if and only if X and Y are independent [11–13], indicating that DC can also measure nonlinear associations. With random samples {(X_{
i
},Y_{
i
}),i=1,…,n}, a natural estimator of dCov(X,Y) can be obtained as follows:
$$ \widehat{\text{dCov}}^{2}(\pmb{X},\pmb{Y}) = \frac{1}{n^{2}}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}A_{ij}B_{ij}, $$
(3)
where
$$A_{ij}=a_{ij}\bar{a}_{i}\bar{a}_{j}+\bar{a},$$
$$B_{ij}=b_{ij}\bar{b}_{i}\bar{b}_{j}+\bar{b},$$
if we let \(\phantom {\dot {i}\!}a_{ij}=\pmb {X}_{i}\pmb {X}_{j}_{d_{X}}\), \(\bar {a}_{i}=\frac {1}{n}\sum \limits _{k=1}^{n}\pmb {X}_{k}\pmb {X}_{i}_{d_{X}}\), \(\bar {a}_{j}=\frac {1}{n}\sum \limits _{l=1}^{n}\pmb {X}_{l}\pmb {X}_{j}_{d_{X}}\), \(\bar {a}=\frac {1}{n^{2}}\sum \limits _{k=1}^{n}\sum \limits _{l=1}^{n}\pmb {X}_{l}\pmb {X}_{k}_{d_{X}}\), \(b_{ij}=\pmb {Y}_{i}\pmb {Y}_{j}_{d_{Y}}\phantom {\dot {i}\!}\), \(\bar {b}_{i}=\frac {1}{n}\sum \limits _{k=1}^{n}\pmb {Y}_{k}\pmb {Y}_{i}_{d_{Y}}\), \(\bar {b}_{j}=\frac {1}{n}\sum \limits _{l=1}^{n}\pmb {Y}_{l}\pmb {Y}_{j}_{d_{Y}}\), \(\bar {b}=\frac {1}{n^{2}}\sum \limits _{k=1}^{n}\sum \limits _{l=1}^{n}\pmb {Y}_{l}\pmb {Y}_{k}_{d_{Y}}\). The sample estimate of DC can be obtained immediately:
$$ \widehat{\text{dCor}}(\pmb{X},\pmb{Y}) = \frac{\widehat{\text{dCov}}(\pmb{X},\pmb{Y})}{\sqrt{\widehat{\text{dCov}}(\pmb{X},\pmb{X})\widehat{\text{dCov}}(\pmb{Y},\pmb{Y})}}. $$
(4)
One can test for significance of DC using an approximate ttest proposed by Szekely and Rizzo (2013) [13], which was implemented in R package energy [14]. Szekely and Rizzo (2013) established the following result under high dimensions
$$\mathcal{T}_{n}=\sqrt{v1}\frac{\mathcal{R}^{*}_{n}(\pmb{X},\pmb{Y})}{\sqrt{1(\mathcal{R}^{*}_{n}(\pmb{X},\pmb{Y}))^{2}}}\rightarrow t_{df=v1},$$
where \(\mathcal {R}^{*}_{n}(\pmb {X},\pmb {Y})\) represents a modified distance correlation between X and Y (see Szekely and Rizzo (2013), Eq 2.10, p.197) and \(v=\frac {n(n3)}{2}\). Here, it is worth noting that although the tapproximation above is derived under high dimensions, it also works well for lowdimension cases (in our problem, dimensions of X and Y both equal one for each test). To evaluate the performance of the tapproximation under dimension one, we consider two independence settings

Setting 1: X_{
i
}∼N(0,1), Y_{
i
}∼N(0,2), i=1,2,…,50,

Setting 2: X_{
i
}∼Uniform(0,1), Y_{
i
}∼Uniform(0,2), i=1,2,…,50.
For each setting, we generated 10,000 data sets and calculated the test statistic \(\mathcal {T}_{n}\) for each data set. Figure 1 compared the sample distribution of \(\mathcal {T}_{n}\) with the asymptotic t distribution (close to a standard normal distribution as the degree of freedom v−1 is generally large). Futhermore, we compared the approximate pvalue with the permutation pvalue (based on 10,000 random shuffles) in 100 replications. As shown in Fig. 2, the approximate pvalues are very close to the permutation pvalues, indicating a satisfactory performance of the tapproximation under low dimensions.
The distance correlation measure has been applied in previous genomic studies to quantify gene coexpressions [15]. Besides DC, there are several measures that can pick up nonlinear dependence between variables, although each of them has its own practical limitations. Clark (2013) [16] empirically compared six popular measures including Pearson’s correlation, Spearman’s correlation, distance correlation, mutual information (MI), maximum information coefficient (MIC) and Hoeffding’s D under a variety of different settings, and it was found that the six methods perform almost equally well in detecting the linear correlation. However, under the nonlinear dependence, the distance correlation and MIC performed notably better than the other measures. There are two considerations that lead to the choice of DC instead of MIC in our analysis. First, DC is straightforward to calculate and not an approximation while MIC relies on a userdefined number of grids for approximation. Second, as pointed out in some recent studies [17, 18], the DC exhibits more statistical power than MIC under moderate or small sample sizes.
Edgecount test
Our testing step is to compare two multivariate distributions (dimension is 2 in DCE analysis). In statistics literature, there are mainly two types of multivariate tests, namely the multidimensional KolmogorovSmirnov (KS) test [19] and edgecount test [20, 21]. These two methods, however, both have practical limitations when applied to real data. For instance, KS test is very conservative, i.e., the null hypothesis is too often not rejected. Also, by the brute force algorithm, the application of multidimensional KS test can be prohibitively computationally intensive. The edgecount test is easy to implement but it is known to be problematic under the location and scale alternatives. Recently, Chen and Friedman [9] developed a modified version of edgecount test, which works properly under different alternatives and exhibits substantial power gains over existing edgecount tests. Similar as other edgecount tests, the new test is based upon a similarity graph such as minimum spanning tree (MST, [22]) that is constructed over the pooled samples from different groups. Generally, if two groups have different distributions, samples would be preferentially closer to others from the same group than those from the other group, therefore the edges in the MST would be more likely to connect samples from the same group. The test rejects the null if the number of betweengroup edges is significantly less than expected.
To be precise, we let x_{1},x_{2},…,x_{
n
} and y_{1},y_{2},…,y_{
m
} be i.i.d. samples from two multivariate distributions F_{
X
} and F_{
Y
}, respectively. We first pooled samples from two groups and indexed them by 1,2,…,N=n+m. A MST is then constructed on the pooled samples using Kruskal’s algorithm [22]. Unless otherwise specified, G represents the MST (or other similarity graphs) as well as the set of all edges, and G denotes the total number of edges. To illustrate the technical details, we adopted the notations from Chen and Friedman’s paper. Let g_{
i
}=0 if sample i is from group X and g_{
i
}=1 otherwise. For the edge e connecting samples i and j, i.e., e=(i,j), we define:
$$ J_{e} = \left\{\begin{array}{ll} 0 &\text{if \(g_{i}\neq g_{j}\)}\\ 1 &\text{if \(g_{i}=g_{j}=0\)}\\ 2 &\text{if \(g_{i}=g_{j}=1\)}\\ \end{array},\right. $$
(5)
and
$$ R_{k}=\sum\limits_{e\in G}I_{J_{e}=k}, k=0, 1, 2. $$
(6)
Here R_{1} and R_{2} represent the numbers of edges connecting samples from same group, and R_{0} stands for number of edges connecting samples from different groups. The new test statistic simply quantifies the deviation of (R_{1},R_{2}) from their expected values under true H_{0}. It has the following quadratic form:
$$ S=(R_{1}\mu_{1}, R_{2}\mu_{2})\Sigma^{1}\left(\begin{array}{l} R_{1}\mu_{1} \\ R_{2}\mu_{2} \end{array}\right), $$
(7)
where μ_{1}=E(R_{1}), μ_{2}=E(R_{2}) and Σ=V((R_{1},R_{2})^{T}) have the following expressions (see the Appendix of Chen and Friedman’s paper for detailed proof):
$$ {}\begin{aligned} \mu_{1} &=G\frac{n(n1)}{N(N1)}, \\ \mu_{2} &=G\frac{m(m1)}{N(N1)}, \\ \Sigma_{11} &=\mu_{1}(1\,\,\mu_{1})+2C\frac{n(n1)(n2)}{N(N1)(N2)}+(G(G1)\\&2C)\frac{n(n1)(n2)(n3)}{N(N1)(N2)(N3)},\\ \Sigma_{22} &=\mu_{2}(1\,\,\mu_{2})+\!2C\frac{m(m1)(m2)}{N(N1)(N2)}+(G(G1)\\&2C)\frac{m(m1)(m2)(m3)}{N(N1)(N2)(N3)}, \\ \Sigma_{12} &=(G(G\,\,1)\,\,2C)\frac{nm(n\,\,1)(m\,\,1)}{N(N\,\,1)(N\,\,2)(N\,\,3)}\!\mu_{1}\mu_{2}, \end{aligned} $$
where \(C=\frac {1}{2}{\sum \nolimits }_{i=1}^{N}G_{i}^{2}G\), and G_{
k
} stands for the subgraph in G that includes all edges that connect to node k. It was proved that under the permutation null hypothesis, S asymptotically follows a Chisquare distribution with 2 degrees of freedom [9]. The pvalue approximation generally works well under relatively small sample size, for instance, when min(n,m)=20. In their work, Chen and Friedman also suggested that the use of kMST graphs (e.g., 3MST or 5MST) may lead to a better approximation of pvalue in practice.
It is noteworthy to mention that Chen and Friedman’s method was developed for twogroup comparison. In the case of multiple groups, a sequence of pairwise comparisons need to be conducted. Recently, we extended Chen and Friedman’s test to multiplegroup case and proposed an overall test to compare more than two groups simultaneously. In our technical report [23], it was proved that the test statistics for p groups asymptotically follows a Chisquare distribution with p degrees of freedom under mild regularity conditions. To be precise, for an edge e in graph G, we let
$$ {}\begin{aligned} A_{e} & = \{e\} \cup \{e'\in G: e' \text{ and } e \text{ share a node}\}, \\ B_{e} & = A_{e}\! \cup\! \{e^{\prime\prime} \!\in G\! : \exists~e' \!\in A_{e}, \text{ such that } e^{\prime\prime} \text{ and } e' \text{ share a node}\}, \end{aligned} $$
then the following theorem can be derived:
Theorem 1
If G=O(N), \({\sum \nolimits }_{k=1}^{N}G_{k}^{2}\frac {4G^{2}}{N}=O(N)\), A_{
e
}B_{
e
}=o(N^{3/2}), \({\lim }_{N\rightarrow \infty }\frac {N_{i}}{N}=\lambda _{i}\in (0, 1)\), then
$${}S\!:=\!(\!R_{1}\mu_{1}, R_{2}\mu_{2}, \ldots, R_{p}\mu_{p}\!)\Sigma^{1}\!\!\begin{pmatrix} R_{1}\,\,\mu_{1} \\ R_{2}\,\,\mu_{2} \\ \cdots \\ R_{p}\,\,\mu_{p}\!\end{pmatrix}\!\!\longrightarrow \chi^{2}_{p}, $$
where i=1,…,p is the group index.
The expected values and covariance matrix can be derived as in (7):
$$ {}\begin{aligned} \mu_{k, 1\leq k\leq p} &=G\frac{n_{k}(n_{k}1)}{N(N1)}, \\ \Sigma_{kk, 1\leq k\leq p} &=\mu_{k}(1\mu_{k})+2C\frac{n_{k}(n_{k}1)(n_{k}2)}{N(N1)(N2)}+(G(G1)\\&2C)\frac{n_{k}(n_{k}1)(n_{k}2)(n_{k}3)}{N(N1)(N2)(N3)},\\ \Sigma_{jk, 1\leq j\neq k\leq p} &=(G(G1)2C)\frac{n_{j}n_{k}(n_{j}1)(n_{k}1)}{N(N1)(N2)(N3)}\mu_{j}\mu_{k}, \end{aligned} $$
where \(N={\sum \nolimits }_{k=1}^{p}n_{k}\) and \(C=\frac {1}{2}{\sum \nolimits }_{i=1}^{N}G_{i}^{2}G\). The detailed proof for Theorem 1 can be found in the Appendix of Zhang et al. (2017) [23].
Simulation study: edgecount test versus two existing approaches
We performed a simulation study to empirically compare the edgecount test with two existing methods based on Pearson’s correlation and mutual information. Particularly, we considered the following linear setting and nonlinear setting, where X and Y represent the expression levels of two genes and subscripts “1” and “2” stand for two conditions:

Linear setting: \((X_{1}, Y_{1})^{T}\sim \mathrm {N}\left [\left (\begin {array}{l} 0\\ 0 \end {array}\right), \left (\begin {array}{ll} 1 & \rho \\ \rho & 1 \end {array}\right)\right ]\), \((X_{2}, Y_{2})^{T}\sim \mathrm {N}\left [\left (\begin {array}{l} 0\\ 0 \end {array}\right), \left (\begin {array}{cc} 1 & \rho +\Delta \\ \rho +\Delta & 1 \end {array}\right)\right ]\), where ρ=0.3, Δ∈{0.1,0.2,…,0.6}.

Nonlinear setting: X_{
i
}∼Uniform(−2,2), \(Y_{i}=X_{i}^{2}+\epsilon _{i}\), \(\epsilon _{i}\sim \mathrm {N}\left (0, \sigma _{i}^{2}\right)\), i=1,2, σ_{1}=0.5, σ_{2}=σ_{1}+Δ, Δ∈{0.1,0.2,…,0.6}.
For each setting, we generated 1,000 data sets with sample sizes n_{1}=n_{2}=100 and three approaches were applied to test for the difference between two joint distributions. For edgecount test, we took 3MST based on Euclidean distance and computed the pvalue using Chisquare approximation. The R package infotheo [24] was used to estimate the entropies of X_{
i
} and Y_{
i
}, as well as the mutual information between X_{
i
} and Y_{
i
}, i=1,2. To evaluate the significance of the mutual information change, we performed a Fisher’s z transformation introduced in Zhang et al. (2012) [25]. To be precise, let H(X_{
i
}) be the entropy of variable X_{
i
}, and I(X_{
i
},Y_{
i
}) be the mutual information between X_{
i
} and Y_{
i
}, then the transformed z_{
i
} given below approaches to a standard normal distribution with variance \(\frac {1}{n_{i}3}\):
$$z_{i}=\frac{1}{2}\log\frac{1+I^{*}(X_{i},Y_{i})}{1I^{*}(X_{i},Y_{i})},$$
where \(I^{*}(X_{i},Y_{i})=\frac {I(X_{i},Y_{i})}{H(X_{i})+H(Y_{i})}\). The pvalue can then be obtained by a twosample z test, i.e.,
$$p\text{value}_{MI}=2P\left(Z>\frac{z_{1}z_{2}}{\sqrt{\frac{1}{n_{1}3}+\frac{1}{n_{2}3}}}\right).$$
For each data set, we conducted a quantile normalization to match the marginals and tested the hypothesis at α=0.05 (\(H_{0}: \pmb {F}^{*}_{1}(X_{1},Y_{1})=\pmb {F}^{*}_{2}(X_{2},Y_{2})\)) with three different methods, where \(\pmb {F}^{*}_{i}\) represented the joint distribution of (X_{
i
},Y_{
i
}) after the marginal matching. The accuracy (true positive rate) of each method under each setting was summarized in Fig. 3. As we can see, all the three methods achieved good accuracy in the linear setting (except in the subtle case of Δ=0.1). The Pearson’s correlation and edgecount test performed slightly better than the mutual information. For the nonlinear (quadratic) setting, the edgecount test substantially outperformed the other two methods, while the Pearson’s method completely failed to identify the difference.
Our simulation study demonstrated the capability of our edgecount test in capturing both linear and nonlinear changes. Generally, the edgecount test performs similarly well as Pearson’s correlation and mutual information under linear setting but achieves significantly better sensitivity for nonlinear setting.