For simplicity, we will introduce our method in the context of microarray studies but it can be generalized to studies of other high-throughput assays. We consider I replicate microarray studies for p genes. In this paper, we focus on the situation of two replicate studies I=2, although our method can be readily extended to the case with more than two studies. We assume a study includes two groups, e.g., the treatment and control group, with sample size equal to n
ik
for group k, k=1,2, in the i-th study. Let x
gijk
be the normalized and transformed measurement of gene expression of the jth sample from group k for gene g in the i-th study. The test statistics of two-sample unpaired t-test for gene g in the i-th study is
$${}\begin{aligned} d_{gi}&=\frac{\bar x_{gi2}-\bar x_{gi1}}{s_{gi}},\text{where}\\ \bar x_{gi1}&=\sum_{j=1,\cdots,n_{i1}}x_{gij1}/n_{i1},\bar x_{gi2}=\sum_{j=1,\cdots,n_{i2}}x_{gij2}/n_{i2}\\ s_{gi}&=\left[(1/n_{i1}+1/n_{i2})\left\{\sum_{j=1,\cdots,n_{i1}}(x_{gij1}-\bar x_{gi1})^{2}\right.\right.\\ &\left.\left.\quad+\sum_{j=1,\cdots,n_{i2}}(x_{gij2}-\bar x_{gi2})^{2} \right\}/(n_{i1}+n_{i2}-2)\right]^{1/2} \end{aligned} $$
We present an empirical Bayesian hierarchical model to account for various sources of variability. When the sample size n
ik
is reasonably large, say n
i1+n
i2≥30, the test statistics d
gi
is well approximated by a normal distribution:
$$\begin{array}{@{}rcl@{}} d_{gi} | \mu_{gi} \sim \mathcal{N}(\delta_{S_{i}}\mu_{gi}, 1) \end{array} $$
(1)
where μ
gi
is the expected group mean difference for gene g in the i-th study, and \(\delta _{S_{i}}=\tilde {\sigma }_{i}^{-1}(1/n_{i1}+1/n_{i2})^{-1/2}\) with \(\tilde {\sigma }_{i}\) being the common standard deviation for {x
g
i
j1}, j=1,2,…,n
i1 and {x
g
i
j2}, j=1,2,…,n
i2. When the sample size is small, the same procedure as in [11] can be applied to construct z-tests based on two-sample t-tests. For simplicity we assume the within-group between-sample standard deviation is the same for all the genes. The general case can be derived in a similar fashion but a bit more tedious.
For the expected group mean difference μ
gi
, we assume it follows
$$\begin{array}{@{}rcl@{}} \mu_{gi} | \mu_{g} \sim \mathcal{N}\left(\mu_{g},\sigma_{g}^{2}\right) \end{array} $$
(2)
where μ
g
is the “true" group mean difference for gene g across all studies and \(\sigma _{g}^{2}\) models the between-study variability due to various experiment conditions.
Furthermore we assume μ
g
is from a mixture distribution
$$\begin{array}{@{}rcl@{}} \mu_{g} \sim \pi_{0}I_{\{0\}} + \pi_{1}\mathcal{N}\left(\mu_{G_{1}},\sigma_{G_{1}}^{2}\right) + \pi_{2}\mathcal{N}\left(\mu_{G_{2}},\sigma_{G_{2}}^{2}\right) \end{array} $$
(3)
where π
i
≥0, i=0,1,2, with π
0+π
1+π
2=1, \(\mu _{G_{1}}>0\) and \(\mu _{G_{2}}<0\). The distribution has three components: the null case where there is no differentially expressed gene, the “up-regulated” case where the treatment stimulates the gene expression, and the “down-regulated” case where the treatment suppresses the gene expression. Generally for microarray studies π
0≃1. Similar mixture models have been considered in [11–16]. Particularly we choose to model the cluster of up-regulated (or down-regulated) genes with a Gaussian distribution for the computational convenience, same as in [12]. Alternative choices include the semiparametric mixture model in [11, 14], mixture of Gaussian distributions in [13, 15] and mixture of t-distributions in [16].
We can show that the test statistics (d
g1,d
g2) follow a Gaussian mixture model. The derivations are standard by repeatedly applying the law of total expectation and the law of total variance and thus omitted. The mixture model is
$$ \begin{aligned} (d_{g1},d_{g2}) \sim \pi_{0}\mathcal{N}(\mu_{0},\Sigma_{0})&+\pi_{1}\mathcal{N}(\mathcal{\mu}_{1},\Sigma_{1})\\ &+\pi_{2}\mathcal{N}(\mathcal{\mu}_{2},\Sigma_{2}), \end{aligned} $$
(4)
where \(\mathcal {N}(\mathcal {\mu }_{l},\Sigma _{l})\) (l=0,1,2) is the biviariate normal distribution with mean vector \(\mathcal {\mu }_{l}\) and covariance matrix Σ
l
. Let I
2 and J
2 be the identity matrix and the square matrix of ones respectively, both with order 2. This mixture model classify the candidates into three components: \(\mathcal {N}(\mathcal {\mu }_{0},\Sigma _{0})\) is the irreproducible component with zero-mean \(\mathcal {\mu }_{0}=(0,0)^{T}\) and covariance structure \(\Sigma _{0}=\left (\sigma _{g}^{2}+1\right)I_{2}\); \(\mathcal {N}(\mathcal {\mu }_{1},\Sigma _{1})\) and \(\mathcal {N}(\mathcal {\mu }_{2},\Sigma _{2})\) are two reproducible components with \(\mathcal {\mu }_{1}=(\delta _{S_{1}}\mu _{G_{1}}, \delta _{S_{2}}\mu _{G_{1}})>0\) and \(\Sigma _{1}=\left (\sigma _{g}^{2}+1\right)I_{2} + \sigma _{G_{1}}^{2}J_{2}\) representing the up-regulated genes, and \(\mathcal {\mu }_{2}=(\delta _{S_{1}}\mu _{G_{2}}, \delta _{S_{2}}\mu _{G_{2}})<0\) and \(\Sigma _{2}=\left (\sigma _{g}^{2}+1\right)I_{2} + \sigma _{G_{2}}^{2}J_{2}\) representing the down-regulated genes, where the inequalities are meant to be interpreted component-wise.
Note with increased sample sizes or decreased within-group between-sample variability, the mean \(\mathcal {\mu }_{1}\) and \(\mathcal {\mu }_{2}\) of the reproducible components move further away from the origin, making the three components more separable. Also note the test statistics from replicate studies have zero correlations in the irreproducible components; in the reproducible components, the correlations become larger when the between-study variability becomes smaller; for all components, the variance is smaller with less between-study variability, resulting in more separable components.
Under the Gaussian mixture model, the posterior probability of (d
g1,d
g2) belonging to a component is
$$\begin{array}{@{}rcl@{}} p_{gl} = \frac{\pi_{l}\phi(d_{g1},d_{g2}|\mathcal{\mu}_{l},\Sigma_{l})}{\sum_{\ell=0,1,2}\pi_{\ell}\phi(d_{g1},d_{g2}|\mathcal{\mu}_{\ell},\Sigma_{\ell}) },l=0,1,2. \end{array} $$
(5)
where ϕ(·|·) is the density function of bivariate normal distribution. According to [10], the posterior probability of being in the irreproducible/null component p
i0 can be introduced as the individual significant score, namely local false discovery rate. When p
g0 is less than a significant level α, gene g is classified as reproducible.
Next, we consider estimation of the unknown parameters
$$ \mathcal{\theta}=(\mu_{1}, \mu_{2}, \Sigma_{0},\Sigma_{1},\Sigma_{2}, \pi_{0},\pi_{1},\pi_{2}) $$
(6)
in the mixture model (4) to get the estimate of p
g0 for individual genes. It is natural to use the expectation-maximization (EM) algorithm to estimate \(\mathcal {\theta }\) by maximizing the log-likelihood of the data [17], i.e.,
$$ \begin{aligned} \ell(\mathcal{\theta})&=\sum_{g=1}^{p}\log\{P(d_{g1},d_{g1}|\mathcal{\theta})\}\\ &= \sum_{g=1}^{p}\log\left\{\sum_{l=0}^{2}\pi_{l}\phi(d_{g1},d_{g2}|\mathcal{\mu}_{l},\Sigma_{l})\right\} \end{aligned} $$
(7)
In our algorithm, we start with some initials value for the parameters \(\mathcal {\theta }^{0}\), then iterate between two steps: (1) Evaluate the current posterior probabilities p
gl
using the current parameters; (2) Maximize the likelihood estimator given current posterior probabilities. The details of the EM procedures are provided in Appendix. Multiple random initial vaues are used to avoid being trapped at the local maximum.