Quantitative reproducibility analysis for identifying reproducible targets from high-throughput experiments

Background High-throughput assays are widely used in biological research to select potential targets. One single high-throughput experiment can efficiently study a large number of candidates simultaneously, but is subject to substantial variability. Therefore it is scientifically important to performance quantitative reproducibility analysis to identify reproducible targets with consistent and significant signals across replicate experiments. A few methods exist, but all have limitations. Methods In this paper, we propose a new method for identifying reproducible targets. Considering a Bayesian hierarchical model, we show that the test statistics from replicate experiments follow a mixture of multivariate Gaussian distributions, with the one component with zero-mean representing the irreproducible targets. Results A target is thus classified as reproducible or irreproducible based on its posterior probability belonging to the reproducible components. We study the performance of our proposed method using simulations and a real data example. Conclusion The proposed method is shown to have favorable performance in identifying reproducible targets compared to other methods.


Background
In biological research, high-throughput assays, such as microarrays, are widely used to effectively select potential targets by studying a large number of candidates in a single experiment. However a high-throughput assay is often subject to substantial variability. Reproducibility of high-throughput assays, such as the level of agreement across replicate samples, test sites or data analytical platforms, is a concerned topic in scientific applications, and has been discussed in [1] for microarray and [2] for ChIPseq technology. Therefore quantitative analysis for the reproducibility of high-throughput assays is an important exercise for evaluating the reliability and robustness of scientific discoveries across studies.
*Correspondence: wenfei.zhang@sanofi.com 2 Sanofi, Framingham, MA, USA Full list of author information is available at the end of the article Reproducibility is nonstandard and unsettled across the sciences. Goodman et al. [3] provides a survey on the papers with the word reproducibility included in titles, abstracts and keywords, and concludes that the interpretation of reproducibility varies among different papers. Goodman et al. [3] further allies the word reproducibility in the papers and classifies them into three terms: methods reproducibility, results reproducibility and inferential reproducibility. In [3], methods reproducibility refers to the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated, such as [1] and [2]; results reproducibility refers to obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible, such as [4] and [5]; Inferential reproducibility refers to the drawing of qualitatively similar conclusions from Table 1 The summary of misclassification rates for the four compared methods under different significant levels (α) and proportions of reproducible genes (γ ) The proposed Method The copula mixture method [10] Benjamini & Heller method [9] The rank product method [8] α=0 either an independent replication of a study or a reanalysis of the original study, such as [1] and [2]. In this paper, our reproducibility analysis aims to identify reproducible targets with consistent and significant signals across replicate studies, which belongs to the category of inferential reproducibility as defined in [3]. Our reproducibility analysis is different from meta-analysis, such as [6] and [7]. Meta-analysis combines the data from multiple studies to gain extra power for identifying targets with signals. The identified targets may not necessarily be significant across all studies.
A few methods have been developed for our reproducibility analysis. Hong et al. [8] proposed a permutation based method through estimating the empirical distribution of the rank product. Benjamini & Heller [9] developed a framework for testing partial conjunction hypothesis that the discovery is true in at least u studies out of total n studies. Most recently, [10] proposed a copula mixture model for estimating the irreproducible discovery rate across studies.
However all existing methods potentially have limitations. The permutation based method [8] can be computationally expensive when dealing with a large number of candidates. Benjamini & Heller method [9] aims at identifying candidates with reproduced signals in a few but not all the studies, which is a related but generally weaker goal than ours. The special case of Benjamini & Heller method testing whether signals are reproduced in all studies is identical to using the largest p-value. The copula mixture [10] method builds the copula mixture using the rank transformation of the original data, which might be less powerful than modeling the original data with a proper probabilistic model as in our proposed method. A major drawback of both Benjamini & Heller method [9] and the copula mixture [10] method is that they both use the significant score of signals, such as p-value, without taking into account the directionality of signals, hence is prune to selecting candidates with significant scores but different directions across studies. For example, in the context of two replicate microarray studies with a treatment and a control group, consider genes with significant p-values in both experiments, but are up-regulated in one study and down-regulated in the other. Although those genes have inconsistent signals across studies, both methods will likely classify them as reproducible based on p-values alone. In contrast, our proposed method models the test statistics directly and is expected to correctly classify those genes as irreproducible most of the time.
In this paper, we propose a Bayesian hierarchical model and show the test statistics from replicate studies can be approximated by a mixture of multivariate Gaussian distributions. The proposed Gaussian mixture model Table 2 The summary of sensitivities for the four compared methods under different significant levels (α) and proportions of reproducible genes (γ ) The proposed Method The copula mixture method [10] Benjamini & Heller method [9] The rank product method [8] α=0  Table 3 The summary of specificities for the four compared methods under different significant levels (α) and proportions of reproducible genes (γ ) The proposed Method The copula mixture method [10] Benjamini & Heller method [9] The rank product method [8] α=0 classifies the signals into three components: one irreproducible component and two reproducible components for consistent up-regulated and down-regulated signals respectively. The posterior probability of belonging to the reproducible components is used as a measure for reproducibility.

Methods
For simplicity, we will introduce our method in the context of microarray studies but it can be generalized to studies of other high-throughput assays. We consider I replicate microarray studies for p genes. In this paper, we focus on the situation of two replicate studies I = 2, although our method can be readily extended to the case with more than two studies. We assume a study includes two groups, e.g., the treatment and control group, with sample size equal to n ik for group k, k=1,2, in the i-th study. Let x gijk be the normalized and transformed measurement of gene expression of the jth sample from group k for gene g in the i-th study. The test statistics of two-sample unpaired t-test for gene g in the i-th study is We present an empirical Bayesian hierarchical model to account for various sources of variability. When the sample size n ik is reasonably large, say n i1 + n i2 ≥ 30, the test statistics d gi is well approximated by a normal distribution: where μ gi is the expected group mean difference for gene g in the i-th study, and δ S i =σ −1 i (1/n i1 + 1/n i2 ) −1/2 withσ i being the common standard deviation for {x gij1 }, j = 1, 2, . . . , n i1 and {x gij2 }, j = 1, 2, . . . , n i2 . When the sample size is small, the same procedure as in [11] can be applied to construct z-tests based on two-sample ttests. For simplicity we assume the within-group betweensample standard deviation is the same for all the genes. The general case can be derived in a similar fashion but a bit more tedious.
For the expected group mean difference μ gi , we assume it follows where μ g is the "true" group mean difference for gene g across all studies and σ 2 g models the between-study variability due to various experiment conditions. Fig. 1 Bivariate plot of test statistics from two studies. The x axis represents the test statistics from GSE 28042 study [18], and the y axis represents the test statistics from GSE 33566 [19]. The green points are the top 500 reproducible genes selected by the proposed method Fig. 2 Bivariate plot of test statistics from two studies. The x axis represents the test statistics from GSE 28042 study [18], and the y axis represents the test statistics from GSE 33566 [19]. The green points are the top 500 reproducible genes selected by the copula mixture model [10] Furthermore we assume μ g is from a mixture distribution where π i ≥ 0, i = 0, 1, 2, with π 0 + π 1 + π 2 = 1, μ G 1 > 0 and μ G 2 < 0. The distribution has three components: the null case where there is no differentially expressed gene, the "up-regulated" case where the treatment stimulates the gene expression, and the "down-regulated" case where the treatment suppresses the gene expression. Generally for microarray studies π 0 1. Similar mixture models have been considered in [11][12][13][14][15][16]. Particularly we choose to model the cluster of up-regulated (or down-regulated) genes with a Gaussian distribution for the computational convenience, same as in [12]. Alternative choices include the semiparametric mixture model in [11,14], mixture of Gaussian distributions in [13,15] and mixture of tdistributions in [16].
We can show that the test statistics (d g1 , d g2 ) follow a Gaussian mixture model. The derivations are standard by repeatedly applying the law of total expectation and the law of total variance and thus omitted. The mixture model is where N (μ l , l ) (l = 0, 1, 2) is the biviariate normal distribution with mean vector μ l and covariance matrix l . Let I 2 and J 2 be the identity matrix and the square matrix of ones respectively, both with order 2. This mixture model classify the candidates into three components: N (μ 0 , 0 ) is the irreproducible component with zeromean μ 0 = (0, 0) T and covariance structure 0 = σ 2 g + 1 I 2 ; N (μ 1 , 1 ) and N (μ 2 , 2 ) are two reproducible components with μ 1 = (δ S 1 μ G 1 , δ S 2 μ G 1 ) > 0 and 1 = σ 2 g + 1 I 2 + σ 2 G 1 J 2 representing the upregulated genes, and μ 2 = (δ S 1 μ G 2 , δ S 2 μ G 2 ) < 0 and 2 = σ 2 g + 1 I 2 + σ 2 G 2 J 2 representing the downregulated genes, where the inequalities are meant to be interpreted component-wise.
Note with increased sample sizes or decreased withingroup between-sample variability, the mean μ 1 and μ 2 of the reproducible components move further away from the origin, making the three components more separable. Also note the test statistics from replicate studies have zero correlations in the irreproducible components; in the reproducible components, the correlations become larger when the between-study variability becomes smaller; for all components, the variance is smaller with less betweenstudy variability, resulting in more separable components.
Under the Gaussian mixture model, the posterior probability of (d g1 , d g2 ) belonging to a component is where φ(·|·) is the density function of bivariate normal distribution. According to [10], the posterior probability of being in the irreproducible/null component p i0 can be introduced as the individual significant score, namely local false discovery rate. When p g0 is less than a significant level α, gene g is classified as reproducible. Fig. 3 Bivariate plot of test statistics from two studies. The x axis represents the t-statistics from GSE 28042 study [18], and the y axis represents t-statistics from GSE 33566 [19]. The green points are the top 500 reproducible genes selected by Benjamini & Heller method [9]  Next, we consider estimation of the unknown parameters θ = (μ 1 , μ 2 , 0 , 1 , 2 , π 0 , π 1 , π 2 ) (6) in the mixture model (4) to get the estimate of p g0 for individual genes. It is natural to use the expectationmaximization (EM) algorithm to estimate θ by maximizing the log-likelihood of the data [17], i.e., In our algorithm, we start with some initials value for the parameters θ 0 , then iterate between two steps: (1) Evaluate the current posterior probabilities p gl using the current parameters; (2) Maximize the likelihood estimator given current posterior probabilities. The details of the EM procedures are provided in Appendix. Multiple random initial vaues are used to avoid being trapped at the local maximum.

Simulation studies
In this section, we present numerical simulations to illustrate the performance of our proposed method compared to three existing methods, the copula mixture model [10], Benjamini & Heller method [9], and the rank product method [8]. We use the following model to simulate data From this model, the mean expression level of gene g for group 1 of study s is modeled as μ gs1 = μ+α g +β i +(αβ) gi , where μ is the overall mean; α g is the main effect of gene g; β i is the main effect of study i; (αβ) gi is the gene-study interaction. We set μ = 0, α g ∼ N (0, 1), β i = 0.1, and (αβ) gi ∼ N (0, 0.5 2 ). For non-differentially expressed genes, the mean expression level for both groups are the same, i.e., μ gs1 = μ gs2 . For differentially expressed genes, (8) models the difference between the two comparison groups as μ gi2 − μ gi1 = δ + γ g + (γβ) gi , where δ is the fixed effect of group difference; γ g is the effect of gene on the group difference; (γβ) gi is the gene-study interaction of the group difference. We set δ = 0, generate γ g from N (2, 0.5 2 ) or N (−2, 0.5 2 ) to mimic two possible directions of signals, (γβ) gi ∼ N (0, 0.5 2 ). gijk is the random error term, and following the distribution N (0, 0.5 2 ).
For each simulation run, we generate 2 studies. Each study has two groups with 10 samples per group. We generate G = 5000 genes per sample and choose the proportions of reproducible genes (γ ) from (80%, 60%, 40%, 20%, 10%, 5%, 1%). We apply the proposed method and the three existing methods to the simulated data, and classify the genes as reproducible based on two commonly used significant levels (α) 0.05 and 0.1. The performance of the four compared methods is evaluated by three criteria, i.e., sensitivity, specificity and misclassification rate. Results from 50 simulations are summarized in Tables 1, 2 and Table 5 The list of 7 selected genes, which are in the list of the top 500 reproducbile genes selected by the copula mixture model [10], but have opposite signs of signals in two studies Gene t-statistics in GSE 28042 [18]  3 respectively. The results shows our proposed method performs the best among the four methods with the smallest misclassification rates (Table 1), highest sensitivity (Table 2) and highest specificities (Table 3).

Results
In this section, we illustrate our proposed method using a real example. This example includes two microarray studies [18] and [19]  We only consider the overlap 17708 common genes for reproducibility analysis. We apply our proposed method, the copula mixture model [10] and Benjamini & Heller method [9]. The rank product method [8] is too computationally intensive to be applied to this example and thus excluded from this study. Figures 1, 2 and 3 show the results of selected reproducible genes from the three compared methods respectively (green). In all three figures, the x axis represents the test statistics from GSE 28042 [18], and the y axis represents the test statistics from GSE 33566 [19]. The top 500 reproducible genes selected by three methods are highlighted in green. As shown in Fig. 1, our proposed method only selects genes with consistently significant signals in both studies. Benjamini & Heller method [9] incorrectly identifies 23 genes (the upper left and bottom right corners of Fig. 2) as reproducible, which actually have opposite directions in two studies. The complete list of the 23 genes incorrectly selected by Benjamini & Heller method [9] is provided in Table 4. The copula mixture model [10] selects 7 genes (Table 5) with opposite directions of signals. It's also noted that the copula mixture model [10] appears to be less powerful in separating the irreproducible and reproducible genes and has incorrectly selected some insignificant genes (see the center of Fig. 3), likely resulting from the rank transformation. Overall, our method performs favorably in identifying reproducible genes.

Conclusion and discussion
This paper proposes a new method for identifying consistent and significant signals across replicate highthroughput experiments. Existing methods ignore the directionality of signals, and can incorrectly identify signals with opposite directions as reproducible ones. Our proposed method considers both the significant scores and directions of signals by modeling the test statistics directly, leading to improved performance in selecting reproducible candidates. When the proposed method is applied to a real data example for identifying reproducible genes in studies of idiopathic pulmonary fibrosis samples, it is shown to have better performance in detecting significant and reproducible genes compared to other methods. Simulations also demonstrate that our method compares favorably to the existing methods.
Step 4: Solution The algorithm continues between Expectation-Step and Maximization-Step until the following two conditions are satisfied.
1. The difference between θ v and θ v+1 is less than a small value δ 1 for all their elements; 2. The change in log-likelihood function (θ) between two consecutive iterations does not exceed a small value δ 2 .