 Methodology Article
 Open access
 Published:
Quantitative reproducibility analysis for identifying reproducible targets from highthroughput experiments
BMC Systems Biology volume 11, Article number: 73 (2017)
Abstract
Background
Highthroughput assays are widely used in biological research to select potential targets. One single highthroughput experiment can efficiently study a large number of candidates simultaneously, but is subject to substantial variability. Therefore it is scientifically important to performance quantitative reproducibility analysis to identify reproducible targets with consistent and significant signals across replicate experiments. A few methods exist, but all have limitations.
Methods
In this paper, we propose a new method for identifying reproducible targets. Considering a Bayesian hierarchical model, we show that the test statistics from replicate experiments follow a mixture of multivariate Gaussian distributions, with the one component with zeromean representing the irreproducible targets.
Results
A target is thus classified as reproducible or irreproducible based on its posterior probability belonging to the reproducible components. We study the performance of our proposed method using simulations and a real data example.
Conclusion
The proposed method is shown to have favorable performance in identifying reproducible targets compared to other methods.
Background
In biological research, highthroughput assays, such as microarrays, are widely used to effectively select potential targets by studying a large number of candidates in a single experiment. However a highthroughput assay is often subject to substantial variability. Reproducibility of highthroughput assays, such as the level of agreement across replicate samples, test sites or data analytical platforms, is a concerned topic in scientific applications, and has been discussed in [1] for microarray and [2] for ChIPseq technology. Therefore quantitative analysis for the reproducibility of highthroughput assays is an important exercise for evaluating the reliability and robustness of scientific discoveries across studies.
Reproducibility is nonstandard and unsettled across the sciences. Goodman et al. [3] provides a survey on the papers with the word reproducibility included in titles, abstracts and keywords, and concludes that the interpretation of reproducibility varies among different papers. Goodman et al. [3] further allies the word reproducibility in the papers and classifies them into three terms: methods reproducibility, results reproducibility and inferential reproducibility. In [3], methods reproducibility refers to the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated, such as [1] and [2]; results reproducibility refers to obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible, such as [4] and [5]; Inferential reproducibility refers to the drawing of qualitatively similar conclusions from either an independent replication of a study or a reanalysis of the original study, such as [1] and [2].
In this paper, our reproducibility analysis aims to identify reproducible targets with consistent and significant signals across replicate studies, which belongs to the category of inferential reproducibility as defined in [3]. Our reproducibility analysis is different from metaanalysis, such as [6] and [7]. Metaanalysis combines the data from multiple studies to gain extra power for identifying targets with signals. The identified targets may not necessarily be significant across all studies.
A few methods have been developed for our reproducibility analysis. Hong et al. [8] proposed a permutation based method through estimating the empirical distribution of the rank product. Benjamini & Heller [9] developed a framework for testing partial conjunction hypothesis that the discovery is true in at least u studies out of total n studies. Most recently, [10] proposed a copula mixture model for estimating the irreproducible discovery rate across studies.
However all existing methods potentially have limitations. The permutation based method [8] can be computationally expensive when dealing with a large number of candidates. Benjamini & Heller method [9] aims at identifying candidates with reproduced signals in a few but not all the studies, which is a related but generally weaker goal than ours. The special case of Benjamini & Heller method testing whether signals are reproduced in all studies is identical to using the largest pvalue. The copula mixture [10] method builds the copula mixture using the rank transformation of the original data, which might be less powerful than modeling the original data with a proper probabilistic model as in our proposed method. A major drawback of both Benjamini & Heller method [9] and the copula mixture [10] method is that they both use the significant score of signals, such as pvalue, without taking into account the directionality of signals, hence is prune to selecting candidates with significant scores but different directions across studies. For example, in the context of two replicate microarray studies with a treatment and a control group, consider genes with significant pvalues in both experiments, but are upregulated in one study and downregulated in the other. Although those genes have inconsistent signals across studies, both methods will likely classify them as reproducible based on pvalues alone. In contrast, our proposed method models the test statistics directly and is expected to correctly classify those genes as irreproducible most of the time.
In this paper, we propose a Bayesian hierarchical model and show the test statistics from replicate studies can be approximated by a mixture of multivariate Gaussian distributions. The proposed Gaussian mixture model classifies the signals into three components: one irreproducible component and two reproducible components for consistent upregulated and downregulated signals respectively. The posterior probability of belonging to the reproducible components is used as a measure for reproducibility.
Methods
For simplicity, we will introduce our method in the context of microarray studies but it can be generalized to studies of other highthroughput assays. We consider I replicate microarray studies for p genes. In this paper, we focus on the situation of two replicate studies I=2, although our method can be readily extended to the case with more than two studies. We assume a study includes two groups, e.g., the treatment and control group, with sample size equal to n _{ ik } for group k, k=1,2, in the ith study. Let x _{ gijk } be the normalized and transformed measurement of gene expression of the jth sample from group k for gene g in the ith study. The test statistics of twosample unpaired ttest for gene g in the ith study is
We present an empirical Bayesian hierarchical model to account for various sources of variability. When the sample size n _{ ik } is reasonably large, say n _{ i1}+n _{ i2}≥30, the test statistics d _{ gi } is well approximated by a normal distribution:
where μ _{ gi } is the expected group mean difference for gene g in the ith study, and \(\delta _{S_{i}}=\tilde {\sigma }_{i}^{1}(1/n_{i1}+1/n_{i2})^{1/2}\) with \(\tilde {\sigma }_{i}\) being the common standard deviation for {x _{ g i j1}}, j=1,2,…,n _{ i1} and {x _{ g i j2}}, j=1,2,…,n _{ i2}. When the sample size is small, the same procedure as in [11] can be applied to construct ztests based on twosample ttests. For simplicity we assume the withingroup betweensample standard deviation is the same for all the genes. The general case can be derived in a similar fashion but a bit more tedious.
For the expected group mean difference μ _{ gi }, we assume it follows
where μ _{ g } is the “true" group mean difference for gene g across all studies and \(\sigma _{g}^{2}\) models the betweenstudy variability due to various experiment conditions.
Furthermore we assume μ _{ g } is from a mixture distribution
where π _{ i }≥0, i=0,1,2, with π _{0}+π _{1}+π _{2}=1, \(\mu _{G_{1}}>0\) and \(\mu _{G_{2}}<0\). The distribution has three components: the null case where there is no differentially expressed gene, the “upregulated” case where the treatment stimulates the gene expression, and the “downregulated” case where the treatment suppresses the gene expression. Generally for microarray studies π _{0}≃1. Similar mixture models have been considered in [11–16]. Particularly we choose to model the cluster of upregulated (or downregulated) genes with a Gaussian distribution for the computational convenience, same as in [12]. Alternative choices include the semiparametric mixture model in [11, 14], mixture of Gaussian distributions in [13, 15] and mixture of tdistributions in [16].
We can show that the test statistics (d _{ g1},d _{ g2}) follow a Gaussian mixture model. The derivations are standard by repeatedly applying the law of total expectation and the law of total variance and thus omitted. The mixture model is
where \(\mathcal {N}(\mathcal {\mu }_{l},\Sigma _{l})\) (l=0,1,2) is the biviariate normal distribution with mean vector \(\mathcal {\mu }_{l}\) and covariance matrix Σ _{ l }. Let I _{2} and J _{2} be the identity matrix and the square matrix of ones respectively, both with order 2. This mixture model classify the candidates into three components: \(\mathcal {N}(\mathcal {\mu }_{0},\Sigma _{0})\) is the irreproducible component with zeromean \(\mathcal {\mu }_{0}=(0,0)^{T}\) and covariance structure \(\Sigma _{0}=\left (\sigma _{g}^{2}+1\right)I_{2}\); \(\mathcal {N}(\mathcal {\mu }_{1},\Sigma _{1})\) and \(\mathcal {N}(\mathcal {\mu }_{2},\Sigma _{2})\) are two reproducible components with \(\mathcal {\mu }_{1}=(\delta _{S_{1}}\mu _{G_{1}}, \delta _{S_{2}}\mu _{G_{1}})>0\) and \(\Sigma _{1}=\left (\sigma _{g}^{2}+1\right)I_{2} + \sigma _{G_{1}}^{2}J_{2}\) representing the upregulated genes, and \(\mathcal {\mu }_{2}=(\delta _{S_{1}}\mu _{G_{2}}, \delta _{S_{2}}\mu _{G_{2}})<0\) and \(\Sigma _{2}=\left (\sigma _{g}^{2}+1\right)I_{2} + \sigma _{G_{2}}^{2}J_{2}\) representing the downregulated genes, where the inequalities are meant to be interpreted componentwise.
Note with increased sample sizes or decreased withingroup betweensample variability, the mean \(\mathcal {\mu }_{1}\) and \(\mathcal {\mu }_{2}\) of the reproducible components move further away from the origin, making the three components more separable. Also note the test statistics from replicate studies have zero correlations in the irreproducible components; in the reproducible components, the correlations become larger when the betweenstudy variability becomes smaller; for all components, the variance is smaller with less betweenstudy variability, resulting in more separable components.
Under the Gaussian mixture model, the posterior probability of (d _{ g1},d _{ g2}) belonging to a component is
where ϕ(··) is the density function of bivariate normal distribution. According to [10], the posterior probability of being in the irreproducible/null component p _{ i0} can be introduced as the individual significant score, namely local false discovery rate. When p _{ g0} is less than a significant level α, gene g is classified as reproducible.
Next, we consider estimation of the unknown parameters
in the mixture model (4) to get the estimate of p _{ g0} for individual genes. It is natural to use the expectationmaximization (EM) algorithm to estimate \(\mathcal {\theta }\) by maximizing the loglikelihood of the data [17], i.e.,
In our algorithm, we start with some initials value for the parameters \(\mathcal {\theta }^{0}\), then iterate between two steps: (1) Evaluate the current posterior probabilities p _{ gl } using the current parameters; (2) Maximize the likelihood estimator given current posterior probabilities. The details of the EM procedures are provided in Appendix. Multiple random initial vaues are used to avoid being trapped at the local maximum.
Simulation studies
In this section, we present numerical simulations to illustrate the performance of our proposed method compared to three existing methods, the copula mixture model [10], Benjamini & Heller method [9], and the rank product method [8]. We use the following model to simulate data
From this model, the mean expression level of gene g for group 1 of study s is modeled as μ _{ g s1}=μ+α _{ g }+β _{ i }+(α β)_{ gi }, where μ is the overall mean; α _{ g } is the main effect of gene g; β _{ i } is the main effect of study i; (α β)_{ gi } is the genestudy interaction. We set μ=0, \(\alpha _{g} \sim \mathcal {N}(0,1)\), β _{ i }=0.1, and \((\alpha \beta)_{gi} \sim \mathcal {N}(0,0.5^{2})\). For nondifferentially expressed genes, the mean expression level for both groups are the same, i.e., μ _{ g s1}=μ _{ g s2}. For differentially expressed genes, (8) models the difference between the two comparison groups as μ _{ g i2}−μ _{ g i1}=δ+γ _{ g }+(γ β)_{ gi }, where δ is the fixed effect of group difference; γ _{ g } is the effect of gene on the group difference; (γ β)_{ gi } is the genestudy interaction of the group difference. We set δ=0, generate γ _{ g } from \(\mathcal {N}(2, 0.5^{2})\) or \(\mathcal {N}(2, 0.5^{2})\) to mimic two possible directions of signals, \((\gamma \beta)_{gi}\sim \mathcal {N}(0,0.5^{2})\). ε _{ gijk } is the random error term, and following the distribution \(\mathcal {N}(0,0.5^{2})\).
For each simulation run, we generate 2 studies. Each study has two groups with 10 samples per group. We generate G=5000 genes per sample and choose the proportions of reproducible genes (γ) from (80%, 60%, 40%, 20%, 10%, 5%, 1%). We apply the proposed method and the three existing methods to the simulated data, and classify the genes as reproducible based on two commonly used significant levels (α) 0.05 and 0.1. The performance of the four compared methods is evaluated by three criteria, i.e., sensitivity, specificity and misclassification rate. Results from 50 simulations are summarized in Tables 1, 2 and 3 respectively. The results shows our proposed method performs the best among the four methods with the smallest misclassification rates (Table 1), highest sensitivity (Table 2) and highest specificities (Table 3).
Results
In this section, we illustrate our proposed method using a real example. This example includes two microarray studies [18] and [19] comparing idiopathic pulmonary fibrosis (IPF) samples with healthy control samples. Data from both studies are obtained from Gene Expression Omnibus [20]. GSE 28042 [18] measures profiles of peripheral blood mononuclear cell (PBMC) for 75 IPF samples and 16 control samples through GeneChip Human 1.0 exon ST arrays, and GSE 33566 [19] measures profiles of peripheral blood RNA for 93 IPF patients and 30 control samples through Agilent Whole Human Genome Oligonucleotide Microarrays. We only consider the overlap 17708 common genes for reproducibility analysis.
We apply our proposed method, the copula mixture model [10] and Benjamini & Heller method [9]. The rank product method [8] is too computationally intensive to be applied to this example and thus excluded from this study. Figures 1, 2 and 3 show the results of selected reproducible genes from the three compared methods respectively (green). In all three figures, the x axis represents the test statistics from GSE 28042 [18], and the y axis represents the test statistics from GSE 33566 [19]. The top 500 reproducible genes selected by three methods are highlighted in green. As shown in Fig. 1, our proposed method only selects genes with consistently significant signals in both studies. Benjamini & Heller method [9] incorrectly identifies 23 genes (the upper left and bottom right corners of Fig. 2) as reproducible, which actually have opposite directions in two studies. The complete list of the 23 genes incorrectly selected by Benjamini & Heller method [9] is provided in Table 4. The copula mixture model [10] selects 7 genes (Table 5) with opposite directions of signals. It’s also noted that the copula mixture model [10] appears to be less powerful in separating the irreproducible and reproducible genes and has incorrectly selected some insignificant genes (see the center of Fig. 3), likely resulting from the rank transformation. Overall, our method performs favorably in identifying reproducible genes.
Conclusion and discussion
This paper proposes a new method for identifying consistent and significant signals across replicate highthroughput experiments. Existing methods ignore the directionality of signals, and can incorrectly identify signals with opposite directions as reproducible ones. Our proposed method considers both the significant scores and directions of signals by modeling the test statistics directly, leading to improved performance in selecting reproducible candidates. When the proposed method is applied to a real data example for identifying reproducible genes in studies of idiopathic pulmonary fibrosis samples, it is shown to have better performance in detecting significant and reproducible genes compared to other methods. Simulations also demonstrate that our method compares favorably to the existing methods.
Appendix
Expectationmaximization (EM) algorithm to estimate model parameters
The algorithm for estimating \(\mathcal {\theta }\) in (6) is an iterative algorithm between Expectation steps and maximization step. We use \(\widehat {\mathcal {\theta }}^{v}\) to denote the estimate at vth iteration. The algorithm includes the following steps:

Step 1: Initial Values Generate the initial values for \(\mathcal {\theta }\) and denote it as \(\widehat {\mathcal {\theta }}^{0}\)

Step 2: ExpectationStep Continue from the vth iteration step with the estimate \(\widehat {\mathcal {\theta }}^{v}\). We can obtain the estimated posterior probability \(\widehat {p_{gl}}^{v}\) of (d _{ g1},d _{ g2}) from (5) by
$$\begin{array}{@{}rcl@{}} \widehat{p_{gl}}^{v}&=&\frac{\widehat{\pi_{l}}^{v}\phi\left(d_{g1},d_{g2}\widehat{\mathcal{\mu}_{l}}^{v},\widehat{\Sigma_{l}}^{v}\right)}{\sum_{\ell=0,1,2}\widehat{\pi_{\ell}}^{v}\phi\left(d_{g1},d_{g2}\widehat{\mathcal{\mu}_{\ell}}^{v},\widehat{\Sigma_{\ell}}^{v}\right)},l=0,1,2. \end{array} $$(9) 
Step 3: MaximizationStep Update the parameter \(\widehat {\mathcal {\theta }}^{v+1}\) by maximizing the loglikelihood function \(\ell (\mathcal {\theta })\) in (7) given the current estimated posterior probability \(\widehat {p_{gl}}^{v}\). The estimated parameters from the maximization are
$$\begin{aligned} \widehat{\pi_{l}}^{v+1}&=\sum_{g=1}^{p}\widehat{p_{gl}}^{v}/p,l=0,1,2.\\ \widehat{\mu_{1}}^{v+1}&=\left(\widehat{\mu_{11}}^{v+1},\widehat{\mu_{12}}^{v+1}\right)= \left(\frac{\sum_{g=1}^{p} \widehat{p_{g2}}^{v} d_{g1}}{\sum_{g=1}^{p} \widehat{p_{g2}}^{v}}, \frac{\sum_{g=1}^{p} \widehat{p_{g2}}^{v} d_{g2}}{\sum_{g=1}^{p} \widehat{p_{g2}}^{v}}\right)\\ \widehat{\mu_{2}}^{v+1}&=\left(\widehat{\mu_{21}}^{v+1},\widehat{\mu_{22}}^{v+1}\right)= \left(\frac{\sum_{g=1}^{p} \widehat{p_{g3}}^{v} d_{g1}}{\sum_{g=1}^{p} \widehat{p_{g3}}^{v}}, \frac{\sum_{g=1}^{p} \widehat{p_{g3}}^{v} d_{g2}}{\sum_{g=1}^{p} \widehat{p_{g3}}^{v}}\right) \\ \widehat{\sigma_{g}^{2}}^{v+1}&= \frac{\sum_{g=1}^{p} \widehat{p_{g1}}^{v}\left(d_{g1}^{2}+d_{g2}^{2}\right)}{2\sum_{g=1}^{p} \widehat{p_{g1}}^{v}} 1 \\ \widehat{\sigma_{G_{1}}^{2}}^{v+1}&= \frac{\sum_{g=1}^{p} \left[\widehat{p_{g2}}^{v} \left(d_{g1}\widehat{\mu_{11}}^{v+1}\right)^{2}+\widehat{p_{g2}}^{v} \left(d_{g2}\widehat{\mu_{12}}^{v+1}\right)^{2}\right]}{2\sum_{g=1}^{p} \widehat{p_{g2}}^{v}} \\ &\quad\frac{\sum_{g=1}^{p} \widehat{p_{g1}}^{v}\left(d_{g1}^{2}+d_{g2}^{2}\right)}{2\sum_{g=1}^{p} \widehat{p_{g1}}^{v}} \\ \widehat{\sigma_{G_{2}}^{2}}^{v+1}&= \frac{\sum_{g=1}^{p} \left[\widehat{p_{g3}}^{v} \left(d_{g1}\widehat{\mu_{21}}^{v+1}\right)^{2}+\widehat{p_{g3}}^{v} \left(d_{g2}\widehat{\mu_{22}}^{v+1}\right)^{2}\right]}{2\sum_{g=1}^{p} \widehat{p_{g3}}^{v}} \\ &\quad\frac{\sum_{g=1}^{p} \widehat{p_{g1}}^{v}\left(d_{g1}^{2}+d_{g2}^{2}\right)}{2\sum_{g=1}^{p} \widehat{p_{g1}}^{v}} \end{aligned} $$ 
Step 4: Solution The algorithm continues between ExpectationStep and MaximizationStep until the following two conditions are satisfied.

1.
The difference between \(\widehat {\mathcal {\theta }}^{v} \) and \(\widehat {\mathcal {\theta }}^{v+1}\) is less than a small value δ _{1} for all their elements;

2.
The change in loglikelihood function \(\ell (\mathcal {\theta })\) between two consecutive iterations does not exceed a small value δ _{2}.

1.
References
Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, De Longueville F, Kawasaki ES, Lee KY, et al. The microarray quality control (maqc) project shows interand intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006; 24(9):1151–61.
Park PJ. Chip–seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009; 10(10):669–80.
Goodman SN, Fanelli D, Ioannidis JP. What does research reproducibility mean?. Sci Transl Med. 2016; 8(341):341–1234112.
Darbani B, Stewart CN. Reproducibility and reliability assays of the gene expressionmeasurements. J Biol Res (Thessaloniki). 2014; 21(1):3.
Parmigiani G, GarrettMayer ES, Anbazhagan R, Gabrielson E. A crossstudy comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res. 2004; 10(9):2922–7.
Choi H, Shen R, Chinnaiyan AM, Ghosh D. A latent variable approach for metaanalysis of gene expression data from multiple microarray experiments. BMC Bioinforma. 2007; 8(1):364.
Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E. A statistical framework for expressionbased molecular classification in cancer. J R Stat Soc Ser B Stat Methodol. 2002; 64(4):717–36.
Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. Rankprod: a bioconductor package for detecting differentially expressed genes in metaanalysis. Bioinformatics. 2006; 22(22):2825–7.
Benjamini Y, Heller R, Yekutieli D. Selective inference in complex research. Philos Trans R Soc Lond A Math Phys Eng Sci. 2009; 367(1906):4255–71.
Li Q, Brown JB, Huang H, Bickel PJ. Measuring reproducibility of highthroughput experiments. Annals Appl Stat. 2011; 5:1752–79.
Efron B. Microarrays, empirical bayes and the twogroups model. Stat Sci. 2008; 23(1):1–22.
Chen MH, Ibrahim JG, Chi YY. A new class of mixture models for differential gene expression in dna microarray data. J Stat Plan Infer. 2008; 138(2):387–404.
Najarian K, Zaheri M, Rad AA, Najarian S, Dargahi J. A novel mixture model method for identification of differentially expressed genes from dna microarray data. BMC Bioinforma. 2004; 5(201):201–10.
Newton MA. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004; 5(2):155–76.
Wei Pan JL, Le CT. A mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics. 2003; 3:117–24.
G.J. McLachlan RWB, Peel D. A mixture modelbased approach to the clustering of microarray expression data. Bininformatics. 2002; 18(3):413–22.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol. 1977; 38:1–38.
HerazoMaya JD, Noth I, Duncan SR, Kim S, Ma SF, Tseng GC, Feingold E, JuanGuardela BM, Richards TJ, Lussier Y, et al. Peripheral blood mononuclear cell gene expression profiles predict poor outcome in idiopathic pulmonary fibrosis. Sci Transl Med. 2013; 5(205):205–136205136.
Yang IV, Luna LG, Cotter J, Talbert J, Leach SM, Kidd R, Turner J, Kummer N, Kervitsky D, Brown KK, et al. The peripheral blood transcriptome identifies the presence and extent of disease in idiopathic pulmonary fibrosis. PLoS One. 2012; 7(6):37708.
Gene Expression Omnibus. http://www.ncbi.nlm.nih.gov/geo/.
Acknowledgements
We would like thank referees for their time on reviewing this manuscript.
Availability of data and materials
All data are from Gene Expression Omnibus [20].
Author information
Authors and Affiliations
Contributions
All authors equally distributed. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, W., Liu, Y., Zhang, M. et al. Quantitative reproducibility analysis for identifying reproducible targets from highthroughput experiments. BMC Syst Biol 11, 73 (2017). https://doi.org/10.1186/s129180170444y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s129180170444y