Quantitative reproducibility analysis for identifying reproducible targets from highthroughput experiments
 Wenfei Zhang^{2}Email author,
 Ying Liu^{1},
 Mindy Zhang^{2},
 Cheng Zhu^{2} and
 Yuefeng Lu^{2}
https://doi.org/10.1186/s129180170444y
© The Author(s) 2017
Received: 17 February 2017
Accepted: 30 June 2017
Published: 11 August 2017
Abstract
Background
Highthroughput assays are widely used in biological research to select potential targets. One single highthroughput experiment can efficiently study a large number of candidates simultaneously, but is subject to substantial variability. Therefore it is scientifically important to performance quantitative reproducibility analysis to identify reproducible targets with consistent and significant signals across replicate experiments. A few methods exist, but all have limitations.
Methods
In this paper, we propose a new method for identifying reproducible targets. Considering a Bayesian hierarchical model, we show that the test statistics from replicate experiments follow a mixture of multivariate Gaussian distributions, with the one component with zeromean representing the irreproducible targets.
Results
A target is thus classified as reproducible or irreproducible based on its posterior probability belonging to the reproducible components. We study the performance of our proposed method using simulations and a real data example.
Conclusion
The proposed method is shown to have favorable performance in identifying reproducible targets compared to other methods.
Keywords
Reproducibility Highthroughput experiment Bayesian classification Empirical Bayes Gaussian mixture EM algorithmBackground
In biological research, highthroughput assays, such as microarrays, are widely used to effectively select potential targets by studying a large number of candidates in a single experiment. However a highthroughput assay is often subject to substantial variability. Reproducibility of highthroughput assays, such as the level of agreement across replicate samples, test sites or data analytical platforms, is a concerned topic in scientific applications, and has been discussed in [1] for microarray and [2] for ChIPseq technology. Therefore quantitative analysis for the reproducibility of highthroughput assays is an important exercise for evaluating the reliability and robustness of scientific discoveries across studies.
Reproducibility is nonstandard and unsettled across the sciences. Goodman et al. [3] provides a survey on the papers with the word reproducibility included in titles, abstracts and keywords, and concludes that the interpretation of reproducibility varies among different papers. Goodman et al. [3] further allies the word reproducibility in the papers and classifies them into three terms: methods reproducibility, results reproducibility and inferential reproducibility. In [3], methods reproducibility refers to the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated, such as [1] and [2]; results reproducibility refers to obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible, such as [4] and [5]; Inferential reproducibility refers to the drawing of qualitatively similar conclusions from either an independent replication of a study or a reanalysis of the original study, such as [1] and [2].
In this paper, our reproducibility analysis aims to identify reproducible targets with consistent and significant signals across replicate studies, which belongs to the category of inferential reproducibility as defined in [3]. Our reproducibility analysis is different from metaanalysis, such as [6] and [7]. Metaanalysis combines the data from multiple studies to gain extra power for identifying targets with signals. The identified targets may not necessarily be significant across all studies.
A few methods have been developed for our reproducibility analysis. Hong et al. [8] proposed a permutation based method through estimating the empirical distribution of the rank product. Benjamini & Heller [9] developed a framework for testing partial conjunction hypothesis that the discovery is true in at least u studies out of total n studies. Most recently, [10] proposed a copula mixture model for estimating the irreproducible discovery rate across studies.
However all existing methods potentially have limitations. The permutation based method [8] can be computationally expensive when dealing with a large number of candidates. Benjamini & Heller method [9] aims at identifying candidates with reproduced signals in a few but not all the studies, which is a related but generally weaker goal than ours. The special case of Benjamini & Heller method testing whether signals are reproduced in all studies is identical to using the largest pvalue. The copula mixture [10] method builds the copula mixture using the rank transformation of the original data, which might be less powerful than modeling the original data with a proper probabilistic model as in our proposed method. A major drawback of both Benjamini & Heller method [9] and the copula mixture [10] method is that they both use the significant score of signals, such as pvalue, without taking into account the directionality of signals, hence is prune to selecting candidates with significant scores but different directions across studies. For example, in the context of two replicate microarray studies with a treatment and a control group, consider genes with significant pvalues in both experiments, but are upregulated in one study and downregulated in the other. Although those genes have inconsistent signals across studies, both methods will likely classify them as reproducible based on pvalues alone. In contrast, our proposed method models the test statistics directly and is expected to correctly classify those genes as irreproducible most of the time.
In this paper, we propose a Bayesian hierarchical model and show the test statistics from replicate studies can be approximated by a mixture of multivariate Gaussian distributions. The proposed Gaussian mixture model classifies the signals into three components: one irreproducible component and two reproducible components for consistent upregulated and downregulated signals respectively. The posterior probability of belonging to the reproducible components is used as a measure for reproducibility.
Methods
where μ _{ gi } is the expected group mean difference for gene g in the ith study, and \(\delta _{S_{i}}=\tilde {\sigma }_{i}^{1}(1/n_{i1}+1/n_{i2})^{1/2}\) with \(\tilde {\sigma }_{i}\) being the common standard deviation for {x _{ g i j1}}, j=1,2,…,n _{ i1} and {x _{ g i j2}}, j=1,2,…,n _{ i2}. When the sample size is small, the same procedure as in [11] can be applied to construct ztests based on twosample ttests. For simplicity we assume the withingroup betweensample standard deviation is the same for all the genes. The general case can be derived in a similar fashion but a bit more tedious.
where μ _{ g } is the “true" group mean difference for gene g across all studies and \(\sigma _{g}^{2}\) models the betweenstudy variability due to various experiment conditions.
where π _{ i }≥0, i=0,1,2, with π _{0}+π _{1}+π _{2}=1, \(\mu _{G_{1}}>0\) and \(\mu _{G_{2}}<0\). The distribution has three components: the null case where there is no differentially expressed gene, the “upregulated” case where the treatment stimulates the gene expression, and the “downregulated” case where the treatment suppresses the gene expression. Generally for microarray studies π _{0}≃1. Similar mixture models have been considered in [11–16]. Particularly we choose to model the cluster of upregulated (or downregulated) genes with a Gaussian distribution for the computational convenience, same as in [12]. Alternative choices include the semiparametric mixture model in [11, 14], mixture of Gaussian distributions in [13, 15] and mixture of tdistributions in [16].
where \(\mathcal {N}(\mathcal {\mu }_{l},\Sigma _{l})\) (l=0,1,2) is the biviariate normal distribution with mean vector \(\mathcal {\mu }_{l}\) and covariance matrix Σ _{ l }. Let I _{2} and J _{2} be the identity matrix and the square matrix of ones respectively, both with order 2. This mixture model classify the candidates into three components: \(\mathcal {N}(\mathcal {\mu }_{0},\Sigma _{0})\) is the irreproducible component with zeromean \(\mathcal {\mu }_{0}=(0,0)^{T}\) and covariance structure \(\Sigma _{0}=\left (\sigma _{g}^{2}+1\right)I_{2}\); \(\mathcal {N}(\mathcal {\mu }_{1},\Sigma _{1})\) and \(\mathcal {N}(\mathcal {\mu }_{2},\Sigma _{2})\) are two reproducible components with \(\mathcal {\mu }_{1}=(\delta _{S_{1}}\mu _{G_{1}}, \delta _{S_{2}}\mu _{G_{1}})>0\) and \(\Sigma _{1}=\left (\sigma _{g}^{2}+1\right)I_{2} + \sigma _{G_{1}}^{2}J_{2}\) representing the upregulated genes, and \(\mathcal {\mu }_{2}=(\delta _{S_{1}}\mu _{G_{2}}, \delta _{S_{2}}\mu _{G_{2}})<0\) and \(\Sigma _{2}=\left (\sigma _{g}^{2}+1\right)I_{2} + \sigma _{G_{2}}^{2}J_{2}\) representing the downregulated genes, where the inequalities are meant to be interpreted componentwise.
Note with increased sample sizes or decreased withingroup betweensample variability, the mean \(\mathcal {\mu }_{1}\) and \(\mathcal {\mu }_{2}\) of the reproducible components move further away from the origin, making the three components more separable. Also note the test statistics from replicate studies have zero correlations in the irreproducible components; in the reproducible components, the correlations become larger when the betweenstudy variability becomes smaller; for all components, the variance is smaller with less betweenstudy variability, resulting in more separable components.
where ϕ(··) is the density function of bivariate normal distribution. According to [10], the posterior probability of being in the irreproducible/null component p _{ i0} can be introduced as the individual significant score, namely local false discovery rate. When p _{ g0} is less than a significant level α, gene g is classified as reproducible.
In our algorithm, we start with some initials value for the parameters \(\mathcal {\theta }^{0}\), then iterate between two steps: (1) Evaluate the current posterior probabilities p _{ gl } using the current parameters; (2) Maximize the likelihood estimator given current posterior probabilities. The details of the EM procedures are provided in Appendix. Multiple random initial vaues are used to avoid being trapped at the local maximum.
Simulation studies
From this model, the mean expression level of gene g for group 1 of study s is modeled as μ _{ g s1}=μ+α _{ g }+β _{ i }+(α β)_{ gi }, where μ is the overall mean; α _{ g } is the main effect of gene g; β _{ i } is the main effect of study i; (α β)_{ gi } is the genestudy interaction. We set μ=0, \(\alpha _{g} \sim \mathcal {N}(0,1)\), β _{ i }=0.1, and \((\alpha \beta)_{gi} \sim \mathcal {N}(0,0.5^{2})\). For nondifferentially expressed genes, the mean expression level for both groups are the same, i.e., μ _{ g s1}=μ _{ g s2}. For differentially expressed genes, (8) models the difference between the two comparison groups as μ _{ g i2}−μ _{ g i1}=δ+γ _{ g }+(γ β)_{ gi }, where δ is the fixed effect of group difference; γ _{ g } is the effect of gene on the group difference; (γ β)_{ gi } is the genestudy interaction of the group difference. We set δ=0, generate γ _{ g } from \(\mathcal {N}(2, 0.5^{2})\) or \(\mathcal {N}(2, 0.5^{2})\) to mimic two possible directions of signals, \((\gamma \beta)_{gi}\sim \mathcal {N}(0,0.5^{2})\). ε _{ gijk } is the random error term, and following the distribution \(\mathcal {N}(0,0.5^{2})\).
The summary of misclassification rates for the four compared methods under different significant levels (α) and proportions of reproducible genes (γ)
The proposed Method  The copula mixture method [10]  Benjamini & Heller method [9]  The rank product method [8]  

α=0.1  α=0.05  α=0.1  α=0.05  α=0.1  α=0.05  α=0.1  α=0.05  
γ=80%  0.007(0.001)  0.008(0.0012)  0.24(0.0708)  0.271(0.0954)  0.025(0.0022)  0.032(0.0025)  0.197(0.0044)  0.25(0.0036) 
γ=60%  0.007(0.0013)  0.008(0.0013)  0.402(0.0022)  0.404(0.0028)  0.022(0.0017)  0.027(0.002)  0.073(0.0031)  0.099(0.0035) 
γ=40%  0.005(0.001)  0.006(0.001)  0.568(0.0059)  0.541(0.01)  0.016(0.0017)  0.02(0.0019)  0.02(0.0018)  0.028(0.0021) 
γ=20%  0.004(8e04)  0.004(8e04)  0.166(0.0026)  0.186(0.0015)  0.01(0.0014)  0.013(0.0015)  0.004(9e04)  0.006(0.0011) 
γ=10%  0.002(6e04)  0.002(6e04)  0.058(0.0104)  0.077(0.0075)  0.007(0.001)  0.008(0.0011)  0.002(5e04)  0.002(6e04) 
γ=5%  0.001(5e04)  0.001(5e04)  0.011(0.0038)  0.025(0.0042)  0.004(9e04)  0.005(0.001)  0.001(4e04)  0.001(3e04) 
γ=1%  0.001(4e04)  0(4e04)  0.001(6e04)  0.002(9e04)  0.001(7e04)  0.002(7e04)  0.001(4e04)  0(3e04) 
The summary of sensitivities for the four compared methods under different significant levels (α) and proportions of reproducible genes (γ)
The proposed Method  The copula mixture method [10]  Benjamini & Heller method [9]  The rank product method [8]  

α=0.1  α=0.05  α=0.1  α=0.05  α=0.1  α=0.05  α=0.1  α=0.05  
γ=80%  0.992(0.0014)  0.991(0.0016)  0.948(0.0881)  0.905(0.1184)  0.97(0.0027)  0.96(0.0031)  0.754(0.0055)  0.687(0.0045) 
γ=60%  0.99(0.002)  0.988(0.0021)  0.978(0.0071)  0.956(0.0119)  0.966(0.0028)  0.955(0.0033)  0.878(0.0052)  0.836(0.0058) 
γ=40%  0.989(0.0024)  0.987(0.0024)  0.975(0.0069)  0.937(0.0161)  0.962(0.0046)  0.951(0.005)  0.949(0.0045)  0.931(0.0051) 
γ=20%  0.985(0.0037)  0.983(0.004)  0.176(0.0149)  0.069(0.0081)  0.949(0.007)  0.937(0.0076)  0.978(0.0046)  0.972(0.0053) 
γ=10%  0.984(0.0048)  0.982(0.0051)  0.421(0.1033)  0.228(0.0746)  0.934(0.0098)  0.92(0.0108)  0.985(0.0053)  0.982(0.0055) 
γ=5%  0.984(0.0069)  0.983(0.0075)  0.773(0.0741)  0.509(0.0832)  0.925(0.0191)  0.909(0.0195)  0.99(0.0049)  0.988(0.0057) 
γ=1%  0.986(0.0176)  0.984(0.0177)  0.907(0.0592)  0.842(0.0882)  0.866(0.0673)  0.844(0.0706)  0.99(0.0163)  0.99(0.0163) 
The summary of specificities for the four compared methods under different significant levels (α) and proportions of reproducible genes (γ)
The proposed Method  The copula mixture method [10]  Benjamini & Heller method [9]  The rank product method [8]  

α=0.1  α=0.05  α=0.1  α=0.05  α=0.1  α=0.05  α=0.1  α=0.05  
γ=80%  0.996(0.002)  0.997(0.0017)  0.009(0.0058)  0.025(0.0152)  0.994(0.0021)  0.999(0.001)  1(0)  1(0) 
γ=60%  0.998(9e04)  0.999(7e04)  0.029(0.0075)  0.057(0.0144)  0.997(0.0015)  0.999(7e04)  1(0)  1(0) 
γ=40%  0.999(7e04)  0.999(6e04)  0.07(0.0136)  0.139(0.0268)  0.999(7e04)  1(4e04)  1(0)  1(0) 
γ=20%  0.999(4e04)  0.999(3e04)  0.999(9e04)  1(4e04)  1(3e04)  1(1e04)  1(0)  1(0) 
γ=10%  0.999(4e04)  1(3e04)  1(1e04)  1(1e04)  1(1e04)  1(1e04)  1(2e04)  1(1e04) 
γ=5%  1(3e04)  1(3e04)  1(1e04)  1(0)  1(1e04)  1(0)  1(3e04)  1(1e04) 
γ=1%  1(3e04)  1(3e04)  1(1e04)  1(0)  1(0)  1(0)  1(4e04)  1(2e04) 
Results
In this section, we illustrate our proposed method using a real example. This example includes two microarray studies [18] and [19] comparing idiopathic pulmonary fibrosis (IPF) samples with healthy control samples. Data from both studies are obtained from Gene Expression Omnibus [20]. GSE 28042 [18] measures profiles of peripheral blood mononuclear cell (PBMC) for 75 IPF samples and 16 control samples through GeneChip Human 1.0 exon ST arrays, and GSE 33566 [19] measures profiles of peripheral blood RNA for 93 IPF patients and 30 control samples through Agilent Whole Human Genome Oligonucleotide Microarrays. We only consider the overlap 17708 common genes for reproducibility analysis.
The list of 23 selected genes, which are in the list of the top 500 reproducible genes selected by Benjamini & Heller method [9], but have opposite signs of signals in two studies
Genes  tstatistics in GSE 28042 [18]  tstatistics in GSE 33566 [19]  

1  A1BG  3.34  3.63 
2  ANKRD39  3.93  3.35 
3  CA4  4.4  4.94 
4  CDK14  4.88  3.34 
5  CHCHD2  3.5  3.65 
6  CXCR2  4.67  3.38 
7  HCG27  4.68  3.29 
8  KAT6A  3.48  3.54 
9  MFSD3  4.25  3.29 
10  MMP9  3.51  5.77 
11  MRPL14  4.06  3.69 
12  MRPL15  3.99  3.38 
13  MRPL55  3.63  3.95 
14  NDUFB7  3.79  3.54 
15  NDUFS3  3.98  3.89 
16  PRPS1  3.87  4.13 
17  RBBP6  3.66  3.67 
18  ROMO1  3.33  3.41 
19  SEPHS1  4  3.44 
20  TANC2  3.59  3.95 
21  TCN1  4.69  3.36 
22  TMEM141  3.45  3.64 
23  TRIM33  4.64  3.47 
The list of 7 selected genes, which are in the list of the top 500 reproducbile genes selected by the copula mixture model [10], but have opposite signs of signals in two studies
Conclusion and discussion
This paper proposes a new method for identifying consistent and significant signals across replicate highthroughput experiments. Existing methods ignore the directionality of signals, and can incorrectly identify signals with opposite directions as reproducible ones. Our proposed method considers both the significant scores and directions of signals by modeling the test statistics directly, leading to improved performance in selecting reproducible candidates. When the proposed method is applied to a real data example for identifying reproducible genes in studies of idiopathic pulmonary fibrosis samples, it is shown to have better performance in detecting significant and reproducible genes compared to other methods. Simulations also demonstrate that our method compares favorably to the existing methods.
Appendix
Expectationmaximization (EM) algorithm to estimate model parameters

Step 1: Initial Values Generate the initial values for \(\mathcal {\theta }\) and denote it as \(\widehat {\mathcal {\theta }}^{0}\)

Step 2: ExpectationStep Continue from the vth iteration step with the estimate \(\widehat {\mathcal {\theta }}^{v}\). We can obtain the estimated posterior probability \(\widehat {p_{gl}}^{v}\) of (d _{ g1},d _{ g2}) from (5) by$$\begin{array}{@{}rcl@{}} \widehat{p_{gl}}^{v}&=&\frac{\widehat{\pi_{l}}^{v}\phi\left(d_{g1},d_{g2}\widehat{\mathcal{\mu}_{l}}^{v},\widehat{\Sigma_{l}}^{v}\right)}{\sum_{\ell=0,1,2}\widehat{\pi_{\ell}}^{v}\phi\left(d_{g1},d_{g2}\widehat{\mathcal{\mu}_{\ell}}^{v},\widehat{\Sigma_{\ell}}^{v}\right)},l=0,1,2. \end{array} $$(9)

Step 3: MaximizationStep Update the parameter \(\widehat {\mathcal {\theta }}^{v+1}\) by maximizing the loglikelihood function \(\ell (\mathcal {\theta })\) in (7) given the current estimated posterior probability \(\widehat {p_{gl}}^{v}\). The estimated parameters from the maximization are$$\begin{aligned} \widehat{\pi_{l}}^{v+1}&=\sum_{g=1}^{p}\widehat{p_{gl}}^{v}/p,l=0,1,2.\\ \widehat{\mu_{1}}^{v+1}&=\left(\widehat{\mu_{11}}^{v+1},\widehat{\mu_{12}}^{v+1}\right)= \left(\frac{\sum_{g=1}^{p} \widehat{p_{g2}}^{v} d_{g1}}{\sum_{g=1}^{p} \widehat{p_{g2}}^{v}}, \frac{\sum_{g=1}^{p} \widehat{p_{g2}}^{v} d_{g2}}{\sum_{g=1}^{p} \widehat{p_{g2}}^{v}}\right)\\ \widehat{\mu_{2}}^{v+1}&=\left(\widehat{\mu_{21}}^{v+1},\widehat{\mu_{22}}^{v+1}\right)= \left(\frac{\sum_{g=1}^{p} \widehat{p_{g3}}^{v} d_{g1}}{\sum_{g=1}^{p} \widehat{p_{g3}}^{v}}, \frac{\sum_{g=1}^{p} \widehat{p_{g3}}^{v} d_{g2}}{\sum_{g=1}^{p} \widehat{p_{g3}}^{v}}\right) \\ \widehat{\sigma_{g}^{2}}^{v+1}&= \frac{\sum_{g=1}^{p} \widehat{p_{g1}}^{v}\left(d_{g1}^{2}+d_{g2}^{2}\right)}{2\sum_{g=1}^{p} \widehat{p_{g1}}^{v}} 1 \\ \widehat{\sigma_{G_{1}}^{2}}^{v+1}&= \frac{\sum_{g=1}^{p} \left[\widehat{p_{g2}}^{v} \left(d_{g1}\widehat{\mu_{11}}^{v+1}\right)^{2}+\widehat{p_{g2}}^{v} \left(d_{g2}\widehat{\mu_{12}}^{v+1}\right)^{2}\right]}{2\sum_{g=1}^{p} \widehat{p_{g2}}^{v}} \\ &\quad\frac{\sum_{g=1}^{p} \widehat{p_{g1}}^{v}\left(d_{g1}^{2}+d_{g2}^{2}\right)}{2\sum_{g=1}^{p} \widehat{p_{g1}}^{v}} \\ \widehat{\sigma_{G_{2}}^{2}}^{v+1}&= \frac{\sum_{g=1}^{p} \left[\widehat{p_{g3}}^{v} \left(d_{g1}\widehat{\mu_{21}}^{v+1}\right)^{2}+\widehat{p_{g3}}^{v} \left(d_{g2}\widehat{\mu_{22}}^{v+1}\right)^{2}\right]}{2\sum_{g=1}^{p} \widehat{p_{g3}}^{v}} \\ &\quad\frac{\sum_{g=1}^{p} \widehat{p_{g1}}^{v}\left(d_{g1}^{2}+d_{g2}^{2}\right)}{2\sum_{g=1}^{p} \widehat{p_{g1}}^{v}} \end{aligned} $$

Step 4: Solution The algorithm continues between ExpectationStep and MaximizationStep until the following two conditions are satisfied.
 1.
The difference between \(\widehat {\mathcal {\theta }}^{v} \) and \(\widehat {\mathcal {\theta }}^{v+1}\) is less than a small value δ _{1} for all their elements;
 2.
The change in loglikelihood function \(\ell (\mathcal {\theta })\) between two consecutive iterations does not exceed a small value δ _{2}.
 1.
Declarations
Acknowledgements
We would like thank referees for their time on reviewing this manuscript.
Availability of data and materials
All data are from Gene Expression Omnibus [20].
Authors’ contributions
All authors equally distributed. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, De Longueville F, Kawasaki ES, Lee KY, et al. The microarray quality control (maqc) project shows interand intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006; 24(9):1151–61.View ArticlePubMedGoogle Scholar
 Park PJ. Chip–seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009; 10(10):669–80.View ArticlePubMedPubMed CentralGoogle Scholar
 Goodman SN, Fanelli D, Ioannidis JP. What does research reproducibility mean?. Sci Transl Med. 2016; 8(341):341–1234112.View ArticleGoogle Scholar
 Darbani B, Stewart CN. Reproducibility and reliability assays of the gene expressionmeasurements. J Biol Res (Thessaloniki). 2014; 21(1):3.View ArticleGoogle Scholar
 Parmigiani G, GarrettMayer ES, Anbazhagan R, Gabrielson E. A crossstudy comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res. 2004; 10(9):2922–7.View ArticlePubMedGoogle Scholar
 Choi H, Shen R, Chinnaiyan AM, Ghosh D. A latent variable approach for metaanalysis of gene expression data from multiple microarray experiments. BMC Bioinforma. 2007; 8(1):364.View ArticleGoogle Scholar
 Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E. A statistical framework for expressionbased molecular classification in cancer. J R Stat Soc Ser B Stat Methodol. 2002; 64(4):717–36.View ArticleGoogle Scholar
 Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. Rankprod: a bioconductor package for detecting differentially expressed genes in metaanalysis. Bioinformatics. 2006; 22(22):2825–7.View ArticlePubMedGoogle Scholar
 Benjamini Y, Heller R, Yekutieli D. Selective inference in complex research. Philos Trans R Soc Lond A Math Phys Eng Sci. 2009; 367(1906):4255–71.View ArticleGoogle Scholar
 Li Q, Brown JB, Huang H, Bickel PJ. Measuring reproducibility of highthroughput experiments. Annals Appl Stat. 2011; 5:1752–79.View ArticleGoogle Scholar
 Efron B. Microarrays, empirical bayes and the twogroups model. Stat Sci. 2008; 23(1):1–22.View ArticleGoogle Scholar
 Chen MH, Ibrahim JG, Chi YY. A new class of mixture models for differential gene expression in dna microarray data. J Stat Plan Infer. 2008; 138(2):387–404.View ArticleGoogle Scholar
 Najarian K, Zaheri M, Rad AA, Najarian S, Dargahi J. A novel mixture model method for identification of differentially expressed genes from dna microarray data. BMC Bioinforma. 2004; 5(201):201–10.View ArticleGoogle Scholar
 Newton MA. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004; 5(2):155–76.View ArticlePubMedGoogle Scholar
 Wei Pan JL, Le CT. A mixture model approach to detecting differentially expressed genes with microarray data. Funct Integr Genomics. 2003; 3:117–24.View ArticlePubMedGoogle Scholar
 G.J. McLachlan RWB, Peel D. A mixture modelbased approach to the clustering of microarray expression data. Bininformatics. 2002; 18(3):413–22.View ArticleGoogle Scholar
 Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol. 1977; 38:1–38.Google Scholar
 HerazoMaya JD, Noth I, Duncan SR, Kim S, Ma SF, Tseng GC, Feingold E, JuanGuardela BM, Richards TJ, Lussier Y, et al. Peripheral blood mononuclear cell gene expression profiles predict poor outcome in idiopathic pulmonary fibrosis. Sci Transl Med. 2013; 5(205):205–136205136.View ArticleGoogle Scholar
 Yang IV, Luna LG, Cotter J, Talbert J, Leach SM, Kidd R, Turner J, Kummer N, Kervitsky D, Brown KK, et al. The peripheral blood transcriptome identifies the presence and extent of disease in idiopathic pulmonary fibrosis. PLoS One. 2012; 7(6):37708.View ArticleGoogle Scholar
 Gene Expression Omnibus. http://www.ncbi.nlm.nih.gov/geo/.