Shrinkage regression-based methods for microarray missing value imputation
© Wang et al.; licensee BioMed Central Ltd. 2013
Published: 13 December 2013
Missing values commonly occur in the microarray data, which usually contain more than 5% missing values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of downstream microarray data analyses. Many types of methods have been developed to estimate missing values. Among them, the regression-based methods are very popular and have been shown to perform better than the other types of methods in many testing microarray datasets.
To further improve the performances of the regression-based methods, we propose shrinkage regression-based methods. Our methods take the advantage of the correlation structure in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and then use the new coefficients to estimate missing values. Simulation results show that the proposed methods provide more accurate missing value estimation in six testing microarray datasets than the existing regression-based methods do.
Imputation of missing values is a very important aspect of microarray data analyses because most of the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.
Nowadays microarray technique has become an important and useful tool in functional genomics research. This high throughput technique allows the characterization of the gene expression of the whole genome by measuring the relative transcript levels of thousands of genes in various experimental conditions or time points . Microarray data analyses have been widely used to investigate various biological processes such as the cell cycle process [2–8] and the stress response [9, 10].
Although the microarray technology has been developed for more than a decade, typical microarray data still contain more than 5% missing values with up to 90% of genes affected . Missing values could be generated by various reasons, including technological failures, administrative error, insufficient resolution, image corruption, dust or scratches on the slide . As many downstream analysis methods (such as gene clustering, disease classification and gene network reconstruction) require complete datasets, missing value estimation becomes an important pre-processing step in the microarray data analysis [11–13].
The missing values in the microarray dataset are traditionally estimated by repeating the microarray experiments or simply replacing the missing values with zero or the row average (the average expression over the experimental conditions). Because these approaches are either time-consuming or leading to serious estimation errors, more advanced missing value imputation methods are needed to solve the missing value problems. In 2001, Troyanskaya et al. published the first two missing value imputation algorithms based on the k-nearest neighbors (kNNimpute) and the singular value decomposition (SVDimpute) . Since then, a lot of missing value imputation methods have been proposed such as Bayesian principal component analysis (BPCA) , Gaussian mixture clustering imputation (GMCimpute) , conditional ordered list imputation , random-forest-based imputation  and so on.
Among the existing missing value imputation methods, the regression-based methods are very popular and contain many algorithms, including least squares imputation (LSimpute) , local least squares imputation (LLSimpute) , sequential local least squares imputation (SLLSimpute) , and iterated local least squares imputation (ILLSimpute) . LSimpute estimates the missing values in the target gene by using a weighted average of the k estimates from the k most similar genes. Each estimate is attained by constructing a single regression model of the target gene by a similar gene. LLSimpute represents the target gene as a linear combination of k similar genes by a multiple regression model and uses the regression coefficients to estimate the missing values. SLLSimpute modifies the LLSimpute by estimating the missing values sequentially from the gene containing the fewest missing values and partially utilizing these estimated values. ILLSimpute modifies the LLSimpute by not choosing the similar genes with a fixed number k but defining the similar genes as the genes whose distances from the target gene are less than a distance threshold and then runs LLSimpute iteratively.
In this study, we focus on the regression-based methods because these methods have been shown to have better performances than the other existing methods in many testing microarray datasets [20, 21]. To further improve the performance of the regression-based methods, we propose shrinkage regression-based methods which use a shrinkage estimator to replace the least square estimator for the estimation of the regression coefficients in the regression model. The shrinkage estimator such as the James-Stein estimator has been shown to dominate the least square estimator in many statistical models [22, 23]. By adopting our new regression coefficients in the regression-based methods, we showed that an improvement on missing value estimation in six testing microarray datasets could be achieved.
Shrinkage estimation approach
all have uniformly smaller mean squared error than the intuitive estimator Y i , for k ≥ 3 and 0 <c < 2 (k - 2). Among all the estimators of the form in (4), the estimator in (3) has the minimized mean squared error. The shrinkage estimation approach has also been shown to have good performance in interval estimation [24, 25]. Based on the James-Stein estimator in (3), we developed shrinkage regression-based imputation methods.
where denotes the transpose of a column vector g i . If there is a missing value in the l th position of the i th gene, we denote it as , i.e. G i , l = g il = .
Shrinkage local least squares imputation (Shrinkage LLSimpute)
Shrinkage sequential local least squares imputation (Shrinkage SLLSimpute)
By a similar argument as for the shrinkage LLSimpute, we apply the shrinkage estimator to SLLSimpute. The shrinkage SLLSimpute adjusts the coefficients of the regression model by the formula in (10) and use the formula in (11) to estimate the missing values.
Shrinkage iterated local least squares imputation (Shrinkage ILLSimpute)
LLSimpute and SLLSimpute methods select k-nearest neighbor genes for a target gene, where k is a fixed number. However, in the ILLSimpute method , it does not fix the number of similar genes selected. Alternatively, it defines the similar genes as the genes whose distances to the target genes are less than a distance threshold . The rationale of using a distance threshold rather than using a fixed number of similar genes is that some of the k-nearest neighbor genes are already far away from the target gene and are not very similar to the target gene.
The procedure of ILLSimpute is as follows. In the first iteration, missing values of each target gene are filled with the row average. Then a distance threshold is used to select the similar genes of each target gene. Finally, LLSimpute method is used to estimate the missing values of each target gene. In the later iteration, ILLSimpute method uses the imputed results from the previous iteration to reselect the similar genes of each target gene (using the same distance threshold) and applies LLSimpute method to re-estimate the missing values.
By a similar argument as for the shrinkage LLSimpute, we apply the shrinkage estimator to ILLSimpute. The shrinkage ILLSimpute adjusts the coefficients of the regression model by the formula in (10) and use the formula in (11) to estimate the missing values.
Results and Discussion
We conducted several experiments to compare the performances of our shrinkage regression-based methods and the original regression-based methods under different scenarios. In the first subsection, we introduce the benchmark datasets. In the second subsection, we describe how we measure the performance of various imputation methods. In the following three subsections, we report the comparison results for different number of similar genes used, different missing rates, and different noise levels. Finally, we further compare the performances of our shrinkage regressioni-based methods and three existing non-regression-based methods.
The performance measure
where yguess and yans are vectors whose elements are the estimated values by an imputation method and the known answers for all missing entries, respectively.
Performance comparison for different k values
The optimal k value for each benchmark dataset.
For each of the six benchmark dataset, we also compared the performances of the proposed shrinkage regression-based methods and the original regression-based methods for several possible k values (50, 100, 150, 200, 250 and 300). In our numerical experiments, missing rate for each benchmark dataset was set to be 5%. Namely, for each dataset, we randomly removed 5% entries of the complete matrix to generate a matrix with missing values, and then estimated the missing values using the shrinkage and the original regression-based methods. The same procedure was run for five independent rounds and the average NRMSE of these five simulations was used to compare the performances of different imputation methods.
Performance comparison for different missing rates
In real applications, different microarray data may have different missing rates to be imputed. It is informative to know how an imputation method performs for different missing rates. Therefore, we compared the performances of the shrinkage regression-based methods and the original regression-based methods on the microarray data with different missing rates (1%, 5%, 10%, 15% and 20%). Namely, for each of the six benchmark dataset, we randomly removed x% (x = 1, 5, 10, 15 or 20) entries of the complete matrix to generate a matrix with missing values, and then estimated the missing values using the shrinkage and the original regression-based methods. The same procedure was run for five independent rounds and the average NRMSE of these five simulations was used to compare the performances of different imputation methods. Note that the optimal k value used for each benchmark dataset was listed in Table 2.
Performance comparison for different noise levels
In real applications, different microarray data may contain different levels of noises. It is informative to know how an imputation method performs for different levels of noises inherent in the microarray data. Therefore, we compared the performances of the shrinkage regression-based methods and the original regression-based methods on the microarray data with different noise levels. For each of the six benchmark dataset, we added Gaussian noises with different levels into the data. The magnitudes of the noises were set in terms of the standard deviations ranging from 0 to 0.25 with a step size 0.05. In our numerical experiments, missing rate for each benchmark dataset was set to be 5% and the optimal k value used for each benchmark dataset was listed in Table 2. Namely, for each dataset (after adding Gaussian noises into the data), we randomly removed 5% entries of the complete matrix to generate a matrix with missing values, and then estimated the missing values using the shrinkage and the original regression-based methods. The same procedure was run for five independent rounds and the average NRMSE of these five simulations was used to compare the performance of different imputation methods.
Performance comparison with three existing non-regression-based methods
Imputation of missing values is a very important aspect of microarray data analyses because most of downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. In this study, regression-based methods associated with a shrinkage estimation approach are proposed to estimate missing values in the microarray data. Our methods take the advantage of the correlation structure existing in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and apply the new coefficients of the regression model to estimate missing values. Simulation results show that the proposed shrinkage regression-based methods provide more accurate missing value estimation for various types of datasets than the original regression-based methods do. Since our proposed methods can be applied to modify any kind of regression-based methods and can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.
This study was supported by the National Cheng Kung University and Taiwan National Science Council NSC 99-2628-B-006-015-MY3 and NSC 101-2118-M-009-006-MY2.
The full funding for the publication fee came from Taiwan National Science Council and College of Electrical Engineering and Computer Science, National Cheng Kung University.
This article has been published as part of BMC Systems Biology Volume 7 Supplement 6, 2013: Selected articles from the 24th International Conference on Genome Informatics (GIW2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S6.
- Schena M, Shalon D, Davis R, Brown P: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467-470. 10.1126/science.270.5235.467.View ArticlePubMedGoogle Scholar
- Wu W, Li W, Chen B: Computational reconstruction of transcriptional regulatory modules of the yeast cell cycle. BMC Bioinformatics. 2006, 7: 421-10.1186/1471-2105-7-421.PubMed CentralView ArticlePubMedGoogle Scholar
- Rowicka M, Kudlicki A, Tu B, Otwinowski Z: High-resolution timing of cell cycle-regulated gene expression. Proc Natl Acad Sci USA. 2007, 104: 16892-16897. 10.1073/pnas.0706022104.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu W, Li W, Chen B: Identifying regulatory targets of cell cycle transcription factors using gene expression and ChIP-chip data. BMC Bioinformatics. 2007, 8: 188-10.1186/1471-2105-8-188.PubMed CentralView ArticlePubMedGoogle Scholar
- Futschik M, Herzel H: Are we overestimating the number of cell-cycling genes? The impact of background models on time-series analysis. Bioinformatics. 2008, 24: 1063-1069. 10.1093/bioinformatics/btn072.View ArticlePubMedGoogle Scholar
- Wu W, Li W: Systematic identification of yeast cell cycle transcription factors using multiple data sources. BMC Bioinformatics. 2008, 9: 522-10.1186/1471-2105-9-522.PubMed CentralView ArticlePubMedGoogle Scholar
- Siegal-Gaskins D, Ash J, Crosson S: Model-based deconvolution of cell cycle time-series data reveals gene expression details at high resolution. PLoS Comput Biol. 2009, 5: e1000460-10.1371/journal.pcbi.1000460.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang H, Wang Y, Wu W: Yeast cell cycle transcription factors identification by variable selection criteria. Gene. 2011, 485: 172-176. 10.1016/j.gene.2011.06.001.View ArticlePubMedGoogle Scholar
- Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Botstein D, Brown P: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11: 4241-4257. 10.1091/mbc.11.12.4241.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu W, Li W: Identifying gene regulatory modules of heat shock response in yeast. BMC Genomics. 2008, 9: 439-10.1186/1471-2164-9-439.PubMed CentralView ArticlePubMedGoogle Scholar
- Ouyang M, Welsh W, Georgopoulos P: Gaussian mixture clustering and imputation of microarray data. Bioinformatics. 2004, 20: 917-923. 10.1093/bioinformatics/bth007.View ArticlePubMedGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R: Missing value estimation methods for DNA microarrays. Bioinformatics. 2001, 17: 520-525. 10.1093/bioinformatics/17.6.520.View ArticlePubMedGoogle Scholar
- Cai Z, Heydari M, Lin G: Iterated local least squares microarray missing value imputation. J Bioinform Comput Biol. 2006, 4: 935-957. 10.1142/S0219720006002302.View ArticlePubMedGoogle Scholar
- Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003, 19: 2088-2096. 10.1093/bioinformatics/btg287.View ArticlePubMedGoogle Scholar
- Yu T, Peng H, Sun W: Incorporating nonlinear relationships in microarray missing value imputation. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8: 723-731.PubMed CentralView ArticlePubMedGoogle Scholar
- Stekhoven D, Bühlmann P: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012, 28: 112-118. 10.1093/bioinformatics/btr597.View ArticlePubMedGoogle Scholar
- Bø T, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 2004, 32: e34-10.1093/nar/gnh026.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim H, Golub G, Park H: Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics. 2005, 21: 187-198. 10.1093/bioinformatics/bth499.View ArticlePubMedGoogle Scholar
- Zhang X, Song X, Wang H, Zhang H: Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med. 2008, 38: 1112-1120. 10.1016/j.compbiomed.2008.08.006.View ArticlePubMedGoogle Scholar
- Celton M, Malpertuy A, Lelandais G, de Brevern A: Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics. 2010, 11: 15-10.1186/1471-2164-11-15.PubMed CentralView ArticlePubMedGoogle Scholar
- Brock G, Shaffer J, Blakesley R, Lotz M, Tseng G: Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinformatics. 2008, 9: 12-10.1186/1471-2105-9-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Stein C: Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. 1956, 1: 197-206.Google Scholar
- James W, Stein C: Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability. 1961, 1: 361-379.Google Scholar
- Wang H: Brown's paradox in the estimated confidence approach. The Annals of Statistics. 1999, 27: 610-626. 10.1214/aos/1018031210.View ArticleGoogle Scholar
- Wang H: Improved confidence estimators for the multivariate normal confidence set. Statistica Sinica. 2000, 10: 659-664.Google Scholar
- Ogawa N, DeRisi J, Brown P: New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. Molecular Biology of the Cell. 2000, 11: 4309-4321. 10.1091/mbc.11.12.4309.PubMed CentralView ArticlePubMedGoogle Scholar
- Bohen S, Troyanskaya O, Alter O, Warnke R, Botstein D, Brown P, Levy R: Variation in gene expression patterns in follicular lymphoma and the response to rituximab. Proc Natl Acad Sci USA. 2003, 100: 1926-1930. 10.1073/pnas.0437875100.PubMed CentralView ArticlePubMedGoogle Scholar
- Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson J, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038/35000501.View ArticlePubMedGoogle Scholar
- Brauer M, Saldanha A, Dolinski K, Botstein D: Homeostatic adjustment and metabolic remodeling in glucose-limited yeast cultures. Mol Biol Cell. 2005, 16: 2503-2517. 10.1091/mbc.E04-11-0968.PubMed CentralView ArticlePubMedGoogle Scholar
- Shapira M, Segal E, Botstein D: Disruption of yeast forkhead-associated cell cycle transcription by oxidative stress. Mol Biol Cell. 2004, 15: 5659-5669. 10.1091/mbc.E04-04-0340.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.