Volume 6 Supplement 3
The International Conference on Intelligent Biology and Medicine (ICIBM): Systems Biology
Statistical aspects of omics data analysis using the random compound covariate
- Pei-Fang Su^{1},
- Xi Chen^{1, 2},
- Heidi Chen^{1, 2} and
- Yu Shyr^{1, 2}Email author
https://doi.org/10.1186/1752-0509-6-S3-S11
© Su et al; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Abstract
Background
Dealing with high dimensional markers, such as gene expression data obtained using microarray chip technology or genomics studies, is a key challenge because the numbers of features greatly exceeds the number of biological samples. After selecting biologically relevant genes, how to summarize the expression of selected genes and then further build predicted model is an important issue in medical applications. One intuitive method of addressing this challenge assigns different weights to different features, subsequently combining this information into a single score, named the compound covariate. Investigators commonly employ this score to assess whether an association exists between the compound covariate and clinical outcomes adjusted for baseline covariates. However, we found that some clinical papers concerned with such analysis report bias p-values based on flawed compound covariate in their training data set.
Results
We correct this flaw in the analysis and we also propose treating the compound score as a random covariate, to achieve more appropriate results and significantly improve study power for survival outcomes. With this proposed method, we thoroughly assess the performance of two commonly used estimated gene weights through simulation studies. When the sample size is 100, and censoring rates are 50%, 30%, and 10%, power is increased by 10.6%, 3.5%, and 0.4%, respectively, by treating the compound score as a random covariate rather than a fixed covariate. Finally, we assess our proposed method using two publicly available microarray data sets.
Conclusion
In this article, we correct this flaw in the analysis and the propose method, treating the compound score as a random covariate, can achieve more appropriate results and improve study power for survival outcomes.
Introduction
High-dimensional omics data
Personalized medicine is expected to enable a more predictive discipline, in which therapies are targeted toward the molecular constitution of individual patients and their disease; thus, molecular biomarkers are widely expected to revolutionize the current practice of medicine. For example, the progress of genomics has made it possible to evaluate molecular signatures to predict cancer metastasis [1, 2]. Various technological breakthroughs have led to a plethora of high-dimensional omics data to support personalized medicine, and these data have a common characteristic: the numbers of features greatly exceeds the number of biological samples. Because biological phenomena are the result of sets of features (e.g., concerted expression of multiple genes), the analysis of a group of related features (e.g., genes) may be more effective and may provide more directly interpretable results than the analysis of individual genes.
As high-dimensional omics research has advanced, the compound covariate (or compound score) has generally been held as a simpler and more straightforward approach. After selecting biologically relevent genes in training cohort, such a score is often a useful device in medical applications to define the information contained in a single set of data and to summarize the association of a set of variables with disease. Tukey [3] first advocated the use of compound covariates in the clinical trial setting. To develop a compound score, the individual covariates are summed; the association between such a compound covariate and outcome then is evaluated via regression analysis. Tomasson [4] used a compound score for binary outcomes, via fitting a logistic regression. Later, Hedenfalk [5] successfully applied the compound covariate method to class prediction analysis for breast cancer data. Because the use of the compound covariate is intuitive and seems useful, many other leading researchers also have applied this method for analyzing omics data sets [6–9].
Problem statements
A compound covariate is a linear combination of the basic covariates being studied, with each covariate having its own coefficient or weight. For survival outcomes, a commonly used scheme is to 1) compute the univariate Cox regression [10] for each gene of interest, 2) assign a weight to each gene (typically, the estimated regression coefficients or Wald statistics from the univariate Cox regressions), and 3) combine the weighted genes in a linear model that incorporates gene expression levels in each sample. This method of modeling weighted genes is believed to reflect the importance of each individual gene to the outcome; the higher the weight assigned, the more significant a particular gene is.
In this paper, we first contend the compound covariate should be treated as a random observation. Our idea is based on that proposed by Prentice [12], who analyzed covariates with measurement error and used a partial likelihood function technique to infer whether the parameter for the covariate was significant. In addition, if a training data set is used for a double purpose (i.e., to construct the compound covariate and then to test it), the resulting over-fitting means the p-value is not reliable when testing the regression parameter. Therefore, we use a 2-fold method (e.g., [13, 14]), splitting all observations in the training cohort into two parts, one part for assigning gene weights, and another part for testing the regression parameter through a partial likelihood score test. The remainder of this paper is organized as follows: We outline creation of the compound score using a random covariate approach. Then, we investigate the accuracy of the asymptotic distribution of the proposed tests. We thoroughly assess the performance of two commonly used estimated weights, "estimated coefficient" and "Wald statistic", for the Cox proportional hazards model. Finally, we illustrate the implementation of the proposed methods through two real data sets, and offer concluding remarks.
The proposed method
The compound covariate
for each patient j, j = 1, 2, ..., n. From the perspective of biology, the weighting policy is believed to reflect the importance of each individual gene to survival outcome, the higher the weight, the more important the gene is. In other words, the score can be regarded as a condensed index, representing the collective effects of gene expression.
where h_{0}(t) is a baseline hazard function and γ_{0} is a corresponding parameter for the compound covariate z; they then use the same data set to test the null hypothesis H_{0} :γ_{0} = 0. Under the null hypothesis, however, the method results in uncontrolled type I error, because the training data set has been used twice, both for building the model and for testing the regression parameter. If independent data are available, carry ${\widehat{\beta}}_{1},{\widehat{\beta}}_{2},\dots ,{\widehat{\beta}}_{p}$ from training data set and test on another independent data set is possible, allowing unbiased model validation to prevent over-fitting. However, if investagtors whish to report a testing result in the training cohort, an alternative, though less optimal, study design is using k-fold method or split the training cohort data randomly (2-fold), with 50% of the data being assigned to develop the score, compound covariate, and 50% to evaluate its performance. The limitation of this approach is that it requires a relatively large sample size. With this method, Kaplan-Meier survival curves [15] for the two sets should be examined to ensure no significant difference by the random selection of those two sets from training cohort data.
Cox regression with a random compound covariate
where γ_{0} is the parameter for the compound covariate, ${\mathit{\gamma}}_{1}$ is the corresponding parameters for fixed observations and ${\mathit{\gamma}}_{1}^{\text{T}}$ is the transpose of ${\mathit{\gamma}}_{1}$.
A partial likelihood function and score test
where m_{ i } is the number of failures at time t_{ i }(i = 1, 2,...l). Let $a=E\left[\text{exp}\left\{{\gamma}_{0}z+{\mathit{\gamma}}_{1}^{\text{T}}\mathbf{w}\right\}\right]$, $b=\partial a\mathsf{\text{/}}\partial \mathit{\gamma}$ and $c=\partial b\mathsf{\text{/}}\partial \mathit{\gamma}$ where $\mathit{\gamma}={\left[{\gamma}_{0},\phantom{\rule{2.77695pt}{0ex}}{\mathit{\gamma}}_{1}^{\text{T}}\right]}^{\text{T}}$ (The explicit forms of a, b and c are shown in Additional file 1). The score statistic then can be derived as
by using Wald statistics as weight. We show the derivation in more detail in Additional file 2.
Multiple gene sets
Let $a=E\left[\text{exp}\left({\mathit{\gamma}}_{0}^{\text{T}}\mathbf{z}+{\mathit{\gamma}}_{1}^{\text{T}}\mathbf{w}\right)\right]$, $b=\partial a\mathsf{\text{/}}\partial \mathit{\gamma}$ and $c=\partial b\mathsf{\text{/}}\partial \mathit{\gamma}$, where $\mathit{\gamma}={\left[{\mathit{\gamma}}_{0}^{\text{T}},{\mathit{\gamma}}_{1}^{\text{T}}\right]}^{\text{T}}$. The score statistic and the observed information matrix can be further derived as (3) and (4) as well. Consequently, under the null hypothesis ${H}_{0}:\mathit{\gamma}=0$, the partial likelihood score test v^{T}V^{-1} v has an asymptotic ${\chi}_{d+q}^{2}$ distribution when V is nonsingular. If we reject the null hypothesis, we can conclude that the covariate vector is associated with survival time.
Simulation results
To assess the performance of the proposed testing procedure for compound covariate, we conducted simulation studies under various scenarios to study type I error rate and power. For the scenario of split training data set as two parts and the consideration of compound scores as random covariates, we denoted the compound score using $\widehat{\beta}$ as a weight function as SRC_{ B }, and the compound score using the Wald statistic as a weight function as SRC_{ W }. The corresponding notations, SC_{ B } and SC_{ W }, refer to split data but without treating the compound covariate as a random covariate (i.e., typical Cox regression [10]). The notations DC_{ B } and DC_{ W } refer to compound scores with double use of the training data set, both for building the model and then testing for the same data set. Assume there are p selected genes in a gene set. Gene expression data were generated from a multivariate normal distribution with mean vector β = [β_{1}, β_{2}, ..., β_{ p }]^{T}, and variance-covariance matrix equal to the identity matrix for n cases. Generated survival times were associated with gene expression via a proportional hazard model, exp(x^{T}β). All tests with nominal significance level 0.05 were applied and empirical rejection probability was obtained based on 2000 simulation runs.
Empirical type I error rates
Method | n | cen. | The total number of genes | |||
---|---|---|---|---|---|---|
10 | 30 | 50 | 70 | |||
SRC _{ B } | 50 | 10% | 0.052 | 0.057 | 0.051 | 0.048 |
40% | 0.041 | 0.047 | 0.045 | 0.046 | ||
75 | 10% | 0.052 | 0.048 | 0.045 | 0.046 | |
40% | 0.044 | 0.046 | 0.050 | 0.046 | ||
100 | 10% | 0.056 | 0.049 | 0.052 | 0.052 | |
40% | 0.045 | 0.044 | 0.048 | 0.050 | ||
SRC _{ W } | 50 | 10% | 0.058 | 0.046 | 0.052 | 0.050 |
40% | 0.034 | 0.046 | 0.036 | 0.043 | ||
75 | 10% | 0.046 | 0.042 | 0.051 | 0.051 | |
40% | 0.044 | 0.038 | 0.044 | 0.040 | ||
100 | 10% | 0.051 | 0.046 | 0.048 | 0.060 | |
40% | 0.044 | 0.041 | 0.046 | 0.048 | ||
DC _{ B } | 50 | 10% | 0.937 | 1.000 | 1.000 | 1.000 |
40% | 0.910 | 1.000 | 1.000 | 1.000 | ||
75 | 10% | 0.944 | 1.000 | 1.000 | 1.000 | |
40% | 0.946 | 1.000 | 1.000 | 1.000 | ||
100 | 10% | 0.957 | 1.000 | 1.000 | 1.000 | |
40% | 0.952 | 1.000 | 1.000 | 1.000 | ||
DC _{ W } | 50 | 10% | 0.926 | 1.000 | 1.000 | 1.000 |
40% | 0.916 | 1.000 | 1.000 | 1.000 | ||
75 | 10% | 0.920 | 1.000 | 1.000 | 1.000 | |
40% | 0.936 | 1.000 | 1.000 | 1.000 | ||
100 | 10% | 0.929 | 1.000 | 1.000 | 1.000 | |
40% | 0.933 | 1.000 | 1.000 | 1.000 |
Power comparison under two different scenario
n | cen. | Scenarios 1 | Scenarios 2 | ||||||
---|---|---|---|---|---|---|---|---|---|
SRC _{ B } | SC _{ B } | SRC _{ W } | SC _{ W } | SRC _{ B } | SC _{ B } | SRC _{ W } | SC _{ W } | ||
Strong effect: β= [β_{1}, β_{2}, ..., β_{30}]^{T} = [1, 1, ..., 1]^{T} | |||||||||
50 | 10% | 0.757 | 0.742 | 0.675 | 0.650 | 0.600 | 0.599 | 0.723 | 0.692 |
30% | 0.624 | 0.580 | 0.536 | 0.490 | 0.480 | 0.422 | 0.546 | 0.505 | |
50% | 0.448 | 0.350 | 0.381 | 0.312 | 0.350 | 0.250 | 0.359 | 0.294 | |
75 | 10% | 0.960 | 0.956 | 0.907 | 0.902 | 0.876 | 0.870 | 0.944 | 0.942 |
30% | 0.883 | 0.864 | 0.828 | 0.766 | 0.783 | 0.771 | 0.875 | 0.822 | |
50% | 0.758 | 0.626 | 0.690 | 0.526 | 0.607 | 0.494 | 0.694 | 0.580 | |
100 | 10% | 0.998 | 0.997 | 0.985 | 0.982 | 0.974 | 0.973 | 0.996 | 0.995 |
30% | 0.982 | 0.974 | 0.948 | 0.917 | 0.940 | 0.905 | 0.966 | 0.955 | |
50% | 0.928 | 0.846 | 0.868 | 0.730 | 0.806 | 0.695 | 0.883 | 0.802 | |
Low effect: β= [β_{1}, β_{2}, ..., β_{30}]^{T} = [0.5, 0.5, ..., 0.5]^{T} | |||||||||
50 | 10% | 0.666 | 0.625 | 0.594 | 0.576 | 0.266 | 0.242 | 0.326 | 0.305 |
30% | 0.498 | 0.487 | 0.440 | 0.430 | 0.206 | 0.165 | 0.224 | 0.214 | |
50% | 0.362 | 0.285 | 0.328 | 0.244 | 0.144 | 0.122 | 0.162 | 0.124 | |
75 | 10% | 0.930 | 0.924 | 0.859 | 0.850 | 0.492 | 0.466 | 0.574 | 0.570 |
30% | 0.824 | 0.756 | 0.756 | 0.688 | 0.370 | 0.325 | 0.432 | 0.421 | |
50% | 0.642 | 0.553 | 0.571 | 0.469 | 0.263 | 0.206 | 0.312 | 0.224 | |
100 | 10% | 0.992 | 0.990 | 0.964 | 0.950 | 0.662 | 0.654 | 0.796 | 0.792 |
30% | 0.959 | 0.944 | 0.918 | 0.850 | 0.558 | 0.505 | 0.652 | 0.594 | |
50% | 0.852 | 0.760 | 0.802 | 0.654 | 0.412 | 0.319 | 0.472 | 0.370 |
Results are shown in Table 2.
For scenario 1, all 30 genes have effects. As expected, the power of the tests increases with increase in total sample size and gene effect, but decreases as the censoring proportion grows. Under the first scenario, the power of the SRC_{ B } method is always better than that of SC_{ B }, and SRC_{ W } is always better than SC_{ W }. This result indicates that treating the compound score as a random covariate yields higher power than treating the score as a fixed covariate. When the sample size is 100, the power average increases 10.6, 3.5, and 0.4 percentage points for 50%, 30%, and 10% censoring, respectively. This is a reasonable result, because the random covariate approach involves fitting a quadratic Cox regression model, $\text{exp}\left({\gamma}_{0}{\mu}_{j}+{\sigma}_{j}^{2}{\gamma}_{0}^{2}/2\right)$, instead of exp(γ_{0}, μ_{ j }). The quadratic form takes into account the variance of each score, use of the compound score without acknowledgement of covariate error yields lower power.
Examples
In this section, we demonstrate our methodology using two examples, an Amsterdam 70-gene breast cancer gene signature [1] and a data set involving two pathways for non-small-cell lung cancer. All tests with nominal level 0.05 were applied to the training cohort. The R code for obtaining p-values for the proposed testing procedure is available from the authors upon request.
Breast cancer data set
Breast cancer data set analysis
Method | Coef | RR | p-value |
---|---|---|---|
SRC _{ B } | 0.052 | 1.12 | 1.9 × 10^{-8} |
SRC _{ W } | 0.022 | 1.12 | 1.8 × 10^{-8} |
SC _{ B } | 0.093 | 1.10 | 1.1 × 10^{-7} |
SC _{ W } | 0.040 | 1.04 | 1.3 × 10^{-7} |
DC _{ B } | 0.078 | 1.08 | 8.6 × 10^{-13} |
DC _{ W } | 0.015 | 1.02 | 1.1 × 10^{-13} |
Although all coefficients and relative risks are very close, the p-values are very different. When using DC_{ B } and DC_{ W }, the p-values are 8.6 × 10^{-13} and 1.1 × 10^{-13}, respectively. When treating the compound covariate as fixed, the p-values of SC_{ B } and SC_{ W } are 1.1 × 10^{-7} and 1.3 × 10^{-7}. When using our procedure, the p-values of SRC_{ B } and SRC_{ W } are 1.9 × 10^{-8} and 1.8 × 10^{-8}. Although the results remain significant regardless of method, we achieve appropriate p-values for the training cohort, showing that the 70-gene prognosis signature can be used to evaluate early events in breast cancer patients. We get consistent conclusion with the other researches [17, 16].
Non-small cell lung cancer data set
Non-small-cell lung cancer data set analysis
Method | Pathway | Coef | RR | p-value | Overall p-value |
---|---|---|---|---|---|
SRC _{ B } | NOD | 0.033 | 1.0013 | 0.59 | 0.236 |
P53 | 0.037 | 1.0044 | 0.67 | ||
SRC _{ W } | NOD | 0.016 | 1.0063 | 0.37 | 0.358 |
P53 | 0.001 | 1.0002 | 0.99 | ||
SC _{ B } | NOD | 0.077 | 1.08 | 0.36 | 0.432 |
P53 | 0.015 | 1.01 | 0.90 | ||
SC _{ W } | NOD | 0.034 | 1.03 | 0.24 | 0.432 |
P53 | -0.01 | 0.99 | 0.74 | ||
DC _{ B } | NOD | 0.072 | 1.07 | 0.37 | 2.29 × 10^{-6} |
P53 | 0.314 | 1.37 | 0.003 | ||
DC _{ W } | NOD | 0.019 | 1.02 | 0.21 | 1.85 × 10^{-5} |
P53 | 0.055 | 1.06 | 0.006 |
To summarize all the information, two compound covariates were used. As shown, conventional Cox regression yields overall p-values that are strongly statistically significant (2.29 × 10^{-6} for DC_{ B } and 1.85 × 10^{-5} for DC_{ W }). When treating the compound score as a fixed covariate and using a split data set, however, the p-values of SC_{ B } and SC_{ W } become 0.432 for both. When treating the compound score as a random covariate, the p-values of SRC_{ B } and SRC_{ W } become 0.236 and 0.358, respectively. Such divergent p-values suggest that an inappropriate method may well lead to misleading results.
Concluding remarks
In this paper, we focused on survival outcomes and proposed a feasible and correct method for testing the compound covariate to evaluate its association with survival outcomes for training cohort data. We have described the use of a random covariate, SRC_{ B }/SRC_{ W }, to achieve correct testing results for training cohort data and moderately improve power as compared to the use of SC_{ B }/SC_{ W }. Simulation study shows that our proposed method performs consistently better than SC_{ B }/SC_{ W }, because the quadratic term utilized in the SRC_{ B }/SRC_{ W } method takes into account error in the compound covariate. We further found that an increase in sample size improves power when there is a high proportion of censored data. If the gene set of interest includes noise genes, we suggest that the compound covariate SRC_{ W } is a better choice than SRC_{ B }; whether noise genes or non-functionally related genes are hidden in gene set is a judgment call for a geneticist. In addition, we contend that a flaw of biomedical papers concerned with such topics report an bias p-value based on flawed compound covariate analysis for the same training data set. In this paper, we use a well-known 2-fold concept, with one part of the data to built compound covariate and the remainder part for testing if there is association between survival outcomes and the score to ensure correct p-values in the training data set. Note that we need to check the proportional hazards assumption.
using the same analysis procedure. The chosen weight $\widehat{\beta}$ or $\mathit{\u0175}$ can be adjusted by the other clinical observed covariates in the proposed framework. Our method, however, cannot be used to test the interaction between two random covariates, because of the complexity of specifying the distribution for the interaction between two random covariates; this is an area worthy of further investigation. Another one potential practical concern of the proposed method is that sample size must not be too small, higher fractions of censored data create the need for further increased sample size. Similarly, to achieve high power when studying a large number of genes, greater sample size is needed. When studying a large number of genes, ignoring the covariance that exists between genes does not influence the type I error rate, however, taking the covariance into account may increase power. Further research is required to address these limitations. Note that the permutation test (e.g., [20]) might be another method to calculate an appropriate p-value for the training dataset, however, with the permutation test, weights are not easily adjusted by the other covariates. Even for a small gene set, this approach may appear too expensive in computer time.
Declarations
Acknowledgements
The authors wish to thank editorial assistants, Lynne Berry and Yvonne Poindexter, for their editorial work on this manuscript. This work was supported by National Cancer Institute (grant numbers P30 CA068485, P50 CA095103, P50 CA090949, P50 CA098131).
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
Authors’ Affiliations
References
- van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend S: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415: 530-536. 10.1038/415530a.View ArticleGoogle Scholar
- Wang Y, Klijn J, Zhang Y, Sieuwerts A, Look M, Yang F, Talantov D, Timmermans M, Meijer-van Gelder M, Yu J, Jatkoe T, Berns E, Atkins D, Foekens J: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365: 671-679.View ArticlePubMedGoogle Scholar
- Tukey JW: Tighening the clinical trial. Control Clin Trials. 1993, 14: 266-285. 10.1016/0197-2456(93)90225-3.View ArticlePubMedGoogle Scholar
- Tomasson H: Risk scores from logistic regression: unbiased estimates of relative and attributable risk. Stat Med. 1995, 14: 1331-1339. 10.1002/sim.4780141206.View ArticlePubMedGoogle Scholar
- Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J, Raffeld M, Yakhini Z, Ben-Dor A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Sauter G: Gene-expression profiles in hereditary breast cancer. N Engl J Med. 2001, 344: 539-548. 10.1056/NEJM200102223440801.View ArticlePubMedGoogle Scholar
- Lossos IS, Czerwinski DK, Alizadeh AA: Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med. 2004, 350: 1828-1837. 10.1056/NEJMoa032520.View ArticlePubMedGoogle Scholar
- Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, Yuan A, Cheng CL, Wang CH, Terng HJ, Kao SF, Chan WK, Li HN, Liu CC, Singh S, Chen WJ, Chen JJ, Yang PC: A five-gene signature and clinical outcome in non-small-cell lung cancer. N Engl J Med. 2007, 356: 11-20. 10.1056/NEJMoa060096.View ArticlePubMedGoogle Scholar
- Hsu YC, Yuan S, Chen HY, Yu SL, Liu CH, Hsu PY, Wu G, Lin CH, Chang GC, Li KC, Yang PC: A four-gene signature from NCI-60 cell line for survival pre-diction in non-small cell lung cancer. Clin Cancer Res. 2009, 15: 7309-7315. 10.1158/1078-0432.CCR-09-1572.View ArticlePubMedGoogle Scholar
- Beer DG, Kardia SLR, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JMG, Iannettoni MD, Orringer MB, Hanash S: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002, 8: 816-824.PubMedGoogle Scholar
- Cox DR: Regression models and life-tables. J R Statist Soc B. 1972, 34: 187-220.Google Scholar
- Salmon S, Chen H, Chen S, Herbst R, et al: Classification by mass spectrometry can accurately and reliably predict outcome in patients with non-small cell lung cancer treated with erlotinib-containing regimen. J Thorac Oncol. 2009, 4: 689-696. 10.1097/JTO.0b013e3181a526b3.PubMed CentralView ArticlePubMedGoogle Scholar
- Prentice RL: Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982, 69: 331-342. 10.1093/biomet/69.2.331.View ArticleGoogle Scholar
- Zhao H, Tibshirani R, Brooks J: Gene expression profiling predicts survival in conventional renal cell carcinoma. PLoS Med. 2005, 3: 115-124.View ArticleGoogle Scholar
- Efron B, Tibshirani R: On testing the significance of sets of genes. Ann Appl Stat. 2007, 1: 107-129. 10.1214/07-AOAS101.View ArticleGoogle Scholar
- Kaplan EL, Meier P: Nonparametric estimator from incomplete observations. J Amer Stat Assoc. 1958, 53: 457-481. 10.1080/01621459.1958.10501452.View ArticleGoogle Scholar
- van de Vijver MJ, He YD, van 't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347: 1999-2009. 10.1056/NEJMoa021967.View ArticlePubMedGoogle Scholar
- Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, Saghatchian d'Assignies M, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ: Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Nat Cancer Inst. 2006, 98: 1183-1192. 10.1093/jnci/djj329.View ArticlePubMedGoogle Scholar
- Rosenstiel P, Till A, Schreiber S: NOD-like receptors and human diseases. Microb Infect. 2007, 9: 648-657. 10.1016/j.micinf.2007.01.015.View ArticleGoogle Scholar
- Shyr Y, Kim K: Weighted flexible compound covariate method for classifying microarray data. A practical approach to microarray data analysis. 2003, Norwell: Kluwer Academic Publishers, New York, 186-200.View ArticleGoogle Scholar
- Rdmacher MD, McShane LM, Simon R: A paradigm for class prediction using gene expression profiles. J Comput Biol. 2002, 9: 505-511. 10.1089/106652702760138592.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.