Exploring molecular links between lymph node invasion and cancer prognosis in human breast cancer
© Kim et al; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Skip to main content
© Kim et al; licensee BioMed Central Ltd. 2011
Published: 14 December 2011
Lymph node invasion is one of the most powerful clinical factors in cancer prognosis. However, molecular level signatures of their correlation are remaining poorly understood. Here, we propose a new approach, monotonically expressed gene analysis (MEGA), to correlate transcriptional patterns of lymph node invasion related genes with clinical outcome of breast cancer patients.
Using MEGA, we scored all genes with their transcriptional patterns over progression levels of lymph node invasion from 278 non-metastatic breast cancer samples. Applied on 65 independent test data, our gene sets of top 20 scores (positive and negative correlations) showed significant associations with prognostic measures such as cancer metastasis, relapse and survival. Our method showed better accuracy than conventional two class comparison methods. We could also find that expression patterns of some genes are strongly associated with stage transition of pathological T and N at specific time. Additionally, some pathways including T-cell immune response and wound healing serum response are expected to be related with cancer progression from pathway enrichment and common motif binding site analyses of the inferred gene sets.
By applying MEGA, we can find possible molecular links between lymph node invasion and cancer prognosis in human breast cancer, supported by evidences of feasible gene expression patterns and significant results of meta-analysis tests.
The presence of lymph node invasion is one of the strongest indicators for prognoses of distant metastasis and survival in most cancers [1, 2]. In the multi-step process of cancer metastasis development, invasion into a vascular or a lymphatic system has generally been believed to be a key step of tumor cell dissemination [3–5]. Once tumor cells acquire abilities of intravasation and survival in an unfavorable vascular environment, they circulate around the whole body parts to form new tumors at the secondary site . While the exact mechanisms of cancer metastasis through blood vessels and lymph nodes are still being studied, it is necessary to explain the processes in a genetic level as a key factor of cancer patients’ prognosis.
Many researchers have devoted their efforts to understand lymph node invasion in breast cancers, because regional lymph nodes are frequently observed as the first site of metastasis . Survival analyses with clinical features showed that lymph node status is generally marked as a top significant factor among conventional clinical features [8–10]. Studies of finding molecular markers using genome-wide expression profiles identified various genetic signatures for prediction of lymph node and distant metastasis [11–19]. However, the associations between conventional clinical features including tumor size, lymph node involvement and distant metastasis (TNM staging ) and prognosis are not yet identified in a genetic level. Moreover, the existence of a common gene set for lymph node metastasis in a transcriptional level is unclear .
So far, t-test based differential expression analysis or clustering methods between lymph node negative and positive samples have been used to detect corresponding gene sets [17–19, 21]. Although these methods are straightforward and intuitive, there are several inherent problems in them. First, direct comparison within two classes (lymph node negatives and positives) may simplify the subtle changes over cancer progression. Usually, four-stage pathological N (N0~N3) is used to indicate the degree of lymph node invasion in breast cancer; N0 denotes no lymph node invasion observed. Regarding the expression values as longitudinal data to find patterns over lymph node progression might benefit from utilizing known biomedical information. For instance, a gene whose expression is significantly high only at a certain stage (e.g. N2) is hardly accepted as a closely related gene from current metastasis model. However, a two-class comparison (e.g. N0 vs. others) would mark it differently expressed. Second, the effect of a factor (e.g. lymph node invasion) should be separated from the effect of the others (e.g. tumor size or histological subtype). These factors are generally not independent and will lead to false findings unless carefully analysed. And, the validation of inferred gene signatures should be performed on sufficient number of independent sets in a strict statistical manner. The data statistics and characteristics including inherent biases should be recognized, appropriately treated and be properly analyzed in a meta-analytic way.
There are several statistical models applicable for multivariate correlation scoring (instead of two-class based scoring). Linear/Non-linear (multiple) regression and analysis of variance models (two-way ANOVA and MANOVA) have been widely used in various fields. Both models (linear and non-linear), however, have a few weaknesses; a gene expression pattern over lymph node progression is not necessarily linear, and the data has too few time points to be assessed in a non-linear way. ANOVA models are usually used to test if there is a significant difference among the mean values, so it is not robust to inconsistent fluctuations of expression values. In time series analyses, autoregressive moving average model and its variants (ARMA, ARMAX and ARIMA) are widely used especially in electronic engineering and system identification fields, and some unit root tests (for stationarity test in time series including Augmented Dickey Fuller test ) have been used in statistics and econometrics as well. However, there are a few difficulties in adapting these models to our problem; the number of time points is very few, intervals are not regular and the stage is a pseudo-time. After reviewing the conventional models, we developed a new multivariate correlation measure specially designed for non-linear and small data point analysis. Nevertheless, the conventional models were applied as well and tested to compare with our measure and two-class based analyses.
Our method, monotonically expressed gene analysis (MEGA), scores gene expression patterns with their non-linear monotonicity over a stage progression of interest. It accumulates all the normalized expressional differences between two consecutive stages (see Methods). If the direction of expressional change is consistently positive or negative, the score increases; otherwise, the sum of differences will be cancelled out. Because there are two non-independent factors (stage T and N), one variable should be fixed while the other variable is being used. In MEGA, a two dimensional matrix is constructed, each dimension of which is composed of four points (N0~N3 and T1~T4, T0 is excluded due to the lack of data) generating totally 16 data points per a gene. So, applying the scoring function to each row or column represents calculating the cumulative expressional changes over one factor while the other is fixed. MEGA also has a weight parameter to emphasize a specific stage transition (e.g. N1→ N2) to capture genes activated or repressed in a particular time range. After calculating scores, top k genes are collected and named N-wise monotonically expressed genes (N-MEG) or T-wise monotonically expressed genes (T-MEG) depending on which factor is used for the analysis. Validation of inferred gene sets can be done in a retrospective way to see how accurately the gene sets classify prognostic outcomes in other independent data. P-values from each test data are integrated by meta-analysis to report more confident accuracy of the gene sets. This is basically one of the most unbiased ways for evaluating usefulness of inferred gene signatures.
If the gene sets show consistence and confident accuracy, a series follow-up analysis can be used for reasoning biological meaning (e.g. common pathways or transcription factors). First, gene set analysis can discover some biological pathways involving in metastasis progression. Considering pathways instead of individual genes as an acting unit of biological phenomena explains how different gene sets are sometimes associated with same conditions. And we can find more succinct way to describe the whole processes. Second, the fact that the genes show similar expression patterns as the cancer metastasis progresses leads us to a hypothesis that some common transcription factors play a crucial role in the process. Here, all the genes are not necessarily causative; rather, they are effect from changes of a fewer number of genes in upper hierarchy. In this case, finding frequently represented motifs from the promoter regions of the gene sets might be a good analysis for discovering the transcription factors. This would be more powerful information in practical applications such as pharmaceutical research and patient treatment.
Totally four gene sets of size 20 are constructed from 278 breast tumor gene expression data (expO database) by applying monotonically expressed gene analysis (MEGA). They are N-wise monotonically expressed genes (N-MEG) and T-wise monotonically expressed genes (T-MEG), which are further divided into positive and negative correlation sets. Given these four gene sets (N-MEG+, N-MEG-, T-MEG+, and T-MEG-), we tested on65 independent breast cancer prognosis data sets downloaded from ONCOMINE database (See Methods for details) how much the expression values of the genes are correlated with prognostic outcomes.
T-MEG (n=40, 20 T-MEG+ and 20 T-MEG-) were also significantly correlated with breast cancer prognosis including metastasis and relapse, but the significance was generally worse than N-MEG (Table 1). In the prognosis of metastasis studies, both of the T-MEG+ and T-MEG- were significant (p-values of 4.3x10-8 and3.1x10-3respectively), but they were not as effective as N-MEG (p-values of 1.2x10-15 and1.4x10-6). This result agrees with the previously known pathological facts; both of the degree of lymph node invasion and tumor size are important in predicting metastasis probabilities, while the former gives more direct evidences. We can also notice that tumor size related gene were either not significant (in prognosis of survival and tumor stages) or less significant than lymph node invasion related genes (in prognosis of relapse and tumor grade).
Comparison of N-MEG and T-MEG in a meta-analysis test
Comparison of stage transition specific genes
N-MEG0 → 1 (n=40)
N-MEG1 → 2 (n=40)
N-MEG2 → 3 (n=40)
Firstly, we expected that N-MEG0→1 would be more informative than N-MEG in the other stage transitions. Because, it is thought that if a set of tumor cells acquire high motility to migrate and intravasate into lymph nodes, dissemination of tumor cells over the larger parts of lymph nodes would follow spontaneously . But the result of meta-analysis test represents that there would be another transcriptional changing event in the late step of lymph node invasion before raising distant metastasis.
To inspect the characteristics of N-MEG0→1 and N-MEG2→3, we chose 200 genes from each gene set (100 positive and 100 negative genes in N-MEG0→1 and N-MEG2→3) and compared them each other. We found that there were few overlaps between two gene sets; no overlap in top 20 genes, and only two overlaps in 200 genes. But in the gene function analysis using Gorilla , both gene sets were enriched in the immune response GO terms (p-values ~ 1.0×10-4). Where the immune response is a well-known process affecting lymph node invasion [25–27], it is convincing that both gene sets are distinct but closely related to lymph node invasion by connected pathways (see Additional Files 1 and 2 for full enrichment map).
Enriched pathway analysis of N-MEG+ using GSEA
Gene Set Name
Gene Set Description
Up-regulated with adaptive T cell activities in viral clearance
Up-regulated in immature T cell in CD4+ T cell differentiation
Up-regulated in undifferentiated tumor cells
Up-regulated in serum response of fibroblasts (wound healing)
Up-regulated in breast tumor cells of negative results (prognosis)
We also conducted the same analysis with N-MEG0→1 and N-MEG2→3 (β=10). While there was no significant common binding sites in N2→N3 progression, we found STAT1, IRF2, and IRF1 can be common binding transcription factors in 50 N-MEG0→1 (p-values of 4.4×10-13,1.1×10-12, and7.4×10-12 respectively, data shown in Figure 6). We found that IRF1, who plays a tumor suppressing role, is negatively regulated by competitive transcriptional binding of IRF2, both of which were significantly correlated with tumor stage (p-value 0.001), depth of tumor infiltration (p-value 0.006) and lymph node metastasis (p-value 0.015) in human esophageal cancers . Here, we also suggest that IRF2-IRF1 pathway is likely to be involved in lymph node invasion and metastasis progression in human breast cancers with well-known activities of STAT1 [41–44].
In this study, we proposed a monotonically expressed gene analysis (MEGA) for extracting genes that are related to lymph node invasion and tumor size in breast cancer. We analysed expression patterns over a two dimensional N×T space and provided results of meta-analysis to evaluate the gene sets. The test has been conducted on completely independent data sets. We showed that gene sets selected from the suggested LE’ and TE’ functions are strongly correlated with cancer prognoses including metastasis, relapse and survival, and showed significantly better results than conventional approaches. These functions are specially designed to capture expressional differences between two consecutive stages and consistency of expression patterns as well. The MEGA model also enabled us to analyze the impact of each clinical factor independently, and to inspect a specific stage transition in a cancer progression.
Although the MEGA analysis provided a feasible link to clinical factors and cancer prognoses in a genetic level, some parts remain to be improved. First, the monotonicity can be defined in various ways. Currently we can rarely determine the activities of genes in an absolute expression level. Some genes have higher saturation level so that the gene expression pattern might show a monotonic increase through all the clinical stages. On the other hand, if an activity of a gene is easily saturated, the gene expression pattern above a certain degree would not be informative anymore. Handling and determining the optimized pattern on every gene is almost impossible, but other heuristics will be available using kinetics and text-mining. Second, integration with other omics data always count. It is still a big question; how to connect two different types and levels of information. As shown in Figure 7, protein level information explains another big part of clinical outcomes. In our opinion, those information sets should be integrated to an augmented ‘gene’ entity with other available information like point mutations, SNPs, and CNVs. In this case, the MEGA model has to be revised in its scoring functions. Lastly, finding for driving mechanisms of progression is one of the ultimate goals in this field. We tried to elucidate these mechanisms through pathway analyses and commonly binding transcription factor analyses here, but it is yet to be a striking discovery. After we solve the prior questions, our approach might be more helpful in clarifying the core genes or genetic events that can be essential in therapeutic applications.
We used 278 breast tumor gene expression data from the expO (expression project for Oncology) database (http://www.intgen.org/expo.cfm, International Genomics Consortium). The data can be also downloaded from NCBI’s GEO database (GSE2109). From 2,158 gene expression profiles for all tissues and tumor types, we chose only breast carcinomas. Samples without pathological N and T stage records, or whose pathological M stage and histological information indicate inclusion of distant metastasis were removed. Finally, the 278 non-metastasis breast tumor samples were categorized into 16 N×T classes (N0~N3, T1~T4). We also used seven normal breast gene expression profiles from GSE3744  to infer the deviation of each N×T stage against the normal condition. Normalization of data was processed with the Simpleaffy Package  in R by applying RMA normalization for every N×T stage with normal breast samples. After normalization, probe sets were collapsed into gene symbols using the GSEA collapse tool . Each gene was scored by log 2 based fold change.
For test sets, we used 65 breast cancer data sets from ONCOMINE database . The 65 data sets were firstly classified into three major analysis types (Prognosis, Tumor Stage, and Tumor Grade) each of which is further classified into matching minor subtypes. Minor typing has been done by manual inspection. From the ONCOMINE database, we could download pre-analyzed tables which include sample size, statistics, and two-tailed p-values. Two-tailed p-values were further converted into one-tailed p-values.
where the first row and the first column denote gene expressions of normal samples.
Because β is only meaningful in two consecutive stages, the β matrix is much like a diagonal matrix. The off-diagonal entries are always defined as 1. In this study, we applied the leaping factors only for N stage progression.
We can interpret the value of LE and TE functions as a sum of moving deviations in the course of N and T stage progression. For each stage, we calculate the gene expression differences between the current stage and previous stages from the beginning of stage to the one step before the current stage. If a certain gene shows a monotonic increase or decrease in its gene expression along the N or T stages, the absolute value of the function would be larger.
We selected top 40 genes for each monotonicity function (LE’ and TE’) with their absolute scores. The 40 genes are composed of 20 genes of high score and 20 genes of low score. The genes of high LE’ score means that the expression of those genes showed monotonic increase as the lymph node invasion progresses, so the set of the genes is named N-wise monotonically expressed genes with positive correlation (N-MEG+). Similarly, N-wise monotonically expressed genes with negative correlation (N-MEG-) and two other gene sets on a tumor size factor (T-MEG+ and T-MEG-) were defined.
During this procedure, some genes were missed due to the different naming strategies and the difference of coverage among the test sets. To minimize the loss of information, we searched for all aliases and symbols of earlier versions using the recent version of HUGO Gene Nomenclature Committee (HGNC) database (2009/08/23).
Before finalizing p-values of meta-analyses, we noticed that the p-values in the ONCOMINE data sets are upwardly biased. In the 65 studies, 55 studies have bigger number of significantly changed genes than we expected. And we also found that 43 studies have more up-regulated genes than down-regulated genes. We do not insist that these results mean experimental errors; we would rather think it is natural that many of genes are going to be actively expressed as the cancer progresses and regulatory mechanisms are being broke down. But in the case of test procedures, we are likely to get more false positives unless we consider the background biases. For example, some random gene sets may represent ‘more than average’ results and will be thought to be significant for cancer phenotypes.
To correct the background biases, we generated 1,000 random gene sets (n=20) and tested on the ONCOMINE data sets. And we computed Stouffer’s overall Z score on each random gene set. Averaged Z scores represent an expected Z score from a random gene set. From this result, we could conclude that the meta-analysis result of gene sets look more up-regulated in the cancer progression than they really are. So we corrected all the Stouffer’s Z scores result from N-MEG and T-MEG by subtracting the mean values of Z in the up-regulation test and adding in the down-regulation test.
We first extracted lymph node invasion related gene sets from previous studies (Suzuki et al , Abba et al , and Ellsworth et al ). Each gene set was tested using the corrected Stouffer’s Z test described in previous section. We found that the gene set from Abba et al was already reduced from 300 to 46 genes using eight prognosis experiments. So the Abba set was not used in further comparisons. We also found there are four experiments which used expO data set in the ONCOMINE test set (Bittner et al). All the overlapping data was excluded in the test procedure. Additionally, we selected 40 genes from expO data using two-way ANOVA and multiple regression models. For each gene, both models were constructed using ‘aov’ and ‘lm’ functions in R. In ANOVA models, genes were sorted by N stage dependent two-tailed p-values derived from their F-statistics because of the model’s non-directionality. In multiple regression models, 40 genes with the highest P-values of N stage were selected where their T stage and interaction terms are not significant. Directions of regulation were determined from the estimated coefficients (up >0, down <0). Finally five gene sets (N-MEG, multiple regression set, two-way ANOVA set, Suzuki set, and Ellsworth set) were tested and their p-values were reported. P-values from meta-analysis were converted into log-odd score using –log10(P). For each study, the final score was calculated from adding the two log-odd scores (from up and down regulated gene sets).
The efforts of the International Genomics Consortium (IGC) and expO (expression project for Oncology) are greatly acknowledged. We would like to thank CHUNG Moon Soul Center for BioInformation and BioElectronics for providing research facilities. This work was supported by the World Class University program (R32-2008-000-10218-0) of the Ministry of Education, Science and Technology through the National Research Foundation of Korea.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 2, 2011: 22nd International Conference on Genome Informatics: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S2.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.