Skip to main content

GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization



Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists’ capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter-gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets.


In this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets’ ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets.


Using autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.


Nowadays gene set based analysis has been an essential step for interpreting gene expression data, for which a variety of bioinformatics tools have been developed to extract biological insights from different aspects. Among all methods, functional enrichment is the most common gene set based analysis to determine classes of genes that are associated with disease phenotypes, such as Gene Set Enrichment Analysis (GSEA) [1]. Function over-representation is another category for enrichment analysis, represented by The Database for Annotation, Visualization and Integrated Discovery (DAVID) [2, 3], among many others [4,5,6]. Researchers also employ gene set as a classifier; for example, the 50-gene PAM50 model was used to classify the subtypes of breast cancer [7]. Furthermore, many studies have conducted survival analysis at gene set level to predict clinical outcomes [8, 9]. Overall, gene set analysis improves the biologists’ capability to interpret functional impact to their experiment design. However, some studies have also disclosed the inconsistency of gene set results. Lau et al. showed that there are only minimal overlaps between the putative prognostic gene sets for non-small-cell lung cancer found in nine various studies [10]. Hence, inter gene sets association should be taken into consideration, as suggested by various studies, to limit inconsistency. While combined gene sets may provide consistency, its biological relevance is rarely discussed.

Deep learning methods have emerged recently in computational biology due to the increase of molecular and cellular profiling data. Convolutional neural network (CNN) methods were implemented for prediction of DNA-protein binding [11] or detection of phenotype-associated cell subsets [12]. Autoencoder, which is an unsupervised learning algorithm, was used for modeling gene expression through dimensionality reduction in many studies [13,14,15]. Lin et al. proposed a supervised neural network model for single-cell RNA-seq data that incorporate protein–protein interaction (PPI) and protein–DNA interaction (PDI) information [13]. However, the prior biological knowledge was only utilized to improve the performance of dimensionality reduction and cell type-specific identification, and the influence of combining PPI nodes was not examined.

In this study, we proposed Gene Superset AutoEncoder (GSAE), a multi-layer autoencoder model that incorporates a priori defined gene sets to preserve the crucial biological features from combining gene sets in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets, with weights trained by the autoencoder, where each node in the latent layer is termed a superset. The goal of this study is to determine the functional or clinical relevance of the learned gene supersets from our model, where the model evaluates gene expression data at the level of superset. To achieve our goal, we used large-scale RNA-seq data sets from The Cancer Genome Atlas (TCGA) to test GSAE and investigate the top ranked gene sets in the statistically significant supersets. We demonstrated that gene supersets preserve sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Our study also compared different neural network classifiers and the superset classifier showed high accuracy in cancer subtype prediction. We concluded that superset produces more reproducible results than single gene sets, provides robustness in cancer subtype classification, and has the capability to learn potential gene sets association.


Data sets in this study

For Pan-cancer (PanCan) analysis, we collected TCGA RNA-seq data that was organized by TumorMap [16], which contains 9806 samples in 33 cancer types. In addition to entire TCGA data, we also selected breast invasive carcinoma (BRCA) data with 1099 samples for characterizing network nodes. For survival analysis, lung adenocarcinoma (LUAD) with 515 samples were chosen. Furthermore, we used four data sets with sufficient survival information, LUAD, BRCA, lower grade glioma (LGG, 523 samples), and skin cutaneous melanoma (SKCM, 469 samples) to compare the reproducibility of supersets and gene sets. The expression profiles of all tumor RNA-seq in this study are in the Transcripts Per Million (TPM) unit and then log-transformed (logTPM = log2 (TPM + 1)), which are re-analyzed uniformly for all samples [16].

Gene superset autoencoder

The architecture of GSAE is shown in Fig. 1. The input of the model is the gene expression profiles in log2 TPM values. The output x of the jth node in the ith layer can be formulated as

$$ {x}_{ij}=g\left({b}_{\left(i-1\right)}+\sum \limits_j{w}_{\left(i-1\right)j}{x}_{\left(i-1\right)j}\right) $$

where the bias b and the weight w are the two parameters that are learned in training, g() is the activation function, where we used the linear activation in the output layer and rectified linear unit (ReLU, defined in Eq. 2) in other layers to provide non-linearity while keeping a scoring feature in the model.

$$ ReLU=\left\{\begin{array}{c}x, if\ x>0\ \\ {}0, otherwise\ \end{array}\right. $$
Fig. 1
figure 1

The architecture of gene superset autoencoder (GSAE). In the gene set layer, one color node represents a gene set, and edges in the same color show connect associate genes to a gene set

Besides the input layer in our proposed autoencoder, the first two layers are the encoding network that extracts the important features from gene expression. ReLU activation mimics pathway activation/deactivation function, and comparing with with linear activation in all layers, GSAE with ReLU activation in the hidden layers has much better performance in cancer subtype prediction (data not shown). The decoder part comprises the last two layers; it is a complementary function of the encoder, and it aims to reconstruct the input from the converge layer. If the model is designed as a neural network-based classifier for cancer subtype prediction, the decoder network is replaced by a softmax function that is used as the output layer. At last, we choose the loss function to be either a mean square error function for the reconstruction loss, or a categorical cross-entropy function for multi-class classification error.

Incorporate gene sets into the encoder layer

We designed the first layer in the encoder as the gene set layer, which incorporates the information of a collection of gene sets. Specifically, each node in this layer represents a gene set, where only genes in the input layer that belong to a gene set have connection to the node [13], and the weight for each connection is determined by the backpropagation in training steps. This is different from the fully connected layer commonly used in autoencoder. We adopted the chemical and genetic perturbations (CGP) collection downloaded from the Molecular Signatures Database (MSigDB) [1, 17] and replaced some highly dependent gene sets with a representative gene set among them.

From the output of the gene set layer, we can retrieve the gene set score of each input sample. Following with a dimension reduced dense layer, the superset layer (latent layer), can be used to investigate the combination of gene sets while keeping the crucial features needed to reconstruct the input data by the decoder. The superset layer is the key layer of our model, which we obtain a group of gene sets that correlate with divergence of cancer subtypes. Each node in this layer is considered as a superset, which is a combination of different gene set terms. In this study, we set the superset layer size to 200. With the information of supersets, we can analyze characteristics of the data set, such as the development of subpopulations or clinical relevance of a disease.

Resolve dependencies among gene sets

The CGP collection in MSigDB includes the gene sets that represent expression signatures of genetic and chemical perturbations published in the literature. However, some gene sets are highly similar, and we need to mitigate the dependency. We used a similar method as in our previous study [18] to cluster gene sets with significant similarity. First, we omitted the gene sets that have less than 15 or more than 500 genes, which is also the default setting in original GSEA implementation [1]. We subsequently used kappa statistics to measure the similarity between all gene sets. We clustered gene sets with P-value < 10−7, and assigned the largest gene set as the representative of the cluster. At last, there were 2334 CGP gene sets including 18,107 genes selected to form the gene set layer.

Establish and train the gene superset autoencoder

We implemented the model using Keras 1.2.2 ( and used the custom layer method in Keras to accomplish the sparsity of gene set layer in order to keep the zero weights while optimizing the parameters. Since ReLU is used as the activation function, we selected He uniform initialization as the initializers for all layers [19]. To train the autoencoder, we used the stochastic gradient descent (SGD) optimizer that was revised in Lin et al. study [13], which was designed to deal with the optimization problem for sparse layers. The SGD parameters were set as following, learning rate = 0.05, decay = 10− 6, momentum = 0.9, and Nesterov = 1.

While training the model for a data set, we extracted 5% of data to be the validation set to avoid overfitting. With the callbacks.EarlyStopping() function in Keras, the model stops training when the loss of validation split does not improve in three consecutive epochs. At last, we imported the data set into the trained model and exported the outputs and weights of the encoder layers for further analyses in R.

The use of additional machine learning tools

In this study, we have applied t-Distributed Stochastic Neighbor Embedding (t-SNE, [20, 21], which has been widely used for dimensionality reduction. We performed t-SNE on superset results and embedded the high-dimensional data into a two-dimensional space, where potential subpopulations of the data were revealed. Another machine learning method, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN, [22, 23], was used in the tumor subtype analysis. Comparing with many other clustering algorithms, HDBSCAN has good performance and stability in exploratory data analysis. We performed HDBSCAN on the t-SNE results to determine the possible clusters among the data. Ambiguous samples were classified as noise and omitted from further analysis.

Evaluation of the clustering performance of t-SNE results

To compare the clustering performance of a t-SNE result, three index methods were used in this study: 1) Dunn index (\( \frac{\min_{all\ \left(i,j,i\ne j\right)}{d}_B\left({C}_i,{C}_j\right)}{{\mathit{\max}}_k{d}_W\left({C}_k\right)}\Big) \), where numerator is the minimal between-cluster distance, and denominator is the largest within-cluster distance) (clv v0.3–2.1 in R); 2) Silouette index (the mean of the mean silhouettes through all the clusters) (clValid 0.6–6 in R); and 3) inter-intra distance (IID) index. Slightly different from Dunn Index, the IID index takes the ratio of mean over between-cluster distances to the mean over within-cluster distances. We also define dB(Ci, Cj) as the center-to-center distance, where cluster center is defined as the median of all samples within a cluster, and dW(Ck) is defined as the distance of all samples within cluster Ck to the center of Ck, or

$$ IID\ Index=\frac{1/{n}_B{\sum}_{all\ i,j;i\ne j}{d}_B\left({C}_i,{C}_j\right)}{1/{n}_W{\sum}_k{d}_W\left({C}_k\right)} $$

where nB and nW are the number of between-cluster pairs and the number of clusters, respectively.

Differential superset analysis between tumor subtypes

After performing t-SNE on the superset layer outputs, we subsequently determined the subtypes of a data set by using HDBSCAN. To find the supersets with a subtype pattern, we compared superset values between one tumor subtype (group 1) and the other subtypes (group 2) by one-tailed Mann-Whitney-Wilcoxon U test (MWW) with a location shift of “mu” (mu was assigned to change the stringency of the test). Significant supersets (MWW P-value < 0.01) that have larger values in group 1 were named as up-supersets, whereas down-supersets were the significant supersets with larger than in group 2. We further investigated gene sets in the significant supersets. To quantify the contribution of ith gene set in jth superset, gsScore was calculated as following,

$$ {gsScore}_{ij}=\left({\mu}_1^{(i)}-{\mu}_2^{(i)}\right)\times {w}_{ij} $$

where μ1 and μ2 are the average of the ith gene set values in the two groups, and wij is the weight in the model corresponding to the connection from the ith gene set to the jth superset. In up-supersets, gene sets with gsScore greater than a positive cutoff (in the right tail) were selected. In the opposite, gene sets in the down-supersets with gsScore less than a negative cutoff (in the left tail) were selected. Those gene sets are the potential high impact gene sets of the subtype (group 1).

Kaplan-Meier survival analysis on superset layer

We examined whether GSAE retains survival related features. For each superset and gene set, we used a median split (median of the superset or gene set value) to create two groups and performed log-rank test. For each prognostic significant superset, we ranked gene sets according to the gsScore (Eq. 4) and further investigated the survival relevance of top gene sets.


Cancer type information preserved in low dimension outcome

To test the capability of GSAE to retain crucial features in the superset layer, we used TCGA PanCan RNA-seq logTPM data, 15,975 genes selected with μ > 1 and σ > 0.5 across 9806 samples in 33 cancer types, as GSAE inputs and exported the superset layer results. We performed t-SNE on TCGA logTPM data and the superset layer outputs (200 nodes), and the results are shown in Fig. 2, in which the color of each node was labelled according to the cancer type information. The groupings of cancer types in the two t-SNE plots are nearly identical, where most cancer types form an individual cluster. The mingling of few cancer types are also similar in both figures. We used three index methods, Dunn index, Silouette index, and IID index, to evaluate the resemblance of the two t-SNE results in Fig. 2. Overall, with the input dimension reducing by more than 98%, it leads to 23.48% loss in the clustering performance between the two t-SNE results with Dunn index (Table 1). However, we got comparable clustering performance while using the other two index methods (− 0.85% in Silouette index and − 2.54% in IID index, respectively, Table 1). We concluded that the model is able to retain cancer type-associated features of a data while reducing dimensionality.

Fig. 2
figure 2

The t-SNE results of TCGA 9806 samples using (a) logTPM data with 15,975 genes (an initial PCA step was performed), and (b) 200 superset outputs

Table 1 Evaluation of the clustering performance of the two t-SNE results in Fig. 2. As a reference, the compression rate from 15,975 features down to 200 supersets is about 98.7%

Indication of gene sets associated with breast cancer subtypes

In Fig. 2, we learned that the samples labeled in red are separated into two clusters, and we further verified that they belonged to BRCA. We used GSAE to analyze the BRCA data separately to discover gene sets that are support this subtype differentiation. There were 15,183 genes in 1099 samples that meet the criterion of μ > 1 and σ > 0.5, where they were used as the model input. After training of the model, we exported the superset results and performed t-SNE, which is shown in Fig. 3a. We applied HDBSCAN, which clustered the samples into two groups, where group 1 (G1) is labeled in red and group 2 (G2) in green. The noisy samples defined by the algorithm were omitted. Four up-supersets and three down-supersets were determined (P-value < 0.01) using one-tailed Mann Whitney U test with location shift mu = 9, where only supersets with a huge difference between the two groups could pass the test. In each significant superset, those with gsScore > 2 sd (standard deviation of all gsScores in the superset) are the high impact gene sets of the superset. All high impacts gene sets of 7 significant supersets are listed in Additional File 1: Table S1, and the PScore (−log10(P-value)) of Mann Whitney U test (location shift set as 0.5) of each gene set was also included.

Fig. 3
figure 3

Subtype analysis in BRCA data set. (a) The t-SNE results of BRCA data, where HDBSCAN classified the samples into two groups. The noisy samples were labeled in black and omitted from further analysis. (b) The density plots of the most significant up-superset and three selected top gene sets. The blue/yellow arrow corresponds to positive/negative weight in the model between the gene set and superset. (c) The density plots of the most significant down-superset and three selected top gene sets. (d) The Venn diagram of the significant gene sets in the top 3 up-supersets

Top 15 gene sets in the most significant up-superset and down-superset are listed in Tables 2 and 3. The two superset density plots of gene set values (output of gene set nodes) in Fig. 3b and c show the vast difference between the two groups in those significant supersets. We also identified many high impact gene sets associated with breast cancer subtypes. For example, “FARMER_BREAST_CANCER_BASAL_VS_LULMINAL” clearly indicates the two groups are possible Basal and Luminal subtypes [24]. The study of “STEIN_ESRRA_TARGETS_RESPONSIVE_TO_ESTROGEN_UP” gene set also suggested that ESRRα might be a therapeutic target for triple negative breast cancer [25]. Group 1 has a higher value in “DOANE_BREAST_CANCER_ESR1_DN”, which matches the gene set condition where these genes were down-regulated in ESR1 positive samples [26]. Genes that are involved in “PEDERSEN_METASTASIS_BY_ERBB2_ISOFORM_7” differentiates the HER2 positive and negative BRCA subtype [27]. A study has shown that c-Myb differed significantly across the subtypes, where Basal-like has the lowest expression [28], which fits the result of “LEI_MYB_TARGETS”. There is no direct connection of Estradiol with breast cancer subtype, but it is an estrogen and its target gene set “FRASOR_RESPONSE_TO_ESTRADIOL_DN” might be a potential subtype marker.

Table 2 Top 15 gene sets in up-superset #1 in BRCA subtype analysis
Table 3 Top 15 gene sets in down-superset #1 in BRCA subtype analysis

After reviewing these gene sets, because the Basal subtype accounts for 15% of the breast cancer population, we hypothesized that G1, the small subpopulation in red in Fig. 3a, is the Basal subtype of breast cancer. We checked the TCGA clinical information and PAM50 classification results and verified that 156 of 175samples (with complete estrogen status or PAM50 subtype information) in G1 are either Basal-like or triple negative samples. This result demonstrates that our proposed superset autoencoder is able to reveal the subpopulation features and biological relevance.

We compared with GSEA results between G1 and G2, and 53 out of 124 (42.7%) high impact gene sets are also GSEA enriched gene sets (nom P-value < 0.05), which indicates the reliability of our results. To examine whether a superset contains some exclusive gene sets, we compared the top 3 up-supersets (Additional File 1: Table S1A-C) and the Venn diagram is shown in Fig. 3d. Many of the overlapped gene sets are associated with the Basal subtype (12 common gene sets in Additional File 1: Table S1, bold font). Up-superset 1 has additional estrogen related gene sets (Additional File 1: Table S1A, colored in blue); up-superset 2 holds some gene sets that are relevant to ERBB2 (Additional File 1: Table S1B, colored in blue).

Prediction of breast cancer PAM50 subtypes with superset classifier

To test if our model can be used as a classifier to predict cancer subtypes, we reconstructed our model to the architecture shown in Additional File 2: Fig. S1A, where the decoder network is replaced by a softmax function output (input – encoder – prediction output). With the clinical annotation organized by the UCSC Cancer Genomics Browser [29, 30] (captured in 2015,, we selected 821 BRCA samples with 15,183 genes in Basal, LumA, LumB, and Her2 PAM50 subtypes as input data to test the performance of the superset classifier (normal-like subtype was removed due to small sample size). Using 10-fold cross-validation to test the superset classifier, we achieved a good performance of 88.79% prediction accuracy.

With the same input, we also compared with four different neural network models, 1) gene set classifier, in which the superset layer is removed (Additional File 2: Fig. S1B), 2) 2-layer fully connected encoder network with the same size of the superset classifier (Additional File 2: Fig. S1C), 3) 2-layer fully connected encoder network, where the size of each layer was optimized by Hyperas [31] (Additional File 2: Fig. S1C), and 4) 4-layer fully connected encoder network, where the size of each layer was optimized by Hyperas (Additional File 2: Fig. S1D). The size and 10-fold cross-validation accuracy of each classifier are listed in Table 4. We have tuned the SGD parameter setting of each model in order to get the best performance.

Table 4 The size of encoder layers and the 10-fold cross-validation accuracy of each neural network classifier

The prediction accuracy of gene set classifier (87.69%) is close to that of the superset classifier, which implies the gene set layer contains sufficient information for classification. On the other hand, all three classifiers with fully connected encoder have low prediction accuracy (< 50%, Additional File 2: Fig. S1C and D), mainly due to the large number of weights need to be trained to attain (or fail to attain) an optimal model. To alleviate the training burden, we reduced the input number by performing principal component analysis (PCA) on BRCA data first and selected top 500 principal components (PCs) to test the models with fully connected encoder (Additional File 2: Fig. S1C and D, layer size was also optimized by Hyperas, Table 4). The prediction results (87.57%) are equivalent to the superset classifier, indicating that the gene set layer and top PCs both preserve important subtype features. While both PC classifier and gene set classifier achieved the same accuracy, we can design our network to emphasize certain features (e.g. PAM50 subtype classification), based on the fact that we understand the biological functions of a priori defined gene set, and the flexibility of choosing different functional sets (signaling pathways, immunological signatures, etc).

We further tested the mean sensitivities and specificities of superset classifier by ten (10) iterations of 10-fold cross-validations (Table 5). We have near perfect specificity in all four BRCA subtype, especially in Basal (1.000) and HER2 (0.977). In addition, these two subtypes both have high sensitivity (Basal: 0.957 and HER2: 0.924). Prediction errors mostly occurred between Luminal A and Luminal B subtypes with relatively low sensitivities (0.862 and 0.835, respectively) while maintaining consistent specificities (0.935 and 0.907, respectively). This is expected due to the ambiguous cutoffs to define Luminal A & B at gene expression levels [32]. Overall, the superset classifier provides high sensitivity and specificity in BRCA subtype prediction.

Table 5 The mean sensitivities and specificities of superset classifier by ten iterations of 10-fold cross-validations

Prognostic significance for lung adenocarcinoma

TCGA LUAD data set was employed to test if the model is capable of retaining survival related features in the superset layer. With the same gene selection criterion, 15,188 genes in 515 samples were used as the model input. We also organized the TCGA LUAD survival information to a 5-year survival record, where the maximum survival time was set as 1825 days, and a death event that occurred after five years was censored at 5 years. After performing log-rank test on the superset results, we determined 6 supersets with log-rank P-value < 0.001, which were considered as prognostic significant nodes. We ranked the gene sets in those six supersets by the gsScore, and the top 20 gene sets in each superset are listed in Additional File 3: Table S2. The top ranked gene sets that also showed significance in gene set log-rank test were selected to probe the biological relevance of lung adenocarcinoma.

We picked the first and fourth ranked supersets as two examples, and the top 15 gene sets in the two supersets are listed in Tables 6 and 7. We chose the 4th ranked superset due to the least overlap of significant gene sets with the 1st ranked superset. We selected three gene sets tested significant by the log-rank test from the two supersets and plotted the Kaplan-Meier survival curves in Fig. 4. In rank 1 superset, several significant gene sets are related to the survival of LUAD. A study has shown that decreased mRNA expression of TCF21, a tumor suppressor, is a core predictor for poor prognosis in patients with lung cancers in two studies [33,34,35], agree with what we found the prognosis association from TCGA LUAD with gene set “CUI_TCF21_TARGETS_2_UP” (P = 1.30 × 10− 4). “KIM_WT1_TARGETS_DN” (P = 0.0064) is related to the oncogene WT1 in lung cancer, and the high expression of WT1 links to an unfavorable impact on the prognosis [36]. We also found some gene sets that no previous study showed direct connection with the prognosis of LUAD. Previous studies have revealed that ETS-related transcription factors are associated with non-small-cell lung cancers (NSCLC) [37, 38]. ELK3 is also an ETS transcription factor, and the related gene set “GROSS_HYPOXIA_VIA_ELK3_UP” (P = 5.21 × 10− 4) might be relevant to LUAD survival. Two chemical compounds related gene sets were discovered in superset 1, “MARTINEZ_RESPONSE_ TO_TRABECTEDIN_DN” (P = 0.0015) and “CONCANNON_APOPTOSIS_BY_EPOXOMICIN_DN” (P = 0.0264). While both gene lists were derived from studies of other cancer types (e.g., HCT116 colon cancer cell-line), other studies have demonstrated the effectiveness of both Epoxomicin and Trabectedin in lung cancer treatment. Carfilzomib, which is a designed drug based on Epoxomicin, demonstrated anti-proliferative activity and resulted in prolonged survival in mice with SHP-77 small cell lung cancer xenografts [39]. There was only one study testing treatment with trabectedin on NSCLC patients, but no recommendation was given to use trabectedin as single agent treatment [40]. Thus, these two gene sets could be further examined to look for the biological relevance to LUAD.

Table 6 Top 15 gene sets in the highest ranked superset in LUAD survival analysis
Table 7 Top 15 gene sets in 4th ranked superset in LUAD survival analysis
Fig. 4
figure 4

The Kaplan-Meier Curves of (a) 1st ranked superset and selected three top 20 gene sets associated with the superset, (b) 4th ranked superset and selected three top 20 gene sets associated with the superset. The blue/yellow arrow corresponds to positive/negative weight in the model between the gene set and superset

In 4th ranked superset, “IWANAGA_CARCINOGENESIS_BY_KRAS_PTEN_UP” (P = 0.0249) is a widely studied gene set to show the association with LUAD prognosis. The study that concluded this gene set observed the aberration in NSCLC with oncogenic form of KRAS and inactivated PTEN, in which condition resulted in shorter survival [41]. The gene set “ZHANG_BREAST_CANCER_PROGENITORS_UP” (P = 0.0248) shows the effect of progenitor cells in survival. Ooi et al. suggested that the presence of a putative tumor-initiating progenitor cell population in NSCLC is a biomarker with a worse prognosis [42]. MAPK8 related gene set “YOSHIMURA_MAPK8_TARGETS_DN” (P = 0.0195) is also a potential prognostic associated gene set, while only one study implied indirect impact of poor prognosis due to MAPK8 repression [43].

From the two selected supersets, we already found some gene sets highly associated with LUAD survival, there are some novel prognostic gene set candidates that need to be further studied. In conclusion, superset results encompass survival-associated features and sort out the priority of potential prognostic gene sets.

Improved survival reproducibility from supersets

To compare the reproducibility of survival results between the superset and gene set layers, we selected four TCGA data sets (BRCA, LUAD, SKCM, and LGG) to examine the reproducibility of GSAE. For each data set, we omitted genes that did not meet the criterion of μ > 1 and σ > 0.5. We next randomly split 60% of the data as the training set and the remaining 40% as the test set. After the autoencoder was trained on the training set, we obtained the superset outputs for the training and test sets. Median split and log-rank test were performed on training and test superset results to determine survival-related supersets and gene sets.

We assumed that the prognostic significant gene sets and supersets should be similar between training and test data. To evaluate the performance of gene set and superset results, we compared the significant gene sets and supersets obtained from training data and those from test data by Jaccard index. Furthermore, we used two population proportions z-test to examine whether supersets have greater overlap proportion in the training data, and the results are shown in Table 8.

Table 8 The statistical information of GSAE outputs between the training and test TCGA data sets of four cancer types

In the largest data set BRCA, we found out that superset has much higher Jaccard index (34%) than gene set (12%), and the two overlap proportions differ significantly (P = 2 × 10− 4). In two other smaller data sets, LUAD and SKCM, superset (Jaccard Index ~18%) still outperforms gene set (11 and 7% for LUAD and SKCM, respectively; z-test P-value < 0.05). In LGG, because of the large number of prognostic significant nodes for superset and gene sets, both Jaccard coefficients are high (~48% for both superset and gene set) and the performance of gene set and superset is identical. To avoid the potential of sampling bias, we repeated the whole process in BRCA and LUAD several times, and we obtained similar stability measure (z-test P-value, data not shown). Overall, we concluded that superset has better reproducibility performance over gene set.


Same as other machine learning algorithms, the selective process of GSAE is an issue. Despite getting identical losses, different nodes (or gene sets) in different training may selective activated or de-activated (output value ~0) with the same training data. Take our study for example, we might obtain the same outcome (e.g. tumor subtype classification) of a dataset in the superset layer, but it is difficult to match superset between runs, and the top ranked gene set components in significant supersets might also be different, although highly relevant gene sets appear more frequent. This observation can be used to assess the significance of a given gene set or superset to a specific aim (e.g. survival association).

We also tried to understand the major cause of selective process in our model, and two possible factors were concluded – the dependency among gene sets in the CGP collection and the initialization of the model weights. Even though we tried to mitigate the dependency effect, many gene sets still share a subset of genes. In the model, we observed that some gene sets with zero values (deactivated) are highly overlapped with top ranked gene sets (activated). We assume that the information (member genes) of a zero-value gene set can be replaced by a combination of other gene sets. In addition, all weights in the GSAE model are randomly initialized. Due to the randomly initialized weights and dependency among gene sets, the model can reconstruct the input data through different gene sets, which results in the selective process of activated or deactivated gene sets. A better choice for independent or less overlapping gene sets could be Gene Ontology (GO) slims, a cut-down version of the whole GO. We might also alleviate this selective issue by assigning saved initial weights from a previous run or pre-trained weights of other data.

Another limitation of our model is the requirement of large sample size, which is a constraint for usual bulk RNA-seq experiments. However, the characteristic of single-cell RNA-seq (scRNA-seq) experiments, lower read depth with much larger sample size (near half-million scRNA-seq in some studies [44, 45]), perhaps is more suitable to be analyzed by deep learning methods. There are many advantages to examine scRNA-seq data using GSAE. First, scRNA-seq analysis with our model will not be restricted by statistical assumptions, where we can avoid dealing with the diverse statistical characteristics of single-cell data [46]. Second, we can directly determine the exclusive gene sets or GO functions of each identified subpopulation, without the need to find the representative genes of a subpopulation. With the support of other additional tools, analyzing scRNA-seq data with our model will be more thorough. For example, using only highly variable genes detected by scVEGs [47] will increase the diversity of subpopulations while lowering the variable dimension of the model. The HDBSCAN algorithm can cluster the multiple subpopulations of single-cell data precisely after t-SNE is applied. Overall, there is a huge potential using deep learning methods for scRNA-seq data analysis.

The concept of gene supersets not only provides better reproducibility, it also gives us a chance to understand the inter-dependency of gene sets. In this study we investigated the associations between significant supersets and gene sets. However, relations between those top ranked gene sets in the same superset has yet to be discussed. One possible solution is to find the corresponding input genes that have large contribution to a significant superset (by interpreting the weights in the first layer as the gene weights in each gene set), where we can further form a set of genes based on the superset. All these alternative approaches will guide our future study to bolster the biological functions of supersets.


In this paper, we proposed a multi-layer autoencoder model with the incorporation of annotated gene set information. The model is capable of preserving crucial biological features of gene expression data in the dimension reduced superset layer. From the superset results, we have found out information such as tumor subtype differentiation and clinical prognostic significance. With the concept of superset, an unbiased combination of gene sets, we can improve the reproducibility of survival analysis, provide robust prediction of cancer subtypes, and indicate potential gene sets association of a disease. GSAE has the versatility to incorporate different gene set collection, discover different biological relevance, and analyze different kinds of gene expression data.



Breast invasive carcinoma


Chemical and genetic perturbations


Convolutional neural network


The Database for Annotation, Visualization and Integrated Discovery


Gene Onotology


Gene Superset Autoencoder


Gene Set Enrichment Analysis


Hierarchical Density-Based Spatial Clustering of Applications with Noise

IID index:

Inter-Intra Distance index


Lower grade glioma


Lung adenocarcinoma


Molecular Signatures Database


Mann-Whitney-Wilcoxon U test


Non-small-cell lung cancers




Principal Component Analysis


Protein–DNA Interactions


Protein–Protein Interactions


Rectified linear unit


single-cell RNA-seq


Stochastic Gradient Descent


Skin cutaneous melanoma


The Cancer Genome Atlas


Transcripts Per Million


t-Distributed Stochastic Neighbor Embedding


  1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–50.

    Article  CAS  Google Scholar 

  2. Huang d W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13.

    Article  Google Scholar 

  3. Huang d W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57.

    Article  CAS  Google Scholar 

  4. Berriz GF, King OD, Bryant B, Sander C, Roth FP. Characterizing gene sets with FuncAssociate. Bioinformatics. 2003;19(18):2502–4.

    Article  CAS  Google Scholar 

  5. Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009;37(Web Server):W305–11.

    Article  CAS  Google Scholar 

  6. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90–7.

    Article  CAS  Google Scholar 

  7. Cancer Genome Atlas N. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70.

    Article  Google Scholar 

  8. Chang YH, Chen CM, Chen HY, Yang PC. Pathway-based gene signatures predicting clinical outcome of lung adenocarcinoma. Sci Rep. 2015;5:10979.

    Article  CAS  Google Scholar 

  9. Tang H, Xiao G, Behrens C, Schiller J, Allen J, Chow CW, Suraokar M, Corvalan A, Mao J, White MA, et al. A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients. Clin Cancer Res. 2013;19(6):1577–86.

    Article  CAS  Google Scholar 

  10. Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, Strumpf D, Johnston MR, Darling G, Keshavjee S, Waddell TK, et al. Three-gene prognostic classifier for early-stage non small-cell lung cancer. J Clin Oncol. 2007;25(35):5562–9.

    Article  Google Scholar 

  11. Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics. 2016;32(12):i121–7.

    Article  CAS  Google Scholar 

  12. Arvaniti E, Claassen M. Sensitive detection of rare disease-associated cell subsets via representation learning. Nat Commun. 2017;8:14825.

    Article  CAS  Google Scholar 

  13. Lin C, Jain S, Kim H, Bar-Joseph Z. Using neural networks for reducing the dimensions of single-cell RNA-Seq data. Nucleic Acids Res. 2017;45(17):e156.

    Article  Google Scholar 

  14. Tan J, Hammond JH, Hogan DA, Greene CS. ADAGE-based integration of publicly available Pseudomonas aeruginosa gene expression data with Denoising autoencoders illuminates microbe-host interactions. mSystems. 2016;1(1):e00025–00015.

  15. Chen Y, Li Y, Narayan R, Subramanian A, Xie X. Gene expression inference with deep learning. Bioinformatics. 2016;32(12):1832–9.

    Article  CAS  Google Scholar 

  16. Newton Y, Novak AM, Swatloski T, McColl DC, Chopra S, Graim K, Weinstein AS, Baertsch R, Salama SR, Ellrott K, et al. TumorMap: exploring the molecular similarities of Cancer samples in an interactive portal. Cancer Res. 2017;77(21):e111–4.

    Article  CAS  Google Scholar 

  17. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40.

    Article  CAS  Google Scholar 

  18. Hsiao TH, Chiu YC, Hsu PY, Lu TP, Lai LC, Tsai MH, Huang TH, Chuang EY, Chen Y. Differential network analysis reveals the genome-wide landscape of estrogen receptor modulation in hormonal cancers. Sci Rep. 2016;6:23035.

    Article  CAS  Google Scholar 

  19. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: ArXiv e-prints. vol. 1502; 2015.

    Google Scholar 

  20. van der Maaten LJP. Accelerating t-SNE using tree-based algorithms. J Mach Learn Res. 2014;15:3221–45.

    Google Scholar 

  21. van der Maaten LJP, Hinton GE. Visualizing high-dimensional data using t-SNE. J Mach Learn Res. 2008;9:2579–605.

    Google Scholar 

  22. Campello RJGB, Moulavi D, Sander J: Density-Based Clustering Based on Hierarchical Density Estimates. In: 2013; Berlin, Heidelberg: Springer Berlin Heidelberg; 2013: 160–172.

  23. McInnes L, Healy J, Astels S. hdbscan: hierarchical density based clustering. The Journal of Open Source Software. 2017;2.

  24. Farmer P, Bonnefoi H, Becette V, Tubiana-Hulin M, Fumoleau P, Larsimont D, Macgrogan G, Bergh J, Cameron D, Goldstein D, et al. Identification of molecular apocrine breast tumours by microarray analysis. Oncogene. 2005;24(29):4660–71.

    Article  CAS  Google Scholar 

  25. Stein RA, Chang CY, Kazmin DA, Way J, Schroeder T, Wergin M, Dewhirst MW, McDonnell DP. Estrogen-related receptor alpha is critical for the growth of estrogen receptor-negative breast cancer. Cancer Res. 2008;68(21):8805–12.

    Article  CAS  Google Scholar 

  26. Doane AS, Danso M, Lal P, Donaton M, Zhang L, Hudis C, Gerald WL. An estrogen receptor-negative breast cancer subset characterized by a hormonally regulated transcriptional program and response to androgen. Oncogene. 2006;25(28):3994–4008.

    Article  CAS  Google Scholar 

  27. Pedersen K, Angelini PD, Laos S, Bach-Faig A, Cunningham MP, Ferrer-Ramon C, Luque-Garcia A, Garcia-Castillo J, Parra-Palau JL, Scaltriti M, et al. A naturally occurring HER2 carboxy-terminal fragment promotes mammary tumor growth and metastasis. Mol Cell Biol. 2009;29(12):3319–31.

    Article  CAS  Google Scholar 

  28. Thorner AR, Parker JS, Hoadley KA, Perou CM. Potential tumor suppressor role for the c-Myb oncogene in luminal breast cancer. PLoS One. 2010;5(10):e13073.

    Article  Google Scholar 

  29. Goldman M, Craft B, Swatloski T, Cline M, Morozova O, Diekhans M, Haussler D, Zhu J. The UCSC Cancer genomics browser: update 2015. Nucleic Acids Res. 2015;43(Database issue):D812–7.

    Article  CAS  Google Scholar 

  30. Zhu J, Sanborn JZ, Benz S, Szeto C, Hsu F, Kuhn RM, Karolchik D, Archie J, Lenburg ME, Esserman LJ, et al. The UCSC Cancer genomics browser. Nat Methods. 2009;6(4):239–40.

    Article  CAS  Google Scholar 

  31. Pumperla M. Keras + Hyperopt: A very simple wrapper for convenient hyperparameter optimization; 2016.

    Google Scholar 

  32. Yersal O, Barutca S. Biological subtypes of breast cancer: prognostic and therapeutic implications. World J Clin Oncol. 2014;5(3):412–24.

    Article  Google Scholar 

  33. Landi MT, Dracheva T, Rotunno M, Figueroa JD, Liu H, Dasgupta A, Mann FE, Fukuoka J, Hames M, Bergen AW, et al. Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS One. 2008;3(2):e1651.

    Article  Google Scholar 

  34. Xiao J, Liu A, Lu X, Chen X, Li W, He S, He B, Chen Q. Prognostic significance of TCF21 mRNA expression in patients with lung adenocarcinoma. Sci Rep. 2017;7(1):2027.

    Article  Google Scholar 

  35. Zhang Y, Foreman O, Wigle DA, Kosari F, Vasmatzis G, Salisbury JL, van Deursen J, Galardy PJ. USP44 regulates centrosome positioning to prevent aneuploidy and suppress tumorigenesis. J Clin Invest. 2012;122(12):4362–74.

    Article  CAS  Google Scholar 

  36. Qi XW, Zhang F, Wu H, Liu JL, Zong BG, Xu C, Jiang J. Wilms' tumor 1 (WT1) expression and prognosis in solid cancer patients: a systematic review and meta-analysis. Sci Rep. 2015;5:8924.

    Article  CAS  Google Scholar 

  37. Hiroumi H, Dosaka-Akita H, Yoshida K, Shindoh M, Ohbuchi T, Fujinaga K, Nishimura M. Expression of E1AF/PEA3, an Ets-related transcription factor in human non-small-cell lung cancers: its relevance in cell motility and invasion. Int J Cancer. 2001;93(6):786–91.

    Article  CAS  Google Scholar 

  38. Yamaguchi E, Nakayama T, Nanashima A, Matsumoto K, Yasutake T, Sekine I, Nagayasu T. Ets-1 proto-oncogene as a potential predictor for poor prognosis of lung adenocarcinoma. Tohoku J Exp Med. 2007;213(1):41–50.

    Article  CAS  Google Scholar 

  39. Baker AF, Hanke NT, Sands BJ, Carbajal L, Anderl JL, Garland LL. Carfilzomib demonstrates broad anti-tumor activity in pre-clinical non-small cell and small cell lung cancer models. J Exp Clin Cancer Res. 2014;33:111.

    Article  Google Scholar 

  40. Massuti B, Cobo M, Camps C, Domine M, Provencio M, Alberola V, Vinolas N, Rosell R, Taron M, Gutierrez-Calderon V, et al. Trabectedin in patients with advanced non-small-cell lung cancer (NSCLC) with XPG and/or ERCC1 overexpression and BRCA1 underexpression and pretreated with platinum. Lung Cancer. 2012;76(3):354–61.

    Article  Google Scholar 

  41. Iwanaga K, Yang Y, Raso MG, Ma L, Hanna AE, Thilaganathan N, Moghaddam S, Evans CM, Li H, Cai WW, et al. Pten inactivation accelerates oncogenic K-ras-initiated tumorigenesis in a mouse model of lung cancer. Cancer Res. 2008;68(4):1119–27.

    Article  CAS  Google Scholar 

  42. Ooi AT, Mah V, Nickerson DW, Gilbert JL, Ha VL, Hegab AE, Horvath S, Alavi M, Maresh EL, Chia D, et al. Presence of a putative tumor-initiating progenitor cell population predicts poor prognosis in smokers with non-small cell lung cancer. Cancer Res. 2010;70(16):6639–48.

    Article  CAS  Google Scholar 

  43. Zhang W, Sun J, Luo J. High expression of Rab-like 3 (Rabl3) is associated with poor survival of patients with non-small cell lung Cancer via repression of MAPK8/9/10-mediated autophagy. Med Sci Monit. 2016;22:1582–8.

    Article  CAS  Google Scholar 

  44. Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, Saadatpour A, Zhou Z, Chen H, Ye F, et al. Mapping the mouse cell atlas by microwell-Seq. Cell. 2018;172(5):1091–107 e1017.

    Article  CAS  Google Scholar 

  45. Rozenblatt-Rosen O, Stubbington MJT, Regev A, Teichmann SA. The human cell atlas: from vision to reality. Nature. 2017;550(7677):451–3.

    Article  CAS  Google Scholar 

  46. Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods. 2017;14(4):414–6.

    Article  CAS  Google Scholar 

  47. Chen HI, Jin Y, Huang Y, Chen Y. Detection of high variability in gene expression from single-cell RNA-seq profiling. BMC Genomics. 2016;17(Suppl 7):508.

    Article  Google Scholar 

Download references


The authors greatly appreciate the brilliant and constructive inputs from reviewers and participants of the International Conference on Intelligent Biology and Medicine (ICIBM 2018).


Funding for this research was provided partially by the NCI Cancer Center Shared Resources (NIH-NCI P30CA54174 to YC), NIH (CTSA 1UL1RR025767–01 to YC, R01GM113245 to YH), CPRIT (RP160732 to YC), San Antonio Life Sciences Institute Innovation Challenge Award 2016 to YC and YH, and San Antonio Life Sciences Institute Postdoctoral Fellowship to YCC. The publication costs for this article were funded by the aforementioned CPRIT grants to YC.

Availability of data and materials

Data will be shared upon request.

About this supplement

This article has been published as part of BMC Systems Biology Volume 12 Supplement 8, 2018: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2018: systems biology. The full contents of the supplement are available online at

Author information

Authors and Affiliations



All authors contribute to the manuscript. HHC, YCC, TZ, SZ, YH and YC conceived and designed the study. HHC carried out the Python and R scripts. HCC and YCC investigated the biological relevance of the results. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yufei Huang or Yidong Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Authors declare no competing interest in preparing the paper and developing the software associated to this paper.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:

Table S1. Top high impact gene sets of the four up-supersets and three down-supersets in the BRCA tumor subtype analysis. The highly overlapped collections of a gene set were determined by MSigDB (in the compute overlaps section). (XLSX 27 kb)

Additional file 2:

Figure S1. The architectures of four neural network classifiers. (A) superset classifier, (B) gene set classifier, (C) 2-layer fully connected encoder network classifier, and (D) 4-layer fully connected encoder classifier. (PDF 328 kb)

Additional file 3:

Table S2. Top 20 gene sets of the six prognostic significant supersets in LUAD survival analysis. (XLSX 15 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, HI.H., Chiu, YC., Zhang, T. et al. GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization. BMC Syst Biol 12 (Suppl 8), 142 (2018).

Download citation

  • Published:

  • DOI: