Calculating Expression Variance of human genes
In order to calculate global patterns of expression variance (EV) of human genes, we used the extensive collection of human cancer tissue microarrays of the Expression Project for Oncology (ExpO) of the International Genomics Consortium (http://www.intgen.org/expo), which contains expression microarray profiles of 2158 tumor tissue samples. This dataset contains samples dissected from diverse tissues with various types of cancer with different characteristics and treatments, and therefore spans a wide spectrum of cellular environments. We calculated EVs for each human gene by taking statistical variance of normalized expression levels of each gene across the whole dataset. Normalization of expression levels of each gene was done by first normalizing the samples in the dataset by quantile normalization [16], so that each sample in the dataset has an identical distribution, and then dividing the expression level of the gene in each sample by the median of its expression level across all samples, which resulted in a measure of deviation of the expression levels of each gene from their median. Representative plots of expression profiles of some low and high EV genes are shown in Figure 1. Varying levels of expression variation is evident between high and low EV genes. Expression levels of genes with low EV seem to be relatively stable regardless of extracellular conditions, while some genes seem to have extremely variable expression pattern across many different conditions. This suggests that some genes may be preferentially regulated during cellular adaptation to its environment, while some genes are generally not regulated at a transcriptional level.
It is possible that the EV values simply reflect basal tissue specific expression variations of genes and not the variability of their expression under different cellular conditions. In order to test this, we calculated tissue-specific EVs of genes using only samples in the ExpO dataset collected from breast, lung or colon, thereby obtaining EVs of genes for each individual tissue type. If the EVs reflect tissue-specific expression variations of genes, there should not be high correlation between tissue-specific EVs. However, there is a high correlation between breast and lung tissue-specific EVs (Additional File 1). Similarly high correlation was also observed between ovary and lung tissue EVs (Additional file 2, which indicates that EV mostly reflects variability of genes between different cellular and extracellular conditions rather than tissue-specific expression patterns of genes. We also tested correlation of EV values between different probes of the same gene, and find similar high correlation (Spearman's ρ = 0.45, n = 10,263, P <<10-16), indicating that the EV values identified here represent gene-specific variations of mRNA levels.
EV reflects gene regulation under varying extracellular conditions
To further confirm that our EV values reflect cellular response to varying extracellular conditions rather than being an artifact of tissue samples, we compiled an independent collection of microarray gene expression datasets from 14 different studies measuring responses of cultured human cells to various receptor ligands (EGF, heregulin, TGF-beta, TNF-alpha, interleukin 1, FGF2, arachidonic acid, thrombin, leukotriene, estradiol and sphingosine) (CK dataset, see Methods). We normalized each microarray sample in the dataset by their corresponding controls (i.e. no treatment conditions) and discarded the control samples, so that each sample in the CK dataset reflects fold-changes in response to the corresponding stimulus. Therefore, this dataset contains measurements of gene expression change in various human cell lines under 149 different stimulation conditions. The expression variance of genes calculated using the CK dataset is in a significantly high agreement with the EV values calculated using the ExpO dataset of 2158 tissue samples (Spearman's ρ = 0.69, see Additional file 3). Since EVCK values reflect fold changes of gene expression upon a large number of different stimulations, our observation indicates that EVexpo values reflect true expression variations of genes within cells under different extracellular environments, rather than an artifact of tissue- or cell type-specific expressional variations.
EV is not an artifact of mRNA abundance
Total mRNA expression levels of genes are extremely variable (spanning almost 4 logs), and this can substantially contribute to the variability of genes between different conditions. Indeed, EV of genes has a significant negative correlation with their average expression levels across the whole ExpO dataset (Spearman's ρ = -0.59 for EVexpo and -0.53 for EVCK), so that genes with low mRNA abundance are more likely to have variable expression. Therefore, it is possible that our observations above and below simply reflect the correlations of total expression levels of gene mRNAs rather than their variability. In order to test this, we calculated partial correlation between EVexpo and EVCK having controlled for average mRNA expression levels of genes and find that the correlation strength between these EV values calculated from different datasets is still significantly strong (partial Spearman's ρ = 0.58), indicating that the observed EV is not an artifact of mRNA abundance. In order to confirm this observation, we selected genes with similar average mRNA levels (300 < average expression < 350, n = 831), and tested if the correlation between EVexpo and EVCK is still high. Indeed, although the correlation of total mRNA levels with either EVCK or EVexpo is lost (Spearman's ρ values of 0.02 and -0.003, respectively), the correlation between EVCK and EVexpo is still significantly high (Spearman's ρ = 0.50, Additional file 4), which strongly suggests that expression variability is an intrinsic characteristic of genes rather than an artifact of their total mRNA abundance. Importantly, the correlations we present below are also reproducible having controlled for the total mRNA levels of genes (see below and Additional file 5).
EV reflects RNA Polymerase II promoter occupancy
Next, we asked if EVs of genes correlates with the pattern of their promoter activities. Kim et al (2005) conducted a comprehensive study mapping active promoters across the human genome and identified 4 classes of genes based on their expression and RNA polymerase II pre-initiation complex (PIC) promoter occupancy [8]. The first class (Class I) of genes had PIC occupancy and increased histone acetylation in their promoters and were actively transcribed. The second class (Class II) had PIC occupancy, however were not actively transcribed. The third class of genes were actively transcribed although no PIC could be detected, while the fourth class (Class IV) had no PIC occupancy or detectable transcript levels and had reduced histone acetylation at their promoters [8]. We find that there is a high concordance between EV values and these gene classes (Figure 2A), which was also reproduced with EVs of the CK dataset (Additional file 6). Low-EV genes mostly belong to classes I and III, reflecting constitutive and high promoter activity, whereas high-EV genes mostly belong to classes II and IV, which are mostly expressed in a condition-specific manner. This is concordant with the invariant constitutive expression pattern of low EV genes and highly variant condition-specific expression pattern of high EV genes. These observations suggest that low EV genes are generally highly active and abundantly transcribed, while high EV genes are transcribed in a condition-specific manner.
Functional distinction of genes based on EV
We have previously shown organization of genes into separate modules based on their expression variation in yeast [10, 12]. We wanted to determine if expression variation in human genes has a functional significance similar to that observed in yeast. In order to answer this question, we constructed a comprehensive network of human genes based on their functional similarity, where each interaction is between two genes sharing a significant functional annotation from either Gene Ontology [17] or KEGG [18] (the Fun-Net, see Methods). Then, we tested whether subnetworks of genes with specific functional associations segregated based on their EV. In order to gain a comprehensive view of gene-gene association preferences in the Fun-Net based on their EV, we binned genes into 50 bins based on their EV and calculated interaction preferences between each bin pair in the Fun-Net. As expected, the heatmap of interaction preferences shows a clear clustering of low and high EV genes into separate functional categories (Figure 2B). This is not an artifact of the network connectivity, as this pattern is not observed in a network where node positions have been randomly shuffled (Additional file 7). Similarly analysis using different bin sizes did not significantly alter the outcome (Additional file 8). A similar high correlation and interaction preference pattern, albeit weaker, is observed when protein-protein interaction network is used for gene-gene interactions instead of the Fun-Net (Additional file 9). These observations show that human genes can be functionally separated based on their EV patterns. Low overall association of genes with low and high EV genes in either network suggests that the cellular functions performed by the low and high EV genes are distinct, similar to what we have shown for yeast.
Next, in order to see which cellular functions are represented by the high and low EV genes, we calculated relative enrichment of the top and bottom 500 genes within the EV distribution for specific GO functional categories. Figure 2C shows pie-charts of most enriched (hypergeometric distribution p-value < 10-5) functional categories in 500 genes with lowest or highest EV. Genes encoding for cellular functions pertaining to cellular homeostasis: mRNA transcription and processing, protein synthesis and proteasomal protein degradation, are the most significantly enriched functional categories among genes with lowest EV (Figure 2C). However, genes exhibiting highest EV are mainly composed of genes encoding proteins in the extracellular space, including extracellular matrix (ECM) components, growth factors and extracellular proteases. A similar pattern is identified using the EVCK values (Additional file 10), where the values reflect fold inductions of genes within the same cell line in response to a treatment. Therefore, the differential enrichment of high and low EV genes for, respectively, extracellular space and intracellular homeostasis genes reflects biological pattern of cellular response to extracellular conditions.
EV of signaling genes reflects their role in the signaling hierarchy
Based on the observations above, we reasoned that extracellular ligands for cellular transmembrane receptors may be more variable than their receptors, meaning that cells are more likely to modulate the expression levels of secreted factors rather than their receptors in response to extracellular cues. We compared EVs of genes annotated as "receptor binding" (GO:0005102), "growth factor activity" (GO:0008083) or "cytokine activity" (GO:0005125) and "extracellular space" (GO:0005615) (SF list, n = 269 genes) to those annotated as "transmembrane receptor activity" (GO:0004888), "receptor activity" (GO:0004872) and "plasma membrane" (GO:0005886) (GR list, n = 1038 genes) (see Additional file 11). Although EVs of both classes are significantly higher when compared to the rest of genes, EVs of the SF list are significantly higher than those of the GR list (see Figure 3B). We wanted to determine if EVs of genes involved in signal transduction reflect the hierarchical position of the corresponding signaling molecules within the signaling network. In order to answer this question, we compiled a comperehensive signaling network from online databases (5499 genes and ~22,000 interactions, see Methods), and defined 5 levels of signaling hierarchy based on the positions of the signaling molecules (Figure 3A). The first level, growth factor modulators (GM class), are secreted molecules that modulate the activities of receptor-binding secreted factors. This class includes genes such as SFRP2 (Secreted Frizzled-Related Protein 2, regulator of WNT proteins), MMP1 (matrix metaloprotease 1, regulator of various growth factors/cytokines), IGFBP1 (IGF binding protein 1) and LTBP1 (latent TGF-beta binding protein). The next two levels are secreted factors (SF) and receptors (GR), explained above. Receptor substrates (RS) are molecules immediately downstream of receptors (GR), such as G-proteins (GNA genes), receptor-associated kinases (e.g. IRAK genes, ADRBK2, JAK1), and adaptor proteins (e.g. GRB2, SOS1-2, FADD, IRS1) among others; and the next class are molecules that mediate signal transduction downstream of RS (RS2) (see Methods). Strikingly, EV patterns of these levels display a gradient, with the GM level being the most variable, and RS being the least variable among these hierarchy levels (Figure 3B). This pattern is also reproduced with EVCK values (Additional file 10). This suggests that transcriptional regulation of intracellular signaling pathways mostly happens at the level of secreted growth factor modulators and growth factors, while signaling molecules immediately downstream of signaling receptors seem to be the least transcriptionally modulated. Accordingly, the RS and RS2 levels are mostly found in class I through III of genes based on their PIC occupancy, while GM, SF and GR levels are in class IV (Figure 3C).
Interestingly, class II, which represents genes with PIC occupancy but no detectable transcription, is enriched for intracellular signaling genes in the RS level, not secreted factors, although this class has significantly high EV (see Figure 2A). This may indicate that class II contains condition-specific intracellular signaling genes, while classes I and III are enriched for constitutively expressed intracellular signaling genes. Indeed, genes with class II promoters contain high EV genes of the RS level, while classes I and III contain the low EV genes of the RS level (Figure 3D). Importantly, these observations suggest not only that genes coding for extra- and intra-cellular proteins can be distinguished based on their promoter architecture, but also that promoters of intracellular proteins among themselves are distinguished based on whether they are constitutive or condition-specific. In addition, while genes for condition-specific extracellular proteins are located within densely packed hypo-acetylated regions, condition-specific intracellular genes have relatively open promoters with a pre-assembled PIC. This suggests that regulation of transcription of genes coding for extracellular proteins may be fundamentally different from those coding for intracellular signaling genes (see Discussion).
EV reflects functional centrality
Next, we asked whether the correlation of EV patterns of signaling genes with their positioning within the signaling hierarchy has a biological significance in terms of regulation of signaling pathways. Secreted factors are very specific in terms of their signaling targets (e.g. the chemokine CXCL12 specifically activates the G-protein coupled receptor CXCR4) therefore variation of their levels through transcriptional regulation may provide a high level of specificity in the regulation of signaling pathways. However, the receptor substrates are utilized by many signaling pathways (e.g. G-proteins are utilized by a large variety of G-protein coupled receptors), and therefore variation in their expression can lead to major rearrangements in the signaling architecture of the cell. Therefore, it is possible that transcriptional regulation of signaling in response to extracellular cues involves highly selective activation/inhibition of specific pathways, rather than involving large rearrangements of the signaling network. In order to test this, we compared total number of protein-protein interactions within each hierarchical class. Since the functions of signaling molecules mainly involve protein-protein interactions, the total number of protein-protein interactions of a signaling protein may provide an estimate of the number of different processes/functions that it can be involved in. Indeed, the RS class has the most overall number of interactions among all the classes (Figure 4A). In order to see if lower EV of most central proteins is a general trend in the intracellular protein interaction network, we correlated total number of interactions of the RS and RS2 proteins with their EVs. There is a significant negative correlation of EVs of RS and RS2 proteins and their number of protein interactions (Figure 4B, Spearman's ρ = -0.27, n = 1189, P < 10-20), which supports our hypothesis that differential expressional variation of genes within the signaling hierarchy can at least in part be explained by their functional centrality in the signaling network. Moreover, the RS and RS2 level proteins with high EV (EV > 0.9) have significantly less number of protein-protein interactions than those with low EV (EV < 0.1) (Figure 4C). In addition, the intracellular signaling genes with Class II promoters, which are mainly condition-specific (see Figure 3D), have significantly less number of protein-protein interactions than those with Class I promoters (Figure 4D), which are mainly constitutive (see Figure 3D). In order to get a view of the layout of proteins with different EVs within the signaling network, we plotted the signaling network of proteins of the RS and RS2 levels with high and low EVs. This network in Figure 5 shows a clear differential distribution of high and low EV proteins. While proteins with low EV are mainly located in the central dense regions of the network, those with high EV are mainly located at the periphery and generally have sparser connections. These observations support our hypothesis that condition-specific genes encode proteins with less central roles in the signaling network. The finding that genes with lowest EVs mostly comprise genes involved in the cellular homeostatic processes (transcription, translation, etc...), which can be regarded as the most central cellular processes, also adds to the hypothesis that mRNA levels of genes with highest functional centrality in the cell are modulated less and the most variable genes are those encoding more specialized regulatory proteins.