The DNA promoter methylation data for human heart, kidney and liver tissue samples was obtained from the NCBI GEO database with the accession number GSE26033 http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE26033. This data was generated by Pai et al. who studied the methylation patterns between human and chimpanzee tissues [20]. Six individual samples were collected for each tissue type. The methylation profiles were generated by Illumina HumanMethylation27 DNA Analysis BeadChip that contains 27,578 CpG loci located in either CpG islands or non-CpG islands of promoter regions.
The DNA methylation profiles of human heart, kidney, and liver were integrated with RNA-Seq data for analysis. The RNA-Seq data were obtained from Human BodyMap 2.0 project by Illumina, Inc. Human BodyMap 2.0 project conducted deep sequencing for total RNA of 16 individual human tissues with one sample for each tissue type. The RNA-Seq data was available in NCBI GEO database with accession number GSE30611 http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE30611. In this study, we combined RNA-Seq data for heart, kidney and liver tissues with DNA methylation data.
Data quantification and quality
The variability in gene methylation level has various sources, including tissue differentiation, individual sample variation, and technical variance. For the same tissue, genes are expected to carry similar epigenetic modification to control tissue differentiation, thus the variability between sample replicates is expected to be marginal comparing with the variability between different tissues. To evaluate the quality of the DNA methylation data, we plotted the methylation levels between each pair of samples from the same tissue. All plots consistently showed low variability. Figure 1A-C show 3 plots for random pair of samples from heart, kidney and liver. The amount of genes that have a standard deviation less than 0.1 were 98.3%, 91.1%, and 97.6% for heart, kidney and liver, respectively. When we plot the average methylation level for each CpG marker between tissues, the variation was substantial larger (Figure 1D-F). It is clear that points in 1E and 1F are more widespread than that in 1D, implying difference between heart and kidney is smaller than that between heart and liver or between kidney and liver. Heart and kidney both develop from mesoderm and liver develops from endoderm, therefore, not only do epigenetic changes correctly differentiate between differentiated cell types, but they also reveal the distinct embryonic origins of tissues.
The RNA-Seq data from Illumina BodyMap 2.0 project was generated by Illumina HiSeq 2000 with 50 bp paired end sequencing. For each sample, ~160 million reads were generated. We used Tophat [22] to map short reads onto the reference sequence of human genome 19 (GRCh37/hg19). About 55% paired reads were aligned with the reference genome. Software Cufflinks [23] was then used to quantify the expression levels of transcripts. We counted from 45,000 to 48,000 transcripts from 3 tissues and compared the transcriptional levels between each pair of tissues. Figure 1G-I plot the pairwise expression levels between two tissues. The transcriptional levels are represented at logarithm scale for FPKM values calculated from Cufflinks. At log scale, the expression levels are aligned well between tissues with the majority of plots located in the lower level. To further explore the expression patterns, we drew quantile-quantile plots to compare the expression distributions between tissues (Figure 1J-L). All data distributions are aligned well in the high expression levels. However, at the lower end, there are dramatic differences between tissues. Heart has a number of genes with significantly lower expression than both kidney and liver, and liver has more low-expression genes than kidney.
DNA methylation signature identifies tissue specificity
Figure 1D-E indicate variation in DNA methylation between tissues. Nonetheless, it is not clear whether the tissue-specific signatures can be identified from methylation data such that tissues can be classified based on DNA methylation pattern. We investigated sample classification for DNA methylation data with three different methods.
First, we performed hierarchical clustering using all 27,578 CpG markers. We tested different dissimilarity measures, including Euclidean distance, Pearson's dissimilarity and Spearman's dissimilarity. Pearson's and Spearman's dissimilarity are both based on data correlation, thus, they have similar performance that is better than Euclidean distance. For Illumina HumanMethylation27 data, most of genes have two or more CpG markers and the markers from the same genes should be correlated. Thus, correlation based dissimilarity method is more appropriate for this analysis. Ward linkage method was used in hierarchical clustering. The resulting dendrogram was shown in Figure 2A. The 18 methylation samples were clustered into 3 major groups, each representing a human tissue. Again, in terms of DNA methylation, heart and kidney are closer to each other than they are from liver. There was only one misplacement in this clustering, in that one kidney sample was clustered into heart group though it also has short distance to other kidney samples. Overall, tissue samples were successfully clustered based on their DNA methylation patterns.
Secondly, we looked for approaches to improve the hierarchical clustering result. The misplacement occurred in hierarchical clustering is likely due to the large number of markers. The markers with low variation should not contribute to tissue classification, but their variances may present as a noise to affect clustering. Therefore, we intended to filter out the markers with low variance prior to clustering. We selected only the CpG markers that have a standard deviation greater than 0.2 and we obtained 488 markers. A heatmap is shown in Figure 2B. Hierarchical clustering correctly classified all 18 samples with their tissue labels. In Figure 2B, green is for low methylation levels and red is for high levels. Most markers showed distinct methylation levels among the three tissues, therefore, they can be used as DNA signatures to represent different tissue. Note that the methylation levels of the first 69 markers in the heatmap are not aligned with tissue labels. We found that 68 markers are located in X chromosome, and they are methylation markers for sex imprinted genes. Their high levels (red) indicate female, and low levels (green) indicate male.
Thirdly, to further investigate methylation clustering, we used an alternative method, principal components analysis (PCA). One of the advantages of PCA is that it disassociates the correlation between markers when methylation data are transformed into principal components. We plotted the 18 samples in the first three principal components, and the samples are grouped into their tissue labels with the second and third principal components (Figure 2C). The circle points in black color are for liver samples, the red squares are kidney and the green triangles are heart. All points are grouped well except one kidney sample that is close to heart samples as in Figure 1A. However, from the dimension of principal component 2, this kidney sample is still in the range of other kidney samples rather than heart samples.
Significantly expressed genes are affected by DNA methylation
Next, we wanted to determine to what degree the gene expression differences among tissues are affected by epigenetic changes. We used cuffdiff to find expression variation (in FPKM) between tissues and then associated the selected genes with their methylation levels in HumanMethylation27 array. In total, we selected 1296 genes that were differentially expressed between any pair of the three tissues and that exist in HumanMethylatin27 array. We further looked at whether these 1296 genes showed DNA methylation variation between tissues as well. Unpaired t test was used to detect the methylation differences among tissues and the results are summarized in Figure 3. From the 1296 significantly expressed genes, only about one third of them (483 genes) were not shown to have significant changes in methylation between the three tissues. In total, 610 genes were shown to have significant methylation difference between heart and liver, 599 genes were significant between kidney and liver, and 418 genes were significant between heart and kidney. Among them, many genes overlap to be significant in two or three tests. For example, 340 genes showed significant results in comparing both kidney and heart against liver, and 135 genes were significant in all three tests, heart vs. kidney, heart vs. liver, and kidney vs. liver. The list of these 135 genes is shown in additional file 1. The distance of each CpG markers to the transcriptional start site (TSS) was used a covariate to fit into a linear model, but it was not identified as a confounding factor to influence gene expression. In short, the majority of differentially expressed genes showed significant changes in DNA methylation, implying DNA methylation plays an important role in mediating tissue differentiation.
We closely examined the 135 genes that not only have significantly different level of mRNA expression, but also have remarkably different DNA methylation between any pair of the three tissues. We first studied those genes that express very low in kidney, heart and liver (FPKM <1). Their marginal expression may regulate important cellular and biochemical functions, but this requires further elucidation, and and their expression should be more finely tuned. For instance, iodotyrosine deiodinase (IYD) was expressed at low levels in all three tissues (0 in heart, 0.02 in kidney, and 0.26 in liver), which was in accordance to its high methylation level (0.81 in heart, 0.61 in kidney and 0.43 in kidney). Its expression in heart and kidney is negligible, but it has low expression in liver. This is consistent with the lower level of DNA methylation in liver. Iodotyrosine deiodinase, also known as iodotyrosine dehalogenase 1 especially expressed in the thyroid, and salvages iodide by catalyzing deiodination of mono- and diiodotyrosine during the biosynthesis of the thyroid hormone thyroxine [24]. The function of IYD in liver remains unknown. Next, we examined other genes that have distinct high expressions in heart, kidney and liver (FPKM>1), which includes 111 genes. This pool of genes may be critical to the development and/or function of kidney, heart and liver. Several genes were selected for further consideration. Transcobalamin I (TCN1) is a vitamin B12 binding protein that regulates the absorption, trafficking and secretion. TCN1 abnormality in the liver and kidney has been thought to associate with impairment of these tissues [25]. Its FPKM values were 0, 1.75 and 12.97 in heart, kidney and liver respectively, corresponding to beta values of 0.81, 0.59 and 0.73. Thus, its low expression in heart may be partially regulated by DNA methylation. For another example, serine protease 23 (PRSS23), belongs to the peptidase S1 family [26], and is highly expressed in kidney, heart and liver (117.69, 30.86 and 217.32 respectively), and this is consistent with a negative regulatory role for DNA methylation in controlling gene expression as all three tissues show relatively low levels of DNA methylation (0.04 in kidney, 0.07 in heart and 0.06 in liver).
We next investigated whether DNA methylation is associated with induced or repressed gene expression during tissue differentiation. We selected the genes from the 1296 genes that had sufficient changes in methylation levels between tissues (change of average beta values greater than 0.3) and plotted the changes in methylation and expression levels in Figure 4. The fold change in expression levels was plotted on a logarithmic scale. Although higher methylation levels are associated with both lower and higher expression levels, Figure 4 shows a trend that high methylation is more likely to repress gene expression. For example in Figure 4B of kidney vs. heart, 10 genes show higher methylation levels in kidney. Among the 10 genes, 8 show lower expression levels, and only 2 show higher expression levels in kidney than in heart. Therefore, the effect of DNA methylation on gene expression may be bi-directional, nonetheless, higher methylation tends to repress gene expression.
DNA methylation change correlates with gene expression
We further investigated whether the significant changes in methylation will result in a measurable difference in gene expression. To select the genes with significant variation in methylation among the three tissues, we performed analysis of variance (ANOVA) for each gene. To adjust p values for multiple tests, we control false discovery rate (FDR) at 5% and obtained 8687 CpG markers that showed significant variation among the three tissues. We then conducted regression analysis for gene methylation levels and transcriptional levels. Similarly, we adjusted p values by controlling FDR at 5%. Amazingly, most of the 8687 CpG dinucleotides (5735) showed significant correlation between methylation and expression. The p value distribution of regression analysis is shown in Figure 5A. It is clear from this figure that the amount of significant correlations is not generated from random sampling. Thus, we conclude that DNA methylation extensively correlates with gene expression during tissue differentiation.
From the 5735 CpG dinucleotides, 2960 were associated with repressed gene expression, which was only slightly larger than the number of genes (2775) whose methylation levels were positively correlated gene expression. This result further confirms our finding that the effects of methylation on gene expression are bi-directional during tissue differentiation. As HumanMethylation27 markers are located either within or outside CpG island, we are interested to know if the location of markers affects its regulation on expression. Figure 5B-C show the distributions of Pearson's correlations for markers that are located within CpG island (B) or outside of known CpG islands (C). The two distributions are similar with both peaks in positive and negative correlation regions. The peak in negative correlation region is slightly higher than that in positive region, but it is not significant. It indicates a weak trend of increased methylation being associated with decreased gene expression.
We identified several tissue specific genes by sorting these genes whose methylation levels significantly correlates with gene expression. We identified C4BPA, expression of 0.03 in kidney, 0.04 in heart and 97.02 in liver, associated with DNA methylation of 0.77, 0.80 and 0.33, respectively. C4BPA encodes Complement Component 4 Binding Protein, which is known to be only produced in liver [27]. Other selected genes involving in complement system include C5AR1, CD55, FGG, and KRT1. Aquaporin 2 (AQP2) was expressed high in the kidney (15.84 in kidney, 0.17 in heart and 0.03 in liver) with corresponding DNA methylation of 0.64 in kidney, 0.77 in heart and 0.87 in liver. We noticed that the values of DNA methylation are relatively high in all three tissues, while the value difference between kidney and heart is 0.14 and that between kidney and liver is as high as 0.24. AQP2, found in the apical cell membrane and intracellular vesicles of the kidney's collecting duct principle cells, is an important vasopressin-regulated water channel, which supports out finding of high correlation between mRNA expression and DNA methylation [28]. These results showed that integrative analysis of genome wide DNA methylation and RNA-Seq transcriptome data can reveal global regulation of gene expression by DNA methylation.
Comments
View archived comments (3)