Genome-wide meta-analysis of genetic susceptible genes for Type 2 Diabetes
© Hale et al; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Hale et al; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Many genetic studies, including single gene studies and Genome-wide association studies (GWAS), aim to identify risk alleles for genetic diseases such as Type II Diabetes (T2D). However, in T2D studies, there is a significant amount of the hereditary risk that cannot be simply explained by individual risk genes. There is a need for developing systems biology approaches to integrate comprehensive genetic information and provide new insight on T2D biology.
We performed comprehensive integrative analysis of Single Nucleotide Polymorphisms (SNP's) individually curated from T2D GWAS results and mapped them to T2D candidate risk genes. Using protein-protein interaction data, we constructed a T2D-specific molecular interaction network consisting of T2D genetic risk genes and their interacting gene partners. We then studied the relationship between these T2D genes and curated gene sets.
We determined that T2D candidate risk genes are concentrated in certain parts of the genome, specifically in chromosome 20. Using the T2D genetic network, we identified highly-interconnected network "hub" genes. By incorporating T2D GWAS results, T2D pathways, and T2D genes' functional category information, we further ranked T2D risk genes, T2D-related pathways, and T2D-related functional categories. We found that highly-interconnected T2D disease network “hub” genes most highly associated to T2D genetic risks to be PI3KR1, ESR1, and ENPP1. The well-characterized TCF7L2, contractor to our expectation, was not among the highest-ranked T2D gene list. Many interacted pathways play a role in T2D genetic risks, which includes insulin signalling pathway, type II diabetes pathway, maturity onset diabetes of the young, adipocytokine signalling pathway, and pathways in cancer. We also observed significant crosstalk among T2D gene subnetworks which include insulin secretion, regulation of insulin secretion, response to peptide hormone stimulus, response to insulin stimulus, peptide secretion, glucose homeostasis, and hormone transport. Overview maps involving T2D genes, gene sets, pathways, and their interactions are all reported.
Large-scale systems biology meta-analyses of GWAS results can improve interpretations of genetic variations and genetic risk factors. T2D genetic risks can be attributable to the summative genetic effects of many genes involved in a broad range of signalling pathways and functional networks. The framework developed for T2D studies may serve as a guide for studying other complex diseases.
Type 2 Diabetes (T2D) is a complex metabolic disease that affects 25.8 million Americans in 2011, according to statistics reported by Centers for Disease Control and Prevention (CDC). T2D occurs when the body develops resistance to insulin due to the malfunction of insulin producing β-cells. The developmental process of T2D involves a complex interplay between genetic and environmental factors. However, it is not clear how the underlying genetic defects give rise to T2D pathogenesis over time. Recent T2D genetic study results, particularly those from genome-wide association studies (GWAS), have yielded insights to the molecular mechanisms and underlying genetic risk factors of T2D . Among the many risk genes identified are: transcription factor 7-like 2 (TCF7L2)[2–4], peroxisome proliferator-activated receptor gamma (PPARG)[5–7], and potassium inwardly-rectifying channel, subfamily J, member 11 (KCNJ11)[5, 6].
These GWAS results were challenging to interpret. Many single nucleotide polymorphisms (SNPs) identified from GWAS tend to show strong sample biases and may not extrapolate from one population to another. In T2D, only approximately 28% of the disease heritability may be explained by identified individual SNPs that showed statistical significance in these samples/population--a problem known as missing heritability . The combined effects of multiple risk SNP's can increase the overall odds ratio of T2D by 1.24 per allele for up to 8.68 among 18 risk alleles in one study  and by 1.265 per allele in another study . The additive effect suggests the presence of molecular system structures that are essential to T2D pathogenesis.
To confirm the presence of molecular systems structures that may better explain missing heritability problems for T2D, we adopted a Systems Biology approach to studying T2D genetic risk gene networks as a whole rather than the risk genes individually. Prior to this study, several reports [10, 11] examined genes implicated T2D differential expressions in affected tissues. In this study, we used T2D-associated SNP information curated from the Type 2 Diabetes Genetic Association Database (T2DGADB), which integrated comprehensively reported SNPs, their odds ratios, population description, and all related metadata from various T2D GWAS performed worldwide . We further annotated individual SNPs collected from T2DGADB with information from the DbSNP database , including information such as nearby genes, Chromosomal location, gene functional class, and base changes. To create a model for T2D genetic risk gene molecular systems structure, we built a gene interaction network seeded by T2D risk genes collected from T2DGADB and expanded with high-confidence protein interaction data collected from the Human Annotated and Predicted Protein Interaction database (HAPPI) . We also ranked risk genes in the network according to these high confidence interactions.
Data from both the ftp site and web pages of T2DGADB were downloaded. On the ftp site, only a gene list and a SNP list without annotation were available for download. Therefore, the complete information from individual web pages of T2DGADB was extracted into a single Excel file manually. Data entries with dbSNP SNP cross-references were kept and entries without dbSNP SNP cross-reference information were removed from this study. Gene annotation information is derived from the VEGA  database. The Excel file was imported into the ORACLE 11 g database for subsequent efficient database querying.
Once we collected information integrated from T2DGADB and dbSNP, we applied standard hyper-geometric tests (using an R software package called phyper) to the data set to determine which chromosomes were over-represented/under-represented. We determined significance on three data sets. First, the distribution of dbSNP of all human SNP's in current build of known origins were compared against that of risk SNP's across all chromosomes. Second, the distribution of genes where the risk SNPs can be mapped to were compared against that of all the genes across all chromosomes. Third, the distribution of protein-coding genes where the risk SNPs can be mapped to were compared against that of all the protein-coding genes across all chromosomes.
In this equation, r p score measures the relative significance of a gene/protein in the subnetwork. p and q are proteins in the subnetwork. k is a constant set at two for our purposes. conf (p, q) is the confidence score of the interaction between the two proteins provided by the HAPPI database and is 0 if p and q do not interact. N (p, q) is 1 if the proteins have an interaction and is 0 otherwise. With all of this we are able to generate a network map of all involved genes.
r mod adjusts r p score of any risk gene with both the count of populations, P count , in which significant risk SNPs were identified, and the average odds ratio (OR avj ) of reported risk SNPs found in these studies. Genes containing only one significant study can still be adjusted using the formula provided here. In constructing the final T2D risk gene network, we modified the original network to exclude studies in which the risk genes were determined to be insignificant (r p score < 2) before we calculated r mod scores for risk genes. Cytoscape software was used to visualize network relationships among genes and gene sets (to be described next).
To further sift the results and explore functional connections, we also mapped genes onto known gene sets. For this purpose, we used DAVID [22, 23] to search for enriched KEGG  pathways. We also used GARNET  to identify enriched Gene Ontology categories and their relationships.
Based on T2DGADB, we collected 4358 T2D SNP entries that cover 518 PubMed articles reporting T2D Genome-wide association studies (GWAS) results worldwide. Since not all study reported statistics on all SNPs, there are only 3720 SNP entries with P-values, 2715 SNP entries with complete odds ratios, 2406 SNP entries with sample size and minor allele frequency values. All together, there are 1269 SNP entries with the above-mentioned complete set of statistics. After comparing collected information against dbSNP entries manually, we validated 333 SNP gene annotations, re-annotated 11 SNP gene annotations, and flagged 140 additional genes that do not appear to be consistent with dbSNP curated locus information. The "mis-annotations" in T2DGADB were partly due to the presence of two genes, e.g., ABCC8 and KCNJ11, for some reported SNPs, therefore confounding manual curations. In other cases, gene symbols that are similar to each other, e.g., AGER and RAGE, which do not even appear on the same chromosome, also seemed to have been mixed up. The flagging of putative genes, e.g., LOC387761, according to dbSNP was performed, because we wanted to prioritize genes with known gene functions. In the end, we collected 4085 distinct SNP entries that cover 1539 SNPs in 370 different genes. Among these SNP entries, 598 SNPs from 255 different genes passed a P-Value significance cutoff of 0.05.
P-value of Chromosomal Specificity Significant Test.
P-value (all genes)
P-value (coding genes)
Contributing to this result are genes including Hepatocyte Nuclear Factor 4 Alpha (HNF4A) and Protein Tyrosine Phosphotase non-receptor type 1 (PTPN1), both of which have been extensively studied for their roles to T2D genetics. Linkage studies dating back to 1997 showed a modest association with this region of chromosome 20 [26, 27], when whole genome data was not available. Particularly interesting is that the interplay between genetics and environment also may act on this chromosomal region, e.g., the epigenetic effect of diet on the promoter regions of HNF4A .
Top-ranking T2D risk genes ordered by their r p scores in the T2D risk gene protein interaction network.
r p Score
Top-ranking T2D risk genes ordered by their r mod scores in the T2D risk gene protein interaction network.
Rank by r mod Score
r mod Score
r p Score
Original Rank by r p Score
Enriched pathways identified in the T2D risk genes.
Adipocytokine signalling pathway
Type II diabetes mellitus
Insulin signalling pathway
Maturity onset diabetes of the young
PPAR signalling pathway
Calcium signalling pathway
Renal cell carcinoma
Hypertrophic cardiomyopathy (HCM)
Aldosterone-regulated sodium reabsorption
VEGF signalling pathway
Pathways in cancer
Our study suggests PI3KR1 may be the gene with highest T2D genetic risk associated. PIK3R1 is a regulatory subunit of the phosphoinositide-3-kinase, which is a protein known to be involved in insulin actions, cancer signaling, and cytokine signaling. While the coverage of the gene's functional relationship to T2D risks in T2DGADB is very limited (with one article only) , there is increasing evidence, including a recent SNP UTR study  and a mixed methods meta-analysis , that supports our finding.
The second highest-ranking gene in the T2D risk gene network by r p scores is ESR1, the estrogen receptor 1 gene. The gene encodes a transcription factor that responds to estrogen action and cancer, and will also form a heterodimer with ESR2. There are two articles cited in the T2DGADB database [32, 33].
The third highest-ranking gene in the network is ENPP1, Ectonucleotide Pyrophosphatase/Phosphodiesterase 1, a trans-membrane glycoprotein involved in metabolism and has been shown to have an effect of insulin signaling and glucose metabolism . ENPP1 was well studied among 20 GWAS-related publications collected through T2DGADB and 10 of those studies returned positive results in the population examined.
To demonstrate that the network hub genes are indeed functionally associated with T2D risks, we performed a t-test on the distribution of risk SNP per gene between the top ranked 25% of risk genes and the bottom 75% of risk genes. When ranks are given by the original r p scores, the results showed significant difference (P = 0.01) between the two groups for the reported risk SNP per gene raition (3.06 SNP/gene for the top 25% "hub genes" vs. 1.67 SNP/gene for the bottom 75% "non-hub genes"). When ranks are given by the modified r p scores, the results showed even higher significant difference (P = 3.55E-4) between the same two groups (3.74 SNP/gene for the top 25% "hub genes" vs. 1.45 SNP/gene for the bottom 75% "non-hub genes").
To gain an overview of functional categories represented by the T2D high-risk genes, we mapped these genes onto curated pathways. In Table 4, we list top significantly over-represented pathways identified by the DAVID software. Two highly ranked "novel" pathways identified (excluding known T2D pathways) are the Adipocytokine signaling and the PPAR signaling pathway.
Enriched gene ontology categories identified in the T2D risk genes.
Gene Ontology Term
regulation of glucose transport
regulation of insulin secretion
response to insulin stimulus
response to peptide hormone stimulus
cellular response to hormone stimulus
positive regulation of glucose import
positive regulation of fatty acid metabolic process
cellular response to insulin stimulus
negative regulation of macrophage derived foam cell differentiation
positive regulation of glucose metabolic process
regulation of lipid metabolic process
In this study, we showed our findings of T2D genetic risk gens. Genes from the chromosome 20 collectively accounted for the highest T2D genetic risks of all the chromosomes. However, the individual contribution of these chromosome 20 risk genes is relatively small. HNF4A as the most significant gene on chromosome 20 has a relatively small r p score of 4.72, which is far lower than all r p scores shown for the top-ranking T2D risk genes in Table 2. Nonetheless, when all other information derived from GWAS results are integrated into the r mod score of 118, the significant contribution of HNF4A to T2D risks becomes clear. The "missing inheritability" problem of T2D genetic risks are therefore partially explained with the calculated integration of T2D molecular interaction network information and T2D genotype-phenotype association study results.
GWAS results show population-specific effectiveness in using TCF7L2 for T2D genetic risk profiling.
Number of Studies
Average Odds Ratio
Maximum Odds Ratio
Minimum Odds Ratio
TCF7L2 was not among the highest ranked T2D genes, primarily due to the emphasis on the quality of protein interaction data that we bring in. The HAPPI database reported 238 protein interactions for TCF7L2 but only 11 of those were above the confidence threshold of 0.8. This is in contrast to ENPP1, in which we identified 743 protein interactions from the HAPPI database and 87 of them passed our confidence threshold of 0.8. The relatively low network centrality explained why TCF7L2 is not ranked at the top overall, although it is a population target for many T2D GWAS.
Future analysis that is built upon this work could benefit by integrating additional genomics and functional genomics information, e.g., available miRNA or mRNA expression information, available copy number variations results, and whole genome sequencing data.
Large-scale systems biology meta-analyses of GWAS results can improve interpretations of genetic variations and genetic risk factors. In this work, we determined that T2D candidate risk genes are located in higher concentration in certain parts of the genome, specifically in chromosome 20. Using the T2D genetic network, we identified highly interconnected network "hub" genes. By incorporating T2D GWAS results, T2D pathways, and T2D genes' functional category information, we further ranked T2D risk genes, T2D-related pathways, and T2D-related functional categories. Overview maps involving T2D genes, gene sets, pathways, and their interactions are all reported. Moreover, we demonstrate a computational framework built upon disease-specific data integration, model construction, and data analysis. The framework developed for T2D studies may serve as a guide for studying other complex diseases.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.