Gene, pathway and network frameworks to identify epistatic interactions of single nucleotide polymorphisms derived from GWAS data
© Liu et al; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Liu et al; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Interactions among genomic loci (also known as epistasis) have been suggested as one of the potential sources of missing heritability in single locus analysis of genome-wide association studies (GWAS). The computational burden of searching for interactions is compounded by the extremely low threshold for identifying significant p-values due to multiple hypothesis testing corrections. Utilizing prior biological knowledge to restrict the set of candidate SNP pairs to be tested can alleviate this problem, but systematic studies that investigate the relative merits of integrating different biological frameworks and GWAS data have not been conducted.
We developed four biologically based frameworks to identify pairwise interactions among candidate SNP pairs as follows: (1) for each human protein-coding gene, a set of SNPs associated with that gene was constructed providing a gene-based interaction model, (2) for each known biological pathway, a set of SNPs associated with the genes in the pathway was constructed providing a pathway-based interaction model, (3) a set of SNPs associated with genes in a disease-related subnetwork provides a network-based interaction model, and (4) a framework is based on the function of SNPs. The last approach uses expression SNPs (eSNPs or eQTLs), which are SNPs or loci that have defined effects on the abundance of transcripts of other genes. We constructed pairs of eSNPs and SNPs located in the target genes whose expression is regulated by eSNPs. For all four frameworks the SNP sets were exhaustively tested for pairwise interactions within the sets using a traditional logistic regression model after excluding genes that were previously identified to associate with the trait. Using previously published GWAS data for type 2 diabetes (T2D) and the biologically based pair-wise interaction modeling, we identify twelve genes not seen in the previous single locus analysis.
We present four approaches to detect interactions associated with complex diseases. The results show our approaches outperform the traditional single locus approaches in detecting genes that previously did not reach significance; the results also provide novel drug targets and biomarkers relevant to the underlying mechanisms of disease.
The typical analytic framework for the genome-wide association studies (GWAS) of complex diseases or traits considers the additive contribution of common variants (usually SNPs) to genetic risk one at a time, based on an assumption of an underlying simplified genetic architecture [1–4]. In the last few years, more than 400 GWAS have identified an unprecedented number of candidate disease-associated DNA sites (> 1,200) , many of them already well-known for their disease association, while many others have also been replicated and confirmed by follow-on studies . However, for a given disease/trait, the genetic variation explained by GWAS is significantly less than the estimated heritability . In addition, although the individual SNP contributions are considered independent with respect to additive heritability when Linkage Disequilibrium (LD) is taken into account, their independence in contributing to a complex disease is by no means assured. The potential for gene-gene interaction has been proposed to be one of possible reasons for the so-called "missing heritability" of GWAS, along with other possible factors such as rare variants and environmental factors. In addition, biological interactions may be very important in contributing to overall disease outcomes [6, 7]. In this paper we come to the problem of GWAS analysis using an alternative assumption, not one of the independent action of SNPs or genes but rather one that assumes that they may indeed interact in causing complex disease. This is a well-known idea and when genes function primarily through a complex mechanism that involves multiple genes, the joint effect (behavior) of those genes' variants is referred to as a gene-gene interaction (or epistasis) [8–10], though biological interaction and statistical interaction are often confused . The contributions of gene-gene interaction to the risk of diseases have been well documented, e.g., in the case of breast cancer and coronary heart disease [8, 12].
One of the challenges in detecting gene-gene interactions is the computational burden due to the size of the search space for exhaustive pairwise testing . Consequently, most methods employ either a heuristic or a two-step (screen-testing) approach, which may miss some true interactions . However, some recent methods are reported to be efficient enough for an exhaustive search of GWAS data in a relatively short time [14, 15]. An alternative approach to probing the effects of multiple genes is gene set enrichment analysis (such as GSEA-SNP and others [16, 17]), which can serve to identify pathways that are implicated in disease. However, this approach considers all the genes in the pathway as equal, and cannot reveal the discrete structure of potential relationships of mechanistic interest. Another challenge in multi-SNP analysis is that the statistical significance threshold (p-value) has to be extremely low due to multiple hypothesis-testing corrections; typically, the p-value has to be smaller than 10-13 to be significant for an exhaustive search in GWAS data and, very few interactions can pass such stringent thresholds . On the other hand, one way to tackle this challenge is to use biological knowledge to narrow the statistical search space  and this idea dominates the approach taken here.
The progress of biomedical research over the last several decades has resulted in the accumulation of vast amounts of biological information stored in public databases, including gene/genome annotations, pathway, and network information. The analysis of GWAS data can benefit from the application of such rich resources. Various biological frameworks have been integrated with GWAS to detect biomarkers or pathways associated with complex diseases [19–23]. Recently, a number of approaches have been suggested to guide the search for gene-gene interaction based on the use of prior biological knowledge [18, 24–26]. These approaches either focused on a single pathway or were based on interaction databases, and thus reduced the search space dramatically. However, because the databases are far from complete, using these alternative approaches alone would not identify novel interactions that are absent in the current interaction databases.
A recent study described a method to search for what could be interactions within the local SNP "neighborhood", but was proposed rather as a simple way to detect associated haplotypes, and demonstrated that this approach is efficient in detecting new disease associations . By applying logistic regression to test adjacent SNPs pairs, six pairs of SNPs were found to significantly associate with coronary artery disease (CAD), and one pair locates in a known major CAD-associated region (9p21) . Encouraged by this success, here we propose new and broader approaches to integrate the latest available gene annotation, pathway, and network information, along with the functionality of eSNPs, to detect gene-gene interactions associated with complex diseases. Briefly, the statistical interactions among SNPs involved in the same genes and pathways were exhaustively tested; and we also constructed a disease related subnetwork based on prior knowledge, and the interactions among genes in the subnetwork were also exhaustively searched.
Many searches for interactions based on biological information and annotation have focused on the SNPs located in coding gene regions. However, the vast majority of SNPs are located in intergenic regions, many of them are likely to have unannotated biological functions, and their potential interactions have been neglected in previous studies. Recent studies showed that many intergenic genomic loci can modulate gene expression by both cis and/or trans mechanisms, and the loci identified in this manner are referred to as expression quantitative trait loci, or eQTLs [28, 29]. Highly significant associations between SNPs located in eQTLs and gene expression in various tissues have been established (such SNPs are referred to as expression SNPs, or eSNPs) [30–32]. Furthermore, one recent study has demonstrated that SNPs associated with complex traits are significantly more likely to be eSNPs . However, the potential interactions related to eSNPs have not been studied and reported. In this study, we propose general methods to search for the interactions between eSNPs and SNPs located in a gene whose expression is affected by the eSNPs.
We have applied the approach of using several biological frameworks (gene, pathway, network, and eSNP) to reduce the relevant search space and make exhaustive pairwise searches tractable. Using these frameworks, which can be generalized to all complex diseases, we probe for pairwise SNP statistical interactions to provide novel candidate genes associated with type 2 diabetes (T2D) using the Wellcome Trust Case Control Consortium (WTCCC) data . The results illustrate that our approach can detect multiple, new disease associations for complex diseases, which can point to novel biomarker and drug targets that illuminate the molecular mechanisms underlying the diseases.
We obtained permission to access the WTCCC dataset for T2D from the consortium websites (https://www.wtccc.org.uk/info/access_to_data_samples.shtml, ). The detailed description of the study samples can be found in the original report. In brief, the data set has a pool of 3,004 controls (which consist of a 1958 Birth Cohort (1,504 individuals) and a recently recruited UK Blood Service sample (1,500 individuals)), and 1,999 T2D affected individuals. The majority of subjects were of European ancestry. Samples from the individuals were genotyped using Affymetrix GeneChip 500K arrays. The genotypes estimated with the algorithm CHIAMO were used in this study. The following exclusion criteria were used for quality control: (i) Hardy-Weinberg Equilibrium exact test P value < 5 × 10-7 in controls; (ii) allele frequency difference test based on 1 degree of freedom (df) trend test P value < 5 × 10-7 between the two control groups; (iii) minor allele frequencies < 1%. We were most interested to see if a focus on interactions could promote SNPs that were non-significant to significant status. After filtering and taking out the SNPs in genes found by single SNP analysis, the number of SNPs analyzed decreased from 500,568 to 418,097.
We downloaded the genomic coordinates of 18,657 genes from the plink website (http://pngu.mgh.harvard.edu/~purcell/plink/res.shtml, ), which were generated from the UCSC table browser for all RefSeq genes on July 24th 2008. The coordinates of SNPs from the WTCCC dataset were used to map them to these genes. Because of the potential that regulatory elements could be located in the annotated gene neighborhood, 20 kb of sequence up and down stream of a gene was also considered as part of the gene. This may associate more than one gene with a single SNP. Gene assignments using this approach were used for the pathway, network, and eSNP analyses below.
To reduce the computational burden, only genes with less than 100 SNPs were analyzed. This constraint could be relaxed in future studies by considering the Linkage Disequilibrium (LD) structure, where SNPs located in the same LD blocks are highly correlated, and the genotyping information becomes redundant. The final analyzed set has 15,953 genes, which include 205,402 SNPs in total. For each of the 15,953 genes, SNP pairs were generated exhaustively based on SNPs located in the same gene, and tested for interactions. In total, more than 2.7 × 106 SNP pairs were tested. To correct for the multiple hypothesis-testing problem, a Bonferroni correction was used with a p-value cutoff of 1.85 × 10-8 for a significance level of 5%.
Canonical pathway data were downloaded from the Molecular Signatures Database (http://www.broadinstitute.org/gsea/msigdb/index.jsp, c2.cp.v3.0). The initial data contain 880 canonical pathways. Some pathways have very general functions, and contain large numbers of genes, e.g., the gene expression pathway from the Reactome has 425 genes, and the pathways in cancer from KEGG have 328 genes. To focus on pathways with more specific functions and to increase computational efficiency, only pathways with less than 50 genes were analyzed in this study. The final set has 655 pathways, and 1.9 × 105 SNPs in total. For each pathway, the interactions among SNPs located in different genes were tested, which led to 2.7 × 107 tests in total.
The subnetwork associated with T2D was constructed in three steps: 1) first, a human protein-protein interactome was downloaded from a public database; 2) then, genes associated with T2D (T2D genes) were also obtained from a database curated by literature mining; 3) finally, the T2D genes were used as seeds to extract a subnetwork from the interactome by applying the Steiner tree algorithm. The details for each of these steps are as follows.
The human interactome was downloaded from the STRING database maintained by EMBL (http://string-db.org/, ). Note that STRING contains known and predicted protein/gene interactions, which include direct (physical) and indirect (functional, such as mRNA co-expression) associations. They are derived mainly from four sources: high-throughput experiments for interaction detection (yeast two-hybrid, affinity purification followed by mass spectrometry, whole genome expression, literature mining, and genomic context). Based on the strength of evidence for each interaction, a score is assigned to reflect the confidence level. Those interactions with a score more than 0.80 were extracted to generate the human interactome, which contains 10,571 genes and 286,876 interactions.
Genes associated with T2D (T2D genes) were downloaded from a public database (T2DGADB, http://t2db.khu.ac.kr:8080/, ), derived from 701 publications of T2D association studies. 446 T2D genes showed disease association in from one to 49 publications. T2D genes (seed genes) were mapped to the interactome, and a T2D related subnetwork was constructed by adding new genes to connect T2D genes using a Steiner tree algorithm. Details of this algorithm can be found in the original paper and its applications [37–39]. Briefly, as a first step, T2D genes absent from the interactome are removed, then the algorithm adds other genes to connect the remaining genes, finally, the network is simplified based on the criterion of the shortest paths between seed genes. The final subnetwork has 453 genes and 2374 interactions, and 354 genes are initial T2D genes (seed genes) while 99 genes (nodes) are added to optimize the connectivity.
The SNPs located in the 453 genes of the subnetwork were collected, and the SNP pairs were exhaustively generated from all SNPs. SNP pairs from the same genes were removed. The final SNP pairs were tested for interactions, which results in 4.7 × 107 tests.
To detect the interactions involved in SNPs located in intergenic regions, we analyzed the interactions of SNP pairs between eSNPs and SNPs positioned in the genes whose expressions are affected by the eSNPs.
The association data between eSNPs and genes was downloaded from a previous study (http://www.sph.umich.edu/csg/liang/asthma/, ) and public database (http://www.scandb.org, ). The p-value cutoff (< 10-5) was used to filter out the unreliable associations between eSNPs and the expression of genes. In total, association was established between 151,571 eSNPs and 11,558 genes. SNPs located in 11,558 genes were mapped to genes as described above. Overall 3.5 × 106 eSNP-SNP pairs were generated.
where P is the probability of being affected, b0 is the intercept, b1, b2, and b3 are coefficient terms, and rs1 and rs2 are the additive codes for the two SNPs (i.e. the number of susceptibility alleles, 0,1 or 2). The biological meaning of these terms are as follows: b0, the baseline odds of disease; b1 and b2 are odds of disease due to the two SNPs respectively; b3, the odds of disease due to interaction between two SNPs. They are calculated by traditional logistic regression analysis. Then, a two-sided test of the null hypothesis b3 = 0 is performed assuming the test statistic follows its asymptotic distribution. The Bonferroni method was used to correct for multiple hypothesis-testing separately for each of the gene, pathway, network, and eSNP levels. Corrected p-values < 0.05 were considered as significant. The associations of single SNPs with T2D were also compared to interaction p-values.
The initial analysis of this data by the WTCCC identified three SNPs significantly associated with T2D (rs9465871, rs4506565, and rs9939609). We found that most of the p-values for SNP pairs containing those SNPs are highly significant. It is not clear if there is a true association among those pairs because of the strong association of those three SNPs with T2D. Further analyses are needed to clarify this result, and thus those SNPs pairs are not presented in this work to focus the results on interactions that promote non-significant SNPs to significance.
SNP pairs detected when testing among SNPs located in the same genes.
3.7 × 10-1
3.3 × 10-5
1.3 × 10-8
3.3 × 10-5
3.0 × 10-1
1.6 × 10-8
4.8 × 10-1
1.4 × 10-3
1.5 × 10-11
9.2 × 10-1
1.5 × 10-5
9.4 × 10-9
6.5 × 10-1
1.5 × 10-5
1.2 × 10-8
8.3 × 10-1
1.5 × 10-5
1.2 × 10-8
3.4 × 10-1
5.7 × 10-5
1.7 × 10-10
5.7 × 10-5
2.3 × 10-1
3.5 × 10-10
2.0 × 10-1
5.7 × 10-5
6.1 × 10-10
One significant SNP pair is detected for ZFAT (zinc finger and AT hook domain); while the individual SNPs have non-significant p-values, the correlation between them is less than 0.1, and the p-value for interaction is the most significant one (1.51 × 10-11). ZFAT is not well studied, but likely binds DNA and functions as a transcriptional regulator related to apoptosis and cell survival [41, 42]. Two pairs of SNPs are detected for NDST3. This is annotated as a monomeric bifunctional enzyme, which catalyzes the N-deacetylation and N-sulfation of N-acetylglucosamine residues in heparan sulfate and heparin. C9orf3 contains three significant SNP pairs, and is likely a member of the M1 zinc amino-peptidase family based on the annotation in NCBI. The proposed encoded protein is a zinc-dependent metallopeptidase that catalyzes the removal of an amino acid from the amino terminus of a protein or peptide. There is no prior association with T2D reported for ZFAT, NDST3 or C9orf3, and thus they are novel targets for further study.
The fourth gene with significant SNP pairs is PPM1A, which is a member of the PP2C family of Ser/Thr protein phosphatases and reported to be indirectly associated with T2D . PPM1A is involved in the IRS (insulin regulated signaling) pathway. Its function is to dephosphorylate and negatively regulate the activities of MAP kinases, which are important for insulin-signaling . Moreover, it has been shown to inhibit the activation of p38 and JNK kinase cascades induced by environmental stresses . Over-expression of this phosphatase is reported to activate the expression of the tumor suppressor gene TP53/p53, which leads to G2/M cell cycle arrest and apoptosis .
SNP pairs detected through analysis of pathway, network and eSNPs
1.5 × 10-5
1.3 × 10-7
7.0 × 10-11
2.2 × 10-7
2.2 × 10-2
2.9 × 10-11
1.3 × 10-7
2.2 × 10-2
9.9 × 10-10
5.5 × 10-1
2.9 × 10-1
9.6 × 10-10
PPARA is a nuclear transcription factor, which also mediates peroxisome proliferators and affects the expression of target genes involved in cell proliferation, in cell differentiation and in immune and inflammation responses. PPARA is believed to associate with diabetic "microvascular" complications (damage to the small blood vessels), and is considered as a potential treatment target for such complications . Furthermore, PPARA interacts with PPARGC1A (peroxisome proliferator-activated receptor gamma, coactivator 1 alpha), which interacts with PPARG (peroxisome proliferator-activated receptor gamma), which permits the interaction of this protein with multiple transcription factors. PPARG is reported to be significantly associated with T2D in the original analysis of the WTCCC GWAS dataset . The protein coded by the CDC6 gene is essential for the initiation of DNA replication, and functions as a regulator at the early steps of DNA replication. RARA regulates transcription in a ligand-dependent manner. It is implicated in the regulation of development, differentiation, apoptosis, granulopoeisis, and transcription of clock genes. Both CDC6 and RARA are less well-studied genes, and there is no T2D association with CDC6 or RARA reported so far. Our results suggest that they might be associated with T2D through their interaction with PPARA.
Two pairs of SNPs were found with significant p-values (Table 2), and they shared one common SNP: rs41433646, which locates in an intron of the RBM19 (RNA binding motif protein 19) gene. One of the other two SNPs (rs2490429) locates in between two genes: ATF6 (activating transcription factor 6) and OLFML2B (olfactomedin-like 2B). Although the distances between rs2490429 with ATF6 and OLFML2B are almost the same, ATF6 and rs2490429 belong to the same LD block, as illustrated in Figure 2. Thus it is likely that ATF6 and rs2490429 are associated to each other. Interestingly, the third SNP is rs4253764, which is part of the SNP pair detected in the above pathway analysis and located in the intron region of PPARA.
The protein encoded by ATF6 is a transcription factor. During endoplasmic reticulum (ER) stress, ATF6 activates target genes for the unfolded protein response. There are reports that its polymorphisms are associated with diabetes in various populations, such as Dutch Caucasians and Indians [48, 49]. OLFML2B locates close to ATF6, and is a relatively less known gene. RBM19 encodes a nuclear protein that contains six RNA-binding motifs, which may be involved in regulating ribosome biogenesis. No strong association with T2D has been reported for RBM19.
The above analysis focused on testing of interactions among SNPs based on gene-gene based frameworks. We also attempted to explore the interactions of SNPs through an examination of eSNPs and associated genes whose expression is altered by the eSNPs. In total, 3.5 × 106 tests were performed for the search of interaction between expression altered genes and eSNPs. One SNP pair was detected with significant p-value after correction (Table 2). rs12517663 is located in an intergenic region of chromosome 5, and significantly affects the expression of KLHDC4 (kelch domain containing 4, chromosome 16) in lymphoblastoid cell lines with p-value < 1.0 × 10-5. The interaction between rs12517663 and rs3751726 (located in the neighborhood of KLHDC4) is significantly associated with T2D (p-value 9.6 × 10-10). The function of KLHDC4 is unknown. However, Kelch proteins are commonly seen to associate with actin tails.
Two major assumptions of the analysis are: 1) the numbers of individuals in the nine genotype cells of the contingency table for two SNPs, for both cases and controls, are large enough to assume an asymptotic null distribution for the test statistic, and 2) our results are not confounded by population structure. To be sure that the first assumptions does not invalidate our results we made sure that each of the pairs of SNPs showed a significant difference between cases and controls using Fisher's exact test, as implemented in the R function fisher.test. This was an overall test for the effect of the two SNPs, not just for the interaction term in our logistic regression model. The results showed the p-values for all detected SNP pairs are still significant after correction. Regarding the second assumption, we calculated the 10 top principal components using ancestry informative marker SNPs with Plink , and then we repeated each 2-SNP analysis where we found a significant interaction but including as covariates in the model the 10 top principal components of all the SNPs. The results showed that there is virtually no change in p-values of the coefficients of the interaction terms, which indicates that the associations detected are not caused by population stratification.
In conclusion, this study presented several approaches to search for disease-associated gene-gene interactions from GWAS data based on prior biological knowledge and discrete biological frameworks. This is in stark contrast to the single locus approach and the results provide many interesting genes and interactions with significant p-values. While some of the identified SNPs are linked to genes that are well known for their association with T2D, such as PPARA, others are novel, and potentially provide new avenues for further research. The original analysis based on single locus models revealed only three genes associated with T2D: PPARG, KCNJ11 and TCF7L2 . Our analysis uncovered 12 additional genes that might be associated with the disease through the statistical interaction of SNP pairs in the same or different genes. We believe that the multiple-locus models presented in this and previous studies, such as the method based on adjacent SNPs , may outperform the single locus model to detect true associations, where the addition of relevant biological knowledge can dramatically improve the efficiency of the search.
genome wide association study
single nucleotide polymorphism
expression single nucleotide polymorphism
expression quantitative trait loci
type two diabetes
wellcome trust case control consortium.
This study makes use of data generated by the Wellcome Trust Case Control Consortium.
Funding for the original WTCCC project was provided by the Wellcome Trust under award 076113. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk/info/participants.shtml. This work is supported by funding from NIH (R01-LM-11247 and R43-GM-103404) and National Science Foundation (Grant Number CCF-0953195). This publication was made possible by the Clinical and Translational Science Collaborative of Cleveland, UL1TR000439 from the National Center for Advancing Translational Sciences (NCATS) component of the National Institutes of Health and NIH roadmap for Medical Research. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health and National Science Foundation.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.