Genetic studies of complex human diseases: Characterizing SNP-disease associations using Bayesian networks
© Han et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Han et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis, and treatment of complex human diseases. Applying machine learning or statistical methods to epistatic interaction detection will encounter some common problems, e.g., very limited number of samples, an extremely high search space, a large number of false positives, and ways to measure the association between disease markers and the phenotype.
To address the problems of computational methods in epistatic interaction detection, we propose a score-based Bayesian network structure learning method, EpiBN, to detect epistatic interactions. We apply the proposed method to both simulated datasets and three real disease datasets. Experimental results on simulation data show that our method outperforms some other commonly-used methods in terms of power and sample-efficiency, and is especially suitable for detecting epistatic interactions with weak or no marginal effects. Furthermore, our method is scalable to real disease data.
We propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. In EpiBN, we develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast Branch-and-Bound algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to real data, we propose the use of a Markov chain Monte Carlo (MCMC) method to perform the screening process. Applications of the proposed method to some real GWAS (genome-wide association studies) datasets may provide helpful insights into understanding the genetic basis of Age-related Macular Degeneration, late-onset Alzheimer’s disease, and autism.
To identify genetic variants that affect susceptibility of a variety of diseases, genome-wide association studies (GWAS) genotype a dense set of common SNPs (Single Nucleotide Polymorphism) and test allelic frequencies among a cohort of affected people and non-affected people . Traditional analysis methods for GWAS data only consider one SNP at a time and test its association with disease. This type of analysis strategy is only suitable for simple Mendelian disorders. Some common complex diseases such as various types of cancers, cardiovascular disease, and diabetes are influenced by multiple genetic variants. Therefore, detecting high-order epistasis, which refers to the interactive effect of two or more genetic variants on complex human diseases, can help to unravel how genetic risk factors confer susceptibility to complex diseases . However, the very large number of SNPs checked in a typical GWAS and the enormous number of possible SNP combinations make detecting high-order epistatic interactions from GWAS data computationally challenging . Moreover, how to measure the association between a set of SNPs and the phenotype presents another grand statistical challenge.
During the past decade, two types of heuristic computational methods have been proposed to detect epistatic interactions: prediction/classification-based methods and association-based methods. Prediction/classification-based methods try to find the best set of SNPs, which can generate the highest prediction/classification accuracy including, for example, multifactor dimensionality reduction (MDR) , penalized logistic regression (e.g., stepPLR , and lassoPLR ), support vector machine (SVM) , and random forest . MDR is a non-parametric and model-free method based on constructing a risk table for every SNP combination . If the case and control ratio in a cell of this risk table is larger than 1, MDR will label it as "high risk", otherwise, "low risk". By the risk table, MDR can predict disease risk and will select the SNP combination with the highest prediction accuracy. StepPLR and lassoPLR make some modifications to avoid the overfitting problems that standard logistic regression methods suffer from  when detecting epistatic interactions. For example, stepPLR combines the logistic regression criterion with a penalization of the L2-norm of the coefficients. This modification makes stepPLR more robust to high-order epistatic interactions . Two machine learning methods: SVM  and random forest  have also been applied to detecting epistatic interactions. Machine learning methods are based on binary classification (prediction) and treat cases as positives and controls as negatives in SNP data. They use SVM or random forest as a predictor and select a set of SNPs with the highest prediction/classification accuracy by feature selection. Some prediction/classification-based methods can only be applied to small-scale analysis (i.e., a small set of SNPs) due to their computational complexity. Moreover, almost all prediction/classification-based methods tend to introduce a large number of false positives, which may result in a huge cost for further biological validation experiments .
Bayesian epistasis association mapping (BEAM) is a scalable and association-based method . It partitions SNPs into three groups: group 0 is for normal SNPs, group 1 contains disease SNPs affecting disease risk independently, and group 2 contains disease SNPs that jointly contribute to the disease risk (interactions). Given a fixed partition, BEAM can get the posterior probability of this partition from SNP data based on Bayesian theory. A Markov Chain Monte Carlo method is used to reach the optimal SNP partition with maximum posterior probability in BEAM. One drawback of BEAM is that identifying both single disease SNP and SNP combinations simultaneously makes BEAM over-complex and weakens its power.
Recently, we propose a new Markov blanket-based method, DASSO-MB, to detect epistatic interactions in case-control studies . The Markov Blanket is a minimal set of variables, which can completely shield the target variable from all other variables based on Markov condition property . Thus, Markov blanket methods can detect the causal disease SNPs with the fewest false positives. Furthermore, the heuristic search strategy in Markov blanket methods can avoid the time-consuming training process as in SVM and random forests. However, the faithfulness assumption in Markov blanket methods, which can hardly always be ensured, may hinder their applications in detecting epistatic interactions .
In this paper, we address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection by introducing a score-based Bayesian network structure learning method, EpiBN (Epistatic interaction detection using Bayesian Network model), which employs a Branch-and-Bound technique and a new scoring function. Bayesian networks provide a succinct representation of the joint probability distribution and conditional independence among a set of variables. In general, a score-based structure learning method for Bayesian networks first defines a scoring function reflecting the fitness between each possible structure and the observed data, and then searches for a structure with the maximum score. Comparing to Markov blanket methods, the merits of applying score-based Bayesian network structure learning method to epistatic interaction detection include: (1) the faithfulness assumption can be relaxed and (2) heuristic search method can solve the classical XOR (Exclusive or) problem . We apply the EpiBN method to simulated datasets based on four disease models and three real datasets: Age-related Macular Degeneration (AMD) dataset, late-onset Alzheimer’s disease (LOAD) dataset, and autism dataset. We demonstrate that the proposed method outperforms some commonly-used methods such as SVM, MDR, and BEAM, especially when the number of samples is small.
where Pa(X i ) denotes the set of parents of X i in G. Therefore, there are two components in a Bayesian network. The first component is the DAG G reflecting the structure of the network. The second component, θ, describes the conditional probability distribution P(X i |Pa(X i )) to specify the unique distribution J on G.
Bayesian networks provide models of causal influence and allow us to explore causal relationships, perform explanatory analysis, and make predictions. Genome-wide association studies attempt to identify the epistatic interaction among a set of SNPs, which are associated with one certain type of disease. Therefore, we can use Bayesian networks to represent the relationship between genetic variants and a phenotype (disease status). The n SNP nodes and the disease status/label node form a two-layer Bayesian network and we want to determine which SNP nodes are the parent nodes of the disease status node. In this type of Bayesian network, we only allow edges from SNP nodes to the disease status node. Edges from the disease status node to SNP nodes and edges among SNP nodes are prohibited.
By modelling the association between SNPs and the disease status based on Bayesian networks, we transform detecting epistatic interactions into structure learning of Bayesian networks from GWAS data. There are two types of structure learning methods for Bayesian networks: constraint-based methods and score-and-search methods. The constraint-based methods first build a skeleton of the network (undirected graph) by a set of dependence and independence relationships. Next they direct links in the undirected graph to construct a directed graph with d-separation properties corresponding to the dependence and independence determined [17, 18]. Although constraint-based methods are developed with a rigorous theoretical foundation, errors in conditional dependence and independence will affect the stability of constraint-based methods, especially for small sample problems, which is also a problem of Markov Blanket methods in detecting epistatic interactions. Therefore, in this paper, we focus on score-and-search methods. The score-and-search methods view a Bayesian network as a statistical model and transform the structure learning of Bayesian networks into a model selection problem . To select the best model, a scoring function is needed to indicate the fitness between a network and the data. Then the learning task is to find the network with the highest score. Thus, score-and-search methods typically consist of two components, (1) a scoring function, and (2) a search procedure. Next, we discuss in detail the proposed EpiBN algorithm, which consists of three components: scoring, searching, and screening.
One of the most important issues in score-and-search methods is the selection of scoring function. A natural choice of scoring function is the likelihood function. However, the maximum likelihood score often overfits the data because it does not reflect the model complexity. Therefore, a good scoring function for Bayesian networks’ structure learning must have the capability of balancing between the fitness and the complexity of a selected structure. There are several existing scoring functions based on a variety of principles, such as the information theory and minimum description length (e.g. BIC score, AIC score, and MDL score) [20–22] and Bayesian approach (BDe score) .
The BIC score and AIC score are derived from some approximations when the number of samples N approaches infinity . If the number of samples is small, the approximation in the inference of both AIC and BIC scores can not hold any more and the structure penalty term in Eq. (5) and Eq. (6) are not suitable . When using information-based scores in the Bayesian network model to detect epistatic interactions by GWAS data, which show a non-skewed distribution, the BIC score is too strict and prefers to select simple structures, while the AIC score prefers to select complex structures .
where C(LDS) is the complexity of the local disease structure, q is the number of configurations of parent SNP nodes, r is the number of states of the disease status node, N jk is the number of instances where the disease status node takes its k-th value and the parent SNP nodes take their j-th configuration, N j is the number of instances where the parent SNP nodes take their j-th configuration, and .
where N k is the number of instances in which the disease status node takes its k-th value, and .
i.e. the mutual information between X and Pa(X) coincides with the difference between the log-likelihood of LDS and LDS 0 .
where Cat(V) is the number of categories of the variable V, and thus Cat(X) = r and Cat(Pa(X)) = q .
where log P(D|LDS 0) is a constant.
The distribution of G 2 asymptotically approximates to that of χ 2 with the same number of degrees of freedom . The χ 2 distribution with k degrees of freedom has a variance of 2k, and therefore 2DF(G 2(X, Pa(X))) is the variance of the corresponding G 2 distribution. Since G 2(X, Pa(X)) reflects the bias, the AIC score in Eq. (16) indicates a trade-off between bias and variance in terms of the G 2 statistic G 2(X, Pa(X)) and its variance.
where Var D (G 2(X, Pa(X))) comes from the estimation of the variance of the corresponding G 2 distribution from data. Our new scoring function estimates the penalty term from the data to make it consistent with the data, which is similar to the DIC (Deviance Information Criterion) score trying to identify models that best explain the observed data .
Due to the estimation of the variance of G 2(X, Pa(X)) from data in Eq. (17), EpiScore is not score-equivalent. However, we are not very concerned about this: there are no equivalent structures in the two-layer Bayesian network for the restriction on the direction of edges we describe in the previous section.
The computational task in score-and-search methods is to find a network structure with the highest score. The searching space consists of a super-exponential number of structures and thus exhaustively searching optimal structure from data for Bayesian networks is NP-hard . One simple heuristic search algorithm is greedy hill-climbing algorithm, where three types of operators are defined to change one edge at each step: adding an edge, removing an edge, and reversing an edge. By these three operators, we can construct the local neighbourhood of the current network. Then we select the network with the highest score in the local neighbourhood to get the maximal gain. This process can be repeated until it reaches a local maximum. However, greedy hill-climbing algorithm cannot guarantee a global maximum . Other structure learning methods for Bayesian networks include Branch-and-Bound (B&B) [28, 32] and Markov chain Monte Carlo .
Even though the B&B algorithm uses an upper score bound to reduce the searching space, it still has an exponential time complexity in the worst case and is not feasible to be directly applied to real GWAS data. Therefore, an efficient screening method is necessary. Traditional screening methods assign a score to every single SNP and select a subset of SNPs with high scores. However, these methods ignore the joint effect of SNPs on disease and are not suitable for detecting epistatic interactions from real GWAS data.
The likelihood of local disease structure, P(D|LDS), can be calculated by Eq. (17). Based on the result from MCMC method, we select SNP nodes associated with edges whose posterior probabilities larger than 0. Since we consider the association of multiple SNPs with disease status at each step of the Markov chain in our MCMC method, the potential disease SNPs related with epistatic interactions will be kept in the final subset of SNPs.
Simulated Data We first evaluate the proposed EpiBN method on four simulated data sets, which are generated from three commonly used two-locus epistatic models in  and one three-locus epistatic model developed in . Model-1 is a multiplicative model, model-2 demonstrates two-locus interaction multiplicative effects, and model-3 specifies two-locus interaction threshold effects. There are three disease loci in model-4 . Some certain genotype combinations can increase disease risk in model-4 and there are almost no marginal effects for each disease locus.
We generate data based on the similar parameter settings as in [9–11] for three parameters associated with each model: the marginal effect of each disease locus (λ), the minor allele frequencies (MAF) of both disease loci, and the strength of linkage disequilibrium (LD, quantified by the squared correlation coefficient r 2 calculated from allele frequencies) between the unobserved disease locus and a genotyped locus . For each parameter setting on each model, we generate 50 datasets and each dataset contains 100 markers genotyped for 1,000 cases and 1,000 controls. To measure the performance of each method, we use power as our evaluation criterion, which is defined as the proportion of simulated datasets in which only the true diseases associated markers are identified without any false positives.
Accuracy comparison of EpiBN, BEAM, and SVM.
0.76 ± 0.27
0.76 ± 0.27
0.34 ± 0.38
0.87 ± 0.32
0.75 ± 0.34
0.32 ± 0.43
0.61 ± 0.29
0.91 ± 0.19
0.43 ± 0.31
0.90 ± 0.21
0.90 ± 0.20
0.14 ± 0.29
0.91 ± 0.26
0.75 ± 0.31
0.29 ± 0.38
0.69 ± 0.29
0.95 ± 0.15
0.34 ± 0.31
0.78 ± 0.30
0.79 ± 0.30
0.31 ± 0.43
0.83 ± 0.35
0.74 ± 0.37
0.34 ± 0.49
0.72 ± 0.28
0.88 ± 0.24
0.33 ± 0.35
1.00 ± 0.00
1.00 ± 0.00
0.00 ± 0.00
0.41 ± 0.49
0.20 ± 0.29
1.05 ± 0.47
0.41 ± 0.32
0.61 ± 0.38
0.76 ± 0.40
Comparison of EpiScore, BIC score, and AIC score.
Sample efficiency Typically, GWAS can not generate a large number of samples due to the high experiment cost. Thus, the performance of various computational methods for epistatic interaction detection in case of small samples is important. We explore the effect of the number of samples on the performance of EpiBN, MDR, BEAM and SVM. The parameters used are: λ = 1.1 for model-1, λ = 0.9 for model-2, λ = 1.8 for model-3, and θ = 7 for model-4. To test the scalability of EpiBN on large number of SNPs, we generate synthetic datasets containing different number of SNPs (40, 200, and 1000) genotyped for different number of samples (100, 200, 300, 400, 600, 1000, and 2000) with r 2 = 1 and MAF = 0.5.
In this section and the following two sections, we apply EpiBN to large-scale datasets in real genome-wide case-control studies, which often require genotyping of 30,000-1,000,000 common SNPs. We first make use of an Age-related Macular Degeneration (AMD) dataset containing 116,204 SNPs genotyped with 96 cases and 50 controls  (i.e., high dimensionality and small sample sizes). Multiple genetic factors cause AMD, which is a complex retinal degenerative disorder. To remove inconsistently genotyped SNPs, we perform the same filtering process as in [11, 38]. After filtering, there are 97,327 autosomal SNPs remained.
We first perform the screening process and select 51 potential disease SNPs related with AMD by MCMC method. Among these 51 selected SNPs, EpiBN detects two associated SNPs showing the highest EpiScore: rs380390 and rs2402053. Klein et al. demonstrated that the first SNP, rs380390, is associated with AMD . The second SNP, rs2402053, is intergenic between TFEC and TES in chromosome 7q31 . Even though no evidences show that rs2402053 is related with AMD, it is worth noting that mutations in some genes on 7q31-q32 are revealed in patients with retinal disorders . Therefore, rs2402053 may be a new genetic factor, on chromosome 7q, contributing to the underlying mechanism of AMD.
Late-onset Alzheimer’s disease (LOAD) is the most common form of Alzheimer’s disease and usually occurs in persons over 65. It causes patients’ degeneration of the ability of thinking, memory, and behaviour. The apolipoprotein E (APOE) gene is one genetic factor that accounts for affecting the risk of LOAD. There are three common variants of the APOE gene: ε2, ε3, and ε4. The appearance of the ε4 allele in a person’s APOE genotype increases the LOAD risk . Rieman et al. conducted genome-wide association studies to detect other generic risk factors related with LOAD . They found 10 SNPs showing significant association with LOAD in the APOE ε4 carriers. All these 10 SNPs are in the GRB-associated binding protein 2 (GAB2) gene.
We download the LOAD GWAS data from http://www.tgen.org/neurogenomics/data. After pre-processing, we have 287,479 SNPs and 1408 samples (857 cases and 551 controls). EpiBN keeps APOE as one parent of the disease status node and identifies two other SNPs: rs1931565 and rs4505578, which may interact with APOE and affect the LOAD risk. The rs1931565 SNP is intergenic between ABCA4 and ARHGAP29 in chromosome 1p22. ABCA4 is related with some brain-related diseases including stargardt disease 1, early-onset severe retinal dystrophy and age-related macular degeneration . On the other hand, some ABC transporter family genes such as ABCA1, ABCA2, ABCA7 and ABCA12 are associated with Alzheimer’s disease . Therefore, we can speculate that the interaction among rs1931565, rs4505578 and APOE may affect some brain functions and therefore increase the LOAD risk. Our results do not contain any of the 10 SNPs in GAB2 found in . One reason is that Rieman et al. only explored two-locus interactions related with LOAD. In fact, the epistatic interactions are very complicated. If we restrict the number of genetic risk factors as two, we will miss some potential disease SNPs associated with complex diseases.
Autism is a common early onset neurodevelopmental disorder, which affects the brain’s normal development and impairs social interaction and communication. To pinpoint the causal SNPs and genes involved in autism, a large number of genotyping data have been generated from subjects with and without autism. Some of the genotyping data have been deposited on the AGRE (Autism Genetic Resource Exchange) website http://www.agre.org for further analysis by the research community. In this paper, we analyse one of the largest genotype dataset contributed by Children’s Hospital of Philadelphia (CHOP), which contains 513,312 SNPs genotyped from 1784 cases and 2441 controls . To reduce the searching space and focus on more relevant SNPs, we only use SNPs in autism-related genes. We first get information of 277 autism-related genes from the autism genetic database (AGD) http://wren.bcf.ku.edu/. Then we obtain a list of 205,589 SNPs in these autism-related genes from UCSC genome browser http://genome.ucsc.edu/. The CHOP dataset contains 9330 of these 205,589 SNPs. Samples with missing rate > 10% and SNPs with missing rate > 10%, MAF < 0.05, and p-value of HWE < 0.001 were removed. After applying the aforementioned filtering process, our genotype dataset contains 4222 samples (1783 cases, 2439 controls) and 8198 SNPs.
Heterogeneity of phenotypic presentation in autism makes it difficult to detect epistatic interactions related with this complex disorder . One proposed approach to reduce the phenotypic heterogeneity of autistic subjects is dividing them into several subgroups by clustering method on ADI-R (Autism Diagnostic Interview-Revised) data . The ADI-R is a clinical diagnostic interview to assess autism in children and adults and contains 93 items about behaviours in three domains: quality of social interaction, communication and language ability, and repetitive, restricted and stereotyped interests and behaviour . In this paper, we use an alternative method to reduce the phenotypic heterogeneity: biclustering . Biclustering methods can identify subgroups of autism samples that show similar behaviour patterns on a specific subset of ADI-R items. By applying the biclustering method , we find a bicluster of constant value in 235 subjects for 8 out of 77 numerically scored ADI-R items (0 = normal; 3 = most severe). These autistic subjects have the same ADI-R score (i.e., 3 which is most severe) on the 8 ADI-R items which are: CCONVER, CINAPPQ, CPRON, CNEOID, CVERRIT, CINR, CSPEECH, and CFRIEND. Most of these 8 items are about verbal and nonverbal communication. Therefore these 235 autism samples may represent a subset with the most severe communication problems.
To explore the genetic basis in the identified more homogeneous subset, we use the SNP data for these 235 autistic subjects (cases) and 2439 controls in CHOP dataset. The MCMC method first selects 111 candidate SNPs. Then our EpiBN detects an epistatic interaction of three SNPs: rs706363, rs7780487, and rs12536378. The first SNP, rs706363, is on the autism candidate gene DAB1 on chromosome 1. Both rs7780487 and rs12536378 are on the autism candidate gene DPP6 on chromosome 7. If we search HPRD (Human Protein-protein Interaction Database), we can find a pathway from DAB1 to DPP6: DAB1--APLP2--PRNP--DPP6 . This suggests a potential interaction between the detected SNPs using our EpiBN, which warrants further investigations to assess this in silico prediction by molecular functional experiments.
Jiang et al. also tried to use score-based Bayesian network structure learning methods to detect epistatic interactions . They evaluated the performance of 22 BN scoring criteria by scoring all 1-SNP, 2-SNP, 3-SNP, and 4-SNP Bayesian networks on simulation datasets and showed that the BDeu score with large values of hyperparameters α achieved the best performance. Since the prior knowledge on the optimal α and the true number of disease SNPs is unknown, their methods can hardly address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection very well.
To address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection from GWAS data, several machine learning or statistical methods have been proposed during the past decade. However, these proposed machine learning or statistical methods still encounter some problems: scalability to real genome-wide dataset, tending to introduce false positives, sample-efficiency, and poor performance when detecting epistatic interactions with weak or no marginal effects.
In this paper, we propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. We develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast B&B algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to GWAS data, we propose the use of a MCMC method to perform the screening process.
We apply the proposed method to both simulated datasets based on four disease models and three real datasets. Our experimental results demonstrate that our method outperforms some other commonly-used methods and is scalable to GWAS data.
This work is supported by the US National Science Foundation Award IIS-0644366 and the KCALSI-10-3 Patton Trust Grant.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.