Data Collection and Preprocessing
We utilized SAGE data [2], which featured 3,829 subjects and considered 948,658 SNPs from across the human genome, as well as several demographic variables. The data included human samples from three prior studies [2]; 30% of the individuals were African Americans and 70% were European Americans. The SAGE dataset includes 1,897 Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) cases and 1,932 alcohol-exposed non-dependents. We used 15 environmental variables (demographic factors) that are listed in Table 1. Several demographic factors were left out, especially ones relating to comorbidities, because their distributions across the cases and controls were heavily imbalanced. All continuous demographic variables in the data (e.g. income) were discretized. We first removed any SNPs out of Hardy-Weinberg equilibrium (P < 0.0001). Hardy-Weinberg equilibrium tests were run separately on the African Americans and the European Americans in order to ensure identification of any SNPs common only in one race out of equilibrium. SNPs with minor allele frequency (MAF) below 0.01 or call rate below 98% were also removed from consideration, leaving a total of 934,128 SNPs. Finally, the 3,776 samples (1909 cases and 1867 controls) with a genotyping rate above 98% were maintained. A Cochran-Mantel-Haenszel (CMH) association test was used to rank the 934,128 SNPs [30]. The association analysis was performed with the software PLINK [37]. The top 652 SNPs (p < 0.0005) were maintained for network construction as detailed in the next few subsections.
Maximum-weight Dependence Tree (MWDT)
First-order dependence tree of maximum weight is proposed initially by Chow and Liu [16] and further developed and evaluated by Friedman et al. [38]. Although there is no biological evidence that dependence between variables (genes or SNPs) follow a tree structure, but limitations on the number of available sample points compared to the complexity of the problem in hand require the joint distribution of variables be approximated by some simplifying assumptions. In this regard, tree dependence assumption is made to approximate a n
th order joint probability distribution by a product of n-1 s-order distributions. To understand the working principle in the context of GWAS, let P(x) denote the probability mass function of a random vector x. The mutual information between two variables (here SNP1 and SNP2) is given by
$$ I\left({\mathrm{SNP}}_1,{\mathrm{SNP}}_2\right)={\displaystyle \sum_{{\mathrm{SNP}}_1,{\mathrm{SNP}}_2}} P\left({\mathrm{SNP}}_1,{\mathrm{SNP}}_2\right)\ \log \left(\frac{P\left({\mathrm{SNP}}_1,{\mathrm{SNP}}_2\right)}{P\left({\mathrm{SNP}}_1\right) P\left({\mathrm{SNP}}_2\right)}\right) $$
Intuitively, I(SNP1, SNP2) measures the amount of information that SNP1 carries about SNP2 and vice versa. In a graphical representation of dependency among SNPs, we assume the dependencies have a tree structure (meaning each node has a single parent and one node (the root) has no parent), and assign to every edge of the tree an \( I\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\right) \). Then the tree with the maximum weight is the one that maximizes \( {\displaystyle {\sum}_{i=1}^n I\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\right)} \) where m
i
denotes the parent node of node i and n is the number of SNPs under study. Note that there is no difficulty to maximize \( {\displaystyle {\sum}_{i=1}^n I\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\right)} \) without considering the class labels; however, doing so leads to a static network that may not differentiate one class from another. In other words, it is not possible to use the network as an inferential tool. The technique originally proposed in [16] resolves this problem by stratifying the samples at the outset and constructing one network for each class. Nevertheless, having a different network of interactions for each class will not only make the inference a more difficult and elusive task, but may not have a biological ground either.
In a case–control study, we can define a “class” variable C to measure the amount of information between SNPs given the phenotype (case or control). In this case, the maximum weight first-order dependence tree becomes the one with the maximum \( {\displaystyle {\sum}_{i=1}^n I\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\Big|\mathrm{C}\right)} \). By the first-order tree assumption on the structure of dependencies between SNPs, one can write the joint distribution between all SNPs given C as
$$ P\left({\mathrm{SNP}}_1,{\mathrm{SNP}}_2,\dots, {\mathrm{SNP}}_n\Big|\mathrm{C}\right)={\displaystyle \prod_{i=1}^n} P\left({\mathrm{SNP}}_i\Big|{\mathrm{SNP}}_{m_i},\mathrm{C}\right) $$
This decomposition of joint probability to product of “second-order” distributions or the distribution of first-order tree dependence leads to an algorithm that can “grow” the tree in polynomial time (Kruskal algorithm detailed in [16]). In practice, the knowledge of conditional probability distributions is not available, and they must be estimated from data. Nevertheless, it can be shown that due to decomposition of joint probability distributions as mentioned above, the strategy that finds the tree with maximum weights is also the maximum likelihood estimate (MLE) of the joint distribution. In other words, finding the tree with maximum \( {\displaystyle {\sum}_{i=1}^n\widehat{I}\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\Big|\mathrm{C}\right)} \), with \( \widehat{I}\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\Big|\mathrm{C}\right) \) being the sample estimate of \( I\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\Big|\mathrm{C}\right) \), is equivalent to the MLE of the joint distribution of SNPs, P(SNP1, SNP2, …, SNP
n
|C), under the first order dependence tree structure (see [16]). This implies that if the true dependence between SNPs has a tree structure, then as the sample size increases, the estimated trees converge to the true tree with probability one. For further details on estimating \( I\left({\mathrm{SNP}}_i,{\mathrm{SNP}}_{m_i}\Big|\mathrm{C}\right) \), see Additional file 1, Supplementary Notes, Section 2. Another interesting feature of MWDT is that approximating and estimating the joint distribution of SNPs create a flow of information among nodes of the network. As opposed to other network approaches based on mutual information [17–19], this interesting property of the network gives us the ability to employ the network as an inferential tool. For example, for an observation of unknown class, one can assign a case label if
$$ {\displaystyle \prod_{i=1}^n} P\left({\mathrm{SNP}}_i\Big|{\mathrm{SNP}}_{m_i},\mathrm{C}=\mathrm{case}\right) > {\displaystyle \prod_{i=1}^n}\left({\mathrm{SNP}}_i\Big|{\mathrm{SNP}}_{m_i},\mathrm{C}=\mathrm{control}\right) $$
AUC in Ranking Networks of Interactions
From the previous section, we have to note that the MWDT guarantees the maximum likelihood estimate of the joint distribution given the true tree dependency among a set of given SNPs. However, for a set of SNPs of size n (here 652 SNPs selected as described before), there will be 2n-1 potential maximum weight networks that can be constructed on any subset of n variables. Of course one may choose to grow the tree on all n SNPs but here we propose a complementary step to further narrow down the list of potential genetic factors used in the proposed network of alcoholism. To do so, we use the network as a classifier and use the AUC to rank a set of potential networks (see next subsection) and choose the one with the highest AUC. Unless otherwise stated, we employ 3-fold cross-validation procedure to compute AUCs. Nevertheless, since for the initial dimensionality reduction step, we use the CMH test on the full training data, we shall not interpret AUC as the predictive ability of our constructed network on a subset of SNPs and/or other factor. In other words, the use of AUC here is merely a measure to rank constructed sub-networks of interactions.
Ranking mechanism
To construct the optimal network of interactions, two approaches were employed: one is a backward sequential iterative approach described below, and the other is an approach based on a combination of linkage disequilibrium (LD) analysis [39] and the backward iterative approach. In the (backward) iterative approach, the MWDT was first trained with the remaining SNPs and the 15 demographic variables as part of the network. In each subsequent iteration, the 50 SNPs with the largest CMH p-values were removed and a new network was constructed using the reduced list of SNPs. The best network was the one with the highest AUC in differentiating cases from controls. The LD analysis-based approach sought to eliminate redundant SNPs. LD analysis was performed and SNPs that were strongly linked (i.e. frequently co-occurred in both the cases and the controls) were grouped into bins. The approach outlined by Carlson, et al. [40], with the r2 threshold lowered from 0.8 to 0.4, was used to produce a single tag SNP for each LD bin. Only the tag SNPs were maintained, and the iterative approach was applied to them. This approach ensures that multiple SNPs that are proxies due to low LD distance are not selected. The tag SNP acts as a proxy for all SNPs in that region. The best networks from the two approaches were compared, and the one with the highest AUC was selected as the SNP×SNP×E network.
Analysis of network composition
To study the gene level interactions with the phenotype based on SNP level variations, we enumerate all genes with at least one SNP in the network. For each gene, we construct a sub-network of SNPs involved in the full SNP×SNP×E network located on that gene and record the AUC of a newly constructed sub-network. We consider race and sex as part of each sub-network. This would unlock the full potential of race- or sex-specific SNPs. In a sense, this analysis is similar to the adjustment for sex and age in the classical regression analysis.
We next considered important demographic features. To evaluate the importance of each demographic factor, we calculated the decline in resubstitution AUC (AUC on the training set) upon removal of that factor and all edges connected to it from the full SNP×SNP×E network. Resubstitution was used because the response of cross-validation AUC to minor changes is relatively imprecise due to larger variance of cross-validation estimators [41]. We used the Molecular Signatures Database [42] to determine the lists of genes related to 186 pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [43]. For each KEGG pathway, we recorded the AUC of the corresponding sub-network constructed using SNPs in the full network that are within the pathways’s genes, as well as race and sex. Finally, to detect most important interactions, we successively removed each edge in the full SNP×SNP×E network and recorded the decline in AUC. The analysis left us with an AUC for each gene, pathway, and a decline in AUC for each demographic feature and interaction. Rather than reporting the actual AUCs, which here is merely used for ranking purposes, we calculated a p-value associated with each AUC. Although here ranking based on AUC or p-value leads to the same result, we use the p-value threshold of 0.05 (non-adjusted) to narrow down the list.
To determine a p-value for each gene- or pathway-specific network, we constructed 1,000 networks, each with the same number of nodes for which the AUC in question was calculated, and determined their AUCs. The set of genetic features for each of the 1,000 networks was drawn randomly from the background set of SNPs. Race and sex were included as features in all 1,000 networks in order to ensure parity with the procedure used to generate the gene- or pathway-specific network. The list of 1,000 random AUCs enabled the calculation of a p-value for the AUC in question.
To determine the statistical significance of each decline in AUC (used for quantifying the importance of each demographic variable and the interactions in the SNP×SNP×E network), we used the same background set to construct 1,000 random networks with the same set of demographic factors and the same number of SNPs as in the SNP×SNP×E network. For each randomly generated model, we recorded the decline in AUC upon removal of a random SNP (in the case of the declines in AUC for demographic factors) or a random edge (in the case of the declines in AUC for interaction). The 1,000 random declines in AUC enabled the calculation of a p-value for the decline in AUC of interest. Each gene, demographic factor, pathway, and interaction relationship was now associated with a p-value.