Simulation Results
The performance of GGM was evaluated with both large (N = 200) and small (N = 50) datasets. The simulated gene network contains 256 true edges. With large-sample data GGM identified 140 significant edges (54.69%), out of which 126 are true (90%), while only 97 (37.89%) significant edges (83 of which are true, 85.57%) were identified with small-sample data. The false-positive rate is about 0.3% in both cases. In [7] the power (the proportion of true edges identified) is about 50–60% for N = 200 and about 10–20% for N = 50; therefore our observed power is comparable to the simulation results in the large-sample case, and significantly better in the small-sample case, which may be due to the design of the "true network" based on a real data set.
We next added SNP data to the defined networks by testing the association between the SNPs and the genes for gene pairs with strong links. The large-sample case simulation includes 1,777 true cis-acting (i.e., within 0.005) SNP-gene associations. Though conditional dependency between SNPs and genes mapping to distances greater than 0.005 were permitted in the simulation (that is, cis-associated SNP with gene A could be associated with gene B due to correlations between genes A and B), we did not simulate any independent trans-acting effects. We detected 1,120 significant associations (63.03%), of which 1,113 were true (99.38%, i.e., only 7 false positives). We note that no significant independent trans-acting associations were detected, consistent with the simulated model. In contrast, failure to consider conditional dependence would reveal a large number of indirect associations: linear regression of all possible SNP-trait associations (regardless of distance) identifies 507 significant associations (out of a possible 3,447), though these associations are all indirect. As expected in light of the large number of comparisons performed, SNP-association mapping in the small-sample case is underpowered (only 170 of 1,244 true associations are detected, 13.67%), though no false-positive cis-acting association were observed, and again, no false-positive direct trans-acting associations were identified (compared with 61 of 2,414 detected by linear regression).
Asthma Integrative Genomics Dataset
An integrative genomics study of the genetics of gene expression was performed in a subset of young adult asthmatics (n = 299) participating in the Childhood Asthma Management Program (CAMP) Phase 2, a multicenter follow-up study of childhood asthma. CD4+ lymphocytes were isolated from peripheral blood samples with anti-CD4+ microbeads by column separation (Miltenyi Biotec, Auburn CA). Total RNA was extracted from CD4+ lymphocytes with the QIAGEN RNeasy Mini Protocol (Valencia, CA). Expression profiles were generated with Illumina HumanRef8 v2 BeadChip oligonucleotide arrays (Illumina, San Diego CA) and scanned with the BeadArray scanner. Raw expression intensities were processed with the lumi package [8]. Each array underwent background adjustment with RMA convolution [9] and log2 transformation for variance stabilization. The combined samples were quantile normalized. Adequate DNA for genome-wide genotyping was available for 154 subjects of self-reported white ancestry with RNA samples. Genotyping was performed with the Illumina Infinium II HumanHap550 Genotyping BeadChip. We apply our method to a dataset of genome-wide SNP genotype data (534,290 autosomal markers) and peripheral blood CD4+ lymphocyte gene expression profiles (22,177 transcripts) in 154 asthmatic subjects. Schäfer and Strimmer [7] suggest that the power and positive discovery rate drop to zero when the number of genes outnumbers the samples by fivefold. These effects are also evident in our simulations. We therefore performed necessary data reduction using genefilter [10] by focusing only on 3,203 RefSeq-annotated genes variably expressed across samples (minimum interquantile range of 1.0 on log2 scale). Though p exceeded n by more than 20-fold, our GGM modeling identified 513,203 gene-gene associations (i.e., edges) with posterior probability ≥ .50, representing 10.01% of all possible
edges. To validate the gene network, we apply the same algorithm on an independent, publicly available gene expression set from CD4+ lymphocytes in both asthmatics and normal subjects (GEO series 473, see [11]). With a more stringent threshold (posterior probability ≥ 0.90), we find that a significant proportion of edges overlap in the two data sets (1,913 edges, p = 5.5 × 10-106). We also find 40 genes with over 100 significant connections with other genes in both datasets [See Additional file 1], which can be considered as hubs [12]. The extensive and reproducible connectivity of these 40 hub genes suggests they are of particular biological interest in CD4+ lymphocytes.
To illustrate the utility of network modeling using GGM, we focus on the network segment surrounding one of the identified hub genes – the beta-2 subunit of the Interleukin-12 receptor (IL12RB2, OMIM *601642). The IL12 receptor is constitutively expressed on CD4+ lymphocytes and is induced by antigen receptor triggering, and its ligand (IL12) is a potent immunomodulator in allergic airways disease [13]. GGM identified 306 genes with direct edges to IL12RB2 in the CAMP dataset [see Additional file 2]. 5,611 SNPs map to within 50 kb of these genes. Because the number of SNPs per gene is generally small (18.34 SNPs on average), we can efficiently estimate partial networks over the 5,611 SNPs quickly, as described in step 3. After FDR adjustment we identify 225 SNP-gene pairs (4.01%) with significant association [see Additional file 3]. Given the simulation results, we expect few of them to be false positives. Target candidates of interest include RAP1A [14] and TBKBP1 [15].
SNP-association mapping is productive for non-hub genes as well. For illustrative purposes, we mapped regulatory variants for genes linked to Interleukin-1B (IL1B, OMIM *147720), another biological asthma candidate gene. 353 SNP-gene pairs (5.2% of all cis-acting pairs tested) with significant association were identified with FDR-adjusted p-value ≤ 0.05 [see Additional file 4].
GeneVar Dataset
We used the publicly available GeneVar dataset http://www.sanger.ac.uk/humgen/genevar/ to assess the reproducibility of our network building. GeneVar consists of gene expression profiling of Epstein-Barr virus-transformed lymphoblastoid cell lines for the 270 HapMap Consortium (Phase II) individuals and their available genotype data (3,967,792 SNPs) [3].
We apply our method to the 30 Caucasian trios in GeneVar. The dataset contains expression values from 47,293 probes for each individual, and we use the same filtering criteria as in CAMP to filter it down to 2,247 genes. The gene network contains 146,310 (5.8%) significant edges with posterior probability greater than 0.5, 67,345 (2.67%) with posterior probability greater than 0.8, and 44,207 (1.75%) greater than 0.9. With only 90 samples in the dataset, the latter categories are likely most reliable.
To compare the networks from GeneVar and CAMP, we focus on a subset of 608 genes that appear in both datasets after filtering. Using the posterior probability threshold of 0.9, we find 3,302 significant edges in CAMP and 2,239 in GeneVar. There are 86 edges that appear in both sets (See Figure 3), considerably more than would be expected by chance (p-value of 6.77 × 10-11), but far fewer than the large overlap of 1,913 noted between the two datasets.
IL1B is in both datasets after filtering. Using the 0.9 threshold, we find 51 genes directly linked to IL1B in GeneVar, and we estimate the gene-SNP network with 5,404 SNPs within 50 Kb of those genes. We find 133 significant SNP-gene pairs [see Additional file 5]. In CAMP, however, there are only 4 genes connected to IL1B in the reduced dataset and no overlap with GeneVar. Note that the subset accounts for only 2.74% of the genes in the original dataset.
We note that graph models can facilitate biologic interpretation of observed associations of SNP with multiple genes by distinguishing between conditionally independent (Model 1 in Figure 1) and dependent (Model 2) SNP-gene associations. For example, of the 133 significant SNP-gene associations noted in the GeneVar IL1B subset analysis, 25 (18.8%) of these SNPs are associated with IL1B expression in a univariate analysis (for example, by linear regression of SNP genotype on the expression trait – see Additional file 5). None of these trans-associations remain significant when the effects of the cis-associated genes are considered, suggesting that the SNPs identified influence IL1B expression by first altering the expression of the cis-associated gene, thereby strengthening support for a true gene-gene interaction. Without considering conditional dependency, these insights would be missed.