Machine learning identifies SNPs predictive of advanced coronary artery calcium in ClinSeq® and Framingham Heart Study cohorts

One goal of personalized medicine is leveraging the emerging tools of data science to guide medical decision-making. Achieving this using disparate data sources is most daunting for polygenic traits and requires systems level approaches. To this end, we employed random forests (RF) and neural networks (NN) for predictive modeling of coronary artery calcification (CAC), which is an intermediate end-phenotype of coronary artery disease (CAD). Model inputs were derived from advanced cases in the ClinSeq® discovery cohort (n=16) and the FHS replication cohort (n=36) from 89th−99th CAC score percentile range, and age-matching controls (ClinSeq® n=16, FHS n=36) with no detectable CAC (all subjects were Caucasian males). These inputs included clinical variables (CLIN), genotypes of 57 SNPs associated with CAC in past GWAS (SNP Set-1), and an alternative set of 56 SNPs (SNP Set-2) ranked highest in terms of their nominal correlation with advanced CAC state in the discovery cohort. Predictive performance was assessed by computing the areas under receiver operating characteristics curves (AUC). Within the discovery cohort, RF models generated AUC values of 0.69 with CLIN, 0.72 with SNP Set-1, and 0.77 with their combination. In the replication cohort, SNP Set-1 was again more predictive (AUC=0.78) than CLIN (AUC=0.61), but also more predictive than the combination (AUC=0.75). In contrast, in both cohorts, SNP Set-2 generated enhanced predictive performance with or without CLIN (AUC> 0.8). Using the 21 SNPs of SNP Set-2 that produced optimal predictive performance in both cohorts, we developed NN models trained with ClinSeq® data and tested with FHS data and replicated the high predictive accuracy (AUC>0.8) with several topologies, thereby identifying several potential susceptibility loci for advanced CAD. Several CAD-related biological processes were found to be enriched in the network of genes constructed from these loci. In both cohorts, SNP Set-1 derived from past CAC GWAS yielded lower performance than SNP Set-2 derived from “extreme” CAC cases within the discovery cohort. Machine learning tools hold promise for surpassing the capacity of conventional GWAS-based approaches for creating predictive models utilizing the complex interactions between disease predictors intrinsic to the pathogenesis of polygenic disorders.

Best-guessed genotypes with imputation quality >0.3 were used for markers that were not available from 196 the first two actual genotyping platforms. 197 We compiled a set of 57 SNPs (listed in Table S4) that were associated with coronary calcium in SNPs in both SNP sets were coded as 0 or 2 (homozygous for either allele) or 1 (heterozygous) using the 203 same reference alleles in both ClinSeq R and FHS cohorts.

204
Predictive modeling using random forests and neural networks 205 We implemented the random forest classification method using the Statistics and Machine Learning After ranking all features in each distinct feature set (e.g., all clinical variables), we decreased the number 228 of features gradually by leaving out weaker predictors to identify the optimal predictive performance 229 and the corresponding optimal set of features. We repeated this procedure to compare the predictive 230 performances of models trained and tested by combining clinical and genotype data, as well as using each 231 layer data in isolation. By identifying the predictive SNPs in GWAS-based SNP Set-1 and the alternative 232 SNP Set-2, we were also able to compare the cumulative predictive importance scores from SNP-risk 233 factor and SNP-CAD related phenotype associations (tied to each SNP set) based on past studies. The 234 predictive patterns generated by data from the ClinSeq R discovery cohort were also compared with the 235 patterns generated by the independent FHS replication cohort. Finally, random forest models were also 236 used to identify a subset of SNPs in SNP Set-2 that generated the optimal predictive performance in both 237 ClinSeq R and FHS cohorts.

238
Upon identifying the subset of SNPs in SNP Set-2 that generate random forest models with optimal 239 performance in both cohorts, we implemented a neural-network-based classification approach using  We trained all network configurations with the genotypes of the optimal subset of SNPs within SNP 248 Set-2 from all subjects in the ClinSeq R discovery cohort (approximately 20% of these samples include below the median AUC value obtained when the case-control statuses are not permuted (i.e., actual data).

6/24
We first built random forest models using all of the nine clinical variables from the ClinSeq R discovery 259 cohort and identified that three of them had positive predictive importance values as listed in Table  1. These predictors included HDL Cholesterol (a major risk factor for coronary calcium (Allison and to the design of our study. 275 We next built random forest models for the ClinSeq R discovery cohort using the genotypes of the 57  (Table S6), whereas six of them have been linked to CAD, MI, stroke, and aortic 279 valve calcium (Table S7). For a detailed discussion of these associations and loci (including PCSK9 and 280 9p21), we refer the reader to Supplementary Text (Section 1).

281
To compare the predictive patterns generated by the discovery and replication cohorts based on the 282 SNP Set-1 genotype data, we next developed random forest models for the FHS replication cohort and 283 identified 19 SNPs among SNP Set-1 with positive predictive importance in this cohort. Figure    Previous GWAS and meta-analyses studies on CAC focused on the presence of coronary calcium (in-303 cluding low levels of CAC), rather than its extreme levels. Since our discovery and replication cohorts 304 both included cases with CAC scores within 89 th -99 th percentile range, we next targeted the ClinSeq R 305 discovery cohort genotype data to identify SNPs highly predictive of advanced CAC in order to utilize the 306 advantages provided by random forest models (ability to identify optimal model structure for training data 307 while utilizing interactions between SNPs without multiple testing penalty) over GWAS and conventional 308 regression approaches. Expecting a correlation between the SNP genotype and the binary advanced CAC 309 state (healthy control vs. 89 th -99 th percentile CAC score range) for such predictive SNPs, we used a genotype-phenotype correlation criterion to identify an additional SNP set with approximately the same 311 size as the SNP Set-1 from the ClinSeq R discovery cohort data. First, we verified the rationale behind the 312 implemented genotype-phenotype correlation criterion. As shown in Figure 4, the predictive importance 313 values of the SNPs in SNP Set-1 and the correlation between each SNP's genotype and the case-control 314 statuses of our subjects were highly correlated with each other. Here, the Pearson-based correlation 315 coefficient was computed as 0.73 with a p-value of 2.61E-10 estimated by a two-tailed t-test. In simple 316 terms, for any SNP within SNP Set-1, a higher correlation between the genotype and the case-control 317 statuses of the 32 subjects led to a higher predictive importance value. Using this rationale, we identified 318 an alternative "SNP Set-2" (56 SNPs not associated with CAC in past studies) whose genotypes had the 319 highest correlation values with the case-control status. Within the ClinSeq R discovery cohort, the range 320 of genotype-phenotype correlation among the SNPs in SNP Set-2 was 0.63-0.73, whereas the same range 321 was 0-0.51 among the SNPs in SNP Set-1. Hence, there was no overlap between the two sets of SNPs.

322
Upon incorporating the genotypes of SNP Set-2 within the ClinSeq R discovery cohort into random 323 forest models, the AUC value turned out to be 0.9975, thereby verifying the superb ability of this set of 324 markers. As shown in Table S8, 42 of these 56 predictive SNPs have been previously associated with a 325 total of 18 risk factors, whereas the total number of SNP-risk factor pairs was 86 with many individual 326 SNPs being associated with multiple risk factors. This was in contrast to only 11 of the 17 predictive 327 SNPs in SNP Set-1 that were associated with a total of 18 risk factors forming 28 SNP-risk factor pairs.

328
In addition, the susceptibility score, which is computed as the cumulative predictive importance values of 329 SNPs tied to CAD risk factors in previous GWAS, increased from 446 to 1229 aligning with the improved 330 predictive accuracy from 0.72 (maximum AUC in Figure 3B for SNP Set-1 in the ClinSeq R discovery 331 cohort) to ≈1.00 as we moved between the two sets of predictors, namely SNP Set-1 and SNP Set-2.
332 Table S9 shows that ten of the predictive SNPs in Set-2 that have been associated with stroke and aortic 333 valve calcium in past GWAS, a trend we also observe with three SNPs in SNP Set-1 (Table S7). Two of the 334 predictive SNPs in Table S9 have also been linked to mitral annular calcium, another disease phenotype S8 and S9 suggests that the highly predictive SNPs identified from the ClinSeq R discovery cohort data 338 (or SNP Set-2) could be potential susceptibility loci for advanced CAC. Comparing predictive performance of SNP Set-2 using FHS and ClinSeq R data sets 340 In order to test whether the higher predictive performance of SNP Set-2 over the past GWAS-based SNP  Table 3 shows the list of 21 SNPs in SNP Set-2 generated optimal predictive performance in ClinSeq R We identified a total of 13 genes that included the 21 SNPs leading to optimal predictive performance 389 in both cohorts. Using GeneMANIA, we derived a network that included this group of 13 genes in addition  Figure 6 shows this network, whereas the abbreviated gene symbols and the corresponding gene names are listed in Table S10. The proteins coded by the genes in 393 the network have a wide range of roles. 12 of them are either a transcription factor or an enzyme, one is a 394 translational regulator, and two are transmembrane receptors. 395 Figure 6. Network derived from GeneMANIA based on 244 studies in humans. The connections in pink are derived from gene coexpression data, whereas the connections in green are derived from genetic interaction data from the literature. The inner circle is composed of genes on which the subset of SNPs in SNP Set-2 leading to optimal performance in both cohorts are present, whereas the genes forming the outer circle are additional genes identified by GeneMANIA. The thicknesses of the links (or edges) between the genes are proportional to the interaction strengths, whereas the node size for each gene is proportional to the rank of the gene based on its importance (or gene score) within the network.
In order to identify whether our list of genes was enriched in any biological functions or processes   Table 4 shows the 22 cardiovascular disease related biological functions and phenotypes, which are 434 identified by IPA based on Fisher's exact test (p-value<0.01), enriched within our network of genes.

435
Several of these functions and phenotypes are involved in biological processes associated with "vascular 436 aging", which is highly relevant to CAC, since aged vascular smooth muscle cells (VSMCs) are known to 437 have less resistance against phenotypic modulations promoting vascular calcification (Shanahan, 2013).

438
In fact, along with seven traditional risk factors (age, gender, total cholesterol, HDL cholesterol, systolic 439 BP, smoking status, hypertension medication status), the Agatston CAC score is used as a parameter 440 in quantifying "vascular age" in the MESA arterial age calculator (Dat, 2016b). Among our network 441 genes previously linked to biological processes related to "accelerated" arterial aging, TLR5 is a member   potentially strong susceptibility to CAD as well as coronary calcium among our subjects with advanced 551 CAC. Upon identifying a subset of 21 SNPs from this alternative set that led to optimal predictive 552 performance in both cohorts, we developed neural network models trained with the ClinSeq R genotype 553 data and tested with the FHS genotype data and obtained high predictive accuracy values (AUC>0.8) 554 under a wide range of network topologies, thereby replicating the collective predictive ability of these 555 SNPs in FHS and identifying several potential susceptibility loci for advanced CAD pathogenesis. At the 556 gene network level, several biological processes previously linked to cardiovascular disease, including 557 differentiation of adipocytes, were found to be enriched among these loci.

558
A potential extension of our modeling study is the expansion of the panel of SNPs that are highly 559 predictive of advanced coronary calcium levels around their loci for building more comprehensive models.

560
Subsequently, we can assess these loci as predictors of rapid CAC progression and early onset of MI 561 with longitudinal data in independent cohorts, especially for cases poorly predicted by traditional risk nhlbi.org/Calcium/input.aspx).