Skip to main content

Shrunken methodology to genome-wide SNPs selection and construction of SNPs networks

Abstract

Background

Recent development of high-resolution single nucleotide polymorphism (SNP) arrays allows detailed assessment of genome-wide human genome variations. There is increasing recognition of the importance of SNPs for medicine and developmental biology. However, SNP data set typically has a large number of SNPs (e.g., 400 thousand SNPs in genome-wide Parkinson disease data set) and a few hundred of samples. Conventional classification methods may not be effective when applied to such genome-wide SNP data.

Results

In this paper, we use shrunken dissimilarity measure to analyze and select relevant SNPs for classification problems. Examples of HapMap data and Parkinson disease (PD) data are given to demonstrate the effectiveness of the proposed method, and illustrate it has a potential to become a useful analysis tool for SNP data sets. We use Parkinson disease data as an example, and perform a whole genome analysis. For the 367440 SNPs with less than 1% missing percentage from all 22 chromosomes, we can select 357 SNPs from this data set. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that has a similarity value being greater than a threshold are selected to construct several groups of genes. For the SNPs that involved in these groups of genes, a statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and SNPs with significance of P < 0.01 are chosen to identify SNPs networks based on their P values. Here SNPs networks are constructed based on Gene Ontology knowledge, and therefore each SNP network plays a role in the biological process. An analysis shows that such networks have relationships directly or indirectly to Parkinson disease.

Conclusions

Experimental results show that our approach is suitable to handle genetic variations, and provide useful knowledge in a genome-wide SNP study.

Background

Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring when a single nucleotide - A, C, G, or T - differs at the same position between individuals [1]. SNPs are believed to result in differences between individuals, such as susceptibility to diseases [2]. They are abundant in human genome [3, 4], which are considered as invaluable markers and potential powerful tools for both of genetic researches and applications in practice [58], for instance, disease gene discovery [9], drug development [10], and clinical treatment [11]. It is believed that more and more genetic researches and practical applications combined with machine learning or statistical methods will be investigated based on SNP data sets as SNPs will provide more useful information which is not shown by other methods.

In a SNP data set, the association between a disease and a set of relevant SNPs are investigated. Patients and normals are often categorized in groups according to their SNP genotypes (categorical values). Thousands of SNPs in different regions of chromosomes are used to describe characteristics of patient/normal samples. There are two key properties of data sets for such classification task: high-dimensional and categorical.

When many SNPs are used to detect the association between a disease and multiple marker genotypes, it is common to find only several numbers of SNPs having genotype patterns that are highly specific to each group of individuals. The SNPs are called the relevant SNPs, as opposed to the irrelevant SNPs that do not help much in identifying the group (i.e., individuals of the same type). Due to the large number of SNPs being irrelevant to each group, two individuals in the same group could have low similarity when measured by a simple similarity function that consider the genotypes of all SNPs. The groups may thus be undetectable by classification algorithms.

Many researchers gave efforts to find such a cohort of SNPs that having genotype patterns and highly specific to each group of individuals. Dai et al. [12] proposed a SNP-Haplotype Adaptive Regression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with one more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Xu et al. [13] developed a set of web-based SNP selection tools which can select SNPs based on Genome-wide Association Studies (GWAS) results, linkage disequilibrium (LD), and predicted functional characteristics of both coding and non-coding SNPs. An example using prostate cancer was demonstrated that it can select a small panel of SNPs that include many of the recently validated prostate cancer SNPs. Latourelle et al. [14] sought to identify onset age genetic modifiers using genome-wide association study in familial Parkinson disease (PD). Meta analysis across three studies detected consistent association (P < 10−5) of five SNPs suggesting an influence of genes involved in endocytosis and lysosomal sorting in PD pathogenesis. Gao et al. [15] conducted a genome-wide parametric and nonparametric linkage analysis and found two loci for PD, indicating that additional PD susceptibility genes might be identified through targeted candidate gene studies in these loci regions. Srinivasan et al. [16] considered pathway association of SNP variation, which may have inconsistencies with traditionally individual SNP associations, providing a combination of the pathway and SNP analysis in the future.

The classification problem is defined for such a scenario, see for instance [17]. Each group is a set of individuals with an associated set of relevant SNPs such that in the group formed by the relevant SNPs, the individuals are similar to each other but dissimilar to individuals outside the group. In this paper, we test the HapMap data which is downloaded from HapMap webpage [18] and Parkinson disease genome-wide SNPs genotyping data obtained from the Coriell Institute for Medical Research. A new computational method called the nearest shrunken centroid was performed to select SNPs from these two data sets. In the literature, Schwender [19] has developed SAM for analysis of SNP data. The method is to study contingency table for testing if the distribution of the genotypes of SNPs differs between different groups. The Pearson χ2 statistic is used to handle rejection hypothesis. Shrunken χ2 statistics are further constructed to analyze relevant SNPs. In [20], Park et al. have considered using a classical nearest shrunken centroid method [21, 22] to select SNPs. Their idea is to represent genotypes by numerical numbers directly and then perform the nearest shrunken centroid on the numerical data set of genotypes. The classical nearest shrunken centroid method is used to handle numerical microarray data sets. The main aim of this paper is to apply a new nearest shrunken centroid method to handle SNPs data in a categorical manner, and detect association between a disease and multiple marker genotypes based on a set of relevant SNPs selected. In addition, we conduct a comparison between our method and Park's [20] method based on one of the chromosomes. Genes that those selected SNPs located in are constructed several groups of genes using GOSemSim [23] with a similarity value being greater than a threshold. SNPs involved in these networks were further checked pair-wise SNP-SNP interactions using PLINK [24] with statistical significance of P < 0.01, which can be considered as an extension of existing Gene Ontology [25] knowledge.

Methods

Data source

HapMap data

The HapMap SNPs data [18] are downloaded from the HapMap webpage. According to the LD map of chromosome 22, see [26], 200 SNPs from chromosome 22 of 4 populations: Utah residents with ancestry from northern and western Europe (CEU), Han Chinese in Beijing, China, (CHB), Japanese in Tokyo, Japan (JPT) and Yoruba in Ibadan, Nigeria (YRI) are picked out randomly from a region from 3.44e7−3.5e7 kb [27], which shows a great difference of SNP positions on the LD map over 4 populations. Here the LD map shows the intensity of linkage disequilibrium of SNPs. In the map, the “flat” curve means that the SNPs are in strong linkage disequilibrium, i.e., the recombination rarely occur between them, while the “steep” curve means the recombination occurs frequently in this part of chromosome. Samples are collected from the CEU (30 trios), CHB (45 unrelated individuals), JPT (45 unrelated individuals), YRI (30 both-parent-and-adult-child trios). There are 90 samples for CEU and YRI populations respectively, and 45 samples for each of CHB and JPT populations. Missing data are considered as a category in the calculation.

Parkinson disease data

The Parkinson disease SNPs data is based on a genome-wide genotyping of 270 individuals with idiopathic Parkinson Disease cases (case) and 271 neurologically normal controls (control) downloaded from the Coriell Institute for Medical Research (http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap). The genotyping was performed using the Illumina Infinium I and Infinium II assays. The Illumina Infinium I assay asseses 109,365 unique gene-centric SNPs while the Infinium II assay assesses 317,511 haplotype taggings SNPs based upon Phase I of the International HapMap Project. The Illumina Infinium I and II assays share 18,073 SNPs in common. Therefore, the combination of the two assays represents 408,803 unique SNPs. In the following experiment, SNPs with a > 1% missing percentage in all samples are not considered. After missing values are filtered out, the number of SNPs was decreased to 367440.

Shrunken methodology

The nearest shrinkage centroid is developed to handle numerical microarray data sets. The main difference between gene expression and SNP data is that the expression values are continuous and SNPs are categorical [28].

In this paper, we make use of the shrinkage idea and apply the algorithm for categorical SNP data by using a genotype distribution measuring for categorical objects and modes instead of means for groups. These extensions will remove the numeric-only limitation of the nearest shrunken method and enable the classification process to be used to efficiently deal with genome-wide categorical SNP data sets.

Let x ij be the categorical value for SNP i = 1, 2, …, p and samples j = 1, 2, …, n. There are K classes and let C k be indices of the n k samples in class k. The centroid of the i th SNP in class k is defined as:

(1)

where mode is the category that with the highest appearance frequency.

The overall centroid for SNP i is:

(2)

Let

(3)

where is the genotype distribution vector associated with i th SNP centroid in class k, and is the genotype distribution vector associated with i th SNP overall centroid, .2 is the Euclidean norm, s i is the pooled within-class standard deviation for SNP i:

(4)

and

(5)

C k denote the indices of the n k samples in class k, s0 is a positive constant included to prevent the possibility that a SNP with small deviation could produce a large d ij . In (3),we need to consider the distance from a class centroid to the overall centroid for the i th SNP. In our proposal, genotype distributions are used for measuring categorical SNPs data.

In the next step, the soft thresholding can be defined similarly by:

(6)

In (3),we can see that if the difference between a class centroid and the overall centroid is small, it demonstrates that the difference is insignificant or is just some noise in the classification process. Let t be a test sample, the class label of t is determined by:

(7)

and

where π k is the prior probability of class k. It is the proportion of class k in the population. If it is unknown, it can be set to .

Cross validation

A 10-fold cross validation is adopted in our classification procedure to evaluate the performance of the proposed nearest shrunken centroid method. In each trial, all the samples are randomly divided into 10 equal partitions. For each of the 10 partition groups, we select one of them as testing set and the remaining nine of them are considered as training sets. Ten trials are considered and the results are collected and based on this 10-fold cross validation procedure.

SNP network construction

All the SNPs that selected by the shrunken metholodgy belong to 122 unique genes. We compute all the pair-wise functional similarities of these gene products using GOSemSim, a package of Bioconductor [29], which is an open source and open development software project for the analysis and comprehension of genomic data running in the platform of R. GOSemSim estimates the similarity scores of gene pairs according to their GO terms: molecular function (MF), biological process (BP) and cellular component (CC) [25]. In this paper, we only consider two of these terms: MF and BP and adopt Rel's method [30] to compute the similarity values, which is based on the information content of the GO terms and define information content as the frequency of each term occurs in the GO corpus. Afterwards, gene pairs that have a similarity value being greater than a threshold, were selected to construct several groups of genes using Cytoscape [31].

For the SNPs that involved in these groups of genes, we did a statistical analysis between these SNPs and all the other SNPs selected by our method using PLINK [24], which is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analysis in a computationally efficient manner. PLINK provides a logistic regression test for interaction that assumes an allelic model for both the main effects and the interactions. All pairwise combinations of SNPs can be tested. Odds ratio for interaction, χ2 statistic and asymptotic P-value will be provided in the output file. By constructing SNPs networks with SNP pairs that have P < 0.01 significance, we can figure out some potential SNP-SNP interactions that are still unknown.

Results and discussion

HapMap SNP data set

In the first test, we take any two out of four populations in HapMap data set to set up two-class classification problems. Cross-validation is used to employ independent data sets. The results are shown in Figures 1, 2, 3, 4, 5, 6. As shown in these figures, we can see that all have a high accuracy of more than 90 percent, except the CHB-JPT classification problem, only about 50 percent, when the threshold Δ is less than 2. Then accuracy decreases as the amount of shrinkage increases since less SNPs are used in the prediction. The reason for the poor accuracy of CHB-JPT classification is that these two populations are quite similar on their SNPs, see Figure 7.

Figure 1
figure 1

CEU-CHB classification. Two populations: CEU and CHB out of the 4 populations in HapMap data set are picked out to set up a two-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refers to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method.

Figure 2
figure 2

CEU-JPT classification. Two populations: CEU and JPT out of the 4 populations in HapMap data set are picked out to set up a two-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refers to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method.

Figure 3
figure 3

YRI-CHB classification. Two populations: YRI and CHB out of the 4 populations in HapMap data set are picked out to set up a two-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refers to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method.

Figure 4
figure 4

YRI-JPT classification. Two populations: YRI and JPT out of the 4 populations in HapMap data set are picked out to set up a two-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refers to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method.

Figure 5
figure 5

CEU-YRI classification. Two populations: CEU and YRI out of the 4 populations in HapMap data set are picked out to set up a two-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refers to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method.

Figure 6
figure 6

CHB-JPT classification. Two populations: CHB and JPT out of the 4 populations in HapMap data set are picked out to set up a two-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refers to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method.

Figure 7
figure 7

The values of soft threshold. The SNPs used in prediction and their values of (from top to bottom are: CEU, YRI, CHB, JPT, Δ = 1.5). The values of in blue in the figure mean that its corresponding SNP appears in all four populations, while the values of in red represents its corresponding SNP shows in only one population.

In the second test, we consider a four-class classification problem, i.e., to classify the four populations: CEU, CHB, JPT and YRI. The setting is the same as that in the first experiment. Figure 8 shows the cross-validation classification accuracy using different values of Δ for 200 SNPs. The best accuracy is 77.78 percent when Δ = 1.5. When Δ < 1.5, there are a lot of SNPs to be used in the classification, but some of them are likely redundant. When Δ > 1.5, a lot of SNPs are not used, we may throw away some useful SNPs in the classification process. The confusion matrix in Table 1 shows that the prediction for CEU and YRI is quite good, but bad for CHB and JPT. In these two cases, the accuracy is not high. When we use all 51793 SNPs in chromosome 22 to perform the classification, the best accuracy is 94.44 percent (Δ = 0.5), see Figure 9.

Figure 8
figure 8

Classification accuracy for four classes problem using 200 SNPs in Chromosome 22. Four populations: CEU, CHB, YRI and JPT in HapMap data set are picked out to set up a four-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refers to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method. Only 200 SNPs located in 3.44e7-3.5e7kb of chromosome 22 are used in this experiment.

Table 1 Confusion matrix when Δ = 1.5.
Figure 9
figure 9

Classification accuracy for four classes problem using all 51793 SNPs in Chromosome 22. Four populations: CEU, CHB, YRI and JPT in HapMap data set are picked out to set up a four-class classification. The X axis is the amount of shrinkage Δ and Y axis is the accuracy (accuracy refer to the correctly classified samples in testing data sets in the 10-fold cross validation) obtained by using our shrunken method. All 51793 SNPs of chromosome 22 are used in this experiment.

By shrinkage (Δ is set to 1.5), the number of SNPs used for classification is decreased from 200 to 143, 143, 142 and 142 for CEU, YRI, CHB, and JPT respectively. In Figure 7, we show the SNPs used in prediction and their value of . The values of in blue in the figure mean that its corresponding SNP appears in all four populations, while the values of in red represents its corresponding SNP shows in only one population. Next we show the centroid genotype distribution vector corresponding to the in red in Table 2.

Table 2 Genotype distribution vector of 12th SNP (left) and 127th SNP (right).

As shown in Table 2, at 12th SNP, the genotype distribution vector of YRI is quite different from the others, similarly, at 127th SNP, the genotype distribution vector of CEU differs from those of the other three populations. The reason is that the mode of YRI is “aA”, while that of whole population is “AA”, and therefore YRI population has more variation and has a large value of .

Parkinson disease SNPs data

Next we consider to use Parkinson disease data set to perform experiments to show the effectiveness of the shrunken methodology and construct SNPs networks. Table 3 shows the average classification accuracy results (correctly classified samples in testing data sets in the 10-fold cross validation) of all 22 chromosomes of Parkinson disease data set by using the nearest shrunken centroid program after 10-fold cross validation. We use the most frequent genotypes in case and control groups to be the modes for the program. The parameter Δ is tuned in each chromosome to obtain the highest accuracy in the test. To demonstrate the effectiveness of the proposed method, we also have a comparison with Park's [20] using the corresponding same data set. Here we use the numerical values (0,1,2,3) to represent different genotypes for Park's method. According to Table 3, the performance of our shrunken centroid method in terms of accuracy and numbers of selected SNPs is better than Park's method.

Table 3 Comparisons between the proposed method and Park's method.

We also choose Chromosome 14 as an example to demonstrate the SNPs selected by the proposed method. Figure 10 shows the accuracies obtained when we increase Δ value from zero to three in one trial of the 10-fold cross validation. We can see from the figure that our method can get a reasonably good accuracy of 64.81% when Δ is equal to 0.8. By shrinkage, the number of SNPs selected for the classification is decreased from 11728 to 20. In Table 4, we show the genotype distributions of these 20 SNPs in the disease and control groups where A and a represent the major and minor alleles. The column under “Missing” refers to the missing percentages of genotypes in the groups. According to the table, we find that the SNP genotype distributions in two groups are quite different.

Table 4 Genotype distributions of selected 20 SNPs in Chromosome 14.
Figure 10
figure 10

Relationship between Δ and accuracy in Chromosome 14. Illustration of the accuracy obtained in Chromosome 14 of Parkinson disease genome-wide data set when change Δ value from 0 to 3. For Chromosome 14, in each trial, all the 541 samples of both control and case are randomly divided into 10 equal partitions. For each of the 10 partition groups, we select one of them as testing set and the remaining nine of them are considered as training sets. 10 trials are considered and the results are collected based on this 10-fold cross validation procedure. This figure was drawn based on one of these ten trails when the highest accuracy (accuracy refers to the percentage of correctly classified samples over all test samples) is obtained. X axis refers to Δ value, it increases from 0 to 3. Y axis refers to the accuracy obtained in Chromosome 14 when using our method, it fluctuates when different Δ values are applied and the highest accuracy is obtained when Δ is equal to 0.8.

We randomly select one trial of this 10-fold cross validation as an example to further analyze. In this trial, for all the 367440 SNPs from 22 chromosomes of Parkinson disease data set, there are totally 357 selected and 171 of them are located in gene coding area. Next we make use of the knowledge of these genes to construct SNPs networks. For the 122 genes that those 171 SNPs located in, we cluster the genes based on their similarity values using GOSemSim. The closely related biological process and molecular function roles of each gene were checked with GOSemSim with a threshold. When a similarity value between two genes is less than the threshold, their relationship is not considered. Therefore several groups of genes can be formed. As we are interested at gene-gene interactions, and we only consider the groups where the number of genes in these groups are more than one. In Table 5, we show the number of groups of genes formed by using different threshold values and the number of pairs of genes involved.

Table 5 Groups of genes formed for different threshold values.

We see in Table 5 that the number of groups of genes increases when the threshold value increases as more groups are formed. However, when threshold value further increases, the number of groups is reduced as each group just contains one gene. According to Table 5, we select the threshold to be 0.25 for analysis as the number of groups of genes is higher than those using the other threshold values. Figure 11 demonstrates the group of genes constructed by our method when threshold is equal to 0.25. Gene pairs that are grouped in the same group suggest a strong potential for interaction effects in biological process. We can see from this figure that there are 12 groups, including 68 genes.

Figure 11
figure 11

Gene network when GOSemSim threshold=0.25. Gene network constructed using Cytoscape. Gene pairs are computed the similarity values using GOSemSim and gene pairs that have a > 0.25 threshold are grouped together. Every node in the figure is labeled as its gene symbol and the edge between two genes indicates whether this pair of genes has a > 0.25 threshold or not. There are 12 clusters in the gene network when threshold=0.25.

For each group of genes constructed, we check all the pairwise SNP-SNP interactions using PLINK between SNPs involved in the group of genes and all the other SNPs selected by the shrunken method. Based on the P-value of PLINK epistasis test, we construct SNPs networks. Because there are more groups of genes when the threshold value in GOSemSim is in between 0.22-0.28, we are interested in their corresponding SNPs networks. In particular, we show in Figures 12, 13, 14 that SNPs networks when the threshold values are 0.22, 0.26 and 0.27 respectively. We find that there are two SNPs networks as shown in Figure 14 appearing frequently among the networks constructed when the threshold value in GOSemSim is in between 0.22-0.28. Table. 6 shows all SNP pairs of these interesting SNPs networks that have P < 0.01 significance interactions in Figure 14.

Figure 12
figure 12

SNPs network when GOSemSim threshold=0.22. SNPs network constructed using Cytoscape. For each group of the gene network where gene pairs have > 0.22 similarity value, all the pairwise SNP-SNP interactions are checked using PLINK between SNPs involved in the groups of genes and all the other SNPs selected by the shrunken method. SNPs network is constructed based on the P value of PLINK epistasis test. Each node in the figure is labeled as its SNP ID and the edge between two SNPs indicates whether this pair of SNPs are interacted under a P < 0.01 significance. There are 6 SNPs networks in this figure.

Figure 13
figure 13

SNPs network when GOSemSim threshold=0.26. SNPs network constructed using Cytoscape. For each group of the gene network where gene pairs have › 0.26 similarity value, all the pairwise SNP-SNP interactions are checked using PLINK between SNPs involved in the groups of genes and all the other SNPs selected by the shrunken method. SNPs network is constructed based on the P value of PLINK epistasis test. Each node in the figure is labeled as its SNP ID and the edge between two SNPs indicates whether this pair of SNPs are interacted under a P < 0.01 significance. There are 5 SNPs networks in this figure.

Figure 14
figure 14

SNPs network when GOSemSim threshold=0.27. SNPs network constructed using Cytoscape. For each group of the gene network where gene pairs have > 0.27 similarity value, all the pairwise SNP-SNP interactions are checked using PLINK between SNPs involved in the groups of genes and all the other SNPs selected by the shrunken method. SNPs network is constructed based on the P value of PLINK epistasis test. Each node in the figure is labeled as its SNP ID and the edge between two SNPs indicates whether this pair of SNPs are interacted under a P < 0.01 significance. There are 2 SNPs networks in this figure.

Table 6 Pair-wise interactions among SNPs when GOSemSim threshold=0.27 (P <0.01).

We find some interesting relationships from these two SNPs networks. For example, for SNPs rs11626809 and rs2836392, which are highly interacted, their corresponding genes are RAD51L1 and ERG respectively, but located in different clusters in gene network, which means that maybe we can merge these two clusters in gene network together. Another example, rs4968656 is interacted with rs4658673, which is located in intergenic area and do not have a record in Gene Ontology until now, maybe we can make use of rs4968656's gene information, KCNH6, to further analyze the inner functions of rs4658673 and extend GO afterward.

Indeed some of the SNPs selected by the shrunken method are directly or indirectly related to PD. For example, ERG and anatomical abnormalities are reported to cause retinopathy in dementia with Lewy bodies [32], which share similar symptoms with PD and are thought to be related to PD, or that they sometimes happen together. KCNH6, located in Chromosome 17, are reported to have diverse functions include regulating neurotransmitter release, heart rate, insulin secretion, neuronal excitability, epithelial electrolyte transport, smooth muscle contraction, and cell volume. These characteristics are also the symptoms of PD.

Conclusions

In this paper, we review the method of nearest shrunken centroid for gene expression data, and extend it to tackle SNP data classification. The main contribution of this paper is to develop a shrunken dissimilarity measure to handle SNP data classification problems. The method can be implemented on a PC very efficiently. The relevant SNPs are selected for HapMap data and Parkinson disease data. Experimental results are also reported to show the effectiveness of the method. In particular, we find some SNPs that contain in some genes which is relevant to Parkinson disease. Based on the SNPs network, we can have some unknown relationships between their corresponding genes, which can be considered as an extension of existing GO knowledge. In the future, detailed biological analysis of SNPs of other genome-wide SNP data sets will be studied. The genomic variation of data sets can take account of functional as well as linkage disequilibrium information. More importance is attached to some SNPs than others, based on their positions within the coding or regulatory regions or splice sites.

Authors contributions

MN designed this study and developed the new algorithm. YL designed this study, coded the program, and ran the experiments. Two authors wrote the manuscript.

References

  1. Brookes AJ: The essence of SNPs. Gene. 1999, 234: 177-186. 10.1016/S0378-1119(99)00219-X

    Article  PubMed  CAS  Google Scholar 

  2. Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 1516-1517. 10.1126/science.273.5281.1516

    Article  PubMed  CAS  Google Scholar 

  3. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Shaw N, Lane CR, Lim EP, Kalyanaraman N, Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington J, Lipshutz R, Daley GQ, Lander ES: Characterization of single nucleotide polymorphisms in coding regions of human genes. Nature Genet. 1999, 22: 231-238. 10.1038/10290

    Article  PubMed  CAS  Google Scholar 

  4. Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S, Van Etten WJ, Reich D, Higgins J, Daly MJ, Blumenstiel B, Baldwin J, Stange-Thomann N, Zody MC, Linton L, Lander ES, Altshuler D: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001, 409: 928-933. 10.1038/35057149

    Article  PubMed  CAS  Google Scholar 

  5. Schork NJ, Fallin D, Lanchbury JS: Single nucleotide polymorphisms and the future of genetic epidemiology. Clin. Genet. 2000, 58: 250-264. 10.1034/j.1399-0004.2000.580402.x

    Article  PubMed  CAS  Google Scholar 

  6. Wang Z, Moult J: SNPs, protein structure, and disease. Human mutation. 2001, 17: 263-270. 10.1002/humu.22

    Article  PubMed  Google Scholar 

  7. Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 2005, 6: 95-108. 10.1038/nrg1521

    Article  PubMed  CAS  Google Scholar 

  8. Syvanen AC: Toward genome-wide SNP genotyping. Nature Genetics. 2005, 37: S5-S10. 10.1038/ng1558

    Article  PubMed  Google Scholar 

  9. Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, Sato H, Sato H, Hori M, Nakamura Y, Tanaka T: Functional SNPs in the lymphotoxin- gene that are associated with susceptibility to myocardial infarction. Nature Genetics. 2002, 32: 650-654. 10.1038/ng1047

    Article  PubMed  CAS  Google Scholar 

  10. Rothberg BEG: Mapping a role for SNPs in drug development. Nature Biotechnology. 2001, 19: 209-211. 10.1038/85631

    Article  PubMed  CAS  Google Scholar 

  11. Erichsen HC, Chanock SJ: SNPs in cancer research and treatment. British Journal of Cancer. 2004, 90: 747-751. 10.1038/sj.bjc.6601574

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  12. Dai JY, Leblanc M, Smith NL, Psaty B, Kooperberg C: SHARE: an adaptive algorithm to select the most informative set of SNPs for candidate genetic association. Biostatistics. 2009, 10: 680-693. 10.1093/biostatistics/kxp023

    Article  PubMed  PubMed Central  Google Scholar 

  13. Xu Z, Taylor JA: SNPinfo: integrating GWAS and candidate gene information into functional SNP selection for genetic association studies. Nucleic Acids Res. 2009, 37: W600-605. 10.1093/nar/gkp290

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  14. Latourelle JC, Pankratz N, Dumitriu A, Wilk JB, Goldwurm S, Pezzoli G, Mariani CB, DeStefano AL, Halter C, Gusella JF, Nichols WC, Myers RH, Foroud T: Genomewide association study for onset age in Parkinson disease. BMC Medical Genetics. 2009, 10: 98- 10.1186/1471-2350-10-98

    Article  PubMed  PubMed Central  Google Scholar 

  15. Gao X, Martin ER, Liu Y, Mayhew G, Vance JM, Scott WK: Genome-wide linkage screen in familial Parkinson disease identifies loci on chromosomes 3 and 18. Am J Hum Genet. 2009, 84: 499-504. 10.1016/j.ajhg.2009.03.005

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  16. Srinivasan BS, Doostzadeh J, Absalan F, Mohandessi S, Jalili R, Bigdeli S, Wang J, Mahadevan J, Lee CL, Davis RW, William LJ, Ronaghi M: Whole genome survey of coding SNPs reveals a reproducible pathway determinant of Parkinson disease. Hum Mutat. 2009, 30: 228-238. 10.1002/humu.20840

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  17. Khlestkina EK, Salina EA: SNP markers: Methods of analysis, ways of development, and comparison on an example of common wheat. Russian Journal of Genetics. 2006, 42: 585-594. 10.1134/S1022795406060019

    Article  CAS  Google Scholar 

  18. Gibbs RA, Belmont JW, Hardenbol TD, Willis P, Yu FL, Yang HM, Ch'ang LY, Huang W, Liu B, Shen Y, Tam PKH, Tsui LC, Waye MMY, Wong JTF, Zeng CQ, Zhang QR, Chee MS, Galver LM, Kruglyak S, Murray SS, Oliphant AR, Montpetit A, Hudson TJ, Chagnon F, Ferretti V, Leboeuf M, Phillips MS, Verner A, Kwok PY, Duan SH, Lind DL, Miller RD, Rice JP, Saccone NL, Taillon-Miller P, Xiao M, Nakamura Y, Sekine A, Sorimachi K, Tanaka T, Tanaka Y, Tsunoda T, Yoshino E, Bentley DR, Deloukas P, Hunt S, Powell D, Altshuler D, Gabriel SB, Qiu RZ, Ken A, Dunston GM, Kato K, Niikawa N, Knoppers BM, Foster MW, Clayton EW, Wang VO, Watkin J, Gibbs RA, Belmont JW, Sodergren E, Weinstock GM, Wilson RK, Fulton LL, Rogers J, Birren BW, Han H, Wang HG, Godbout M, Wallenburg JC, L'Archeveque P, Bellemare G, Todani K, Fujita T, Tanaka S, Holden AL, Lai EH, Collins FS, Brooks LD, McEwen JE, Guyer MS, Jordan E, Peterson JL, Spiegel J, Sung LM, Zacharia LF, Kennedy K, Dunn MG, Seabrook R, Shillito M, Skene B, Stewart JG, Valle DL, Clayton EW, Jorde LB, Belmont JW, Chakravarti A, Cho MK, Duster T, Foster MW, Jasperse M, Knoppers BM, Kwok PY, Licinio J, Long JC, Marshall PA, Ossorio PN, Wang VO, Rotimi CN, Royal CDM, Spallone P, Terry SF, Lander ES, Lai EH, Nickerson DA, Abecasis GR, Altshuler D, Bentley DR, Boehnke M, Cardon LR, Daly MJ, Deloukas P, Douglas JA, Gabriel SB, Hudson RR, Hudson TJ, Kruglyak L, Kwok PY, Nakamura Y, Nussbaum RL, Royal CDM, Schaffner SF, Sherry ST, Stein LD, Tanaka T: The International HapMap project. Nature. 2003, 426: 789-796. 10.1038/nature02168

    Article  CAS  Google Scholar 

  19. Schwender H: Modifying microarray analysis methods for categorical data-SAM and PAM for SNPs. Classification-The Ubiquitous Challenge, Proceedings of the 28th Annual Conference of the Gesellschaft fur Klassifikation e. V., University of Dortmund: 9-11 March 2004; Dortmund. Edited by: Weihs C, Gaul W. 2004, 370-377. Springer

    Google Scholar 

  20. Park J, Hwang S, Lee YS, Kim SC, Lee D: SNP@Ethnos: a database of ethnically variant single-nucleotide polymorphisms. Nucleic Acids Res. 2007, 35 (Database issue): D711-D715. 10.1093/nar/gkl962

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  21. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS. 2002, 99: 6567-6572. 10.1073/pnas.082099299

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  22. Bair E, Tibshirani R: Machine learning method applied to DNA microarray data can improve the diagnosis of cancer. SIGKDD Explorations. 2003, 5: 48-55. 10.1145/980972.980980.

    Article  Google Scholar 

  23. Bioconductor. http://www.bioconductor.org

  24. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  25. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  26. Liao H, Ng MK, Fung E, Sham P: Unidimensional nonnegative scaling for genome-wide linkage disequilibrium maps. International Journal of Bioinformatics Research and, Applications. 2008, 4: 417-434. 10.1504/IJBRA.2008.021177

    Article  CAS  Google Scholar 

  27. Liao H, Liu Y, Ng MK: Shrunken Dissimilarity Measure for Genome-wide SNP Data Classification. Proceedings of the Third International Symposium OSB09: 20-22 September 2009; Zhangjiajie, China. Edited by: Chen L. 2009, 73-80. World Publishing Corporation

    Google Scholar 

  28. Schwender H, Ickstadt K, Rahnenfuhrer J: Classification with High-Dimensional Genetic Data: Assigning Patients and Genetic Features to Known Classes. Biometrical Journal. 2008, 50: 911-926. 10.1002/bimj.200810475

    Article  PubMed  Google Scholar 

  29. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80- 10.1186/gb-2004-5-10-r80

    Article  PubMed  PubMed Central  Google Scholar 

  30. Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 2006, 7: 302- 10.1186/1471-2105-7-302

    Article  PubMed  PubMed Central  Google Scholar 

  31. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13: 2498-2504. 10.1101/gr.1239303

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  32. Devos D, Tir M, Maurage CA, Waucquier N, Defebvre L, Defoort-Dhellemmes S, Destee A: ERG and anatomical abnormalities suggesting retinopathy in dementia with Lewy bodies. Neurology. 2005, 65: 1107-1110. 10.1212/01.wnl.0000178896.44905.33

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgements

We thank the participants and the submitters for depositing samples at the NINDS Neurogenetics repository. The samples for this study are derived from the NINDS Neurogenetics repository at Coriell Cell Repositories. Access to the samples and to these data are available from the website: http://ccr.coriell.org/Sections/BrowseCatalog/DiseaseDetail.aspx?PgId=403&omim=PAR40000&coll when we first attempted to download it. Currently, it is available from NCBI: http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap. This work was financially supported by Research Grant Council [201508] and Hong Kong Baptist University FRGs.

This article has been published as part of BMC Systems Biology Volume 4 Supplement 2, 2010: Selected articles from the Third International Symposium on Optimization andSystems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/4?issue=S2

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Ng.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Liu, Y., Ng, M. Shrunken methodology to genome-wide SNPs selection and construction of SNPs networks. BMC Syst Biol 4 (Suppl 2), S5 (2010). https://doi.org/10.1186/1752-0509-4-S2-S5

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1752-0509-4-S2-S5

Keywords