# SAVI: a statistical algorithm for variant frequency identification

- Vladimir Trifonov
^{1, 2, 3}Email author, - Laura Pasqualucci
^{4, 5}, - Enrico Tiacci
^{6}, - Brunangelo Falini
^{6}and - Raul Rabadan
^{1, 2, 3}Email author

**7(Suppl 2)**:S2

https://doi.org/10.1186/1752-0509-7-S2-S2

© Trifonov et al.; licensee BioMed Central Ltd. 2013

**Published: **14 October 2013

## Abstract

### Background

Many problems in biomedical research can be posed as a comparison between related samples (healthy vs. disease, subtypes of the same disease, longitudinal data representing the progression of a disease, etc). In the cases in which the distinction has a genetic or epigenetic basis, next-generation sequencing technologies have become a major tool for obtaining the difference between the samples. A commonly occurring application is the identification of somatic mutations occurring in tumor tissue samples driving a single cell to expand clonally. In this case, the progression of the disease can be traced through the trajectory of the frequency of the oncogenic alleles. Thus obtaining precise estimates of the frequency of abnormal alleles at various stages of the disease is paramount to understanding the processes driving it. Although the procedure is conceptually simple, technical difficulties arise due to inhomogeneous samples, existence of competing subclonal populations, and systematic and non-systematic errors introduced by the sequencing technologies.

### Results

We present a method, Statistical Algorithm for Variant Frequency Identification (SAVI), to estimate the frequency of alleles in a set of samples. The method employs Bayesian analysis and uses an iterative procedure to derive empirical priors. The approach allows for the comparison of allele frequencies across several samples, e.g. normal/tumor pairs and more complex experimental designs comparing multiple samples in tumor progression, as well as analyzing sequencing data from RNA sequencing experiments.

### Conclusions

Analyzing sequencing data through estimating allele frequencies using empirical Bayes methods is a powerful complement to the ever-increasing throughput of the sequencing technologies.

## Keywords

## Background

High-throughput sequencing (HTS) has become one of the most important tools in the arsenal of biomedical researchers in their quest for understanding the foundations of diseases with a strong genetic component like cancer [1]. Sequencing technologies have significantly increased the amount of data produced by a single run of an instrument with their modern high-throughput versions producing more than 1Tb in a single run. Thus, in the case of the human genome, the scale of the sequencing experiments has been pushed into the realm of whole-exome [2, 3], whole transcriptome [4, 5], and even whole-genome studies [6, 7]. The sheer volume of this data combined with the inherently random aspect of the sequencing process conspire to introduce uncertainties in the output of the experiments. Furthermore, the intricate chemistry of the novel sequencing instruments often results in systematic errors, which can make artifacts indistinguishable from actual genetic lesions. Although detecting such sources of non-random errors is an important part of analyzing the data produced by sequencing experiments, the purpose of this paper is to develop a framework for the correction of the random component in the noise introduced by the sequencers. Furthermore, our goal is not to establish the biochemical sources of those errors, but to develop a statistical approach applicable in diverse scenarios in which the assumption about the random nature of the source of errors holds.

A major application of HTS is to compare related samples, e.g. healthy vs. disease or various stages in the progression of a disease, based on their genetic makeup. Many such examples are provided in cancer research, where identifying the genetic or epigenetic lesions that contribute to the oncongenic process is an important step towards better diagnosis, evaluation of prognosis, and treatment of the disease. Thus HTS has become a powerful unbiased approach to delineating the landscape of genetic alterations in different tumor types [8, 9]. An often-applied technique involves identifying aberrant somatic alleles fixed in a population of cancer cells, which are absent from a control sample. The wisdom behind this technique is based soundly on the observation that the drivers of oncogenic processes are fixed by the clonal expansion [10]. From a more general perspective, and in addition to making a digital (binary) present/absent call, one could strive for identifying the frequency of alleles in a given population of cells. The importance of this analysis to cancer research is immense since low frequency alleles can have a major contribution to the disease in later stages [11], e.g. by conferring resistance to treatment. Furthermore, detecting low frequency alleles at an early stage of the disease, and even before the disease has manifested itself, can be crucial for its prognosis.

Besides the clinical importance of diagnosing and treating cancer, having a detailed picture of the alleles present at different stages of the evolution of a population of cells is an important step towards understanding the intricate biological processes underlying its existence. Several approaches have been used to identify the genetic alterations differentiating a set of related samples. Some methods [12, 4] involve ad hoc thresholds on the number of reads reporting a variant allele, their quality and/or the total number of reads covering the position of the variant allele to produce a binary present/absent call for that variant. A comparison between different samples is then based on differences in those calls (see Figure 1). Arbitrary thresholds on the number of reads reporting a variant introduce biases due to the uneven distribution of depths along the genome. Such methods can miss variants which have somewhat lower quality but high depth or lower depth, but high quality. Furthermore, decisions based on a discretized view of the data depend strongly on the level of discretization and other parameters of the discretization process. This could be a problem if the samples being compared are not perfectly homogeneous, e.g. by contamination or the natural presence of tumor cells in the control sample. Other methods [13, 14] assume allele distributions particular to a homogenous population of diploid genomes. Similar to our approach, the algorithm SNVMix [15] uses a Bayesian framework and is able to identify variants in samples with different ploidy and tumor cellularity. The algorithm employs an expectation-maximization for constructing a prior.

In this paper we present the statistical Algorithm for Variant frequency Identification (SAVI) developed in the course of analyzing data from sequencing experiments of cancer samples from nine Hairy Cell Leukemia (HCL) patients and the corresponding paired normal samples. The initial findings are published in [2]. The algorithm is based on constructing Bayesian posteriors distributions on allele frequencies and employs an iterative procedure for constructing an empirical prior from a given dataset. Having posterior distributions lets us obtain a high credibility interval for the frequency of a particular allele as well as estimates on its expected and most likely value. In contrast to other applications of Bayesian analysis to the problem of frequency estimation, our prior is not fixed to belong to a predetermined set that might not represent an accurate description of the data. Our desire was to develop a method which is more empirically grounded and governed by the data itself as much as possible. With the advent of large datasets the significance of Bayesian methods using priors based on empirical observations has increased [16, 17]. Thus our goal to tailor the prior to the data was naturally steered towards such considerations. Empirical Bayesian methods have been applied before in the context of gene expression in RNA sequencing experiments [18] and estimates of positive selection in tumors [19]. Although, as outlined above, the techniques developed in this manuscript are quite general, we focus here on two important aspects of its application: detection and genotyping of variant alleles in a population of cells and comparing allele frequencies across different samples.

## Results

The methods described in this paper were developed to analyze the sequencing data from a study on HCL - a lymphoid malignancy in which bone marrow, spleen and liver are infiltrated by leukemic B cells showing abundant cytoplasm with characteristic "hairy" projections. In the study, the exome of samples from peripheral blood leukemic hairy cells and paired normal mononuclear cells from 9 HCL patients was sequenced with Illumina GAIIx and HiSeq2000. Genetic material from the exome of those samples was obtained by enrichment with SureSelect. The mate-pair reads produced by the sequencer were aligned to the hg18/NCBI 36.1 human reference using MAQ. On average around 27M of the 28M exonic positions reported in the NCBI 36.3 CCDS database were covered by reads with an average haploid depth of 55. After filtering, described next, on average 16K exonic positions were determined to contain a variant. Of those, 2,500 were novel (not recorded in dbSNP 130) and 1,900 were in addition non-synonymous. Finally, across all samples a total of 81 of the novel non-synonymous variants were predicted to be somatic. These variants were compared against a list of 121 tested variants 67 of which were confirmed to be somatic. Fifty-six of the confirmed somatic variants were predicted and only 1 of the tested, but not confirmed variants, was predicted.

To analyze the variants produced by the sequencer we applied the Bayesian framework outlined in the Methods section. As discussed there, for every potential variant allele we obtained the number of reads confirming the variant, the average Phred score of the nucleotides containing the variant, and the total number of reads covering its locus. Assuming a binomial likelihood of the variant depth given a particular allele frequency, total depth, and a prior distribution on those frequencies, we obtained a posterior distribution for the allele frequency for every tentative allele using Bayes theorem. In the following discussion, the priors and posteriors have precision 1% unless explicitly stated otherwise.

### Detection of variants

Given the posterior distribution for the frequency of an allele we obtained the posterior probability that the allele has a non-zero frequency. Our decision was to consider as present variants whose posterior probability of having a non-zero frequency was at least 1-10^{-6} (see the Methods section for the choice of confidence cutoffs). Figure 3 (center) shows in red the aggregated posterior obtained from variants determined to be present with the criterion that the hypothesis "frequency at least 1%" is credible at the 1-10^{-6} level and using the empirical prior. The plot also contains (in blue) the aggregated posterior corresponding to the uniform prior. A comparison between the two posteriors shows again the shrinking towards a common mean, this time towards the frequency 50% for heterozygous variants and towards 100% for homozygous variant alleles. Also shown here are the aggregated posteriors for variants present in both the tumor and normal samples (in black) and novel variants not documented in the NCBI dbSNP130 database (in green).

*f*for which the hypothesis "frequency at least

*f"*is credible a posteriori with confidence at least 1-10

^{-6}- i.e. we obtained a one-sided credible interval for that level. Next, for every possible cutoff frequency we obtained the contingency table for the test corresponding to the cutoff on that frequency vs. the known polymorphic locations present with frequency at least 1% with high posterior confidence in both samples. We computed the mutual information corresponding to that contingency table and chose the frequency cutoff which maximizes it. In our case a frequency cutoff of 2% was chosen. Figure 4 (right) shows the result of this analysis and Figure 4 (center) shows the ROC curve for the frequency cutoff parameter, where the chosen cutoff frequency is marked in red. Finally, we revisited the definition of presence of an allele and used a frequency cutoff of 2% at posterior credibility level 1-10

^{-6}. The result of this filtering is shown in Figure 3 (right). For comparison, in that plot we have also included the result of filtering with frequency cutoff 1%, which was shown in red in Figure 3 (center). From the plot it is evident that the cutoff at 2% removes a small peak and some of the bulk at low frequencies of the posterior distribution, which we hypothesize are due to sequencing artifacts. A possible extension iterates the process with the newly established cut-off until the aggregated posteriors stop changing substantially as measured, for example, by the Kullback-Leibler divergence.

### Detection of somatic variants

Similar to variant detection, the identification of somatic alleles is based on choosing a suitable cutoff on frequency difference. In the variant detection context our search for suitable frequency cutoff was guided by the frequencies of variants at known (as recorded in NCBI dbSNP 130) locations observed in both tumor and normal samples. To establish a variant difference cutoff we leveraged the validation results of 121 potential somatic alleles 67 of which were confirmed to be genuine somatic mutations present only in the tumor samples.

The list of potential somatic alleles was chosen in the following way. First, starting from a uniform prior on frequencies 0%, 50%, and 100%, with the iterative procedure outlined in the Methods section we constructed an empirical prior for those frequencies. Next, amongst non-synonymous variants whose locations do not appear in dbSNP we selected those with posterior confidence of having zero frequency in tumor at most 10^{-6} and a posterior confidence of having non-zero frequency in normal at most 10^{-6}. Finally, we restricted our attention to the variants having an observed frequency at least 25% in the tumor and observed frequency at most 5% in the normal. Those 121 variants were subjected to validation by a direct Sanger sequencing, and 67 were confirmed to be present in the tumor, but absent from the normal samples.

To select a frequency difference able to distinguish between the confirmed variants from those which were not confirmed in the validation experiment, we performed a mutual information analysis, similar to the one we used to obtain a variant frequency cutoff in the context of detection of variants. More precisely, for every possible frequency difference cutoff we compared the variants obtained at this cutoff with the result of the validation experiment and chose the cutoff for which the mutual information of the corresponding contingency table was maximized. In this way, a cutoff of 10% on frequency difference was chosen. Figure 6 (center, right) contains the details of this analysis and Figure 6 (left, in green) contains the aggregated posterior of the validated variants, showing an increase in frequency of at least 10% with high posterior confidence. As designed, this cutoff was able to distinguish the confirmed from the non-confirmed variants quite well, so we went back and filtered the novel non-synonymous variants using this criterion. In addition, we filtered for variants for which a decrease of at least 10% in allele frequency was observed. Our hypothesis was that such variants might be due to a loss of heterozygosity, which was confirmed subsequently. The aggregated posterior distribution of the frequency differences of the filtered variants is given in Figure 5 (right).

## Conclusions

We have presented an empirical Bayesian framework for estimating allele frequencies in a given sample and for comparing such frequencies across samples. The method constructs an empirical prior through an iterative procedure with a fixed-point distribution predicting the empirically observed distribution on data as a marginal. Given the prior, the method combines the data of a particular position with Binomial likelihood to produce a posterior allele frequency distribution. We use high credibility intervals derived from that distribution to decide between the competing variant present/absent hypotheses and distinguish alleles existing in the sample from variants introduced as artifacts by the sequencing technology. For a particular sample involving the study of a human/cancer genome, the parameters of that decision can be inferred from the variants present at locations polymorphic in the human population. Finally, when confronted with a pair of samples, we obtain a posterior distribution on the allele frequency differences between them and detect the alleles presenting a substantial differential. This differential can constitute the appearance of oncogenic alleles, as well as a loss of alleles, e.g. loss of heterozygosity, as a result of the oncogenic process. In our work we used a set of validated variants to deduce the cutoff on frequency difference and have shown that the chosen cutoff distinguishes variants confirmed to distinguish the samples, e.g. somatic variants, from those which do not, e.g. germline variants. Regarding cutoffs we can make the general comment that although obtaining theoretical guarantees on those cutoffs is important, in experiments in which subsequent independent validation is available the stress is more on the cutoff parameter as setting a priority in the validation and leaving the decision of a cutoff to be guided by the design and the resources of the experiment.

This method was developed in order to analyze the sequencing data from tumor and paired normal samples from HCL patients and was successful in identifying important somatic mutations present in the tumor sample. The goal of the project is to obtain the mutational landscape of that disease and will be reported in a separate publication. We saw the study as an opportunity to explore the larger goal of estimating the frequency of alleles in a population of cells. Our motivation for this was enabled by the ability of the novel high-throughput sequencing technologies to detect alleles at a growing depth, and hence at a decreasing frequency. The importance of this to the future of biomedical research is immense, since it will let us fine-tune to a higher precision our understanding of complex populations of cells. More precisely, it will let us detect low frequency alleles, which might have been or will be selected for a clonal expansion in the evolution of that population. Furthermore, as is often the need in biomedical research, this approach will let us study in finer detail the differences between two samples as captured by their genetic makeup. This is important in contexts in which impurities in the sample might confound a more coarsely discretized view of the data. Our conclusion is that the approach to analyzing sequencing data through estimating allele frequencies using empirical Bayes methods is a powerful complement to the ever-increasing throughput of the sequencing technologies.

## Methods

### Bayesian framework

*f*. Assume also a probability

*e*of error in reading the bit. Then the probability of observing a "1" is

*p(f)*of the frequency

*f*and a prior for the error

*e*uniform on the interval [0,

*E*] for a fixed 0 ≤

*E*≤ 1. A random experiment produces a string of

*m*bits with

*n*"1"s. Assuming a binomial likelihood of the data, the posterior density on

*(f, e)*is

*C*ensures that $\int P\left(f,e\right)dfde=1$. For the marginal posterior on

*f*we obtain

An event/hypothesis *H* is a set of frequencies. We compute the prior *p(H)* and posterior *P(H)* of *H*, by summing/integrating over the frequencies in *H*. Following the Bayesian framework, *p(H)* reflects our prior knowledge about the hypothesis *H* and *P(H)* our belief in the hypothesis after we have observed the data/evidence.

If {*H*_{
1
}*, ..., H*_{
k
}} is a disjoint and complete set of hypotheses, then ${\sum}_{i}P\left({H}_{i}\right)=1$ and we assume the hypothesis with the highest posterior to hold. Alternatively, given a confidence threshold α, we can find a set of hypotheses, which are satisfied with confidence at least 1 - α. A particular choice for this set orders the hypotheses in decreasing order of their posterior and then selects the smallest *l* so that the total posterior weight of the top *l* hypotheses is at least 1 - α.

*p*

_{ 0 },

*p*

_{ 1/2 }, and

*p*

_{ 1 }, i.e.

Consider the hypothesis *H*_{
fair
} *= {½}* and *H*_{
fake
} *= {0,1}*. Interpreting the random experiment as a sequence of *m* coin flips of which *n* have come up "Heads", the posterior *P(H*_{
fair
}*)* reflects our belief that the coin we are observing is fair.

In general, a random experiment can consist of *k* independent pieces of data/evidence *(n*_{
1
}*, m*_{
1
}*), ..., (n*_{
k
}*, m*_{
k
}*)*. A hypothesis *H* in this setting is a set of sequences of frequencies. For example, for a given frequency *f* the hypothesis *H*_{
f
} *= {f = f*_{
1
} *= ... = f*_{
k
}*}* consists of all sequences which have all of their components equal to *f*. Since hypotheses are sets, we can combine them with the usual operations permitted on sets, i.e. union, set, and intersection. Thus, to continue our example, we can form the hypothesis *H*_{
eq
} consisting of all sequences with all components being equal, regardless of their common value, as the union ${\cup}_{f}{H}_{f}$. We define the prior probability *p(f*_{
1
}*, ..., f*_{
k
}*)* of a sequence of frequencies *(f*_{
1
}*, ..., f*_{
k
}*)* to be the product *p(f*_{
1
}*)...p(f*_{
k
}*)* of the priors of its components. Since the pieces of evidence are independent, we obtain the posterior probability *P(f*_{
1
}*, ..., f*_{
k
}*)* of the sequence to be the product *P(f*_{
1
}*)...P(f*_{
k
}*)* of the posteriors of its components. The prior/posterior of a hypothesis *H* is formed by summing the priors/posteriors of its elements.

### Genotyping a diploid organism

In sequencing data, each position observed by the sequencing instrument acquires an independent piece of evidence regarding the frequency of the allele at that position. Diploidity means that for each position there are three possible zigosity types: both homozygous and equal to the reference genome (two alleles equal to the reference), heterozygous (exactly one allele equal to the reference), or both homozygous and different from the reference genome (two alleles different from the reference). The reference genome in this case refers to the particular genome to which the reads of the sequencing experiment have been aligned. These three possibilities correspond naturally to three possible frequencies 0, ½, and 1 for the random experiment discussed in the general section on the Bayesian framework, where the frequency of a bit being "1" captures the frequency of the non-reference/variant allele.

In a given sample, the data at a particular position consists of the total number *m* of reads mapping at that position and the number *n* of those reads confirming the variant allele. We interpret the data *(n, m)* as the evidence for the zygosity type of the particular position. Following the Bayesian framework, we compute the posteriors *P*_{
0
}, *P*_{
½
}, and *P*_{
1
} for the three possible zygosity types and choose the type with highest posterior to be the type of the position. The upper bound *E* on the error in reading the nucleotide allele corresponds to the probability of a non-systematic/random sequencing error. Letting *Q* to be the average of the Phred scores provided by the sequencing instrument for all variant nucleotides mapping at the position, we set *E* to be *10*^{
-Q/10
}.

### Detection of variants in diploid genomes

To detect the presence of a variant at a particular position, we form two hypotheses one for the position being a variant (heterozygous or homozygous) and one for the position being a non-variant (homozygous for the reference), and consider their corresponding posteriors *P*_{
var
} and *P*_{
ref
}.

### Comparing variants across samples from diploid organisms

*(n*

_{ t }

*, m*

_{ t }

*)*and

*(n*

_{ n }

*, m*

_{ n }

*)*. Using this data one can assign the posterior probability to the hypothesis that the variant is somatic to be

and then set the non-somatic posterior to *P*_{
nsom
} *=* 1 *- P*_{
som
}.

### Detection of variants in non-diploid genomes

An objection to the approach outlined so far is that although the cancer genome descends from the diploid human genome, the zigosity of a particular allele can be distorted considerably by the oncogenic process, to the point where the diplodity of that allele is dubious. For example, it is known that wildly varying copy numbers are an important characteristic of many cancer genomes. Furthermore, impure samples containing several types of genomes, e.g. mixture of tumor and normal cells, can produce non-diploid frequencies. To account for this, one has to abandon the notion that alleles come in only three possibly frequencies, and allow for more possibilities in the priors and posteriors. Further applications of sequencing experiments, e.g. sequencing cellular transcriptomes or experiments characterizing metagenomic samples, point to the necessity of a wider vocabulary of allele frequencies.

To simplify our exposition, and since the resolution of current sequencing technologies is rather limited, we decided to consider as a possibility the set {0, 0.01, 0.02, 0.03, ..., 1} of frequencies with resolution 0.01. Under this setting for a given position in the sequenced genome we have a hypothesis for every possible frequency *f* of the non-reference allele with is corresponding posterior *P*_{
f
} obtained from the data *(n, m)* for that position. Given the posteriors probability of every possible frequency, one can find the most likely and the expected frequency. Furthermore, given a confidence threshold α, we can obtain a credible interval such that the posterior weight of the frequencies outside of this interval is at most α.

The decision whether a variant allele is present is based on selecting a frequency cutoff *f* and confidence threshold α < 0.5 and then calling present the variants for which the event "frequency at least *f*" has posterior confidence at least 1 - α. The frequency cutoff *f* can be chosen to be equal to 1%, i.e. variant is present, if it has non-zero frequency, or selected according to a set of validated variants, for example from a paired SNP array analysis.

### Comparing variants across samples from non-diploid genomes

*f*

_{ 1 }and

*f*

_{ 2 }we can obtain their posteriors ${P}_{1,{f}_{1}}$ and ${P}_{2,{f}_{2}}$ according to each of the samples. Then assuming that the samples are independent we can assign the posterior of the pair ${P}_{{f}_{1},{f}_{2}}$ to be ${P}_{1,{f}_{1}}\times {P}_{2,{f}_{2}}$. Next, for every difference

*Δ*we can form the posterior probability ${P}_{\mathrm{\Delta}}$ that the difference

*f*

_{ 1 }-

*f*

_{ 2 }is equal to

*Δ*, namely

Finally, we can obtain the most likely and the expected frequency difference, as well as a credible interval for confidence level 1 - α. One can use those posterior measurements to focus on a subset of the alleles, e.g. somatic alleles, which differentiate the two samples.

Similar to the case of detection of variants, the decision whether a variant allele has different frequency in the two samples is based on selecting a frequency difference cutoff *d* and confidence threshold α < 0.5. If one is interested only in variants in which the allele frequency has increased, for example somatic point mutations, then one can focus only on those variants for which the event "frequency difference at least *d*" has posterior confidence at least 1 - α. If in addition the goal is to obtain locations with a loss of allele, e.g. as part of obtaining the loci of a loss of heterozygosity, then the event is "frequency difference at least *d* in absolute value". The frequency cutoff *d* can be chosen to be equal to 1% or adjusted to an existing set of validated variant positions.

### Construction of priors

The prior distribution is an essential component of the Bayesian framework. Generally speaking, the prior captures our belief in a particular hypothesis before we have observed the data. From a practical point of view, when many repetitions of a random experiment are available our belief should coincide with the frequency of occurrence of the hypothesis.

For the application of this framework to sequencing data of diploid organisms, a prior should capture our knowledge about the relative prevalence of the three zygosity types in the genome of the organism. Ignoring our knowledge in that respect, we can set the prior distribution to be uniform over the three types. Concerning a sequencing experiment focused on the coding part of the human genome, a better-informed prior takes into consideration that this part of the human contains around 10^{7} nucleotides, and that the average number of variants (heterozygous and homozygous and different from the reference) with respect to a reference genome is around 10^{4} [21]. Furthermore, considering that the reference genome is given as haploid, whereas the data is obtained from the diploid cancer/normal samples one can show that the number of heterozygous variants is expected to be twice as many as the homozygous ones. This holds because fixing the order of the two homologues of the human genome, a heterozygous variant can be located on either one, whereas for the homozygous variant there is a single choice. This leads to a prior in which the proportion of the three zygosity types is roughly 10^{3}: 2: 1.

An empirical prior can be obtained if we acknowledge that we have two samples from a patient - from tumor and normal tissue. Hence one can estimate the proportion of the three zygosity types from one of the samples and use this estimate to form the prior for the other. In addition to this, an empirical prior can be obtained form previous sequencing experiment using the same sequencing technology. This approach is applicable also to the more general frequency setting.

where summation is over all pieces of data and ${P}_{i}\left(f|{D}_{j}\right)$ is the posterior probability of frequency *f* based on prior ${p}_{i}$ for the *j-* th piece of data. The empirical prior ${p}_{e}$ is a stationary point of this process. In practice we terminate the process as soon as there is no substantial change in the prior. For the data obtained from the sequencing the human exome and using frequencies of resolution 1% we used around 10 iterations at which point the Kullback-Leibler divergence between the consecutive priors was 0.0013 bits (see Figure 2).

In the case of the human genome the empirical priors obtained in this way were similar to the theoretical prior outlined above and exhibited the expected modes at 0.5 and 1 (variant alleles), and 0 (reference alleles). Furthermore, the frequency distribution in the vicinity of the 0.5 mode was observed to be fitted well by a beta distribution. This empirical observation is consistent with the classically established prescription for choosing the beta distribution as a conjugate prior to a binomial likelihood. Furthermore, the beta distribution was justified because we observed that the depths follow a negative binomial distribution, and hence the observed allele frequencies are a ratio of such distributions, which is approximated well by a beta distribution. As far as the observed negative binomial distribution of the depths, our conjecture is that this is related to the fact that locally the depth distribution is Poisson, but that the means of these Poissons vary widely due to the heterogeneity in the nucleotide content of the human genome and are long tailed over the whole genome. Hence globally the depth distribution is a convolution of Poisson distributions with a long tailed distribution of the means, which is the nature of the negative binomial distribution. The variance of the beta distribution fitting the distribution of the heterozygous alleles depends on the ability of the sequencing technology to detect the two alleles at a heterozygous locus: the more uncorrelated the observed frequencies of the two alleles the wider the variance.

### Choosing confidence cutoffs

Classically, the Bayesian framework prescribes the choice between two competing hypotheses to the one with higher posterior weight. In our case we follow this prescription with the addition that we do not make a decision in the case that the posterior weight of the more likely hypothesis is not high enough. Our decision is justified by the observation that, assuming independent random experiments, maximizing the posterior when deciding on a hypothesis in each will maximize the posterior over all choices because in this case the posterior over all choices is equal to the product of the individual choices. This is not necessarily true if the data is not independent and in this case a more sophisticated analysis is necessary. Furthermore, restricting our attention only to the experiments for which the most likely hypothesis has a high enough posterior guarantees a high enough overall posterior. We answer the question of how high is high enough by noting that if for each experiment the most likely hypothesis has a posterior confidence at least 1 - *α* and we have *N* experiments, then the overall posterior is at least ${\left(1-\alpha \right)}^{N}\approx 1-N\alpha $. Thus to reach a high enough overall posterior confidence, a reasonable posterior confidence for each experiment can be obtained by taking *α* proportional to the inverse of the number of experiments in a manner similar to the Bonferroni correction in the frequentist setting.

Since for current estimates the human exome contains around 10^{7} nucleotides, in the case of sequencing experiments related to that part of the human genome, when deciding a posteriori between two competing hypothesis, e.g. present/absent or germline/somatic, we can to take *α* = 10^{-6} for each genomic location.

## Declarations

### Acknowledgements

We would like to thank Joseph Chan, Riccardo dalla Favera, Oliver Elliott, Giulia Fabbri, and Hossein Khiabanian, for helpful discussions, comments, and help in the preparation of this manuscript.

**Declarations**

The publication of this article was funded by the Northeast Biodefence Center (U54-AI057158), the National Institute of Health (U54 CA121852-05), and the National Library of Medicine (1R01LM010140-01).

This article has been published as part of *BMC Systems Biology* Volume 7 Supplement 2, 2013: Selected articles from The 6th International Conference of Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S2.

## Authors’ Affiliations

## References

- Chin L, Hahn W, Getz G, Meyerson M: Making sense of cancer genomic data. Genes Dev. 2011, 25 (6): 534-55. 10.1101/gad.2017311.PubMed CentralView ArticlePubMedGoogle Scholar
- Tiacci E, et al.: BRAF mutations in hairy-cell leukemia. N Engl J Med. 2011, 364 (24): 2305-15. 10.1056/NEJMoa1014209.PubMed CentralView ArticlePubMedGoogle Scholar
- Pasqualucci L, et al.: Inactivating mutations of acetyltransferase genes in B-cell lymphoma. Nature. 2011, 471: 189-195. 10.1038/nature09730.PubMed CentralView ArticlePubMedGoogle Scholar
- Morin R, et al.: Somatic mutations altering EZH2 (Tyr641) in follicular and diffuse large B-cell lymphomas of germinal-center origin. Nature Genetics. 2010, 42: 181-185. 10.1038/ng.518.PubMed CentralView ArticlePubMedGoogle Scholar
- Shah S, et al.: Mutational evolution in a lobular breast tumor profiled at single nucleotide resolution. Nature. 2009, 461: 809-813. 10.1038/nature08489.View ArticlePubMedGoogle Scholar
- Ley T, et al.: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008, 456: 66-72. 10.1038/nature07485.PubMed CentralView ArticlePubMedGoogle Scholar
- Mardis E, et al.: Recurring mutations found by sequencing an acute myeloid leukemia genome. N Engl J Med. 2009, 361: 1058-1066. 10.1056/NEJMoa0903840.PubMed CentralView ArticlePubMedGoogle Scholar
- Sjöblom T, et al.: The Consensus Coding Sequences of Human Breast and Colorectal Cancers. Science. 2006, 314 (5797): 268-274. 10.1126/science.1133427.View ArticlePubMedGoogle Scholar
- Vogelstein B, Kinzler K: Cancer genes and the pathways they control. Nat Med. 2004, 10: 789-799. 10.1038/nm1087.View ArticlePubMedGoogle Scholar
- Stratton M, Campbell P, Futreal A: The cancer genome. Nature. 2009, 458: 719-724. 10.1038/nature07943.PubMed CentralView ArticlePubMedGoogle Scholar
- Fabbri G, et al.: Analysis of the Chronic Lymphocytic Leukemia Coding Genome: Role of
*NOTCH1*Mutational Activation. Journal of Experimental Biology. 2011, 208 (7): 1389-401.Google Scholar - Li M, Wang I, Li Y, Bruzel A, Richards A, Toung J, Cheung V: Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science. 2011, 333 (6038): 53-58. 10.1126/science.1207018.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, et al.: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, et al.: SOAP: short oligonucleotide alignment program. BMC Bioinformatics. 2008, 24: 713-714. 10.1093/bioinformatics/btn025.View ArticleGoogle Scholar
- Goya R, et al.: SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. BMC Bioinformatics. 2010, 26 (6): 730-6. 10.1093/bioinformatics/btq040.View ArticleGoogle Scholar
- Robbins H: An Empirical Bayes Approach to Statistics. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. 1956, 157-163. Volume 1: Contributions to the Theory of StatisticsGoogle Scholar
- Carlin B, Louis T: Bayes and Empirical Bayes Methods for Data Analysis (2nd ed.). 2000, Chapman & Hall/CRCView ArticleGoogle Scholar
- Hardcastle T, et al.: baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010, 11: 422-10.1186/1471-2105-11-422.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones S, et al.: Comparative lesion sequencing provides insights into tumor evolution. Proc Natl Acad Sci USA. 2008, 105 (11): 4283-8. 10.1073/pnas.0712345105.PubMed CentralView ArticlePubMedGoogle Scholar
- Casella G: An Introduction to Empirical Bayes Data Analysis. The American Statistician. 1985, 39 (2): 83-87. 10.2307/2682801.Google Scholar
- 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.