Bayesian framework
Consider a random source, which produces bits ("0"s and "1"s) with probability/frequency of "1" equal to f. Assume also a probability e of error in reading the bit. Then the probability of observing a "1" is
Assume a prior density p(f) of the frequency f and a prior for the error e uniform on the interval [0, E] for a fixed 0 ≤ E ≤ 1. A random experiment produces a string of m bits with n "1"s. Assuming a binomial likelihood of the data, the posterior density on (f, e) is
where the constant C ensures that . For the marginal posterior on f we obtain
where
An event/hypothesis H is a set of frequencies. We compute the prior p(H) and posterior P(H) of H, by summing/integrating over the frequencies in H. Following the Bayesian framework, p(H) reflects our prior knowledge about the hypothesis H and P(H) our belief in the hypothesis after we have observed the data/evidence.
If {H
1
, ..., H
k
} is a disjoint and complete set of hypotheses, then and we assume the hypothesis with the highest posterior to hold. Alternatively, given a confidence threshold α, we can find a set of hypotheses, which are satisfied with confidence at least 1 - α. A particular choice for this set orders the hypotheses in decreasing order of their posterior and then selects the smallest l so that the total posterior weight of the top l hypotheses is at least 1 - α.
A particular example of the above framework is obtained by taking the prior to be concentrated on 0, ½, and 1 with densities p
0
, p
1/2
, and p
1
, i.e.
Consider the hypothesis H
fair
= {½} and H
fake
= {0,1}. Interpreting the random experiment as a sequence of m coin flips of which n have come up "Heads", the posterior P(H
fair
) reflects our belief that the coin we are observing is fair.
In general, a random experiment can consist of k independent pieces of data/evidence (n
1
, m
1
), ..., (n
k
, m
k
). A hypothesis H in this setting is a set of sequences of frequencies. For example, for a given frequency f the hypothesis H
f
= {f = f
1
= ... = f
k
} consists of all sequences which have all of their components equal to f. Since hypotheses are sets, we can combine them with the usual operations permitted on sets, i.e. union, set, and intersection. Thus, to continue our example, we can form the hypothesis H
eq
consisting of all sequences with all components being equal, regardless of their common value, as the union . We define the prior probability p(f
1
, ..., f
k
) of a sequence of frequencies (f
1
, ..., f
k
) to be the product p(f
1
)...p(f
k
) of the priors of its components. Since the pieces of evidence are independent, we obtain the posterior probability P(f
1
, ..., f
k
) of the sequence to be the product P(f
1
)...P(f
k
) of the posteriors of its components. The prior/posterior of a hypothesis H is formed by summing the priors/posteriors of its elements.
Genotyping a diploid organism
In sequencing data, each position observed by the sequencing instrument acquires an independent piece of evidence regarding the frequency of the allele at that position. Diploidity means that for each position there are three possible zigosity types: both homozygous and equal to the reference genome (two alleles equal to the reference), heterozygous (exactly one allele equal to the reference), or both homozygous and different from the reference genome (two alleles different from the reference). The reference genome in this case refers to the particular genome to which the reads of the sequencing experiment have been aligned. These three possibilities correspond naturally to three possible frequencies 0, ½, and 1 for the random experiment discussed in the general section on the Bayesian framework, where the frequency of a bit being "1" captures the frequency of the non-reference/variant allele.
In a given sample, the data at a particular position consists of the total number m of reads mapping at that position and the number n of those reads confirming the variant allele. We interpret the data (n, m) as the evidence for the zygosity type of the particular position. Following the Bayesian framework, we compute the posteriors P
0
, P
½
, and P
1
for the three possible zygosity types and choose the type with highest posterior to be the type of the position. The upper bound E on the error in reading the nucleotide allele corresponds to the probability of a non-systematic/random sequencing error. Letting Q to be the average of the Phred scores provided by the sequencing instrument for all variant nucleotides mapping at the position, we set E to be 10-Q/10.
Detection of variants in diploid genomes
To detect the presence of a variant at a particular position, we form two hypotheses one for the position being a variant (heterozygous or homozygous) and one for the position being a non-variant (homozygous for the reference), and consider their corresponding posteriors P
var
and P
ref
.
Comparing variants across samples from diploid organisms
The discussion so far has focused on the reads obtained from the sequencing of a single sample. In the cancer sequencing experiment discussed in this paper, we have two types of samples for each patient: one from the cancer tissue and one from a normal tissue. The goal is to set the variants obtained from the cancer tissue against the background provided by the normal and obtain the somatic variants which are present in tumor tissue, but do not appear in the normal. Thus, the evidence for the lineage type, i.e. germline vs. somatic, of the variant at a particular position in the genome consists of the two independent pieces of evidence for the zygosity type of that position in the tumor and the normal tissue. Since the data for the zygosity type is simply the total counts, then the evidence for the lineage type is (n
t
, m
t
) and (n
n
, m
n
). Using this data one can assign the posterior probability to the hypothesis that the variant is somatic to be
and then set the non-somatic posterior to P
nsom
= 1 - P
som
.
Detection of variants in non-diploid genomes
An objection to the approach outlined so far is that although the cancer genome descends from the diploid human genome, the zigosity of a particular allele can be distorted considerably by the oncogenic process, to the point where the diplodity of that allele is dubious. For example, it is known that wildly varying copy numbers are an important characteristic of many cancer genomes. Furthermore, impure samples containing several types of genomes, e.g. mixture of tumor and normal cells, can produce non-diploid frequencies. To account for this, one has to abandon the notion that alleles come in only three possibly frequencies, and allow for more possibilities in the priors and posteriors. Further applications of sequencing experiments, e.g. sequencing cellular transcriptomes or experiments characterizing metagenomic samples, point to the necessity of a wider vocabulary of allele frequencies.
To simplify our exposition, and since the resolution of current sequencing technologies is rather limited, we decided to consider as a possibility the set {0, 0.01, 0.02, 0.03, ..., 1} of frequencies with resolution 0.01. Under this setting for a given position in the sequenced genome we have a hypothesis for every possible frequency f of the non-reference allele with is corresponding posterior P
f
obtained from the data (n, m) for that position. Given the posteriors probability of every possible frequency, one can find the most likely and the expected frequency. Furthermore, given a confidence threshold α, we can obtain a credible interval such that the posterior weight of the frequencies outside of this interval is at most α.
The decision whether a variant allele is present is based on selecting a frequency cutoff f and confidence threshold α < 0.5 and then calling present the variants for which the event "frequency at least f" has posterior confidence at least 1 - α. The frequency cutoff f can be chosen to be equal to 1%, i.e. variant is present, if it has non-zero frequency, or selected according to a set of validated variants, for example from a paired SNP array analysis.
Comparing variants across samples from non-diploid genomes
The setting can be extended to the comparison of two samples in the following way. For every two frequencies f
1
and f
2
we can obtain their posteriors and according to each of the samples. Then assuming that the samples are independent we can assign the posterior of the pair to be . Next, for every difference Δ we can form the posterior probability that the difference f
1
- f
2
is equal to Δ, namely
Finally, we can obtain the most likely and the expected frequency difference, as well as a credible interval for confidence level 1 - α. One can use those posterior measurements to focus on a subset of the alleles, e.g. somatic alleles, which differentiate the two samples.
Similar to the case of detection of variants, the decision whether a variant allele has different frequency in the two samples is based on selecting a frequency difference cutoff d and confidence threshold α < 0.5. If one is interested only in variants in which the allele frequency has increased, for example somatic point mutations, then one can focus only on those variants for which the event "frequency difference at least d" has posterior confidence at least 1 - α. If in addition the goal is to obtain locations with a loss of allele, e.g. as part of obtaining the loci of a loss of heterozygosity, then the event is "frequency difference at least d in absolute value". The frequency cutoff d can be chosen to be equal to 1% or adjusted to an existing set of validated variant positions.
Construction of priors
The prior distribution is an essential component of the Bayesian framework. Generally speaking, the prior captures our belief in a particular hypothesis before we have observed the data. From a practical point of view, when many repetitions of a random experiment are available our belief should coincide with the frequency of occurrence of the hypothesis.
For the application of this framework to sequencing data of diploid organisms, a prior should capture our knowledge about the relative prevalence of the three zygosity types in the genome of the organism. Ignoring our knowledge in that respect, we can set the prior distribution to be uniform over the three types. Concerning a sequencing experiment focused on the coding part of the human genome, a better-informed prior takes into consideration that this part of the human contains around 107 nucleotides, and that the average number of variants (heterozygous and homozygous and different from the reference) with respect to a reference genome is around 104 [21]. Furthermore, considering that the reference genome is given as haploid, whereas the data is obtained from the diploid cancer/normal samples one can show that the number of heterozygous variants is expected to be twice as many as the homozygous ones. This holds because fixing the order of the two homologues of the human genome, a heterozygous variant can be located on either one, whereas for the homozygous variant there is a single choice. This leads to a prior in which the proportion of the three zygosity types is roughly 103: 2: 1.
An empirical prior can be obtained if we acknowledge that we have two samples from a patient - from tumor and normal tissue. Hence one can estimate the proportion of the three zygosity types from one of the samples and use this estimate to form the prior for the other. In addition to this, an empirical prior can be obtained form previous sequencing experiment using the same sequencing technology. This approach is applicable also to the more general frequency setting.
More precisely, to obtain an empirical prior we use that a sequencing experiment consists of multiple random experiments - one for each location of the genome. Thus an empirical prior based on a particular sequencing experiment should reflect the allele distribution observed in the genome by that experiment. Using that in the Bayesian framework the prior and the likelihood determine the distribution of the data as a certain marginal, the empirical prior derived from a particular experiment should predict the distribution of the data as observed in that experiment. In particular we must have where is the empirical prior, the posterior based on that prior, and is the empirically observed distribution of the data. Thus we establish the following iterative procedure for constructing an empirical prior. Let be the prior before the experiment. In the case of lack of prior knowledge we can set to be the uniform distribution. Then on iteration we set
where summation is over all pieces of data and is the posterior probability of frequency f based on prior for the j- th piece of data. The empirical prior is a stationary point of this process. In practice we terminate the process as soon as there is no substantial change in the prior. For the data obtained from the sequencing the human exome and using frequencies of resolution 1% we used around 10 iterations at which point the Kullback-Leibler divergence between the consecutive priors was 0.0013 bits (see Figure 2).
In the case of the human genome the empirical priors obtained in this way were similar to the theoretical prior outlined above and exhibited the expected modes at 0.5 and 1 (variant alleles), and 0 (reference alleles). Furthermore, the frequency distribution in the vicinity of the 0.5 mode was observed to be fitted well by a beta distribution. This empirical observation is consistent with the classically established prescription for choosing the beta distribution as a conjugate prior to a binomial likelihood. Furthermore, the beta distribution was justified because we observed that the depths follow a negative binomial distribution, and hence the observed allele frequencies are a ratio of such distributions, which is approximated well by a beta distribution. As far as the observed negative binomial distribution of the depths, our conjecture is that this is related to the fact that locally the depth distribution is Poisson, but that the means of these Poissons vary widely due to the heterogeneity in the nucleotide content of the human genome and are long tailed over the whole genome. Hence globally the depth distribution is a convolution of Poisson distributions with a long tailed distribution of the means, which is the nature of the negative binomial distribution. The variance of the beta distribution fitting the distribution of the heterozygous alleles depends on the ability of the sequencing technology to detect the two alleles at a heterozygous locus: the more uncorrelated the observed frequencies of the two alleles the wider the variance.
Choosing confidence cutoffs
Classically, the Bayesian framework prescribes the choice between two competing hypotheses to the one with higher posterior weight. In our case we follow this prescription with the addition that we do not make a decision in the case that the posterior weight of the more likely hypothesis is not high enough. Our decision is justified by the observation that, assuming independent random experiments, maximizing the posterior when deciding on a hypothesis in each will maximize the posterior over all choices because in this case the posterior over all choices is equal to the product of the individual choices. This is not necessarily true if the data is not independent and in this case a more sophisticated analysis is necessary. Furthermore, restricting our attention only to the experiments for which the most likely hypothesis has a high enough posterior guarantees a high enough overall posterior. We answer the question of how high is high enough by noting that if for each experiment the most likely hypothesis has a posterior confidence at least 1 - α and we have N experiments, then the overall posterior is at least . Thus to reach a high enough overall posterior confidence, a reasonable posterior confidence for each experiment can be obtained by taking α proportional to the inverse of the number of experiments in a manner similar to the Bonferroni correction in the frequentist setting.
Since for current estimates the human exome contains around 107 nucleotides, in the case of sequencing experiments related to that part of the human genome, when deciding a posteriori between two competing hypothesis, e.g. present/absent or germline/somatic, we can to take α = 10-6 for each genomic location.