Volume 6 Supplement 2
Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012)
Two combinatorial optimization problems for SNP discovery using basespecific cleavage and mass spectrometry
 Xin Chen^{1}Email author,
 Qiong Wu^{1, 2},
 Ruimin Sun^{1} and
 Louxin Zhang^{3}
DOI: 10.1186/175205096S2S5
© Chen et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Abstract
Background
The discovery of singlenucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on basespecific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach.
Results
In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. We present an exact dynamic programming algorithm for solving the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem and also show that the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem is NPhard by a reduction from a restricted variation of the 3partition problem.
Conclusions
We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary basespecific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.
Background
Single nucleotide polymorphisms (SNPs) is a common type of DNA sequence variations that occur when a single nucleotide base is altered at a specific locus. They are among the most important genetic factors that contribute to human disease and biological functions. However, discovering novel SNPs is a scientifically challenging task. Among others, one valuable approach proposed for SNP discovery is based on basespecific cleavage and mass spectrometry [1–3].
The SNP discovery approach based on basespecific cleavage and mass spectrometry usually adopts a dataacquisition procedure as summarized below. First, a target sample DNA sequence is PCRamplified using primers that incorporate the T7 promoter sequences. Then, the PCR products are invitro transcribed and subsequently digested with the endonuclease RNase A in four basespecific cleavage reactions. Each reaction can cleave the sample sequence to completion at all loci wherever a specific base is found. Finally, the matrixassisted laser desorption/ionization timeofflight mass spectrometry (MALDITOF MS) is applied to the cleavage products, resulting in four measured mass spectra, each corresponding to one basespecific cleavage reaction.
The early proofofconcept studies on the above SNP discovery approach using basespecific cleavage and mass spectrometry were presented in [3–5], where the identification of SNPs however was done by visual inspection. Shortly afterwards, two automated computational solutions were developed [1, 2]: one was implemented in the proprietary MassARRAY™ SNP Discovery software package from Sequenom, Inc. and the other implemented in the software package called RNaseCut which is instead freely available online [6]. In particular, the solution in [1] mainly comprises of two separate procedures. It first computes all potential SNPs that give rise to each unanticipated based composition and then score them by taking into account the mass spectrometry data from the four basespecific cleavage reactions. Thus, the integration of the four basespecific cleavage reactions was done only in the second step. Apparently, such an integration strategy is far from being optimal, as at least it assumes that the occurrences of potential SNPs are independent in the first step.
In this paper, we study two new combinatorial optimization problems to exploit the full potential of the above SNP discovery approach. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. Then, we present an exact dynamic programming algorithm for solving the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem and also show that the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem is NPhard by a reduction from the restricted variation of the 3partition problem [7, 8].
Methods
Preliminaries
Let s ∈ Σ* denote a string over the fourbase alphabet $\sum =\left\{\mathsf{\text{A}},\mathsf{\text{C}},\mathsf{\text{G}},\mathsf{\text{T}}\right\}$. The length of s is denoted by s, the ith base of s by s[i], and the substring of s from the ith base to the jth base by s[i, j], for 1 ≤ i ≤ j ≤ s. We use ∈ to denote the empty string so that ∈ = 0. The concatenation of two strings s and t is denoted by s · t, and the concatenation of l copies of a string s is denoted by s^{ l }.
Given a string s and a cut base $x\in \sum $, a cleavage fragment refers to a substring of s that does not contain x and that cannot be extended in either side without crossing a base x. Formally, the substring s[i, j] is a cleavage fragment with respect to the cut base x if the following three conditions are satisfied: (i) s[i − 1] = x if i ≠ 1, (ii) s[j + 1] = x if j ≠ s, and (iii) s[k] ≠ x, ∀k ∈ [i, j]. In addition, the empty string ε is a cleavage fragment if there exits i ∈ [1,s − 1] such that s[i] = s[i + 1] = x. Given a cleavage fragment, we use ${\mathsf{\text{A}}}_{i}{\mathsf{\text{C}}}_{j}{\mathsf{\text{G}}}_{k}{\mathsf{\text{T}}}_{l}$ to denote its base composition of i As, j Cs, k Gs, and l Ts. In [1], this base composition is termed as a compomer of the string s with respect to the cut base x. The whole set of compomers is hence called the compomer spectrum of the string s with respect to the cut base x, and denoted by Finally, let ${\mathcal{C}}_{\sum}\left(s\right)=\left\{{\mathcal{C}}_{x}\left(s\right):x\in \sum \right\}=\left\{{\mathcal{C}}_{\mathsf{\text{A}}}\left(s\right),{\mathcal{C}}_{\mathsf{\text{C}}}\left(s\right),{\mathcal{C}}_{\mathsf{\text{G}}}\left(s\right),{\mathcal{C}}_{\mathsf{\text{T}}}\left(s\right)\right\}$, a collection of four compomer spectra of the string s where each is generated with one cut base.
Example 1 Let s := ACATGCTACATTA. Then, the string s contains four cleavage fragments with respect to the cut base A: C, TGCT, C, and TT. With respect to the cut base T, it instead contains five cleavage fragments: ACA, GC, ACA, ∈, and A. Their respective compomer spectra are ${\mathcal{C}}_{\mathsf{\text{A}}}\left(s\right)=\left\{{\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{\mathsf{\text{1}}}{\mathsf{\text{G}}}_{0}{\mathsf{\text{T}}}_{0},{\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{\mathsf{\text{1}}}{\mathsf{\text{G}}}_{\mathsf{\text{1}}}{\mathsf{\text{T}}}_{\mathsf{\text{2}}},{\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{0}{\mathsf{\text{G}}}_{0}{\mathsf{\text{T}}}_{\mathsf{\text{2}}}\right\}$ and ${\mathcal{C}}_{\mathsf{\text{T}}}\left(s\right)=\left\{{\mathsf{\text{A}}}_{\mathsf{\text{2}}}{\mathsf{\text{C}}}_{\mathsf{\text{1}}}{\mathsf{\text{G}}}_{0}{\mathsf{\text{T}}}_{0},{\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{\mathsf{\text{1}}}{\mathsf{\text{G}}}_{\mathsf{\text{1}}}{\mathsf{\text{T}}}_{0},{\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{0}{\mathsf{\text{G}}}_{0}{\mathsf{\text{T}}}_{0},{\mathsf{\text{A}}}_{\mathsf{\text{1}}}{\mathsf{\text{C}}}_{0}{\mathsf{\text{G}}}_{0}{\mathsf{\text{T}}}_{0}\right\}$. Note that each compomer appears in a compomer spectrum at most once.
Problem formulation
Let d_{ H } $\left(s,{s}^{\prime}\right)$ denote the Hamming distance between two strings s and ${s}^{\prime}$ of equal length. It measures the minimum number of substitutions required to transform one string into the other. Given a collection of compomer spectra ${\mathcal{C}}_{\Sigma}=\left\{{\mathcal{C}}_{x}:x\in \Sigma \right\}$ of an unknown string ${s}^{\prime}$ (i.e., the sample DNA sequence experimented) which can in principle be generated from a mass spectrometry experiment, and a string s (i.e., the reference DNA sequence) which is believed to differ from the unknown string ${s}^{\prime}$ by a number of substitutions only, we formulate below two combinatorial optimization problems for SNP discovery.
Definition 2 $\left(The\phantom{\rule{2.77695pt}{0ex}}SNPM{S}_{\mathcal{P}}\phantom{\rule{2.77695pt}{0ex}}problem\right)$Given a string s and a collection of compomer spectra ${\mathcal{C}}_{\Sigma}=\left\{{\mathcal{C}}_{x}:x\in \Sigma \right\}$, find a string ${s}^{\prime}$such that ${\mathcal{C}}_{x}\left({s}^{\prime}\right)\subseteq {\mathcal{C}}_{x}$, for all $x\in \sum $and d_{ H } $\left(s,{s}^{\prime}\right)$is minimized.
Definition 3 $\left(The\phantom{\rule{2.77695pt}{0ex}}SNPM{S}_{\mathcal{Q}}\phantom{\rule{2.77695pt}{0ex}}problem\right)$Given a string s and a collection of compomer spectra ${\mathcal{C}}_{\Sigma}=\left\{{\mathcal{C}}_{x}:x\in \Sigma \right\}$, find a string ${s}^{\prime}$such that ${\mathcal{C}}_{x}\subseteq {\mathcal{C}}_{x}\left({s}^{\prime}\right)$, for all$x\in \sum $and d_{ H } $\left(s,{s}^{\prime}\right)$is minimized.
The only difference between the above two problem formulations is that one requires ${\mathcal{C}}_{x}\left({s}^{\prime}\right)\subseteq {\mathcal{C}}_{x}$ and the other requires ${\mathcal{C}}_{x}\subseteq {\mathcal{C}}_{x}\left({s}^{\prime}\right)$, for all the cut bases. Once the string ${s}^{\prime}$ is found, it is easy to identify the SNPs in ${s}^{\prime}$, i.e., those base substitutions that transform ${s}^{\prime}$ into s.
The feasible solutions to the $\mathit{\text{SNP}}\mathit{\text{M}}{\mathit{\text{S}}}_{\mathcal{P}}$ problem for the above instance include the strings such as ATATA, TATAT, TTATT, ATATT, and ATTAT. Their respective Hamming distances to the input string s are 2, 3, 2, 1, and 1. The string ${s}^{\prime}$ = TTAAT is not a feasible solution because the compomer ${\mathsf{\text{A}}}_{\mathsf{\text{2}}}{\mathsf{\text{T}}}_{0}\in {\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\right)$ but ${\mathsf{\text{A}}}_{\mathsf{\text{2}}}{\mathsf{\text{T}}}_{0}\notin {\mathcal{C}}_{\mathsf{\text{T}}}$ so that ${\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\right)\u2288{\mathcal{C}}_{\mathsf{\text{T}}}$.
The feasible solutions to the $SNPM{S}_{\mathcal{Q}}$ problem for the above instance include the strings such as TTATA, TATTA, ATATT, and ATTAT. Their respective Hamming distances to the input string s are 3, 5, 1, and 1. The string ${s}^{\prime}$ = TTAAT is not a feasible solution because the compomer ${\mathsf{\text{A}}}_{\mathsf{\text{2}}}{\mathsf{\text{T}}}_{0}\in {\mathcal{C}}_{\mathsf{\text{T}}}$ but ${\mathsf{\text{A}}}_{\mathsf{\text{1}}}{\mathsf{\text{T}}}_{0}\notin {\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\right)$ so that ${\mathcal{C}}_{\mathsf{\text{T}}}\u2288{\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\right)$.
The measured mass spectra of a sample sequence are rarely perfect in practice. Some peaks may actually represent noises, while some true signal peaks are missing. The problem $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ is so formulated that its computational solution would be robust against noisy peaks but susceptible to missing peaks (i.e., there is a good chance to recover the sample sequence even if some noisy peaks are present in the measured mass spectra, but the chance would become much less if there are some true signal peaks missing). In contrast, the problem $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ is so formulated that its computational solution would be robust against missing peaks but susceptible to noisy peaks.
We noticed that several computational problems in the literature that are more or less related to our problems introduced above. In [9], a socalled sequencing from compomers problem was studied which, like the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem, also aimed to reconstruct the sample sequence from a given collection of compomer spectra, but without help of a reference sequence. In [10], the spectral alignment problem differs from the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem mainly by its exploration on short read sequencing data rather than the mass/compomer spectra data, which may lead to wide implications in the subsequent algorithm design and complexity analysis. Moreover, in [1], a socalled SNP discovery from mass spectrometry problem was defined in a similar way to the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem. However, it has only a single compomer as input, as opposed to a collection of four complementary compomer spectra used in the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem.
Results
An exact dynamic programming algorithm for $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$
In this subsection, we shall describe an exact dynamic programming algorithm for solving the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem. Without loss of generality, we may assume in the remaining of this section that every base of Σ will eventually occur in the optimal solution to a given instance of the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem. Consequently, only those feasible solutions that contains all the bases of Σ need to be considered when we search for the optimal solution. In case some base x would not occur in the optimal solution ${s}^{\prime}$ note that it becomes relatively easy to find ${s}^{\prime}$ since we would have ${s}^{\prime}\in {\mathcal{L}}_{x}\cap {\mathcal{R}}_{x}$ and s' = s. See below for definitions of ${\mathcal{L}}_{x}$ and ${\mathcal{R}}_{x}$.
Let us start with some preliminary definitions and notations. For a string s, a cleavage fragment s[i, j] is called internal if neither i = 1 nor j = s, leftended if i = 1, or rightended if j = s. In addition, a cleavage fragment ∈ is always considered internal. Given a collection of compomer spectra ${\mathcal{C}}_{\sum}$, we call a string is Icompatible if the compomers of its internal cleavage fragments are all contained in ${\mathcal{C}}_{\sum}$ (under the respective cut base). A string is called Lcompatible (resp. Rcompatible) if it is Icompatible and if the compomers of its leftended (resp. rightended) cleavage fragments are all contained in ${\mathcal{C}}_{\sum}$ as well.
Example 5 Consider the string s given in Example 1. The four cleavage fragments of s with respect to the cut base A are all internal. Among the five cleavage fragments of s with respect to the base T, the first cleavage fragment ACA is leftended, the last cleavage fragment A is rightended, and the other three cleavage fragments in the middle are all internal.
Examples.
strings  Icompatible  Lcompatible  Rcompatible 

ATGATAC 



ATGCTAC 



ACATGCT 



TACATTA 



CTACATTA 



Then, let ${\mathcal{I}}_{\Sigma}=\left\{{\mathcal{I}}_{\mathsf{\text{A}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{I}}_{\mathsf{\text{C}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{I}}_{\mathsf{\text{G}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{I}}_{\mathsf{\text{T}}}\right\}$. Analogously, we may define ${\mathcal{L}}_{x}\left({\mathsf{\text{A}}}_{i}{\mathsf{\text{C}}}_{j}{\mathsf{\text{G}}}_{k}{\mathsf{\text{T}}}_{l}\right)$, ${\mathcal{R}}_{x}\left({\mathsf{\text{A}}}_{i}{\mathsf{\text{C}}}_{j}{\mathsf{\text{G}}}_{k}{\mathsf{\text{T}}}_{l}\right)$, ${\mathcal{L}}_{\Sigma}=\left\{{\mathcal{L}}_{\mathsf{\text{A}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{L}}_{\mathsf{\text{C}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{L}}_{\mathsf{\text{G}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{L}}_{\mathsf{\text{T}}}\right\}$ and ${\mathcal{R}}_{\Sigma}=\left\{{\mathcal{R}}_{\mathsf{\text{A}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{R}}_{\mathsf{\text{C}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{R}}_{\mathsf{\text{G}}},\phantom{\rule{2.77695pt}{0ex}}{\mathcal{R}}_{\mathsf{\text{T}}}\right\}$for the Lcompatible strings and the Rcompatible strings, respectively. Clearly, ${L}_{x}\subseteq {\mathcal{I}}_{x}$ and ${\mathcal{R}}_{x}\subseteq {\mathcal{I}}_{x}$, for all x ∈ Σ.
Example 7 Consider the collection of compomer spectra ${\mathcal{C}}_{\sum}$given in Example 6. For the compomer ${\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{\mathsf{\text{1}}}{\mathsf{\text{G}}}_{\mathsf{\text{1}}}{\mathsf{\text{T}}}_{\mathsf{\text{2}}}\in {\mathcal{C}}_{\mathsf{\text{A}}}$, we have ${\mathcal{I}}_{\mathsf{\text{A}}}\left({\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{\mathsf{\text{1}}}{\mathsf{\text{G}}}_{\mathsf{\text{1}}}{\mathsf{\text{T}}}_{\mathsf{\text{2}}}\right)=\left\{\mathsf{\text{CGTT}},\mathsf{\text{CTTG}},\mathsf{\text{GCTT}},\mathsf{\text{GTTC}},\mathsf{\text{TCGT}},\mathsf{\text{TGCT}},\mathsf{\text{TTCG}},\mathsf{\text{TTGC}}\right\}$, and ${\mathcal{L}}_{\mathsf{\text{A}}}\left({\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{1}{\mathsf{\text{G}}}_{1}{\mathsf{\text{T}}}_{2}\right)={\mathcal{R}}_{\mathsf{\text{A}}}\left({\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{1}{\mathsf{\text{G}}}_{1}{\mathsf{\text{T}}}_{2}\right)=\varnothing $. For the compomer ${\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{\mathsf{\text{1}}}{\mathsf{\text{G}}}_{\mathsf{\text{1}}}{\mathsf{\text{T}}}_{0}\in {\mathcal{C}}_{\mathsf{\text{T}}}$, we have ${\mathcal{I}}_{\mathsf{\text{T}}}\left({\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{1}{\mathsf{\text{G}}}_{1}{\mathsf{\text{T}}}_{0}\right)={\mathcal{L}}_{\mathsf{\text{T}}}\left({\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{1}{\mathsf{\text{G}}}_{1}{\mathsf{\text{T}}}_{0}\right)={\mathcal{R}}_{\mathsf{\text{T}}}\left({\mathsf{\text{A}}}_{0}{\mathsf{\text{C}}}_{1}{\mathsf{\text{G}}}_{1}{\mathsf{\text{T}}}_{0}\right)=\varnothing $.
Given a string t which could be a potential cleavage fragment with respect to the cut base x (i.e., the string t does not contain any base x), we say a string s begins with the string t if t · x is a prefix of s · x, or say a string s ends with the string t if x · t is the suffix of x · s. The following lemma is useful to design a dynamic programming algorithm for solving the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem. Its easy proof is omitted. Recall that our discussions in this section are limited only to the feasible solutions containing all the bases of Σ.
Lemma 8 A string s' of lengths is a feasible solution to the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$problem if and only if
 all the substrings of ${s}^{\prime}$are Icompatible with ${\mathcal{C}}_{\sum}$,
 ${s}^{\prime}$begins with a string in ${\mathcal{L}}_{x}$ for some$x\in \sum $, and
 ${s}^{\prime}$ends with a string in ${\mathcal{R}}_{x}$ for some$x\in \sum $.

all its substrings are Icompatible with ${\mathcal{C}}_{\sum}$,

it begins with a string from ${\mathcal{L}}_{y}$ for some y ∈ Σ, and

it ends with the given string t.
Then, let x':= (x · t)[k], p := (x · t)[1, k  1], and q := (x · t)[k, x · t]. Note that x' ≠ x and the string p contains all the bases of Σ except x'.
Example 9 Let t := CGTT ∈ I_{A}. Then, x · t = ACGTT, k = 4, ${x}^{\prime}=\mathsf{\text{T}}$, p = ACG, and q = TT.
then s' would be an optimal solution to the input instance $\u27e8s,{\mathcal{C}}_{\sum}\u27e9$of the $SNP\mathsf{\text{}}M{S}_{\mathcal{P}}$problem.
Proof: For the correctness of the above dynamic programming algorithm, we need to show that (i) every feasible solution of the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem would be essentially evaluated by the dynamic programming algorithm, and (ii) every string evaluated by the dynamic programming algorithm must be a feasible solution of the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem.
Let the string s' be a feasible solution. Consider a cleavage fragment t of s' that contains all the bases of Σ except its corresponding cut base x. Clearly, $t\in {\mathcal{I}}_{x}$ and t is the suffix of a substring s'[1, i] for some integer i. Without loss of generality, we can further suppose that t ≠ s'[1, i]. To show (i), what we mainly need to show is that there exists a string ${t}^{\prime}\in {\mathcal{I}}_{{x}^{\prime}}$ such that p is the suffix of t' and t' is the suffix of the substring s'[1, i  q], where x', p, and q are computed for the string t as described earlier. Indeed, we can find the string t^{ ′ } as follows. First, let (i' − 1) be the position of the last occurrence of the base x' in the substring s'[1, i − t]; if there is no such occurrence, we let i' = 1. Then, we assign ${t}^{\prime}:={s}^{\prime}\left[{i}^{\prime},i\leftq\right\right]$. Obviously, ${t}^{\prime}$ is the suffix of s'[1, i q]. Because s'[i  t] = x and x ≠ x , we have i' ≤ i  t. It then follows from p = s'[i − t, i − q] that p shall be the suffix of t'. Since p contains all the bases of $\sum $ except x' so, does t'. Moreover, t' is a cleavage fragment of s' with respect to the cut base x' because we have either s'[i' − 1] = x' or i' = 1 on the left end of t' and s'[i − q + 1] = x' on the right end of ${t}^{\prime}$. By Lemma 8, we can see that ${t}^{\prime}\in {\mathcal{I}}_{A}$. For the reader's convenience, we demonstrate in the following example how to find ${t}^{\prime}$ from t. Let s' = ACATGCTACATTA, t = s' [4,7] = TGCT, i = 7, x = A, and ${\mathcal{C}}_{\sum}$ be the one as given in Example 6. Note that$t\in {\mathcal{I}}_{A}$. Further, for the given string t = TGCT, we have x' = C, p = ATG, and q = CT. Then, we obtain that i' = 3 and then t' = s' [3, 7 − 2] = s' [3,5] = ATG. It is easy to check that p is the suffix of t', t' is the suffix of the substring ${s}^{\prime}\left[1,i\leftq\right\right]$, and ${t}^{\prime}\in {\mathcal{I}}_{{x}^{\prime}}$.

If j ≥ i − t + 2, then s'[j, k] is an internal cleavage fragment of s'[i − t +1, s']. Since s'[i − t +1, s'] is already assumed to be Icompatible with ${\mathcal{C}}_{\sum}$, the base composition of s'[ j, k] shall be also contained in ${\mathcal{C}}_{{x}^{\u2033}}$.

If j = i − t + 1, then x″ = x, which further implies that k = i and s' [j, k] = t. Since $t\in {\mathcal{I}}_{x}$, the base composition of s'[j, k] shall be contained in ${\mathcal{C}}_{{x}^{\u2033}}$.

If j ≤ i − t and k ≥ i − q, then s'[i − t, i − q] is a substring of s'[j, k]. Since s^{ ′ }[i − t, i − q] contains all the bases of Σ, the string s'[j, k] can not be a cleavage fragment (as a cleavage fragment must not contain its corresponding cut base). Therefore, there shall not have the case where j ≤ i − t and k ≥ i − q.

If k ≤ i − q − 1, then s'[j, k] is an internal cleavage fragment of t' = s'[i', i − q]. Since $t\in {\mathcal{I}}_{{x}^{\prime}}$, the base composition of s'[j, k] shall be contained in ${\mathcal{C}}_{{x}^{\u2033}}$.
In conclusion, for every internal cleavage fragment of s'[i^{ ′ }, s^{ ′ }], its base composition is contained in ${\mathcal{C}}_{\sum}$ under the respective cut base. Therefore, the extended substring s'[i', s'] is still Icompatible with ${\mathcal{C}}_{\sum}$.
Note that computing each entry $\mathcal{H}\left(i,t\right)$ of the dynamic programming table may take time $O\left(\lefts\right\cdot \left{\mathcal{I}}_{\sum}\right\right)$, where $\left{\mathcal{I}}_{\sum}\right\phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}\left{\mathcal{I}}_{\mathsf{\text{A}}}\right+\left{\mathcal{I}}_{\mathsf{\text{C}}}\right+\left{\mathcal{I}}_{\mathsf{\text{G}}}\right+\left{\mathcal{I}}_{\mathsf{\text{T}}}\right$. Hence, the above dynamic programming algorithm can be done in time $O\left(s{}^{\mathsf{\text{2}}}\cdot {\mathcal{I}}_{\sum}{}^{\mathsf{\text{2}}}\right)$. In the worst case, we may have $\left{\mathcal{I}}_{\sum}\right=O\left(\lefts\right!\right)$, that is, $\left{\mathcal{I}}_{\sum}\right$ is in the factorial order of the input problem size. In practice, however, we would expect $\left{\mathcal{I}}_{\sum}\right$ not too large to be manageable, because cleavage fragments are usually of small size. Therefore, the above dynamic programming algorithm could be a practically feasible solution to the problem $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$, especially when compared to the bruteforce algorithm which needs to examine all the possible strings s'. For the special case where $\sum \phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}2$, $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ is actually an easy problem, as we can see from the above that $\left{\mathcal{I}}_{\sum}\right\phantom{\rule{2.77695pt}{0ex}}=O\left(\lefts\right\right).$
Corollary 11 The above dynamic programming algorithm can solve the $SNPM{S}_{\mathcal{P}}$ problem in polynomial time when $\sum \phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}2$.
The NPhardness of $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$
This subsection is dedicated to prove that the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem is NPhard. We begin with a brief introduction of the 3partition problem.
Definition 12 (The general form of the 3partition problem) Given a multiset of positive integers $\mathcal{A}=\left\{{a}_{1},\phantom{\rule{2.77695pt}{0ex}}{a}_{2},\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}{a}_{n}\right\}$where n = 3m and ${\sum}_{i=1}^{n}{a}_{i}=mB$, can we partition the multiset $\mathcal{A}$ into m multisets${\mathcal{A}}_{1},{\mathcal{A}}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{\mathcal{A}}_{m}$, such that the sum of each multiset is equal to B?
The 3partition problem is strongly NPcomplete [7]. Therefore, it remains NPcomplete even when the integers in $\mathcal{A}$ and the integer B are encoded in unary. In this case, the size of a problem instance is Θ(nB). In contrast, it becomes O(n log B) when using the binary encoding of integers.
Definition 13 (The restricted variation of the 3partition problem) Given a set of positive integers $\mathcal{A}=\left\{{a}_{1},\phantom{\rule{2.77695pt}{0ex}}{a}_{2},\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}{a}_{n}\right\}$where n = 3m, ${\sum}_{i=1}^{n}{a}_{i}=mB$, and $\frac{B}{4}<{a}_{i}<\frac{B}{2},\forall 1\le i\le n$, can we partition the set$\mathcal{A}$into m subsets ${\mathcal{A}}_{1},{\mathcal{A}}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{\mathcal{A}}_{m}$, such that the sum of each subset is equal to B?
There are two constraints imposed in the above restricted variation of the 3partition problem. The first one limits $\mathcal{A}$ to be a set so that all the integers in $\mathcal{A}$ are distinct. The second one limits all the integers in $\mathcal{A}$ strictly between $\frac{B}{4}$ and $\frac{B}{2}$, which subsequently enforces every subset ${\mathcal{A}}_{i}$ to consist of exactly three elements. Interestingly, this restricted variation of the 3partition problem remains strongly NPcomplete [8], just like the general form of the 3partition problem. Note that the second constraint $\frac{B}{4}<{a}_{i}<\frac{B}{2}$ was actually not imposed in [8]. But, it can be easily done by adding B to each a_{ i } and then multiplying B by 4.
Theorem 14 The $SNP\phantom{\rule{0.3em}{0ex}}M{S}_{\mathcal{Q}}$problem is NPhard, even when $\sum \phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}2$.

Let Σ = {G, T}.

Let s be the string such that s · T = (G^{B+2}T)^{ m }. That is, let s · T be the concatenation of m copies of the fragment GG · · · GT, where each fragment consists of (B + 2) consecutive base Gs followed by one base T. Note that s = m(B + 3) − 1 = mB + 3m − 1.

Let ${\mathcal{C}}_{\mathsf{\text{G}}}=\left\{{\mathsf{\text{G}}}_{0}{\mathsf{\text{T}}}_{0},{\mathsf{\text{G}}}_{0}{\mathsf{\text{T}}}_{\mathsf{\text{1}}}\right\}$ and ${\mathcal{C}}_{\mathsf{\text{T}}}=\left\{{\mathsf{\text{G}}}_{{a}_{i}}{\mathsf{\text{T}}}_{0}:\mathsf{\text{1}}\le i\le n\right\}$ so that ${\mathcal{C}}_{\sum}=\left\{{\mathcal{C}}_{\mathsf{\text{G}}},{\mathcal{C}}_{\mathsf{\text{T}}}\right\}$.
First, we check whether this construction can be done in polynomial time in the size of the input instance of the 3partition problem. Since the restricted variation of the 3partition problem is strongly NPcomplete, we may encode the integers in unary so that the size of the input instance is Θ(nB). In the above reduction, we can easily see that the first step can be done in constant time, the second step in time O(mB), and the third step in time O(n log B). Therefore, the total time needed for construction is O(nB), no more than time polynomial in the size of the input instance of the 3partition problem.
Next, we show that every feasible solution s″ to the reduced instance $\u27e8s,{\mathcal{C}}_{\sum}\u27e9$ of the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem is such that (i) ${\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\u2033}\right)={\mathcal{C}}_{\mathsf{\text{T}}}$, (ii) s″ contains exactly 3m − 1 base Ts, and (iii) d_{ H } (s, s″) ≥ 2m. For each compomer ${\mathsf{\text{G}}}_{{a}_{i}}{\mathsf{\text{T}}}_{0}\in {\mathcal{C}}_{\mathsf{\text{T}}}\subseteq {\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\u2033}\right)$, there exists at least one cleavage fragment ${G}^{{a}_{i}}$ in s″ that is obtained with respect to the cut base T. Since all the integers a_{ i }are distinct, all such cleavage fragments shall be pairwise nonoverlapping. Thus, the string s′′ contains at least ${\sum}_{i=1}^{n}{a}_{i}=mB$ base Gs and at least n − 1 = 3m  1 base Ts. On the other hand, since s = mB + 3m  1, the string s″ hence consists of exactly mB + 3m− 1 bases. Therefore, we can deduce that s″ contains exactly 3m − 1 base Ts and further that ${\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\u2033}\right)$ cannot have any other compomer than those in C_{T}. By construction, we also know that the string s contains exactly m − 1 base Ts, which hence implies that d_{ H }(s, s″) ≥ 2m.
Now, we are going to show that there exists a valid partition for the input instance of the 3partition problem if and only if there exists an optimal solution s^{ ′ } for the reduced instance of the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem such that d_{ H }(s, s') = 2m.
 1.
${s}^{\prime}:=\varnothing $;
 2.
for i = 1 to m
 3.
for j = 1 to 3
 4.
${s}^{\prime}+={\mathsf{\text{G}}}^{{a}_{{i}_{j}}}\mathsf{\text{T}}$; // append the string ${\mathsf{\text{G}}}^{{a}_{{i}_{j}}}\mathsf{\text{T}}$ to s'
 5.
end
 6.
end
 7.
s':= s'[1, s' − 1]; // remove the last base T
As one can easily check, the resulting string s' is such that s' = mB + 3m − 1, ${\mathcal{C}}_{\mathsf{\text{G}}}\subseteq {\mathcal{C}}_{\mathsf{\text{G}}}\left({s}^{\prime}\right)$, and ${\mathcal{C}}_{\mathsf{\text{T}}}\subseteq {\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\right)$. Therefore, s' is a feasible solution to the reduced instance $\u27e8s,{\mathcal{C}}_{\sum}\u27e9$ of the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem. On the other hand, since ${\sum}_{j=1}^{3}{a}_{{i}_{j}}=B,\forall 1\le i\le m$, we can deduce that s'[k] = s[k] if s'[k] = G or s[k] = T; otherwise, s^{ ′ }[k] ≠ s[k], ∀k ∈ [1, mB + 3m  1]. Therefore, d_{ H }(s, s') =[k : s'[k] ≠ s[k]} = s − {k : s'[k] = s[k]} = mB + 3m − 1 − {k : s'[k] = G} − {k : s[k] = T} = mB + 3m − 1 − mB − m + 1 = 2m. It hence follows that s′ is indeed an optimal solution to the reduced instance $\u27e8s,{\mathcal{C}}_{\sum}\u27e9$ of the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem.
 1.
s := s · T; s':= s' · T;
 2.
i := 1; j := 1;
 3.
${\mathcal{A}}_{i}:=\mathrm{0\u0338};\phantom{\rule{2.77695pt}{0ex}}{a}_{{i}_{j}}:=0;$
 4.
for k = 1 to mB + 3m
 5.
if s'[k] = T
 6.
${\mathcal{A}}_{i}:={\mathcal{A}}_{i}\cup \left\{{a}_{{i}_{j}}\right\};$
 7.
j + +;
 8.
if s[k] = T
 9.
i + +; j := 1;
 10.
${\mathcal{A}}_{i}:=\mathrm{0\u0338};$
 11.
end
 12.
${a}_{{i}_{j}}:=0;$
 13.
else
 14.
${a}_{{i}_{j}}++;$
 15.
end
 16.
end
It follows from the earlier discussions that ${\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\right)={\mathcal{C}}_{\mathsf{\text{T}}}=\left\{{\mathsf{\text{G}}}_{{a}_{i}}{\mathsf{\text{T}}}_{0}\phantom{\rule{2.77695pt}{0ex}}:\phantom{\rule{2.77695pt}{0ex}}1\phantom{\rule{2.77695pt}{0ex}}\le i\le n\right\}$ and also that s' contains exactly 3m − 1 base Ts. Furthermore, since d_{ H }(s, s') = 2m, we can deduce that s'[k] = s[k] if s[k] = T, ∀k ∈ [1, mB + 3m − 1]. Notice that s[k] = T if and only if k can be written as a multiple of (B + 3), that is, k = i(B + 3) ∈ [1, mB + 3m − 1], ∀i. Therefore, s'[k] = T if k = i(B + 3) ∈ [1, mB + 3m − 1], ∀i, which subsequently implies that ${\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\left[\left(i\mathsf{\text{1}}\right)\left(B+\mathsf{\text{3}}\right)+\mathsf{\text{1,}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{i(B}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{+}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{3)}}1\right]\right)\subseteq {\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\right)$, for each i ∈ [1, m]. Note that s[(i − 1)(B + 3) + 1, i(B + 3) − 1] is a substring of s that consists of (B + 2) base Gs; it is located either strictly between two consecutive base Ts or strictly between one base T and one end of the string s. Since C_{T}(s^{ ′ }[(i − 1)(B + 3) + 1, i(B + 3) − 1]) ⊆ C_{T}(s'), we can let ${\mathcal{C}}_{\mathsf{\text{T}}}\left({s}^{\prime}\left[\left(i1\right)\left(B+3\right)+1,\phantom{\rule{2.77695pt}{0ex}}i\left(B+3\right)1\right]\right)=\left\{{\mathsf{\text{G}}}_{{a}_{{i}_{1}}}{\mathsf{\text{T}}}_{0},\phantom{\rule{2.77695pt}{0ex}}{\mathsf{\text{G}}}_{{a}_{{i}_{2}}}{\mathsf{\text{T}}}_{0},\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}{\mathsf{\text{G}}}_{{a}_{{i}_{j}}}{\mathsf{\text{T}}}_{0}\right\}$ such that ${a}_{{i}_{1}}+{a}_{{i}_{2}}+\cdots +{a}_{{i}_{j}}+j1=B+2$. Since $\frac{B}{4}<{a}_{{i}_{j}}<\frac{B}{2}$, we can deduce that j = 3; hence ${a}_{{i}_{1}}+{a}_{{i}_{2}}+{a}_{{i}_{3}}=B$. Let${\mathcal{A}}_{i}=\left\{{a}_{{i}_{1}},\phantom{\rule{2.77695pt}{0ex}}{a}_{{i}_{2}},\phantom{\rule{2.77695pt}{0ex}}{a}_{{i}_{3}}\right\}$, for all i ∈ [1, m]. Then, we can see that ${\mathcal{A}}_{1},{\mathcal{A}}_{2},...,{\mathcal{A}}_{m}$ is a partition of $\mathcal{A}$ such that the sum of integers in each subset is equal to B.
Extensions to edit distance
Naturally we may extend our previous problem formulations to the edit distance (i.e., Levenshtein distance). The resulting two new problems are formally defined as follows.
Definition 15 (The $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem) Given a string s and a collection of compomer spectra ${\mathcal{C}}_{\sum}=\left\{{\mathcal{C}}_{x}:x\in \Sigma \right\}$, find a string s' such that ${\mathcal{C}}_{x}\left({s}^{\prime}\right)\in {\mathcal{C}}_{x}$, for all ×∈ Σ and d_{ E } (s, s') is minimized.
Definition 16 (The$\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem) Given a string s and a collection of compomer spectra ${\mathcal{C}}_{\sum}=\left\{{\mathcal{C}}_{x}:x\in \Sigma \right\}$, find a string s' such that ${\mathcal{C}}_{x}\subseteq {\mathcal{C}}_{x}\left({s}^{\prime}\right)$, for all$x\in \sum $and d_{ E } (s, s') is minimized.
These extensions make it possible to detect not only base substitutions but also base insertions and deletions. Hence, they would permit the mutation discovery in DNA sequences (see [1]). In the Additional file 1, we show that both $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ and $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ are theoretically NPhard, together with an exact dynamic programming algorithm for solving the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem.
Conclusions
To exploit the full potential of the SNP discovery approach using basespecific cleavage and mass spectrometry, in this paper we have studied two new combinatorial optimization problems, called $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ and $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$, respectively. We believe that any efficient solution to either problem could offer a more seamless integration of information in four complementary basespecific reactions than previously done in [1, 2], thereby improving the capability of the underlying biotechnology (i.e., basespecific cleavage and mass spectrometry) for sensitive and accurate SNP discovery.
Although we cannot change the inherent complexity of our proposed dynamic programming algorithm for the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem, we believe that by improving and optimizing its implementation, the compute runtime can be significantly reduced to the extent suitable for practical use. On the other hand, the NPhardness result indicates that in the most general situation, solving the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem exactly in polynomial time is impossible unless P = NP. In more realistic situations where only a very few SNPs (e.g., two or three SNPs) occur in a target sample sequence, however, the problem can be quite easily tackled, e.g., using an exhaustive search approach. In the future work, we shall try to prove that the $\mathsf{\text{SNP}}\mathsf{\text{M}}{\mathsf{\text{S}}}_{\mathcal{P}}$ problem is NPhard and develop an efficient heuristic algorithm for the $\mathsf{\text{SNPM}}{\mathsf{\text{S}}}_{\mathcal{Q}}$ problem for practical use.
Declarations
Acknowledgements
We would like to thank Yuguang Mu and Kai Tang for introducing us the problem of SNP discovery using basespecific cleavage and mass spectrometry. X.C.'s research was supported by the Singapore National Medical Research Council grant (CBRG11nov091) and a College of Science Collaborative Research Award at NTU. Q.W.'s research was supported by National Science Foundation for Young Scientists of China (61103066). L.Z.'s research was supported by the Singapore MOE AcRF Tier 2 grant (R146000134112).
This article has been published as part of BMC Systems Biology Volume 6 Supplement 2, 2012: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S2.
Authors’ Affiliations
References
 Bocker S: SNP and mutation discovery using basespecific cleavage and MALDITOF mass spectrometry. Bioinformatics. 2003, 19 (Suppl 1): i4453. 10.1093/bioinformatics/btg1004.View ArticlePubMedGoogle Scholar
 Krebs S, Medugorac I, Seichter D, Forster M: RNaseCut: a MALDI mass spectrometrybased method for SNP discovery. Nucleic Acids Research. 2003, 31 (7):
 Stanssens P, Zabeau M, Meersseman G, Remes G, Gansemans Y, Storm N, Hartmer R, Honisch C, Rodi CP, Bocker S, van den Boom D: Highthroughput MALDITOF discovery of genomic sequence polymorphisms. Genome Research. 2004, 14: 126133.PubMed CentralView ArticlePubMedGoogle Scholar
 Hartmer R, Storm N, Bocker S, Rodi CP, Hillenkamp F, Jurinke C, van den Boom D: RNase T1 mediated basespecific cleavage and MALDITOF MS for highthroughput comparative sequence analysis. Nucleic Acids Research. 2003, 31 (9):
 Honisch C, Raghunathan A, Cantor CR, Palsson BO, van den Boom D: Highthroughput mutation detection underlying adaptive evolution of Escherichia coliK12. Genome Research. 2004, 14 (12): 24952502. 10.1101/gr.2977704.PubMed CentralView ArticlePubMedGoogle Scholar
 RNaseCut webpage link. [http://www.vetmed.unimuenchen.de/gen/forschung.html]
 Garey MR, Johnson DS: Complexity results for multiprocessor scheduling under resource constraints. Siam Journal on Computing. 1975, 4: 397411. 10.1137/0204035.View ArticleGoogle Scholar
 Hulett H, Will TG, Woeginger GJ: Multigraph realizations of degree sequences: Maximization is easy, minimization is hard. Operations Research Letters. 2008, 36 (5): 594596. 10.1016/j.orl.2008.05.004.View ArticleGoogle Scholar
 Bocker S: Sequencing from compomers: Using mass spectrometry for DNA de novo sequencing of 200+ nt. Journal of Computational Biology. 2004, 11 (6): 11101134. 10.1089/cmb.2004.11.1110.View ArticlePubMedGoogle Scholar
 Pevzner PA, Tang HX, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98 (17): 97489753. 10.1073/pnas.171285098.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.