Palindromic sequence impedes sequencing-by-ligation mechanism
© Huang et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Skip to main content
© Huang et al.; licensee BioMed Central Ltd. 2012
Published: 12 December 2012
Current next-generation sequencing (NGS) platforms adopt two types of sequencing mechanisms: by synthesis or by ligation. The former is employed by 454 and Solexa systems, while the latter by SOLiD system. Although the pros and cons for each sequencing mechanism have more or less been discussed in a number of occasions, the potential obstacle imposed by palindromic sequences has not yet been addressed.
To test the effect of the palindromic region on sequencing efficacy, we clonally amplified a paired-end ditag sequence composed of a 24-bp palindromic sequence flanked by a pair of tags from the E. coli genome. We used the near homogeneous fragments produced from MmeI digestion of the amplified clone to generate a sequencing library for SOLiD 5500xl sequencer.
Results showed that, traditional ABI sequencers, which adopt sequencing-by-synthesis mechanism, were able to read through the palindromic region. However, SOLiD 5500xl was unable to do so. Instead, the palindromic region was read as miscellaneous random sequences. Moreover, readable tag sequence turned obscure ~2 bp prior to the palindromic region.
Taken together, we demonstrate that SOLiD machines, which employ sequencing-by-ligation mechanism, are unable to read through the palindromic region. On the other hand, sequencing-by-synthesis sequencers had no difficulty in doing so.
Next-Generation Sequencing (NGS) promises to provide robust sequencing platforms for basic biological researches as well as clinical and molecular diagnoses [1–4]. Currently, the sequencing market is dominated by 3 NGS lineages: Roche's 454, Illumina's Solexa, and Life Technologies' SOLiD. In general, the cost for 454 is much higher than the other two, but its sequence length can easily reach 400 bps, followed by the Solexa lineage and then SOLiD, which can only produce short reads up to 75 bp.
The sequencing chemistries of these three systems more or less differ. 454 uses emulsion PCR to amplify temples on beads followed by pyrosequencing with DNA polymerase . Solexa lineage adopts bridge PCR to amplify temples on slide followed by colorimetric sequencing with DNA polymerase. As such, both 454 and Solexa lineages adopt "sequencing-by-synthesis" approach. On the other hand, although SOLiD systems also use emulsion PCR to amplify templates on beads, its sequencing is extended by two unique steps: first by using oligos to hybridize to the target in the single-stranded template, followed by using ligase to seal the nick between the 5' of the growing strand and the 3' of the oligo. This sequencing approach, which grows in 3' to 5' direction, is coined as "sequencing-by-ligation" mechanism. The most evident difference between sequencing-by-synthesis and sequencing-by-ligation is that the former uses DNA polymerase to incorporate complementary nucleotides to the elongating strand, while the latter uses ligase to seal the junction between the elongating strand and the newly incorporated complementary oligonucleotides. Since DNA polymerase is an essential enzyme in the cell, the former is considered as a more natural approach compared with the latter.
Potential underrepresentation of GC-rich regions have been reviewed for both sequencing-by-synthesis and sequencing-by-ligation mechanisms . However, the potential obstacles imposed by palindromic regions have never been addressed. A palindromic sequence is a nucleic acid sequence that reads the same no matter from the 5' end of the sequence itself or from the 5' end of its complementary strand. It has a potential to form a hairpin structure . The success of ligation has to rely on the availability of its complementary strand. Given the fact that short stretches of palindromic sequences may form hairpin structures in the single-stranded template, palindromic sequences which naturally exist in genomes or in RNA sequences (esp. mRNAs or miRNAs) may become regional obstacles for sequencing-by-ligation sequencers, causing problems in genome assembly and introducing bias in transcriptome analysis. Thus, it is an important issue to further understand the potential obstacles caused by palindromic sequences, especially for the sequencing-by-ligation mechanism.
In this article, we used centralized palindromic sequence to study the effects of a palindromic sequence on sequencing. We constructed a single clone containing a palindromic region flanked by two tag sequences (~18 bp), amplified the clone, digested with MmeI, and constructed a sequencing library from the MmeI fragments. The library was sequenced by SOLiD 5500xl sequencer. Here we report the results, indicating the incapability for a sequencing-by-ligation machine to read through the palindromic region.
Paired-end ditag DNA fragment of 5'-GGCAA TGGCA CCATC GCTAG TCGGA GTCTG CGCAG ACTCC GACTC CACGA CCGCT GAGGT T-3' was cloned into yT&A vector (Cat. YC013, Yeastern Biotech. Co., Ltd.) and verified by traditional Sanger sequencing using ABI PRISM® 96-capillary 3730xl DNA Analyzer (Notice that this step proved the capability of a traditional sequencing-by-synthesis method to sequence through the palindromic region). This clone contains a centralized palindromic sequence (26 bp, 5'-AGTCG GAGTC TGCGC AGACT CCGAC T-3') flanked by two tags originated from the E. coli genome: Tag1 (18 bp, 5'-GGCAA TGGCA CCATC GCT-3') and Tag2 (17 bp, 5'-CCACG ACCGC TGAGG TT-3', or 5'-AACCT CAGCG GTCGT GG-3' on the complementary strand which was sequenced by SOLiD 5500xl sequencer), making up a total length of 61 bp. The inserted DNA in yT&A vector was amplified with PCR using M13 forward (5'-GTTTT CCCAG TCACG AC-3') and reverse (5'-TCACA CAGGA AACAG CTATG AC-3') primers and EconoTaq™ PLUS GREEN 2X Master Mix (Lucigen Corp.). PCR amplified fragments were digested with MmeI, end-repaired, A-tailed and ligated with P1 and P2 sequencing adaptors to make the palindromic library for sequencing by SOLiD 5500xl. The quality and quantity of library were checked by Agilent 2100 Bioanalyzer and Roche LightCycler® 480 II Real-Time PCR system according to the manufacturer's instructions.
The fragment library prepared above was sequenced by SOLiD 5500xl for 7 ligation cycles (X 5 primers) to obtain 35 bp sequences. By using Exact Call Chemistry (ECC) module, we directly obtained nucleotide sequences, instead of colour space information.
Raw sequences were processed for quality control . Briefly, sequences with N, polyA, polyT, polyG, polyC, and PCR primer sequence were removed. Various quality values (QVs) were used, depending on what types of sequences we wanted to observe. Only the qualified sequences were used for analysis.
The original clone was a double-stranded construct containing a centralized palindromic sequence (26 bp) flanked by two tags, Tag1 (18 bp) and Tag2 (17 bp), originated from the E. coli genome, making up a total length of 61 bp (Figure 1). For testing the capability of a SOLiD machine to read through the palindromic region, a sequence length of 35 bp is long enough for our purpose, and there was no need to sequence the full construct.
Notice that the internal adaptor region is sealed by two MmeI sites (TCCGAC) each followed by an extra T previously used for stick-end ligation to E. coli genomic fragments. This extra T can be assigned to the tag region, but it was not originated from the E. coli genome.
As mentioned above, there was no need to sequence the full construct (total of ~62 bp) as this study was set out to test the capability of SOLiD 5500xl to read through a palindromic structure. As such, we run 7 cycles for each one of the five primers to make up a total of 35 bp sequence length. The 35 bp sequences produced by SOLiD 5500xl were expected to contain the 5' end of either tag followed a ~17 bp of the palindromic region.
Percentage (compared with qualified reads)
Total of Tag1/Tag2-containing reads
Positions read by sequencing-by-ligation follows the unique two-base encoding pattern. As defined by the design of the oligo probes, each 2-base reading is followed by a 3-base leap (gap). To compose a complete reading, each sequencing run utilizes 5 primers (P1-5) and there is a single-base shifting between primers (Figure 4): Primer 1 reads positions (1,2) ... (6,7) ... and so on; Primer 2 reads positions (0,1)...(5,6)... and so on; Primer 3 reads positions (4,5)...(9,10)... and so on; Primer 4 reads positions (3,4)...(8,9)... and so on; Primer 5 reads positions (2,3)...(7,8)... and so on. P6, an ECC primer not for actual sequence reading, is not included.
SOLiD systems use good-and-best beads ratio as an indicator to evaluate the quality of the library. High good-and-best beads ratio implies high library quality. During the sequencing of the palindromic library, we found that the ratios of good-and-best beads for the first 2 cycles seemed quite normal, but started to decline for P3 and P4 during their third cycle, followed by a global decline during the fourth cycle (Figure 4). This result provided further evidence showing the incapability of SOLiD 5500xl sequencer for reading through the palindromic region in the library.
As indicated previously, the palindromic region starts from position 18 or 19. We noticed a gradual decline of quality value, and it completely dropped to the manufacturer-defined baseline QV 7, at position 17 - about a couple of bases before the palindromic region (Figure 5). As such, the palindromic region would disappear if the QV cutoff was set to be 8 or above. That is, the faulty/random palindromic sequences can only be seen with QV cutoff equal to or below 7.
The original construct was also sequenced by ABI PRISM® 96-capillary 3730xl DNA Analyzer, which adopts sequencing-by-synthesis approach, indicating that palindromic sequences imposed no difficulty for the sequencing-by-synthesis sequencers. However, as indicated by our results, the SOLiD 5500xl sequencing-by-ligation sequencer was unable to read through the palindromic region presumably due to the plausible formation of hairpin structure in the palindromic region.
As such, sequencing-by-ligation-based sequencers may not be suitable for genome assembly, because palindromes of various sizes are known to present in both prokaryotic and eukaryotic genomes. On the other hand, this type of sequencers may remain suitable for sequencing transcriptomes and small DNA molecules of which hairpin structures are less likely to be present. Moreover, the length of a palindrome is expected to be negatively correlated with its readability, although not in proportional.
This research was sponsored by Academia Sinica, Taiwan ROC. We thank the Sequencing Core Facility, Scientific Instrument Center at Academia Sinica for plasmid DNA sequencing.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 2, 2012: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.