Volume 6 Supplement 3
Optimizing hybrid assembly of next-generation sequence data from Enterococcus faecium: a microbe with highly divergent genome
© Wang et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Sequencing of bacterial genomes became an essential approach to study pathogen virulence and the phylogenetic relationship among close related strains. Bacterium Enterococcus faecium emerged as an important nosocomial pathogen that were often associated with resistance to common antibiotics in hospitals. With highly divergent gene contents, it presented a challenge to the next generation sequencing (NGS) technologies featuring high-throughput and shorter read-length. This study was designed to investigate the properties and systematic biases of NGS technologies and evaluate critical parameters influencing the outcomes of hybrid assemblies using combinations of NGS data.
A hospital strain of E. faecium was sequenced using three different NGS platforms: 454 GS-FLX, Illumina GAIIx, and ABI SOLiD4.0, to approximately 28-, 500-, and 400-fold coverage depth. We built a pipeline that merged contigs from each NGS data into hybrid assemblies. The results revealed that each single NGS assembly had a ceiling in continuity that could not be overcome by simply increasing data coverage depth. Each NGS technology displayed some intrinsic properties, i.e. base calling error, systematic bias, etc. The gaps and low coverage regions of each NGS assembly were associated with lower GC contents. In order to optimize the hybrid assembly approach, we tested with varying amount and different combination of NGS data, and obtained optimal conditions for assembly continuity. We also, for the first time, showed that SOLiD data could help make much improved assemblies of E. faecium genome using the hybrid approach when combined with other type of NGS data.
The current study addressed the difficult issue of how to most effectively construct a complete microbial genome using today's state of the art sequencing technologies. We characterized the sequence data and genome assembly from each NGS technologies, tested conditions for hybrid assembly with combinations of NGS data, and obtained optimized parameters for achieving most cost-efficiency assembly. Our study helped form some guidelines to direct genomic work on other microorganisms, thus have important practical implications.
Sequencing of bacterial genomes is an essential approach to understand the virulent mechanisms of pathogens and the evolutionary relationship among close related pathogenic strains. Bacterial isolates of the same species often display surprisingly highly divergent gene contents from vastly different ecological environments. Such genome divergence, which is the result of harsh selection and frequent horizontal gene transfer, presents a unique challenge to the modern sequencing technologies featuring high-throughput and short read length, and limits our ability to re-sequence and construct a genome draft of bacterial variant by taking advantage of a "genome reference".
The case of bacterium Enterococcus faecium presents a unique example of such challenge. E. faecium emerged as an important nosocomial pathogen from hospital environments and were often associated with the resistance to many common antibiotics. It was a major cause of infections in hospitalized immuno-deficient patients . E. faecium is a Gram-positive bacterium with a genome size of roughly 3 Mb. The first draft genome of E. faecium strain TX0016 was assembled in 2000. Since then, more than 25 partial genomes were published in succession . The lack of a complete genome sequence of E. faecium might be due to factors including genome plasticity of E. faecium that harbors large insertions and deletions , and abundant repetitive sequences that hinder the assembly of a complete genome from sequence reads [2, 3]. A genome sequence of a vancomycin-resistant E. faecium strain (Aus0004) was completed more recently , revealing large segments of repetitive DNA and insertion sequence elements in its genome.
The development of new generation sequencing technologies, e.g. 454 GS-FLX, Illumina Genome Analyzer, ABI SOLiD, PacificBio SMRT, etc, made it possible to sequence a bacterial genome with considerably less cost [5, 6]. While the new generation sequencing technologies are attractive for sequencing and constructing bacterial genomes, there are some major factors that seriously impact the performance of such approach. Among them, the short sequence reads, high base calling error rates, and systematic bias of the next generation sequencers were often cited as drawbacks that made de novo genome assembly difficult, incomplete, and/or erroneous [5, 7]. To address these issues and obtain high quality genomes, one common approach taken by researchers was to increase the coverage depth of sequencing reads . Genome drafts of Helicobacter acinonychis and panda  were constructed based on such method. However, such approach not only reduced the cost-effect benefit of nextgen sequencing technologies, but also failed to reduce the systematic bias of the sequencing platform. Another approach attempted by some scientists was to combine sequence data from different technologies, thus in theory they could correct sequence error/bias and improve the quality of draft genome. This so called "hybrid" approach was adapted by the combinations of 454 and Illumina data [10, 11], and 454, Illumina and Sanger data . While the hybrid approach achieved a higher efficiency in genome assembly, the most cost-effective SOLiD data was somewhat excluded in all these studies. Whether SOLiD data could significantly contribute to the hybrid assembly method remained an open question. In addition, there were many variables that influenced the outcome of a hybrid assembly. To investigate how these parameters affect the quality of genome assemblies and how to achieve most cost-efficiency in designing a "hybrid" project were the main goals of the present study.
In the current study, we were presented with an opportunity to attempt the hybrid approach in assembling a variant strain of nosocomial pathogen E. faecium, a medically important microbe. As mentioned above, the high genome divergence of E. faecium had prevented the completion of a genome draft although at least twenty-eight different variant strains had been sequenced, some to very high coverage depth. Under such scenario, we first sequenced the E. faecium variant isolated from a hospitalized patient using three different next-generation sequencing (NGS) technologies: 454 GS-FLX (454), Illumina GAIIx (GAIIx) and ABI SOLiD4.0 (SOLiD). We built a new analysis pipeline: 1) to perform primary assemblies with each single NGS data, by which established a baseline from each single NGS data to compare results and evaluate parameters for hybrid assemblies; 2) to perform secondary assemblies with the combinations of two or three single NGS data. With these design we characterized some systematic error and bias for each NGS platform, and were able to optimize parameters for performing hybrid assembly. Our results revealed that hybrid assembly method greatly improved efficiency in comparison with single NGS technology, which could not be achieved by simply increasing the coverage depth of a single NGS platform alone. We also assessed a number of parameters that would help guide the design and preparation of hybrid assembly studies of bacterial genomes.
DNA preparation and sequencing
The origin of Enterococcus faecium strain was isolated from a patient's peritoneal drainage fluids in HuaShan hospital. The genomic DNA of this bacterium was extracted and prepared for sequencing with three sequencers. For the Roche GS-FLX platform, the SR library was constructed and sequenced with methods described by Margulies and his co-workers . The preparing the PE library and sequencing on the Illumina GAIIx sequencer were performed according to the standard Illumina protocols (Illumina, San Diego, CA, USA). For sequencing on the SOLiD 4 system, the 50-base SR library preparing, sequencing and base calling were performed according the manufacturer's recommendations (Applied Biosystems, Carlsbad, California, USA).
For Illumina raw data, the reads containing uncalled base positions were removed firstly. Then, we trimmed the low quality bases with the PERL script of fastq_qualitytrim_window.pl from http://xyala.cap.ed.ac.uk/Gene_Pool/scripts.tar.gz. The parameters of quality threshold and window size were set as 20 and 2 respectively. After pre-processing, the high quality Solexa dataset was assembled with velvet1.1.04 , the optimized k-mer parameter (hash length) of 31 was used to perform assembly after testing a series of k-mers (from 21 to 71 with interval of five).
For the SOLiD raw data, the reads including the uncalled base was removed, and then the remaining reads were corrected with the program of SAET. We assembled the SOLiD data with Denovo2.2 pipeline http://www.appliedbiosystems.com.cn/ . The optimized k-mer parameter of 25 was applied for the assembly.
For each platform, the preassembled contigs whose length was less than 100 were discarded. The remaining contigs were aligned to the plasmids library in NCBI ftp://ftp.ncbi.nlm.nih.gov/genomes/Plasmids/ with blastn . The contigs, in which 60% bases could be aligned to the plasmids with identity over 90%, were considered as plasmids sequences, and these contigs were abandoned before performing secondary assembly. We then collected the remaining contigs from the three platforms and performed the secondary assembly with Phrap1.090518  (Figure 1). To make the overlap to be more specific, we set the parameter of minimum matching length as 50, -repeat_stringency as 0.95 and default_qual 30. In addition, the secondarily assembled contigs, which were extended with only GAIIx data or SOLiD data, were removed.
Assembled contigs analysis
It is difficult to select a suitable reference of Enterococcus faecium, because of the high variance among stains, therefore we took the hybrid contigs from three platforms as reference genome, and we named this group of contigs as HEf-3 (GenBank: AJTW00000000). For each platform, we aligned preassembled contigs to HEf-3 respectively. The software Mummer3  was used to perform the pairwise alignment and the default parameters were set. After the alignment, the regions in the reference not covered by the contigs from each platform were considered as gaps in the corresponding platform. Additionally, 28 achieved drafts Enterococcus faecium genomes in NCBI http://www.ncbi.nlm.nih.gov/genome/ were aligned to HEf-3 to evaluate our assembled contigs with Mummer3. (GenBank:NZ_ABQA00000000, GenBank:NZ_ABQI00000000, GenBank:NZ_ABQJ00000000, GenBank:NZ_ABRY00000000, GenBank:NZ_ABSC00000000, GenBank:NZ_ABSW00000000, GenBank:NZ_ACAS00000000, GenBank:NZ_ACAX00000000, GenBank:NZ_ACAY00000000, GenBank:NZ_ACAZ00000000, GenBank:NZ_ACBA00000000, GenBank:NZ_ACBB00000000, GenBank:NZ_ACBC00000000, GenBank:NZ_ACBD00000000, GenBank:NZ_ACHL00000000, GenBank:NZ_ACIY00000000, GenBank:NZ_ACJQ00000000, GenBank:NZ_ACOB00000000, GenBank:NZ_ACOS00000000, GenBank:NZ_ACZZ00000000, GenBank:NZ_ADMM00000000, GenBank:NZ_AEBC00000000, GenBank:NZ_AEBU00000000, GenBank:NZ_AECH00000000, GenBank:NZ_AECI00000000, GenBank:NZ_AECJ00000000, GenBank:NZ_AAAK00000000, GenBank:NZ_AEBG00000000).
Sequencing Bias analysis from three NGS platforms
The coverage depth along HEf-3 was evaluated by mapping the reads from each platform to HEf-3. For 454, Illumina and SOLiD reads, we used Newbler2.0.01.14, BWA0.5.9-r16 , and Bioscope1.3 http://www.appliedbiosystems.com.cn/ to perform the alignment with default parameters, respectively. Subsequently, for each base on the reference, we calculated the number of reads that covered it. For each platform, the GC content in the regions of bottom and top 5% coverage depth was analysed.
After mapping the reads to HEf-3, we analysed the substitution errors from each platform. The bases in the mapped reads, which were not consistent with the bases in the HEf-3, were considered as substitution errors. 12 types of substitution errors from three platforms were calculated and compared.
The k-mer depth, which defined as the number of times that a k-mer appears in the sequencing data, was used to identify the repetitive sequence of genome . A higher k-mer depth indicated this k-mer is more likely to appear in repeat regions. For k-mer bias comparison, the k-mers, whose depth ranked on the top10000 of total k-mers, were extracted with JAVA script firstly. Hence, for each NGS platform, the depth of these k-mers was normalized by dividing the depth of total k-mers. Finally, the normalized k-mers depth was compared among the three NGS data with density diagram.
Optimizing parameters for hybrid assembly
To optimize the hybrid assembly, we investigated how the assembled contigs from one platform were influenced by different amount of data from another platform. Based on the pipeline in Figure 1, all the data from each platform were firstly preassembled, for example 454 data, then we randomly sample different amount of data from another platform (GAIIx data or SOLiD) and preassembly these data. For each subgroup sampling, we repeated 5 times from our sequencing data pool. Finally the secondary assembly was performed with these two groups of preassembled contigs using Phrap. The assembly parameters were same as that used above. Similarly, the performance of assembled Illunima contigs influenced by subgroup data from 454 and SOLiD platform, and the performance of assembled SOLiD contigs influenced by subgroup data from 454 and GAIIx platform, were also studied.
Generating and processing sequence data from NGS platforms
In order to investigate the properties of sequencing data from each NGS platform and combine them to achieve the best genome assembly, we sequenced the nosocomial pathogen Enterococcus faecium with the three popular NGS platforms: Roche 454, Illumina GAIIx, and ABI SOLiD 4. We refer them as 454, GAIIx, and SOLiD hereafter. This strain of bacterium was originally isolated from a patient's peritoneal drainage fluids. It was reported that E.faecium showed a high degree of genomic variations among many closely related strains , which presented a mounting challenge to characterize the complete genome of a variable strain.
Sequence data obtained from three NGS platforms for the E.faecium strain.
Number of Reads
Average Read Length (bp)
Total Base (bp)
Estimated Coverage Depth
Assemblies with single-NGS data sets
Genome drafts of the E.faecium strain from primary and secondary assemblies.
Genome Draft Size (bp)
Number of Contigs
Max Contig (bp)
Assembly with 454 data set
Assembly with GAIIx data set
Using the GAIIx dataset, a genome draft of 2.948 MB was constructed, with N50 size of 9,804 bp and 1013 contigs (Table 2). We then randomly sampled GAIIx reads and performed assembly with variable coverage depth. As shown in Figure 2, the draft genome reached a flat line at 200-fold coverage depth, while the number of contigs increased to ~2500 before fell back to ~1100 at about 180-fold coverage depth. The N50 size of genome draft continued to increase until levelled out at 240-fold coverage depth. Further increasing the sequence reads beyond 240 seems to affect very little on the assembly results. The higher number of contigs and smaller N50 size are the result of GAIIx short read length compared to 454 sequence data.
Assembly with SOLiD data set
Using the SOLiD dataset, a genome draft of 2.98 MB was constructed, with an N50 size of 1,389 bp and number of contigs is 3,385 (Table 2). We also randomly sampled SOLiD reads to vary the coverage depth for assembly with Denovo2.2. As shown in Figure 2, the draft genome reached a flat line at 100-fold coverage depth, while the number of contigs increased to 7500 before fell back to ~4000 at about 220-fold coverage depth. The N50 size of genome draft continued to increase until levelled out at 300x coverage depth. Further increasing the sequence reads beyond 300 had very little effect on assembly outcome. Similar, the higher number of contigs and smaller N50 size are the result of short read length from SOLiD. In comparison with assembly results from 454 or GAIIx data, SOLiD data produced the worst draft in term of sequence continuity.
Hybrid assembly with all NGS data
The genome drafts produced from single-NGS dataset indicated that each NGS platform had some intrinsic "defect" that limited the quality of genomic draft. Such inherent limitation with each NGS data can't be overcome by simply increasing data coverage depth as illustrated by our results (Figure 2). Furthermore, each NGS platform displayed different efficiency in forming assemblies; 454 was the foremost in creating a higher degree of contiguity of genome assembly over the GAIIx and SOLiD data.
In order to remedy the single-NGS limitation, we attempted with hybrid assembly approach by merging each single-NGS assemblies through a secondary assembly step. The process for hybrid assembly is outlined in Figure 1. The contigs constructed from primary assembly step were merged using Phrap (ver 1.090518). The hybrid assembly with all three NGS data, named HEf-3 (Table 2), formed a consensus that had a total size of 3,103,094. The consensus genome draft had its number of contigs reduced to 204, and N50 size increased to 34,849 bp (Table 2).
Hybrid assembly was also attempted by combining two of the three single-NGS data. As shown in Table 2, the combinations of two single-NGS assemblies also improved secondary assembly remarkably over primary assemblies.
Evaluating results of hybrid assemblies
When compared with 28 other E. faecium strains, two E. faecium draft genomes, U0317 (GenBank: NZ_ABSW00000000), and 1,231,502 (GenBank: NZ_ACAX00000000), which are 2,893,029 bp and 2,926,114 bp in size, respectively, showed the highest similarities to HEf-3. They had 93.8% and 90.6% overall identities to HEf-3 in the aligned regions (Figure 3). The N50 sizes of the two genome drafts were 31,583 bp and 28,295 bp respectively, smaller than that of HEf-3 (34,849 bp). The HEf-3 draft, generated with the hybrid assembly approach improved the genome draft of E. faecium, resulting in a quality assembly better over these two reported E. faecium genome above. The consensus genome from HEf-3 was then used as a reference for further analysis.
Analysis of sequencing biases from three NGS platforms
We demonstrated the hybrid assembly approach generated a superior genome draft than those from single NGS data. The data also clearly suggested that each NGS platform had some systematic but distinct biases towards each base of E. faecium genome.
Bias for coverage depth
It was observed that the reads aligned to genome assembly were not distributed uniformly among different platforms. The 454 reads were more uniformly mapped than any of the other two (Figure 3). To quantify the extent of biases in each NGS, we first measured the variations of coverage depth. The standard deviation of coverage depth for 454, GAIIx, and SOLiD was 25.5, 400.7 and 426.4, respectively. The results indicated SOLiD and GAIIx had a higher coverage variation than 454. We then computed the correlation coefficients (r) of per-base coverage depth among 454, GAIIx, and SOLiD. The coefficients (r) between 454 and GAIIx, 454 and SOLiD, and GAIIx and SOLiD on the same sample were 0.698, 0.690, and 0.747, respectively. These numbers indicated a comparatively stronger correlation between GAIIx and SOLiD than correlation between 454 and GAIIx, and 454 and SOLiD.
Bias for GC contents
GC content in different regions of E.faecium genome
Avg GC% content
GC% in regions of bottom 5% coverage depth
GC% in regions of top 5% coverage depth
Bias for k-mer diversity
Bias for substitution error
Optimizing parameters for hybrid assembly
Similar results were obtained when using Illumina or SOLiD data as baseline, and varying amount of other NGS data in secondary assembly. With Illumina assembly as baseline, the assembly outcomes for adding 454 data stabilized at 25x coverage depth and at 110x for SOLiD (Figure 7C, D). With SOLiD assembly as baseline, the assembly outcomes for adding 454 data stabilized at 25x coverage depth and at 200x for Illumina (Figure 7E, F).
Conclusions and discussion
Enterococcus faecium is a commensal bacterium inhabiting in the gastrointestinal tract of human, and an important nosocomial pathogen which are often accompanied with multidrug resistance. Some studies determined that most of the E. faecium strains from hospital patients belonged to genetic linage termed Complex 17 (CC17) by employing method of multi-locus sequence typing (MLST) [21, 22]. The strains from CC17 were often accompanied with pathogenicity island(s), insertion sequence (IS) elements and antibiotic resistance and/or virulence genes . MLST analysis illustrated a genetic difference between the strains from hospital patients and healthy human hosts. Recently, the available multi-strain genome data provided a new insight into these two subpopulations. The core genome analysis had suggested there was a significant difference (3.5-4.2% at the DNA level) between these two subpopulations .
The current study attempted to address the difficult issue facing by microbiologists: how to most effectively construct a complete microbial genome using today's state of the art sequencing technologies. In order to accomplish it, we first characterized the sequence data and genome assembly from each NGS platform, then tested various conditions for hybrid assembly with combinations of NGS data, and obtained some optimized parameters for achieving most cost-efficiency assembly. Our results helped form some guidelines to direct genomic work on other analogous microorganisms, and thus had some important practical implications.
In the study, we employed three most popular NGS technologies, Illumina GAIIx, ABI SOLiD4.0, and 454 GS-FLX, each having some unique properties. High coverage depth was obtained with GAIIx and SOLiD, while 454 data were limited to 28-fold because of associated higher cost. Upon applying quality filter, 454 and SOLiD retained over 99% effective sequence reads while GAIIx lost close to one quarter. The larger proportion of low quality sequence reads was repeatedly observed with GAIIx, which warranted some attention to plan compensation for GAIIx experiments. By performing assemblies with each NGS platforms, we obtained some baselines, and upper and lower boundaries for each NGS technology (Figure 2). The distinct results from each NGS were associated with the intrinsic properties of each technology, i.e. sequence read length, base calling error rates, systematic bias, etc. The results indicated that each NGS assembly had a ceiling in continuity that could not be overcome by simply increasing data coverage depth. Previously, the amount and read length of sequencing data were considered as important factors influencing the results of assembled contigs [24, 25]. Our work showed that amount of sequence data influenced the assembly outcomes in certain range, and increased coverage depth beyond this range seemed to make very little impact. The longer read length of 454 data generated longer continuous assemblies than Illumina or SOLiD did.
We tested the hybrid assembly approach that integrated data from three next generation sequencing platforms. The N50 size of hybrid assembly was significantly increased over each single NGS results (Table 2). In addition, we also observed marked improvement of N50 in hybrid assembly using two single NGS data. The SOLiD data, although having the shortest read length, helped make much improved assemblies of E. faecium genome using the hybrid approach. This was the first such study to combine SOLiD sequence data with other type of NGS data.
The genomic divergence of E. faecium strains was analysed by comparing our hybrid assembled genome HEf-3 with 28 partial genome sequences deposited at NCBI. We observed significant alignment differences between these strains and believed the differences were primarily due to E. faecium divergent genome with a large number of strain-specific gene contents. Willem and his-coworkers also showed that different E. faecium strains contained significantly different gene contents (up to 12%) , the acquisition of mobile elements, such as insertion sequence (IS) elements, phage genes, plasmid sequences, antibiotic resistance genes and regulatory genes, mainly contributed to its divergence . In addition, due to the intrinsic biases from sequencing platforms, the alignment differences might reflect some inevitable sequencing errors. Thus, the highly divergent genomes hindered the construction of a novel genome. The de novo assembly approach was urgently needed to be optimized to resolve this issue.
In order to better integrate NGS sequence data, it is important to characterize the biases of each NGS technology. These biases include skewing in coverage depth, bias for particular sequences, i.e. GC contents, k-mer diversity, and unequal tendency in substitution error in different NGS platforms. We observed that the gap regions in each different NGS, in general, had a lower GC content than other regions. Particularly, the gap regions in SOLiD had lowest GC contents of the three. What have not been characterized previously was the detailed differences in bias among different NGS technologies. Although it was showed GC contents had significant influence on coverage depth in all three NGS platforms, the gaps, which are unable to be filled by SOLiD data, was most sensitive to GC contents. For GAIIx data, GC contents affecting k-mers depth might force the high GC repeat sequences to be over displayed. In addition, each NGS showed a different pattern of substitution errors, which were presumably related to the different chemistry of each NGS platform.
Previously, some studies performed hybrid assemblies either using data sets of Sanger, 454, and solexa reads, or two of the three sets. DiGuistini and co-workers combined Sanger PE, 454 SE, and Illumina PE sequence data to perform hybrid assembly , in which they used Forge to generate a genome draft with a length 32.5 Mb and N50 size of 32 kb. Though Sanger's longer reads were advantageous in extending an assembly, the generation of 18,424 Sanger reads was associated with much higher cost. Reinhardt et al. assembled their genome using only Solexa and 454 reads. They first assembled short Solexa reads into contigs, before merging these contigs with 454 long reads . However, their secondary assembly using Newbler may result in possible drawbacks for their approach. Besides Newbler was developed and tested mostly on with 454 data, there was a limitation on the length of sequences that can be used as input for secondary assembly. Also the coverage depth of 454 data directly used in secondary assembly could skew the constructed genome as the Illumina contigs from primary assembly could be counted only as one-fold coverage depth.
In order to optimize the hybrid assembly with data from the different NGS technologies, we tested assembly conditions by varying the amount of different NGS data. We observed continuing growth of assembly N50 as the coverage depth of the varying NGS data increased. However, like assembly with single NGS data, there were ceilings for the increase in N50, which were stalled at distinct coverage depths for different NGS data. The progressive results provided us with some basis to optimize future sequence study using the hybrid assembly approach.
Our hybrid assembly approach consisted of primary and secondary assembly steps. Our pipeline could be easily expended by adding primary assembly path for processing data from new platforms, i.e. SMRT, ION PGM, etc. Pacific Biosciences announced that SMRT technology could generate reads with a length more than 1000 bp base pairs . Integration of primary assembly processing SMRT long reads is certainly a focus to extend our pipeline in the future.
Useful guidelines for hybrid assembly
It is difficult to optimize genome study parameters for hybrid assembly without a thorough understanding of the properties of each NGS technologies. We compared the assemblies from each single NGS data and characterized the systematic biases (coverage, GC content, k-mer diversity, substitution error) of each platform. Based on our investigations, we hope to provide some useful guidelines to help people choose the best strategy, and optimize conditions for hybrid assembly approach. To construct a microbial genome similar in size to that of E. faecium' s, sequencing with 454 GS-FLX, which has the best efficiency compared to that of Illumina GAIIx and SOLiD4, is desired. Usually 25-fold coverage depth by 454 GS-FLX is sufficient. Addition of 240-fold Illumina GAIIx or 300-fold SOLiD sequence coverage data will produce a much improved assembly in term of total length, N50, and error correction. If cost allowed, the optimal outcome can be achieved with the combinations of 454 GS-FLX, Illumina GAIIx, and SOLiD4 sequencing data at abovementioned coverage depths. If two types of NGS to be used, the combination of 454 GS-FLX and Illumina GAIIx is preferred over the other two possible combinations.
List of abbreviations
pair-end read NCBI: National Center for Biotechnology Information
non-gap regions. HEf-3: Hybrid assembled contigs from three NGS data.
The authors would like to thank Zhi-Yong Shen, Jie Ping, and Jia Sheng for their assistance on computation support, and Lulu Zheng for his help in drafting the manuscript. This work is supported by the National Basic Research Program of China (973 Program, Contract No. 2012CB316501), National Natural Science Foundation of China (Contract No. 81171613) and in part by Shanghai Pujiang Scholarship Program (Contract No. 10PJ1408000).
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM) - Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
- Van Schaik W, Top J, Riley D, Boekhorst J, Vrijenhoek J, Schapendonk C, Hendrickx A, Nijman I, Bonten M, Tettelin H: Pyrosequencing-based comparative genome analysis of the nosocomial pathogen Enterococcus faecium and identification of a large transferable pathogenicity island. BMC Genomics. 2010, 11: 239-10.1186/1471-2164-11-239.PubMed CentralView ArticlePubMed
- Leavis HL, Willems RJL, Van Wamel WJB, Schuren FH, Caspers MPM, Bonten MJM: Insertion Sequence-Driven Diversification Creates a Globally Dispersed Emerging Multiresistant Subspecies of E. faecium. PLoS pathogens. 2007, 3 (1): e7-10.1371/journal.ppat.0030007.PubMed CentralView ArticlePubMed
- Bedendo J, Pignatari A: Typing of Enterococcus faecium by polymerase chain reaction and pulsed field gel electrophoresis. Brazilian Journal of Medical and Biological Research. 2000, 33 (11): 1269-1274. 10.1590/S0100-879X2000001100002.View ArticlePubMed
- Lam MMC, Seemann T, Bulach DM, Gladman SL, Chen H, Haring V, Moore RJ, Ballard S, Grayson ML, Johnson PDR: Comparative Analysis of the First Complete Enterococcus faecium Genome. J Bacteriol. 2012, 194: 2334-2341. 10.1128/JB.00259-12.PubMed CentralView ArticlePubMed
- Metzker ML: Sequencing technologies¡a the next generation. Nature Reviews Genetics. 2009, 11 (1): 31-46.View ArticlePubMed
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-380.PubMed CentralPubMed
- Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327. 10.1016/j.ygeno.2010.03.001.PubMed CentralView ArticlePubMed
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome research. 2007, 17 (11): 1697-10.1101/gr.6435207.PubMed CentralView ArticlePubMed
- Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y: The sequence and de novo assembly of the giant panda genome. Nature. 2009, 463 (7279): 311-317.PubMed CentralView ArticlePubMed
- Aury JM, Cruaud C, Barbe V, Rogier O, Mangenot S, Samson G, Poulain J, Anthouard V, Scarpelli C, Artiguenave F: High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies. Bmc Genomics. 2008, 9 (1): 603-10.1186/1471-2164-9-603.PubMed CentralView ArticlePubMed
- Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL: De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome research. 2009, 19 (2): 294-PubMed CentralView ArticlePubMed
- DiGuistini S, Liao NY, Platt D, Robertson G, Seidel M, Chan SK, Docking TR, Birol I, Holt RA, Hirst M: De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol. 2009, 10 (9): R94-10.1186/gb-2009-10-9-r94.PubMed CentralView ArticlePubMed
- Schmieder R, Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011, 27 (6): 863-10.1093/bioinformatics/btr026.PubMed CentralView ArticlePubMed
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research. 2008, 18 (5): 821-10.1101/gr.074492.107.PubMed CentralView ArticlePubMed
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMed
- Ewing B, Hillier LD, Wendl MC, Green P: Base-calling of automated sequencer traces usingPhred. I. Accuracy assessment. Genome research. 1998, 8 (3): 175-185.View ArticlePubMed
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome biology. 2004, 5 (2): R12-10.1186/gb-2004-5-2-r12.PubMed CentralView ArticlePubMed
- Li H, Durbin R: Fast and accurate short read alignment with Burrows¨CWheeler transform. Bioinformatics. 2009, 25 (14): 1754-10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
- Li R, Ye J, Li S, Wang J, Han Y, Ye C, Yang H, Yu J, Wong GKS: ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS computational biology. 2005, 1 (4): e43-10.1371/journal.pcbi.0010043.PubMed CentralView ArticlePubMed
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic acids research. 2008, 36 (16): e105-e105. 10.1093/nar/gkn425.PubMed CentralView ArticlePubMed
- Willems R, Top J, Van Santen M, Robinson DA, Coque TM, Baquero F, Grundmann H, Bonten M: Global spread of vancomycin-resistant Enterococcus faecium from distinct nosocomial genetic complex. Emerg Infect Dis. 2005, 11 (6): 821-828. 10.3201/1106.041204.PubMed CentralView ArticlePubMed
- Leavis HL, Bonten MJM, Willems RJL: Identification of high-risk enterococcal clonal complexes: global dispersion and antibiotic resistance. Current opinion in microbiology. 2006, 9 (5): 454-460. 10.1016/j.mib.2006.07.001.View ArticlePubMed
- Galloway-Peña J, Roh JH, Latorre M, Qin X, Murray BE: Genomic and SNP Analyses Demonstrate a Distant Separation of the Hospital and Community-Associated Clades of Enterococcus faecium. PloS one. 2012, 7 (1): e30187-10.1371/journal.pone.0030187.PubMed CentralView ArticlePubMed
- Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng H: Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics. 2011, 27 (15): 2031-2037. 10.1093/bioinformatics/btr319.PubMed CentralView ArticlePubMed
- Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B: A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PloS one. 2011, 6 (3): e17915-10.1371/journal.pone.0017915.PubMed CentralView ArticlePubMed
- Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B: Real-time DNA sequencing from single polymerase molecules. Science. 2009, 323 (5910): 133-10.1126/science.1162986.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.