 Research
 Open Access
 Published:
MultiCSAR: a multiple referencebased contig scaffolder using algebraic rearrangements
BMC Systems Biology volume 12, Article number: 139 (2018)
Abstract
Background
One of the important steps in the process of assembling a genome sequence from short reads is scaffolding, in which the contigs in a draft genome are ordered and oriented into scaffolds. Currently, several scaffolding tools based on a single reference genome have been developed. However, a single reference genome may not be sufficient alone for a scaffolder to generate correct scaffolds of a target draft genome, especially when the evolutionary relationship between the target and reference genomes is distant or some rearrangements occur between them. This motivates the need to develop scaffolding tools that can order and orient the contigs of the target genome using multiple reference genomes.
Results
In this work, we utilize a heuristic method to develop a new scaffolder called MultiCSAR that is able to accurately scaffold a target draft genome based on multiple reference genomes, each of which does not need to be complete. Our experimental results on real datasets show that MultiCSAR outperforms other two multiple referencebased scaffolding tools, Ragout and MeDuSa, in terms of many average metrics, such as sensitivity, precision, Fscore, genome coverage, NGA50, scaffold number and running time.
Conclusions
MultiCSAR is a multiple referencebased scaffolder that can efficiently produce more accurate scaffolds of a target draft genome by referring to multiple complete and/or incomplete genomes of related organisms. Its standalone program is available for download at https://github.com/ablabnthu/MultiCSAR.
Background
Although sequencing technologies have greatly advanced in recent years, assembling a genomic sequence from a large number of generated reads still remains a challenging task [1, 2]. Largely because of the presence of repetitive sequences, most of assembled genomes are just draft genomes that may be composed of several hundreds of fragmented sequences called contigs. The completeness of an assembled genome actually is significant to its downstream analysis and interpretation in many biological applications [3]. For the purpose of producing a more complete genome, the contigs in a draft genome usually are ordered and oriented into larger gapcontaining scaffolds, in which their gaps can be filled in the subsequent gapclosing process [4].
Although a lot of referencebased scaffolders have been developed, most of them utilize only one genome as the reference to scaffold (i.e., order and orient) the contigs of a target draft genome [5–12]. Actually, the algorithmic methods of all these single referencebased scaffolders can be classified into either alignmentbased approaches [5–8] or rearrangementbased approaches [9–12]. For the alignmentbased scaffolding approaches, they align contig sequences from a draft genome with the sequence of a reference genome and scaffold these contigs based on their matched positions on the reference genome. As for the rearrangementbased scaffolding approaches, they utilize the information of genome structures to scaffold the contigs in a draft genome such that the order and orientation of conserved genes (or sequence markers) between the scaffolded contigs and the reference genome are as similar as possible. Among the single referencebased scaffolders mentioned above, CAR [11] and CSAR [12] were developed by us based on different rearrangementbased algorithms [13, 14]. In principle, CSAR can be considered as an improved version of CAR, because the reference genome used by CAR is required to be complete, but the one used by CSAR can be incomplete.
In fact, a single reference genome may not be sufficient alone for a scaffolding tool to correctly generate the scaffolds of a target draft genome, especially when the evolutionary relationship between target and reference genomes is distant or some rearrangements (e.g., reversals, transpositions and translocations) occur between them. This motivates the need to develop multiple referencebased scaffolders that can scaffold the contigs of the target draft genome using multiple reference genomes derived from related organisms, which may provide different but complementary types of scaffolding information.
Previously, we utilized a heuristic approach to extend our single referencebased scaffolder CAR to a multiple referencebased scaffolder called MultiCAR [15] and demonstrated that it performed better than other similar existing tools, such as Ragout [16] and MeDuSa [17], when all the reference genomes are complete. Unlike Ragout and MeDuSa, however, MultiCAR is not able to accept an incomplete genome as a reference, which ultimately limits its widespread adoption because in practice complete reference genomes are not always available for a target draft genome [18]. In principle, Ragout constructed a breakpoint graph by representing each contig in a target draft genome by two vertices and a contig adjacency supported by reference genomes by an edge with a parsimony cost. The parsimony cost of an edge was computed based on a given phylogenetic tree for the target and reference genomes. Ragout then inferred the contig adjacencies in the target genome from a perfect matching with minimum parsimony cost in the breakpoint graph. By contrast, MeDuSa formulated the contig scaffolding problem as finding a path cover with maximum weight in a scaffolding graph, in which each vertex represents a contig in a target draft genome and each edge represents a contig adjacency with a weight denoting the number of supported reference genomes. Since the computation of an optimal path cover is NPhard, MeDuSa adopted a 2approximation algorithm to compute an approximate path cover from the scaffolding graph and then inferred the scaffolds of the target genome from this approximate path cover.
In this study, we further improve our MultiCAR into a new multiple referencebased scaffolding tool called MultiCSAR that can utilize multiple complete and/or incomplete genomes as the references to scaffold the contigs of a target draft genome. Our experimental results on real datasets containing multiple incomplete genomes as the references have finally shown that MultiCSAR still outperforms Ragout and MeDuSa in terms of many average evaluation metrics, such as sensitivity, precision, Fscore, genome coverage, NGA50, scaffold number and running time.
Methods
The algorithmic method we use to implement our multiple referencebased scaffolder MultiCSAR is a graphbased heuristic approach, which (i) utilizes our CSAR [12] to infer single referencederived scaffolds for a target draft genome based on each of multiple reference genomes, (ii) uses all single referencederived scaffolds to build an edgeweighted contig adjacency graph, (iii) finds a maximum weighted perfect matching from the contig adjacency graph, and (iv) constructs a multiple referencederived scaffold of the target draft genome according to the maximum weighted perfect matching. In the following, we describe the details of these four steps in our multiple referencebased scaffolding algorithm.
Suppose that we are given a target draft genome T consisting of n contigs c_{1},c_{2},…,c_{n}, as well as k references of complete or incomplete genomes R_{1},R_{2},…,R_{k} with weights w_{1},w_{2},…,w_{k}, respectively. We first utilize our single referencebased scaffolder CSAR [12] to obtain a scaffolding result S_{i} of T based on each R_{i}, where 1≤i≤k. After that, we construct a contig adjacency graph G=(V,E) [15], which is an undirected edgeweighted graph as defined below. In principle, a contig c_{j}∈T, where 1≤j≤n, is a fragmented sequence of DNA with two extremities, respectively called head and tail. For our purpose, two vertices, denoted by \(c_{j}^{h}\) and \(c_{j}^{t}\), are used to represent the head and tail of c_{j} in G, respectively, and an undirected edge is used to connect any two vertices in G that are not the extremities from the same contig. In other words, we have \(V = \left \{c_{j}^{t},c_{j}^{h}  1 \le j \le n\right \}\) and E={(u,v)u,v∈V and both u and v are not the extremities of the same contig }. We say that an edge in G is supported by R_{i} if both of its vertices are adjacent extremities from two different but consecutive contigs in a scaffold of S_{i}. If an edge in G can be supported by multiple reference genomes simultaneously, it has a weight equal to the sum of the weights of all these reference genomes. However, if an edge in G is not supported by any reference genome, it receives a weight of zero. Next, we use the Blossom V program [19] to find a maximum weighted perfect matching M in G, where a subset of edges in G is called a perfect matching if every vertex in G is incident to exactly one edge in this subset. Let \(C = \left \{\left (c_{j}^{t},c_{j}^{h}\right)  1 \le j \le n\right \}\) and M^{′} be a subset of edges obtained from M by deleting some of its edges with the minimum total weight such that M^{′}∪C contains no cycle. Finally, we order and orient the contigs of T into scaffolds based on the edge connections in M^{′}. Note that CSAR was developed by us based on a nearlinear time algorithm [14] and the running time of Blossom V is \(\mathcal {O}\left (n^{4}\right)\) for a graph with n vertices. Therefore, the above multiple referencebased scaffolding method we used to implement MultiCSAR is a polynomialtime algorithm. We refer the reader to Fig. 1 for its pseudocode description.
Below, we give an example to illustrate how our scaffolding algorithm works (see Fig. 2 for an example). As mentioned previously, a contig is a fragmented sequence of DNA with two extremities, a head and a tail. Given a scaffold, we scan its ordered and oriented contigs in the lefttoright direction. If the tail of a contig, say c_{i}, precedes its head, we write this contig as +c_{i} in the scaffold; otherwise, we write it as −c_{i}. Suppose that we have the following three scaffolding results S_{1}=(+c_{1},+c_{2},+c_{3}), S_{2}=(+c_{2},+c_{3},+c_{4}) and S_{3}=(−c_{2},−c_{1},−c_{4},−c_{3}) that are respectively obtained by applying the CSAR program on a target genome consisting of four contigs T={c_{1},c_{2},c_{3},c_{4}} and three reference genomes R_{1},R_{2} and R_{3} with equal weight of one. We then utilize S_{1},S_{2} and S_{3} to construct the contig adjacency graph G=(V,E) of T and apply the Blossom V program on G to derive a maximum weighted perfect matching \(M=\left \{\left (c_{1}^{h}, c_{2}^{t}\right), \left (c_{2}^{h}, c_{3}^{t}\right), \left (c_{3}^{h}, c_{4}^{t}\right), \left (c_{4}^{h}, c_{1}^{t}\right)\right \}\). By definition, we have \(C=\left \{\left (c_{1}^{t}, c_{1}^{h}\right), \left (c_{2}^{t}, c_{2}^{h}\right), \left (c_{3}^{t}, c_{3}^{h}\right), \left (c_{4}^{t}, c_{4}^{h}\right)\right \}\) in this instance. Clearly, M∪C forms a cycle. In this case, we can remove the minimum weighted edge \(\left (c_{4}^{h}, c_{1}^{t}\right)\) from M to obtain \(M^{\prime }=\left \{\left (c_{1}^{h}, c_{2}^{t}\right), \left (c_{2}^{h}, c_{3}^{t}\right), \left (c_{3}^{h}, c_{4}^{t}\right)\right \}\) such that M^{′}∪C contains no cycles. Finally, we can derive the scaffold (+c_{1},+c_{2},+c_{3},+c_{4}) of T, which is equivalent to (−c_{4},−c_{3},−c_{2},−c_{1}), according to the edge connections in M^{′}.
It is worth mentioning that the weights of the reference genomes mentioned before can be derived by MultiCSAR automatically using the following sequence identitybased weighting scheme. As mentioned in our previous study [12], CSAR utilizes either NUCmer or PROmer to identify aligned sequence markers between the target genome T and each reference genome R_{i}, where 1≤i≤k. NUCmer and PROmer are from the MUMmer sequence alignment package [20] that is a set of programs to detect similar regions (i.e. sequence markers) between biological sequences. Particularly, NUCmer detects markers directly on input DNA sequences, while PROmer detects markers on the sixframe protein translation of the input DNA sequences. Suppose that there are τ such sequence markers, say m_{1},m_{2},…,m_{τ}, between T and R_{i}. In principle, each such marker m_{j} actually is a local alignment between T and R_{i}, where 1≤j≤τ. Let L(m_{j}) and I(m_{j}) be the alignment length and percent identity of m_{j}, respectively. The weight of R_{i} is then given as \(w_{i} = \sum _{j=1}^{\tau } L(m_{j}) \times I(m_{j})\). Note that the weights of the reference genomes are all defaulted to one when running MultiCSAR, unless the sequence identitybased weighting scheme is used.
From algorithmic point of view, MultiCSAR has the following two new features when compared with its previous version MultiCAR. First, MultiCSAR utilizes CSAR, rather than CAR as used in MultiCAR, to obtain the single referencederived scaffold of the target draft genome. As mentioned in the introduction, the reference genome used by CAR is required to be complete, but the one used by CSAR can be incomplete. Due to this reason, MultiCSAR therefore can accept incomplete genomes as references. Second, MultiCSAR can be run with the sequence identitybased weighting scheme to automatically measure the weight of each reference genome. Generally, the more similar a reference genome is to the target genome, the more weight it receives to support an edge in the contig adjacency graph. In MultiCAR, however, the weights of all reference genomes must be assigned by the user; otherwise, they are defaulted to one.
Results
We tested MultiCSAR, as well as other two multiple referencebased scaffolders Ragout (version 1.0) and MeDuSa (version 1.6), on five real bacterial datasets as shown in Table 1, which were originally prepared and analyzed by Bosi et al. in the study of MeDuSa [17]. Each testing dataset comprises a draft genome to be scaffolded (hereafter called target genome) and two or more references of complete and/or incomplete genomes. All the multiple referencebased scaffolders evaluated in this study were run with their default parameters, except Ragout for which a reliable phylogenetic tree for each testing dataset was unknown and hence a star tree was used instead. Consequently, their average performance results over the five bacterial datasets are shown in Table 2. In addition, the average performance results of MultiCSAR when running with the sequence identitybased weighting scheme are shown in Table 3.
Discussion
For the target genome in each testing dataset, Bosi et al. also provided a reference order of its contigs, which actually was derived from the complete sequence of the target genome and hence can be served as a truth standard in our evaluation. All the tested multiple referencebased scaffolders were evaluated using several different metrics, such as sensitivity, precision, Fscore, genome coverage, NGA50, scaffold number and running time. In principle, sensitivity, precision and Fscore are measures to access the accuracy of scaffolds, genome coverage to access the coverage of scaffolds on the target genome, and NGA50 and scaffold number to access the contiguity of scaffolds. In the following, we describe their definitions in detail.
Given two consecutive contigs in a scaffold, they are considered as a correct join if they also appear in consecutive order and correct orientation in the reference order. The number of the correct contig joins in a scaffolding result is then called as true positive (TP) and the number of the others (i.e., incorrect joins) as false positive (FP). Denote by P the number of all contig joins in the reference order. The sensitivity of a scaffolding result is thus defined as \(\frac {\text {TP}}{P}\), its precision as \(\frac {\text {TP}}{\text {TP}+\text {FP}}\), and its Fscore (i.e., the harmonic mean of sensitivity and precision) as \(\frac {2 \times \text {sensitivity} \times \text {precision}}{\text {sensitivity} + \text {precision}}\) [21]. In principle, Fscore is a balanced measure between sensitivity and precision and it is high only when both sensitivity and precision are high. To conveniently define the metric of genome coverage below, we assume that the target genome contains only circular DNAs. In this case, therefore, each contig has two neighbor contigs respectively on its both sides. Given a contig in a scaffolding result, if it is correctly joined with its two neighbor contigs on its both sides, its whole length is counted as contributing to the genome coverage (as will be defined later). If this contig is correctly joined with exactly one neighbor contig, half of its length is counted. If it is incorrectly joined with other contigs on its both sides, its length is not counted entirely. The genome coverage of a scaffolding result is thus defined as the ratio of the sum of the contig lengths counted using the rules mentioned above to the sum of all contig lengths [10]. Note that if the target genome contains linear DNAs, the first and last contigs located in the reference order of each linear DNA have only one neighbor contig and hence just half of their lengths will be counted in the numerator (if they are correctly joined with their neighbor contigs) and denominator of the genome coverage. The NGA50 value of a scaffolding result is obtained by aligning its scaffolds to the target complete sequence, breaking them at misassembly breakpoints, deleting unaligned regions, and finally computing the NG50 value of the resulting scaffolds that is the size of the smallest scaffold satisfying that 50% of the genome is contained in scaffolds of size NG50 or larger [22].
Clearly, as shown in Table 2, MultiCSAR running with NUCmer achieves the best scaffolding results in sensitivity, Fscore, genome coverage, NGA50 and running time, while still exhibiting the second best scaffolding results in precision and scaffold number. On the other hand, when using PROmer to identify sequence markers, MultiCSAR obtains the best performance in scaffold number, whereas the second best performance in sensitivity, Fscore, genome coverage and NGA50. From the viewpoint of precision, Ragout performs the best among the evaluated scaffolders. However, its sensitivity is much lower than those obtained by MultiCSAR running with NUCmer and PROmer, resulting in that its Fscore is substantially inferior to those of MultiCSAR with NUCmer and PROmer. In addition, Ragout gives the worst performance in scaffold number and running time. As for MeDuSa, it yields the second best result in running time, but the worst results in sensitivity, precision, Fscore, genome coverage and NGA50.
On the other hand, it is worth mentioning that, as shown in Table 3, several average accuracy measures of MultiCSAR, such as sensitivity, precision, Fscore, genome coverage and NGA50, can be further improved if it is run with the sequence identitybased weighting scheme.
Conclusions
Scaffolder is a helpful tool for a sequencing project to obtain a more complete sequence of a genome. In this study, we presented MultiCSAR, an easytouse multiple referencebased scaffolder that can efficiently produce more accurate scaffolds of a target draft genome by referring to multiple complete and/or incomplete genomes of related organisms. MultiCSAR was implemented by a graphbased heuristic approach that utilizes our CSAR to obtain all the single referencederived scaffolding results, uses them to build an edgeweighted contig adjacency graph, finds a maximum weighted perfect matching from this graph, and finally constructs a multiple referencederived scaffolding result based on this matching. All the steps in this heuristic approach can be done in polynomial time. Compared with its previous version MultiCAR, MultiCSAR has the following two new features: (i) it can accept an incomplete genome as a reference, thus greatly improving its applicability since most of available reference genomes are still incomplete, and (ii) it can automatically derive the supporting weights of reference genomes using a sequence identitybased weighting scheme. By testing on five real prokaryotic datasets containing multiple references of incomplete genomes, our MultiCSAR indeed outperforms other two multiple referencebased scaffolders Ragout and MeDuSa in terms of average sensitivity, precision, Fscore, genome coverage, NGA50, scaffold number and running time. In the future, it will be interesting to investigate whether the performance quality of our MultiCSAR can be further enhanced by incorporating other single referencebased scaffolders, such as OSLay [6], Mauve Aligner [7] and r2cat [8].
Abbreviations
 CAR:

Contig assembly using rearrangements
 CSAR:

Contig scaffolding using algebraic rearrangements
 DNA:

Deoxyribonucleic acid
 FP:

False positive
 Mbp:

Megabase pair
 MeDuSa:

Multidraft based scaffolder
 MultiCAR:

Multiple referencebased contig assembly using rearrangements
 MultiCSAR:

Multiple referencebased contig scaffolder using algebraic rearrangements
 MUMmer:

Maximal unique matchmer
 NG50:

Length of the shortest scaffold for which longer and equal length scaffolds cover at least 50% of the genome
 NGA50:

Analogous to NG50 where the scaffolds are replaced by regions that can be aligned to the target complete sequence
 NUCmer:

Nucleotide MUMmer
 OSLay:

Optimal syntenic layouter
 PROmer:

Protein MUMmer
 r2cat:

Related reference contig arrangement tool
 Ragout:

Referenceassisted genome ordering utility
 TP:

True positive
References
 1
Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet. 2008; 24:142–9.
 2
Pop M.Genome assembly reborn: recent computational challenges. Brief Bioinform. 2009; 10:354–66.
 3
Mardis E, McPherson J, Martienssen R, Wilson RK, McCombie WR. What is finished, and why does it matter. Genome Res. 2002; 12:669–71.
 4
Hunt M, Newbold C, Berriman M, Otto TD. A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 2014; 15:R42.
 5
van Hijum SA, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gapclosure of prokaryotic genome sequence assemblies. Nucleic Acids Res. 2005; 33:W560—6.
 6
Richter DC, Schuster SC, Huson DH. OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007; 23:1573–9.
 7
Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics. 2009; 25:2071–3.
 8
Husemann P, Stoye J. r2cat: synteny plots and comparative assembly. Bioinformatics. 2010; 26:570–1.
 9
Munoz A, Zheng C, Zhu Q, Albert VA, Rounsley S, Sankoff D. Scaffold filling, contig fusion and comparative gene order inference. BMC Bioinformatics. 2010; 11:304.
 10
Dias Z, Dias U, Setubal JC. SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics. 2012; 13:96.
 11
Lu CL, Chen KT, Huang SY, Chiu HT. CAR: contig assembly of prokaryotic draft genomes using rearrangements. BMC Bioinformatics. 2014; 15:381.
 12
Chen KT, Liu CL, Huang SH, Shen HT, Shieh YK, Chiu HT, et al.CSAR: a contig scaffolding tool using algebraic rearrangements. Bioinformatics. 2018; 34:109–11.
 13
Li CL, Chen KT, Lu CL. Assembling contigs in draft genomes using reversals and blockinterchanges. BMC Bioinformatics. 2013; 14(Suppl 5):9.
 14
Lu CL. An efficient algorithm for the contig ordering problem under algebraic rearrangement distance. J Comput Biol. 2015; 22:975–87.
 15
Chen KT, Chen CJ, Shen HT, Liu CL, Huang SH, Lu CL. MultiCAR: a tool of contig scaffolding using multiple references. BMC Bioinformatics. 2016; 17:469.
 16
Kolmogorov M, Raney B, Paten B, Pham S. Ragout: a referenceassisted assembly tool for bacterial genomes. Bioinformatics. 2014; 30:i302—9.
 17
Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lio P, et al.MeDuSa: a multidraft based scaffolder. Bioinformatics. 2015; 31:2443–51.
 18
Pagani I, Liolios K, Jansson J, Chen IMA, Smirnova T, Nosrat B, et al.The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2012; 40:D571—9.
 19
Kolmogorov V. Blossom V: a new implementation of a minimum cost perfect matching algorithm. Math Program Comput. 2009; 1:43–67.
 20
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al.Versatile and open software for comparing large genomes. Genome Biol. 2004; 5:R12.
 21
Mandric I, Zelikovsky A. ScaffMatch: scaffolding algorithm based on maximum weight matching. Bioinformatics. 2015; 31:2632–8.
 22
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013; 29:1072–5.
Acknowledgements
This study was partially supported by Ministry of Science and Technology of Republic of China under grant MOST1072221E007066MY2.
Funding
The publication costs of this paper were funded by Ministry of Science and Technology of Republic of China under grant MOST1072221E007066MY2.
Availability of data and materials
All data analyzed during this study are included in this published article.
About this supplement
This article has been published as part of BMC Systems Biology Volume 12 Supplement 9, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): systems biology. The full contents of the supplement are available online at https://bmcsystbiol.biomedcentral.com/articles/supplements/volume12supplement9.
Author information
Author notes
Affiliations
Contributions
CLL conceived of the study, designed the algorithm, carried out experimental analyses and wrote the manuscript. KTC participated in the design of algorithm, implemented the software, conducted the experiments and analyzed experimental results. HTS participated in the software implementation, dataset experiments and analyses of experimental results. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Chin Lung Lu.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Published
DOI
Keywords
 Bioinformatics
 Sequencing
 Contig
 Scaffolding
 Multiple reference genomes