| Literature DB >> 26335184 |
Gregory K Farrant1,2, Mark Hoebeke3, Frédéric Partensky1,2, Gwendoline Andres3, Erwan Corre3, Laurence Garczarek4,5.
Abstract
BACKGROUND: The sequencing depth provided by high-throughput sequencing technologies has allowed a rise in the number of de novo sequenced genomes that could potentially be closed without further sequencing. However, genome scaffolding and closure require costly human supervision that often results in genomes being published as drafts. A number of automatic scaffolders were recently released, which improved the global quality of genomes published in the last few years. Yet, none of them reach the efficiency of manual scaffolding.Entities:
Mesh:
Year: 2015 PMID: 26335184 PMCID: PMC4559175 DOI: 10.1186/s12859-015-0705-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Flowchart describing the whole pipeline from genome assembly to finishing that includes the generation of input data for WiseScaffolder (top), the semi-automatic scaffolder itself (middle) and the manual scaffolding and genome finishing steps (bottom). Dark grey boxes correspond to raw sequencing data, light blue boxes to either input files for WiseScaffolder or to exchange files used during the scaffolding process, the dark blue box to the outputs of WiseScaffolder and light grey boxes to the different genome stages along the whole pipeline. The four capital letters correspond to the four subcommands of WiseScaffolders
Description of the three datasets used in this study
|
|
|
| ||
|---|---|---|---|---|
| Genome | Genome size | 2,429,688 | 4,603,060 | 88,289,540 |
| GC % | 59.48 | 68.79 | 40.89 | |
| Genome composition | 1 x chromosome | 2 x chromosomes | 1 x chromosome | |
| 5 x plasmids | ||||
| Pair-End library | Read length (bp) | 101 | n.a. | n.a. |
| Nb of reads ( | 2.0 × 107 | |||
| Average insert size (bp) | 75 | |||
| Mate-Pair library | Read length | 101 | 101 | 101 |
| Nb of reads | 4.1 × 107 | 2,05 × 106 | 3.65 × 107 | |
| Maximal insert size | 3,800 | 3,000 | 3,500 | |
| Contigs | Contigs assembler | CLC AssemblyCell© | Bambus2 [ | CABOG [ |
| Nb of contigs | 42 | 177 | 3541 | |
| Nb of bp in contigs | 2,398,638 | 4,371,571 | 86,255,201 | |
| Reference | This study | [ | [ |
n.a. not applicable
Comparative statistics for the assembly and scaffolding of Synechococcus sp. WH8103, Rhodobacter sphaeroides and Homo sapiens Chr.14
|
|
|
| |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CONTIGS | SCAFFOLDS | CONTIGS | SCAFFOLDS | CONTIGS | SCAFFOLDS | ||||||||||||
| Contiguity Statistics | Assembler/Scaffolder | CLC | WISCA | SSPACE | SOPRA | SCARPA | CLC | Bambus2 | WISCA | SSPACE | SOPRA | SCARPA | CABOG | WISCA | SSPACE | SOPRA | SCARPA |
| Number of scaffolds | n.a. | 3 | 13 | 5 | 7 | 36 | n.a. | 17 | 116 | 18 | 30 | n.a. | 228 | 930 | 414 | 259 | |
| Unscaffolded contigs | 42 | 13 | n.a. | 22 | 23 | n.a. | 177 | 33 | n.a. | 136 | 39 | 3,541 | 184 | n.a. | 2,587 | 468 | |
| Fragmentsa ≥ 10 kbp | 17 | 3 | 4 | 8 | 15 | 13 | 72 | 16 | 59 | 64 | 34 | 2,132 | 221 | 435 | 1,885 | 362 | |
| Max. scaffold size (kbp) | 357 | 1,296 | 889 | 779 | 357 | 500 | 279 | 2,502 | 279 | 407 | 1,346 | 297 | 2,554 | 1,592 | 497 | 2,253 | |
| N50b (kbp) | 211 | 1,296 | 729 | 367 | 222 | 278 | 97 | 2,502 | 134 | 118 | 178 | 47 | 694 | 348 | 56 | 499 | |
| LG50c | 5 | 1 | 2 | 3 | 5 | 4 | 17 | 1 | 13 | 14 | 5 | 563 | 38 | 74 | 460 | 52 | |
| LG75c | 8 | 2 | 3 | 5 | 8 | 6 | 38 | 5 | 27 | 31 | 13 | 1,217 | 79 | 161 | 1,008 | 115 | |
| Nb of Ns (kbp) | n.a. | 2.4 | 10.2 | 7.8 | 15.0 | 20.0 | n.a. | 8.9 | 35.0 | 11.0 | 55.6 | n.a. | 178.7 | 359.3 | 38.0 | 304.8 | |
| Genome coverage (%)d | 98.72 | 98.86 | 98.92 | 98.72 | 98.72 | 98.80 | 94.97 | 94.97 | 95.06 | 94.97 | 94.77 | 97.70 | 97.73 | 97.69 | 97.70 | 97.35 | |
| Misassembliesh | Misassemblies | 0 | 13 | 2 | 2 | 5 | 2 | 4 | 62 | 7 | 6 | 10 | 108 | 981 | 177 | 120 | 200 |
| -Relocationse | 0 | 11 | 2 | 2 | 5 | 2 | 1 | 48 | 3 | 2 | 6 | 106 | 743 | 175 | 118 | 199 | |
| -Translocationsf | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 4 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | |
| -Inversionsg | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 2 | 238 | 2 | 2 | 1 | |
| Misassembled contigs | 0 | 2 | 1 | 1 | 3 | 2 | 4 | 10 | 6 | 6 | 7 | 87 | 177 | 105 | 96 | 101 | |
| Local misassemblies | 11 | 28 | 36 | 22 | 12 | 19 | 308 | 365 | 352 | 324 | 369 | 454 | 1,932 | 2,125 | 668 | 1,368 | |
| Mismatches | 2 | 7 | 20 | 2 | 7 | 34 | 254 | 287 | 279 | 268 | 280 | 87,135 | 86,626 | 86,856 | 87,335 | 85,748 | |
| Indels | 0 | 3 | 1 | 3 | 1 | 3 | 255 | 277 | 273 | 267 | 270 | 19,990 | 20,656 | 20,740 | 20,511 | 21,102 | |
| -short indels | 0 | 1 | 0 | 1 | 1 | 2 | 195 | 204 | 206 | 202 | 203 | 17,000 | 16,912 | 17,126 | 17,190 | 17,333 | |
| -long indels | 0 | 2 | 1 | 2 | 0 | 1 | 60 | 73 | 67 | 65 | 67 | 2,990 | 3,744 | 3,614 | 3,321 | 3,769 | |
| Indels length | 0 | 19 | 15 | 55 | 3 | 38 | 2,035 | 2,423 | 2,223 | 2,124 | 2,230 | 72,044 | 103,854 | 92,992 | 80,016 | 92,442 | |
n.a. not applicable, WISCA WiseScaffolder, ai.e. scaffolds and unscaffolded contigs; bcontig size over which the sum of contig sizes corresponds to 50 % of the assembly; cminimal number of fragments (contigs/scaffold) to cover 50 %/75 % of the reference genome; dcalculated as ; emisassembly where contiguous sequence fragments align on the same chromosome but where the left sequence fragment aligns over 1 kbp away from the right sequence fragment on the reference or overlap by more than 1 kbp; fmisassembly where the sequence fragments align on different chromosomes; gmisassembly where the sequence fragments align on opposite strands of the same chromosome; hmisassembly statistics are results from QUAST [15]
Fig. 3Local assembly at the edge of a contig using mate-pair reads. The extraction and local assembly of the mate sequences of reads mapping in the vicinity of a region of interest (in the present case, 3′-end of contig_1) allows its extension till the 5′-end of the neighboring contig_2
Fig. 2Comparison of the scaffolds of Synechococcus sp. WH8103 obtained using (a) WiseScaffolder, (b) SSPACE, (c) SOPRA and (d) SCARPA against the closely related genome, Synechococcus sp. WH8102. Whole genome alignments were realized using MUMmer [34]. Only scaffolds and contigs of Synechococcus sp. WH8103 larger than 10 kb are displayed and have been organized to be syntenic with the WH8102 genome. Breakpoints in chromosome reconstruction are represented by light grey areas. Note that the large gap around position 450 Kbp in all 4 assemblies corresponds to a genomic island with similar size but different gene content in Synechococcus spp. WH8102 and WH8103 (see also Additional file 1: Figure S1)
Fig. 4Distribution of the mate-pair insert sizes generated using the WiseScaffolder ‘preprocess’ subcommand
Fig. 5Diagram showing the different outputs generated by the ‘preprocess’ subcommand of WiseScaffolfer, called A through D. See text for further details