| Literature DB >> 24950923 |
Marten Boetzer, Walter Pirovano1.
Abstract
BACKGROUND: The recent introduction of the Pacific Biosciences RS single molecule sequencing technology has opened new doors to scaffolding genome assemblies in a cost-effective manner. The long read sequence information is promised to enhance the quality of incomplete and inaccurate draft assemblies constructed from Next Generation Sequencing (NGS) data.Entities:
Mesh:
Year: 2014 PMID: 24950923 PMCID: PMC4076250 DOI: 10.1186/1471-2105-15-211
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Statistics of the sequence datasets used for the comparative study
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| 3,046.4 | 460.0 | 151 | 44.94 | 232.0 | 516 | 383.5 | 929.1 | 2,422 | |
| 3,794.9 | 548.5 | 144 | 946.2 | 219.1 | 231 | 403.9 | 1,100.3 | 2,724 | |
| 1,718.2 | 249.2 | 145 | 351.9 | 190.7 | 542 | 205.1 | 499.9 | 2,437 | |
| 1,724.4 | 249.4 | 144 | NA | NA | NA | 176.0 | 531.2 | 3,019 | |
| 926.7 | 199.2 | 214 | 178.9 | 74.5 | 416 | 176.4 | 399.8 | 2,266 | |
| 1,943.8 | 279.8 | 143 | 351.5 | 127.1 | 361 | 394.7 | 1,000.2 | 2,534 | |
Overall a 90-100× Illumina MiSeq and 45-50× Roche 454 genome coverage is used. For the PacBio dataset an input coverage of ~200× is available.
Genome reconstruction of 6 bacterial genomes using different sequencing platforms and assembly strategies
| Ray | - | Unknown | 34 | - | 2,384,099 | 212,852 | 0 | - | - | - | |
| | AHA | Unknown | 21 | - | 2,390,466 | 245,559 | 6,367 | - | - | 110 min | |
| | SSPACE LongRead | Unknown | 7 | - | 2,410,351 | 1,215,562 | 8,899 | - | - | 16 min | |
| CLC | - | Unknown | 62 | - | 2,361,409 | 146,347 | 0 | - | - | - | |
| | AHA | Unknown | 36 | - | 2,389,684 | 222,352 | 16,915 | - | - | 118 min | |
| | SSPACE LongRead | Unknown | 6 | - | 2,395,822 | 1,361,277 | 8,650 | - | - | 19 min | |
| - | Unknown | 58 | - | 2,362,898 | 117,742 | 0 | - | - | - | ||
| | AHA | Unknown | 21 | - | 2,391,876 | 505,738 | 12,781 | - | - | 117 min | |
| | |||||||||||
| Ray | - | 1 | 99 | 0 | 4,583,740 | 95,924 | 0 | 0 | 2 | - | |
| | AHA | 1 | 57 | 0 | 4,632,207 | 220,952 | 32,147 | 2 | 2 | 194 min | |
| | SSPACE LongRead | 1 | 11 | 0 | 4,636,946 | 570,605 | 30,741 | 1 | 9 | 28 min | |
| - | 1 | 126 | 0 | 4,554,695 | 88,183 | 0 | - | - | - | ||
| | AHA | 1 | 57 | 0 | 4,636,666 | 497,336 | 34,587 | 2 | 6 | 214 min | |
| | |||||||||||
| Newbler | - | 1 | 80 | 0 | 4,567,139 | 117,490 | 0 | - | - | - | |
| | AHA | 1 | 12 | 0 | 4,652,318 | 3,320,126 | 45,090 | 6 | 14 | 201 min | |
| | SSPACE LongRead | 1 | 2 | 0 | 4,635,316 | 3,716,545 | 7,793 | 7 | 10 | 32 min | |
| Ray | - | 10 | 144 | 1 | 5,432,073 | 112,112 | 0 | - | - | - | |
| | AHA | 10 | 110 | 1 | 5,475,255 | 227,802 | 34,035 | 1 | 2 | 226 min | |
| | SSPACE LongRead | 10 | 38 | 1 | 5,845,919 | 348,040 | 58,068 | 2 | 23 | 31 min | |
| - | 10 | 293 | 13 | 5,335,444 | 105,156 | 0 | - | - | - | ||
| | AHA | 10 | 238 | 8 | 5,437,860 | 201,528 | 42,214 | 4 | 9 | 312 min | |
| | |||||||||||
| Newbler | - | 10 | 279 | 14 | 5,322,767 | 142,438 | 0 | - | - | - | |
| | AHA | 10 | 209 | 8 | 5,471,954 | 254,465 | 65,936 | 5 | 9 | 297 min | |
| | SSPACE LongRead | 10 | 39 | 3 | 5,565,065 | 703,452 | 75,126 | 11 | 34 | 37 min | |
| Ray | - | 3 | 100 | 0 | 1,806,660 | 25,623 | 0 | - | - | - | |
| | AHA | 3 | 38 | 0 | 1,859,591 | 82,151 | 47,651 | 1 | 5 | 95 min | |
| | SSPACE LongRead | 3 | 8 | 0 | 1,886,509 | 279,967 | 27,386 | 1 | 8 | 14 min | |
| CLC | - | 3 | 110 | 1 | 1,780,141 | 25,117 | 0 | - | - | - | |
| | AHA | 3 | 53 | 1 | 1,844,586 | 63,063 | 50,494 | 0 | 6 | 104 min | |
| | SSPACE LongRead | 3 | 7 | 1 | 1,877,533 | 444,696 | 19,639 | 2 | 6 | 18 min | |
| - | 3 | 316 | 0 | 1,653,291 | 8,912 | 0 | - | - | - | ||
| | AHA | 3 | 61 | 0 | 1,965,997 | 69,167 | 255,189 | 7 | 7 | 95 min | |
| | |||||||||||
| Ray | - | Unknown | 80 | - | 2,639,260 | 75,015 | 0 | - | - | - | |
| | AHA | Unknown | 44 | - | 2,676,952 | 108,006 | 25,336 | - | - | 148 min | |
| | SSPACE LongRead | Unknown | 14 | - | 2,682,588 | 703,034 | 29,889 | - | - | 21 min | |
| - | Unknown | 129 | - | 2,630,768 | 63,442 | 0 | - | - | - | ||
| | AHA | Unknown | 41 | - | 2,769,108 | 239,432 | 73,082 | - | - | 166 min | |
| | |||||||||||
| Ray | - | 4 | 119 | 2 | 4,972,739 | 90,542 | 0 | - | - | - | |
| | AHA | 4 | 40 | 2 | 5,012,323 | 203,631 | 34,496 | 0 | 4 | 190 min | |
| | SSPACE LongRead | 4 | 20 | 2 | 5,112,337 | 488,483 | 27,988 | 0 | 6 | 28 min | |
| CLC | - | 4 | 238 | 5 | 4,974,534 | 43,328 | 0 | - | - | - | |
| | AHA | 4 | 62 | 4 | 5,064,555 | 376,354 | 68,292 | 3 | 7 | 200 min | |
| | SSPACE LongRead | 4 | 7 | 3 | 5,038,082 | 3,235,544 | 21,588 | 6 | 2 | 34 min | |
| - | 4 | 101 | 12 | 4,990,994 | 372,513 | 0 | - | - | - | ||
| | AHA | 4 | 69 | 12 | 5,040,830 | 787,589 | 30,907 | 2 | 6 | 193 min | |
In italic-bold the platform/strategy that leads to the lowest amount of assembled scaffolds is highlighted. The number of expected scaffolds refers to the number of chromosomes plus the number of plasmids present in the reference genome (if available). Generally the combination 1) draft assembly using CLCbio for Illumina MiSeq reads or Newbler for Roche 454 reads and 2) scaffolding using SSPACE-LongRead for PacBio CLR reads gives the best results in terms of closure and time. Notably some draft assembly contigs are not covered with PacBio reads (such as PhiX control or bacterial host sequences). The number of errors introduced during scaffolding is only limited and often are a consequence of true variations between the sequenced library and the earlier deposited reference genome.
Figure 1The effect of PacBio RS long read coverage on genome closure. Results are displayed for SSPACE-LongRead based on the CLCbio draft assembly for 5 organisms. For all samples the addition of PacBio reads has a positive effect and leads to a significant contig reduction. In general a 50× coverage is sufficient to scaffold over most gaps, though ideally a 110-160× coverage is required to guarantee an optimal performance of our software. Arguably a higher coverage (>160×) leads to more fragmented genomes, which is likely due to the increased complexity of the assembly graph.
Figure 2Overview of the SSPACE-LongRead scaffolding algorithm. A) The input consists of a set of pre-assembled contigs (or scaffolds) in FASTA format and a set of PacBio CLR reads (in FASTA or FASTQ format). B) The PacBio CLR reads are aligned against the contigs using BLASR and only the best alignment matches are kept. In red a repeated element is indicated. C) Contig pairings and multi-contig linkage information is stored, from this information also repeated elements are detected. D) Based on the pairing and linkage information, contigs are ordered, oriented and connected into scaffolds. E) A post-processing step performs the final linearization and circularization.