| Literature DB >> 19087275 |
Jean-Marc Aury1, Corinne Cruaud, Valérie Barbe, Odile Rogier, Sophie Mangenot, Gaelle Samson, Julie Poulain, Véronique Anthouard, Claude Scarpelli, François Artiguenave, Patrick Wincker.
Abstract
BACKGROUND: Massively parallel DNA sequencing instruments are enabling the decoding of whole genomes at significantly lower cost and higher throughput than classical Sanger technology. Each of these technologies have been estimated to yield assemblies with more problematic features than the standard method. These problems are of a different nature depending on the techniques used. So, an appropriate mix of technologies may help resolve most difficulties, and eventually provide assemblies of high quality without requiring any Sanger-based input.Entities:
Mesh:
Year: 2008 PMID: 19087275 PMCID: PMC2625371 DOI: 10.1186/1471-2164-9-603
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Characteristics of the assemblies with different data inputs
| Coverage | Contigs (number) | Contigs (N50) | Scaffolds (number) | Scaffolds (N50) | Assembly size (% of reference) | Mis-assemblies | Total errors | Substitutions | Insertions/Deletions | |
| Sanger | 7.4x | 173 | 39 kb | 2 | 2.2 Mb | 3.417 Mb (95%) | 0 | 3442 | 2494 | 948 |
| Unpaired 454 | 20x | 119 | 48.7 kb | 119 | 48.7 kb | 3.542 Mb (98%) | 0 | 420 | 67 | 353 |
| Unpaired + paired 454 | 25x | 119 | 58.2 kb | 10 | 1 Mb | 3.544 Mb (98%) | 0 | 431 | 75 | 356 |
| unpaired + paired 454 with Illumina/Solexa GA1 | 25× and 50× | 119 | 58.2 kb | 10 | 1 Mb | 3.544 Mb (98%) | 0 | 163 | 71 | 92 |
Figure 1Largest contig size for each Newbler assembly from a coverage of 1× to 27×.
Figure 2Genome coverage (scaled around the average coverage) at a resolution of 10 Kb (upper graph) along the entire genome of Acinetobacter, and genome coverage of two genomic regions at the base level (two bottom graphs). The black curve is the Sanger reads coverage (the average coverage is the black dashed line), blue lines are Solexa reads coverage and red lines GSFLX reads coverage.
Figure 3Number of corrected and remaining errors after correction with different coverage of Solexa reads.
Remaining errors after correction using Solexa/Illumina reads with different coverage
| Solexa Coverage | Remaining Substitutions | Remaining Insertions | Remaining Deletions |
| 5x | 77 | 127 | 161 |
| 10x | 75 | 95 | 110 |
| 20x | 73 | 59 | 74 |
| 30x | 73 | 48 | 53 |
| 40x | 71 | 46 | 48 |
| 50x | 71 | 46 | 46 |
| 60x | 71 | 43 | 40 |
| 70x | 70 | 43 | 36 |
| 80x | 70 | 43 | 38 |
| 90x | 70 | 43 | 38 |
| 100x | 69 | 40 | 38 |
| 110x | 69 | 39 | 38 |
| 120x | 69 | 38 | 36 |
Figure 4Proposed optimized strategy for sequencing a prokaryote genome with Roche-454 and Solexa/Illumina data.
Figure 5(A) Coverage of the Acinetobacter genome with Solexa/Illumina reads. The replication origin was replaced around 3 Mb. (B) Coverage of the Mycoplasma agalactiae genome with Solexa/Illumina reads. The replication origin was replaced around 700 Kb.
Evolution of the read coverage during the process of errors correction (using initially 50× of Solexa reads leads to a usable coverage of around 17x)
| Sequenced reads | Uniquely mapped reads | Filtered reads | ||||||
| Number of reads | Number of bases | Genome coverage | Number of reads | Number of bases | Genome coverage | Number of reads | Number of bases | Genome coverage |
| 5.000.000 | 180.000.000 | 50,0x | 4.543.370 | 163.561.320 | 45,5x | 3.497.539 | 60.680.570 | 16,9x |