| Literature DB >> 28369353 |
Aleksey V Zimin1,2, Kristian A Stevens3, Marc W Crepeau3, Daniela Puiu2, Jill L Wegrzyn4, James A Yorke1, Charles H Langley3, David B Neale5, Steven L Salzberg2,6.
Abstract
The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.Entities:
Keywords: Conifers; Genome assembly; Genomics; Next-gen sequencing; Pine genomes
Mesh:
Year: 2017 PMID: 28369353 PMCID: PMC5437942 DOI: 10.1093/gigascience/giw016
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Summary of raw data, super-reads, and mega-reads for the Pinus taeda 2.0 assembly. Coverage is based on a genome size of 22 Gbp. Illumina reads were generated from DNA fragments of 300–500 bp (second row) and from longer 5–10 Kb fragments (third row). Clone coverage refers to the depth of coverage using the entire fragment from which each pair of reads was sequenced (see Methods)
| Data type | Number | Total Length (bp) | Mean read length | Coverage | Clone Coverage |
|---|---|---|---|---|---|
| PacBio reads | 27 667 399 | 267 426 106 405 | 9665 | 12× | n/a |
| Illumina reads | 10 563 266 162 | 1 499 483 795 334 | 142 | 68× | 96× |
| Illumina reads from 5–10 Kb fragments | 3 152 047 806 | 475 959 218 706 | 151 | 22× | 69× |
| Super-reads | 96 369 476 | 44 307 329 021 | 460 | 2× | n/a |
| Mega-reads | 27 986 125 | 103 129 750 091 | 3685 | 4.7× | n/a |
Comparison of two assemblies of Pinus taeda, version 1.01 based on Illumina data only, and version 2.0 using the same Illumina data plus 12X coverage in PacBio reads. Total scaffold span includes the sizes of estimated gaps
| Assembly | Ptaeda 1.01 | Ptaeda 2.0 |
|---|---|---|
| Total size | 20 148 103 497 bp | 20 613 845 687 bp |
| Total scaffold span | 22 564 679 219 bp | 22 104 209 064 bp |
| N50 contig size | 8206 bp | 25 361 bp |
| Number of contigs | 16 461 900 | 2 855 700 |
| Number of contigs >500 bp | 2 527 203 | 2 445 689 |
| N50 scaffold size | 66 920 bp | 107 036 bp |
| Number of scaffolds >200 bp | 7 068 375 | 1 762 655 |
| Number of scaffolds >500 bp | 2 158 326 | 1 496 869 |
Comparison of alignments of 2438 contigs assembled from fosmids to each of the two Pinus taeda assemblies
| Assembly | Total aligned bases | % of contigs covered | % identity |
|---|---|---|---|
| Ptaeda 1.01 | 70 296 106 | 97.67 | 98.79 |
| Ptaeda 2.0 | 70 469 590 | 97.91 | 98.85 |
Evaluation of alignments of 458 core (CEGMA) proteins from Arabidopsis thaliana to the two Pinus taeda assemblies and to two other conifer genomes. Entries show how many proteins have at least 90 % of their sequence contained in a single contig (column 2) or scaffold (column 3)
| Assembly | Proteins aligned to a single contig (%) | Proteins aligned to a single scaffold (%) |
|---|---|---|
|
| 39 | 53 |
|
| 40 | 45 |
|
| 27 | 36 |
|
| 27 | 43 |
Figure 1.Construction of super-reads and mega-reads from Illumina reads. Illumina reads (top left) were used to build longer super-reads (green lines), which in turn were used to construct a database of all 15-mers in those reads. For P. taeda, each super-read replaced an average of ∼150 Illumina reads; Table 1) [5]. PacBio reads (purple lines) and super-reads were then aligned using the 15-mer database. Inconsistent super-reads are shown as kinked lines; these were discarded and the remaining super-reads were merged, using the PacBio reads as templates, to produce mega-reads. The sequence of the mega-reads was thus derived entirely from the low-error-rate super-reads, not from the raw PacBio reads (figure modified from Zimin et al. [7]).