| Literature DB >> 29069494 |
Aleksey V Zimin1,2, Daniela Puiu1, Richard Hall3, Sarah Kingan3, Bernardo J Clavijo4, Steven L Salzberg1,5.
Abstract
Common bread wheat, Triticum aestivum, has one of the most complex genomes known to science, with 6 copies of each chromosome, enormous numbers of near-identical sequences scattered throughout, and an overall haploid size of more than 15 billion bases. Multiple past attempts to assemble the genome have produced assemblies that were well short of the estimated genome size. Here we report the first near-complete assembly of T. aestivum, using deep sequencing coverage from a combination of short Illumina reads and very long Pacific Biosciences reads. The final assembly contains 15 344 693 583 bases and has a weighted average (N50) contig size of 232 659 bases. This represents by far the most complete and contiguous assembly of the wheat genome to date, providing a strong foundation for future genetic studies of this important food crop. We also report how we used the recently published genome of Aegilops tauschii, the diploid ancestor of the wheat D genome, to identify 4 179 762 575 bp of T. aestivum that correspond to its D genome components.Entities:
Keywords: PacBio sequencing; genome assembly; hybrid assembly; plant genomes; wheat genome
Mesh:
Year: 2017 PMID: 29069494 PMCID: PMC5691383 DOI: 10.1093/gigascience/gix097
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Assembly statistics for each of the assemblies of Triticum aestivum, constructed as described in the text
| Assembly | Element type | Number | Total size, bp | Average size, bp | N50 size, bp |
|---|---|---|---|---|---|
| Triticum 1.0 | Contigs | 829 839 | 17 045 571 778 | 20 541 | 76 267 |
| Scaffolds > 2 Kb | 576 137 | 16 889 295 941 | 29 314 | 101 195 | |
| Triticum 2.0 | Contigs | 375 328 | 14 395 027 822 | 38 353 | 75 599 |
| Scaffolds > 2 Kb | 252 501 | 14 412 484 332 | 57 078 | 100 805 | |
| FALCON Trit 1.0 | Contigs | 97 809 | 12 939 100 857 | 132 289 | 215 314 |
| Triticum 3.0 | Contigs | 279 439 | 15 343 711 528 | 54 908 | 232 613 |
| Triticum 3.1 | Contigs | 279 439 | 15 344 693 583 | 54 912 | 232 659 |
To enable fair comparisons, all N50 sizes were computed using an estimated genome size of 15.34 Gb. Next, in order to detect and remove redundant regions of the assembly, we aligned the assembly against itself using the nucmer program from the MUMmer package [13]. We identified and excluded scaffolds that were completely contained in and ≥96% identical to other scaffolds. After this de-duplication procedure, the reduced assembly, Triticum 2.0, contained 14.40 Gbp in 375 328 contigs with an N50 contig size of 75 599 bp, with scaffolds spanning 14.45 Gbp and an N50 scaffold size of 100 805 bp (Table 1).
Figure 1:Illustration of the merging process for the Triticum 2.0 and FALCON Trit 1.0 assemblies. If 2 contigs A and B from the FALCON assembly overlapped a Triticum 2.0 contig by at least 5000 bp, then A and B were merged together using the Triticum 2.0 contig to fill the gap.
Figure 2:K-mer uniqueness ratios for the wheat genome (Triticum aestivum) compared to the cow, fruit fly, rice, loblolly pine, and Ae. tauschii genomes. The plot shows the percentage of each genome that is covered (y-axis) by unique sequences of length k for various values of k (x-axis).
Figure 3:Missing 31-mers in the different assemblies of Triticum aestivum. Using the Illumina read data from a previously published assembly of the same genome, we counted all 31-mers in the reads and then plotted how many of these 31-mers are missing from each assembly. The x-axis shows how often the k-mers occur in the reads. The y-axis shows how many distinct k-mers are missing from each assembly. The FALCON Trit 1.0 assembly had the most missing k-mers, while the MaSuRCA-driven Triticum 2.0 assembly had the fewest.