| Literature DB >> 33710295 |
Martin Mascher1,2, Thomas Wicker3, Jerry Jenkins4, Christopher Plott4, Thomas Lux5, Chu Shin Koh6, Jennifer Ens7, Heidrun Gundlach5, Lori B Boston4, Zuzana Tulpová8, Samuel Holden9, Inmaculada Hernández-Pinzón9, Uwe Scholz1, Klaus F X Mayer5, Manuel Spannagl5, Curtis J Pozniak7, Andrew G Sharpe6, Hana Šimková8, Matthew J Moscou9, Jane Grimwood4, Jeremy Schmutz4, Nils Stein1,10.
Abstract
Sequence assembly of large and repeat-rich plant genomes has been challenging, requiring substantial computational resources and often several complementary sequence assembly and genome mapping approaches. The recent development of fast and accurate long-read sequencing by circular consensus sequencing (CCS) on the PacBio platform may greatly increase the scope of plant pan-genome projects. Here, we compare current long-read sequencing platforms regarding their ability to rapidly generate contiguous sequence assemblies in pan-genome studies of barley (Hordeum vulgare). Most long-read assemblies are clearly superior to the current barley reference sequence based on short-reads. Assemblies derived from accurate long reads excel in most metrics, but the CCS approach was the most cost-effective strategy for assembling tens of barley genomes. A downsampling analysis indicated that 20-fold CCS coverage can yield very good sequence assemblies, while even five-fold CCS data may capture the complete sequence of most genes. We present an updated reference genome assembly for barley with near-complete representation of the repeat-rich intergenic space. Long-read assembly can underpin the construction of accurate and complete sequences of multiple genomes of a species to build pan-genome infrastructures in Triticeae crops and their wild relatives. © American Society of Plant Biologists 2021.Entities:
Year: 2021 PMID: 33710295 PMCID: PMC8290290 DOI: 10.1093/plcell/koab077
Source DB: PubMed Journal: Plant Cell ISSN: 1040-4651 Impact factor: 11.277
Sequence datasets analyzed in the present study
| Acronym | Description | Reference |
|---|---|---|
| TRITEX | Illumina short-read data from multiple library types (paired-end, mate-pair, 10× Chromium); used for hybrid assemblies |
|
| PE450 | Overlapping 2 × 250 reads with an insert size ∼450 bp; used for polishing of long-read assemblies; subset of TRITEX |
|
| CLR | PacBio continuous long reads; 121× coverage |
|
| CCS | PacBio circular consensus reads; 27× coverage | This study |
| ONT | Oxford Nanopore reads; 85× coverage | This study |
| Hi-C | Chromosome conformation capture sequencing data; used for pseudomolecule construction |
|
Metrics of different sequence assemblies of the genome of barley cv. Morex
| Acronym | Input data | Size | Size > 1 Mbb | contig N50 | scaffold N50 | BUSCO | Isoseq | label sites | HC genes |
|---|---|---|---|---|---|---|---|---|---|
| TRITEX | TRITEX | 4.65 Gb | 4.23 Gb | 33 kb | 40.2 Mb | 96.0% | 96.7% | 89.2% | 98.3% |
| CLR_MECAT | CLR, PE450 | 4.14 Gb | 3.94 Gb | 10.2 Mb | 95.3% | 95.6% | 95.8% | 95.2% | |
| CLR_wtdbg2 | CLR, PE450 | 4.07 Gb | 3.32 Gb | 2.85 Mb | 92.9% | 93.8% | 91.6% | 91.2% | |
| Hybrid_Wengan | CLR, PE450 contigs | 4.14 Gb | 769 Mb | 496 kb | 94.8% | 95.7% | 81.0% | 94.0% | |
| ONT_smartdenovo | ONT, PE450 | 4.14 Gb | 4.05 Gb | 14.2 Mb | 97.4% | 96.9% | 95.6% | 91.6% | |
| CCS_Falcon | CCS, PE450 | 4.19 Gb | 4.09 Gb | 24.2 Mb | 96.5% | 97.0% | 98.0% | 96.9% | |
| CCS_Canu | CCS | 4.48 Gb | 4.18 Gb | 28.7 Mb | 96.5% | 97.1% | 99.0% | 97.1% | |
Note that gene models are defined on the TRITEX assembly and can be affected by structural errors in that assembly. Genes not aligned to TRITEX (1.7%) are due to alignment uncertainty.
Total assembly size.
Cumulative size of sequences contained in scaffolds larger than 1 Mb.
Long-read assemblies are gap-free, hence scaffold and contig N50s are identical.
Proportion of complete BUSCO gene models (total: 425, viridiplantae_odb10) present in one or more copies.
Proportion of aligned Isoseq reads (total: 123,875), minimum alignment length: 90%, minimum identity: 97%.
Proportion of aligned DLE1 label sites of the Bionano map.
Proportion of aligned Morex V2 HC gene models (total: 32,787), minimum alignment length: 99%, minimum identity: 100%.
Contigs assembled from PE450 data with Minia3 (Monat et al. 2019).
Summary statistics of Bionano optical maps of cv. Morex
| DLS (DLE-1) | NLRS (Nt.BspQI) | |
|---|---|---|
| Number of filtered molecules | 2,791,276 | 774,557 |
| Molecule N50 | 281 kb | 340 kb |
| Number of contigs | 257 | 2,875 |
| Contig N50 | 87.6 Mb | 2.1 Mb |
| Assembly length | 4,249 Mb | 4,289 Mb |
| Genome coverage | 116× | 57× |
Reported by Mascher et al. (2017).
Relative performance of different assemblies in resolving resistance gene loci
| Assembly |
|
|
|
|
|
| Rank sum | Total rank |
|---|---|---|---|---|---|---|---|---|
| ONT_smartdenovo | 2 | 1 | 1 | 3 | 2 | 2 | 11 | 1 |
| CSS_Canu | 1 | 3 | 2 | 1 | 1 | 4 | 12 | 2 |
| CCS_Falcon | 5 | 2 | 2 | 2 | 3 | 3 | 17 | 3 |
| CLR_MECAT | 3 | 4 | 4 | 3 | 5 | 1 | 20 | 4 |
| CLR_wtdbg2 | 4 | 6 | 5 | 5 | 4 | 5 | 29 | 5 |
| TRITEX | 6 | 5 | 6 | 6 | 5 | 6 | 34 | 6 |
| Hybrid_Wengan | 7 | 6 | 7 | 7 | 7 | 6 | 40 | 7 |
Figure 1Structural complexity at the R gene locus Mla. A, A dotplot of the TRITEX (short-read) scaffold versus CCS_Canu (long-read) contig encompassing the Mla locus. The region is intact and correct in CCS_Canu, but collapsed in the TRITEX assembly (repeated parallel diagonal lines) and with a small inversion (inverted diagonal line). B, Physical interval of the Mla locus from the reference barley accession Morex that contains three gene families RGH1 (orange), RGH2 (blue), and RGH3 (green) encoding nucleotide-binding, leucine-rich repeat proteins. Gray arrows define 39.7 kb tandem duplication. The duplicate regions are 99.9% identical, with only 13 SNPs and 11 InDels difference between the duplicated segments.
Assembly statistics after scaffolding and gap-filling
| Before gap-filling | After gap-filling | |
|---|---|---|
| Assembly size | 4.2 Gb | |
| Number of scaffolds | 386 | |
| Number of contigs | 588 | 439 |
| Scaffold N50 | 118.9 Mb | |
| Scaffold N90 | 21.9 Mb | |
| Contig N50 | 31.9 Mb | 69.6 Mb |
| Contig N90 | 7.2 Mb | 19.3 Mb |
| Gap size | 3.37 Mb | 1.32 Mb |
Contiguous gap-free stretches within scaffolds.
Figure 2Alignments between Hi-C-based pseudomolecules and genetic maps. Panel A shows POPSEQ markers in the Morex x Barke and Oregon Wolfe Barley (OWB) maps (Mascher et al., 2013). Framework markers of the Morex × Barke and OWB maps are shown in red and blue, respectively. Markers integrated to the consensus POPSEQ markers are shown as gray dots. Panel B shows GBS markers mapped in Morex × Barke recombinant inbred lines (Mascher et al., 2017). Gray lines indicate scaffold boundaries.
Figure 3Alignments of MorexV3 and MorexV2 pseudomolecules.
Figure 4Alignments of Morex V3 and V2 pseudomolecules in the terminal 10 Mb of each chromosome arm. Gray lines indicate scaffold boundaries.
Figure 5Full-length LTR-retrotransposon (fl-LTR) characteristics of the three Morex chromosome-level assembly versions. A, Fl-LTR insertion age distribution for all high-quality gap-free fl-LTR copies and superfamily subsets (RLC: Copia and RLG Gypsy superfamily, RLX unassigned. B, Overall repetitivity of fl-LTR copies in terms of 20-mer frequencies.
Numbers of full-length BARE1 retrotransposons and solo-LTRs in the three versions of the Morex pseudomolecules
| Morex V1 | Morex V2 | Morex V3 | |
|---|---|---|---|
| Candidates | 4,277 | 6,193 | 20,944 |
| Excluded | 259 | 139 | 686 |
| Full-length copies | 4,018 | 6,054 | 20,258 |
| TSD | 3,313 (82%) | 5,303 (87%) | 17,820 (88%) |
| Fraction of gaps | 0.03% | 1.72% | 0% |
| Solo LTRs | 3,473 | 3,742 | 6,216 |
All elements shorter than 8.5 kb were excluded. The maximum allowed length was 9 kb in V1 and V3, and 9.8 kb in V2.
Figure 6Sizes and insertion age distributions of full-length BARE1 retrotransposons extracted from different Morex assembly versions (V1–V3). Panels A–C show size distributions of the extracted full-length retrotransposons. Those extracted from V2 tend to be much longer due to extended stretches of unfilled gaps represented by N characters. Panels D–F show insertion age distributions of the extracted full-length retrotransposons. Retrotransposons from V1 and V2 are on average older. In V2, very young retrotransposons are almost absent. They could not be identified with our pipeline since LTRs of young elements tend to have sequence gaps.
Figure 7Distribution of sequence gaps and sequence differences in BARE1 elements between Morex V1 and V3. The graph is a compilation of results from sequence alignments of 3,305 v1 and v3 full-length BARE1 retrotransposons. As individual retrotransposon copies can differ in length, the length was normalized to 1,000 bins. The plot shows numbers of SNPs and numbers of N’s in 10-bin windows. The LTRs correspond to approximately the first and last 20% of the retrotransposon. These regions are highly enriched in SNPs and sequence gaps because of the inability of short-read assemblies to resolve highly similar regions longer than a few hundred base pair.
Metrics of Hi-Canu assemblies of down-sampled CCS data
| ID | 19k | 22k | Reads (Gb) | Coverage | N50 (Mb) | N90 (Mb) | Size (Gb)b | Size 1 Mb (Gb) | HC genes | Label sites |
|---|---|---|---|---|---|---|---|---|---|---|
| 3_2_trim | 3 | 2 | 132.7 | 26.5 | 28.7 | 3.6 | 4.48 | 4.17 | 97.1% | 99.0% |
| 3_2 | 3 | 2 | 132.7 | 26.5 | 31.1 | 3.9 | 4.50 | 4.17 | 97.1% | 99.0% |
| 3_1 | 3 | 1 | 109.3 | 21.9 | 25.8 | 4.4 | 4.42 | 4.16 | 97.0% | 98.7% |
| 2_2 | 2 | 2 | 99.2 | 19.8 | 21.6 | 3.8 | 4.39 | 4.15 | 96.9% | 98.5% |
| 3_0 | 3 | 0 | 89.8 | 18.0 | 19.5 | 3.7 | 4.35 | 4.15 | 96.8% | 98.5% |
| 2_1 | 2 | 1 | 79.9 | 16.0 | 13.1 | 2.9 | 4.33 | 4.13 | 96.2% | 98.3% |
| 1_2 | 1 | 2 | 69.8 | 14.0 | 8.0 | 1.9 | 4.30 | 4.05 | 95.3% | 98.2% |
| 2_0 | 2 | 0 | 63.0 | 12.6 | 5.8 | 1.4 | 4.28 | 4.00 | 94.8% | 98.1% |
| 1_1 | 1 | 1 | 52.8 | 10.6 | 2.5 | 0.6 | 4.26 | 3.50 | 92.0% | 97.5% |
| 1_0 | 1 | 0 | 33.6 | 6.7 | 0.4 | 0.1 | 4.18 | 0.48 | 80.5% | 88.2% |
Two libraries with average insert sizes of 19 and 22 kb, respectively, were prepared. The 19 k library was sequenced on three SMRT cells, the 22 k library on four. The columns report the number of SMRT cells whose reads were included in the assemblies.
Total assembly size.
Cumulative size of sequences contained in scaffolds larger than 1 Mb.
Proportion of aligned Morex V2 HC gene models, minimum alignment length: 99%, minimum identity: 100%.
Proportion of aligned DLE1 label sites of the Bionano map.
3_2_trim is the CCS_Canu assembly used for constructing the Morex V3 pseudomolecules. In 3_2, the trimming step was omitted in HiCanu.