| Literature DB >> 31849336 |
Cécile Monat1, Sudharsan Padmarasu1, Thomas Lux2, Thomas Wicker3, Heidrun Gundlach2, Axel Himmelbach1, Jennifer Ens4, Chengdao Li5,6, Gary J Muehlbauer7, Alan H Schulman8, Robbie Waugh9,10, Ilka Braumann11, Curtis Pozniak4, Uwe Scholz1, Klaus F X Mayer2,12, Manuel Spannagl2, Nils Stein13,14, Martin Mascher15,16.
Abstract
Chromosome-scale genome sequence assemblies underpin pan-genomic studies. Recent genome assembly efforts in the large-genome Triticeae crops wheat and barley have relied on the commercial closed-source assembly algorithm DeNovoMagic. We present TRITEX, an open-source computational workflow that combines paired-end, mate-pair, 10X Genomics linked-read with chromosome conformation capture sequencing data to construct sequence scaffolds with megabase-scale contiguity ordered into chromosomal pseudomolecules. We evaluate the performance of TRITEX on publicly available sequence data of tetraploid wild emmer and hexaploid bread wheat, and construct an improved annotated reference genome sequence assembly of the barley cultivar Morex as a community resource.Entities:
Mesh:
Year: 2019 PMID: 31849336 PMCID: PMC6918601 DOI: 10.1186/s13059-019-1899-5
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Input datasets for TRITEX
| Name | Library type (number1) | Insert size | Read length | Coverage2 |
|---|---|---|---|---|
| PE450 | PCR-free paired-end (2) | 400–470 bp | 2 × 250 bp | 70× |
| PE800 | PCR-free paired-end (2) | 700–800 bp | 2 × 150 bp | 30× |
| MP3 | Nextera mate-pair (2) | 2–4 kb | 2 × 150 bp | 30× |
| MP6 | Nextera mate-pair (2) | 5–7 kb | 2 × 150 bp | 30× |
| MP9 | Nextera mate-pair (2) | 8–10 kb | 2 × 150 bp | 30× |
| 10X | 10X Chromium (2) | 2 × 150 bp | 30× | |
| Hi-C | TCC [ | 2 × 100 bp | 200–400 million read pairs |
1Number of independent libraries to be prepared
2Haploid genome coverage for paired-end, mate-pair, and 10X libraries. As Hi-C analysis is count-based, read numbers are more relevant than sequence amount
Overview of the TRITEX pipeline
| Step1 | Software | Input | Output | |
|---|---|---|---|---|
| 1 | Read merging | BBMerge [ | PE450 read pairs | Merged PE450 reads |
| 2 | PE450 error correction | BFC [ | Merged PE450 reads | Corrected PE450 reads, hash table of k-mer counts |
| 3.1 | Unitig assembly | Minia3 [ | Corrected PE450 reads | Unitigs |
| 3.2 | Error correction of PE800 and MP reads | BFC [ | PE800, MP3, MP6, and MP9 reads, hash table of k-mer count (step 2) | Corrected PE800, MP3, MP6, and MP9 reads |
| 4 | Scaffolding | SOAPDenovo2 [ | Unitigs; corrected PE800, MP3, MP6, and MP9 reads | Scaffolds |
| 5 | Gap-filling | Gapcloser [ | Scaffolds, corrected PE450 reads | Scaffolds after gap-filling |
| 6.1 | Alignment of 10X reads | Minimap2 [ | Scaffolds after gap-filling, 10X reads | 10X alignment records |
| 6.2 | Alignment of Hi-C reads | As in 6.1, EMBOSS [ | Scaffolds after gap-filling, Hi-C reads | Hi-C alignment records |
| 6.3 | Alignment of genetic markers | Minimap2 [ | Scaffolds after gap-filling, marker sequences | Marker alignment records |
| 7 | Pseudomolecule construction | Custom R scripts | Scaffolds after gap-filling, 10X alignment records, Hi-C alignment records, marker alignment records | Pseudomolecules, Hi-C contact maps |
1Steps with identical leading digits can be run in parallel
Fig. 1Estimate of assembly size and k-mer coverage as a function of k-mer size. Assembly size (a) and k-mer coverage (b) were estimated from error-corrected PE450 used for Zavitan unitig assembly based on k-mer cardinalities using NtCard [38] and Kmerstream [37]
Assembly statistics for Zavitan and Chinese Spring
| Zavitan | Chinese Spring | |||
|---|---|---|---|---|
| TRITEX | Avni et al. [ | TRITEX | IWGSC [ | |
| Unitig assembly size | 10.8 Gb | 15.1 Gb | ||
| Unitig N50 | 21.7 kb | 21.4 kb | ||
| Unitig N90 | 1.5 kb | 1.7 kb | ||
| Assembled sequence in contigs ≥ 1 kb | 10.0 Gb | 14.0 Gb | ||
| Assembled sequence in contigs ≥ 10 kb | 7.8 Gb | 10.8 Gb | ||
| Scaffold assembly size | 11.1 Gb | 10.5 Gb | 15.7 Gb | 14.5 Gb |
| Scaffold N50 | 1.3 Mb | 7.0 Mb | 2.3 Mb | 7.0 Mb |
| Scaffold N90 | 97 kb | 1.2 Mb | 281 kb | 1.2 Mb |
| Assembled sequence in scaffolds ≥ 1 kb | 10.4 Gb | 10.5 Gb | 14.8 Gb | 14.5 Gb |
| Assembled sequence in scaffolds ≥ 1 Mb | 6.7 Gb | 9.6 Gb | 11.9 Gb | 13.4 Gb |
| Unfilled internal gaps | 209 Mb (1.9%) | 171 Mb (1.6%) | 476 Mb (3.0%) | 262 Mb (1.8%) |
Comparison of different assemblies of barley cv. Morex
| BAC-by-BAC | TRITEX | TRITEX | ||
|---|---|---|---|---|
| Morex V1 [ | Dovetail | Morex V2 | MP9 only | |
| Scaffold assembly size | 4.79 Gb | 4.65 Gb | 4.6 Gb | |
| Scaffold N50 | 79 kb | 3.4 Mb | 2.6 Mb | |
| Scaffold N90 | 4.4 kb | 287 kb | 150 kb | |
| Assembled sequence in scaffolds ≥ 1 kb | 4.67 Gb | 4.34 Gb | 4.32 Gb | |
| Assembled sequence in scaffolds ≥ 1 Mb | 0 bp | 3.80 Gb | 3.49 Gb | |
| Unfilled internal gaps | 216 Mb (4.5%) | 116 Mb (2.5%) | 106 Mb (2.3%) | |
| Super-scaffold N50 | 1.9 Mb | 1.3 Mb | 40.2 Mb | 32.6 Mb |
| Super-scaffold N90 | 336 kb | 7.5 kb | 2.0 Mb | 1.2 Mb |
| Size of pseudomolecules | 4.58 Gb | 4.26 Gb | 4.20 Gb | |
| Size of unanchored sequences (chrUn) | 246 Mb | 83 Mb2 | 111 Mb2 | |
| Proportion of complete full-length cDNAs1 | 81.8% | 84.1% | 89.8% | 90.4% |
1Proportion of 28,622 full-length cDNAs of barley cv. Haruna Nijo [42] aligned with ≥ 90% coverage and ≥ 97% alignment identity
2Sequences shorter than 1000 kb were not included in chrUn
Fig. 2Example of a chimeric scaffold. The chimeric nature of a sequence scaffold joining two unlinked sequences originating from barley chromosomes 2H and 5H is supported by multiple lines of evidence. a Genetic chromosome assignments of marker sequences aligned to scaffold_1005. b 10X molecule coverage. c Physical Hi-C coverage. Coverage in b and c was normalized for distance from the scaffold ends and the log2-fold observed vs. expected ratio was plotted. The red, dotted lines mark the breakpoint at 3.32 Mb
Fig. 3Example of errors in scaffold orientation. The top panels show the Hi-C contact matrix for barley chromosome 3H before (a) and after (b) manual correction. The bottom panels show the directionality biases in the Hi-C data as defined by Himmelbach et al. [47] before (c) and after (d) manual correction. Two inverted scaffolds are evident as deviations from the expected Rabl configuration [3] and as diagonals bounded by discontinuities in the directionality biases
Fig. 4Collinearity between TRITEX and DeNovoMagic assemblies of wheat. Dot plots showing the longest alignments between scaffold pairs of the TRITEX and DeNovoMagic assemblies of Zavitan (a) and Chinese Spring (b), respectively. Alignments were done with Minimap2 [32]
Chinese Spring transcript alignment statistics
| Transcript dataset | No. of transcripts | Assembly | Proportion of complete transcripts1 (%) |
|---|---|---|---|
| IWGSC v1.0 transcripts [ | 269,583 | TRITEX | 96.2 |
| IWGSC [ | 97.0 | ||
| Clavijo et al. [ | 87.8 | ||
| Zimin et al. [ | 88.5 | ||
| Full-length cDNAs [ | 6137 | TRITEX | 97.1 |
| IWGSC [ | 96.3 | ||
| Clavijo et al. [ | 91.6 | ||
| Zimin et al. [ | 85.4 |
1Proportion of transcripts with at least one alignment with ≥ 99% coverage and ≥ 99% identity (for IWGSC transcripts) or with ≥ 90% coverage and ≥ 99% identity (for full-length cDNAs). Alignments were done with GMAP [50]
Fig. 5Morex V2 assembly validated by complementary resources. Morex scaffold_10x_163 was aligned to the Morex V1 assembly (a), the Dovetail assembly of Morex (b), and the genome-wide optical map of Morex (c). Sequence alignments are shown as dot plots (a, b). c The alignment of four optical contigs to scaffold_10x_163. Single aligned restriction sites are connected by black lines. Red lines indicate unaligned restriction sites in either the sequence scaffold or the optical contig
Fig. 6Collinearity of Morex V1 and V2 assemblies. a Dot plots showing the alignments between the chromosomal pseudomolecules of the Morex V1 and V2 assemblies. b Intra-chromosomal Hi-C contact matrices of the Morex V2 assembly. c Intra-chromosomal Hi-C contact matrices of the Morex V1 assembly