| Literature DB >> 32385271 |
Shujun Ou1, Jianing Liu2, Kapeel M Chougule3, Arkarachai Fungtammasan4, Arun S Seetharam1,5, Joshua C Stein3, Victor Llaca6, Nancy Manchanda1, Amanda M Gilbert7, Sharon Wei3, Chen-Shan Chin4, David E Hufnagel1, Sarah Pedersen1, Samantha J Snodgrass1, Kevin Fengler6, Margaret Woodhouse8, Brian P Walenz9, Sergey Koren9, Adam M Phillippy9, Brett T Hannigan4, R Kelly Dawe10, Candice N Hirsch11, Matthew B Hufford12, Doreen Ware13,14.
Abstract
Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11-21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32385271 PMCID: PMC7211024 DOI: 10.1038/s41467-020-16037-7
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Summary statistics for NC358 assemblies.
| Experimenta | 21k_20× | 21k_30× | 21k_40× | 21k_50× | 21k_60× | 21k_75× | 11k_50× | 16k_50× |
|---|---|---|---|---|---|---|---|---|
| Subreads size (Gb) | 45.62 | 68.16 | 91.01 | 113.89 | 136.80 | 171.08 | 113.63 | 113.60 |
| Subread coverage | 20× | 30× | 40× | 50× | 60× | 75× | 50× | 50× |
| Max read length (kb) | 89.6 | 103.3 | 103.3 | 103.3 | 103.3 | 103.3 | 88.3 | 69.8 |
| Subread N25 (kb) | 30.1 | 30.1 | 30.1 | 30.1 | 30.1 | 30.1 | 14.5 | 21.6 |
| Subread N50 (kb) | 21.2 | 21.2 | 21.2 | 21.2 | 21.2 | 21.2 | 11.1 | 16.8 |
| Corrected reads (Gb) | 25.11 | 48.13 | 66.05 | 82.96 | 88.93 | 100.90 | 79.26 | 80.22 |
| Corrected coverage | 11× | 21× | 29× | 37× | 39× | 44× | 35× | 35× |
| Corrected read N50 (kb) | 18.42 | 17.13 | 17.10 | 17.25 | 18.80 | 20.05 | 10.37 | 14.48 |
| Contig number | 10,563 | 2015 | 641 | 407 | 360 | 327 | 5683 | 1036 |
| Contig total (Gb) | 1.60 | 2.11 | 2.12 | 2.12 | 2.13 | 2.13 | 2.10 | 2.12 |
| Longest contig (Mb) | 1.06 | 11.50 | 47.89 | 76.00 | 79.68 | 78.40 | 4.37 | 21.45 |
| Contig N50 (Mb) | 0.18 | 1.82 | 7.48 | 16.27 | 22.12 | 24.54 | 0.56 | 4.24 |
| Longest scaffold (Mb) | 198.5 | 198.7 | 237.1 | 237.2 | 237.1 | 237.3 | 205.4 | 237.6 |
| Superscaffold N50 (Mb) | 95.3 | 96.9 | 99.2 | 98.5 | 99.4 | 99.2 | 98.5 | 99.4 |
| Assembled (%)b | 70.4% | 92.8% | 93.3% | 93.3% | 93.7% | 93.7% | 92.4% | 93.2% |
| Assembly gaps (%) | 24.50% | 0.90% | 0.43% | 0.34% | 0.31% | 0.31% | 2.01% | 0.48% |
| Effective assembly size (Gb)c | 1.33 | 1.67 | 1.70 | 1.72 | 1.74 | 1.75 | 1.68 | 1.70 |
| Optical map conflictd | 594 | 125 | 56 | 31 | 22 | 21 | 386 | 107 |
| Complete BUSCOse | 68.0% | 95.5% | 96.5% | 96.4% | 96.2% | 96.3% | 95.7% | 96.7% |
| LTR Assembly Index (LAI) | 12.2 | 19.8 | 20.4 | 20.2 | 20.4 | 20.6 | 19.1 | 21.0 |
| Falcon CPU hour | 1563 | 4162 | 6363 | 10,693 | 12,386 | 32,950 | 9721 | 9224 |
| Falcon RAM (Gb) | 75 | 75 | 75 | 75 | 75 | 75 | 75 | 75 |
| Canu CPU hour | 1860 | 4036 | 5959 | 7914 | 8849 | 11,520 | 6400 | 7174 |
| Canu RAM (Gb) | 61 | 112 | 149 | 177 | 201 | 120 | 183 | 174 |
aEach dataset was assembled only once with the Falcon–Canu hybrid approach (see Methods).
bCalculated based on total contig size and the estimated genome size of 2.2724 Gb.
cSum of unique 150-mers.
dThe optical map was generated using the Direct Label and Stain (DLS) approach with enzyme DLE-1.
ePilon-polished assemblies were used to calculate Benchmarking Universal Single-Copy Orthologs (BUSCO) scores.
CPU central processing.
Fig. 1Assembly of NC358 using various read lengths and coverage.
a Hybrid scaffolding using the Bionano optical map. A 199 Mb scaffold from chromosome 5 is shown. Gray areas on the chromosome cartoon represent the 199 Mb scaffold; the white area is the remaining 23 Mb scaffold in chromosome 5; the red dot is the centromere. Green tracts represent scaffolded sequences and blue tracts show the contigs that comprise this scaffold with contigs jittered across three levels. b Contig NG(x). c Scaffold NG(x). d Benchmarking Universal Single-Copy Orthologs (BUSCO) of Pilon-polished assemblies. e The number of assembly misjoins revealed by DLE-1 conflicts and the number of contigs of each assembly. f Regional LTR Assembly Index (LAI) values estimated based on 3 Mb windows with 300 kb steps. The box shows the median, upper, and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. Outliers are plotted as dots. g Unique mapping rate of RNA-seq libraries. A total of ten tissues with two biological replicates were sequenced. Each dot represents an RNA-seq library. The box shows the median, upper, and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. h Central processing unit (CPU) core hours required for Falcon correction and Canu assembly. i Bionano optical map inconsistency. Deletions and insertions are cases where sequences are shorter or longer than the size estimated by the optical map, respectively. Structural variations (SVs) are plotted as dots. The box shows the median, upper and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. Source data underlying Fig. 1b, c, f, g are provided as a Source Data file.
Fig. 2Assembly of repetitive components in the NC358 genome.
a The assembled size of the 180-bp knob repeat, the knob TR-1 element, the chromosome 6 nucleolus organizer region (NOR) region, CentC arrays, and subtelomere arrays in each of the NC358 assemblies. b Length distribution of long terminal repeat (LTR) retrotransposons longer than 26 kb. Each dot represents an annotated sequence. The box shows the median, upper and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. d Telomere 7-mer counts in telomere regions of NC358 assemblies. Assembly of (c) LTR retrotransposons, (e) CentC arrays, (f) the chromosome 6 NOR region, (g) the Rp1-D and zein tandem gene arrays, and (h) two example knobs in each of the NC358 assemblies. The NC358 Bionano optical map was used to estimate the size of these components. Ngap, estimated gap size. Source data underlying Fig. 2a, b, i are provided as a Source Data file.