| Literature DB >> 31895413 |
Weiwen Wang1, Ashutosh Das1,2, David Kainer1, Miriam Schalamun1,3, Alejandro Morales-Suarez4, Benjamin Schwessinger1, Robert Lanfear1.
Abstract
BACKGROUND: Eucalyptus pauciflora (the snow gum) is a long-lived tree with high economic and ecological importance. Currently, little genomic information for E. pauciflora is available. Here, we sequentially assemble the genome of Eucalyptus pauciflora with different methods, and combine multiple existing and novel approaches to help to select the best genome assembly.Entities:
Keywords: zzm321990 Eucalyptus pauciflorazzm321990 ; assembly comparison; genome assessment; genome polishing; haplotig separation; hybrid assembly; long-read assembly; nanopore sequencing
Mesh:
Year: 2020 PMID: 31895413 PMCID: PMC6939829 DOI: 10.1093/gigascience/giz160
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:The E. pauciflora sequenced in this study. This E. pauciflora is located in Thredbo, Kosciuszko National Park, New South Wales, Australia (36 29.6597 N, 148 16.9788 E).
Raw (before polish and haplotig removal) assembly statistics
| Assembly | Long-read | Short-read | Assembler | Assembly time (CPU hours) | Length (bp) | contigs | Largest contig (bp) | N50 (bp) | L50 | GC (%) | Ns (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Canu_1kb | ≥1 kb (∼174×) | X | Canu | ∼300,000 | 871,577,052 | 2,867 | 7,123,373 | 629,835 | 259 | 39.18 | 0 |
| Canu_35kb | ≥35 kb (∼40×) | X | Canu | ∼50,000 | 825,916,527 | 2,550 | 10,153,603 | 962,598 | 158 | 39.18 | 0 |
| SMARTdenovo_1kb | ≥1 kb (∼174×) | X | SMARTdenovo | ∼8,000 | 610,858,639 | 729 | 6,287,341 | 1,711,661 | 107 | 39.29 | 0 |
| SMARTdenovo_35kb | ≥35 kb (∼40×) | X | SMARTdenovo | ∼4,000 | 586,903,502 | 704 | 9,494,401 | 1,868,532 | 91 | 39.27 | 0 |
| Flye_1kb | ≥1 kb (∼174×) | X | Flye | ∼700 | 596,007,484 | 5,930 | 2,755,662 | 255,434 | 652 | 39.12 | 0 |
| Flye_35kb | ≥35 kb (∼40×) | X | Flye | ∼500 | 561,349,738 | 4,145 | 2,407,003 | 352,050 | 448 | 39.17 | 0 |
| Marvel_35kb | ≥35 kb (∼40×) | X | Marvel | ∼28,000 | 649,061,435 | 1,181 | 6,453,759 | 795,971 | 182 | 39.07 | 0 |
| MaSuRCA_1kb | ≥1,kb (∼174×) | ∼228× | MaSuRCA | ∼23,000 | 778,288,575 | 1,311 | 12,224,271 | 1,885,174 | 95 | 39.35 | 0.04 |
| MaSuRCA_35kb | ≥35 kb (∼40×) | ∼228× | MaSuRCA | ∼21,000 | 773,035,614 | 1,703 | 8,684,546 | 1,304,720 | 146 | 39.39 | 0.09 |
All long reads were corrected by Canu before assembly. The Canu correction step took ∼200,000 CPU hours, which has not been included in the assembly runtime.
With ∼1 TB of RAM.
Assembly size and assembly ploidy during polishing and haplotig removal
| Stage 1 | Assembly ploidy | Stage 2 | Assembly ploidy | Stage 3 | Assembly ploidy | Stage 4 | Assembly ploidy | Stage 5 | Assembly ploidy | |
|---|---|---|---|---|---|---|---|---|---|---|
| Canu_1kb | 871,577,052 | 1.74 | 893,781,515 | 1.79 | 645,703,255 | 1.29 | 622,473,836 | 1.24 | 622,218,742 | 1.24 |
| Canu_35kb | 825,916,527 | 1.65 | 847,395,928 | 1.69 | 605,520,689 | 1.21 | 586,032,599 | 1.17 | 585,785,283 | 1.17 |
| SMARTdenovo_1kb | 599,580,691 | 1.20 | 610,858,639 | 1.22 | 514,822,476 | 1.03 | 514,822,476 | 1.03 | 514,714,831 | 1.03 |
| SMARTdenovo_35kb | 575,805,356 | 1.15 | 586,903,502 | 1.17 | 504,644,753 | 1.01 | 504,644,753 | 1.01 | 504,515,539 | 1.01 |
| Flye_1kb | 596,007,484 | 1.19 | 593,219,654 | 1.19 | 529,107,244 | 1.06 | 528,619,533 | 1.06 | 528,563,896 | 1.06 |
| Flye_35kb | 561,349,738 | 1.12 | 561,597,192 | 1.12 | 517,329,093 | 1.03 | 517,061,277 | 1.03 | 516,992,152 | 1.03 |
| Marvel_35kb | 649,061,435 | 1.30 | 666,317,308 | 1.33 | 547,630,224 | 1.10 | 537,813,575 | 1.08 | 537,615,613 | 1.08 |
| MaSuRCA_1kb | 778,288,575 | 1.56 | 778,307,850 | 1.56 | 608,764,671 | 1.22 | 594,680,200 | 1.19 | 594,528,099 | 1.19 |
| MaSuRCA_35kb | 773,035,614 | 1.55 | 773,071,231 | 1.55 | 608,629,204 | 1.22 | 595,020,257 | 1.19 | 594,871,467 | 1.19 |
Stage 1: raw assembly size (bp) before polishing. Stage 2: assembly size (bp) after polishing. Stage 3: assembly size (bp) after Purge Haplotigs. Stage 4: assembly size (bp) after Purge Haplotigs and GCICA (bp). Stage 5: assembly size (bp) after Purge Haplotigs and GCICA and extra polishing.
Figure 2:A. The length of primary contigs and haplotigs between different assemblies. B. The comparison of complete BUSCO genes (1,440 in total) between different primary contigs. C. The comparison of duplicated BUSCO genes between different primary contigs.
The comparison of final assemblies
| Assembly | Length (bp) | Contig No. | Contig N50 (bp) | BUSCO score (1,440 genes in total) | LAI score | Assembly ploidy | Short-read mapping | Long-read mapping | CGAL score | Structural variants | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Complete genes | Duplicated genes | Fragmented genes | Mapping rate | Error rate | Mapping rate | Error rate | |||||||||||
| Canu_1kb | 622,218,742 | 895 | 1,502,325 | 1,346 | 93.47% | 183 | 12.71% | 23 | 1.60% | 7.04 | 1.24 | 96.02% | 0.0061 | 91.73% | 0.1661 | −1.959E+06 | 4,243 |
| Canu_35kb | 585,785,283 | 655 | 2,258,674 | 1,345 | 93.40% | 138 | 9.58% | 29 | 2.01% | 5.34 | 1.17 | 95.52% | 0.0066 | 92.64% | 0.1677 | −2.226E+06 | 5,043 |
| SMARTdenovo_1kb | 514,714,831 | 364 | 2,092,790 | 1,342 | 93.19% | 100 | 6.94% | 27 | 1.88% | 7.02 | 1.03 |
| 0.0080 | 92.38% | 0.1678 | −4.275E+06 | 5,940 |
| SMARTdenovo_35kb | 504,515,539 | 370 | 2,178,079 | 1,341 | 93.13% | 100 | 6.94% | 30 | 2.08% | 6.73 |
| 98.35% | 0.0082 | 92.20% | 0.1679 | −5.869E+06 | 6,024 |
| Flye_1kb | 528,563,896 | 2947 | 295,613 | 1,344 | 93.33% | 100 | 6.94% | 31 | 2.15% | 5.70 | 1.06 | 94.86% | 0.0077 |
| 0.1694 | −2.536E+06 | 7,137 |
| Flye_35kb | 516,992,152 | 2548 | 385,290 | 1,336 | 92.78% |
|
| 31 | 2.15% | 6.50 | 1.03 | 94.24% | 0.0080 | 92.34% | 0.1699 | −2.726E+06 | 7,458 |
| Marvel_35kb | 537,615,613 | 730 | 1,202,845 | 1,180 | 81.94% | 153 | 10.63% | 32 | 2.22% | 3.77 | 1.08 | 87.37% | 0.0075 | 85.18% | 0.1689 | −4.451E+06 | 5,162 |
| MaSuRCA_1kb | 594,528,099 | 415 | 3,234,447 |
|
| 201 | 13.96% |
|
| 9.27 | 1.19 | 94.91% |
| 91.57% | 0.1656 |
| 4,020 |
| MaSuRCA_35kb | 594,871,467 | 416 |
|
|
| 200 | 13.89% |
|
|
| 1.19 | 94.92% |
| 91.49% |
| −1.790E+06 |
|
Note: The best value of each assessment is highlighted in boldface.
Figure 3:Structural variation analysis of different assembly primary contigs. Each variant was supported by ≥10 long reads. A. The total event of each structural variant of each assembly. B. The insertion event of each assembly. C. The translocation event of each assembly. D. The deletion event of each assembly.
Figure 4:The sequence coverage of whole-genome alignment among different assemblies. The sequence coverage was calculated by the length of aligned reference sequence/the total length of reference genome.
Figure 5:A. The histogram of location and coverage of E. pauciflora genome aligned to the 11 chromosomes of E. grandis. The scale of the y-axis is 0–2× of coverage. Every bar is 1 Mb. The coverage was calculated by the total aligned length of E. grandis in each bar/the length of bar. If a site in E. grandis is aligned by E. pauciflora twice or more, this site will be counted twice or more. B. Repeat landscape comparison between E. pauciflora and E. grandis. Only repeats that are found in both genomes are shown. Older repeat insertions could accumulate more mutations compared to new repeat insertions. This leads older repeat insertions to have accumulated a higher level of divergence (shown on the right side of the graph).