| Literature DB >> 34775997 |
Yu Chen1,2, Yixin Zhang3, Amy Y Wang2,4, Min Gao2,5, Zechen Chong6,7.
Abstract
Long-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.Entities:
Keywords: Assembly error; Assembly evaluation; De novo assembly; Genome assembly; Long reads
Mesh:
Year: 2021 PMID: 34775997 PMCID: PMC8590762 DOI: 10.1186/s13059-021-02527-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Inspector workflow for evaluating of de novo assembly results. By mapping the long reads to the contigs, besides basic statistic assembly evaluation metrics, Inspector calculates and reports precise structural errors and small-scale errors. The identified errors can also be corrected by Inspector to generate more accurate contigs
Assembly error identification accuracy in simulated assembly
| Haploid | Diploid | ||||||
|---|---|---|---|---|---|---|---|
| Recall/% | Precision/% | F1 score/% | Recall/% | Precision/% | F1 score/% | ||
| Inspector structural – CLR | 96.76 | 100.0 | 98.35 | 95.98 | 98.48 | 97.21 | |
| Inspector structural – HiFi | 97.64 | 100.0 | 98.80 | 97.61 | 98.87 | 98.23 | |
| Inspector small-scale – CLR | 86.84 | 99.53 | 92.75 | 86.60 | 96.99 | 91.50 | |
| Inspector small-scale – HiFi | 98.99 | 99.65 | 99.32 | 98.91 | 99.62 | 99.26 | |
| Merqury | 71.01 | 91.66 | 80.03 | 70.92 | 91.63 | 79.95 | |
| QUAST-LG | 5.73 | 5.96 | 5.84 | 7.08 | 8.48 | 7.72 | |
Evaluation summary of HG002 assemblies
| Assembly | Contig continuity | Assembly error | QUAST-LG | Merqury | Reference-based mode | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # Contig | Total | Max | N50 | Structural | Small-scale | QV | Misassembly | MM | QV | NA50 | MR (%) | Coverage (%) | |||||||
| Canu | 4751 | 2.91 | 72.0 | 7.2 | 103 | 39.82 | 43.63 | 8341 | 18.84 | 38.51 | 1.32 | 99.15 | 89.41 | ||||||
| Flye | 2168 | 2.82 | 66.6 | 12.0 | 192 | 30.88 | 43.38 | 4005 | 16.46 | 38.71 | 1.47 | 99.36 | 88.67 | ||||||
| Wtdbg2 | 2947 | 2.77 | 48.5 | 7.0 | 158 | 430.00 | 33.46 | 8943 | 29.13 | 29.42 | 0.43 | 97.77 | 86.17 | ||||||
| Canu | 1376 | 3.37 | 192.2 | 65.3 | 5 | 1.90 | 54.85 | 47672 | 29.17 | 46.57 | 2.20 | 95.95 | 91.71 | ||||||
| Flye | 2379 | 2.96 | 136.6 | 35.1 | 256 | 20.74 | 43.69 | 14478 | 17.34 | 48.08 | 2.28 | 97.82 | 90.36 | ||||||
| wtdbg2 | 1652 | 2.76 | 74.8 | 16.3 | 251 | 83.06 | 39.42 | 4124 | 14.65 | 42.66 | 1.56 | 99.38 | 86.77 | ||||||
| hifiasm | 559 | 3.07 | 199.4 | 111.1 | 18 | 3.62 | 53.62 | 31143 | 21.47 | 45.88 | 2.53 | 97.37 | 92.03 | ||||||
| Canu | 745 | 2.90 | 101.3 | 33.1 | 1432 | 3845.99 | 24.05 | 14926 | 100.03 | 22.94 | 0.27 | 98.27 | 88.46 | ||||||
| Flye | 584 | 2.87 | 109.9 | 51.7 | 481 | 316.46 | 34.30 | 7688 | 33.94 | 30.46 | 1.48 | 99.32 | 89.80 | ||||||
| wtdbg2 | 7959 | 2.97 | 54.2 | 8.2 | 2226 | 2116.76 | 24.91 | 23159 | 65.88 | 24.49 | 0.30 | 93.79 | 84.91 | ||||||
| Shasta | 1258 | 2.80 | 129.3 | 23.3 | 2527 | 2554.72 | 25.74 | 9063 | 70.15 | 24.76 | 0.31 | 99.16 | 87.71 | ||||||
The unit of Max, N50, and NA50 is Mbp. The unit of Total is Gbp. The unit of small-scale and MM is per Mbp. Misassembly of QUAST-LG includes both extensive and local misassembly. Mismatch of QUAST-LG includes both mismatches and indels
Total total number of bases, Max length of the longest contig, MM number of mismatches, MR mapping ratio of assembled contigs
False discovery rate of assembly errors in HG002 assemblies
| Inspector | Merqury | QUAST-LG | ||||
|---|---|---|---|---|---|---|
| Small-scale | Structural | |||||
| Canu | 3.57 | –a | 14.36 | 35.23 | ||
| Flye | 5.77 | 0.00 | 21.93 | 51.65 | ||
| wtdbg2 | 0.94 | 0.00 | 15.33 | 38.37 | ||
| Canu | 6.21 | –a | 3.61 | 38.96 | ||
| Flye | 0.41 | 0.00 | 56.13 | 52.64 | ||
| wtdbg2 | 0.90 | 2.38 | 72.64 | 64.23 | ||
| hifiasm | 8.85 | 0.00 | 9.99 | 51.63 | ||
| Canu | 1.01 | 0.00 | 3.89 | 23.16 | ||
| Flye | 1.28 | 7.69 | 5.22 | 52.39 | ||
| wtdbg2 | 0.72 | 0.00 | 6.37 | 12.32 | ||
| Shasta | 1.96 | 0.32 | 5.15 | 46.68 | ||
| 2.88 | 1.15 | 19.51 | 42.48 | |||
aAssemblies with no structural error located in the benchmark regions of HG002 are marked with “–”
Fig. 2Characterization of structural assembly errors in HG002 assemblies. a Pie charts showing the proportion of four types of structural errors identified in Canu, Flye, wtdbg2, hifiasm, and Shasta assemblies with CLR, HiFi, and Nanopore datasets, respectively. The number of assembly error is also marked on each sector. b Size distribution of identified structural assembly errors in all HG002 assemblies
Fig. 3Enrichment of assembly errors in repetitive regions. a Proportion of assembly errors located in repetitive regions in each assembly. Dashed line indicates fraction of human reference genome annotated as repeats. P values were calculated by one-sample t-test to compare the proportion of assembly errors with the baseline. b Repeat annotation of structural and small-scale errors for five assemblers
Fig. 4Improved assembly accuracy after error correction. a Methods of assembly error correction for small-scale and structural errors. b, c Number of corrected structural (b) and small-scale errors (c) in HG002 assembly. Negative values indicate more assembly errors after the polishing process