| Literature DB >> 31971576 |
Dengfeng Guan1,2, Shane A McCarthy2, Jonathan Wood3, Kerstin Howe3, Yadong Wang1, Richard Durbin2,3.
Abstract
MOTIVATION: Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.Entities:
Mesh:
Year: 2020 PMID: 31971576 PMCID: PMC7203741 DOI: 10.1093/bioinformatics/btaa025
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.K-mer comparison plots for draft and purge_dups Mm assemblies (k = 21). The horizontal axis represents the copy number of k-mers in short reads from the same sample, the vertical axis shows the number of distinct k-mers and the colored lines denote k-mers which occur in the given number of times in the assembly. (a) The purple line shows 209.1 million two-copy k-mers accumulating in the haploid and diploid areas, which correspond to duplicated haplotigs or overlaps in the primary assembly. (b) Only 7.6 million two-copy k-mers remain after purging with purge_dups
BUSCO scores and assembly metrics
| BUSCO scores (%) | Assembly size (Mb) | Num. Contigs | |||||
|---|---|---|---|---|---|---|---|
| C | C(S) | C(D) | F | M | |||
| At-orig |
| 91.9 | 6.2 |
|
| 140 | 172 |
| At-PH | 97.7 | 96.0 | 1.7 | 0.6 | 1.7 | 123 | 109 |
| At-PD | 97.8 |
|
| 0.6 |
|
|
|
| At-HM | 96.8 | 95.6 | 1.2 | 0.6 | 2.6 | 122 | 117 |
| At-HMm | 96.8 | 95.7 |
| 0.6 | 2.6 |
| 102 |
|
| |||||||
| Ac-orig | 98.7 | 94.7 | 4.0 | 0.6 | 0.7 | 266 | 372 |
| Ac-PH | 98.8 | 96.9 | 1.9 |
| 0.7 | 253 | 224 |
| Ac-PD |
|
| 0.3 | 0.6 |
| 246 |
|
| Ac-HM | 98.5 | 98.2 | 0.3 | 0.6 | 0.9 |
| 223 |
| Ac-HMm | 98.6 | 98.4 |
| 0.6 | 0.8 | 246 | 212 |
|
| |||||||
| Vv-orig |
| 79.8 | 12.4 |
| 6.3 | 591 | 718 |
| Vv-PH | 92.1 | 88.1 | 4.0 | 1.6 | 6.3 | 457 |
|
| Vv-PD | 91.9 |
| 2.0 | 1.9 |
|
| 324 |
| Vv-HM | NA | NA | NA | NA | NA | NA | NA |
| Vv-HMm | 91.8 |
|
| 1.8 | 6.4 | 458 | 383 |
|
| |||||||
| Mm-orig |
| 79.0 | 16.8 |
|
| 1250 | 1290 |
| Mm-PH | 94.5 | 89.1 | 5.4 | 2.4 | 3.1 | 888 | 517 |
| Mm-PD | 94.4 | 90.9 | 3.5 | 2.7 | 2.9 |
| 563 |
| Mm-HM | 94.6 | 91.3 | 3.3 | 2.5 | 2.9 | 850 | 600 |
| Mm-HMm | 94.7 |
|
| 2.6 | 2.7 | 845 |
|
|
| |||||||
| Mm-origS |
| 70.7 | 24.6 |
|
| 1252 | 764 |
| Mm-PHS | 94.7 | 87.5 | 7.2 | 2.5 | 2.8 | 891 |
|
| Mm-PDS | 94.8 | 91.2 | 3.6 | 2.7 |
|
| 222 |
| Mm-HMS | 94.9 | 91.3 | 3.6 | 2.5 | 2.6 | 852 | 343 |
| Mm-HMmS | 94.8 |
|
| 2.5 | 2.7 | 848 | 365 |
C, complete genes; C(S), complete single-copy genes; C(D), complete duplicate genes; F, fragmented genes; M, missing genes; orig, Falcon-unzip; PH, purge_haplotigs; PD, purge_dups; HM, HaploMerger2; HMm, HaploMerger2 with masking; PHS, PDS, HMS, HMmS: purge_haplotigs (respectively purge_dups, HaploMerger2 with and without repeat masking) after scaffolding and polishing. Values in bold indicate the best score of each type in each section. The HaploMerger2 run without masking on Vv did not complete.