| Literature DB >> 35525918 |
Abstract
BACKGROUND: De novo genome assembly typically produces a set of contigs instead of the complete genome. Thus additional data such as genetic linkage maps, optical maps, or Hi-C data is needed to resolve the complete structure of the genome. Most of the previous work uses the additional data to order and orient contigs.Entities:
Keywords: Genetic linkage maps; Genome assembly
Mesh:
Year: 2022 PMID: 35525918 PMCID: PMC9077837 DOI: 10.1186/s12859-022-04701-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1An example of how different points of the hierarchy affect the assembly. At the bottom raw reads are assembled into contigs in the leaf nodes. Each internal node then takes the contigs and further merges the contigs from its children together
Characteristics of the read data sets and genetic maps used in the experiments
| Organism | # of reads | Mean read length (bp) | Total read length (Mbp) | Coverage | # of markers | # of bins |
|---|---|---|---|---|---|---|
| 478,836 | 8,214 | 3,933 | 40 | 98,978 | 81,788 | |
| 3,316,106 | 8,801 | 29,185 | 291 | 388,202 | – | |
| 1,135,065 | 9,475 | 10,755 | 90 | 49,617 | 45,930 | |
| 10,918,547 | 4,948 | 54,025 | 115 | 76,036 | 34,845 | |
| 25,986,153 | 8,916 | 231,694 | 76 | 996,603 | 936,534 |
The C. elegans (sim) reads were simulated with SimLoRD and the C. elegans (sim), A. thaliana, and H. sapiens genetic linkage maps were simulated by randomly positioning the markers on the genome. The C. elegans (real), A. thaliana, P. pungitius, and H. sapiens reads are real PacBio reads
The effect of the minimum leaf size on the assembly of the simulated C. elegans data
| Min leaf size (% of reads) | # of contigs | NGA50 (bp) | Genome fraction | Misassemblies | BUSCO Complete (%) | Reads mapped (%) | Runtime (min) | Peak memory |
|---|---|---|---|---|---|---|---|---|
| 0.1% | 42 | 3,901,186 | 99.699 | 11 | 97.8 | 99.78 | 60 | 1564 |
| 0.5% | 34 | 4,282,525 | 99.564 | 11 | 93.4 | 99.76 | 55 | 487 |
| 1.0% | 30 | 4,274,710 | 99.592 | 9 | 93.4 | 99.77 | 53 | 901 |
| 1.5% | 31 | 5,901,436 | 99.595 | 14 | 97.2 | 99.78 | 51 | 1334 |
| 2.0% | 37 | 4,691,641 | 99.604 | 15 | 93.5 | 99.77 | 53 | 1776 |
| 2.5% | 38 | 3,900,976 | 99.568 | 12 | 93.3 | 99.78 | 54 | 2226 |
| 5.0% | 39 | 5,335,812 | 99.571 | 16 | 98.1 | 99.78 | 40 | 3954 |
The effect of the minimum leaf size on the assembly of the real P. pungitius data. The length of the scaffold level reference assembly (GCA_902500615.3) is 466 Mbp
| Min leaf size (% of reads) | # of contigs | N50 (bp) | Total length (bp) | BUSCO complete (%) | Reads mapped (%) | Runtime (h) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|
| 0.1% | 1945 | 918,119 | 453,155,823 | 88.0 | 92.6 | 13.34 | 6,212 |
| 0.5% | 1084 | 1,799,563 | 489,091,741 | 91.3 | 93.52 | 14.3 | 8,289 |
| 1.0% | 884 | 1,877,796 | 511,024,231 | 92.1 | 93.95 | 13.88 | 11,680 |
| 1.5% | 790 | 2,119,727 | 503,905,067 | 92.5 | 93.91 | 13.44 | 16,001 |
| 2.0% | 779 | 2,059,129 | 499,019,519 | 91.7 | 93.82 | 11.98 | 17,884 |
| 2.5% | 784 | 2,027,447 | 481,429,790 | 91.7 | 93.65 | 17.95 | 22,713 |
The effect of the map density on the assembly of the C. elegans data
| Method | # of markers | # of contigs | NGA50 (bp) | Genome fraction | Misassemblies | BUSCO Compl. (%) | Reads mapped (%) | Runtime (min) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|---|---|
| Kermit | 1k | 850 | 89,141 | 74.315 | 13 | 73.4 | 91.84 | 21 | 11,943 |
| Kermit | 10k | 733 | 82,640 | 68.808 | 16 | 67.8 | 90.69 | 21 | 11,809 |
| Kermit | 20k | 216 | 818,928 | 95.417 | 9 | 93.8 | 98.09 | 22 | 12,434 |
| Kermit | 50k | 69 | 3,450,849 | 99.539 | 12 | 98.0 | 99.74 | 23 | 12,542 |
| Kermit | 100k | 61 | 3,476,344 | 99.563 | 11 | 98.3 | 99.75 | 23 | 12,543 |
| Kermit | 150k | 64 | 3,450,700 | 99.555 | 12 | 98.1 | 99.77 | 23 | 12,555 |
| Kermit | 200k | 64 | 3,476,344 | 99.563 | 11 | 98.3 | 99.75 | 23 | 12,542 |
| Kermit | 500k | 64 | 3,476,344 | 99.563 | 11 | 98.2 | 99.75 | 23 | 12,544 |
| HGGA | 1k | 69 | 2,488,265 | 95.627 | 8 | 93.8 | 97.63 | 38 | 1902 |
| HGGA | 10k | 44 | 3,668,792 | 99.698 | 9 | 97.9 | 99.75 | 40 | 1837 |
| HGGA | 20k | 44 | 3,668,641 | 99.680 | 10 | 95.6 | 98.52 | 42 | 1835 |
| HGGA | 50k | 46 | 3,668,702 | 99.708 | 9 | 95.9 | 98.52 | 42 | 1827 |
| HGGA | 100k | 49 | 3,668,667 | 99.646 | 9 | 97.8 | 99.78 | 42 | 1874 |
| HGGA | 150k | 51 | 3,668,731 | 99.669 | 8 | 98.1 | 99.75 | 42 | 1886 |
| HGGA | 200k | 52 | 3,869,053 | 99.568 | 8 | 96.2 | 99.36 | 43 | 1833 |
| HGGA | 500k | 47 | 3,668,735 | 99.652 | 13 | 98.0 | 99.76 | 48 | 1837 |
The effect of assembly in the internal nodes on the C. elegans data
| Height | # of contigs | NGA50 (bp) | Genome fraction | Misassemblies | BUSCO Complete (%) | Reads mapped (%) |
|---|---|---|---|---|---|---|
| leaves | 221 | 2,840,136 | 99.619 | 17 | 98.6 | 99.88 |
| 1 | 112 | 3,323,225 | 99.599 | 17 | 97.7 | 99.84 |
| 2 | 71 | 3,473,215 | 99.571 | 13 | 98.1 | 99.81 |
| 3 | 59 | 3,540,478 | 99.551 | 12 | 97.6 | 99.79 |
| 4 | 51 | 3,549,527 | 99.551 | 12 | 97.6 | 99.78 |
| root | 31 | 5,901,436 | 99.595 | 14 | 97.2 | 99.78 |
Comparison of HGGA, miniasm, and Kermit on the simulated C. elegans data
| Method | # of contigs | NGA50 (bp) | Genome fraction | Misassemblies | BUSCO complete (%) | Reads mapped (%) | Runtime (min) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|---|
| Miniasm | 126 | 1,982,361 | 99.443 | 10 | 98.1 | 99.75 | 20 | 18,332 |
| Kermit | 83 | 2,819,353 | 99.535 | 7 | 98.3 | 99.75 | 23 | 19,578 |
| HGGA | 31 | 5,901,436 | 99.595 | 14 | 97.2 | 99.78 | 51 | 1,334 |
Comparison of HGGA, miniasm, and Kermit on the A. thaliana data with real reads and simulated genetic linkage map
| Method | # of contigs | NGA50 (bp) | Genome fraction | Misassemblies | BUSCO Complete (%) | Reads mapped (%) | Runtime (h) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|---|
| Miniasm | 712 | 2,552,623 | 98.766 | 346 | 84.5 | 96.63 | 2.37 | 34,128 |
| Kermit | 123 | 2,552,489 | 98.185 | 174 | 85.1 | 89.07 | 2.08 | 34,486 |
| HGGA | 136 | 4,173,314 | 98.247 | 242 | 86.3 | 95.87 | 3.41 | 10,050 |
Comparison of HGGA, miniasm, and Kermit on the H. sapiens data with real reads and simulated genetic linkage map
| Method | # of contigs | NGA50 (bp) | Genome fraction | Misassemblies | BUSCO complete (%) | Reads mapped (%) | Runtime (h) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|---|
| Miniasm | 8,789 | 692,902 | 89.761 | 3,669 | 76.5 | 61.37 | 237.84 | 565,309 |
| Kermit | 4,503 | 1,050,164 | 90.069 | 762 | 77.9 | 60.65 | 239.29 | 565,307 |
| HGGA | 2,204 | 6,814,538 | 93.181 | 3,004 | 86.5 | 70.45 | 37.46 | 69,492 |
Comparison of HGGA, miniasm, and Kermit on the C. elegans data with real genetic linkage map and reads
| Method | # of contigs | NGA50 (bp) | Genome fraction | Misassemblies | BUSCO complete (%) | Reads mapped (%) | Runtime (h) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|---|
| Miniasm | 472 | 1,582,439 | 99.478 | 420 | 95.2 | 94.43 | 5.52 | 88,371 |
| Kermit | 95 | 1,864,384 | 99.187 | 197 | 95.8 | 93.41 | 4.88 | 88,028 |
| HGGA | 217 | 1,927,968 | 99.072 | 195 | 95.1 | 94.61 | 9.07 | 9,101 |
Comparison of HGGA, miniasm, and Kermit on the real P. pungitius data. The length of the scaffold level reference assembly (GCA_902500615.3) is 466 Mbp
| Method | # of contigs | N50 (bp) | Total length (bp) | BUSCO complete (%) | Reads mapped (%) | Runtime (h) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|
| Miniasm | 1,873 | 1,182,753 | 461,795,357 | 92.7 | 93.58 | 13.49 | 165,716 |
| Kermit | 833 | 1,392,886 | 432,823,234 | 92.1 | 93.08 | 13.19 | 165,061 |
| HGGA | 790 | 2,119,727 | 503,905,067 | 92.5 | 93.91 | 13.44 | 16,001 |
Fig. 2An example of how reads, shown as black horizontal lines, are assigned to leaf nodes. Reads have each been assigned to one or more preliminary leaf nodes (shown in black vertical lines). Each preliminary leaf gets further split in half (shown as dashed vertical lines). These halves are then merge back together with their neighbors (shown as grey rectangles) and assigned to the final leaf nodes in their order of appearance
Fig. 3A bidirected overlap graph corresponding to overlaps between contigs a, b, and c. The contig edges are shown in gray and the overlap edges in black. An assembly path through the graph alternates betweem contig edges and overlap edges. In this graph the path is an assembly path