| Literature DB >> 30930956 |
Riku Walve1, Pasi Rastas2, Leena Salmela1.
Abstract
BACKGROUND: With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes.Entities:
Keywords: Coloured overlap graph; Genome assembly; Linkage maps
Year: 2019 PMID: 30930956 PMCID: PMC6425630 DOI: 10.1186/s13015-019-0143-x
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Overview of our method. First the reads are mapped to the draft assembly and assigned colours (top left). Each colour represents one bin in the linkage map. In this example we have three bins (brown, green and red) and the ordering of the bins is brown < green < red. All bins belong to the same chromosome. Miniasm is then used to construct the overlap graph which is augmented with the colours. Next vertex is coloured through the colour propagation process. Finally we remove the edges and because they have inconsistent colourings
Summary of read data used in the experiments
| Data set | Reads | Total bases | Coverage | Accession |
|---|---|---|---|---|
| 36,639 | 486,048,334 | 39.98 | Simulated | |
| 296,230 | 3,930,689,562 | 39.99 | Simulated | |
|
| 52,208 | 690,899,144 | 56.83 | PacBio |
|
| 740,776 | 8,118,404,281 | 82.59 | PacBio |
|
| 10,818,653 | 27,094,241,328 | 60.89 | SRR3476970 SRR4039325 |
|
| 1,898,360 | 19,032,363,776 | 49.71 | ERR2003767 ERR2003768 |
a https://github.com/PacificBiosciences/DevNet/wiki/Saccharomyces-cerevisiae-W303-Assembly-Contigs
b https://github.com/PacificBiosciences/DevNet/wiki/C.-elegans-data-set
Summary of linkage maps used in the experiments
| Data set | Markers | Marker density | Bins | Bin density | References |
|---|---|---|---|---|---|
| 100,009 | 0.008 | 19,283 | 0.002 | Simulated | |
| 750,004 | 0.008 | 162,601 | 0.002 | Simulated | |
|
| 2,781,314 | 0.007 | 145,863 | 0.002 | Van Belleghem et al. [ |
|
| 2,979,993 | 0.007 | 925,123 | 0.002 | Salojärvi et al. [ |
Number of reads fully inside and outside their acceptable colour ranges
| Reads inside | Reads outside | |
|---|---|---|
|
| 36,562 (99.79%) | 31 (0.08%) |
|
| 296,134 (99.97%) | 3 (0.001%) |
Number of edges supported by the positions the reads were simulated from
| Graph | Genomic edges | Spurious edges | |
|---|---|---|---|
|
| Miniasm | 76,538 (99.39%) | 466 (0.60%) |
| Kermit | 76,518 (99.93%) | 52 (0.07%) | |
| Miniasm cleaned | 7146 (99.92%) | 6 (0.08%) | |
| Kermit cleaned | 7114 (100.0%) | 0 (0.0%) | |
|
| Miniasm | 668,012 (99.80%) | 1306 (0.19%) |
| Kermit | 667,970 (99.99%) | 58 (0.003%) | |
| Miniasm cleaned | 60,416 (99.95%) | 28 (0.05%) | |
| Kermit cleaned | 60,356 (99.997%) | 2 (0.003%) |
Graphs marked cleaned are also using the graph cleaning steps that are already implemented in Miniasm
Assembly statistics for simulated S. cerevisiae and C. elegans reads and simulated linkage maps
| Assembly |
|
| ||
|---|---|---|---|---|
| Miniasm | Kermit | Miniasm | Kermit | |
| # of contigs | 26 | 22 | 291 | 261 |
| Total length | 11,831,837 | 11,728,421 | 102,040,817 | 101,632,493 |
| N50 | 605,399 | 640,779 | 2,337,914 | 2,293,633 |
| NGA50 | 565,122 | 585,849 | 2,070,983 | 2,337,914 |
| Misassemblies | 2 | 1 | 7 | 7 |
Assembly statistics for real S. cerevisiae and C. elegans reads and simulated linkage maps
| Assembly |
|
| ||||
|---|---|---|---|---|---|---|
| Miniasm | Kermit | Canu | Miniasm | Kermit | Canu | |
| # of contigs | 31 | 29 | 34 | 177 | 188 | 159 |
| Total length | 12,118,143 | 11,997,376 | 12,426,814 | 109,318,925 | 104,545,368 | 108,154,535 |
| N50 | 732,688 | 763,111 | 739,529 | 2,270,602 | 1,928,805 | 3,202,659 |
| NGA50 | 345,801 | 376,210 | 375,952 | 272,995 | 271,763 | 271,783 |
| Misassemblies | 65 | 60 | 82 | 1837 | 1651 | 2019 |
Assembly statistics for real H. erato and B. pendula reads and real linkage maps
| Assembly |
|
| ||||
|---|---|---|---|---|---|---|
| Miniasm | Kermit | Canu | Miniasm | Kermit | Canu | |
| # of contigs | 7444 | 6091 | 100,615 | 2201 | 1587 | 14,189 |
| Total length | 327,725,353 | 280,881,758 | 691,789,561 | 473,300,369 | 425,356,395 | 387,624,902 |
| N50 | 58,892 | 60,356 | 12,592 | 435,830 | 539,400 | 45,255 |
Assembly statistics for B. pendula dataset with different colour distance parameters
| Assembly | Unicoloured |
|
|
|
|
|---|---|---|---|---|---|
| # of contigs | 1583 | 1594 | 1587 | 1585 | 1583 |
| Total length | 426,546,974 | 425,333,715 | 425,356,395 | 425,531,261 | 425,708,460 |
| N50 | 537,429 | 539,400 | 539,400 | 538,989 | 538,770 |
The unicoloured column shows results for colouring each chromosome with a single colour
Wall clock times for all steps taken by the tools
| Tool | Overlap | Map | Colour | Layout | Consensus | Total | |
|---|---|---|---|---|---|---|---|
| Miniasm | 52 s | – | – | 6 s | 4 min 52 s | 5 min 50 s | |
| Kermit | 52 s | 6 s | 0 s | 6 s | 3 min 29 s | 4 min 38 s | |
| Miniasm | 9 min 55 s | – | – | 1 min 58 s | 28 min 19 s | 40 min 17 s | |
| Kermit | 9 min 55 s | 2 min 17 s | 5 s | 55 s | 28 min 57 s | 42 min 9 s | |
|
| Miniasm | 1 min 9 | – | – | 8 s | 2 min 45 s | 4 min 2 s |
| Kermit | 1 min 09 s | 10 s | 1 s | 8 s | 2 min 44 s | 4 min 12 s | |
| Canu | – | – | – | – | – | 2 h 12 min | |
|
| Miniasm | 16 min 54 s | – | – | 4 min | 53 min 28 s | 1 h 14 min |
| Kermit | 16 min 54 s | 4 min 8 s | 11 s | 2 min 29 s | 50 min 29 s | 1 h 14 min | |
| Canu | – | – | – | – | – | 12 h 31 min | |
|
| Miniasm | 8 h 40 min | – | – | 1 h 30 min | – | 10 h 10 min |
| Kermit | 8 h 40 min | 9 min | 2 min | 2 h 42 min | – | 11 h 32 min | |
| Canu | – | – | – | – | – | 7 days | |
|
| Miniasm | 3 h 24 min | – | – | 1 h 32 min | – | 4 h 57 min |
| Kermit | 3 h 24 min | 9 min | 17 s | 1 h 37 min | – | 5 h 11 min | |
| Canu | – | – | – | – | – | 6 days |
The consensus phase was very slow on the big genomes of H. erato and B. pendula so it was not run on those data sets