| Literature DB >> 26936254 |
Mohammed-Amin Madoui1, Carole Dossat2, Léo d'Agata3, Jan van Oeveren4, Edwin van der Vossen5, Jean-Marc Aury6.
Abstract
BACKGROUND: Scaffolding is an essential step in the genome assembly process. Current methods based on large fragment paired-end reads or long reads allow an increase in contiguity but often lack consistency in repetitive regions, resulting in fragmented assemblies. Here, we describe a novel tool to link assemblies to a genome map to aid complex genome reconstruction by detecting assembly errors and allowing scaffold ordering and anchoring.Entities:
Mesh:
Year: 2016 PMID: 26936254 PMCID: PMC4776351 DOI: 10.1186/s12859-016-0969-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1MaGuS pipeline. a Flowchart of the MaGuS pipeline. b Comparison of the QUAST and MaGuS metrics. c Application of MaGuS to WGP data
QUAST and MaGuS quality metrics for the five assemblies. The R 2 values indicate the Pearson correlation coefficients between the QUAST NAx and MaGuS Anx values
| Assembly metrics | SOAP | SSPACE | SGA | BESST | OPERA_LG |
|---|---|---|---|---|---|
| Assembly size (bp) | 115 319 220 | 116 017 208 | 114 956 386 | 114 996 281 | 116 406 702 |
| N50 (bp) | 821 817 | 982 887 | 284 070 | 1 299 606 | 1 272 891 |
| L50 | 39 | 31 | 115 | 22 | 26 |
| N75 (bp) | 306 051 | 340 070 | 118 727 | 643 037 | 566 836 |
| L75 | 96 | 81 | 270 | 54 | 60 |
| QUAST metrics | |||||
| Number of N's per 100 kb | 3851.60 | 3000.11 | 4251.16 | 2845.19 | 3139.94 |
| Misassemblies | 9 | 9 | 3 | 23 | 51 |
| Largest alignment (bp) | 3 482 036 | 4 678 885 | 1 680 656 | 6 501 653 | 5 259 610 |
| NA50 (bp) | 757 250 | 926 429 | 276 557 | 1 210 586 | 945 419 |
| NA75 (bp) | 268 694 | 291 099 | 100 235 | 516 026 | 351 844 |
| MaGuS metrics | |||||
| An50 (bp) | 31 217 | 32 028 | 23 164 | 35 466 | 33 908 |
| An75 (bp) | 11 887 | 12 052 | 6 981 | 14 315 | 13 113 |
|
| 0.99 | 0.98 | 0.96 | 0.99 | 0.96 |
Fig. 2Comparison of MaGuS and QUAST quality metrics for the five assemblies. a MaGuS Anx plot. b QUAST NAx plot. c Correlation between Anx and NAx values
Assembly metrics after MaGuS scaffolding for the five assemblies
| SOAP | SSPACE | SGA | BESST | OPERA-LG | |
|---|---|---|---|---|---|
| Assembly size (bp) | 115 563 956 | 116 414 299 | 115 703 449 | 115 174 685 | 116 556 828 |
| N50 (bp) | 1 350 715 | 1 680 424 | 635 106 | 1 751 177 | 1 442 963 |
| N50 fold change | 1.64 | 1.74 | 2.24 | 1.35 | 1.13 |
| L50 | 23 | 18 | 47 | 18 | 22 |
| N75 (bp) | 509 384 | 646 442 | 288 240 | 787 050 | 695 198 |
| N75 fold change | 1.66 | 1.9 | 2.43 | 1.22 | 1.23 |
| L75 | 58 | 48 | 110 | 42 | 51 |
| Number of N's per 100 kb | 4 055.34 | 3 331.14 | 4 869.38 | 2 995.68 | 3 264.70 |
| Largest alignment | 5 012 555 | 7 708 756 | 3 361 051 | 6 902 343 | 5 597 743 |
| NA50 | 1 187 620 | 1 455 792 | 579 394 | 1 407 579 | 1 258 868 |
| NA50 fold change | 1.57 | 1.57 | 2.1 | 1.16 | 1.18 |
| NA75 | 354 088 | 508 625 | 215 751 | 609 320 | 560 902 |
| NA75 fold change | 1.32 | 1.75 | 2.15 | 1.18 | 1.59 |
| Total misassemblies | 23 | 19 | 19 | 36 | 62 |
| Magus misassemblies | 14 | 10 | 16 | 13 | 5 |
| Number of map-links | 534 | 481 | 1 034 | 371 | 368 |
| Number of MP-validated links | 209 (39.14 %) | 214 (44.49 %) | 516 (49.9 %) | 93 (25.07 %) | 78 (21.2 %) |
| Number of correct MP-validated links | 195 (36.51 %) | 204 (42.41 %) | 500 (48.53 %) | 80 (21.56 %) | 73 (19.83 %) |
| False positive rate | 6.7 | 4.7 | 3.1 | 14 | 6.4 |
Fig. 3Distribution of the number of mate-pairs that validates map-links for the five assemblies
Fig. 4Effect of intra and inter-BAC contig errors on MaGuS scaffolds. The intra and inter-BAC contig errors are named e1 and e2 respectively. a. Effect of the map errors on the An50 values. b. Effect of the map errors on the N50. c. Effect of the map errors on the N90. d. Effect of the map errors on the NA50. e. Effect of the map errors on the NA90. f. Effect of the map errors on misassemblies. Grey areas are values between the upper and lower pointwise confidence interval around the mean, these values were obtained from a log regression