| Literature DB >> 26072512 |
Martin D Muggli1, Simon J Puglisi1, Roy Ronen1, Christina Boucher1.
Abstract
MOTIVATION: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar.Entities:
Mesh:
Year: 2015 PMID: 26072512 PMCID: PMC4542784 DOI: 10.1093/bioinformatics/btv262
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An illustration about the systematic alterations that occur with rearrangements, inversions, collapsed repeats and expanded repeats. (a) Proper read alignment where mate-pair reads have the correct orientation and distance from each other. A rearrangement or inversion will present itself by the orientation of the reads being incorrect and/or the distance of the mate-pairs being significantly smaller or significantly larger than the expected insert size. This is shown in (b) and (c), respectively. (d) The proper read depth, which is uniform across the genome. (e) A collapsed repeat, which results in the read depth being greater than expected. (f) A expanded repeat, which results in the read depth being lower than expected
The performance comparison between major assembly tools on the F.tularensis dataset, which has a genome length of 1 892 775 bp and 6 907 220 number of 101 bp reads, using QUAST in default mode (Gurevich )
| Assembler | No. contigs (no. unaligned) | N50 | Largest (bp) | Total (bp) | MA | local MA | MA (bp) | GF (%) |
|---|---|---|---|---|---|---|---|---|
| Velvet | 358 (3 + 35 part) | 7377 | 39 381 | 1 762 202 | 11 | 36 | 84 965 | 92.09 |
| SOAPdenovo | 307 (3 + 31 part) | 8767 | 39 989 | 2 018 158 | 10 | 35 | 96 258 | 92.05 |
| ABySS | 96 (1 part) | 27 975 | 88 275 | 1 875 628 | 64 | 32 | 1 330 684 | 95.87 |
| SPAdes (−rr) | 102 (2 + 11 part) | 25 148 | 87 449 | 1 788 634 | 11 | 30 | 258 309 | 92.81 |
| SPAdes (+rr) | 100 (2 + 17 part) | 26 876 | 87 891 | 1 797 197 | 23 | 31 | 497 356 | 93.75 |
| IDBA | 109 (1 + 10 part) | 23 223 | 87 437 | 1 768 958 | 10 | 31 | 221 087 | 92.64 |
All statistics are based on contigs no shorter than 500 bp. N50 is defined as the length for which the collection of all contigs of that length or longer contains at least half of the sum of the lengths of all contigs and for which the collection of all contigs of that length or shorter also contains at least half of the sum of the lengths of all contigs. The no. unaligned is the number of contigs that did not align to the reference genome, or they were only partially aligned (part). Total is sum of the length of all contigs. MA is the number of (extensively) misassembled contigs. Local MA is the total number of contigs that had local misassemblies. MA (bp) is the total length of the MA contigs. GF is the genome fraction percentage, which is the fraction of genome bases that are covered by the assembly. −rr and ++rr denotes before and after repeat resolution, respectively.
The performance comparison of our method on the F.tularensis dataset
| Correction method | Assembler | MA TPR | local MA TPR | FPR |
|---|---|---|---|---|
| misSEQuel (paired-end data only) | Velvet | 100% (11/11) | 100% (36/36) | 58% (180/312) |
| SOAPdenovo | 100% (10/10) | 100% (35/35) | 63% (165/263) | |
| ABySS | 100% (64/64) | 100% (32/32) | 87% (20/23) | |
| SPAdes (−rr) | 100% (11/11) | 100% (30/30) | 83% (52/63) | |
| SPAdes (++rr) | 100% (23/23) | 100% (31/31) | 86% (49/57) | |
| IDBA | 100% (10/10) | 100% (31/31) | 38% (57/149) | |
| misSEQuel (optical mapping data only) | Velvet | 55% (6/11) | 69% (25/36) | 24% (76/312) |
| SOAPdenovo | 80% (8/10) | 63% (22/35) | 29% (77/263) | |
| ABySS | 69% (44/64) | 88% (28/32) | 13% (3/23) | |
| SPAdes (−rr) | 91% (10/11) | 87% (26/30) | 21% (13/63) | |
| SPAdes (++rr) | 87% (20/23) | 81% (25/31) | 16% (9/57) | |
| IDBA | 90% (9/10) | 77% (24/31) | 10% (15/149) | |
| REAPR | Velvet | 55% (6/11) | 11% (4/36) | < 1% (2/312) |
| SOAPdenovo | 20% (2/10) | 14% (5/35) | 2% (6/263) | |
| ABySS | 13% (8/64) | 13% (4/32) | 4% (1/23) | |
| SPAdes (−rr) | 27% (3/11) | 27% (8/30) | 5% (3/63) | |
| SPAdes (++rr) | 0% (0/23) | 19% (6/31) | 11% (6/57) | |
| IDBA | 40% (4/10) | 13% (4/31) | 4% (6/149) | |
| Pilon | Velvet | 27% (3/11) | 3% (1/36) | < 1% (3/312) |
| SOAPdenovo | 10% (1/10) | 9% (3/35) | 2% (5/263) | |
| ABySS | 3% (2/64) | 6% (2/32) | 4% (3/23) | |
| SPAdes (−rr) | 0% (0/11) | 3% (1/30) | 5% (5/63) | |
| SPAdes (++rr) | 0% (0/23) | 10% (3/31) | 12% (7/57) | |
| IDBA | 0% (0/10) | 10% (3/31) | 4% (5/149) |
The TPR in this context is a contig that is misassembled and is predicted to be so. The FPR is a correctly assembled contig that was predicted to be misassembled. The TPR and FPR are given as percentages with the raw values given in brackets. Bold values emphasize the benefit of using both data sources.
The performance comparison between major assembly tools on Loblolly pine genome dataset (62 647 324 bp, 31.3 million reads, 100 bp) using QUAST in default mode (Gurevich )
| Assembler | No. contigs (no. unaligned) | N50 | Largest (bp) | Total (bp) | MA | local MA | MA (bp) | GF (%) |
|---|---|---|---|---|---|---|---|---|
| Velvet | 13 327 (0) | 1740 | 10 823 | 51 851 131 | 0 | 0 | 0 | 62.21 |
| SOAPdenovo | 16 126 (0 + 1 part) | 7950 | 63 004 | 57 205 817 | 0 | 0 | 0 | 90.01 |
| ABySS | 4586 (16 + 89 part) | 37 089 | 201 382 | 63 349 408 | 127 | 715 | 1 391 565 | 98.17 |
| SPAdes (−rr) | 20 671 (4 + 10 part) | 4809 | 44 993 | 45 079 764 | 7 | 11 | 65 079 | 81.30 |
| SPAdes (+rr) | 8607 (7 + 102 part) | 16 957 | 108 442 | 59 730 939 | 299 | 57 | 3 734 609 | 94.57 |
| IDBA | 22 409 (3 + 31 part) | 3990 | 40 213 | 49 765 854 | 61 | 200 | 292 769 | 79.03 |
The performance comparison of our method on the loblolly pine dataset
| Correction method | Assembler | MA TPR | local MA TPR | FPR |
|---|---|---|---|---|
| REAPR | ABySS | 7% (9/127) | 2% (12/715) | 3% (112/3754) |
| SPAdes (−rr) | 14% (1/7) | 27% (3/11) | 6% (1323/20 653) | |
| SPAdes (+rr) | 7% (21/299) | 5% (3/57) | 5% (424/8254) | |
| IDBA | 2% (1/61) | 6% (12/200) | 11% (2354/22 150) | |
| Pilon | ABySS | 7% (8/127) | 2% (11/715) | 2% (70/3754) |
| SPAdes (−rr) | 14% (1/7) | 18% (2/11) | 4% (923/20 653) | |
| SPAdes (+rr) | 5% (16/299) | 5% (3/57) | 5% (388/8254) | |
| IDBA | 2% (1/61) | 5% (12/200) | 8% (1823/22 150) |
Again, a true positive in this context is a contig that is misassembled and is predicted to be so. A false positive is a correctly assembled contig that was predicted to be misassembled. Bold values highlight MISSEQUEL results.