| Literature DB >> 23284938 |
Francesco Vezzi1, Giuseppe Narzisi, Bud Mishra.
Abstract
In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA sequences/reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental "features" of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. This situation has prompted the scientific community to rely on crowd-sourcing through international competitions, such as Assemblathons or GAGE, with the intention of identifying the best assembler(s) and their features. Somewhat circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed [Formula: see text], which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers - thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.Entities:
Mesh:
Year: 2012 PMID: 23284938 PMCID: PMC3532452 DOI: 10.1371/journal.pone.0052210
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Description of implemented features.
| Feature | Description |
| LOW_COV_PE |
|
| HIGH_COV_PE |
|
| LOW_NORM_COV_PE |
|
| HIGH_NORM_COV_PE |
|
| COMPR_PE |
|
| STRECH_PE |
|
| HIGH_SINGLE_PE |
|
| HIGH_SPAN_PE |
|
| HIGH_OUTIE_PE |
|
| COMPR_MP |
|
| STRECH_MP |
|
| HIGH_SINGLE_MP |
|
| HIGH_SPAN_MP |
|
| HIGH_OUTIE_MP |
|
The Table provides a brief description for each implemented feature.
Staphylococcus aureus (GAGE) assembly evaluation and features estimation.
| assembler | Ctg (Kbp) | NG50 | Chaff (%) | Indels | Misjoins | Inv | Reloc | Sens | Spec |
| ABySS | 246 | 34 | 6.66 | 10 | 6 | 4 | 2 | 99.25 | 62.70 |
| Allpaths-LG |
| 1,092 | 0.03 | 12 |
|
| 4 | 84.79 | 89.97 |
| Bambus2 | 17 | 1,084 | 0.00 | 215 | 14 | 2 | 12 | 97.14 | 83.51 |
| MSR-CA | 17 |
| 0.00 | 14 | 15 | 9 | 6 | 88.12 | 92.89 |
| SGA | 546 | 208 | 0.00 |
|
| 1 |
| 95.48 | 63.71 |
| SOAPdenovo | 99 | 3312 | 0.35 | 36 | 25 | 2 | 23 | 95.32 | 86.69 |
| Velvet | 45 | 762 | 0.41 | 16 | 31 | 10 | 21 | 96.83 | 84.26 |
For each assembler we report the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short (Chaff) contigs (the percentage is computed with respect to the real genome length), the number of long (i.e., >5 bp) indels (Indels), the number of Misjoins, the number of inversions (Inv), the number of relocations (Rel), the features sensitivity (Sens), and the features specificity (Spec).
Rhodobacter sphaeroides (GAGE) assembly evaluation and features estimation.
| assembler | Ctg (Kbp) | NG50 | Chaff (%) | Indels | Misjoins | Inv | Reloc | Sens | Spec |
| ABySS | 1701 | 9 | 1.59 | 38 | 24 | 2 | 22 | 98.92 | 37.26 |
| Allpaths-LG |
|
| 0.01 | 37 | 6 |
| 6 | 90.73 | 93.36 |
| Bambus2 | 92 | 2,439 | 0.00 | 378 | 5 | 0 | 7 | 75.84 | 82.76 |
| CABOG | 130 | 66 | 0.00 | 24 | 15 | 5 | 10 | 89.04 | 82.51 |
| MSR-CA | 43 | 2,976 | 0.00 | 31 | 15 | 3 | 12 | 87.87 | 93.92 |
| SGA | 2096 | 51 | 0.00 |
|
|
|
| 96.66 | 62.89 |
| SOAPdenovo | 166 | 660 | 0.44 | 431 | 11 | 1 | 10 | 92.90 | 86.62 |
| Velvet | 178 | 353 | 0.48 | 27 | 21 | 6 | 15 | 92.04 | 83.33 |
For each assembler we report the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short (Chaff) contigs (the percentage is computed with respect to the real genome length), the number of long (i.e., >5 bp) indels (Indels), the number of Misjoins, the number of inversions (Inv), the number of relocations (Rel), the features sensitivity (Sens), and the features specificity (Spec).
Human chromosome 14 (GAGE) assembly evaluation and features estimation.
| assembler | Ctg (Kbp) | NG50 | Chaff (%) | Indels | Misjoins | Inv | Reloc | Sens | Spec |
| ABySS | 51301 | 2,1 | 34.78 | 762 |
|
|
| 95.83 | 18.79 |
| Allpaths-LG |
|
| 0.02 | 2575 | 146 | 44 | 102 | 68.46 | 96.79 |
| Bambus2 | 1792 | 324 | 0.00 | 5651 | 3409 | 1759 | 1650 | 86.26 | 55.04 |
| CABOG | 479 | 393 | 0.00 | 2894 | 746 | 435 | 311 | 62.19 | 95.92 |
| MSR-CA | 1425 | 893 | 0.01 | 3097 | 2311 | 83 | 1439 | 86.10 | 84.71 |
| SGA | 30975 | 83 | 0.00 |
| 150 | 90 | 60 | 92.13 | 65.38 |
| SOAPdenovo | 13501 | 455 | 3.09 | 3902 | 1529 | 537 | 992 | 90.59 | 73.10 |
| Velvet | 3565 | 1,190 | 4.23 | 4172 | 9525 | 4023 | 5502 | 91.60 | 67.55 |
For each assembler we report the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short (Chaff) contigs (the percentage is computed with respect to the real genome length), the number of long (i.e., >5 bp) indels (Indels), the number of Misjoins, the number of inversions (Inv), the number of relocations (Rel), the features sensitivity (Sens), and the features specificity (Spec).
Figure 1FRCurve computed on the three GAGE datasets and on Assemblathon 1 entries.
Figures A, B, and C show the FRCurves computed on the three GAGE datasets (Staphylococcus aureus, Rhodobacter sphaeroides, and Human chromosome 14). Figure D shows the FRCurves computed on Assemblathon 1 entries.
Assemblathon 1 assembly evaluation and features estimation.
| assembler | Ctg (Kbp) | NG50 | Chaff (%) | Indels | Misjoins | Sens | Spec |
| BROAD | 989 |
| 0.00 | 903 | 236 | 92.99 | 93.88 |
| BGI | 1897 | 1,716 | 0.26 | 994 | 656 | 81.39 | 97.48 |
| WTSI-S | 1380 | 2,874 | 0.00 |
| 197 | 95.10 | 96.55 |
| DOEJGI | 771 | 9,073 | 0.03 | 163 |
| 94.32 | 96.80 |
| CSHL | 1842 | 3,254 | 3.05 | 3704 | 733 | 90.76 | 95.18 |
| CRACS | 6165 | 2,712 | 0.00 | 319 | 990 | 96.59 | 83.27 |
| BCCGSC | 3314 | 825 | 2.92 | 488 | 636 | 96.69 | 88.97 |
| EBI | 2173 | 959 | 0.39 | 674 | 1021 | 78.66 | 94.24 |
| IoBUGA |
| 1,801 | 0.18 | 3596 | 1249 | 71.65 | 94.16 |
| RHUL | 4999 | 43 | 0.00 | 336 | 1040 | 91.70 | 95.88 |
| WTSI-P | 1448 | 502 | 0.00 | 4121 | 2389 | 93.53 | 89.94 |
| DCSISU | 4790 | 315 | 0.00 | 1284 | 2366 | 90.14 | 79.60 |
| IRISA | 3539 | 1,406 | 0.05 | 2518 | 350 | 95.28 | 76.90 |
| ASTR | 6228 | 57 | 0.00 | 336 | 2265 | 91.79 | 69.97 |
| UCSF | 14821 | 22 | 0.00 | 12131 | 5127 | 93.85 | 66.19 |
| GACWT | 24297 | 9 | 0.00 | 2197 | 1487 | 94.10 | 49.36 |
| CIUoC | 14993 | 6 | 0.00 | 3215 | 1889 | 77.09 | 67.29 |
For each assembler we report the number of contigs/scaffolds produced (Ctg), the NG50, the percentage of short (Chaff) contigs (the percentage is computed with respect to the real genome length), the number of long (i.e., >5 bp) indels (Indels), the number of Misjoins, the features sensitivity (Sens), and the features specificity (Spec).
Figure 2Dotplot validation of the longest scaffolds produced by Allpaths-LG and MSR-CA on Staphylococcus a. dataset.
Figures A and B show the dotplot of the longest scaffolds produced by Allpaths-LG and MSR-CA against the reference genome.
Figure 3Dotplot validation of the longest scaffolds produced by CABOG on Rhodobacter s. dataset.
Dotplot validation of the longest scaffold produced by CABOG on Rhodobacter dataset. The green lines represent the Features identified by .
Figure 4FRCurve computed on Assemblathon 2 entries.
Figure A shows FRCurves for all the features, while Figure B shows the FRCurves plotted on a single feature (i.e., High Spanning Paired Ends).