| Literature DB >> 21559467 |
Giuseppe Narzisi1, Bud Mishra.
Abstract
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.Entities:
Mesh:
Year: 2011 PMID: 21559467 PMCID: PMC3084767 DOI: 10.1371/journal.pone.0019175
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
List of sequence assemblers.
| Name | Read Type | Algorithm | Reference |
| SUTTA | long & short | B&B | (Narzisi and Mishra |
| ARACHNE | long | OLC | (Batzoglou et al. |
| CABOG | long & short | OLC | (Miller et al. |
| Celera | long | OLC | (Myers et al. |
| Edena | short | OLC | (Hernandez et al. |
| Minimus (AMOS) | long | OLC | (Sommer et al. |
| Newbler | long | OLC | 454/Roche |
| CAP3 | long | Greedy | (Huang and Madan |
| PCAP | long | Greedy | (Huang et al. |
| Phrap | long | Greedy | (Green |
| Phusion | long | Greedy | (Mullikin and Ning |
| TIGR | long | Greedy | (Sutton et al. |
| ABySS | short | SBH | (Simpson et al. |
| ALLPATHS | short | SBH | (Butler et al. |
| Euler | long | SBH | (Pevzner et al. |
| Euler-SR | short | SBH | (Chaisson and Pevzner |
| Ray | long & short | SBH | (Boisvert et al. |
| SOAPdenovo | short | SBH | (Li et al. |
| Velvet | long & short | SBH | (Zerbino and Birney |
| PE-Assembler | short | Seed-and-Extend | (Ariyaratne and Sung |
| QSRA | short | Seed-and-Extend | (Bryant et al. |
| SHARCGS | short | Seed-and-Extend | (Dohm et al. |
| SHORTY | short | Seed-and-Extend | (Hossain et al. |
| SSAKE | short | Seed-and-Extend | (Warren et al. |
| Taipan | short | Seed-and-Extend | (Schmidt et al. |
| VCAKE | short | Seed-and-Extend | (Jeck et al. |
Reads are defined as “long” if produced by Sanger technology and “short” if produced by Illumina technology . Note that Velvet was designed for micro-reads (e.g. Illumina) but long reads can be given in input as additional data to resolve repeats in a greedy fashion.
Benchmark data.
| Genome | Length (bp) | Num. of reads | Avg. read length (bp) | Std. (bp) | Coverage |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
First and second columns report the genome name and length; columns 3 to 6 report the statistics of the shotgun projects: number of reads, average and standard deviation of the read length and genome coverage (region [35,000,001–38,000,000]).
Long reads comparison (without mate-pairs).
| Genome | Assembler | # contigs | # big contigs ( | Max (kbp) | Mean big contigs (kbp) | N50 (kbp) | Big contigs coverage (%) |
|
| Euler |
|
|
|
|
|
|
|
| Minimus |
|
|
|
|
|
|
| PCAP |
|
|
|
|
|
| |
| PHRAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
| |
|
| Euler |
|
|
|
|
|
|
|
| Minimus |
|
|
|
|
|
|
| PCAP |
|
|
|
|
|
| |
| PHRAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
| |
|
| Euler |
|
|
|
|
|
|
| Minimus |
|
|
|
|
|
| |
| PCAP |
|
|
|
|
|
| |
| PHRAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
| |
|
| Euler |
|
|
|
|
|
|
|
| Minimus |
|
|
|
|
|
|
| PCAP |
|
|
|
|
|
| |
| PHRAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
|
Long reads assembly comparison without mate-pair information (clone sizes and forward-reverse constraints). First and second columns report the genome and assembler name; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size , max contig size, mean contig size, and N50 size (N50 is the largest number such that the combined length of all contigs of length is at least 50% of the total length of all contigs). Finally column 8 reports the coverage achieved by the large contigs (). Coverage is computed by double-counting overlapping regions of the contigs, when aligned to the genome.
Figure 1Feature-Response curve comparison for S. epidermidis.
For this comparison no mate-pairs information was used in the assembly.
Figure 2Feature-Response curve comparison for Chromosome Y (3 Mbp of p11.2 region).
For this comparison no mate-pairs information was used in the assembly.
Long reads comparison (with mate-pairs).
| Genome | Assembler | # contigs | # big contigs ( | Max (kbp) | Mean big contigs (kbp) | N50 (kbp) | Big contigs coverage (%) |
|
| ARACHNE |
|
|
|
|
|
|
|
| CABOG |
|
|
|
|
|
|
| Euler |
|
|
|
|
|
| |
| PCAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
| |
|
| ARACHNE |
|
|
|
|
|
|
|
| CABOG |
|
|
|
|
|
|
| Euler |
|
|
|
|
|
| |
| PCAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
| |
|
| ARACHNE |
|
|
|
|
|
|
| CABOG |
|
|
|
|
|
| |
| Euler |
|
|
|
|
|
| |
| PCAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
| |
|
| ARACHNE |
|
|
|
|
|
|
|
| CABOG |
|
|
|
|
|
|
| Euler |
|
|
|
|
|
| |
| PCAP |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| TIGR |
|
|
|
|
|
|
Long reads assembly comparison using mate-pair information. First and second columns report the genome and assembler name; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size , max contig size, mean contig size, and N50 size (N50 is the largest number such that the combined length of all contigs of length is at least 50% of the total length of all contigs). Finally column 8 reports the coverage achieved by the large contigs (). Coverage is computed by double-counting overlapping regions of the contigs, when aligned to the genome.
Figure 3Feature-Response curve comparison for S. epidermidis.
For this comparison mate-pairs information was used in the assembly.
Figure 4Feature-Response curve comparison for Chromosome Y (3 Mbp of p11.2 region).
For this comparison mate-pairs information was used in the assembly.
Figure 5Dot plots of the assemblies for Staphylococcus epidermidis.
Assemblies generated by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the axis. Note that number of single dots are an artifact of the sensitivity of the MUMmer alignment tool; they can be reduced or removed using a larger value for the minimum cluster length parameter –mincluster (default 65).
Figure 6Feature-Response curve comparison by feature type for S. epidermidis using mate-pairs.
Short reads comparison (without mate-pairs).
| Genome | Assembler | # correct | # mis-assembled | N50 (kbp) | Mean (kbp) | Max (kbp) | Coverage (%) |
|
| ABySS |
|
|
|
|
|
|
| (strain MW2) | Edena (strict) |
|
|
|
|
|
|
| Edena (nonstrict) |
|
|
|
|
|
| |
| EULER-SR |
|
|
|
|
|
| |
| SOAPdenovo |
|
|
|
|
|
| |
| SSAKE |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| Taipan |
|
|
|
|
|
| |
| Velvet |
|
|
|
|
|
| |
|
| ABySS |
|
|
|
|
|
|
| (strain Sheeba) | Edena (strict) |
|
|
|
|
|
|
| Edena (nonstrict) |
|
|
|
|
|
| |
| EULER-SR |
|
|
|
|
|
| |
| SOAPdenovo |
|
|
|
|
|
| |
| SSAKE |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| Taipan |
|
|
|
|
|
| |
| Velvet |
|
|
|
|
|
|
Short reads assembly comparison without mate-pair information. First and second columns report the genome and assembler name; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size , max contig size, mean contig size, and N50 size (N50 is the largest number such that the combined length of all contigs of length is at least 50% of the total length of all contigs). Finally column 8 reports the coverage achieved by all the contigs.
Short reads comparison (with mate-pairs).
| Genome | Assembler | # correct | # mis-assembled (mean kbp) | N50 (kbp) | Mean (kbp) | Max (kbp) | Coverage (%) |
|
| ABySS |
|
|
|
|
|
|
| (K12 MG1655) | Edena |
|
|
|
|
|
|
| EULER-SR |
|
|
|
|
|
| |
| SOAPdenovo |
|
|
|
|
|
| |
| SSAKE |
|
|
|
|
|
| |
| SUTTA |
|
|
|
|
|
| |
| Taipan |
|
|
|
|
|
| |
| Velvet |
|
|
|
|
|
|
Short reads assembly comparison using mate-pair information. First and second columns report the genome and assembler name; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size , max contig size, mean contig size, and N50 size (N50 is the largest number such that the combined length of all contigs of length is at least 50% of the total length of all contigs). Finally column 8 reports the coverage achieved by all the contigs.
Figure 7Feature-Response curve comparison for E. coli.
For this comparison mate-pairs information was used in the assembly.
Figure 8Algorithm pseudo-code for computing the Feature-Response curve.