| Literature DB >> 25735824 |
Yu-Chieh Liao1, Shu-Hung Lin1, Hsin-Hung Lin1.
Abstract
Determining the genomic sequences of microorganisms is the basis and prerequisite for understanding their biology and functional characterization. While the advent of low-cost, extremely high-throughput second-generation sequencing technologies and the parallel development of assembly algorithms have generated rapid and cost-effective genome assemblies, such assemblies are often unfinished, fragmented draft genomes as a result of short read lengths and long repeats present in multiple copies. Third-generation, PacBio sequencing technologies circumvented this problem by greatly increasing read length. Hybrid approaches including ALLPATHS-LG, PacBio corrected reads pipeline, SPAdes, and SSPACE-LongRead, and non-hybrid approaches--hierarchical genome-assembly process (HGAP) and PacBio corrected reads pipeline via self-correction--have therefore been proposed to utilize the PacBio long reads that can span many thousands of bases to facilitate the assembly of complete microbial genomes. However, standardized procedures that aim at evaluating and comparing these approaches are currently insufficient. To address the issue, we herein provide a comprehensive comparison by collecting datasets for the comparative assessment on the above-mentioned five assemblers. In addition to offering explicit and beneficial recommendations to practitioners, this study aims to aid in the design of a paradigm positioned to complete bacterial genome assembly.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25735824 PMCID: PMC4348652 DOI: 10.1038/srep08747
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Description of the datasets employed in this study
| Data | Organism | Fragment | Jump | Long read | Reference |
|---|---|---|---|---|---|
| D1 | 2 × 101 bp, 180 bp insert (SRR447685) | 2 × 93 bp, 3000 bp insert (SRR401827 and SRR492488) | 1–3 Kbp (Ribeiro's ftp | NC_000913 | |
| D2 | 2 × 101 bp, 180 bp insert (SRR125492) | 2 × 101 bp, 3000 bp insert (SRR388672) | 1–3 Kbp (Ribeiro's ftp | NC_007488-90 NC_007493-94 NC_009007-08 | |
| D3 | 2 × 101 bp, 180 bp insert (SRR387335) | 2 × 93 bp, 3000 bp insert (SRR364158) | 1–3 Kbp (Ribeiro's ftp | NC_003028 | |
| D4 | 2 × 151 bp, 300 bp insert (Illumina data website | NC_000913 | |||
| D5 | 10 Kbp, 17 SMRT cell (SRX255228 | NC_000913 | |||
| D6 | 8–10 Kbp, 8 SMRT cells (SRX260475 | NC_000913 | |||
| D7 | 8–10 Kbp, 4 SMRT cells (SRX260496 | NC_013946 | |||
| D8 | 8–10 Kbp, 7 SMRT cells (SRX260506 | NC_013061 | |||
| D9 | PacBio RS II System and P4-C2 chemistry | NC_000913 |
aLong reads were downloaded from ftp.broadinstitute.org/pub/papers/assembly/Ribeiro2012/data.
bPaired reads were provided in http://www.illumina.com/systems/miseq/scientific_data.ilmn.
cPacBio HDF5 files were requested from NCBI Sequence Read Archive (SRA).
dPacBio HDF5 files were downloaded from http://files.pacb.com/software/hgap/index.html.
ePacBio HDF5 files were downloaded from https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-20kb-Size-Selected-Library-with-P4-C2.
Figure 1Comparisons of the assemblers conducted in this study.
SSPACE-LongRead is a scaffolder using single molecule long reads to upgrade pre-assembled contigs constructed from short reads. ALLPATHS-LG and SPAdes are hybrid assemblers which take short reads and long reads as inputs. PBcR pipeline uses short reads to correct long reads by pacBioToCA, and then assembles corrected long reads (PBcR) by Celera assembler (runCA). Hierarchical genome-assembly process (HGAP) and PBcR pipeline via self-correction (PBcR pipeline(S)) take long reads as input to produce non-hybrid assembly.
Assembly results obtained by ALLPATHS-LG and SPAdes on D1–D3
| ALLPATHS-LG | SPAdes | |||||||
|---|---|---|---|---|---|---|---|---|
| With PacBio | Without PacBio | With PacBio | Without PacBio | |||||
| No. of contigs | N50 | No. of contigs | N50 | No. of contigs | N50 | No. of contigs | N50 | |
| Website data | 1 | 4638970 | 2 | 4631220 | 16 | 692096 | 31 | 555967 |
| Raw data | 1 | 4625005 | 1 | 4633080 | 28 | 1092719 | 40 | 693826 |
| Fractional data | 14 | 4638970 | 5 | 4575759 | ||||
| 50X coverage | 1 | 4638970 | 3 | 4629108 | ||||
| 100X coverage | 1 | 4638970 | 2 | 4638312 | ||||
| Website data | 11 | 3188818 | 31 | 3188995 | 57 | 318530 | 114 | 183697 |
| Raw data | 13 | 3188540 | 57 | 3186675 | 44 | 422736 | 93 | 223105 |
| Fractional data | 10 | 3188847 | 32 | 1492665 | ||||
| 50X coverage | NA | NA | 79 | 99916 | ||||
| 100X coverage | 12 | 3188773 | 29 | 2634704 | ||||
| Website data | 1 | 2162245 | 4 | 1663585 | 20 | 210016 | 65 | 84287 |
| Raw data | 5 | 1340620 | 6 | 2135901 | 90 | 365564 | 142 | 81903 |
| Fractional data | 1 | 2151421 | 4 | 1671738 | ||||
| 50X coverage | 2 | 1189234 | 4 | 1675149 | ||||
| 100X coverage | 1 | 2150940 | 7 | 1812035 | ||||
Assembly results obtained from hybrid one short and one long library (D4 + D5)
| SMRT | Hybrid approach | No. contigs | N50 | No. misassemblies | No. N's per 100 Kbp | No. genes | Running time |
|---|---|---|---|---|---|---|---|
| 0 | SPAdes | 86 | 139882 | 2 | 0 | 4399 | 2 h 28 m |
| 1 | PBcR pipeline | 19 | 356974 | 8 | 0.19 | 4473 | Over 12 h |
| PBcR pipeline (wgs-8.2) | 24 | 564692 | 7 | 0 | 6 h 8 m | ||
| SPAdes | 4479 | ||||||
| SSPACE-LongRead | 2497845 | 9 | 97.89 | 4467 | 2 h 38 m | ||
| 2 | PBcR pipeline | 17 | 405539 | 7 | 4466 | Over 12 h | |
| PBcR pipeline (wgs-8.2) | 981448 | 11 | 9 h 52 m | ||||
| SPAdes | 14 | 4485 | |||||
| SSPACE-LongRead | 18 | 1238868 | 10 | 67.67 | 4465 | 2 h 47 m | |
| 3 | PBcR pipeline | 15 | 323732 | 4467 | Over 12 h | ||
| PBcR pipeline (wgs-8.2) | 10 | 10 h 24 m | |||||
| SPAdes | 12 | 1241619 | 4492 | ||||
| SSPACE-LongRead | 16 | 2501081 | 10 | 77.33 | 4476 | 2 h 59 m | |
| 4 | PBcR pipeline | 12 | 834736 | 9 | 0.13 | 4456 | Over 12 h |
| PBcR pipeline (wgs-8.2) | 9 | 11 h 57 m | |||||
| SPAdes | 11 | 1750947 | 4492 | ||||
| SSPACE-LongRead | 15 | 3194637 | 10 | 77.37 | 4477 | 3 h 3 m | |
| 17 | PBcR pipeline | 5 | 1215597 | 8 | 0.02 | 4487 | Over 12 h |
| PBcR pipeline (wgs-8.2) | 11 | Over 12 h | |||||
| SPAdes | 6 | 4644452 | |||||
| SSPACE-LongRead | 17 | 1238635 | 8 | 91.03 | 4467 | 5 h 12 m |
aIt produced a non-hybrid assembly within 30 min, with a single contig (4.6 Mbp) when using the long reads of 4 SMRT cells.
Evaluation of the non-hybrid assembly on assembly completeness in terms of the number of contigs; triplicate experiments were performed where applicable
| Dataset | Data description | Assembly approach | 4 SMRT cells | 6 SMRT cells | 8 SMRT cells | All SMRT cells |
|---|---|---|---|---|---|---|
| D5 | HGAP | 4, 2, 4 | 2, 2, 3 | Fail | ||
| PBcR pipeline(S) | 1, | 1, 2, 4 | 1 | |||
| D6 | HGAP | 8, 10, 12 | 4, 9, 12 | 7 | ||
| PBcR pipeline(S) | 8, 10, 14 | 1, 1, 4 | ||||
| D7 | HGAP | |||||
| PBcR pipeline(S) | ||||||
| D8 | HGAP | 2, 2, 5 | 2 | |||
| PBcR pipeline(S) | 3, 3, 3 | 1 | ||||
| D9 | HGAP | 1 | ||||
| PBcR pipeline(S) | 1 |
aHGAP, hierarchical genome-assembly process; PBcR pipeline(S), PacBio corrected reads pipeline via self-correction.
bDue to memory limitations, we were unable to load the 17 SMRT cells successfully.
Evaluation of the latest PBcR pipeline(S) included in wgs-8.2 on assembly completeness and accuracy in terms of the number of contigs and diagonal-like dot plots
| Dataset | Genome size setting | 3 SMRT | 4 SMRT | 6 SMRT | 8 SMRT | All SMRT |
|---|---|---|---|---|---|---|
| D5 17 SMRT cells | without | |▪ | ▪▪▪ | ▪▪▪ | ▪ | |
| 0.8 | 6□6□|▪ | ▪▪▪ | ▪▪▪ | ▪ | ||
| 0.9 | 6□6□|▪ | ▪▪▪ | ▪▪▪ | ▪ | ||
| 1 | 4□6□|▪ | ▪▪▪ | ▪▪▪ | ▪ | ||
| 1.1 | 5□6□|▪ | ▪▪▪ | ▪▪▪ | ▪ | ||
| 1.2 | 4□6□|▪ | ▪▪▪ | ▪▪▪ | ▪ | ||
| D6 8 SMRT cells | 0.8 | |7□7□10□ | 2□ | |||
| 0.9 | |6□7□7□ | 2□ | ||||
| 1 | |6□6□6□ | 2□ | ||||
| 1.1 | |6□6□6□ | 2□ | ||||
| 1.2 | |3□6□6□ | 4□ | ||||
| D7 4 SMRT cells | without | |▪▪▪▪ | ||||
| 0.8 | ▪ | |||||
| 0.9 | ▪ | |||||
| 1 | ▪ | |||||
| 1.1 | ▪ | |||||
| 1.2 | ▪ | |||||
| D8 7 SMRT cells | without | |▪▪▪ | ▪ | |||
| without | |▪▪▪ | ▪ | ||||
| 0.8 | ▪ | |||||
| 0.9 | ▪ | |||||
| 1 | ▪ | |||||
| 1.1 | ▪ | |||||
| 1.2 | ▪ | |||||
| D9 1 RSII SMRT cell | without | |▪ | ||||
| 0.8 | |▪ | |||||
| 0.9 | |▪ | |||||
| 1 | |▪ | |||||
| 1.1 | |▪ | |||||
| 1.2 | |▪ |
▪: Accurate and complete assembly.
: Accurate assembly with diagonal-like dot plot against reference genome, the number in box represents the number of contigs in an assembly.
▪: A single but misassembled contig. The vertical bar represents a cutoff of 75X long reads.
aCommand of PBcR pipeline: PBcR -pbCNS -length 500 -partitions 200 genomeSize = 4650000 (for E. coli D5, D6, and D9); genomeSize = 3100000 (for M. ruber D7); genomeSize = 5170000 (for P. heparinus D8).
bCommand of PBcR pipeline: PBcR -length 500 -partitions 200.