| Literature DB >> 27638685 |
Xiaofan Zhou1, David Peris2, Jacek Kominek2, Cletus P Kurtzman3, Chris Todd Hittinger4, Antonis Rokas5.
Abstract
The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS ( in silicoWhole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.Entities:
Keywords: de novo assembly; experimental design; genome sequencing; high-throughput sequencing; nonmodel organism; simulation
Year: 2016 PMID: 27638685 PMCID: PMC5100864 DOI: 10.1534/g3.116.034249
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1iWGS workflow. A typical iWGS analysis consists of four steps: (1) data simulation (optional); (2) preprocessing (optional); (3) de novo assembly; and (4) assembly evaluation. iWGS supports both Illumina short reads and PacBio long reads, and a wide selection of assemblers to enable de novo assembly using either or both types of data. Users can start the analysis simulating data drawn from a reference genome assembly or, alternatively, use real sequencing data as input and skip the simulation step. iWGS, in silico Whole Genome Sequencer and Analyzer; MP, mate-pair; PacBio, Pacific Biosciences; PE, paired-end.
Sequencing strategies (top) and assembly protocols (bottom) evaluated in the three case studies
| Name | Read Type | Parameters for Read Simulation |
|---|---|---|
| LIB1 | Illumina PE | Depth: 50 ×; read length: 100 bp; insert size: 180 ± 9 bp |
| LIB2 | Illumina MP | Depth: 50 ×; read length: 100 bp; insert size: 8000 ± 400 bp |
| LIB3 | Illumina PE | Depth: 50 ×; read length: 250 bp; insert size: 450 ± 23 bp |
| LIB4 | PacBio CLR | Depth: 60 ×; read accuracy: 0.87 ± 0.03; read length: 11,500 ± 8000 bp |
| LIB5 | PacBio CLR | Depth: 10 ×; read accuracy: 0.87 ± 0.03; read length: 11,500 ± 8000 bp |
| Name | Assembler | Sequencing strategies used for assembly |
| ILMN1 | ABYSS | LIB1, LIB2 (Illumina-only) |
| ILMN2 | ALLPATHS-LG | |
| ILMN3 | MaSuRCA | |
| ILMN4 | SGA | |
| ILMN5 | SOAPdenov2 | |
| ILMN6 | SPAdes | |
| ILMN7 | Velvet | |
| META | Metassembler | |
| ILMN8 | DISCOVAR | LIB3 (Illumina-only) |
| PACB1 | Celera Assembler | LIB4 (PacBio-only) |
| PACB2 | Canu | |
| PACB3 | FALCON | |
| HYBR1 | SPAdes | LIB1, LIB2, LIB5 (Hybrid) |
| HYBR2 | DBG2OLC | LIB1, LIB5 (Hybrid) |
PE, paired-end; MP, mate pair; PacBio, Pacific Biosciences; CLR, continuous long-read.
SparseAssembler (Ye ) was used to assemble LIB1 into contigs, which in turn were then used as input for DBG2OLC.
Figure 2Performance comparison of five representative experimental designs on three Dothideomycetes genomes. The five designs shown include three Illumina-only designs (ILMN2: ALLPATHS-LG, META: Metassembler, and ILMN8: DISCOVAR), the best performing PacBio-only design (PACB2: Canu), and the best performing hybrid design (HYBR2: DBG2OLC) for each genome. The statistics on the assembled fraction of the reference genome, scaffold N50, and largest scaffold size are all after correction for assembly errors using the reference genome as reported by QUAST in GAGE mode. By default, QUAST (in GAGE mode) corrects contigs/scaffolds by breaking them at assembly errors larger than 5 bp. Scaffold N50 and largest scaffold size are shown in log10 scale.
Performance of all experimental designs evaluated in case study II
| Nuclear:Mitochondrial Genome Ratio | Performance of Strategies | |||
|---|---|---|---|---|
| Complete, Single Contig Assembly of the Mitochondrial Genome | Assembled Fraction of Mitochondrial Genome ≥ 99% | 20% ≤ Assembled Fraction of Mitochondrial Genome < 99% | Assembled Fraction of Mitochondrial Genome < 20% | |
| 1:50 (low mitochondrial content) | ILMN1, ILMN6, ILMN8, PACB2, HYBR1, HYBR2 | ILMN7 | ILMN2, ILMN4, ILMN5, PACB1, PACB3 | ILMN3 |
| 1:200 (high mitochondrial content) | PACB2, HYBR1, HYBR2 | ILMN6, ILMN7 | ILMN1, ILMN8, PACB1, PACB3 | ILMN2, ILMN3, ILMN4, ILMN5 |
The de novo assembly generated by each strategy was compared against the reference mitochondrial genome of S. cerevisiae using both QUAST and BLASTN. Unless a single contig was found to represent the complete mitochondrial genome, the assembled fraction of mitochondrial genome was determined based on the number of “missing reference bases” reported by QUAST, and further confirmed by the BLASTN result.
Summary of top-ranking assemblies generated in case study III
| Organism (Genome Size) | Best Assembly from Each Sequencing Strategy | Assembly Statistics | ||
|---|---|---|---|---|
| Scaffold N50 (kb) | Largest Scaffold (kb) | Assembled Fraction of the Reference Genome | ||
| ILMN2 | 169.7 | 1,007.9 | 89.1% | |
| ILMN8 | 155.0 | 1,007.7 | 91.8% | |
| PACB2 | 5107.5 | 13,108.3 | 99.3% | |
| HYBR2 | 279.3 | 1,536.8 | 89.7% | |
| ILMN2 | 307.0 | 1,789.4 | 97.3% | |
| ILMN8 | 266.6 | 2,533.4 | 98.5% | |
| PACB2 | 2065.7 | 8,552.9 | 99.7% | |
| HYBR2 | 289.3 | 1,412.2 | 97.4% | |
| ILMN2 | 28.0 | 146.2 | 96.7% | |
| ILMN8 | 222.0 | 729.9 | 98.4% | |
| PACB2 | 282.9 | 1,378.8 | 99.6% | |
| HYBR2 | 15.5 | 91.5 | 97.4% | |
| ILMN2 | 167.1 | 641.7 | >100% | |
| ILMN8 | 141.2 | 631.4 | 97.6% | |
| PACB3 | 1574.7 | 3,355.3 | 97.7% | |
| HYBR2 | 198.4 | 602.1 | 92.2% | |
The statistics for simulation-based analysis of D. melanogaster, A. thaliana, and Pl. falciparum 3D7 are after correction for assembly errors using the reference genome, as reported by QUAST in GAGE mode. By default, QUAST (in GAGE mode) corrects contigs/scaffolds by breaking them at assembly errors larger than 5 bp. The statistics for real data based analysis of Pl. falciparum IT are calculated from the original de novo assemblies.