| Literature DB >> 25123167 |
Abstract
BACKGROUND: High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible.Entities:
Mesh:
Year: 2014 PMID: 25123167 PMCID: PMC4246604 DOI: 10.1186/1756-0500-7-533
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Popular read simulators
| Technology | Supported platforms | Capabilities | Limitations |
|---|---|---|---|
| PBSim [ | PacBio | Simulates both continuous long reads (CLRs) and circular | Limited insertion and deletion (indel) |
| consensus sequences (CCS); supports sampling-based | support | ||
| simulation (in which both length and quality scores are | |||
| sampled from a real read set) and model-based simulation | |||
| FlowSim [ | Roche 454 | Simulates read length and quality in flow space | No indel support |
| dwgsim [ | Illumina, IonTorrent | Whole-genome simulator | Uniform read length |
| ART [ | Roche 454, Illumina Solexa | Read error model, quality profiles | No simulation of indels for short |
| tandem repeats (STRs) | |||
| Maq [ | Illumina Solexa | Single nucleotide polymorphism (SNP) simulation | Fixed mutation rate does not model |
| real-world data | |||
| Grinder [ | Platform-independent | Shotgun or amplicon read libraries | Limited indel support |
| MetaSim [ | 454, Illumina, Sanger | Simulation, assembly, mapping | Does not assign quality values to reads |
| GemSim [ | 454, Illumina | Simulation | Fixed length and mutation rates |
Figure 1FASTQSim pipeline.
Figure 2IonTorrent dataset characterization profile.
Figure 3Illumina dataset characterization profile for human whole blood sample.
Figure 4Roche dataset characterization profile for human whole blood sample.
Figure 5PacBio dataset characterization profile for human whole blood sample.
Figure 6FASTQspike curve fitting for IonTorrent human whole blood sample dataset. Several types of functions are fit to the empirical data, and least squares linear regression is used to determine the curve of best fit.
Composition of Illumina dataset generated with FASTQsim for the DTRA metagenomic algorithm challenge
| Organism taxonomy | Number of reads | Number of genes |
|---|---|---|
| Bacteria, Proteobacteria, Gammaproteobacteria, Thiotrichales, Francisellaceae, Francisella, tularensis. | 206 | 163 |
| [Genbank:NC_008245.1] | ||
| Bacteria, Proteobacteria, Alphaproteobacteria, Rhizobiales, Methylobacteriaceae, Methylobacterium | 148 | 110 |
| radiotolerans JCM 2831. [Genbank:CP001001.1] | ||
| Bacteria, Proteobacteria, Gammaproteobacteria, Pseudomonales, Pseudomonadaceae, Pseudomonas | 201 | 101 |
| aeruginosa pao1. [Genbank:NC_002516.2] | ||
| Bacteria, Actinobacteria, Actinobacteridae, Actinomycetales, Corynebacterinae, Mycobacteriaceae, | 200 | 111 |
| Mycobacterium avium complex (mac). [Genbank:EU854994.1] | ||
| Bacteria, Firmicutes, Bacilli, Lactobacillales, Streptococcaceae, Streptococcus pneumoniae ATCC 700669. | 201 | 119 |
| [Genbank:NC_011900.1] | ||
| Bacteria, Proteobacteria, Gammaproteobacteria, Legionellales, Legionellaceae, Legionella pneumophila | 50 | 37 |
| Philadelphia 1. [Genbank:NC_002942.5] | ||
| Human immunodeficiency virus I. [Genbank:NC_001802.1] | 5 | 4 |
Figure 7Six leading metagenomic algorithms were evaluated on an Illumina dataset generated with FASTQspike. True positive and false positive algorithm calls were compared at the genus, species, and strain levels.
Newbler characterization for high coverage IonTorrent dataset
| Reference accession | Num unique | Pct of all unique | Pct of | Pct coverage | Description |
|---|---|---|---|---|---|
| matching reads | matches | all reads | of reference | ||
| NC_000913.2 | 225797 | 98.04% | 42.2% | 97.50% | Escherichia coli str. K-12 substr. MG1655, |
| complete genome | |||||
| gi |215104|gb |J02459.1 | 2872 | 1.26% | 0.50% | 98.72 | Enterobacteria phage lambda, complete |
| genome, with codon substitution | |||||
| EU496103.1 | 1543 | 0.70% | 0.30% | 100.0% | BioBrick cloning vector pSB3C5-I52001, |
| complete sequence |