| Literature DB >> 26678408 |
Chih-Hao Fang, Yu-Jung Chang, Wei-Chun Chung, Ping-Heng Hsieh, Chung-Yen Lin, Jan-Ming Ho.
Abstract
BACKGROUND: Recent progress in next-generation sequencing technology has afforded several improvements such as ultra-high throughput at low cost, very high read quality, and substantially increased sequencing depth. State-of-the-art high-throughput sequencers, such as the Illumina MiSeq system, can generate ~15 Gbp sequencing data per run, with >80% bases above Q30 and a sequencing depth of up to several 1000x for small genomes. Illumina HiSeq 2500 is capable of generating up to 1 Tbp per run, with >80% bases above Q30 and often >100x sequencing depth for large genomes. To speed up otherwise time-consuming genome assembly and/or to obtain a skeleton of the assembly quickly for scaffolding or progressive assembly, methods for noise removal and reduction of redundancy in the original data, with almost equal or better assembly results, are worth studying.Entities:
Mesh:
Year: 2015 PMID: 26678408 PMCID: PMC4682372 DOI: 10.1186/1471-2164-16-S12-S9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
The sequencing datasets used in the experiments.
| Dataset | 1 | 2 | 3 |
|---|---|---|---|
| Species11 | Grouper | ||
| Genome size | 4.6 Mbp | 5.2 Mbp | ~1.1 Gbp2 |
| Read length | 2 × 300 bp | 2 × 300 bp | 2 × 200 bp |
| Mean quality score | 34 | 34 | 35 |
| % Bases with quality score > 30 | 83% | 85% | 92% |
| Depth | 2853x | 2669x | ~110-120x2 |
1 The full scientific names of those species are Escherichia coli, Bacillus cereus and Epinephelus lanceolatus.
2 Those are estimated values by ALLPATHS-LG, because the complete reference genome is not yet available.
Figure 1Statistics of minimal quality value for the reads in the . (a) The percentage of reads for a minimal quality value. (b) The cumulative percentages of (a).
Figure 2Statistics of minimal quality value for the reads in the . (a) The percentage of reads for a minimal quality value. (b) The cumulative percentages of (a).
Figure 3Statistics of correctness score for the reads in the . (a) The percentage of reads for a correctness score. (b) The cumulative percentages of (a).
Figure 4Statistics of correctness score for the reads in the . (a) The percentage of reads for a correctness score. (b) The cumulative percentages of (a).
Figure 5Corrected contig N50 size vs. subset size of the .
Figure 6Corrected contig N50 size vs. subset size of the .
Figure 7Statistics of minimal quality value for the PEs in the grouper dataset. (a) The percentage of PEs for a minimal quality value. (b) The cumulative percentages of (a).
Comparing the assembly results of PE subset selection for the grouper dataset.
| Original dataset | Selected subset | |
|---|---|---|
| Dataset size (G bp) | 125 | 63 |
| # read pairs | 319,878,932 | 158,651,599 |
| Mean length of reads | 195.3 | 198.6 |
| %GC content of reads | 41.0% | 39.7% |
| # contigs | 39,911 | 53,488 |
| Total contig length | 996,203,993 | 991,109,739 |
| N50 contig size (K bp) | 82.2 | 43.5 |
| # scaffolds | 3,917 | 4,043 |
| Total scaffold length | 1,076,396,971 | 1,062,462,514 |
| Largest scaffold length | 12,701,604 | 21,777,629 |
| N50 scaffold size (K bp) ( L50 number)2 | 3,354 (97 scaffolds) | 5,443 (61 scaffolds) |
| N75 scaffold size (K bp) (L75 number)2 | 1,429 (218 scaffolds) | 2,493 (131 scaffolds) |
| %GC of scaffolds | 41.23% | 41.17% |
| # 'N's | 79,902,759 | 71,510,549 |
| # 'N's per 100K bp | 7,423.10 | 6,730.57 |
| # scaffolds for 1G bp3 | 482 | 304 |
1 All statistics are based upon the size of contigs and scaffolds both ≥ 1K bp.
2 L50/L75 denotes the minimal number of the scaffolds that produce the 50%/75% bases of the assembly (i.e., all the scaffolds).
3 The minimal number of the scaffolds whose total length ≥ 1G bp.