| Literature DB >> 25408143 |
Chien-Chi Lo1, Patrick S G Chain2,3.
Abstract
BACKGROUND: Next generation sequencing (NGS) technologies that parallelize the sequencing process and produce thousands to millions, or even hundreds of millions of sequences in a single sequencing run, have revolutionized genomic and genetic research. Because of the vagaries of any platform's sequencing chemistry, the experimental processing, machine failure, and so on, the quality of sequencing reads is never perfect, and often declines as the read is extended. These errors invariably affect downstream analysis/application and should therefore be identified early on to mitigate any unforeseen effects.Entities:
Mesh:
Year: 2014 PMID: 25408143 PMCID: PMC4246454 DOI: 10.1186/s12859-014-0366-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features comparison for various QC tools
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
|
| No |
|
|
|
|
|
| No | No |
|
|
|
|
| No | No | No | No |
|
|
| No |
| No | No |
|
|
| No | No | No | No |
|
|
| No |
|
| No |
|
|
| No |
|
| No |
|
|
| No |
|
|
|
|
| No | No |
|
| No |
|
|
|
| No | No | No |
|
|
|
|
|
|
|
|
|
|
|
|
| No |
|
|
| No | No | No | No |
|
|
|
| No | No |
|
|
|
| No | No |
|
|
|
|
|
| No |
|
|
|
|
|
| No |
| No |
|
|
|
| No | No | No |
|
|
|
|
|
|
|
|
|
|
|
|
| No |
%FaQCs records a minimum of five bases of quality scores from both ends.
*Uses a separate program/script to generate the result.
#Module for Galaxy platform.
$A separate web version is required.
Figure 1FaQCs Flowchart. FASTQ files input are first checked for the format of quality encoding, then split into a set (pile) of files which are subsets of the original input. Each file is processed independently and managed using the Parallel::ForkManager Perl module. A global data structure is used to store results returned from each parallel process. All reports are merged and a processed FASTQ file along with a series of detailed graphics are output in PDF format.
Figure 2Boxplot graph for the quality scores. Rectangular boxes show the Inter-quartile Range (IQR). The end of the whiskers shows outliers at max 1.5*IQR. Horizontal lines in the box are median values at each bp position. There is a horizontal line at quality 20 indicating the predicted per base error rate of 1/100. For easy comparison, FaQCs generates two boxplots side by side where the left panel is the boxplot of the raw reads and the right represents the processed reads. This is but one set of figures generated in the final PDF report (see Additional file 1: Figure S1).
Figure 3Plots from of k-mer profiling. a) K-mer frequency histogram of E.coli MiSeq dataset shows an obvious peak k-mer coverage near 216X (small arrow, inset figure) and a minimum inflection point at ~41X (long arrow, inset figure). The k-mers below than the minimum inflection point are due to sequencing artifacts and errors. The other small peaks typically indicate repeats in the genome. b) K-mer rarefaction curve shows a reduction of k-mers when trimming. The blue and red soild lines are the k-mer rarefaction curves of raw and trimmed E.coli MiSeq data, respectively. The green and beige solid lines are k-mer rarefaction curves of raw and trimmed data of the HMP Mock data, respectively. The dashed line represents the baseline where all observed k-mers are distinct.
Data analyzed in this study
|
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |||
|
| Illumina* | Illumina MiSeq | 11458940 (1730299940) | 151.00 | 32.32 | 11239533 (1599703670) | 142.33 | 33.70 |
|
| NCBI SRA SRX687104 | Illumina MiSeq | 26654456 (2692100056) | 101.00 | 29.63 | 20586466 (2006232265) | 97.45 | 33.08 |
|
| NCBI SRA SRX067313 | Ion Torrent PGM | 783097 (79769722) | 101.86 | 18.32 | 435634 (25445098) | 58.41 | 26.03 |
|
| NCBI SRA SRX055380 SRX055381 | Illumina GAII | 14494884 (1087116300) | 75.00 | 25.13 | 13502777 | 68.32 | 28.40 |
*http://www.illumina.com/systems/miseq/scientific_data.ilmn.
The comparison of the computational performance using MG1655 MiSeq dataset and HMP Mock GAII dataset
|
|
|
|
|
| |||
|---|---|---|---|---|---|---|---|
|
| 1 | 4 | 8 | 12 | 1 | 1 | 1 |
|
| 192.60 M | 354.97 M | 568.36 M | 739.09 M | 74.70 M | 4.05G | 85.87 M |
|
| 1:59:28 | 0:36:34 | 0:26:16 | 0:14:16 | 0:31:15 | 3:03:29 | 4:20:55 |
|
| 1:58:22 | 1:59:59 | 2:02:43 | 2:03:10 | 0:30:27 | 2:59:57 | 4:13:07 |
|
|
|
|
|
| |||
|
| 1 | 4 | 8 | 12 | 1 | 1 | 1 |
|
| 189.99 M | 341.71 M | 573.69 M | 738.57 M | 39.08 M | 16.46G | 38.48 M |
|
| 1:16:35 | 0:25:12 | 0:17:09 | 0:12:54 | 0:17:04 | 2:47:51 | 1:29:03 |
|
| 1:14:53 | 1:14:49 | 1:21:49 | 1:24:22 | 0:16:21 | 2:43:26 | 1:28:23 |