| Literature DB >> 24955109 |
Chuming Chen1, Sari S Khaleel2, Hongzhan Huang1, Cathy H Wu1.
Abstract
BACKGROUND: When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets.Entities:
Keywords: De novo assembly; Illumina; Next-generation sequencing; Perl; Reference-based assembly; Trimming
Year: 2014 PMID: 24955109 PMCID: PMC4064128 DOI: 10.1186/1751-0473-9-8
Source DB: PubMed Journal: Source Code Biol Med ISSN: 1751-0473
Comparison of with other publically available pre-processing tools
| Perl | 454, Illumina | FastQ, Illumina QSEQ | Yes | Yes | Yes | Yes: 3'-end, quality window and filter out low quality reads | FastQ | Yes | |
| Perl | 454, Illumina1 | FastQ, FastA (+ .qual) | Yes | Yes | Yes | Yes: filter out low quality reads | FastA (+.qual), FastQ | Yes | |
| C/C++ | Non-specific | FastQ2, FastA (not .qual) | No | No | No | Yes: filter out low quality reads | FastA, FastQ | No | |
| Perl | Non-specific3 | FastA (+ .qual), Phred | No | No | No | Yes: filter out low quality reads | FastA (+.qual) | Yes | |
| Python4 | 454, Illumina, SOLID5 | FastQ, SOLID’s cs.FastA + cs.FastA.qual | No | No | No | Yes: filter out low quality reads | FastQ, SOLID’s cs.FastA + cs.FastA.qual | No | |
| C++6 | Illumina | FastQ6 | No | No | No | Yes: quality window | FastQ | No | |
| Perl | Illumina | FastQ | Yes | No | No7 | Yes: quality window and filter out low quality reads | FastQ | Yes | |
| C/C++8 | Illumina | FastQ | Yes | No | No | Yes: quality window | FastQ | Yes | |
| C/C++8 | Illumina | FastQ | No | No | Yes, but only 3’ | No | FastQ | Yes | |
| Java | Illumina | FastQ | Yes | Yes | Yes | Yes: quality window and filter out low quality reads | FastQ | Yes |
1NGS QC’s IlluQC only works for FastQ file, and 454QC only works for FastA (+.qual) file [15].
2FASTX toolkit does not accept multi-line FastQ file and requires reformatting to one-line FastQ file using provided tools [26].
3While SeqTrim isn’t platform-specific, it can only take FastA file (with/without .qual and chromatogram) [25].
4Most of CutAdapt is in python, but the alignment algorithm was written in C for speedup [14].
5CutAdapt was designed with RNA-Seq technology in mind [14].
6Btrim’s C++ implementation is designed for single reads. The tool website offers an un-optimized Perl script that organizes separately trimmed paired-end files [27].
7SolexaQA does not provide primer/adapter trimming [8].
8Scythe and Sickle require Zlib (http://www.zlib.net/) [24,28].
Short read sequence pre-processing algorithms in
| Sequencing Artifacts Removal | 5adpt | Detects (using exact or approximate matching) sequencing artifacts listed in an input file and removes them. |
| rmHP | Removes homopolymer sequences. | |
| QSEQ Specific Methods | qseq0 | Removes QSEQ reads with “Failed_Chastity” filter flags. |
| qseqB | Removes reads with more than certain number of "B"-scored bases. | |
| Reads with “N” Bases Removal/Splitting | nperc | Filters out reads with un-called “N” bases exceeding a percentage cutoff. |
| ncutoff | Filters out reads with un-called “N” bases exceeding a number cutoff. | |
| nsplit | Searches and removes “N” bases, then splits the read around the removed “N” bases into two smaller daughter reads. | |
| Quality Score Based Trimming | LQR | Removes “low quality” reads using quality score cutoff or percent cutoff. |
| Mott | Quality-window extraction (trim both the 5'- and 3'-ends of a read). | |
| TERA | Trims low quality-score bases from the 3'-ends of reads based on their running average quality scores. | |
| 5'/3'-end Bases Trimming | 3end | Trims bases from 3'-end of a read. |
| 5end | Trims bases from 5'-end of a read. |
The descriptions of raw short-read sequences used in the evaluation experiments
| 6239 | 559292 | 83334 | |
| 100.3 M | 12.2 M | 5.5 M | |
| 7* | 17* | 1 | |
| SRR065390 | SRR449310 | SRR957847 | |
| Illumina Genome Analyzer II | Illumina HiSeq 2000 | Illumina MiSeq | |
| WGS | WGS | WGS | |
| Genomic | Genomic | Genomic | |
| Paired | Paired | Paired | |
| 100 | 76 | 150 | |
| 356 | 230 | 350 | |
| 33,808,546 | 1,898,259 | 2,241,778 | |
| 6,761,709,200 | 288,535,368 | 672,533,400 | |
| 29.49 | 34.17 | 33.12 | |
| 1,902,576 (2.81%) | 167,669 (4.42%) | 76,598 (1.71%) | |
| 67.4x | 23.7x | 122.3x | |
| 35 | 39 | 50 |
*The mitochondrial chromosome is included.
Multi-threading performance of
| 1 | 385.567 | 9.86 | NA | NA |
| 8 | 58.733 | 26.49 | 47.950 | 10.650 |
| 16 | 36.850 | 44.93 | 26.433 | 10.283 |
| 32 | 25 | 81.96 | 14.333 | 10.550 |