| Literature DB >> 31208325 |
Xiaoshuang Liu1,2, Zhenhe Yan1,3, Chao Wu4, Yang Yang1, Xiaomin Li1, Guangxin Zhang5.
Abstract
BACKGROUND: Next-generation sequencing technology is developing rapidly and the vast amount of data that is generated needs to be preprocessed for downstream analyses. However, until now, software that can efficiently make all the quality assessments and filtration of raw data is still lacking.Entities:
Keywords: Adapter removing; NGS; Quality control
Mesh:
Year: 2019 PMID: 31208325 PMCID: PMC6580563 DOI: 10.1186/s12859-019-2936-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Tools developed for processing next-generation sequencing data
| Software | Major functions | Programming language |
|---|---|---|
| FastQC | Quality check | Java |
| PIQA | Quality check | R,C++ |
| FASTX-Toolkit | A collection of tools to filter low quality reads and remove adapters | C,C++ |
| Fqtools | FASTQ file manipulation, such as validate FASTQ and trim reads in a FASTQ file | C |
| seqtk | Toolkit for processing sequences in FASTA/Q formats, such as format conversion and subsampling of sequences | C |
| PRINSEQ | Filter, reformat and trim sequences | Perl, R |
| Multiqc | Aggregate results across many samples into a single report | Python |
| NGS QC Toolkit | Filter low quality reads and remove adapters. | Perl |
| Fastp | Filter low quality reads and trim adapters | C++ |
| FaQCs | Filter low quality reads and trim adapters | C++ |
Fig. 1Statistical graphs provided in an HTML report. (a). Overview of the range of quality values across all bases at each position in a FASTQ file. (b). Proportion of each base at each position in the FASTQ files before and after quality control. (c). Percentages of reads filtered by different criterion. ‘+’ indicates reads that were filtered by more than one criterion. (d). Number of reads containing adapter sequence at different overlapping bases. (e). Percentages of each base in the FASTQ files. (f). Length distribution of sequences
Run-time efficiency of tools for processing next-generation sequencing data
| Tools | run-timec | rss (MB)d | vmem (MB)e | Average CPU utilization f | rchar(GB)g | wchar(GB)h | run-time/CPUi | Process .gzj |
|---|---|---|---|---|---|---|---|---|
| NGS QC Toolkit | 344 | 3062 | 3400 | 3.312 | 88 | 54 | 1139 | R + W |
| FASTX-Toolkit | 308 | 10 | 29 | 1.556 | 60 | 46 | 479 | W |
| PRINSEQ | 252 | 20 | 115 | 1.173 | 177 | 99 | 296 | – |
| Fastqc | 7 | 395 | 2150 | 2.403 | 6 | 0 | 17 | R |
| Cutadapta | 4 | 670 | 1645 | 6.548 | 56 | 75 | 26 | R |
| Cutadaptb | 58 | 675 | 1659 | 1.605 | 84 | 84 | 93 | R + W |
| FastProNGSb | 3 | 7168 | 7291 | 6.396 | 6 | 22 | 19 | R + W |
| FastProNGSa | 13 | 7066 | 7270 | 6.008 | 6 | 4 | 78 | R |
| FaQC | 27 | 121 | 644 | 3.876 | 7 | 27 | 105 | R |
| fastpa | 8 | 723 | 1167 | 4.614 | 7 | 5 | 36 | R + W |
| fastpb | 7 | 720 | 1110 | 5.421 | 7 | 21 | 40 | R |
All tools were installed locally and run against the test data set
aOutput files are gzip-compressed
bOutput files are not compressed
cMinimum task execution time (minutes)
dMean real memory (resident set) size of the process (MB)
eMean virtual memory size of the process (MB)
fAverage number of CPUs utilized by the process
gNumber of bytes the process read (GB)
hNumber of bytes the process wrote (GB)
iMinimum task execution time using one CPU (minutes)
jProcess .gz indicates if the test was natively read (R) or write (W) compressed files. ‘--’ indicates neither read or write compressed files
Fig. 2Time costs of different preprocessing tools. The difference between Cutadapt_gz and Cutadapt is whether output files are compressed. Cutadapt_gz indicates the output files were gzip-compressed. a. Time cost of different tools using multiple threads. b. Time cost of different tools when only one CPU was used
Fig. 3Resources used by different preprocessing tools. a. Number of CPUs. b. Mean virtual memory size (vmem) and real memory (resident set) size (rss). c. Mean number of read and write bytes