| Literature DB >> 29444661 |
Qian Zhou1, Xiaoquan Su2,3, Gongchao Jing2, Songlin Chen4, Kang Ning5.
Abstract
BACKGROUND: RNA-Seq has become one of the most widely used applications based on next-generation sequencing technology. However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions. Therefore, the raw data must be subjected to vigorous quality control (QC) procedures before downstream analysis. Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results.Entities:
Keywords: Alignment statistics; Contamination identification; Parallel computing; Quality control; RNA-Seq
Mesh:
Substances:
Year: 2018 PMID: 29444661 PMCID: PMC5813327 DOI: 10.1186/s12864-018-4503-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The workflow and functions of RNA-QC-Chain. Firstly, reads with sequencing quality defects are trimmed by Parallel-QC, secondly, internal (ribosomal RNA) and external (non-target species) contaminations are identified and filtered by a tool called rRNA-filter. Finally, multiple statistics based on the alignment results are reported by a tool called SAM-stats
Summary of datasets used in this paper
| Name | Type | Species | No. of read | Read length (bp) | Size (Gb) |
|---|---|---|---|---|---|
| Dataset 1 | real data | microalgae ( | 7,045,705 | 2 × 100 | 1.4 |
| Dataset 2 | semi-simulated data | Real: Sprague–Dawley rats ( | 9,809,056 | 51 | 0.9 |
| Simulated: yeast | 8,531,300 | 51 | 0.4 | ||
| Dataset 3 | real-data | Human ( | 25,536,632 | 2 × 75 | 9.6 |
| Dataset 4 | real-data | Human ( | 54,477,454 | 2 × 75 | 16.4 |
Fig. 2Selected outputs of RNA-QC-Chain for a RNA-Seq data of Nannochloropsis (Dataset 1). a Sequencing-quality and rRNA measurements. b External contamination screening by 18S rRNA identification. c Distribution of read coverage on gene. d Distribution of mapped reads over different genomic region. e Distribution of read coverage over genebody. f Comparison of running time of parallel and serial computation (speed-up shown in arrow)
Fig. 3Contamination identification for a semi-simulated RNA-Seq data (Dataset 2) using RNA-QC-Chain. a Eukaryotic species identified by 18S rRNA screening. b Prokaryotic species identified by 16S rRNA screening
Comparison of functions and features of different QC tools for RNA-Seq data
| RNA-QC-Chain | RSeQC | RNA-SeQC | |
|---|---|---|---|
| Functions | |||
| Quality evaluation of raw reads | Yes | Yes | No |
| Quality trimming of raw reads | Yes | No | No |
| rRNA detection | Yes | Yes | Yes |
| rRNA removal | Yes | No | No |
| Contaminating species identification | Yes | No | No |
| Alignment statistics | Yes | Yes | Yes |
| Features | |||
| Language | C++ | Python and C | Java |
| Input file (except commonly required files) | None | A chromosome size file and a BED file | An indexed bam file, a reference sequence, the index of reference sequence and a sequence dictionary |
| Output format | FASTQ/FASTA, TXT, PNG, HTML | PDF, TXT | HTML |
| Usage | One command line for each step | Multiple separate scripts | One command line |
| rRNA reference file | A built-in rRNA database | Requires user to provide | Requires user to provide |
| Visualization dependence | Gnuplot | UCSC Genome Browser or R scripts | N/A |
| Parallel computation | Yes | No | No |
| Running time for Dataset 1 (min) | 8 | 55 | 120 |
Fig. 4The running time of RNA-QC-Chain compared to RSeQC and RNA-SeQC using testing datasets. a The total running time of the complete functions of each tool. b The running time of the same functions of the three tools. The used scripts of RSeQC in this test included bam_stat.py, geneBody_coverage.py, infer_experiment.py, inner_distance.py, read_duplication.py, read_GC.py, read_distribution.py and split_bam.py