| Literature DB >> 33575617 |
Abstract
RNA sequencing (RNA-seq) is currently the standard method for genome-wide expression profiling. RNA-seq reads often need to be mapped to a reference genome before read counts can be produced for genes. Read trimming methods have been developed to assist read mapping by removing adapter sequences and low-sequencing-quality bases. It is however unclear what is the impact of read trimming on the quantification of RNA-seq data, an important task in RNA-seq data analysis. In this study, we used a benchmark RNA-seq dataset and simulation data to assess the impact of read trimming on mapping and quantification of RNA-seq reads. We found that adapter sequences can be effectively removed by read aligner via 'soft-clipping' and that many low-sequencing-quality bases, which would be removed by read trimming tools, were rescued by the aligner. Accuracy of gene expression quantification from using untrimmed reads was found to be comparable to or slightly better than that from using trimmed reads, based on Pearson correlation with reverse transcriptase-polymerase chain reaction data and simulation truth. Total data analysis time was reduced by up to an order of magnitude when read trimming was not performed. Our study suggests that read trimming is a redundant process in the quantification of RNA-seq expression data.Entities:
Year: 2020 PMID: 33575617 PMCID: PMC7671312 DOI: 10.1093/nargab/lqaa068
Source DB: PubMed Journal: NAR Genom Bioinform ISSN: 2631-9268
Percentages of mapped read bases with or without read trimming prior to mapping
| Method | UHRR (%) | HBRR (%) |
|---|---|---|
| No trimming + Subread | 86.4 | 85.5 |
| Trimmomatic–adapters and SW + Subread | 82.4 | 81.7 |
| Trimmomatic–adapters and MI + Subread | 83.2 | 82.3 |
| TrimGalore + Subread | 85.1 | 84.2 |
Subread was used for mapping of untrimmed or trimmed reads.
Correlation of trimmed and untrimmed RNA-seq data with the TaqMan RT-PCR data
| 100 bp PE | 50 bp SE | |||
|---|---|---|---|---|
| Method | UHRR | HBRR | UHRR | HBRR |
| No trimming + Subread | 0.851 | 0.870 | 0.848 | 0.870 |
| Trimmomatic–adapters and SW + Subread | 0.850 | 0.870 | 0.848 | 0.869 |
| Trimmomatic–adapters and MI + Subread | 0.850 | 0.871 | 0.849 | 0.869 |
| TrimGalore + Subread | 0.850 | 0.870 | 0.849 | 0.869 |
Shown are the coefficients of Pearson correlation between log2 expression values of 949 genes measured by the TaqMan RT-PCR technique and their RNA-seq expression levels generated from using each method (log2-RPKM). ‘100 bp PE’ in the table denotes the 100 bp paired-end SEQC dataset. First reads (R1 reads) in this dataset were extracted and truncated to 50 bp long to generate the 50bp single-end dataset used here (‘50 bp SE’).
Figure 1.Time cost of different methods running on a UHRR RNA-seq dataset that includes 15 million 100 bp read pairs. All software tools were run with eight CPU threads. Input data to trimming and mapping tools are in gzipped FASTQ format which is the standard format of data generated by Illumina sequencers.