| Literature DB >> 30783653 |
Yang Liao1,2, Gordon K Smyth1,3, Wei Shi1,4.
Abstract
We present Rsubread, a Bioconductor software package that provides high-performance alignment and read counting functions for RNA-seq reads. Rsubread is based on the successful Subread suite with the added ease-of-use of the R programming environment, creating a matrix of read counts directly as an R object ready for downstream analysis. It integrates read mapping and quantification in a single package and has no software dependencies other than R itself. We demonstrate Rsubread's ability to detect exon-exon junctions de novo and to quantify expression at the level of either genes, exons or exon junctions. The resulting read counts can be input directly into a wide range of downstream statistical analyses using other Bioconductor packages. Using SEQC data and simulations, we compare Rsubread to TopHat2, STAR and HTSeq as well as to counting functions in the Bioconductor infrastructure packages. We consider the performance of these tools on the combined quantification task starting from raw sequence reads through to summary counts, and in particular evaluate the performance of different combinations of alignment and counting algorithms. We show that Rsubread is faster and uses less memory than competitor tools and produces read count summaries that more accurately correlate with true values.Entities:
Mesh:
Year: 2019 PMID: 30783653 PMCID: PMC6486549 DOI: 10.1093/nar/gkz114
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The main Rsubread functions for read alignment and quantification
| Function | Description |
|---|---|
|
| Create hash table of target genome |
|
| Basic alignment with soft-clipping, for gene-level analyses |
|
| Alignment with identification of exon–exon junctions |
|
| Compute mapping statistics |
|
| Compute count matrix for specified genomic features |
Other Rsubread functions are briefly discussed in the Discussion section.
Figure 1.Run times of read aligners. Each aligner used ten threads to map 15 million 100 bp read-pairs from the SEQC UHRR sample to the human reference genome GRCh38. Rsubread::align is faster than STAR or TopHat2 regardless of whether the full index (align-F) or a gapped index (align-G) is used.
Figure 2.Running time of different quantification tools. Labels under each bar indicate the quantification method and the aligner (in parenthesis) that produced the mapped reads used for counting. Mapped reads were assigned to NCBI RefSeq human genes. featureCounts is the only tool that supports multi-threaded read counting and it was run with four threads.
Gene-level accuracy comparison. The table gives Pearson correlations between true log2-expression levels and log2-FPKM values produced by each workflow. The align-F + featureCounts workflow gives the best correlation in each case
| Workflow | UHRR | HBRR | Simulation |
|---|---|---|---|
| align-F + featureCounts |
|
|
|
| align-G + featureCounts | 0.850 | 0.869 |
|
| STAR + featureCounts | 0.848 | 0.867 | 0.901 |
| STAR + htseq-count | 0.845 | 0.864 | 0.877 |
| STAR + summarizeOverlaps | 0.845 | 0.864 | 0.877 |
| TopHat2 + htseq-count | 0.843 | 0.863 | 0.921 |
| TopHat2 + summarizeOverlaps | 0.843 | 0.863 | 0.921 |
Columns ‘UHRR’ and ‘HBRR’ are for the SEQC UHRR and SEQC HBRR samples respectively. For the SEQC columns, the log2-expression values of 958 genes measured by TaqMan RT-PCR are taken as true values. Column ‘Simulation’ shows simulation results for 28 395 genes. For all columns, an offset of 1 was added to raw gene counts to avoid taking logarithms of zeros.
Exon-level accuracy comparison. The table shows the Pearson correlation between the true log2-FPKM expression values of exons and log2-FPKM values produced by each workflow. The Rsubread workflows give the best correlation with the true values
| Workflow | Correlation |
|---|---|
| subjunc-F + featureCounts |
|
| subjunc-G + featureCounts |
|
| STAR + featureCounts | 0.950 |
| STAR + dexseq_count | 0.927 |
| STAR + countOverlaps | 0.950 |
| TopHat2 + dexseq_count | 0.942 |
| TopHat2 + countOverlaps | 0.980 |
An offset of 1 was added to raw exon counts to avoid taking logarithms of zero values.
Aligner performance in mapping junction reads and reporting exon–exon junctions. Results are based on simulated data
| Junctions | Junction reads | |||||
|---|---|---|---|---|---|---|
| Workflow | Recall | Prec | F1 | Recall | Prec | F1 |
| subjunc-F | 99.80 | 99.53 |
| 95.51 | 98.18 |
|
| subjunc-G | 99.82 | 99.43 | 99.63 | 95.50 | 98.16 | 96.81 |
| STAR | 98.48 | 99.87 | 99.17 | 89.87 | 98.06 | 93.78 |
| TopHat2 | 99.12 | 99.91 | 99.51 | 90.15 | 98.57 | 94.17 |
Column ‘Recall’ gives the percentage of correctly called junctions (or junction reads) out of all junctions (or junction reads) generated in the simulated dataset. Column ‘Precision’ gives the percentage of correctly called junctions (or junction reads) out of all reported junctions (or junction reads). Column ‘F1’ gives the F1 score that is the harmonic mean of precision and recall.