| Literature DB >> 32097405 |
Michael I Love1,2, Charlotte Soneson3,4, Peter F Hickey5,6, Lisa K Johnson7, N Tessa Pierce7, Lori Shepherd8, Martin Morgan8, Rob Patro9.
Abstract
Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.Entities:
Mesh:
Year: 2020 PMID: 32097405 PMCID: PMC7059966 DOI: 10.1371/journal.pcbi.1007664
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Flowchart of Salmon quantification followed by tximeta.
The quantification and import pipeline results in a SummarizedExperiment object with reference transcript provenance metadata added by tximeta (see Design and Implementation). The SummarizedExperiment object contains estimated counts and other relevant metadata, and can be used with downstream statistical packages.
Comparison of tximeta to related software.
| Software | Domain | Ranges | Release | Post hoc |
|---|---|---|---|---|
| tximeta | RNA-seq import | ✓ | ✓ | ✓ |
| tximport [ | RNA-seq import | |||
| Arkas | RNA-seq analysis | ✓ | ✓ | |
| ARMOR | RNA-seq analysis | ✓ | ✓ | ✓ |
| htseq [ | RNA-seq counting | |||
| featureCounts [ | RNA-seq counting | ✓ | ||
| summarizeOverlaps [ | RNA-seq counting | ✓ | ✓ | |
| pepkit [ | Workflow management | - | - | |
| basejump [ | Metadata utilities | - | - | |
| Refgenie [ | Genome management | - | - | ✓ |
| CRAM+RefGet [ | Read alignment | - | - | ✓ |
| CWLProv [ | Workflow tracing | - | - | ✓ |
Tximeta is compared to related software, grouped by domain. Columns indicate if the transcript or gene ranges are automatically attached to the output of the software, whether the transcriptome and genome release information is automatically attached, and whether post hoc lookup of transcriptome-related metadata is possible. A hyphen (-) indicates that the column is not directly applicable.
*Arkas attaches transcript ranges and release information for Ensembl transcripts only.
†ARMOR imports tximeta for object construction.
Pre-computed reference transcripts checksums as of early 2020.
| Source | Organism | Releases | Transcript sequence file |
|---|---|---|---|
| GENCODE | 23 – 33 | transcripts.fa | |
| GENCODE | M6 – M24 | transcripts.fa | |
| Ensembl | 76 – 99 | *.cdna.all.fa (NR) | |
| Ensembl | 76 – 99 | *.cdna.all.fa (NR) | |
| Ensembl | 79 – 99 | *.cdna.all.fa (NR) | |
| Ensembl | 76 – 99 | *.cdna.all.fa + *.ncrna.fa | |
| Ensembl | 76 – 99 | *.cdna.all.fa + *.ncrna.fa | |
| Ensembl | 79 – 99 | *.cdna.all.fa + *.ncrna.fa | |
| RefSeq | p1 – p12 | *_rna.fa | |
| RefSeq | p2 – p5 | *_rna.fa |
The set of pre-computed checksums span the stable releases from these sources for the years 2015—2019. (NR)—not recommended: we recommend combination of coding and non-coding transcripts for accurate RNA-seq quantification;
†—RefSeq assembly versions p13 and p6 for human and mouse respectively are currently “latest”, and are subject to sequence updates under the same assembly version, and so not stable releases.