| Literature DB >> 32038716 |
Sateesh Peri1, Sarah Roberts2, Isabella R Kreko3, Lauren B McHan3, Alexandra Naron3, Archana Ram3, Rebecca L Murphy4, Eric Lyons1,2, Brian D Gregory5, Upendra K Devisetty2, Andrew D L Nelson6.
Abstract
Next-generation RNA-sequencing is an incredibly powerful means of generating a snapshot of the transcriptomic state within a cell, tissue, or whole organism. As the questions addressed by RNA-sequencing (RNA-seq) become both more complex and greater in number, there is a need to simplify RNA-seq processing workflows, make them more efficient and interoperable, and capable of handling both large and small datasets. This is especially important for researchers who need to process hundreds to tens of thousands of RNA-seq datasets. To address these needs, we have developed a scalable, user-friendly, and easily deployable analysis suite called RMTA (Read Mapping, Transcript Assembly). RMTA can easily process thousands of RNA-seq datasets with features that include automated read quality analysis, filters for lowly expressed transcripts, and read counting for differential expression analysis. RMTA is containerized using Docker for easy deployment within any compute environment [cloud, local, or high-performance computing (HPC)] and is available as two apps in CyVerse's Discovery Environment, one for normal use and one specifically designed for introducing undergraduates and high school to RNA-seq analysis. For extremely large datasets (tens of thousands of FASTq files) we developed a high-throughput, scalable, and parallelized version of RMTA optimized for launching on the Open Science Grid (OSG) from within the Discovery Environment. OSG-RMTA allows users to utilize the Discovery Environment for data management, parallelization, and submitting jobs to OSG, and finally, employ the OSG for distributed, high throughput computing. Alternatively, OSG-RMTA can be run directly on the OSG through the command line. RMTA is designed to be useful for data scientists, of any skill level, interested in rapidly and reproducibly analyzing their large RNA-seq data sets.Entities:
Keywords: RNA-seq; bioinformatics; high throughput (-omics) techniques; transcriptomics; workflow
Year: 2020 PMID: 32038716 PMCID: PMC6993073 DOI: 10.3389/fgene.2019.01361
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1RNA-sequencing (RNA-seq) data deposited on National Center for Biotechnology Information (NCBI's) sequence read archive (SRA). SRA run information associated with transcriptomic analyses was downloaded and sorted by year deposited. Tera base pairs (Tbp, 1E+12) of RNA-seq data deposited is shown with the gray line and plotted on the left y-axis. Thousands of experiments deposited, per year, is shown with the black line on the right y-axis.
Deployment options for read mapping and transcript assembly (RMTA).
| Platform | App Name | Size of Datasets That Can Be Handled | Data Storage Available | Genome Services Available |
|---|---|---|---|---|
| DE | RMTA v2.5.1.2 | 1–100 | Yes | Yes |
| DE | OSG-RMTA v2.5.1.2 | 100–1000s | Yes | Yes |
| DE | RMTA-Instructional | 1–10 | Yes | Yes |
| Local | RMTA in Docker | Restricted to user capacity | No | No |
| OSG* | OSG-RMTA | 100–1000s | No | No |
Platforms include the Discovery Environment, a local computer, or high performance computing center, or the Open Science Grid. *Users wishing to utilize the Open Science Grid (OSG) outside of the Discovery Environment will need their own OSG account.
Figure 2Read mapping and transcript assembly (RMTA) workflow with suggested downstream analyses. The standard RMTA workflow consists of read mapping by either HISAT2 or Bowtie 2, transcript assembly by StringTie, assembly comparison to the reference annotation by Cuffcompareto identify novel transcripts, and then read counting by featureCounts. Several optional features are included, such as the ability to perform quality control on RNA-sequencing (RNA-seq) data with FastQC, filtering of lowly expressed transcripts, and removal of duplicate reads (Bowtie 2 only). Output is listed, and are ready for downstream analyses such as those shown.
Mapping rates and time to completion for the example read mapping and transcript assembly (RMTA) analyses.
| Mapping rates | ||||||
|---|---|---|---|---|---|---|
| > 90% | 90–75% | 75–50% | <50% | Gbp mapped | Mbp/minute | |
|
| 63% | 16% | 9% | 12% | 863 | 45 |
|
| 76% | 15% | 5% | 4% | 406 | 26 |
RMTA was used to process 100 paired-end (PE) and 1,000 single-end (SE) Arabidopsis sequence read archives (SRAs). The percentage of these SRAs with mapping rates >90%, 90–75%, etc., are shown. Gbp = 1x109 base pairs mapped. Mbp/minute = million base pairs mapped per minute.
Figure 3Examples of downstream analyses facilitated by the read mapping and transcript assembly (RMTA) workflow. The output generated by RMTA are immediately useful for the usual analyses performed following an RNA-sequencing experiment. (A) EPIC-CoGe screenshot of Arabidopsis root and flower RNA-sequencing (RNA-seq) data processed by RMTA highlighting a gene, AT1G01280, that is highly expressed in flower tissue but not roots. (B) Principal component analysis (PCA) of 100 Arabidopsis sequence read archives (SRAs) generated in R using the read count output file from RMTA. (C) Comparison of the length of Evolinc identified long non-coding RNA (lncRNAs) relative to other nuclear and organellar genes. PC, protein-coding gene; Mt, mitochondria; Cp, chloroplast. (D) Comparison of GC content of Evolinc identified lncRNAs relative to other nuclear and organellar genes. (E) EPIC-CoGe visualization of the expression of a locus identified by Evolinc as a lncRNA. The boundaries of the lncRNA and its orientation have been added to the EPIC-CoGe screenshot.