Literature DB >> 28108448

aRNApipe: a balanced, efficient and distributed pipeline for processing RNA-seq data in high-performance computing environments.

Arnald Alonso^1,2, Brittany N Lasseigne¹, Kelly Williams¹, Josh Nielsen¹, Ryne C Ramaker^1,3, Andrew A Hardigan^1,3, Bobbi Johnston¹, Brian S Roberts¹, Sara J Cooper¹, Sara Marsal², Richard M Myers¹.

Abstract

SUMMARY: The wide range of RNA-seq applications and their high-computational needs require the development of pipelines orchestrating the entire workflow and optimizing usage of available computational resources. We present aRNApipe, a project-oriented pipeline for processing of RNA-seq data in high-performance cluster environments. aRNApipe is highly modular and can be easily migrated to any high-performance computing (HPC) environment. The current applications included in aRNApipe combine the essential RNA-seq primary analyses, including quality control metrics, transcript alignment, count generation, transcript fusion identification, alternative splicing and sequence variant calling. aRNApipe is project-oriented and dynamic so users can easily update analyses to include or exclude samples or enable additional processing modules. Workflow parameters are easily set using a single configuration file that provides centralized tracking of all analytical processes. Finally, aRNApipe incorporates interactive web reports for sample tracking and a tool for managing the genome assemblies available to perform an analysis. AVAILABILITY AND DOCUMENTATION: https://github.com/HudsonAlpha/aRNAPipe ; DOI: 10.5281/zenodo.202950. CONTACT: rmyers@hudsonalpha.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Substances：
RNA

Year: 2017 PMID： 28108448 PMCID： PMC5447234 DOI： 10.1093/bioinformatics/btx023

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Quantification of RNA transcripts by next-generation sequencing technologies continues to increase in both throughput and capabilities as sequencing becomes more affordable and accessible (McGettigan, 2013). Unlike gene expression microarrays, RNA-seq not only quantifies gene expression levels but also measures alternative splicing, transcript fusions, and RNA sequence variants (Finotello and Di Camillo, 2015; Koboldt ; Maher ). This broad spectrum of applications has fostered development of a rich set of bioinformatics methods focused on each processing stage (Conesa ). Current RNA-seq data primary analysis applications usually apply a single processing step, involve complex dependencies between processing stages, and depend on the sequencing protocol performed (see Supplementary Section S1). Consequently, there is an increasing need for tools orchestrating the analysis workflow to ensure repeatability of RNA-seq data processing. In addition to the need for data processing integration, the computational requirements of some RNA-seq analysis steps are a bottleneck (Scholz ) and, the use of high-performance computing (HPC) clusters is unavoidable. Because HPC clusters are a valuable and often limited resource, tools integrating RNA-seq processing stages must be carefully designed and optimized. Considering these challenges, we developed a balanced, efficient and distributed pipeline for RNA-seq data analysis: aRNApipe (automated RNA-seq pipeline). This pipeline was optimized to efficiently exploit HPC clusters, to scale from tens to thousands of RNA-seq libraries, and includes modules yielding complete RNA-seq primary analysis.

2 Methods

aRNApipe is designed to overcome the challenges of integration, synchronization and reporting of RNA-seq data analysis by using a project-oriented and balanced design optimized for HPC clusters (Fig. 1).

Fig. 1.

aRNApipe workflow for primary analysis of RNA-seq data

aRNApipe workflow for primary analysis of RNA-seq data The core application of aRNApipe (Supplementary Section S1) includes six operating modes: (i) executing a new analysis, (ii) updating a previous analysis to include new samples or enable new modules, (iii) showing analysis progress, (iv) building a project skeleton, (v) showing available genome builds, and (vi) stopping an ongoing analysis. Input data: aRNApipe requires two input files: (i) analysis configuration and (ii) samples to include in the analysis. In the configuration file, the user can set the executing parameters, including enabling/designating arguments of processing modules, assigning computational resources to modules, and selecting the reference genome build. Current applications: aRNApipe currently includes applications covering the main variations of RNA-seq data generation (Supplementary Section S2). Throughout the workflow, a main daemon process manages pipeline execution (i.e. inter-dependencies between applications) and monitors analysis of each sample at each stage (Fig. 1). First, low-quality reads/bases and adapter sequences can be filtered. Then, a second stack of applications is run in parallel, including assessment of raw data quality, transcript alignment and quantification and identification of gene fusions. The main process launches a third stack of analyses including quantification of genes and exons, conversion of SAM files to BAM sorted files, and alignment quality. Finally, a fourth stack of applications including variant calling and alternative splicing modules are run. Report generation: The Spider is an aRNApipe add-on module that generates interactive web reports summarizing an aRNApipe analysis. These reports review each module, including sample quality control metrics (Supplementary Figs S1 and S2). The user can also observe the computational resources used by each module and access all logs generated during analysis (Supplementary Fig. S3). Additionally, the Spider generates matrix-like count data files of raw counts, RPKMs and corresponding annotation files (i.e. gene identifier and length). Supplementary Section S3 provides a list of generated outputs. Reference builder: The programs used for RNA-seq data processing often use different formats and standards. To address this problem and to provide a centralized repository of available genome builds, aRNApipe includes a reference builder that generates all required files for a genome build based on initial files obtained from sources like NCBI and Ensembl repositories (Supplementary Section S4 and Fig. S4). Implementation: aRNApipe has been developed using Python 2.7 (Supplementary Fig. S5). When running on an HPC cluster, aRNApipe relies on the workload management application to submit jobs for each processing stage, taking into account cross-stage dependencies and custom resource requirements for each stage. aRNApipe has been implemented with the workload management system IBM Platform LSF, but its design allows quick migration to any other workload manager by editing one Python library (Supplementary Section S5). Additionally, a single-machine version is also provided. A configuration library provides supply paths to all applications used by aRNApipe.

3 Results

We have extensively tested aRNApipe and used it to analyze hundreds of RNA-seq libraries with multiple configurations, including different species, different genome builds and different RNA-seq protocols. The reports generated for four example datasets can be accessed online (http://arnapipe.bitbucket.org): (i) Strand-specific paired-end RNA-seq data from nine human samples of different tissues (GSE69241), (ii) unstranded single-end data from two paired normal and colorectal tumor tissues (GSE29580), (iii) unstranded paired-end data from 11 melanoma samples (GSE20156) and 1 prostate cancer cell line (NCIH660), and (iv) in-house unstranded paired-end data from 20 zebrafish libraries.

4 Conclusions

aRNApipe provides an integrated and efficient workflow for analyzing single-end and stranded or unstranded paired-end RNA-seq data. Unlike previous pipelines, aRNApipe is focused on HPC environments and the independent designation of computational resources at each stage allows optimization of HPC resources (see Supplementary Section S1). This application is highly flexible because its project configuration and management options. The Spider provides functional reports for the user at all analytical stages and the reference builder is a valuable genome build manager. Implementation of this pipeline allows users to quickly and efficiently complete primary RNA-seq analysis. Click here for additional data file.

6 in total

Review 1. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis.

Authors: Matthew B Scholz; Chien-Chi Lo; Patrick S G Chain
Journal: Curr Opin Biotechnol Date: 2011-12-09 Impact factor: 9.740

2. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.

Authors: Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson
Journal: Genome Res Date: 2012-02-02 Impact factor: 9.043

Review 3. Transcriptomics in the RNA-seq era.

Authors: Paul A McGettigan
Journal: Curr Opin Chem Biol Date: 2013-01-02 Impact factor: 8.822

Review 4. Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis.

Authors: Francesca Finotello; Barbara Di Camillo
Journal: Brief Funct Genomics Date: 2014-09-18 Impact factor: 4.241

5. Transcriptome sequencing to detect gene fusions in cancer.

Authors: Christopher A Maher; Chandan Kumar-Sinha; Xuhong Cao; Shanker Kalyana-Sundaram; Bo Han; Xiaojun Jing; Lee Sam; Terrence Barrette; Nallasivam Palanisamy; Arul M Chinnaiyan
Journal: Nature Date: 2009-01-11 Impact factor: 49.962

Review 6. A survey of best practices for RNA-seq data analysis.

Authors: Ana Conesa; Pedro Madrigal; Sonia Tarazona; David Gomez-Cabrero; Alejandra Cervera; Andrew McPherson; Michał Wojciech Szcześniak; Daniel J Gaffney; Laura L Elo; Xuegong Zhang; Ali Mortazavi
Journal: Genome Biol Date: 2016-01-26 Impact factor: 13.583

6 in total

15 in total

1. Chromodomain Helicase DNA-Binding Protein 7 Is Suppressed in the Perinecrotic/Ischemic Microenvironment and Is a Novel Regulator of Glioblastoma Angiogenesis.

Authors: Nathaniel H Boyd; Kiera Walker; Adetokunbo Ayokanmbi; Emily R Gordon; Julia Whetsel; Cynthia M Smith; Richard G Sanchez; Farah D Lubin; Asmi Chakraborty; Anh Nhat Tran; Cameron Herting; Dolores Hambardzumyan; G Yancey Gillespie; James R Hackney; Sara J Cooper; Kai Jiao; Anita B Hjelmeland
Journal: Stem Cells Date: 2019-01-24 Impact factor: 6.277

2. Evidence for both Intermittent and Persistent Compartmentalization of HIV-1 in the Female Genital Tract.

Authors: Batsirai M Mabvakure; Bronwen E Lambson; Kavisha Ramdayal; Lindi Masson; Dale Kitchin; Mushal Allam; Salim Abdool Karim; Carolyn Williamson; Jo-Ann Passmore; Darren P Martin; Cathrine Scheepers; Penny L Moore; Gordon W Harkins; Lynn Morris
Journal: J Virol Date: 2019-05-01 Impact factor: 5.103

3. Mutations in EBF3 Disturb Transcriptional Profiles and Cause Intellectual Disability, Ataxia, and Facial Dysmorphism.

Authors: Frederike Leonie Harms; Katta M Girisha; Andrew A Hardigan; Fanny Kortüm; Anju Shukla; Malik Alawi; Ashwin Dalal; Lauren Brady; Mark Tarnopolsky; Lynne M Bird; Sophia Ceulemans; Martina Bebin; Kevin M Bowling; Susan M Hiatt; Edward J Lose; Michelle Primiano; Wendy K Chung; Jane Juusola; Zeynep C Akdemir; Matthew Bainbridge; Wu-Lin Charng; Margaret Drummond-Borg; Mohammad K Eldomery; Ayman W El-Hattab; Mohammed A M Saleh; Stéphane Bézieau; Benjamin Cogné; Bertrand Isidor; Sébastien Küry; James R Lupski; Richard M Myers; Gregory M Cooper; Kerstin Kutsche
Journal: Am J Hum Genet Date: 2016-12-22 Impact factor: 11.025

Review 4. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

5. A genome-wide interactome of DNA-associated proteins in the human liver.

Authors: Ryne C Ramaker; Daniel Savic; Andrew A Hardigan; Kimberly Newberry; Gregory M Cooper; Richard M Myers; Sara J Cooper
Journal: Genome Res Date: 2017-10-11 Impact factor: 9.043

6. Post-mortem molecular profiling of three psychiatric disorders.

Authors: Ryne C Ramaker; Kevin M Bowling; Brittany N Lasseigne; Megan H Hagenauer; Andrew A Hardigan; Nicholas S Davis; Jason Gertz; Preston M Cartagena; David M Walsh; Marquis P Vawter; Edward G Jones; Alan F Schatzberg; Jack D Barchas; Stanley J Watson; Blynn G Bunney; Huda Akil; William E Bunney; Jun Z Li; Sara J Cooper; Richard M Myers
Journal: Genome Med Date: 2017-07-28 Impact factor: 11.117

7. Inhibition of the Wnt/β-catenin pathway enhances antitumor immunity in ovarian cancer.

Authors: David W Doo; Selene Meza-Perez; Angelina I Londoño; Whitney N Goldsberry; Ashwini A Katre; Jonathan D Boone; Dylana J Moore; Cindy T Hudson; Ilaria Betella; Tyler R McCaw; Abhishek Gangrade; Riyue Bao; Jason J Luke; Eddy S Yang; Michael J Birrer; Dmytro Starenki; Sara J Cooper; Donald J Buchsbaum; Lyse A Norian; Troy D Randall; Rebecca C Arend
Journal: Ther Adv Med Oncol Date: 2020-04-14 Impact factor: 8.168

8. Three dimensional modeling of biologically relevant fluid shear stress in human renal tubule cells mimics in vivo transcriptional profiles.

Authors: Emily J Ross; Emily R Gordon; Hanna Sothers; Roshan Darji; Oakley Baron; Dustin Haithcock; Balabhaskar Prabhakarpandian; Kapil Pant; Richard M Myers; Sara J Cooper; Nancy J Cox
Journal: Sci Rep Date: 2021-07-07 Impact factor: 4.379

9. Inhibiting WNT Ligand Production for Improved Immune Recognition in the Ovarian Tumor Microenvironment.

Authors: Whitney N Goldsberry; Selene Meza-Perez; Angelina I Londoño; Ashwini A Katre; Bryan T Mott; Brandon M Roane; Nidhi Goel; Jaclyn A Wall; Sara J Cooper; Lyse A Norian; Troy D Randall; Michael J Birrer; Rebecca C Arend
Journal: Cancers (Basel) Date: 2020-03-24 Impact factor: 6.639

10. RASflow: an RNA-Seq analysis workflow with Snakemake.

Authors: Xiaokang Zhang; Inge Jonassen
Journal: BMC Bioinformatics Date: 2020-03-18 Impact factor: 3.169