| Literature DB >> 35574063 |
Phelelani T Mpangase1,2, Jacqueline Frost2,3, Mohammed Tikly4, Michèle Ramsay1,2, Scott Hazelhurst1,5.
Abstract
The rate of raw sequence production through Next-Generation Sequencing (NGS) has been growing exponentially due to improved technology and reduced costs. This has enabled researchers to answer many biological questions through "multi-omics" data analyses. Even though such data promises new insights into how biological systems function and understanding disease mechanisms, computational analyses performed on such large datasets comes with its challenges and potential pitfalls. The aim of this study was to develop a robust portable and reproducible bioinformatic pipeline for the automation of RNA sequencing (RNA-seq) data analyses. Using Nextflow as a workflow management system and Singularity for application containerisation, the nf-rnaSeqCount pipeline was developed for mapping raw RNA-seq reads to a reference genome and quantifying abundance of identified genomic features for differential gene expression analyses. The pipeline provides a quick and efficient way to obtain a matrix of read counts that can be used with tools such as DESeq2 and edgeR for differential expression analysis. Robust and flexible bioinformatic and computational pipelines for RNA-seq data analysis, from QC to sequence alignment and comparative analyses, will reduce analysis time, and increase accuracy and reproducibility of findings to promote transcriptome research.Entities:
Keywords: RNA-seq; bioinformatics; container; nextflow; pipelines; reproducible; singularity; workflows
Year: 2021 PMID: 35574063 PMCID: PMC9097006 DOI: 10.18489/sacj.v33i2.830
Source DB: PubMed Journal: S Afr Comput J ISSN: 1015-7999
Figure 1:Summary of resources and best practices for development, maintenance, sharing and publishing of reproducible and portable pipelines.
Development of reproducible pipelines start on individual desktop machines using Nextflow (Di Tommaso et al., 2017), Singularity (Kurtzer et al., 2017) and Git (https://git-scm.com/). A pipeline repository can be created on GitHub (https://github.com/) to track version changes. SingularityHub (https://singularity-hub.org/) or DockerHub (https://hub.docker.com/) can be used to create and archive containers triggered by a GitHub push. The pipeline can be cloned on HPC or cloud-services for analyses on a larger scale.
Figure 2:Overall summary of the nf-rnaSeqCount pipeline.
The nf-rnaSeqCount pipeline works in 4 stages: (1) Data Preparation: for downloading Singularity containers and indexing the reference genome using STAR and Bowtie; (2) Quality Control: for assessing the quality of RNA-seq reads using FastQC and trimming low quality bases using Trimmomatic; (3) Alignment & Quantification: for aligning reads to the reference genome using STAR and quantifying abundance of identified genomic features using featureCounts and htseq-count; (4) MultiQC: for assessing the quality of the steps in the pipeline using MultiQC. The main output for the nf-rnaSeqCount pipeline are read count matrices produced by featureCounts and htseq-count, as well as a QC report from MultiQC.
Figure 3:nf-rnaSeqCount and Rsubread performance benchmarking.
The nf-rnaSeqCount pipeline (top row) was compared to the Rsubread package (bottom row) in terms of time (1st column), memory (2nd column) and CPU usage (3rd column) when performing the standard RNA-seq workflow, i.e., indexing (red), read alignment (green) and read counting (blue).