| Literature DB >> 33932985 |
Nicholas J Eagles1, Emily E Burke1, Jacob Leonard2,3, Brianna K Barry1,4, Joshua M Stolz1, Louise Huuki1, BaDoi N Phan1,5,6, Violeta Larios Serrato2,7, Everardo Gutiérrez-Millán2, Israel Aguilar-Ordoñez2,8, Andrew E Jaffe1,4,9,10,11,12,13, Leonardo Collado-Torres14,15.
Abstract
BACKGROUND: RNA sequencing (RNA-seq) is a common and widespread biological assay, and an increasing amount of data is generated with it. In practice, there are a large number of individual steps a researcher must perform before raw RNA-seq reads yield directly valuable information, such as differential gene expression data. Existing software tools are typically specialized, only performing one step-such as alignment of reads to a reference genome-of a larger workflow. The demand for a more comprehensive and reproducible workflow has led to the production of a number of publicly available RNA-seq pipelines. However, we have found that most require computational expertise to set up or share among several users, are not actively maintained, or lack features we have found to be important in our own analyses.Entities:
Keywords: Bioconductor; Pipeline; RNA-seq
Mesh:
Year: 2021 PMID: 33932985 PMCID: PMC8088074 DOI: 10.1186/s12859-021-04142-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An example samples.manifest. The samples.manifest file for paired-end samples is composed of five tab separated columns: (1) path to the first FASTQ file in the pair, (2) optional md5 signature for the first FASTQ file in the pair, (3) path to the second FASTQ file in the pair, (4) optional md5 signature for the second FASTQ file in the pair, (5) sample ID. The first two entries use the same sample ID, which is useful when a biological sample was sequenced in multiple lanes and thus generated multiple FASTQ files. The first two pairs of FASTQ files will be merged
Fig. 2SPEAQeasy workflow diagram. A simplified workflow diagram for each pipeline execution. The red box indicates the FASTQ files are inputs to the pipeline; green coloring denotes major output files from the pipeline; the remaining boxes represent computational steps. Yellow-colored steps are optional or not always performed; for example, preparing a particular set of annotation files occurs once and uses a cache for further runs. Finally, blue-colored steps are ordinary processes which occur on every pipeline execution. The workflow proceeds downward, and each row in the diagram implicitly represents the ability for several computation steps to execute in parallel
Fig. 3Mandatory options in the main script. a The three required pieces of information the user provides are the reference genome, sample pattern, and expected strandness pattern present in all samples. The valid options are depicted horizontally to the right in this figure. b An example of a full command is shown- in this case, a test run on an SGE scheduler without docker is also specified
Fig. 4Main output files from SPEAQeasy. SPEAQeasy produces the files described in the blue boxes, as the final products of interest. Counts of genes, exons, and exon-exon junctions are aggregated into three respective R objects of the familiar RangedSummarizedExperiment class. This allows users to immediately follow up with a number of Bioconductor tools to perform any desired differential expression analyses. If the --coverage option is provided, RData files are produced to provide expression information over regions in the genome. This allows users to compute differentially expressed regions using any of a number of Bioconductor packages as appropriate for the experiment. Finally, for experiments on human samples, variants are called to ultimately produce a single VCF file of genotype calls at 740 particular SNVs. Together with genotype data recorded before sequencing the samples, one can resolve mislabellings and other identity issues which inevitably occur during the sequencing process (http://research.libd.org/SPEAQeasy-example)
Fig. 5Example analysis results from applying SPEAQeasy to a subset of the BipSeq PsychENCODE dataset. a Heatmap of the spearman correlation across samples using variant information derived from the RNA-seq data produced by SPEAQeasy. Off-diagonal high correlation values indicate potential sample swaps. b Top two principal components (PCs) derived from the gene expression counts produced by SPEAQeasy colored by diagnosis. c Boxplots of the normalized log2 expression for the top differentially expressed between controls and bipolar disorder affected individuals using a subset of the BipSeq PsychENCODE data processed using SPEAQeasy. d Heatmap of the top differentially expressed genes with annotations for the brain region (amygdala or sACC), sex (male or female) and diagnosis (bipolar or control). See http://research.libd.org/SPEAQeasy-example/ for the full example analysis