| Literature DB >> 33937765 |
Xueyi Dong1, Luyi Tian1, Quentin Gouil1, Hasaru Kariyawasam1, Shian Su1, Ricardo De Paoli-Iseppi2, Yair David Joseph Prawer2, Michael B Clark2, Kelsey Breslin1, Megan Iminitoff1, Marnie E Blewitt1, Charity W Law1, Matthew E Ritchie1.
Abstract
Application of Oxford Nanopore Technologies' long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs ('sequins') as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.Entities:
Year: 2021 PMID: 33937765 PMCID: PMC8074342 DOI: 10.1093/nargab/lqab028
Source DB: PubMed Journal: NAR Genom Bioinform ISSN: 2631-9268
Figure 1.Analysis workflow and quality metrics. (A) Overview of the analysis workflow used to process the mouse NSC direct-cDNA long-read and short-read RNA-seq data. (B) The number of raw reads, quality filtered reads, trimmed and demultiplexed reads, reads from chosen samples and gene-level counts in the NSC dataset. (C) Distribution of read quality in the NSC dataset, stratified by read length. Read quality is defined by the average base quality score of a read. (D) The total number of reads assigned to each sample in the NSC dataset (green: Smchd1-null samples; orange: WT samples). (E) A hexagonal 2D density plot showing the correlation between gene length and average gene expression (log-CPM) in the NSC dataset.
Figure 2.Results for differential gene expression analysis. (A) Scatter plot of the observed logFC between mix A and B versus the expected logFC in the sequins long-read data. The blue line is the linear regression line. (B) Scatter plot of the t-statistics calculated between mix A and B from the sequins short-read and long-read data. The blue line is the linear regression line. (C) MDS plot showing the relationship between NSC samples based on gene-level logCPM. (D) Voom mean-variance trend in NSC data where points represent genes. (E) Gene-level plot of logFC for Smchd1-null versus WT plotted against average log2-expression values. Differentially expressed genes are highlighted (red: up-regulated genes, blue: down-regulated genes). (F) The barcode plot shows the correlation between our long-read differential expression results and the results from a previous short-read dataset collected on the same NSC sample types. Each vertical bar represents a DE gene from the previous short-read study (red: up-regulated genes, blue: down-regulated genes), and the position of the bar on the x-axis represents the moderated t-statistic of the same gene in the long-read results. The length of the vertical lines represent the logFC of the gene in the short-read results. The red worm on the top and the blue worm at the bottom represent the relative enrichment of the vertical bars in each part of the plot with the smooth fit obtained using a moving average with tricube weights.
Figure 3.Isoform identification and differential transcript usage analysis. (A) A bar plot showing the number of discovered isoform types in the sequins long-read dataset. The bars are separated into isoform categories (by colour), and the dashed line represents the true number of isoform types. The red ‘full-splice-match’ category represents the known transcripts present in the sequin controls (i.e. true positive), while the other categories represent erroneous transcripts. (B) A bar plot showing the number of counts from isoforms in the sequins long-read dataset. The bars are separated into isoform categories (by colour) from which the counts are associated with. (C) A scatter plot showing the correlation between the fraction of full-length reads assigned to a transcript and the length of the annotated transcript. Dots are coloured by the transcript count (log2-scale). (D) The correlation between observed transcript counts and expected transcript abundance of each gene from each sequins sample. (E) A bar plot showing the false discovery rate (FDR) from different tests of DTU in sequins long-read data. (F) A bar plot showing the true positive rate (FDR) from different tests of DTU in sequins long-read data.