| Literature DB >> 33361112 |
Ales Varabyou1,2, Steven L Salzberg1,2,3,4, Mihaela Pertea1,2,3.
Abstract
RNA sequencing is widely used to measure gene expression across a vast range of animal and plant tissues and conditions. Most studies of computational methods for gene expression analysis use simulated data to evaluate the accuracy of these methods. These simulations typically include reads generated from known genes at varying levels of expression. Until now, simulations did not include reads from noisy transcripts, which might include erroneous transcription, erroneous splicing, and other processes that affect transcription in living cells. Here we examine the effects of realistic amounts of transcriptional noise on the ability of leading computational methods to assemble and quantify the genes and transcripts in an RNA sequencing experiment. We show that the inclusion of noise leads to systematic errors in the ability of these programs to measure expression, including systematic underestimates of transcript abundance levels and large increases in the number of false-positive genes and transcripts. Our results also suggest that alignment-free computational methods sometimes fail to detect transcripts expressed at relatively low levels.Year: 2020 PMID: 33361112 PMCID: PMC7849408 DOI: 10.1101/gr.266213.120
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Types and abundance of transcripts and genes of different types observed in an assembly of nearly 10,000 GTEx RNA-seq experiments (Pertea et al. 2018)
Figure 1.Properties of the GTEx data set computed from transcriptome assemblies built for the CHESS database (Pertea et al. 2018) compared with simulated data. (A) Distributions of the number of annotated and intergenic loci observed per tissue. (B) Distributions of the number of annotated and intergenic loci observed per sample. (C) Distributions of the number of transcripts representing each noise type in a sample. (D) Fraction of expression in a typical sample that comes from real isoforms versus noisy isoforms. Only loci having both annotated and noisy transcripts being expressed are included. (E) Fraction of total expression from noisy transcripts in simulated samples.
Figure 2.Effects of transcriptional noise on the transcript-level abundance estimation quantified across the 30 samples in the simulated data set. (A) Distribution of the number of false-positive (FP) observations per sample, with (brown) and without (blue) noise. (B) Expression levels assigned to FPs in the absence and presence of noise. (C) Distribution of the number of false-negative (FN) observations per sample. (D) Expression levels of FNs in the absence and presence of noise.
Figure 3.Effects of noisy transcription on gene-level abundance estimation. (A) Distributions of the number of FP genes per sample, that is, the number of reported gene loci at which no actual transcripts were expressed. (B) Distributions of the number of FN genes per sample, that is, the number of gene loci for which the simulated data contained at least one expressed transcript but where the program failed to report any. (C) Percentage of change in the number of reads assigned to a gene as a function of the fraction of expression at that locus that comes from unannotated transcripts. Percentage of change was computed relative to the total number of reads simulated for all annotated transcripts at each locus. Only loci with more than zero reads from annotated transcripts are shown.