| Literature DB >> 33959753 |
Philip Davies1, Matt Jones1, Juntai Liu2, Daniel Hebenstreit3.
Abstract
RNA-seq, including single cell RNA-seq (scRNA-seq), is plagued by insufficient sensitivity and lack of precision. As a result, the full potential of (sc)RNA-seq is limited. Major factors in this respect are the presence of global bias in most datasets, which affects detection and quantitation of RNA in a length-dependent fashion. In particular, scRNA-seq is affected by technical noise and a high rate of dropouts, where the vast majority of original transcripts is not converted into sequencing reads. We discuss these biases origins and implications, bioinformatics approaches to correct for them, and how biases can be exploited to infer characteristics of the sample preparation process, which in turn can be used to improve library preparation.Entities:
Keywords: RNA-seq; bias; gene expression; modeling; quantitation; software
Mesh:
Substances:
Year: 2021 PMID: 33959753 PMCID: PMC8574610 DOI: 10.1093/bib/bbab148
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1
(A) RNA-seq coverages by sequencing read along an example gene (Ube2s) for two biological replicates. Abrupt changes in exonic read densities (vertical dashed lines) often coincide across samples, suggesting that the local sequence environment is responsible for this type of bias. Data from GEO, accession numbers GSM710183 and GSM710184. (B) RNA-seq coverage along a typical transcript can be subject to bias at different scales; the schematic illustration depicts an absence of visible bias (top left), a local bias (bottom left), a global bias (top right) and a combination of the latter two (bottom right). (C) Global bias depends on transcript lengths. Schematic illustration of the length-dependent effects compared to a short reference transcript with no visible bias (top left). Upon considering longer transcripts in the same sample, a global bias can appear (bottom left), which does not necessarily lead to a skewed overall representation of the transcripts (the dashed horizontal line indicates average coverage equal to the reference). However, different lengths often do lead to unequal representation of transcripts due to global bias that might be invisible or visible in terms of coverage (top and bottom right, respectively).
Figure 4
Biases in experimental datasets as illustrated by heatmaps and sections of fitted theoretical models. (A) Typical bias resulting from a Smart-seq based dataset (Encode project accession number ENCSR096STK). Heatmap (left) as in Figure 3 LiBiNorm software [36] was used to fit a bias model to the data whose predicted coverages are shown for two transcript lengths (right). (B) As A, for a typical random priming based (Ovation® system, NuGEN; GEO accession number GSE84724) dataset. Typical global pattern and coverage shapes are indicated in orange for Smart-seq and random priming in A and B, respectively.
Figure 2
Heatmap representation of bias. Data were simulated to contain no bias (top left) or increasing levels of local bias (left to right) and/or global bias (bottom row). Each heatmap displays transcripts spanning 100 bases to 10 kb that are aligned at 5′ and 3′ ends and are ordered from shortest to longest (top to bottom, respectively). Read coverages are indicated by color in 20 bins along transcripts (color key, top right). The global bias exhibits non-linear length-dependent scaling, from uniform coverage, to 5′ bias, to a bimodal distribution (dark streaks). Typical underrepresentation of transcript ends due to inefficient fragmentation is indicated by orange arrows (shown for one of the three bottom plots affected by it).
Figure 3
Biases in experimental degradation datasets of human brain tissue illustrated by heatmaps (data from NCBI BioProject. Accession number: PRJNA389171, brain number: Br1385). Samples were left at room temperature to degrade for different amounts of time and both poly(A) + and ribodepleted libraries were sequenced. Increasing degradation times typically results in a more pronounced coverage bias towards the 3′ end of the transcript for the poly(A) + but not the ribodepleted libraries.
Different types of biases and their properties. The information was collated based on basic logics, heuristic considerations and literature examples where it was available. Note that some issues addressed in this review have not been systematically researched yet
| Types of bias | Visibility in coverage | Local | Global | Non-linear length-scaling | Strength of bias | References |
|---|---|---|---|---|---|---|
| Fragmentation efficiency | Yes | Yes | Yes | Yes | Moderate | [ |
| PCR | No | No | Yes | Potentially | Strong | [ |
| Random priming | Potentially | Yes | No | No | Moderate | [ |
| Sequence-specific | Yes | Yes | No | No | Small | [ |
| Processivity random priming | Yes | No | Yes | Yes | Moderate | [ |
| Processivity SMART | Yes | No | Yes | Yes | Strong | [ |
| RNA degradation | Yes | No | Yes | Yes | Strong | [ |
Selection of bias correction software tools. The list is intended to give an overview of the landscape and is not exhaustive and subject to limitations in the descriptions etc
| Name | Protocol type | Bias type addressed | Comments | Reference |
|---|---|---|---|---|
| LiBiNorm | Coverage based, specifically Smatseq2 | Global | Only current tool that addresses cDNA-related global bias | [ |
| Wan | Coverage | Global | Interprets global bias as RNA degradation only | [ |
| Flux simulator | Coverage | Local and global | Captures some simplified features of global bias, but only for simulation, not correction; only model that at least in principle considers cDNA priming/synthesis as bias source | [ |
| RNASeqBias R package | Coverage | Local and global | Assumes independence between expression level and gene/RNA length and thus corrects globally | [ |
| Sailfish | Coverage | Local and global | Method is based on [ | [ |
| AIDE | Coverage | Local | Focus on isoforms | [ |
| BCseq | Coverage, specifically scRNA-seq | Local and dropouts | Focus on scRNA-seq and addresses dropouts | [ |
| Bento-seq | Coverage | Local | Focus on splicing | [ |
| iReckon | Coverage | Local | Focus on isoforms | [ |
| kallisto | Coverage | Local | Focus on sequence specific bias, uses a similar method to [ | [ |
| Maxcounts | Coverage | Local | Novel approach; appealing in its simplicity; limited in its power | [ |
| Mix2 | Coverage | Local | Focuses on positional biases using mixture models, closed source C++ implementation | [ |
| CEM | Coverage | Local | Focus on isoforms and transcriptome assembly | [ |
| Howard and Heber | Coverage | Local | Focuses on positional biases for isoform quantitation | [ |
| Wu | Coverage | Local | Focus on isoforms | [ |
| Huang | Coverage | Local | Focus on isoforms | [ |
| Liu | Coverage | Local | Focus on sequence specific bias | [ |
| Alnasir and Shanahan | Coverage | Local | Focus on sequence specific bias | [ |
| Zhang | Coverage | Local | Employs deep learning for sequence specific bias correction | [ |
| Jiang and Salzman | Coverage | Local | Focus on isoforms | [ |
| Roberts | Coverage | Local | Implemented in several software tools (CuffLinks, kallisto, etc.) in various iterations | [ |
| NLDMseq | Coverage | Local | Focus on isoforms | [ |
| PBSeq | Coverage | Local | Focus on positional and sequence specific biases | [ |
| PennSeq | Coverage | Local | Focus on isoforms | [ |
| PGseq | Coverage | Local | Considerers positional and sequence specific biases | [ |
| PM-seq | Coverage | Local | Uses mixture models | [ |
| RSEM | Coverage | Local | Concentrates on positional bias | [ |
| Salmon/Alpine | Coverage | Local | Uses the method of [ | [ |
| seqbias | Coverage | Local | Concentrate on sequence specific bias | [ |
| Sequgio | Coverage | Local | Focus on isoforms | [ |
| SparseIso | Coverage | Local | Focus on isoforms | [ |
| WemIQ | Coverage | Local | Focus on isoforms | [ |
| XAEM | Coverage | Local | Focus on isoforms | [ |
| bayNorm | scRNA-seq | Dropouts | Bayesian approach and non-zero inflated binomial distribution | [ |
| MAGIC | scRNA-seq | Dropouts | Dropout recovery by sharing information from neighborhood cells | [ |
| Qju | scRNA-seq | Dropouts | Cell classifier based on dropout co-occurrence | [ |
| SAVER | scRNA-seq | Dropouts | Empirical Bayes approach for dropout imputation based on intergenic correlation | [ |
| scImpute | scRNA-seq | Dropouts | Bayesian approach to rescue dropout gene using information from similar cells | [ |
| ScVI | scRNA-seq | Dropouts | Neural network approach for scRNA-seq data processing | [ |
| ZIFA | scRNA-seq | Dropouts | Dimension reduction accounts for dropouts | [ |
| ZINB-WaVE | scRNA-seq | Dropouts | Imputation method based on zero-inflated model | [ |
| Buttner | scRNA-seq | Batch correction | Benchmark batch correction methods | [ |
| Xi | scRNA-seq | Doublet discrimination | Benchmark doublet discrimination methods | [ |
| SoupX | scRNA-seq | Ambient gene expression | Use empty droplets to learn model | [ |
| EmptyDrops | scRNA-seq | Empty droplets | Model ambient RNA pool to detect empty droplets | [ |
Figure 5
Schematic illustration of RNA library preparation and sequencing using UMIs produced using simulations, similar to [55]. (A) Initial number of RNA molecules in sample, colors indicate different genes. (B) Molecules get ‘captured’ by UMI tags (dots). Non-captured RNA molecules are lost (dashed lines). (C) Captured molecules are amplified by PCR. (D) Sequencing of cDNA library, sequencing depth determines how many copies are lost (transparent). (E) Reads to mRNA counts relationship is noisy due to stochastic effects during amplification and sequencing, as well as PCR efficiency variation between genes. Color brightness indicates PCR amplification efficiency, with darker colors indicating lower efficiency. (F) Sequenced UMIs are used to remove duplicate reads, improving the estimation of the initial RNA molecules.