| Literature DB >> 32884024 |
Klay Saunders1, Andrew G Bert1, B Kate Dredge1, John Toubia1, Philip A Gregory1,2, Katherine A Pillman1, Gregory J Goodall1,2, Cameron P Bracken3,4.
Abstract
The attachment of unique molecular identifiers (UMIs) to RNA molecules prior to PCR amplification and sequencing, makes it possible to amplify libraries to a level that is sufficient to identify rare molecules, whilst simultaneously eliminating PCR bias through the identification of duplicated reads. Accurate de-duplication is dependent upon a sufficiently complex pool of UMIs to allow unique labelling. In applications dealing with complex libraries, such as total RNA-seq, only a limited variety of UMIs are required as the variation in molecules to be sequenced is enormous. However, when sequencing a less complex library, such as small RNAs for which there is a more limited range of possible sequences, we find increased variation in UMIs are required, even beyond that provided in a commercial kit specifically designed for the preparation of small RNA libraries for sequencing. We show that a pool of UMIs randomly varying across eight nucleotides is not of sufficient depth to uniquely tag the microRNAs to be sequenced. This results in over de-duplication of reads and the marked under-estimation of expression of the more abundant microRNAs. Whilst still arguing for the utility of UMIs, this work demonstrates the importance of their considered design to avoid errors in the estimation of gene expression in libraries derived from select regions of the transcriptome or small genomes.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32884024 PMCID: PMC7471316 DOI: 10.1038/s41598-020-71323-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Highly expressed microRNAs are subject to over de-duplication. 4 separate smRNA libraries (HMLE and MesHMLE cells; total RNA and RNA co-immunoprecipitated with AGO) are shown, with each miRNA represented as dots plotted on axes of total RNA reads (x axis) and de-duplicated read counts (y axis). (a) Draws from a more limited pool of UMIs on account of de-duplicating otherwise identical reads in which there is a single nucleotide mismatch between UMIs (Hamming distance = 1), to account for PCR or sequencing error. In (b) no UMI sequence divergence is accommodated (Hamming distance = 0). For clarity, 4 libraries are represented here. Data from additional biological replicates are included in Supplementary Fig. 1.
Figure 2Over de-duplication due to limiting UMIs drastically decreases the apparent expression of more abundant miRNAs. MiRNA expression (counts per million) from raw reads or after de-deduplication (Hamming distance = 0 or 1) is shown for the top 20 miRNAs from each of the libraries analysed.
Figure 3“Over-sequencing” is not responsible for limiting UMIs. Raw read counts of every isomiR detected were ordered and plotted to reveal the frequency of which individual molecules were sequenced.
Figure 4UMIs correct isomiR-specific PCR bias. (a) All isomiRs from HMLE and MesHMLE cells and all individual isomiRs of the 3 most expressed miRNAs in (b) HMLE and (c) MesHMLE cells are represented as dots plotted on axes of total RNA reads (x axis) and de-duplicated read counts (y axis, Hamming distance = 0).
Figure 5Simulation of de-duplication performance with differing UMI length and sequencing depth.