| Literature DB >> 29949956 |
Kendell Clement1,2,3, Rick Farouni1,2,3, Daniel E Bauer4,5,6, Luca Pinello1,2,3.
Abstract
Motivation: Unique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon-based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments.Entities:
Mesh:
Year: 2018 PMID: 29949956 PMCID: PMC6022702 DOI: 10.1093/bioinformatics/bty264
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(a) Schematic showing utility of UMIs in identifying PCR duplicates. In libraries using UMIs, a short sequence of random nucleotides is added to each DNA fragment before PCR amplification. All PCR products of that read will contain the same UMI. After library sequencing, DNA fragments with the same sequence (shown as square nucleotides on the right part of the read) can be identified as either PCR duplicates or not, based on the UMI sequence (shown as rounded nucleotides on the left part of the read). (b) Outline of a standard experiment utilizing UMI technology. The steps shown in gray are computational processing steps, and are the procedures performed by our software, AmpUMI
Fig. 2.Association between distortion of allelic frequency and UMI length. Colored bars show simulated allelic fractions of four alleles after deduplication of reads with the same UMI and allele. Simulated samples consisted of 100 000 reads and were drawn from a population with an allelic diversity given by . Reads were generated using using UMI of length between 1 bp and 18 bp long. For each UMI length, the average simulation proportion of each allele is shown after removing UMI-allele collisions. Samples of 100 were simulated for each UMI length. The right column marked ’Truth’ shows the underlying allelic diversity from which the simulated samples were drawn. Dots connected by lines show the predicted allele frequency given our model [Equation (12)] and are in complete concordance with the simulation results. The gray histogram at the bottom of the plot shows the TAFD [Equation (13)] for each UMI length
Fig. 3.Distribution of collisions in simulated populations. The number of collisions in each simulated sample used in Figure 2 were aggregated by UMI length. Boxplots show the median (thick line), interquartile range (box) and the range of the data (whiskers). The count of collisions is defined as the number of simulated reads (UMI-molecule combination) that had already been observed in the simulation
Fig. 4.Probability of having no UMI collisions for Case 2 (worst case scenario): The probability of no collision as a function of sample size n for 5 consecutive values of UMI lengths, such that (colored curves). The vertical dotted line shows the n = 100 000 sample size referenced in Figures 2and 5
Fig. 5.Probability of having no UMI collisions in simulated samples. We simulated samples of size 100 000, with DNA fragments randomly selected from a set containing five unique fragments each with a random fraction of presence in the sample. Simulated DNA fragments were paired with a given set of UMIs, and the rate of UMI collisions were measured. The average percent of all 1000 simulated samples having no collisions is shown with the blue line. Three were carried out with 1000 samples each. The red reference line is computed by our model, and shows the values in Figure 4
Number of reads, alleles and the percent of reads matching the reference allele that are present before and after AmpUMI analysis of a dataset of deep amplicon sequencing (Kou )
| No. Reads | No. Alleles | Pct. Ref Allele | |
|---|---|---|---|
| Unprocessed | 6 440 216 | 20 760 | 95.77 |
| Deduplicated–Step 1 | 426 447 | 2524 | 97.05 |
| Deduplicated–Step 2 | 269 564 | 176 | 99.19 |