| Literature DB >> 32759675 |
Shuntai Zhou1, Ronald Swanstrom1,2.
Abstract
Next generation sequencing (NGS) platforms have the ability to generate almost limitless numbers of sequence reads starting with a PCR product. This gives the illusion that it is possible to analyze minor variants in a viral population. However, including a PCR step obscures the sampling depth of the viral population, the key parameter needed to understand the utility of the data set for finding minor variants. Also, these high throughput sequencing platforms are error prone at the level where minor variants are of interest, confounding the interpretation of detected minor variants. A simple strategy has been applied in multiple applications of NGS to solve these problems. Prior to PCR, individual molecules are "tagged" with a unique molecular identifier (UMI) that can be used to establish the actual sample size of viral genomes sequenced after PCR and sequencing. In addition, since PCR generates many copies of each sequence tagged to a specific UMI, a template consensus sequence (TCS) can be created from the many reads of each template, removing virtually all of the method error. From this perspective we examine our own use of a UMI, called Primer ID, in the detection of minor drug resistant variants in HIV-1 populations.Entities:
Keywords: HIV; drug resistance mutations; next generation sequencing; primer ID; unique molecular identifier
Mesh:
Year: 2020 PMID: 32759675 PMCID: PMC7472098 DOI: 10.3390/v12080850
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Figure 1A Typical unique molecular identifier (UMI) (Primer ID) Frequency Distribution. The raw sequence output from a MiSeq run was sorted for individual UMIs and the number of reads for each UMI (after PCR) is plotted versus the number of UMIs with that read number. There is a large number of UMIs with very small read numbers (green dots to the left of the curve) representing offspring Primer IDs, i.e., Primer IDs generated from real Primer IDs by sequencing mistakes. The middle part of the graph shows the distribution of real Primer IDs (blue dots) and the fact that they are present in unequal numbers in the PCR product. This is not solely due to sampling of an otherwise equal distribution in the PCR product as a simulation of the same amount of the Primer IDs in this dataset (851) sampled 37,177 times (equal to the number of total raw reads in this dataset) to give the same mean read number as the observed distribution shows a much narrower spread (plotted as a red line in the figure). Finally, there are a small number of “jackpot” Primer IDs to the right (orange dots) that represent extreme PCR skewing and/or resampling of the Primer ID/UMI population. The blue dotted vertical line shows the offspring cut-off based on a simulation.
Figure 2Comparison of Selected HIV-1 DRMs In the RT Region Analyzed Using Raw Sequence Reads or Template Consensus Sequences (TCS) Created Using Primer ID. (a) Sequencing of defective HIV-1 RNA from homogenous 8E5 virus. Top panel, analysis of DRMs in the raw sequence reads with the number of reads from triplicate sequencing runs shown (color coded); bottom panel is the same data analyzed using the Primer IDs in the data set to create TCS, with the number of TCS in each of the replicates shown. The dashed line in the top panel is an arbitrary 1% cutoff often used for reporting minor variants. The dashed line in the bottom panel is the average sampling depth (95% confidence of detection) for each of the triplicate sequencing runs. DRMs detected in the raw reads are shown. (b) Sequencing of HIV-1 RNA from a clinical specimen. These two panels are set up the same as in (a) with the lesser number of TCS in the data set indicated in the low panel and thus the reduced sensitivity of detection (dashed line).
Figure 3Illustration of Primer ID cDNA primer and how Primer ID reduces methods error (a) Primer ID cDNA primer. Ns in the Primer ID cDNA primer is a block of degenerate nucleotides. (b) Reads with same Primer ID will be collapsed to make a template consensus sequence based to greatly reduce errors. Colored vertical lines on the sequence reads indicate PCR mis-incorporations or sequencing errors, and the colored horizontal block indicates the PCR recombination.