| Literature DB >> 28578414 |
Dmitriy A Shagin1,2,3, Irina A Shagina2,3, Andrew R Zaretsky2,3, Ekaterina V Barsova1,3, Ilya V Kelmanson1,3, Sergey Lukyanov1,2, Dmitriy M Chudakov4,5,6,7, Mikhail Shugay8,9,10.
Abstract
The accuracy with which DNA polymerase can replicate a template DNA sequence is an extremely important property that can vary by an order of magnitude from one enzyme to another. The rate of nucleotide misincorporation is shaped by multiple factors, including PCR conditions and proofreading capabilities, and proper assessment of polymerase error rate is essential for a wide range of sensitive PCR-based assays. In this paper, we describe a method for studying polymerase errors with exceptional resolution, which combines unique molecular identifier tagging and high-throughput sequencing. Our protocol is less laborious than commonly-used methods, and is also scalable, robust and accurate. In a series of nine PCR assays, we have measured a range of polymerase accuracies that is in line with previous observations. However, we were also able to comprehensively describe individual errors introduced by each polymerase after either 20 PCR cycles or a linear amplification, revealing specific substitution preferences and the diversity of PCR error frequency profiles. We also demonstrate that the detected high-frequency PCR errors are highly recurrent and that the position in the template sequence and polymerase-specific substitution preferences are among the major factors influencing the observed PCR error rate.Entities:
Mesh:
Year: 2017 PMID: 28578414 PMCID: PMC5457411 DOI: 10.1038/s41598-017-02727-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Experimental design for UMI-based evaluation of PCR errors. (a) Schematic representation of our five-stage experiment to evaluate errors introduced during conventional PCR amplification. DNA molecules were tagged with unique molecular identifiers (UMI) using linear amplification (1), followed by PCR amplification with various polymerases (2). We predict approximately 1.3 × 105-fold amplification given a PCR efficiency of 1.8 and 20 cycles of PCR. Next, we performed a series of dilutions (~106-fold downsampling) (3) to ensure that no more than one molecule with a given UMI tag is subjected to a second round of PCR (4). Finally, the PCR-amplified pool was subjected to high-throughput sequencing. The downsampling step guarantees that errors produced during the first PCR will represent the more frequent variant within reads tagged with the same UMI after sequencing, clearly distinguishing them from errors introduced at subsequent stages. Sequencing reads were processed (5) to ensure that only errors from the first PCR are retained, while those arising during the second PCR step and sequencing errors are filtered. (b) A simpler protocol for evaluating the error rate of the linear amplification stage used for UMI attachment. Such errors are indistinguishable from those arising during the first PCR step in the experimental setup depicted in (a), and the results from this simpler procedure are used to adjust the error rates inferred for conventional PCR amplification.
Error rate estimates from two independent experiments. The table shows the number of mismatches that remain after consensus sequence assembly for reads tagged with the same UMI, the total number of unique UMI tags observed, and error rate estimates for 20 (25 for Phusion) cycles of PCR amplification with a 150-bp template. Error rate estimates are provided as the number of erroneous bases per template base per cycle, with 95% confidence intervals (CI) calculated using normal approximation for binomial proportions. Error rateLA represents estimates for the linear amplification step, while Error ratecorr represents the estimate of conventional PCR error rate after correcting for linear amplification errors.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Encyclo | 1 | 24557 | 185560 | 4.30 [4.25, 4.35] | 12.79 | 3.68 [3.63, 3.73] |
| Encyclo | 2 | 14211 | 101516 | 4.55 [4.48, 4.62] | 3.93 [3.86, 3.99] | |
| Kapa HF | 1 | 339 | 7876 | 1.40 [1.25, 1.55] | 8.17 | 1.00 [0.88, 1.13] |
| Kapa HF | 2 | 2519 | 57052 | 1.44 [1.38, 1.49] | 1.04 [0.99, 1.08] | |
| Phusion | 1 | 30 | 1348 | 0.72 [0.47, 0.98] | 4.99 | 0.48 [0.27, 0.69] |
| Phusion | 2 | 23 | 1351 | 0.55 [0.33, 0.78] | 0.31 [0.14, 0.48] | |
| SD-HS | 1 | 6714 | 33076 | 6.60 [6.46, 6.74] | 26.89 | 5.29 [5.16, 5.42] |
| SD-HS | 2 | 10362 | 58518 | 5.76 [5.66, 5.86] | 4.45 [4.36, 4.54] | |
| SNP-detect | 1 | 457 | 13870 | 1.07 [0.97, 1.17] | 5.04 | 0.83 [0.74, 0.91] |
| SNP-detect | 2 | 848 | 32310 | 0.85 [0.80, 0.91] | 0.61 [0.56, 0.66] | |
| Taq-HS | 1 | 1875 | 15137 | 4.03 [3.86, 4.20] | 18.36 | 3.13 [2.98, 3.29] |
| Taq-HS | 2 | 3113 | 24082 | 4.20 [4.07, 4.34] | 3.31 [3.18, 3.43] | |
| Tersus-buffer1 | 1 | 2550 | 46927 | 1.77 [1.70, 1.83] | 5.98 | 1.48 [1.41, 1.54] |
| Tersus-buffer1 | 2 | 5504 | 154226 | 1.16 [1.13, 1.19] | 0.87 [0.84, 0.89] | |
| Tersus-buffer2 | 1 | 6282 | 130891 | 1.56 [1.52, 1.60] | 5.1 | 1.31 [1.28, 1.35] |
| Tersus-buffer2 | 2 | 1299 | 30683 | 1.38 [1.30, 1.45] | 1.13 [1.06, 1.19] | |
| TruSeq | 1 | 312 | 14164 | 0.72 [0.64, 0.79] | 4.1 | 0.52 [0.45, 0.58] |
| TruSeq | 2 | 362 | 16705 | 0.70 [0.63, 0.78] | 0.50 [0.44, 0.57] | |
| KTN | 1 | 16802 | 132733 | 4.12 [4.06, 4.17] | 12.92 | 3.49 [3.43, 3.54] |
| KTN | 2 | 6298 | 44331 | 4.62 [4.51, 4.73] | 3.99 [3.89, 4.09] |
*Encyclo (Evrogen JSC), Kapa HiFi PCR Kit (Kapa Biosystems), Phusion High-Fidelity DNA Polymerase (NEB), SD-HS[29], SNP-detect, HS Taq, Tersus in buffer 1 (Mg2+ 3.5 mM, Ph 8.5), Tersus in buffer 2 (Mg2+ 2.5 mM, Ph 8.0) (Evrogen JSC), TruSeq Custom Enrichment Kit (Illumina), KTN (KlenTaq N polymerase)[30].
Figure 2Substitution type preferences and unique error profiles of different polymerases. (a) Share of each substitution type produced during 20 PCR cycles (top) and one cycle of linear amplification (bottom) in each sample. Substitution frequencies were normalized to the ratio of the corresponding base type in the template sequence. Note that linear amplification preserves the strand information and thus allows us to resolve all twelve possible substitution types. (b) Hierarchical clustering of PCR error profiles, computed as the frequency of each of three possible substitutions at each template position produced in two independent experiments. Correlation and Euclidean distance were used to cluster polymerases (columns) and substitutions (rows), respectively. The color panel at left shows substitution type, as represented in the legend at top.
Frequency of substitution types produced during 20 PCR cycles.
| Polymerase | A > C, T > G | A > G, T > C | A > T, T > A | C > A, G > T | C > G, G > C | C > T, G > A |
|---|---|---|---|---|---|---|
| Encyclo | 4% |
| 5% | 2% | 1% | 29% |
| Kapa HF | 2% | 10% | 2% | 14% | 1% |
|
| SD-HS | 3% |
| 14% | 5% | 17% | 30% |
| SNP-detect | 5% | 27% | 1% | 2% | 1% |
|
| Taq-HS | 3% |
| 7% | 7% | 2% | 25% |
| Tersus-buffer1 | 5% | 35% | 2% | 6% | 1% |
|
| Tersus-buffer2 | 5% | 45% | 2% | 2% | 1% |
|
| TruSeq | 6% | 13% | 3% | 25% | 3% |
|
| KTN | 4% |
| 4% | 2% | 1% | 29% |
Relative frequency of each of six substitution types that can be inferred without strand information across all errors produced by each polymerase. We used pooled data from two independent experiments to construct the table, with frequencies normalized to the ratio of the corresponding base in the template sequence. The most frequent substitution for each polymerase is shown in bold.
Frequency of substitution types produced during linear amplification.
| Polymerase | A > C | A > G | A > T | C > A | C > G | C > T | G > A | G > C | G > T | T > A | T > C | T > G |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Encyclo | 6% | 22% | 3% | 10% | 2% | 19% |
| 1% | 1% | 1% | 11% | 1% |
| Kapa HF | 3% | 4% | 1% |
| 1% | 23% | 16% | 1% | 7% | 0% | 2% | 0% |
| SD-HS | 5% | 11% |
| 8% | 7% | 13% | 16% | 3% | 4% | 4% | 6% | 0% |
| SNP-detect | 8% | 9% | 0% | 5% | 2% | 26% |
| 0% | 1% | 1% | 6% | 0% |
| Taq-HS | 4% | 15% | 8% |
| 2% | 13% | 19% | 1% | 2% | 4% | 8% | 0% |
| Tersus-buffer1 | 9% | 6% | 0% | 22% | 1% | 18% |
| 1% | 1% | 1% | 4% | 0% |
| Tersus-buffer2 | 10% | 10% | 0% | 6% | 2% | 22% |
| 1% | 1% | 1% | 7% | 0% |
| TruSeq | 5% | 3% | 1% |
| 1% | 8% | 15% | 1% | 20% | 0% | 2% | 0% |
| KTN | 6% | 24% | 3% | 8% | 2% | 19% |
| 0% | 1% | 1% | 10% | 0% |
Relative frequency of each of twelve possible substitution errors produced by each polymerase. We used pooled data from two independent experiments to construct the table, with frequencies normalized to the ratio of the corresponding base in the template sequence. The most frequent substitution for each polymerase is shown in bold.
Figure 3Recurrent high-frequency errors in two separate experiments. Scatter plots of individual substitution frequencies obtained from two independent experiments with 20 cycles of PCR. Red circles indicate PCR errors that were not detected in one of the experiments; in this case, their frequency is assumed to be 1 divided by the corresponding region coverage. Linear fitting is shown by blue lines.
Figure 4Frequency distribution of individual PCR errors. The mixture of individual error (grouped by position and substitution type) frequency histograms for each substitution type (shown by color) in each sample. Dashed line shows the mean PCR error rate estimate after 20 PCR cycles.
Figure 5Evidence for bridge PCR errors in quality-thresholded sequencing data. (a) Erroneous variants from an unamplified control were thresholded by sequencing quality, and the error rate of each substitution type was computed as the ratio of corresponding errors to the number of reads that have quality greater than or equal to the threshold. The overall error rate was between 10−3–10−4 when no filtering was applied, in keeping with the fact that most bases in this experiment have a sequencing quality score of Phred 35. Errors characteristic for the TrueSeq polymerase signature after one cycle of linear amplification (C > A, G > T) become dominant beyond the Phred 30 quality threshold. (b) Correlation between the PCR error profile (i.e., frequencies of substitutions at each template position) and errors in the amplification-free control at various sequencing quality thresholds. (c) Hierarchical clustering of error profiles produced by TrueSeq and profiles obtained at Q10 and Q35 quality thresholds with an unamplified control.
Figure 6Template position and other factors affecting PCR error rate. (a) Normalized log10 error rate for each template position. The normalization was performed by scaling log-transformed error rate values for each nucleotide type and polymerase to zero mean and unit standard deviation. The Spearman correlation coefficient between normalized error rate and position is R = −0.21 (P < 10−11). (b) Percent of variance in error rate explained by GC content, nucleotide type, polymerase, polymerase-specific nucleotide preference, and position. GC content was computed within a 15-bp window centered on the erroneous base. The position variable was computed by splitting the template sequence into non-overlapping 15-bp bins, where the index of the bin corresponding to each erroneous base was used as a variable. Explainable variance was computed using ANOVA of a linear model that includes all of the aforementioned factors. (c) Observed error rate plotted against the error rate predicted using the linear model. The correlation between the observed and predicted error rate was R = 0.63 (P < 10−116).