| Literature DB >> 31341211 |
Matthias Döring1, Christoph Kreer2,3, Nathalie Lehnen2,3,4, Florian Klein2,3,4, Nico Pfeifer5,6,7,8.
Abstract
Successful primer design for polymerase chain reaction (PCR) hinges on the ability to identify primers that efficiently amplify template sequences. Here, we generated a novel Taq PCR data set that reports the amplification status for pairs of primers and templates from a reference set of 47 immunoglobulin heavy chain variable sequences and 20 primers. Using logistic regression, we developed TMM, a model for predicting whether a primer amplifies a template given their nucleotide sequences. The model suggests that the free energy of annealing, ΔG, is the key driver of amplification (p = 7.35e-12) and that 3' mismatches should be considered in dependence on ΔG and the mismatch closest to the 3' terminus (p = 1.67e-05). We validated TMM by comparing its estimates with those from the thermodynamic model of DECIPHER (DE) and a model based solely on the free energy of annealing (FE). TMM outperformed the other approaches in terms of the area under the receiver operating characteristic curve (TMM: 0.953, FE: 0.941, DE: 0.896). TMM can improve primer design and is freely available via openPrimeR ( http://openPrimeR.mpi-inf.mpg.de ).Entities:
Mesh:
Substances:
Year: 2019 PMID: 31341211 PMCID: PMC6656877 DOI: 10.1038/s41598-019-47173-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Experimental layout and labeling of the PCR reactions.
Figure 2Examples for encoding mismatches within the 3′ hexamer region. Primers are indicated as arrows and templates are indicated as horizontal bars. Arrowheads indicate the 3′ hexamer region. Mismatches within the 3′ hexamer are encoded via z ∈ {0, 1}6, , and i ∈ {0, 1, …, 6}. While z uses a binary encoding to indicate the presence of mismatches within the 3′ hexamer, X gives the total number of 3′ hexamer mismatches, and i indicates the position of the 3′ hexamer mismatch closest to the 3′ terminus. (a) Absence of 3′ terminal mismatches between primer and template. (b) Mismatches in the 3′ hexamer at positions 4 and 6.
Overview of the properties of the IGHV data set.
| Property | Interpretation | Set 1 | Set 2 |
|---|---|---|---|
| Δ | Free energy of annealing | [−4.9, −2.0] | [−8.6, −5.2] |
|
| Mismatch closest to 3′ end | [2, 6] | [0, 1] |
|
| Number of 3′ hexamer mismatches | [1, 3] | [0, 1] |
| | | Extent of GC clamp | [1, 2] | [1, 1] |
| Δ | Free energy of folding [kcal/mol] | [−1.53, −0.24] | [−1.24, −0.76] |
| Δ | Free energy of self-dimerization [kcal/mol] | [−2.1, −0.7] | [−1.2, −0.8] |
| yi = | Positive amplification status | 217 of 720 (30.1%) | 165 of 188 (87.8%) |
|
| Number of mismatches at the start of the 3′ hexamer | 271 | 25 |
|
| Number of mismatches at the 2nd position of the 3′ hexamer | 226 | 4 |
|
| Number of mismatches at the 3rd position of the 3′ hexamer | 272 | 31 |
|
| Number of mismatches at the 4th position of the 3′ hexamer | 246 | 11 |
|
| Number of mismatches at the 5th position of the 3′ hexamer | 308 | 12 |
|
| Number of mismatches at the 3′ terminal position | 308 | 12 |
Values shown in brackets indicate the inter-quartile range of the observed values.
Empirical amplification rates in dependence on the number of primer-template mismatches and other properties.
| Number of mismatches |
| Δ | Amplification rate | Primer set |
|---|---|---|---|---|
| 0 | [0, 0] | [−16.616, −15.696] | 100% | Overall |
| 1 | [0, 3] | [−14.353, −12.1] | 100% | Overall |
| 2 | [0, 3] | [−12.0455, −9.656] | 100% | Overall |
| 3 | [0, 4] | [−11.607, −7.9185] | 100% | Overall |
| 4 | [2, 6] | [−10.796, −7.409] | 92.31% | Overall |
| 5 | [0, 3] | [−7.047, −6.047] | 88.89% | Overall |
| 6 | [0, 0] | [−8.603, −5.11325] | 83.33% | Overall |
| 7 | [0, 3] | [−5.39, −4.212] | 67.19% | Overall |
| 8 | [3, 6] | [−5.56075, −2.539] | 34.04% | Overall |
| 9 | [4, 6] | [−3.5335, −2.1325] | 23.08% | Overall |
| 10 | [4, 6] | [−4.09, −1.724] | 18.02% | Overall |
| 11 | [4, 6] | [−3.74, −1.695] | 10.53% | Overall |
| 12 | [6, 6] | [−2.624, −1.413] | 3.75% | Overall |
| 0 | [0, 0] | [−16.07, −15.609] | 100% | Set 1 |
| 1 | [0, 3] | [−13.283, −12.1] | 100% | Set 1 |
| 2 | [0, 3.25] | [−11.94175, −9.656] | 100% | Set 1 |
| 3 | [0, 4] | [−11.607, −7.66375] | 100% | Set 1 |
| 4 | [2, 6] | [−10.974, −6.686] | 90.91% | Set 1 |
| 5 | [2.5, 4.5] | [−8.36825, −6.4925] | 75% | Set 1 |
| 6 | [3.25, 4] | [−4.4545, −2.9] | 33.33% | Set 1 |
| 7 | [3, 6] | [−4.212, −2.539] | 9.52% | Set 1 |
| 8 | [4, 6] | [−3.303, −2.06275] | 18.06% | Set 1 |
| 9 | [5, 6] | [−3.0985, −2.0395] | 13.51% | Set 1 |
| 10 | [5, 6] | [−3.393, −1.695] | 11.26% | Set 1 |
| 11 | [5, 6] | [−3.351, −1.695] | 4.2% | Set 1 |
| 12 | [6, 6] | [−2.608, −1.413] | 2.6% | Set 1 |
| 0 | [0, 0] | [−20.79275, −16.616] | 100% | Set 2 |
| 1 | [0, 2] | [−17.782, −14.045] | 100% | Set 2 |
| 2 | [0, 0] | [−14.4805, −12.5605] | 100% | Set 2 |
| 3 | [1, 1] | [−10.505, −10.505] | 100% | Set 2 |
| 4 | [0.75, 2.25] | [−10.29475, −9.29225] | 100% | Set 2 |
| 5 | [0, 0] | [−6.047, −6.047] | 100% | Set 2 |
| 6 | [0, 0] | [−8.603, −5.208] | 100% | Set 2 |
| 7 | [0, 0] | [−5.39, −5.208] | 95.35% | Set 2 |
| 8 | [0, 0] | [−5.937, −3.95] | 86.36% | Set 2 |
| 9 | [1, 6] | [−5.58, −2.89] | 78.95% | Set 2 |
| 10 | [0, 3] | [−5.208, −2.956] | 66.67% | Set 2 |
| 11 | [0, 2.25] | [−5.208, −2.8395] | 64.29% | Set 2 |
| 12 | [4, 5.5] | [−2.6225, −1.9615] | 33.33% | Set 2 |
Amplification properties are shown when evaluated on primers from all primer sets as well as on primers from Set 1 or Set 2 only, respectively.
Figure 3Impact of the free energy of annealing (ΔG) and 3′ terminal mismatches on the amplification of templates. The x-axis indicates, for every PTP, the mismatch position closest to the primer 3′ terminus such that position 1 in the plot corresponds to i = 6 and position 6 corresponds to i = 1. PTPs with zero mismatches are denoted by None. Every point represents a primer-template pair. Pairs that are labeled as Amplified are shown in blue, while those that are labeled as Unamplified are shown in red. Observations from Set 1 are indicated by circles and those from Set 2 by triangles. The dashed lines indicate cutoffs that are suitable for separating observations according to their amplification status. The vertical dashed line indicates the end of the 3′ hexamer, while the horizontal dashed line indicates a free energy of −5 kcal/mol.
Primers used for performing IGHV PCRs.
| Primer ID | Sequence | GC Ratio | Δ | Δ |
|---|---|---|---|---|
| Set 1.1 | cacctgtggttcttcctcct | 59.1% | −0.8 | 0 |
| Set 1.2 | cacctgtggttcttcctcct | 59.1% | −0.8 | 0 |
| Set 1.3 | atggagtttgggctgagct | 57.1% | −2.3 | 0 |
| Set 1.4 | atggagttggggctgagct | 60% | −2.3 | 0 |
| Set 1.5 | tggagttttggctgagct | 57.1% | −2.3 | −0.1 |
| Set 1.6 | actttgctccacgctcct | 60% | −0.3 | 0 |
| Set 1.7 | atggactggacctggagcat | 57.1% | −1.9 | 0 |
| Set 1.8 | atggactggacctggaggtt | 59.1% | −2.1 | −1.9 |
| Set 1.9 | atggactgcacctggaggat | 57.1% | −1.9 | 0 |
| Set 1.10 | atggactggacctggagggtctt | 58.3% | −1.9 | −3.6 |
| Set 1.11 | tctgtctccttcctcatcttcct | 52% | 0.4 | 0 |
| Set 1.12 | ggactggatttggagggtcctctt | 56% | −2.2 | −3.2 |
| Set 1.13 | gctccgctgggttttcctt | 60% | 0.4 | 0 |
| Set 1.14 | tggggtcaaccgccat | 66.7% | −0.7 | −1.6 |
| Set 1.15 | ggcctctccacttaaaccca | 59.1% | −1.9 | 0 |
| Set 1.16 | tggacacactttgctacacact | 50% | 0 | 0 |
| Set 2.1 | acaggtgcccactcccaggtgca | 66.7% | −0.8 | −1.2 |
| Set 2.2 | aaggtgtccagtgtga | 54.3% | −1.2 | 0 |
| Set 2.3 | cccagatgggtcctgtcccaggtgca | 66.7% | −1.3 | −2.6 |
| Set 2.4 | caaggagtctgttccgaggtgca | 58.3% | −0.8 | −0.3 |
The extent of the primer 3′ GC clamp is indicated in bold. Primers prefixed with Set 1 indicate primers from Set 1, while those prefixed with Set 2 refer to primers from Set 2.
Comparison of logistic regression models without (LR1) and with (LR2) correction for the association between ΔG and i, as well as TMM, which was defined using feature selection.
| Feature |
|
| TMM | |||
|---|---|---|---|---|---|---|
| Estimate | p-value | Estimate | p-value | Estimate | p-value | |
| Intercept | −2.86 |
| −5.76 |
| −5.6177 |
|
|
| −0.50 | 0.0.058 | −0.187 | 0.4929 | — | — |
|
| −0.00 | 0.977 | −0.144 | 0.6164 | — | — |
|
| −0.92 |
| −0.424 | 0.1359 | — | — |
|
| −0.97 |
| −0.46 | 0.1340 | — | — |
|
| 0.04 | 0.894 | 0.574 | 0.1085 | — | — |
|
| −1.57 |
| −0.659 | 0.1069 | — | — |
|
| NA | NA | NA | NA | — | — |
|
| −0.83 |
| −1.576 |
| −1.5448 |
|
|
| — | — | 0.400 | 0.0829 | 0.3279 | 0.0818 |
|
| — | — | 0.180 |
| 0.1837 |
|
NAs indicates features that could not be estimated due to singularities. Dashes indicate features that were not considered by a model. Asterisks and bold font indicate significant features. Based on an initial significance threshold of 0.05, the following multiple hypothesis testing adjusted thresholds were used (Bonferroni): 0.05/9 = 0.0056 (LR1), 0.05/11 = 0.0045 (LR2), and 0.05/4 = 0.0125.
Optimized cutoffs for the considered models for predicting PCR amplification.
| Model | Cutoff interpretation | Cutoff | Cutoff |
|---|---|---|---|
| TMM | Probability of amplification | 83.9% | 46.1% |
| DE | Efficiency of PCR | 9.71e-05 | 1.88e-05 |
| FE | Free energy of annealing | −6.05 | −4.83 |
The column Cutoff interpretation indicates the type of values on which cutoffs were applied. The column for cutoff s indicates the cutoff that was selected such as to ensure an empiric specificity of at least 99%. The column for cutoff Y indicates the cutoff that maximized Youden′s index.
Model performance in terms of the AUC when validating models on test set observations from individual primer sets.
| Test set | TMM | DE | FE |
|---|---|---|---|
| Overall | 0.954 | 0.896 | 0.941 |
| Set 1 | 0.938 | 0.863 | 0.923 |
| Set 2 | 0.980 | 0.941 | 0.980 |
Figure 4Performance of three models for identifying primer amplification events. TMM indicates our newly developed logistic regression model, DE refers to the approach from DECIPHER, and FE is solely based on the free energy of annealing. Models subscripted with s use cutoffs optimized for high specificity, while models subscripted with Y use cutoffs optimized for overall performance.
Interpretation of variables used in the formulation of the TMM model.
| Term | Interpretation |
|---|---|
|
| Estimated likelihood of amplification |
| ln | Log odds of amplification |
|
| Model weights |
|
| Free energy of annealing [kcal/mol] |
|
| Position of 3′ hexamer mismatch closest to 3′ terminus of the PTP |
Figure 5Visualization of the TMM model. Individual dots show the prediction function of the model. Red dots indicate low probabilities of amplification while blue dots indicate high probabilities. The rectangles show the model estimate for the observations contained in the data set. Here, red points indicate primer-template pairs that were labeled as Unamplified, while blue points indicate observations labeled as Amplified.
Distribution of data set labels.
| Data set |
| ||
|---|---|---|---|
| Full | 908 (100%) | 382 (42.1%) | 526 (57.9%) |
| Validation | 227 (25%) | 96 (42.3%) | 131 (57.7%) |
| Training | 454 (50%) | 197 (43.4%) | 256 (56.6%) |
| Testing | 227 (25%) | 92 (40.5%) | 135 (59.5%) |
The total number of observations N and their labels y are shown for the full data set and the constructed subsets for validation, training, and testing.