| Literature DB >> 18492719 |
Reidar Andreson1, Tõnu Möls, Maido Remm.
Abstract
We have developed statistical models for estimating the failure rate of polymerase chain reaction (PCR) primers using 236 primer sequence-related factors. The model involved 1314 primer pairs and is based on more than 80 000 PCR experiments. We found that the most important factor in determining PCR failure is the number of predicted primer-binding sites in the genomic DNA. We also compared different ways of defining primer-binding sites (fixed length word versus thermodynamic model; exact match versus matches including 1-2 mismatches). We found that the most efficient prediction of PCR failure rates can be achieved using a combination of four factors (number of primer-binding sites counted in different ways plus GC% of the primer) combined into single statistical model GM1. According to our estimations from experimental data, the GM1 model can reduce the average failure rate of PCR primers nearly 3-fold (from 17% to 6%). The GM1 model can easily be implemented in software to premask genome sequences for potentially failing PCR primers, thus improving large-scale PCR-primer design.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18492719 PMCID: PMC2441781 DOI: 10.1093/nar/gkn290
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The complete list of factors used in study for building models
| Factor description | Factor name | GM1 | GM1MM | GM2 | GM2MM | PCR | Number of factors |
|---|---|---|---|---|---|---|---|
| The number of binding sites of PCR primers (exact, with one and two mismatches allowed) with different word sizes from the 3′-end and from random positions in the primer sequence. | MAX/MIN[ | + | + | + | 12 | ||
| MAX/MIN _FULL; | + | 2 | |||||
| MAX/MIN[ | + | 12 | |||||
| MAX/MIN[ | + | + | 8 | ||||
| MAX/MIN_FULL_1MM; | + | 2 | |||||
| MAX/MIN[ | + | 8 | |||||
| MAX/MIN[ | + | + | 8 | ||||
| MAX/MIN_FULL_2MM; | + | 2 | |||||
| MAX/MIN[ | + | 8 | |||||
| The number of binding sites of PCR primers (exact, with one and two mismatches allowed) with variable word sizes from the 3′-end and from random positions in the primer sequence. The word size for each primer is extended until three different free energy levels are achieved: ΔG < −10, −15, −20 kcal/mol. | MAX/MIN_DG[ | + | + | + | 6 | ||
| MAX/MIN_DG[ | + | 6 | |||||
| MAX/MIN_DG[ | + | + | 6 | ||||
| MAX/MIN_DG[ | + | 6 | |||||
| MAX/MIN_DG[ | + | + | 6 | ||||
| MAX/MIN_DG[ | + | 6 | |||||
| The number of all binding sites of PCR primers (exact, with one and two mismatches allowed) counted with NCBI BLASTN (-F F). | [MAX,MIN]_BLASTALL | + | 2 | ||||
| PCR primer length | PRIM_LENGTH_[MAX,MIN] | + | 2 | ||||
| GC content of PCR primer with different word sizes from the 3′-end and full primer | PRIM_GC_PRC_[ | + | + | + | + | + | 6 |
| PRIM_GC_PRC_[MAX,MIN] | + | 2 | |||||
| The free energies of different subsequences from the primer 3′-end | PRIM_ DG[ | + | + | + | + | + | 14 |
| DUST score of PCR primer | PRIM_DUS_[MAX,MIN] | + | 2 | ||||
| The strongest free energies of the dimers of primers alone and in pairs using local and global alignment approaches | MAX/MIN_PRIM_END1; | + | 2 | ||||
| PRIM_PAIR_END1; | + | 1 | |||||
| MAX/MIN_PRIM_END2; | + | 2 | |||||
| PRIM_PAIR_END2; | + | 1 | |||||
| MAX/MIN_PRIM_ANY; | + | 2 | |||||
| PRIM_PAIR_ANY | + | 1 | |||||
| The strongest secondary structure of the PCR primers in a given pair predicted with MFOLD at 55°C | [MAX,MIN]_PRIM_MFOLD | + | 2 | ||||
| The Tm of the primer, difference of melting temperatures between the two primers in a given pair and the difference between annealing (used in PCR experiments) and melting temperature | TM_[MAX,MIN]; | + | 2 | ||||
| TM_DIFF; | + | 1 | |||||
| TM_TA_[MAX,MIN]_DIFF | + | 2 | |||||
| Total number of SNPs in both primers and the position of the SNP closest to the 3′-end | NO_OF_SNPS; | + | + | + | + | + | 1 |
| ALL_POS_FROM_3_END; | + | + | + | + | + | 1 | |
| NO_OF_VALID_SNPS; | + | + | + | + | + | 1 | |
| VALID_POS_FROM_3_END | + | + | + | + | + | 1 | |
| The terminal and last two nucleotides of primer sequence, also the first nucleotide of amplicon following the primer sequence. These are categorical values (0 – given nuc. is not present in both primers, 1 – is present at least in one primer, 2 – is present in both primers). | PRIM_LAST_ONE_NUC_[A,C,G,T]; | + | 4 | ||||
| PRIM_LAST_TWO_NUC_[AA,AC, AG,AT,CC,CG,CT,GG,GT,TT]; | + | 10 | |||||
| PROD_FIRST_ONE_NUC_[A,C,G,T] | + | 4 | |||||
| The number of predicted products with maximum length of 1000, 3000 and 10 000 nt for exact binding sites with different word sizes from the 3′-end and from random positions in the primer sequence. | PROD[ | + | 6 | ||||
| PROD_FULL_1000; | + | 1 | |||||
| PROD[ | + | 6 | |||||
| PROD[ | + | 6 | |||||
| PROD_FULL_3000; | + | 1 | |||||
| PROD[ | + | 6 | |||||
| PROD[ | + | 6 | |||||
| PROD_FULL_10000; | + | 1 | |||||
| PROD[ | + | 6 | |||||
| The number of predicted products with maximum length of 1000, 3000 and 10 000 nt for exact binding sites with variable word sizes from the 3′-end and from random positions in the primer sequence. The word size for each primer is extended until three different free energy levels are achieved: ΔG < −10, −15, −20 kcal/mol. | PROD_DG[ | + | 3 | ||||
| PROD_DG[ | + | 3 | |||||
| PROD_DG[ | + | 3 | |||||
| PROD_DG[ | + | 3 | |||||
| PROD_DG[ | + | 3 | |||||
| PROD_DG[ | + | 3 | |||||
| PCR product length | PROD_LENGTH | + | 1 | ||||
| GC content of PCR product | PROD_GC_PRC | + | 1 | ||||
| Area under the GC curve and above 65% of the PCR product (7) | PROD_AUCGC | + | 1 | ||||
| Number of GC windows with values above 65% divided by the length of the PCR product (×100) (7) | PROD_RATIOGC_100 | + | 1 | ||||
| PROD_AUCGC × PROD_RATIOGC (7) | PROD_AUCGC2 | + | 1 | ||||
| The strongest secondary structure of PCR product predicted with MFOLD at 55°C | PROD_MFOLD_55 | + | 1 | ||||
| Percentage of masked nucleotides of PCR product using DUST | PROD_DUST_PRC | + | 1 | ||||
| Percentage of masked nucleotides of PCR product using Repeat Masker with different sensitivity parameters (-s, -q, -qq) | PROD_RMs_PRC; | + | 1 | ||||
| PROD_RMq_PRC; | + | 1 | |||||
| PROD_RMqq_PRC | + | 1 | |||||
| Percentage of masked nucleotides of PCR product using GenomeMasker with different word sizes (exact matches) | PROD_GM[ | + | 5 | ||||
Factors marked by ‘+’ under a model are used in the building of this model.
Figure 1.The distribution of factors between different model types.
List of the best factors (top 4) and the corresponding one-degrees-of-freedom chi-squares (χ2(1)) from the GENMOD Type I analysis using whole dataset
| Factor name | χ2 (1) | Model |
|---|---|---|
| MAX_DG15_2MM | 4862 | |
| MAX_DG15_RAND*MAX_DG15_RAND | 1374 | |
| PROD_DG20_1000_RAND* PROD_DG20_1000_RAND | 378 | |
| PROD_LENGTH*PROD_LENGTH | 298 | |
| MAX_DG15_2MM | 4862 | |
| MAX_DG15_1MM*MAX_DG15_1MM | 1091 | |
| MAX_DG20_2MM | 244 | |
| MAX_DG20_1MM*MAX_DG20_1MM | 262 | |
| MAX_DG20 | 4085 | |
| MAX_DG15*MAX_DG15 | 1106 | |
| PRIM_GC_PRC_8_MIN | 386 | |
| MIN_DG20*MIN_DG20 | 277 | |
| MAX15_2MM | 2854 | |
| MAX12_1MM*MAX12_1MM | 1681 | |
| MAX12 | 1291 | |
| PRIM_GC_PRC_16_MAX | 789 | |
| MAX16 | 2507 | |
| MAX15*MAX15 | 2394 | |
| PRIM_GC_PRC_16_MAX | 1126 | |
| MAX14 | 272 |
All factors are significant at P < 0.0001. Asterisks in factor names mark the polynomial regression of given independent variable. χ2-values illustrate the estimated simultaneous (Type I) effects of the best four factors on each model.
Figure 2.Comparison of five model types at different sensitivity levels. The figure illustrates the efficiency of the models in predicting the actual failure rate at several cutoff levels with a single factor (A) or four factors (B) included in the model. Columns illustrate the average failure rate of the remaining primer pairs in 10 control sets using the given cutoff. Dashed line defines the average PCR failure rate of the whole dataset before applying any model. Error bars show 95% confidence limits for the real mean.
Figure 3.The relationship between some of the statistically significant factors and the PCR failure rate. The binding sites of the primer pairs were counted with the most descriptive factors for each model: (A) maximum number of hits with two mismatches allowed and free energy level ≤−15 kcal/mol (PCR and GM2MM models), no mismatches allowed and free energy level ≤−20 kcal/mol (GM2 model), two mismatches allowed and word size = 15 nt. (GM1MM model), no mismatches allowed and word size = 16 nt. (GM1 model). The effects of the primer GC content (B) and the number of predicted PCR products (C) are also shown. Error bars show 95% confidence limits for the real mean.
Comparison of our GM1 model with other masking methods
| No masking | RepeatMasker | GenomeMasker | GM1 with 4 factors | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 10% | 20% | 30% | 40% | 50% | |||||
| 16.9 | 13.8 | 10.2 | 6.0 | 8.9 | 9.5 | 11.2 | 11.4 | ||
| 100 | 69.3 | 96.4 | 98.5 | 99.3 | 99.6 | 99.6 | 99.7 | ||
| in genome | 49.5 | 52.2 | 80.7 | 59.7 | 51.4 | 45.5 | 35.6 | ||
| in introns | 0 | 12.5 | 38.8 | 83.7 | 63.4 | 54.5 | 47.7 | 37.6 | |
| in exons | 4.5 | 28.2 | 86.0 | 69.7 | 61.1 | 54.0 | 40.4 | ||
aFractions of failing primer pairs after using given masking method calculated from the experimental data of 1314 primer pairs.
bFraction of masked genomic regions for which at least one primer pair could be designed using 1000 random regions from the human genome (each 1000 nt long). Primer3 was used to design primer pairs.
cFraction of masked nucleotides from three different random sequence sets: 1000 genomic regions (each 1000 nt long), 1000 exonic sequences (average length 150 nt) and 1000 intronic sequences (average length 400 nt). With GenomeMasker and GM1 we have used an option to mask only one nucleotide from 3′-end of the repeated word.