| Literature DB >> 16246912 |
Xingyuan Li1, Zhili He, Jizhong Zhou.
Abstract
The oligonucleotide specificity for microarray hybridization can be predicted by its sequence identity to non-targets, continuous stretch to non-targets, and/or binding free energy to non-targets. Most currently available programs only use one or two of these criteria, which may choose 'false' specific oligonucleotides or miss 'true' optimal probes in a considerable proportion. We have developed a software tool, called CommOligo using new algorithms and all three criteria for selection of optimal oligonucleotide probes. A series of filters, including sequence identity, free energy, continuous stretch, GC content, self-annealing, distance to the 3'-untranslated region (3'-UTR) and melting temperature (T(m)), are used to check each possible oligonucleotide. A sequence identity is calculated based on gapped global alignments. A traversal algorithm is used to generate alignments for free energy calculation. The optimal T(m) interval is determined based on probe candidates that have passed all other filters. Final probes are picked using a combination of user-configurable piece-wise linear functions and an iterative process. The thresholds for identity, stretch and free energy filters are automatically determined from experimental data by an accessory software tool, CommOligo_PE (CommOligo Parameter Estimator). The program was used to design probes for both whole-genome and highly homologous sequence data. CommOligo and CommOligo_PE are freely available to academic users upon request.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16246912 PMCID: PMC1266071 DOI: 10.1093/nar/gki914
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Flowchart for CommOligo.
Figure 2Directed graph of the dynamic programming matrix for alignment of sequence ACCAA and ACGGA (A), and the traversal of the dynamic programming matrix for alignment of sequence ACCAA and ACGGA with two mismatches (B).
Cutoff values for five sets of criteria estimated by CommOligo_PE
| Signal threshold (%) | Criteria | NPV (%) | Coverage (%) |
|---|---|---|---|
| 5 | Identity | ||
| Stretch | |||
| Energy ≥ −12.00 | 96.2 | 30.1 | |
| Identity and stretch | |||
| Identity ≤ 0.85, stretch ≤ 13 and energy ≥ −12.00 | 96.2 | 30.1 | |
| 8 | Identity ≤ 0.77 | 96.6 | 28.3 |
| Stretch | |||
| Energy ≥ −19.00 | 95.8 | 46.5 | |
| Identity ≤ 0.81 and stretch ≤ 12 | 97.4 | 38.4 | |
| Identity ≤ 0.87, stretch ≤ 17 and energy ≥ −24.00 | 95.0 | 57.6 | |
| 10 | Identity ≤ 0.77 | 96.6 | 27.5 |
| Stretch ≤ 11 | 95.7 | 43.1 | |
| Energy ≥ −19.00 | 95.8 | 45.1 | |
| Identity ≤ 0.87 and stretch ≤ 11 | 95.7 | 43.1 | |
| Identity ≤ 0.87, stretch ≤ 17 and energy ≥ −29.00 | 96.0 | 69.6 |
Optimization goal was set to an automatic mode. The minimal NPV was set to 95%. Blank cells indicate no cutoff values were found in search range under the constraints. Data from Rhee et al. (10) and He et al. (11) in the Supplementary Data were used for training. When the optimization goal was changed to maximizing coverage, ‘identity ≤ 0.81 and stretch ≤ 12’ was changed to ‘identity ≤ 0.81 and stretch ≤ 18’, and ‘identity ≤ 0.87, stretch ≤ 17 and energy ≥ −29.00’ was changed to ‘identity ≤ 0.87, stretch ≤ 17 and energy ≥ −32.00’, while the others remained the same.
Estimated criteria with cross-validation
| Signal threshold (%) | Criteria | Training NPV (%) | Training C (%) | Testing NPV (%) | Testing C (%) | N of runs with cutoffs generated |
|---|---|---|---|---|---|---|
| 5 | Identity | 0 | ||||
| Stretch | 0 | |||||
| Energy ≥ −12.40 | 95.9 | 31.2 | 93.3 | 30.8 | 10 | |
| Identity ≤ 0.81 and stretch ≤ 11 | 96.2 | 34.1 | 70.8 | 39.1 | 2 | |
| Identity ≤ 0.85, stretch ≤ 13 and energy ≥ −13.00 | 95.9 | 32.0 | 91.7 | 30.9 | 10 | |
| 8 | Identity ≤ 0.77 | 96.3 | 31.3 | 82.5 | 31.3 | 10 |
| Stretch ≤ 11 | 95.2 | 43.9 | 66.7 | 33.3 | 2 | |
| Energy ≥ −20.00 | 95.8 | 48.8 | 91.7 | 55.0 | 10 | |
| Identity ≤ 0.81 and stretch ≤ 14 | 96.5 | 42.5 | 85.0 | 38.7 | 10 | |
| Identity ≤ 0.83, stretch ≤ 17 and energy ≥ −29.20 | 96.8 | 58.1 | 89.1 | 58.6 | 10 | |
| 10 | Identity ≤ 0.77 | 96.0 | 28.8 | 98.0 | 30.3 | 10 |
| Stretch ≤ 11.00 | 95.9 | 43.2 | 96.3 | 44.5 | 10 | |
| Energy ≥ −20.10 | 95.9 | 47.9 | 91.7 | 48.6 | 10 | |
| Identity ≤ 0.84 and stretch ≤ 13 | 96.0 | 44.2 | 91.3 | 36.0 | 10 | |
| Identity ≤ 0.87, stretch ≤ 17 and energy ≥ −30.70 | 95.7 | 72.2 | 93.3 | 72.6 | 10 |
Data were partitioned into 10 subsets. Values shown are averages. Settings were the same as in Table 1.
Figure 3The relationships between sequence identity (A), stretch length (B) or binding free energy (C) and the number of designed probes. Vertical lines x = 0.87, x = 17 and x = −29 indicate the fitted thresholds for probe design criteria.
Number and quality of designed probes by different programs
| Programs used | Whole-genome sequences of | Group sequences of | ||||||
|---|---|---|---|---|---|---|---|---|
| ORFs rejected | Probes designed | Sim. ≤87% and Str. ≤17 | Sim. >87% or Str. >17 | ORFs rejected | Probes designed | Sim. ≤87% and Str. ≤17 | Sim. >87% or Str. >17 | |
| ArrayOligoSelector | 7 | 1759 | 1463 | 296 | 0 | 842 | 146 | 696 |
| OligoArray | 68 | 1698 | 1670 | 28 | 35 | 807 | 112 | 695 |
| OligoArray 2.0 | 68 | 1698 | 1506 | 292 | 51 | 791 | 55 | 736 |
| OligoPicker | 42 | 1724 | 1721 | 3 | 673 | 169 | 162 | 7 |
| CommOligo | 32 | 1734 | 1734 | 0 | 655 | 187 | 187 | 0 |