| Literature DB >> 17567619 |
Marc Sturm1, Sascha Quinten, Christian G Huber, Oliver Kohlbacher.
Abstract
We propose a new model for predicting the retention time of oligonucleotides. The model is based on nu support vector regression using features derived from base sequence and predicted secondary structure of oligonucleotides. Because of the secondary structure information, the model is applicable even at relatively low temperatures where the secondary structure is not suppressed by thermal denaturing. This makes the prediction of oligonucleotide retention time for arbitrary temperatures possible, provided that the target temperature lies within the temperature range of the training data. We describe different possibilities of feature calculation from base sequence and secondary structure, present the results and compare our model to existing models.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17567619 PMCID: PMC1919494 DOI: 10.1093/nar/gkm338
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Sample separation of an oligonucleotide (second peak) with internal standards (dC)14 (first peak) and (dT)26 (third peak) at 80° C.
Overview of the model components used in this work
| Name | Features | Description |
|---|---|---|
| COUNT | 4 | Base frequencies (#A, #C, #G and #T in the sequence). |
| CONTACT | 16 | Dinucleotide frequencies (#CG, #CA, #CT, #CC, …). |
| SCONTACT | 10 | Dinucleotide frequencies, independent of direction (#CG+#CG, #CA+#AC, #CC, …). |
| PAIRED | 4 | Fraction of A, C, G and T inside stems. |
| UNPAIRED | 4 | Fraction of A, C, G and T outside stems. |
| STRUCTURE | 12 | Fraction of A, C, G and T in stems, in loops, or unpaired. |
| MULTISTRUCT | 6 | Fraction of bases in stems. |
| MULTITWO | 12 | Fraction of bases that are unpaired, and in stems or loops. |
| MULTIDETAIL | 36 | Fraction of bases in stems, and loops. Fraction of unpaired A, C, G and T. |
| SESUM | 1 | Sum of the stacking energies of adjacent bases. |
| STACKING | 2 | Sum of the enthalpic (Δ H) and entrophic (TΔ S) |
| contributions to the free energy. | ||
Figure 3.Predicted secondary structure of GTGCTCAGTGTAGCCCAGGATGGG at 40° C.
Example of the feature calculation from the predicted secondary structure
| Composition | 5 | 10 | 5 | 4 |
| Bases in stems | 3 | 3 | 1 | 1 |
| Fraction of bases in stems | 0.60 | 0.30 | 0.20 | 0.25 |
| Bases in loops | 0 | 2 | 0 | 1 |
| Fraction of bases in loops | 0.00 | 0.20 | 0.00 | 0.25 |
Prediction performance (Q2) of selected models in cross-validation
| Model | 30°C | 40°C | 50°C | 60°C | 80°C | ALL | Average |
|---|---|---|---|---|---|---|---|
| count_scontact_multistruct | 0.978 | 0.978 | 0.950 | 0.965 | 0.975 | 0.954 | 0.967 |
| count_multistruct_stacking | 0.956 | 0.977 | 0.960 | 0.957 | 0.983 | 0.953 | 0.964 |
| count_multistruct | 0.951 | 0.976 | 0.962 | 0.959 | 0.974 | 0.954 | 0.963 |
| count_multitwo_stacking | 0.950 | 0.975 | 0.964 | 0.959 | 0.985 | 0.936 | 0.961 |
| scontact_multistruct_stacking | 0.971 | 0.972 | 0.953 | 0.954 | 0.963 | 0.954 | 0.961 |
| scontact_multistruct | 0.972 | 0.968 | 0.949 | 0.957 | 0.955 | 0.955 | 0.959 |
| count_multitwo | 0.938 | 0.974 | 0.964 | 0.963 | 0.981 | 0.934 | 0.959 |
| paired_unpaired_stacking | 0.947 | 0.953 | 0.956 | 0.955 | 0.983 | 0.934 | 0.955 |
| scontact_multitwo_stacking | 0.958 | 0.965 | 0.946 | 0.950 | 0.964 | 0.941 | 0.954 |
| paired_unpaired | 0.942 | 0.949 | 0.954 | 0.962 | 0.963 | 0.929 | 0.950 |
| scontact | 0.912 | 0.897 | 0.897 | 0.952 | 0.975 | 0.931 | 0.927 |
| count_scontact | 0.902 | 0.866 | 0.843 | 0.961 | 0.988 | 0.924 | 0.914 |
| count | 0.583 | 0.621 | 0.762 | 0.918 | 0.988 | 0.873 | 0.791 |
| Average | 0.920 | 0.929 | 0.928 | 0.955 | 0.975 | 0.936 |
Models are sorted according to average performance.
Prediction performance on homology-reduced datasets at 30 and 80°C
| Dataset | Oligonucleotides | 30°C | 80°C |
|---|---|---|---|
| Unreduced | 72 | 0.957 | 0.983 |
| Two bases differing | 52 | 0.954 | 0.986 |
| Three bases differing | 38 | 0.957 | 0.965 |
| Five bases differing | 29 | 0.911 | 0.963 |
Figure 4.Comparison of the Gilar model to our model ‘count_multistruct_stacking’.
Figure 5.Results of the Gilar model at 30° C and 80° C. The hairpin structures are marked with circles.
Average relative model error of oligonucleotides with a similar fraction of bases involved in secondary structure
| Average relative model error | ||||||
|---|---|---|---|---|---|---|
| Fraction of bases in secondary structure | ||||||
| 0.0 – 0.2 | 0.2 – 0.4 | 0.4 – 0.6 | 0.6 – 0.8 | 0.8 – 0.10 | ||
| 30°C | count | 0.98% | 1.12% | 1.41% | 15.47% | 37.62% |
| 30°C | count_multistruct_stacking | 0.65% | 1.01% | 1.21% | 2.35% | 1.19% |
| 50°C | count | 0.78% | 0.80% | 1.11% | 17.95% | 22.70% |
| 50°C | count_multistruct_stacking | 0.38% | 0.39% | 0.43% | 0.27% | 0.26% |
| 80°C | count | 0.57% | 2.12% | na | 1.94% | 0.40% |
| 80°C | count_multistruct_stacking | 0.44% | 0.34% | na | 0.28% | 0.09% |
Figure 6.Number of training data points plotted versus prediction performance. The error bars show the SD, derived from 200 repetitions of the experiment.