| Literature DB >> 15383156 |
Gustavo Camps-Valls1, Alistair M Chalk, Antonio J Serrano-López, José D Martín-Guerrero, Erik L L Sonnhammer.
Abstract
BACKGROUND: This paper presents the use of Support Vector Machines (SVMs) for prediction and analysis of antisense oligonucleotide (AO) efficacy. The collected database comprises 315 AO molecules including 68 features each, inducing a problem well-suited to SVMs. The task of feature selection is crucial given the presence of noisy or redundant features, and the well-known problem of the curse of dimensionality. We propose a two-stage strategy to develop an optimal model: (1) feature selection using correlation analysis, mutual information, and SVM-based recursive feature elimination (SVM-RFE), and (2) AO prediction using standard and profiled SVM formulations. A profiled SVM gives different weights to different parts of the training data to focus the training on the most important regions.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15383156 PMCID: PMC526382 DOI: 10.1186/1471-2105-5-135
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Feature ranking using (a) the correlation coefficient between input features and efficacy (), (b) mutual information feature selection (MIFS) with β = 0.75, (c) SVM-based Recursive Feature Elimination (SVM-RFE), and (d) best selection in [21] using the correlation coefficient.
| 1 | ΔG | -0.35 | ΔG | 0.094 | ΔH | 0.680 | GGGA | 0.26 |
| 2 | # Cytosine | 0.31 | # Cytosine | 0.089 | ΔS | 0.671 | # Cytosine | 0.23 |
| 3 | TCCC | 0.28 | %GC content | 0.077 | ΔG | 0.193 | ΔH | -0.19 |
| 4 | 5pΔG | -0.26 | ΔG/length | 0.075 | # Cytosine | 0.045 | ΔG | -0.18 |
| 5 | ΔH | -0.24 | ΔH | 0.064 | Hairpin quality | 0.035 | CAGT | -0.18 |
| 6 | ΔH/length | -0.22 | ΔH/length | 0.061 | # Adenine | 0.024 | AGAG | 0.18 |
| 7 | %GC content | 0.22 | ΔS | 0.060 | # Thymine | 0.018 | GTGG | 0.17 |
| 8 | CCCT | 0.21 | # Adenine | 0.043 | Hairpin length | 0.014 | # Guanine | -0.15 |
| 9 | CCAC | 0.21 | # Guanine | 0.042 | 5pΔG | 0.009 | 3pΔG | 0.14 |
| 10 | CCCC | 0.21 | 5pΔG | 0.040 | 3pΔG | 0.005 | ΔS | -0.14 |
| 11 | CTCT | 0.20 | Hairpin quality | 0.027 | Dimer | 0.004 | CCCC | -0.13 |
| 12 | CCCA | 0.20 | Hairpin length | 0.024 | Hairpin energy (Mfold) | 0.003 | Hairpin quality | -0.11 |
| 13 | ACAC | -0.16 | Hairpin Energy | 0.022 | # Guanine | 0.001 | %GC content | 0.11 |
| 14 | # Adenine | -0.16 | # Thymine | 0.016 | Hairpin energy (vienna) | 0.000 | TGGC | -0.10 |
Figure 1Illustration of Gaussian-like profiles for the penalization factor and the ε-insensitive region in the P-SVR approach. In this case, we penalize harder the committed errors in the higher and lower efficacy regions. Additionally, the insensitive region becomes wider in medium AO efficacies, and thus few AOs will contribute to the cost function and, consequently, become support vectors. Only one additional parameter is introduced in the formulation, i.e. the width of the Gaussian profile, σ.
Mean error (ME), root-mean-squared error (RMSE) and correlation coefficient (r) of models in the validation set. Success rates (SR) for efficacy higher than 0.75 or below 0.25 are also given for each feature selection method.
| MI( | ||||||
| - | - | - | ||||
| 0.356 | 0.367 | 0.374 | 0.398 | 0.430 | 0.440 | |
| -0.0280 | -0.0223 | -0.0104 | -0.0068 | 0.031 | 0.022 | |
| 0.312 | 0.300 | 0.301 | 0.299 | 0.299 | 0.278 | |
| 82.8 | 87.5 | 86.7 | 87.5 | 83.3 | 83.3 | |
| 71.4 | 73.9 | 71.4 | 73.9 | 76.2 | 82.9 | |