| Literature DB >> 17169992 |
Abstract
Highly accurate knockdown functional analyses based on RNA interference (RNAi) require the possible most complete hydrolysis of the targeted mRNA while avoiding the degradation of untargeted genes (off-target effects). This in turn requires significant improvements to target selection for two reasons. First, the average silencing activity of randomly selected siRNAs is as low as 62%. Second, applying more than five different siRNAs may lead to saturation of the RNA-induced silencing complex (RISC) and to the degradation of untargeted genes. Therefore, selecting a small number of highly active siRNAs is critical for maximizing knockdown and minimizing off-target effects. To satisfy these needs, a publicly available and transparent machine learning tool is presented that ranks all possible siRNAs for each targeted gene. Support vector machines (SVMs) with polynomial kernels and constrained optimization models select and utilize the most predictive effective combinations from 572 sequence, thermodynamic, accessibility and self-hairpin features over 2200 published siRNAs. This tool reaches an accuracy of 92.3% in cross-validation experiments. We fully present the underlying biophysical signature that involves free energy, accessibility and dinucleotide characteristics. We show that while complete silencing is possible at certain structured target sites, accessibility information improves the prediction of the 90% active siRNA target sites. Fast siRNA activity predictions can be performed on our web server at http://optirna.unl.edu/.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17169992 PMCID: PMC1802606 DOI: 10.1093/nar/gkl1065
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Overview of the 572 sequence, thermodynamic and accessibility features of the siRNAs
| Both global and positional features: |
| • Δ |
| • The ratio Δ |
| • Average probabilities of target site positions to form secondary structures (mono-, di- and tetranucleotides); |
| • G + C content. |
| Global features covering the complete antisense strand: |
| • Δ |
| • Relative frequencies of mono- or dinucleotides; |
| • Relative frequencies of homotri- and tetranucleotides; |
| • Maximal length of the G/C runs; |
| • Minimal free energy of the secondary structures at the mRNA target site; |
| • Melting temperature of the double-stranded siRNA; |
| • The probability and Δ |
| • Position of the target locus at the mRNA relative to the translation initiation site; |
| • Concentration of the siRNA. |
| Features specific to each position of the antisense strand: |
| • Presence or absence of mono- and dinucleotides; |
| • Presence of G or C mononucleotides; |
| • Probability of the target site positions to form secondary structures; |
| • Change in free energy during complex formation between the siRNA and the target mRNA. |
Both global and positional features were used. SVM and constrained optimization methods performed the iterative selection of the most predictive features shown in Table 2 and Supplementary Table S1.
The predictive performance of features
| Predictive performance | |||||
|---|---|---|---|---|---|
| Individuala | Combinedb | ||||
| Feature | Position | Feature | Position | Weight | |
| Δ | 1–2 | 0.38 | Δ | All | 0.146 |
| U | 1 | 0.36 | CC | All | −0.134 |
| G | 1 | −0.31 | All | −0.128 | |
| Δ | 1–2 | 0.30 | U | 1 | 0.109 |
| Δ | 1–2 | 0.27 | Δ | 18–19 | −0.107 |
| U | All | 0.26 | A | 19 | −0.099 |
| Δ | All | 0.25 | G | 1 | −0.094 |
| UU | 1 | 0.23 | UU | 18–19 | −0.086 |
| G | All | −0.22 | Δ | 20–21 | 0.084 |
| Δ | All | 0.22 | U | 2 | 0.068 |
| Δ | 3–5 − 19–21 | 0.21 | A | 2 | 0.066 |
| Δ | 1–3 − 19–21 | 0.21 | AU | 6–7 | −0.063 |
| Δ | All | 0.21 | AA | 17–18 | −0.059 |
| GG | 1 | −0.20 | GG | 20–21 | 0.058 |
| GC | 1 | −0.20 | AA | 18–19 | −0.056 |
| UA | All | 0.18 | AU | 9–10 | −0.055 |
| U | 2 | 0.17 | Δ | 3–4 | 0.055 |
| C | 1 | −0.17 | C | 1 | −0.054 |
| GG | All | −0.17 | GG | 16–17 | −0.053 |
| Δ | 1–5 − 17–21 | 0.17 | CG | 1–2 | −0.052 |
| Δ | 18 | −0.17 | AG | 20–21 | 0.052 |
| Δ | 13 | 0.17 | G | 14 | −0.050 |
| Δ | 2 | 0.17 | UG | 4–5 | −0.049 |
| GC | All | −0.16 | A | 20 | −0.047 |
| CC | All | −0.16 | UG | 20–21 | 0.046 |
| UU | All | 0.16 | CC | 13–14 | −0.044 |
| CG | 1 | −0.16 | GU | 5–6 | 0.040 |
| A | 19 | −0.16 | A | 1 | 0.039 |
| Δ | All | −0.15 | CC | 20–21 | −0.036 |
| CC | 1 | −0.15 | U | 7 | 0.035 |
Weights were optimized by an SVM with linear kernel. The absolute value of the weight indicates the contribution of that feature to the prediction in the linear kernel limited to 30 features. Note that the practical predictions use 142 features, shown in Supplementary Table S1 online. p3 is the probability of that each base of the tetranucleotide (i, i + 1, i + 2, i + 3) is paired as predicted by the sfold algorithm.
aThe 30 features with the strongest correlations to siRNA activity in the Novartis dataset.
bFeatures that in combination account for the most accurate predictions of the siRNA knockdown activity.
Figure 1Overlap between the distributions of ΔG at positions 1 and 2, global G + C content and U content between siRNAs with >90% (full-lines) and <80% activity (dotted lines) in the Novartis dataset (1).
Figure 2The accuracy of SVM using different kernels and constrained optimization methods as functions of the number of features. Results of 10× cross-validation experiments (see Materials and Methods) are shown. Note that constrained optimization eliminated all but 72 features in the first iteration.
Figure 3Observed versus predicted activities in the Novartis dataset (1). Predictions were performed by the polynomial kernel SVM using 142 features shown on Supplementary Table 1.
Figure 4Three examples of aligned feature (dinucleotide) combinations selected by RFE and/or WFE with common sequence motifs. All of these features decrease siRNA activity. The selection against the dinucleotide CC at position 9 is expressed by disfavoring cytosines at position 9 in RFE and the dinucleotide CC at position 10 in both methods. In WFE, the selection against CC at 9 is expressed both directly (CC at 9) and indirectly by disfavoring AC and CC at 8, and CC at positions 8, 9, and 10 (see text).
Figure 5Contributions of the individual antisense sequence positions to the predictions in MVR, BC, SVM/WFE with linear kernel, and SVM/RFE with a polynomial kernel. For the first four methods, we show the sum of weights (absolute values) for the features at that position. For RFE (magenta line), we display the total decrease in the prediction accuracy when features specific to a given position are eliminated.