| Literature DB >> 19420052 |
Jochen W Klingelhoefer1, Loukas Moutsianas, Chris Holmes.
Abstract
MOTIVATION: Short interfering RNA (siRNA)-induced RNA interference is an endogenous pathway in sequence-specific gene silencing. The potency of different siRNAs to inhibit a common target varies greatly and features affecting inhibition are of high current interest. The limited success in predicting siRNA potency being reported so far could originate in the small number and the heterogeneity of available datasets in addition to the knowledge-driven, empirical basis on which features thought to be affecting siRNA potency are often chosen. We attempt to overcome these problems by first constructing a meta-dataset of 6483 publicly available siRNAs (targeting mammalian mRNA), the largest to date, and then applying a Bayesian analysis which accommodates feature set uncertainty. A stochastic logistic regression-based algorithm is designed to explore a vast model space of 497 compositional, structural and thermodynamic features, identifying associations with siRNA potency.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19420052 PMCID: PMC2940241 DOI: 10.1093/bioinformatics/btp284
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Structure of short interfering RNA (siRNA). Both guide and passenger strands are displayed and the guide strand nucleotides are numbered in 5′ to 3′ direction.
Overview of datasets employed in this study
| Dataset | Size | Reference | Strand Format | SiRNA concentration | Potency | Also contained in |
|---|---|---|---|---|---|---|
| siRecords (SIR) | 2881 | Ren | S | Variable | Four classes | – |
| Sloan–Kettering (SLO) | 601 | Jagla | AS | 100 nM | [0, 1] | – |
| Isis (ISI) | 67 | Vickers | AS | 100 nM | [0, 1] | SHA, SAE, SIR |
| Novartis (NOV) | 2431 | Huesken | AS | 50 nM | [0, 1] | – |
| Katoh (KAT) | 702 | Katoh and Suzuki ( | S | 10/25 nM | [0, 1] | – |
| Shabalina (SHA) | 653 | Shabalina | AS | Variable | [0, 1] | SAE, SIR |
| Saetrom (SAE) | 537 | Saetrom ( | AS | Variable | [0, 1] | – |
| Phipps (PHI) | 26 | Phipps | S | 300 nM | [0, 1] | SIR |
| Amgen–Dharmacon (AMG) | 239 | Reynolds | AS | 100 nM | [0, 1] | SHA, SAE, SIR |
The columns for dataset, size and reference refer to the name and abbreviation, the number of siRNA samples contained and the reference to the dataset, respectively. Strand format indicates which strand is reported in the original study: sense (S) or antisense (AS). In the next column the siRNA concentration, as reported in the respective study, is stated. Furthermore, potency is either reported in a continuous scale over [0, 1], where 0 represents fully potent and 1 non-potent or as discrete value, where samples are split into different potency classes.
aFor these datasets, siRNA data has been collected from various experiments, all at slightly different experimental condition, so a common value for siRNA concentration used cannot be stated.
bIn these datasets, potency was reported in a continuous scale over [0, 1], but fully potent entries were represented by 1, and non-potent by 0.
cFor these combined datasets, a reference list of the individual datasets they contain is given in the main article.
Ratio of potent to non-potent entries contained in the datasets
| Dataset | Size | Potent | Non-potent | Ratio |
|---|---|---|---|---|
| SLO | 601 | 179 | 422 | 0.4242 |
| NOV | 2431 | 1222 | 1209 | 1.0108 |
| KAT | 702 | 176 | 526 | 0.3346 |
| SAE | 509 | 197 | 312 | 0.6314 |
| SIR | 2240 | 1577 | 663 | 2.3786 |
| Total | 6438 | 3351 | 3132 | 1.067 |
Percentile occurrences of the ‘UCU’ and ‘ACGA’ motifs
| Dataset | All datasets (%) | Excluding SIR (%) |
|---|---|---|
| Potent entries (all samples) | 51.7 | 41.5 |
| Potent entries (only samples containing ‘UCU’) | 60.1 | 49.4 |
| Potent entries (only samples containing ‘ACGA’) | 36.5 | 25.4 |
Comparison of percentile occurrences of the ‘UCU’ and ‘ACGA’ motifs in potent siRNA sequences for the case that all five of our datasets (SLO, NOV, KAT, SAE, SIR) are considered and the case that the SIR dataset is excluded.
List of features that were most dominant in the generated models
| Feat. ID | Feat No | Occurrence (%) | Corr. Coeff. | Feature explanation |
|---|---|---|---|---|
| 1 | 11 | 94.84 | 0.1029 | NT10 is ‘A’ (cleavage site) |
| 2 | 140 | 93.24 | 0.1485 | Motif ‘UCU’ is present in siRNA |
| 3 | 20 | 88.84 | −0.1213 | NT19 is ‘A’ |
| 4 | 40 | 84.84 | 0.2176 | NT1 is ‘U’ |
| 5 | 433 | 84.54 | 0.2709 | ΔG in NT1..NT4 (dG1-4) |
| 6 | 210 | 78.65 | −0.0972 | Motif ‘ACGA’ is present in siRNA |
| 7 | 437 | 75.42 | 0.1492 | ΔG in NT5..NT8 (dG5-8) |
| 8 | 38 | 69.76 | −0.0034 | NT18 is ‘G’ |
| 9 | 34 | 67.75 | −0.1045 | NT14 is ‘G’ |
| 10 | 483 | 62.84 | 0.0415 | GC content > 35% |
| 11 | 426 | 61.13 | 0.1286 | ΔG in NT13..NT14 |
| 12 | 431 | 58.50 | −0.1495 | ΔG in NT18..NT19 (dG18-19) |
| 13 | 2 | 58.13 | 0.0884 | NT1 is ‘A’ |
| 14 | 491 | 42.33 | 0.2009 | GC content < 70% |
| 15 | 125 | 39.40 | −0.1323 | Motif ‘GCC’ is present in siRNA |
| 16 | 450 | 37.30 | −0.1957 | Folding is present in siRNA (binary value) |
| 17 | 492 | 35.80 | 0.2280 | GC content < 75% |
| 18 | 259 | 33.49 | −0.0911 | Motif ‘GUGG’ is present in siRNA |
| 19 | 347 | 32.84 | 0.0217 | Motif ‘UCCG’ is present in siRNA |
List of features that were most dominant in the models generated by our Bayesian Markov chain Monte Carlo algorithm. The columns depict the overall percentile appearance of each feature (100% representing an appearance of 25 000 000 times in the models generated by our algorithm), the correlation between the feature and the product level variable (positive correlation coefficient indicates an increasing of siRNA potency), as well as its biological meaning. The first thirteen features appear in more than 50% of the runs.