| Literature DB >> 28243313 |
Ye Han1, Yuanning Liu1, Hao Zhang1, Fei He2, Chonghe Shu1, Liyan Dong1.
Abstract
Small interfering RNAs (siRNAs) induce posttranscriptional gene silencing in various organisms. siRNAs targeted to different positions of the same gene show different effectiveness; hence, predicting siRNA activity is a crucial step. In this paper, we developed and evaluated a powerful tool named "siRNApred" with a new mixed feature set to predict siRNA activity. To improve the prediction accuracy, we proposed 2-3NTs as our new features. A Random Forest siRNA activity prediction model was constructed using the feature set selected by our proposed Binary Search Feature Selection (BSFS) algorithm. Experimental data demonstrated that the binding site of the Argonaute protein correlates with siRNA activity. "siRNApred" is effective for selecting active siRNAs, and the prediction results demonstrate that our method can outperform other current siRNA activity prediction methods in terms of prediction accuracy.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28243313 PMCID: PMC5294759 DOI: 10.1155/2017/5043984
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Primary dinucleotides with minimal p value.
| Position | Dinucleotide motif | Freq | Freq | Type of corr. |
|
|---|---|---|---|---|---|
| 1 | UU1 | 178/1218 | 25/1213 | Positive | 9.45 |
| GG1 | 36/1218 | 159/1213 | Negative | 1.52 | |
| 2 | UA2 | 73/1218 | 32/1213 | Positive | 4.62 |
| GC2 | 48/1218 | 96/1213 | Negative | 3.26 | |
| 3 | AA3 | 76/1218 | 53/1213 | Positive | 0.0397 |
| CC3 | 57/1218 | 91/1213 | Negative | 0.0036 | |
| 4 | UU4 | 111/1218 | 69/1213 | Positive | 0.0013 |
| CC4 | 60/1218 | 107/1213 | Negative | 0.0001 | |
| 5 | AU5 | 94/1218 | 56 /1213 | Positive | 0.0015 |
| CC5 | 66/1218 | 102/1213 | Negative | 0.0036 | |
| 6 | UU6 | 117/1218 | 63/1213 | Positive | 3.19 |
| CC6 | 47/1218 | 110/1213 | Negative | 1.63 | |
| 7 | UU7 | 104/1218 | 67/1213 | Positive | 0.0036 |
| CA7 | 70/1218 | 120/1213 | Negative | 0.0001 | |
| 8 | CG8 | 32/1218 | 51/1213 | Negative | 0.0323 |
| 9 | CA9 | 108/1218 | 66/1213 | Positive | 0.0010 |
| GU9 | 56/1218 | 84/1213 | Negative | 0.0138 | |
| 10 | AU10 | 101/1218 | 62/1213 | Positive | 0.0017 |
| CC10 | 63/1218 | 96/1213 | Negative | 0.0062 | |
| 11 | AA11 | 74/1218 | 46/1213 | Positive | 0.0094 |
| GG11 | 78/1218 | 111/1213 | Negative | 0.0114 | |
| 12 | CG12 | 32/1218 | 56/1213 | Negative | 0.0086 |
| 13 | AU13 | 108/1218 | 65/1213 | Positive | 0.0008 |
| GG13 | 59/1218 | 114/1213 | Negative | 1.22 | |
| 14 | UU14 | 105/1218 | 72/1213 | Positive | 0.0108 |
| GG14 | 60/1218 | 110/1213 | Negative | 6.10 | |
| 15 | CA15 | 113/1218 | 74/1213 | Positive | 0.0033 |
| GG15 | 72/1218 | 108/1218 | Negative | 0.0048 | |
| 16 | AC16 | 82/1218 | 46/1213 | Positive | 0.0012 |
| GG16 | 68/1218 | 137/1213 | Negative | 3.82 | |
| 17 | AC17 | 80/1218 | 45/1213 | Positive | 0.0014 |
| GA17 | 51/1218 | 95/1213 | Negative | 0.0002 | |
| 18 | UC18 | 114/1218 | 69/1213 | Positive | 0.0006 |
| AA18 | 29/1218 | 87/1213 | Negative | 2.76 | |
| 19 | CU19 | 124/1218 | 53/1213 | Positive | 3.23 |
| AC19 | 30/1218 | 63/1213 | Negative | 0.0004 | |
| 20 | UG20 | 146/1218 | 67/1213 | Positive | 1.59 |
| CC20 | 52/1218 | 101/1213 | Negative | 3.73 |
Primary trinucleotides with minimal p value.
| Position | Trinucleotide motif | Freq | Freq | Type of corr. |
|
|---|---|---|---|---|---|
| 1 | UUG1 | 52/1218 | 5/1213 | Positive | 9.48 |
| GGG1 | 4/1218 | 50/1213 | Negative | 1.90 | |
| 2 | UUA2 | 14/1218 | 4/1213 | Positive | 0.0184 |
| GCC2 | 10/1218 | 33/1213 | Negative | 0.0004 | |
| 3 | AUU3 | 28/1218 | 9/1213 | Positive | 0.0009 |
| CAC3 | 9/1218 | 29/1213 | Negative | 0.0005 | |
| 4 | UAU4 | 19/1218 | 5/1213 | Positive | 0.0021 |
| CCA4 | 19/1218 | 41/1213 | Negative | 0.0019 | |
| 5 | AUU5 | 29/1218 | 11 /1213 | Positive | 0.0021 |
| CCC5 | 6/1218 | 30/1213 | Negative | 2.59 | |
| 6 | UUU6 | 40/1218 | 12/1213 | Positive | 4.53 |
| CCA6 | 10/1218 | 41/1213 | Negative | 5.20 | |
| 7 | UCU7 | 37/1218 | 18/1213 | Positive | 0.005 |
| CGU7 | 3/1218 | 16/1213 | Negative | 0.0013 | |
| 8 | ACA8 | 29/1218 | 13/1213 | Positive | 0.0066 |
| AAU8 | 8/1218 | 28/1213 | Negative | 0.0004 | |
| 9 | CAA9 | 26/1218 | 7/1213 | Positive | 0.0004 |
| AUU9 | 12/1218 | 30/1213 | Negative | 0.0024 | |
| 10 | ACA10 | 35/1218 | 11/1213 | Positive | 0.0002 |
| CGA10 | 2/1218 | 12/1213 | Negative | 0.0036 | |
| 11 | CUA11 | 32/1218 | 13/1213 | Positive | 0.0022 |
| GCG11 | 6/1218 | 23/1213 | Negative | 0.0007 | |
| 12 | AUU12 | 30/1218 | 11/1213 | Positive | 0.0014 |
| GGG12 | 9/1218 | 31/1213 | Negative | 0.0002 | |
| 13 | UUU13 | 33/1218 | 16/1213 | Positive | 0.0074 |
| CCG13 | 6/1218 | 20/1213 | Negative | 0.0028 | |
| 14 | CCA14 | 36/1218 | 16/1213 | Positive | 0.0026 |
| CCC14 | 6/1218 | 21/1213 | Negative | 0.0018 | |
| 15 | UAU15 | 16/1218 | 4/1213 | Positive | 0.0036 |
| UGG15 | 19/1218 | 46/1218 | Negative | 0.0003 | |
| 16 | ACU16 | 31/1218 | 12/1213 | Positive | 0.0018 |
| CGA16 | 1/1218 | 10/1213 | Negative | 0.0032 | |
| 17 | CUG17 | 49/1218 | 21/1213 | Positive | 0.0004 |
| GUU17 | 9/1218 | 34/1213 | Negative | 5.57 | |
| 18 | UCU18 | 43/1218 | 11/1213 | Positive | 5.54 |
| AAA18 | 8/1218 | 28/1213 | Negative | 0.0004 | |
| 19 | CUG19 | 61/1218 | 16/1213 | Positive | 9.70 |
| AGA19 | 7/1218 | 31/1213 | Negative | 4.05 |
Algorithm 1The calculation process of threshold k.
Figure 1Comparison between model 1 and model 2. Observed siRNA activities of the Huesken_test are plotted against predicted siRNA activities by model 1 (a) and model 2 (b).
The performance of our model with the top k features.
| Number of features ( | Pearson Correlation Coefficient (PCC) | |
|---|---|---|
| 1 | 230 | 0.705 |
| 2 | 230/2 = 115 | 0.713 |
|
|
|
|
| 4 | 57/2 = 28 | 0.712 |
| 5 | 28 + (57 − 28)/2 = 42 | 0.720 |
| 6 | 42 + (57 − 42)/2 = 49 | 0.721 |
| 7 | 49 + (57 − 49)/2 = 53 | 0.721 |
| 8 | 53 + (57 − 53)/2 = 55 | 0.719 |
| 9 | 55 + (57 − 55)/2 = 56 | 0.721 |
Figure 2The 57 features selected by the BSFS method.
Figure 3Boxplots of the top 15 features. For each plot, the left side represents potent siRNAs, and the right side represents nonpotent siRNAs.
PCC between observed and predicted siRNA activities for five algorithms.
| Method | PCC ( |
|---|---|
| Biopredsi | 0.660 |
|
| 0.654 |
| ThermoComposition-21 | 0.659 |
| DSIR | 0.670 |
|
|
|
Figure 4ROC curves of the five algorithms.
The five algorithms' sensitivities in the high specificity area.
| Method | Sensitivity | Sensitivity |
|---|---|---|
|
|
|
|
| Biopredsi | 16.3% | 8.1% |
|
| 24.4% | 6.7% |
| ThermoComposition-21 | 28.9% | 18.5% |
| DSIR | 20.0% | 10.4% |
Figure 5Comparisons of ten algorithms using the three independent datasets of Vickers, Reynolds, and Harborth.