| Literature DB >> 19847297 |
Kyle A McQuisten1, Andrew S Peek.
Abstract
BACKGROUND: Exogenous short interfering RNAs (siRNAs) induce a gene knockdown effect in cells by interacting with naturally occurring RNA processing machinery. However not all siRNAs induce this effect equally. Several heterogeneous kinds of machine learning techniques and feature sets have been applied to modeling siRNAs and their abilities to induce knockdown. There is some growing agreement to which techniques produce maximally predictive models and yet there is little consensus for methods to compare among predictive models. Also, there are few comparative studies that address what the effect of choosing learning technique, feature set or cross validation approach has on finding and discriminating among predictive models. PRINCIPALEntities:
Mesh:
Substances:
Year: 2009 PMID: 19847297 PMCID: PMC2760777 DOI: 10.1371/journal.pone.0007522
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Computational systems used in developing models for predicting effective RNAi.
| # | Technique(s) | class/reg | siRNA data set | Total Features | Reference(s) |
| 1 | Rule | classification | 180–19mers | 8 |
|
| 2 | Rule | classification | 62–19mers | 4 |
|
| 3 | Rule | classification | 46–19mers-train, 34–19mers-test | 9 |
|
| 4 | Rule | classification | 148–19mers | 18 |
|
| 5 | Rule | classification | 249–19mers | 12 |
|
| 6 | Rule | classification | 23–19mers | 2 |
|
| 7 | GPBoost, SVM | class/reg | 204–19mers | ? |
|
| 8 | GPBoost, SVM | regression | 581–19mers | ? |
|
| 9 | DT | class/reg | 398–19mers | 11 |
|
| 10 | Rule | classification | composite | 8 |
|
| 11 | ANN | regression | 2431–21mers | 84 |
|
| 12 | ANN | classification | 180–19mers | 6 |
|
| 13 | Rule, DT | classification | 601–19mers | 55 |
|
| 14 | GSK SVM | classification | 94–19mers | 84 |
|
| 15 | Rule DT, SVM | classification | 33–21mers | 4 |
|
| 16 | SVM | classification | 2431–21mers, 581–19mers | 84+15+20 |
|
| 17 | ANN | regression | 581–19mers-train, 2431–21mers-test | 200 |
|
| 18 | linear | regression | 526–19mers | 84 |
|
| 19 | linear | regression | 2431–21mers, 653–19mers | 84+84 |
|
| 20 | DRM | classification | 3277 | 276-initial 21-final |
|
| 21 | Rule | classification | 420 and 1220 | 6+4+16+64 |
|
| 22 | SVM | class/reg | 2252–21mers, 240–19mers | 572 |
|
| 23 | linear | regression | 2431–21mers | 84+ |
|
| 24 | SVM | regression | 2431–21mers, 579–19mers | 1566 |
|
| 25 | Rule, DT, GPBoost, ANN, linear | class/reg | 2431–21mers, 601–19mers, 238–19mers, 67–19mers | 84+84, 22-final |
|
| 26 | SVM | classification | 2431–21mers, 653–19mers | 28 |
|
| 27 | Rule, SVM, RFR | regression | 3589 | 41 |
|
| 28 | linear | regression | 702–19mers | 76+3 |
|
| 29 | Rule HS | classification | 474 subset of 2433–21mers, 99 subset of 294–21mers, 360 21–mers | 4 |
|
| 30 | Rule DT | classification | 62 21-mers | 8 |
|
GPBoost: Genetic Programming and Boosting.
SVM: Support Vector Machine.
DT: Decision Tree.
ANN: Artificial Neural Network.
GSK: General String Kernel.
DRM: Disjunctive Rule Merging.
RFR: Random Forest Regression.
HS: Hierarchical Sorting.
Model performance by learning technique and feature mapping method for correlations and mean squared error on the entire dataset and by 10-fold cross validation.
| ANN | GLM | SVM | |||||||||||
| Train on 2431 | Test on 2431 | 10-fold | Cross validation | Train on 2431 | Test on 2431 | 10-fold | Cross validation | Train on 2431 | Test on 2431 | 10-fold | Cross validation | ||
| Mapping method | Number Features |
|
|
|
|
|
|
|
|
|
|
|
|
| PSBC | 84 | 0.658 | 0.023 | 0.636 |
| 0.631 | 0.029 |
|
| 0.764 | 0.017 | 0.643 | 0.024 |
| THER | 23 | 0.562 | 0.028 | 0.567 | 0.029 | 0.514 | 0.840 | 0.511 | 0.844 | 0.722 | 0.019 | 0.579 | 0.027 |
| NG25 | 1360 | 0.871 | 0.015 | 0.464 | 0.049 | 0.450 | 0.936 | 0.357 | 0.929 | 0.897 | 0.009 | 0.509 | 0.030 |
| GSSF | 32 | 0.316 | 0.036 | 0.278 | 0.038 | 0.198 | 0.072 | 0.152 | 0.115 | 0.232 | 0.039 | 0.215 | 0.039 |
| GSSS | 23 | 0.301 | 0.037 | 0.279 | 0.038 | 0.207 | 0.091 | 0.201 | 0.091 | 0.339 | 0.036 | 0.271 | 0.038 |
| P+13 | 168 | 0.703 | 0.021 |
| 0.027 | 0.500 | 0.252 | 0.474 | 0.257 | 0.779 | 0.010 | 0.681 | 0.022 |
| P+25 | 1444 | 0.898 | 0.012 | 0.572 | 0.047 | 0.513 | 1.144 | 0.439 | 1.109 | 0.931 | 0.006 |
|
|
| ALL | 1522 | 0.430 | 0.136 | 0.524 | 0.055 | 0.509 | 2.605 | 0.444 | 2.529 | 0.934 | 0.006 | 0.644 | 0.025 |
Learning Techniques: Artificial Neural Network (ANN), General Linear Model (GLM), Support Vector Machine (SVM).
Mapping methods: Position Specific Base Composition (PSBC), Thermodynamic (THER), N-Grams of length 2 though 5 (NG25), Guide Strand Structure Features (GSSF), Guide Strand Secondary Structure (GSSS), Positions specific base compositions plus N-Grams of length 1 through 3 (P+13), Positions specific base compositions plus N-Grams of length 2 through 5 (P+25) the combination of each of the methods PSBC, THER, NG25, GSSF and GSSS (ALL).
R = Pearson correlation coefficient, of model predicted activities to observed activities.
MSE = Mean Squared Error of model predicted activities to observed activities.
column maxima for R and minima for MSE are in bold for 10-fold cross validations, same values bolded in Table 7.
Comparison among learning technique and mapping method for building significantly dissimilar models by 10-fold cross validation.
| TEC | ANN | ||||||||
| MET | PSBC | THER | NG25 | GSSF | GSSS | P+13 | P+25 | ALL | |
| PSBC | 0.636 | 8.89E-04** | 2.26E-05** | 2.20E-09** | 1.41E-11** | 2.26E-01 | 3.02E-02* | 1.21E-04** | |
| THER | 2.76E-03* | 0.567 0.029 | 5.70E-03* | 2.31E-08** | 7.02E-10** | 1.48E-04** | 6.00E-01 | 2.02E-01 | |
| NG25 | 3.30E-05** | 1.37E-04** | 0.464 0.049 | 2.44E-05** | 1.03E-05** | 5.23E-06** | 5.13E-03* | 6.49E-02 | |
| ANN | GSSF | 2.21E-10** | 1.24E-07** | 7.87E-03* | 0.278 0.038 | 9.84E-01 | 3.66E-10** | 4.90E-08** | 1.60E-07** |
| GSSS | 2.20E-10** | 9.99E-08** | 8.43E-03* | 8.62E-01 | 0.279 0.038 | 2.97E-12** | 2.22E-08** | 1.53E-08** | |
| P+13 | 3.26E-01 | 2.01E-01 | 3.91E-05** | 1.07E-05** | 1.16E-05** |
| 6.85E-03* | 2.30E-05** | |
| P+25 | 1.28E-02* | 3.01E-02* | 8.99E-01 | 2.06E-01 | 2.12E-01 | 1.86E-02* | 0.572 0.047 | 1.32E-01 | |
| ALL | 5.47E-05** | 1.55E-04** | 2.31E-01 | 2.72E-03* | 2.87E-03* | 6.31E-05** | 3.74E-01 | 0.524 0.055 | |
| PSBC | 7.04E-04** | ||||||||
| THER | 1.45E-11** | ||||||||
| NG25 | 2.21E-12** | ||||||||
| GLM | GSSF | 4.65E-06** | |||||||
| GSSS | 4.37E-08** | ||||||||
| P+13 | 3.23E-07** | ||||||||
| P+25 | 4.99E-15** | ||||||||
| ALL | 1.68E-11** | ||||||||
| PSBC | 3.20E-01 | ||||||||
| THER | 2.36E-01 | ||||||||
| NG25 | 2.59E-04** | ||||||||
| SVM | GSSF | 5.99E-02 | |||||||
| GSSS | 8.81E-02 | ||||||||
| P+13 | 1.35E-02* | ||||||||
| P+25 | 2.42E-04** | 4.71E-03* | |||||||
| ALL | 6.29E-05** | ||||||||
Diagonal cells from upper left to lower right contain the mean correlations R (upper) and MSE (lower) from the 10-fold cross validation predictions within the learning technique and mapping method, equivalent to the 10-fold cross validation R and MSE columns in table 2.
Cells above and to the right of the diagonal are the t-test probabilities of the 10-fold cross validations R rejecting the H0: xa = xb, where xa is mean R of combined technique and method a and xb is the mean R of combined technique and method b.
Cells below and to the left of the diagonal are the t-test probabilities of the 10-fold cross validations MSE rejecting the H0: xa = xb, where xa is mean MSE of combined technique and method a and xb is the mean MSE of combined technique and method b.
The cells off the upper left to lower right diagonal are unlabeled where P≥0.05.
The cells off the diagonal are labeled with a * where P<0.05 and P≥0.001 (<5.0E-02 and >1.0E-03).
The cells off the diagonal are labeled with a ** where P<0.001 or 1.0E-03.
Learning technique (TEC) and mapping method (MET) labels are consistent with Table 2.
Individual model ANOVA on correlation (R) cross validation replicates.
| Mdl | Model formula |
|
|
|
|
|
| M |
| 0.02958 | 4.5157 | 2, 147 | 3.27 | 0.041 |
| M |
| 0.8314 | 0.7741 | 4, 145 | 184.6 | <2.2×10−16 |
| M |
| 0.8734 | 0.5731 | 6, 143 | 172.3 | <2.2×10−16 |
| M |
| 0.8822 | 0.5033 | 14, 135 | 80.73 | <2.2×10−16 |
Model comparisons by ANOVA for R.
| Mdl | Mdl |
|
|
|
|
|
|
|
| M | M | 4.5157 | 0.5731 | 147 | 143 | 4 | 245.93 | <2.2×10−16 |
| M | M | 0.7741 | 0.5731 | 145 | 143 | 2 | 25.069 | 4.646×10−10 |
| M | M | 0.5731 | 0.5033 | 143 | 135 | 8 | 2.3414 | 0.02181 |
Individual model ANOVA on Mean Squared Error (MSE) cross validation replicates.
| Mdl | Model formula |
|
|
|
|
|
| M |
| 0.3535 | 7.9759 | 2, 147 | 41.73 | 4.442×10−15 |
| M |
| 0.1904 | 9.8519 | 4, 145 | 9.759 | 5.107×10−7 |
| M |
| 0.5564 | 5.3238 | 6, 143 | 32.14 | <2.2×10−16 |
| M |
| 0.9931 | 0.0778 | 14, 135 | 1540 | <2.2×10−16 |
Model comparisons by ANOVA for MSE.
| Mdl | Mdl |
|
|
|
|
|
|
|
| M | M | 7.9759 | 5.3238 | 147 | 143 | 4 | 17.81 | 6.937×10−12 |
| M | M | 9.8519 | 5.3238 | 145 | 143 | 2 | 60.815 | <2.2×10−16 |
| M | M | 5.3238 | 0.0778 | 143 | 135 | 8 | 1138 | <2.2×10−16 |
Figure 1Box-and-whisker diagrams for the cross validation estimates of model precision performance, or Pearson correlation (R).
Boxes encompass the first to third quartile of the distribution. The medians of the distributions are given as horizontal lines within the boxes. Whiskers encompass the 5% to 95% confidence regions of the distribution. Statistical outliers are shown as open circles. The left side of the diagram groups the model precision estimates by machine learning technique. The right side of the diagram groups the model precision estimates by feature mapping method.
Figure 2Box-and-whisker diagrams for the cross validation estimates of model accuracy performance, or Mean Squared Error (MSE).
See Figure 1 for more details.
Comparisons among regression learning techniques results for model precision and accuracy.
| entire dataset | cross validation | ||||||||
|
|
| type |
|
| technique | availability | source | ref | Avail ability Dataset 2431 |
| −0.513 | - | - | - | - | Rule | - | Reynolds-Khvorova |
| - |
| −0.236 | - | - | - | - | Rule | - | Uitei-Saigo v1 |
| - |
| −0.457 | - | - | - | - | Rule | - | Uitei-Saigo v2 |
| - |
| −0.423 | - | - | - | - | DT | - | Jagla-Rothman |
| - |
| −0.476 | - | - | - | - | Boosting | - | Sætrom |
| - |
| −0.425 | - | - | - | - | Rule | - | Amarzguioui-Prydz |
| - |
| −0.449 | - | - | - | - | Rule | webserver( | Henschel-Habermann |
| - |
| −0.666 | - | - | - | - | ANN | Contact authors | Shabalina-Ogurtsov |
| - |
| −0.670 | - | hold out | 0.660 | - | ANN | webserver( | Huesken-Hall |
| + |
| −0.666 | - | - | - | - | GLM | webserver( | Vert-Vandenbrouck |
| + |
| - | - | 10-fold | - | 92.3%2 | SVM | webserver ( | Ladunga |
| + |
| 0.635 | - | Single hold out | 0.577 | - | GLM | VB code( | Ichihara |
| + |
| 0.797 | 0.015 | 10-fold | 0.760 | 0.023 | SVM | C++ code( | Peek |
| + |
| −0.578 | - | - | - | - | GLM | data( | Matveeva model 1 |
| + |
| −0.650 | - | - | - | - | GLM | As above | Matveeva model 2 |
| + |
| 0.917 | 6.8041 | - | - | - | SVM | webserver( | Jiang |
| + |
| 0.7263 | - | - | - | - | GLM | webserver( | Katoh |
| +3 |
| 0.703 | 0.021 | 10-fold | 0.636 | 0.025 | ANN | C++ code( |
| - | + |
| 0.631 | 0.029 | 10-fold | 0.607 | 0.031 | GLM | As above |
| - | + |
| 0.931 | 0.006 | 10-fold | 0.711 | 0.020 | SVM | As above |
| - | + |
All values of R presented as negative values are from [33] as such a negative R model would not yield useful MSE values.
1) MSE, labeled as RMSE.
2) accuracy was defined as “100 minus the average percentage difference between predicted and observed knockdown activities”.
3) a dataset of 702 siRNAs was used, not the 2431 dataset considered by the remainder of the table.
Comparison of model cross-validation procedures on the PSBC feature mapping method across 3 learning techniques.
| ANN | ANN | GLM | GLM | SVM | SVM | |||
| Rep | Part | CV-fold |
|
|
|
|
|
|
| 1 | Strat | 2 | 0.620 (2.09E-03) | 0.0253 (5.47E-04) | 0.586 (2.52E-02) | 0.0334 (3.52E-03) | 0.622 (7.44E-03) | 0.0249 (4.73E-04) |
| 1 | Strat | 3 | 0.625 (2.05E-02) | 0.0249 (1.00E-03) | 0.600 (2.14E-02) | 0.0320 (2.13E-03) | 0.626 (1.89E-02) | 0.0247 (8.31E-04) |
| 1 | Strat | 5 | 0.632 (3.19E-02) | 0.0247 (2.16E-03) | 0.600 (4.07E-02) | 0.0315 (3.74E-03) | 0.639 (3.46E-02) | 0.0240 (1.86E-03) |
|
|
|
|
|
|
|
|
|
|
| 1 | Strat | 20 | 0.638 (5.00E-02) | 0.0248 (2.85E-03) | 0.611 (5.84E-02) | 0.0307 (4.79E-03) | 0.647 (4.85E-02) | 0.0237 (2.71E-03) |
| 1 | Rand | 2 | 0.616 (1.70E-02) | 0.0258 (5.21E-04) | 0.594 (1.19E-02) | 0.0326 (1.60E-03) | 0.619 (1.40E-02) | 0.0251 (9.55E-04) |
| 1 | Rand | 3 | 0.630 (2.29E-03) | 0.0245 (1.04E-03) | 0.604 (1.52E-02) | 0.0316 (1.59E-03) | 0.639 (4.47E-03) | 0.0241 (1.20E-03) |
| 1 | Rand | 5 | 0.630 (1.86E-02) | 0.0247 (1.85E-03) | 0.606 (2.85E-02) | 0.0311 (2.65E-03) | 0.636 (1.79E-02) | 0.0242 (2.01E-03) |
| 1 | Rand | 10 | 0.633 (3.84E-02) | 0.0244 (2.24E-03) | 0.608 (4.31E-02) | 0.0309 (3.02E-03) | 0.643 (3.56E-02) | 0.0238 (2.46E-03) |
| 1 | Rand | 20 | 0.637 (4.64E-02) | 0.0247 (3.47E-03) | 0.609 (5.13E-02) | 0.0307 (3.50E-03) | 0.646 (4.15E-02) | 0.0237 (3.21E-03) |
| 5 | Rand | 2 | 0.622 (9.79E-03) | 0.0258 (1.26E-03) | 0.594 (1.50E-02) | 0.0326 (1.91E-03) | 0.625 (1.19E-02) | 0.0248 (7.22E-04) |
| 5 | Rand | 3 | 0.632 (1.61E-02) | 0.0250 (1.57E-03) | 0.601 (1.97E-02) | 0.0317 (1.84E-03) | 0.636 (1.77E-02) | 0.0242 (1.14E-03) |
| 5 | Rand | 5 | 0.634 (2.58E-02) | 0.0252 (1.55E-03) | 0.605 (2.24E-02) | 0.0312 (1.69E-03) | 0.638 (1.61E-02) | 0.0241 (1.25E-03) |
| 5 | Rand | 10 | 0.633 (3.11E-02) | 0.0248 (1.81E-03) | 0.607 (3.34E-02) | 0.0309 (2.15E-03) | 0.642 (2.87E-02) | 0.0239 (1.98E-03) |
| 5 | Rand | 20 | 0.636 (5.12E-02) | 0.0249 (3.40E-03) | 0.608 (5.00E-02) | 0.0308 (3.57E-03) | 0.642 (4.82E-02) | 0.0238 (3.36E-03) |
| 10 | Rand | 2 | 0.622 (8.93E-03) | 0.0256 (8.99E-04) | 0.592 (1.35E-02) | 0.0328 (1.69E-03) | 0.625 (1.16E-02) | 0.0248 (7.67E-04) |
| 10 | Rand | 3 | 0.632 (1.33E-02) | 0.251 (1.13E-03) | 0.601 (1.68E-02) | 0.0316 (1.64E-03) | 0.636 (1.40E-02) | 0.0242 (9.80E-04) |
| 10 | Rand | 5 | 0.633 (2.46E-02) | 0.0249 (1.91E-03) | 0.606 (2.03E-02) | 0.312 (1.67E-03) | 0.638 (1.73E-02) | 0.0241 (1.54E-03) |
| 10 | Rand | 10 | 0.633 (3.59E-02) | 0.0248 (2.06E-03) | 0.608 (3.03E-02) | 0.0309 (2.30E-03) | 0.643 (2.78E-02) | 0.0239 (2.13E-03) |
| 10 | Rand | 20 | 0.636 (4.55E-02) | 0.0249 (3.68E-03) | 0.610 (4.63E-02) | 0.0307 (3.77E-03) | 0.644 (4.45E-02) | 0.0238 (3.27E-03) |
| 20 | Rand | 2 | 0.626 (1.18E-02) | 0.0256 (1.18E-03) | 0.593 (1.39E-02) | 0.0327 (1.70E-03) | 0.626 (1.19E-02) | 0.0248 (7.08E-04) |
| 20 | Rand | 3 | 0.630 (1.48E-02) | 0.0250 (1.11E-03) | 0.602 (1.67E-02) | 0.0316 (1.65E-03) | 0.636 (1.40E-02) | 0.0242 (9.37E-04) |
| 20 | Rand | 5 | 0.633 (2.54E-02) | 0.0250 (1.47E-03) | 0.606 (2.12E-02) | 0.0311 (1.66E-03) | 0.640 (1.90E-02) | 0.0240 (1.38E-03) |
| 20 | Rand | 10 | 0.634 (3.46E-02) | 0.0249 (2.53E-03) | 0.608 (3.24E-02) | 0.0308 (2.46E-03) | 0.644 (2.88E-02) | 0.0238 (2.06E-03) |
| 20 | Rand | 20 | 0.634 (5.05E-02) | 0.0250 (3.36E-03) | 0.609 (4.96E-02) | 0.0307 (3.85E-03) | 0.645 (4.58E-02) | 0.0238 (3.22E-03) |
Rep: replication level; Part: partitioning type, either stratification or random; CV-fold: cross-validation fold level; Bold: is the model cross validation procedure of single replicate stratified 10-fold cross validation.