| Literature DB >> 27444562 |
Yi Wang1, Yi Li1, Weilin Pu1, Kathryn Wen2, Yin Yao Shugart2, Momiao Xiong3, Li Jin1.
Abstract
Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).Entities:
Mesh:
Year: 2016 PMID: 27444562 PMCID: PMC4957112 DOI: 10.1038/srep30086
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The summarized process.
A 3-layer sparse neural network with random weights. represents threshold functions.
Figure 2Maximum AUC of the independent ADO testing dataset with different numbers of markers.
Psoriasis prediction performance with all methods based on best number of SNP subsets.
| Independent testing dataset (ADO dataset) | Training dataset (GRU dataset) with 10-fold cross validation | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy | AUC | 95% CI of AUC | Sensitivity | Specificity | Accuracy | AUC | 95% CI of AUC | |
| NN | 0.6404 | 0.5840 | 0.6055 | 0.6563 | [0.6240, 0.6886] | 0.5347 | 0.6657 | 0.5899 | 0.6192 | [0.4388, 0.7893] |
| KNN | 0.6241 | 0.7279 | 0.6884 | 0.7021 | [0.6699, 0.7344] | 0.6428 | 0.6553 | 0.6478 | 0.6660 | [0.5342, 0.7830] |
| ELM | 0.6589 | 0.6610 | 0.6602 | 0.7053 | [0.6738, 0.7368] | 0.6305 | 0.6403 | 0.6346 | 0.6618 | [0.5210, 0.8094] |
| RF | 0.6311 | 0.7051 | 0.6770 | 0.7134 | [0.6820, 0.7448] | 0.6036 | 0.6703 | 0.6314 | 0.6603 | [0.5072, 0.7954] |
| SVM | 0.6589 | 0.6952 | 0.6814 | 0.7132 | [0.6815, 0.7449] | 0.6569 | 0.6419 | 0.6503 | 0.6694 | [0.5319, 0.7843] |
| GBM | 0.6473 | 0.7080 | 0.6849 | 0.7187 | [0.6873, 0.7500] | 0.5890 | 0.7129 | 0.6415 | 0.6707 | [0.5153, 0.7986] |
| RBF | 0.6543 | 0.7151 | 0.6920 | [0.6930, 0.7548] | 0.6317 | 0.6490 | 0.6390 | [0.5254, 0.8275] | ||
Bold: The bold means the first place result of all methods compared. *AUC, sensitivity, specificity, and accuracy were its average value in 10-fold CV,
95% CI of AUC represents the range of the 95% CI of AUC in 10-fold CV.
Figure 3The ROC curve of six best benchmarked methods on the Psoriasis GWAS dataset of independent ADO group using selected best number of SNPs.
Figure 4The average of ten-fold’s cross-validation ROC curve of six best benchmarked methods on the Psoriasis GWAS dataset of GRU group using selected best number of SNPs.
Regression RMSE of all methods on 14 datasets.
| Regression RMSE | Sample | Feature | Linear | KNN | NN | ELM | SVM | GBM | RF | RBF |
|---|---|---|---|---|---|---|---|---|---|---|
| 209 | 7 | 69.62 | 63.13 | 134.91 | 159.23 | 93.63 | 91.67 | 59.66 | ||
| 308 | 6 | 9.13 | 6.43 | 1.18 | 1.96 | 1.03 | 1.16 | |||
| 506 | 12 | 4.88 | 4.10 | 4.94 | 7.92 | 3.16 | 3.40 | 3.13 | ||
| 517 | 13 | 1.50 | 2.10 | 1.50 | 1.41 | |||||
| 536 | 8 | 0.01 | 0.04 | 0.02 | ||||||
| 1030 | 9 | 10.53 | 8.28 | 6.36 | 13.18 | 5.25 | 4.72 | 4.53 | ||
| 5875 | 19 | 9.74 | 6.10 | 6.69 | 10.35 | 6.02 | 2.10 | 1.65 | ||
| 6497 | 11 | 0.74 | 0.70 | 0.73 | 0.92 | 0.67 | 0.67 | 0.58 | ||
| 17389 | 16 | 141.87 | 104.58 | 65.99 | 94.56 | 102.37 | 75.47 | 39.97 | ||
| 28179 | 97 | 1.45 | 0.76 | 0.37 | 1.58 | 1.49 | ||||
| 45730 | 9 | 5.19 | 3.79 | 6.12 | 6.12 | 4.16 | 5.05 | 3.45 | ||
| 434874 | 2 | 18.37 | 6.44 | 15.55 | 16.95 | 12.53 | 14.82 | 3.86 | ||
| 515345 | 90 | 9.55 | 9.22 | 10.93 | 11.47 | — | 9.63 | 9.24 | ||
| 583250 | 78 | 1.33 | 0.52 | 0.51 | 1.03 | — | 0.48 |
Bold: The bold means the first place result of all methods compared.
*The * means the dependent variable of the corresponding data was transformed by log function to be more asymptotically normal.
The best RBF’s RMSE was significantly less than the second best RF using Wilcoxon Matched-Pairs Signed-Ranks Test (p-value = 0.007185).
Classification error of all methods on 14 datasets.
| Classification error% | Sample | Feature | LR | KNN | NN | ELM | SVM | GBM | RF | RBF |
|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 9 | 15.00 | 12.00 | 15.00 | 24.00 | 12.00 | ||||
| 208 | 60 | 26.00 | 13.02 | 21.67 | 14.43 | 12.52 | 12.52 | 12.02 | ||
| 306 | 3 | 25.85 | 25.16 | 30.71 | 27.40 | 26.45 | 27.12 | 27.4 | ||
| 351 | 34 | 10.26 | 10.25 | 11.98 | 10.28 | 5.13 | 6.26 | 6.55 | ||
| 540 | 18 | 7.04 | 5.56 | 5.93 | 4.44 | 5.74 | 6.48 | 4.81 | ||
| 569 | 30 | 5.09 | 2.81 | 8.45 | 8.80 | 3.33 | 2.98 | 2.28 | ||
| 579 | 10 | 27.83 | 27.82 | 30.21 | 28.34 | 28.51 | 27.47 | 26.42 | ||
| 748 | 4 | 22.86 | 24.46 | 23.80 | 20.19 | 21.66 | 21.79 | 19.92 | ||
| 1055 | 41 | 13.37 | 13.75 | 14.98 | 22.38 | 12.14 | 12.89 | 12.42 | ||
| 1212 | 100 | 42.00 | 45.71 | 5.28 | 23.42 | 34.73 | 43.89 | 40.50 | ||
| 1372 | 4 | 1.02 | 0.15 | 0.15 | 0.51 | |||||
| 14980 | 14 | 35.75 | 15.37 | 31.57 | 42.34 | 19.52 | 8.46 | 5.96 | ||
| 19020 | 10 | 20.88 | 15.86 | 13.17 | 22.64 | 12.30 | 11.75 | 11.73 |
Bold: The bold means the first place result of all methods compared.
The best RBF’s error% was significantly less than the second best SVM using Wilcoxon Matched-Pairs Signed-Ranks Test (p-value = 0.04584).