Literature DB >> 27444562

Random Bits Forest: a Strong Classifier/Regressor for Big Data.

Yi Wang1, Yi Li1, Weilin Pu1, Kathryn Wen2, Yin Yao Shugart2, Momiao Xiong3, Li Jin1.   

Abstract

Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).

Entities:  

Mesh:

Year:  2016        PMID: 27444562      PMCID: PMC4957112          DOI: 10.1038/srep30086

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


The most widely used methods for prediction include linear regressions, logistic regressions, k-Nearest Neighbors (k-NN)1, support vector machines (SVM)2, neural networks (NNs)3, extreme learning machines (ELM)4, deep learning (DL)5, random forests (RF)67, and generalized boosted models (GBM)89. However, each method has its own drawbacks. For instance, linear regression and logistic regression handle linear and log-linear conditions, respectively, but may fail while dealing with nonlinear tasks. k-NNs are sensitive to the local structure of the data, with the best choice for k dependent on the properties of each datasets10. SVMs have uncalibrated class membership probabilities, large memory requirements (O(N2)), and difficult-to-interpret parameters21112. NNs and DL are computationally expensive, with features learnt and tuned iteratively1314. ELMs do not have sufficient features to handle complex works15. GBMs have high memory consumption and low evaluation speed16, as all base-learners must be evaluated in order to obtain predictions for the model. For RFs, decision trees are axis-parallel, which may lead to suboptimal trees; though oblique random forests provide one way to improve the performance of random forests17, ultimately they may fail on datasets with greater depth18. We created Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks, boosting, and random forests. We compared the performance of RBF with that of seven other methods, using 28 datasets from the UCI (University of California, Irvine) Machine Learning Repository. We then tested RBF on real psoriasis genome-wide association study (GWAS) data.

Methods

Summary

For clarity, features were standardized by subtracting the mean and dividing by standard deviation. The features were then transformed into random features/basis, by gradient boosting of the Random Bits base learner, a 3-layer sparse neural network with random weights, and fed to a random forest classifier/regressor to obtain predictions (Fig. 1).
Figure 1

The summarized process.

A 3-layer sparse neural network with random weights. represents threshold functions.

Random Bits

Our derived feature/basis/base learner is called Random Bits. It is a 3-layer sparse neural network with random weights. Two parameters were used to construct the neural network: twist1 (the number of features connected to each hidden node) and twist2 (the number of hidden nodes). The features connected with hidden node are randomly assigned and interlayer weights are drawn from a standard normal distribution. The hidden nodes and the top node are the threshold units, with the threshold of each node determined by calculating the linear summation of its input for the ith sample zi and choosing a random zi among the sample as the threshold15.

Boosting Random Bits

In order to generate many Random Bits, we used a gradient boosting scheme with the following pseudocode: For boost = 1 to B: For step = 1 to S: 1: residual = Y; MaxVar = 0; BestBit = NULL; 2: For cand = 1 to C: 1: Draw a random bit, RB 2: Calculate the residual explained by RB: Var 3: if (Var > MaxVar) {MaxVar = Var; BestBit = RB;} 3: Set the random_bit_pool [(boost − 1) * S + step] = BestBit 4: Mean[0] = E(residual|BestBit = 0), Mean[1] = E(residual|BestBit = 1) 5: residual = residual − Mean[BestBit]; The algorithm launched B independent boosting chains, each with S steps. Each boosting chain undergoes the standard gradient boosting procedure, starting with a residual of Y and updating every step. In each step, C Random Bits features (C > 100) were generated, and the bit with the largest pseudo residual was chosen. The Random Bits from each independent boosting chain were collected to form a large (~10,000) feature pool. The Random Bits were stored in a compressed format requiring 1 bit per Random Bits per sample.

Random Bits Forest

The produced Random Bits are eventually fed to Random Bits Forest. Random Bits Forest is a random forest classifier/regressor, but slightly modified for speed: each tree was grown with a bootstrapped sample and bootstrapped bits, the number of which can be tuned by users. The best bits among all the bootstrapped bits were chosen for each split. By making full use of the binary nature of Random Bits, through special coding and Streaming SIMD Extensions (SSE), acceleration was achieved, such that the modified random forest can afford ~10,000 binary features for large datasets (N = 500,000).

Benchmarking

We benchmarked nine methods: linear regression (Linear), logistic regression (LR), k-Nearest Neighbors (kNN), neural networks (NN), support vector machines (SVM), extreme learning machines (ELM), random forests (RF), generalized boosted models (GBM), and Random Bits Forest (RBF). We used the RBF software available at http://sourceforge.net/projects/random-bits-forest/ and implemented the other eight methods using various R (v3.2.1) packages: stats, RWeka (v0.4-24), nnet (v7.3-8), kernlab (v0.9-19), randomForest (v4.6-10), elmNN (v1.0), and gbm (v2.1). We used ten-fold cross validation (accuracy, sensitivity, specificity and AUC) to evaluate each method’s performance. For methods sensitive to parameter selection, we manually tuned the parameters to obtain the best performance. As we chose the best handpicked parameters for each method respectively, the performance of each method based on the best parameters was comparable with each other. The results of tuning the parameters of sensitive methods on the real psoriasis genome-wide association study (GWAS) dataset were provided as Supplemental Materials 1. Benchmarking was performed on a desktop PC equipped with an AMD FX-8320 CPU and 32GB of memory. SVM, on some large-sample datasets, failed to complete benchmarking within reasonable time (1 week), so those results were left as blank.

Benchmarked UCI Datasets Study

We benchmarked all datasets from the UCI Machine Learning Repository19 that fulfilled the following criteria including: (1) the dataset contains no missing values; (2) the dataset is in dense matrix form; (3) the dataset uses only binary classification; and (4) the dataset had clear instructions and specified the target variable. We included 14 regression datasets (20, 21,,22, 23,24,25,26,27,28, 29,)30 and 14 classification datasets (31,32,33,34,35,36,37,38,39,40,41,)42.

Applications on GWAS Dataset Study

We applied each method to a psoriasis genome-wide association (GWAS) genetic dataset4344 to predict disease outcomes. We obtained the dataset, a part of the Collaborative Association Study of Psoriasis (CASP), from the Genetic Association Information Network (GAIN) database, a partnership of the Foundation for the National Institutes of Health. The data were available at http://dbgap.ncbi.nlm.nih.gov. through dbGaP accession number phs000019.v1.p1. All genotypes were filtered by checking for data quality44. We used 1590 subjects (915 cases, 675 controls) in the general research use (GRU) group and 1133 subjects (431 cases and 702 controls) in the autoimmune disease only (ADO) group. A dermatologist diagnosed all psoriasis cases. Each participant’s DNA was genotyped with the Perlegen 500K array. Both cases and controls agreed to sign the consent contract, and controls (≥18 years old) had no confounding factors relative to a known diagnosis of psoriasis. We used both SNP ranking and multiple logistic regression methods, based upon allelic association p-values, for feature selection in training datasets and compared the different methods in both training and testing datasets. First, we trained the model based on the GRU dataset with different numbers of top associated SNPs, and then chose the robust and popular method (LR) to select the best number of SNPs as predictors based on the maximum AUC of the independent ADO (testing) dataset (Fig. 2 and Supplemental Materials 2). We then selected the best number (best number of SNPs = 50) of top associated SNPs as input variables and evaluated their performance in both the GRU (training) dataset and independent ADO (testing) dataset for each learning algorithm (except LR). To know more information of these selected 50 top associated SNPs, the Pearson’s R squared and Odds Ratio45 were also provided in Supplemental Materials 3.
Figure 2

Maximum AUC of the independent ADO testing dataset with different numbers of markers.

To evaluate a classification method’s performance on an imbalanced dataset, we used the area under the receiver operating characteristics (ROC) curve. The area under the curve (AUC) measures the global classification accuracy and is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance46. We used the AUC as a measure of classifier performance for both GRU (training) and ADO (testing) datasets (Table 3, Figs 3 and 4). The 95% confidence interval (CI) of the AUC47, sensitivity, specificity and accuracy of all methods were also calculated by choosing the optimal threshold value.
Table 3

Psoriasis prediction performance with all methods based on best number of SNP subsets.

 Independent testing dataset (ADO dataset)
Training dataset (GRU dataset) with 10-fold cross validation*
SensitivitySpecificityAccuracyAUC95% CI of AUCSensitivitySpecificityAccuracyAUC95% CI of AUC
NN0.64040.58400.60550.6563[0.6240, 0.6886]0.53470.66570.58990.6192[0.4388, 0.7893]
KNN0.62410.72790.68840.7021[0.6699, 0.7344]0.64280.65530.64780.6660[0.5342, 0.7830]
ELM0.65890.66100.66020.7053[0.6738, 0.7368]0.63050.64030.63460.6618[0.5210, 0.8094]
RF0.63110.70510.67700.7134[0.6820, 0.7448]0.60360.67030.63140.6603[0.5072, 0.7954]
SVM0.65890.69520.68140.7132[0.6815, 0.7449]0.65690.64190.65030.6694[0.5319, 0.7843]
GBM0.64730.70800.68490.7187[0.6873, 0.7500]0.58900.71290.64150.6707[0.5153, 0.7986]
RBF0.65430.71510.69200.7239[0.6930, 0.7548]0.63170.64900.63900.6739[0.5254, 0.8275]

Bold: The bold means the first place result of all methods compared. *AUC, sensitivity, specificity, and accuracy were its average value in 10-fold CV,

95% CI of AUC represents the range of the 95% CI of AUC in 10-fold CV.

Figure 3

The ROC curve of six best benchmarked methods on the Psoriasis GWAS dataset of independent ADO group using selected best number of SNPs.

Figure 4

The average of ten-fold’s cross-validation ROC curve of six best benchmarked methods on the Psoriasis GWAS dataset of GRU group using selected best number of SNPs.

Results

Results from UCI Datasets Study

Table 1 shows the regression root-mean-square error (RMSE) of all methods on 14 datasets. RBF was the top performing method in 13 and the second best performing method in 1. In the case () in which RBF was not the best method, the difference between RBF and the top performing method (RF) was within 2%. RF was the second best performing among the regression datasets. RBF’s performance exhibited the greatest improvement over that of the other methods with the dataset, a shallow task in which the methods predicted the altitude at specific points on a 3D map. However, RBF outperformed RF by allowing non-axis-parallel splitting.
Table 1

Regression RMSE of all methods on 14 datasets.

Regression RMSESampleFeatureLinearKNNNNELMSVMGBMRFRBF
Computer hardware209769.6263.13134.91159.2393.6391.6759.6658.39
Yacht hydrodynamics30869.136.431.181.961.031.161.001.00
Housing506124.884.104.947.923.163.403.073.13
Forest fire*517131.501.402.101.401.501.401.411.40
Istanbul stock exchange53680.010.010.040.020.010.010.010.01
Concrete compressive strength1030910.538.286.3613.185.254.724.534.18
Parkinsons telemonitoring5875199.746.106.6910.356.022.101.651.19
Wine quality6497110.740.700.730.920.670.670.580.57
Bike sharing1738916141.87104.5865.9994.56102.3775.4739.9738.26
Buzz in social media tomhardware*28179971.450.760.371.581.490.310.310.31
Physicochemical properties4573095.193.796.126.124.165.053.453.27
3D Road Network434874218.376.4415.5516.9512.5314.823.861.20
Year prediction MSD515345909.559.2210.9311.479.639.248.87
Buzz in social media twitter*583250781.330.520.511.030.480.470.47

Bold: The bold means the first place result of all methods compared.

*The * means the dependent variable of the corresponding data was transformed by log function to be more asymptotically normal.

The best RBF’s RMSE was significantly less than the second best RF using Wilcoxon Matched-Pairs Signed-Ranks Test (p-value = 0.007185).

Table 2 shows the classification error of each method among 14 datasets. RBF was the top performer in 8 datasets, the second best in 5, and the third best for 1. In the cases RBF was not the best method, the difference between RBF and the top performing method was within 2%. SVM was the second best method among classification datasets. RBF’s performance exhibited the greatest improvement over that of the other methods with the dataset, a deep task in which the methods classified the shape (“hill” or “valley”) of a time series with 100 time points. Although all other methods, except neural networks, failed to well perform this task, RBF and its 3-layer random neural network features worked well on this dataset.
Table 2

Classification error of all methods on 14 datasets.

Classification error%SampleFeatureLRKNNNNELMSVMGBMRFRBF
Fertility100915.0012.0015.0024.0012.0012.0012.0012.00
Connectionist Bench2086026.0013.0221.6714.4310.1412.5212.5212.02
Habermans survival306325.8525.1630.7127.4026.4527.1227.425.12
Ionosphere3513410.2610.2511.9810.285.136.266.554.26
Climate Model Simulation Crashes540184.267.045.565.934.445.746.484.81
Breast Cancer Wisconsin Diagnostic569305.092.818.458.801.933.332.982.28
Indian Liver Patient5791027.8327.8230.2128.3428.5127.4726.0926.42
Blood Transfusion Service Center748422.8619.6524.4623.8020.1921.6621.7919.92
QSAR biodegradation10554113.3713.7514.9822.3812.1412.8912.4211.95
Hill valley with noise121210042.0045.715.2823.4234.7343.8940.502.47
Banknote authentication137241.020.150.000.000.000.150.510.00
EEG Eye State149801435.7515.3731.5742.3419.528.465.963.66
MAGIC Gamma Telescope190201020.8815.8613.1722.6412.3011.7511.7310.36

Bold: The bold means the first place result of all methods compared.

The best RBF’s error% was significantly less than the second best SVM using Wilcoxon Matched-Pairs Signed-Ranks Test (p-value = 0.04584).

Furthermore, we also observed that the datasets in which RBF performed best were all big datasets (N > 1000 with limited features, Table 1 and Table 2). This is due to the nature of trees, which inherently require larger samples than do regressions.

Results from GWAS dataset study

Figure 2 and Supplemental Materials 2 shows that the ideal number of biomarkers for prediction of psoriasis was 50 in the efficient LR classifier. When the number of biomarkers was less than 20, the AUC of independent ADO (test) dataset was unstable in LR classifier. On the other hand, as the number of biomarkers approached 50, performance improved and stabilized: the best AUC for LR was 0.7063, respectively. Performance did not significantly improve as the number of biomarkers increased over 50. As seen in Table 3, all benchmarked methods were used to construct effective diagnosis models for psoriasis prediction based on optimal number of SNP subsets. No significant unbalances were found in the training and testing datasets, suggesting the credibility and stability of the prediction models. The average of AUC of 10-fold cross-validation48 in the training dataset and AUC of the independent testing dataset were used to evaluate the performance of all methods. The AUC of each method ranged from 0.6192−0.6739 in the training dataset and from 0.6563−0.7239 in the testing dataset. We found that RBF, GBM, SVM and RF were the four top performing methods in both the training dataset and the testing dataset. RBF was the top performer in both the training dataset (AUC = 0.6739, 95% CI: [0.5254, 0.8275], sensitivity = 0.6317, specificity = 0.6490, accuracy = 0.6390) and the testing dataset (AUC = 0.7239, 95% CI: [0.6930, 0.7548], sensitivity = 0.6543, specificity = 0.7151, accuracy = 0.6920). The ROC curves for each method are also shown in Fig. 3 and Fig. 4 for performance comparison visualization. Furthermore, RBF appeared to be robust in sensitivity and specificity in both the training and testing datasets. Although the sensitivity and specificity of RBF were not the best for all datasets, its AUC still was the top performer in both GRU (training) and ADO (testing) datasets. This characteristic of RBF is also applicable in the unbalanced dataset, whose prediction performance may be easily influenced by the disease population ratio. In Table 3, we see that although KNN has the second accuracy (accuracy = 0.6884) in the testing dataset, its AUC performance (AUC = 0.7021) is poor because it pays more attention to specificity (specificity = 0.7279) than sensitivity (sensitivity = 0.6241).

Discussion

Random forests are among the top performing algorithms for machine learning, as they are accurate, fast, flexible, and mature. Random forest6 is a substantial modification of bagging which builds a large number of de-correlated trees and then averages the trees. The main idea of random forests is to improve the variance reduction of bagging by reducing the correlation between trees without increasing the variance heavily49. And the target is achieved in the tree-growing process by randomly selecting the input variables. Thus, Random Bits Forest mainly focuses on the automated feature engineering of random forests. We also obtain good results if we feed random bits to a regularized linear regression, though, in big data cases, no better than we get from random forests. And the statistical inference50 of random forests equally applies to RBF. RBF outperforms the random forest algorithm by breaking its two limitations: the limitation to axis-parallel splitting that may lead to suboptimal trees17, and the decision tree depth of two that could fail on dataset with greater depth18. To overcome the first limitation, we used random projections. Because of pre-generation of many (~10,000) random projections, the tree is allowed to grow with more freedom. To overcome the second limitation, we improved naïve random projections with a 3-layer random neural network. We then defined a random neural network based on the original features and took its output as a derived feature/basis. Such additional depth may be crucial for specific datasets (UCI dataset: , shown in Table 2). Compared to oblique random forests, RBF generated non-axis parallel features before random forest while oblique random forests generates oblique splits within the tree-growing process. One crucial improvement to our random projections was to use 3-layer random neural networks as random projection/basis, giving the random forest more depth. Additional layers did not improve accuracy on the benchmarked datasets, potentially because 3-layer neural networks are already universal approximations. In order to make full use of our ~10,000 bits budget, we need a feature selection procedure rather than naïve random projections. Feature selection was achieved by employing the gradient boosting framework. Instead of directly using the boosting predictions, we collected the boosted basis and fed them into the random forest. First, we found the random bit that best explained the residual and subtracted its effect from the residual to avoid highly correlated random bits. For the dataset, this method for feature selection reduced error from 11% to 2.5%, compared with naïve random projections. In the boosting procedure, we used multiple independent boost chains, originally just for ease of parallel computing. However, multiple chains also reduced the local optimum problem and led to better prediction. For small datasets, 256 boost chains were used. Large sample (N > 1000) are important for the success of RBF since trees are more flexible models than are linear models and as a result require a larger sample size. For smaller samples, regularization is useful, which was achieved by limiting the bootstrapped sample size. The consequence is that each tree was suboptimal and biased, but the trees are further decorrelated, thus reducing variance. Reducing feature bootstrap also helped to regularize the problem. In summary, we firstly present Random Bits Forest (RBF), an original classification and regression algorithm that integrates the advantages of neural networks (for learning depth), boosting (for learning width), and random forests (for prediction accuracy). That is the reason why Random Bits Forest will perform better than other methods. In conclusion, RBF is a novel robust method for machine learning, which is especially effective in datasets with large sample sizes (N > 1000). Our work indicates that RBF performs better if fed with extracted/selected features by using appropriate feature selection methods.

Additional Information

How to cite this article: Wang, Y. et al. Random Bits Forest: a Strong Classifier/Regressor for Big Data. Sci. Rep. 6, 30086; doi: 10.1038/srep30086 (2016).
  10 in total

Review 1.  Representation learning: a review and new perspectives.

Authors:  Yoshua Bengio; Aaron Courville; Pascal Vincent
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2013-08       Impact factor: 6.226

Review 2.  Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes.

Authors:  J V Tu
Journal:  J Clin Epidemiol       Date:  1996-11       Impact factor: 6.437

3.  Quantitative structure-activity relationship models for ready biodegradability of chemicals.

Authors:  Kamel Mansouri; Tine Ringsted; Davide Ballabio; Roberto Todeschini; Viviana Consonni
Journal:  J Chem Inf Model       Date:  2013-03-27       Impact factor: 4.956

4.  Sequence and haplotype analysis supports HLA-C as the psoriasis susceptibility 1 gene.

Authors:  Rajan P Nair; Philip E Stuart; Ioana Nistor; Ravi Hiremagalore; Nicholas V C Chia; Stefan Jenisch; Michael Weichenthal; Gonçalo R Abecasis; Henry W Lim; Enno Christophers; John J Voorhees; James T Elder
Journal:  Am J Hum Genet       Date:  2006-03-31       Impact factor: 11.025

5.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.

Authors:  E R DeLong; D M DeLong; D L Clarke-Pearson
Journal:  Biometrics       Date:  1988-09       Impact factor: 2.571

6.  Oliguric acute renal failure in malignant hypertension.

Authors:  W D Mattern; S C Sommers; J P Kassirer
Journal:  Am J Med       Date:  1972-02       Impact factor: 4.965

7.  Basic statistical analysis in genetic case-control studies.

Authors:  Geraldine M Clarke; Carl A Anderson; Fredrik H Pettersson; Lon R Cardon; Andrew P Morris; Krina T Zondervan
Journal:  Nat Protoc       Date:  2011-02-03       Impact factor: 13.491

8.  Accurate telemonitoring of Parkinson's disease progression by noninvasive speech tests.

Authors:  Athanasios Tsanas; Max A Little; Patrick E McSharry; Lorraine O Ramig
Journal:  IEEE Trans Biomed Eng       Date:  2009-11-20       Impact factor: 4.538

9.  Psoriasis prediction from genome-wide SNP profiles.

Authors:  Shenying Fang; Xiangzhong Fang; Momiao Xiong
Journal:  BMC Dermatol       Date:  2011-01-07

10.  Gradient boosting machines, a tutorial.

Authors:  Alexey Natekin; Alois Knoll
Journal:  Front Neurorobot       Date:  2013-12-04       Impact factor: 2.650

  10 in total
  4 in total

Review 1.  A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases.

Authors:  I S Stafford; M Kellermann; E Mossotto; R M Beattie; B D MacArthur; S Ennis
Journal:  NPJ Digit Med       Date:  2020-03-09

2.  Nuclear Norm Clustering: a promising alternative method for clustering tasks.

Authors:  Yi Wang; Yi Li; Chunhong Qiao; Xiaoyu Liu; Meng Hao; Yin Yao Shugart; Momiao Xiong; Li Jin
Journal:  Sci Rep       Date:  2018-07-18       Impact factor: 4.379

3.  A multi-hazard map-based flooding, gully erosion, forest fires, and earthquakes in Iran.

Authors:  Soheila Pouyan; Hamid Reza Pourghasemi; Mojgan Bordbar; Soroor Rahmanian; John J Clague
Journal:  Sci Rep       Date:  2021-07-21       Impact factor: 4.379

Review 4.  A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases.

Authors:  I S Stafford; M Kellermann; E Mossotto; R M Beattie; B D MacArthur; S Ennis
Journal:  NPJ Digit Med       Date:  2020-03-09
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.