Literature DB >> 36175944

Effective hybrid feature selection using different bootstrap enhances cancers classification performance.

Noura Mohammed Abdelwahed1, Gh S El-Tawel2, M A Makhlouf3.   

Abstract

BACKGROUND: Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem.
METHOD: This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE.
RESULTS: The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively.
CONCLUSION: High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.
© 2022. The Author(s).

Entities:  

Keywords:  High dimensional data; Learning algorithms; Machine Learning; Over-fitting; Random Forest feature importance; Recursive feature elimination and its disadvantages

Year:  2022        PMID: 36175944      PMCID: PMC9523996          DOI: 10.1186/s13040-022-00304-y

Source DB:  PubMed          Journal:  BioData Min        ISSN: 1756-0381            Impact factor:   4.079


Introduction

Artificial intelligence (AI) is a science that plays an important role in all fields, especially in the biomedical field, and it aims to simulate reality [1, 2]. Different AI applications have been applied in this field for 20 years due to many factors, including the availability of different datasets in this field, computer devices with high capabilities and arithmetic algorithms [2]. AI has great importance, as a survey has proven that it has great effectiveness in health, and it will outperform the performance of specialists in this field. In addition, it has proven effective in cancer research [2]. Furthermore, AI has become providing human specialists with many information and accordingly, the decision is taken, as it has become one of the most important elements in the medical team [2]. It also works to improve accuracy, speed up diagnosis and discover features or genes affecting cancer as recommendations for human specialists to take into consideration [2]. AI is considered a second decision that helps the specialist make their decision [2]. AI differs from the manual method because it provides human specialists with more information and details. Its diagnosis is more accurate and efficient and does not require more labor. The manual method may be stressful for the patient, as it puts him under great pressure and takes more time to know the results of the sample, which makes him tense [3]. Cancer has become very widespread in recent times, as it has become a major cause of disease and death [4]. It can be defined as a group of more than one disease due to abnormal cell growth or changes in genes, and it can occur anywhere in the body [5]. Many factors cause cancer including [6]: - (1) tobacco consumption, (2) poor diet, (3) lack of physical activity, (4) alcohol, (5) radiation, (6) infection, (7) genetic factors, (8) smoking and (9) age [6]. There are many different types of human cancer, but in this paper, we used some types that included Breast Invasive Carcinoma (BR), Bladder urothelial carcinoma (BL), Colon and rectum (CO), Glioblastoma multiform (GB), Head and neck squamous cell (HN), Kidney renal clear-cell (KI), Parkinson’s disease (PD), Prostate adenocarcinoma (PRAD) and Lung adenocarcinoma (LUAD). There are enormous problems in big datasets involved in the features numbers, fitting time, classification accuracy, and model performance. Feature selection is a process for selecting the most relevant features and discarding insignificant ones. Feature selection plays a vital role in many directions to enhance the model performance [7-9]. This process aims to select the most relevant subset r features from the original R features set (r < R) in given datasets [9]. R includes all features in a dataset. It suffers from many problems included in high dimension, noisy, repetitive and over-fitting. The ineffective features are deleted. These features diminish the classification accuracy and waste time. By deleting irrelevant features, all previous problems are solved and improved. Feature selection procedures have three major types: filter, wrapper [9, 10], and embedded [11]. Filter procedure selects the features by evaluating their relevance of features. These features are ranked in decreased order, and low-ranking features are omitted to obtain the most relevant features [12]. The filter approach can use many measures included in gain ratio, mutual information based feature selection (MIFS), information gain based feature selection (IGF), relaxed functional dependencies [9], and chi-square [10]. This procedure does not depend on any machine learning and is faster than the wrapper procedure. Despite its simplicity, it suffers from an over-fitting problem. The best subset of features is selected depending on machine learning to estimate this subset [9, 10]. This procedure suffers from expensive computationally when applied on high dimensions. On the other hand, it guarantees to select the most relevant and effective subset of features. Feature selection is an integral part of the classification model in the embedded procedure. It is embedded in the phase of learning [11]. This procedure has many advantages, including being less computationally expensive, reducing over-fitting problems, and selecting the most accurate features. In this direction, we adopted the integration of wrapper procedure with embedded one to select the relevant features using proposed methods to minimize the previous drawbacks and maximize the classification accuracy. Selecting influencing features is an effective step in the classification process to obtain accurate results. Many datasets always suffer from high dimensions problems, which negatively affect the model performance’s accuracy. The feature selection step is considered one of the processes that positively impact solving many problems facing different datasets. In this direction, many authors applied different feature selection algorithms to minimize processing time, over-fitting, maximize classification accuracy and find the most relevant features, which still need more researches to improve. Therefore, there are numerous different methods for feature selection to fix the previous drawbacks included in the filter, wrapper, and embedded methods. The filter method is simple, and it selects the features based on their ranking according to a class. Still, it suffers from over-fitting problems in high dimensions datasets and disregards feature dependencies. Elsadek et al. [12] proposed a method using IGF to classify six human cancer types based on DNA copy number variation (CNV) dataset. The proposed method selected 16,381 features as the most relevant features. More than one learning algorithm is applied, such as logistic regression (LR), support vector machine (SVM), random forest (RF), J48, neural network, bagging and dagging. LR learning algorithm achieved the best classification accuracy of about 85% and ROC area 0.965. Rajit et al. [13] proposed selecting best and select percentile filter methods. The proposed method used a breast cancer dataset. There are more than one learning algorithms are used. LR classifier achieved a better result. Furthermore, many filter methods are proposed by Pinar Yildirim [14]. Different filter methods are applied in Cfs Subset eval, principal component analysis (PCA), consistency subset eval, IGF, One-R attribute eval, and relief attribute eval. The proposed method used the Hepatitis datasets and proved that the Consistency Subset, IGF, One-R Attribute Eval, and Relief Attribute Eval filter methods achieved better results. In addition, Alirezanejad et al. [15] proposed a filter method for gene selection using two heuristic methods. These methods, namely, Xvariance and mutual congestion. The Xvariance gave the best results with the standard datasets, while mutual congestion enhanced the accuracy of high-dimensional datasets. Kuswanto et al. [16] proposed a comparison method for feature selection using different filtering methods. Three filtering methods included in MIFS, correlation based feature selection (CFS) and fast correlation based feature selection (FCBF) are applied. The results of these methods are forwarded to K-nearest neighbors (KNN) classifer. The results showed that the FCBF selected a small number of features, while other methods performed well. Furthermore, Ghasemi et al. [17] proposed a method using IGF and gini index to select important features. These features are used to early predict of heart disease. This proposed method aimed to minimize the dimension and maximize the performance of the diagnosis of heart disease with less medical experiments. Mahmood [18] proposed a method to minimize a dimension for facial expression recognition dataset. Two feature selection methods are applied to obtain minimum number of features included in Chi-Square and Relief-F. These methods selected the first highest six features. Four different classifiers are applied to evaluate the performance. In addition, Spencer et al. [19] proposed a method to predict heart disease dataset. Four proposed methods are used for feature selection included in ReliefF, Chi-squared, symmetrical uncertainty and PCA. Different machine learning classifiers are applied to create models for comparison. The best prediction with less subset of features is selected using Chi-Square. Mohamed et al. [20] proposed a method to obtain the most important subset of feature rather than the whole dataset. Chi-square, IG and Bat algorithm are applied for feature selection. Many varieties of classifiers are used to evaluate the model performance. Vikas et al. [21] proposed a method to minimize processing time and maximize classification accuracy using lung cancer detection. To select the most relevant features, Chi-square algorithm is applied. Two different classifiers are used to evaluate the performance included in SVM and RF. Many authors applied wrapper methods to solve the optimization problems and to get the most important subset features using different datasets. AH et al. [22] proposed an algorithm using the wrapper approach. The proposed algorithm enhanced the basic salp swarm algorithm (SSA) to improve reliability, convergence speed, and classification accuracy. The algorithm was enhanced by adding inertia weight to achieve better results. Hegazy et al. [9] used the hybrid wrapper method by applying chaotic maps to improve the performance of the salp swarm algorithm (SSA) and overcome its drawbacks. To control the exploitation/exploration rates, they used five chaotic maps. The proposed algorithm (CSSA) was applied on twenty-seven datasets and gave the best results. Although it gave the best results using twenty-seven datasets, it did not achieve good results using high-dimensional datasets. Sanaa et al. [8] proposed a wrapper method included in particle swarm optimization (PSO) and genetic algorithm (GA) to classify six human cancers types using DNA CNV dataset. The hybrid proposed method was applied to minimize the features and maximize the classification accuracy. It selected 2051 features from 16,381 features. The selected features achieved 84.6% classification accuracy. However, it suffered from many problems included in over-fitting, fitting time, relevant features, and classification accuracy. RFE is considered a wrapper method for feature selection. It suffers from time-consuming, especially when using big data. Li et al. [23] proposed fixing the support vector machine recursive feature elimination (SVMRFE) problem. They first proposed random value-based oversampling as a resampling method. The proposed variable step size (VSSRFE) to speed up the feature selection process. Another method is proposed called linear SVM (LLSVM). The two proposed methods are used together for feature selection. Jeon et al. [24] proposed a hybrid RFE method using benchmark datasets. This proposed method used SVM-RFE, random forest RFE (RF-RFE), and gradient boosting machines RFE (GBM-RFE) methods which combined the feature-importance-based RFE methods. There were two types of weighting functions used in the proposed methods. The first type sums the weight of three proposed RFE methods, and the second one reflects the classification accuracies and weights of features. Rani et al. [25] proposed a hybrid wrapper method by integrating GA and RFE algorithms. This method is compared with other feature selection methods. The proposed method improved the classification performance after canceling irrelevant features. Zvarevashe et al. [26] proposed a method to select the most relevant subset features using RFE algorithm based on RF. The proposed method was compared with a deep learning algorithm. It proved its powerful for selecting features. Senan et al. [27] proposed a method to select the relevant features using RFE algorithm for a kidney disease dataset. Four classification algorithms are applied for the classification step. The RF algorithm gave the best results. Many researchers used a hybrid method which combined filter and wrapper methods to select relevant features, but it had many limitations that filter method may cancel important features and wrapper methods take more time. High dimensional is another limitation when applying this hybrid [28]. Ansari et al. [10] used filter and wrapper approaches as a feature selection process. They proposed two different hybrid methods. F-score feature ranker and Chi-square feature ranker are applied in the first method and took the intersection between them. The intersection between these features is applied to obtain the most important features. The results of the intersection process are applied on binary particle swarm optimization (BPSO) as a feature optimization approach. In the second one, after the intersection between features, RFE approach is applied. Zhang et al. [7] proposed a method to classify six human cancer types using CNV level values. Zhang selected the features using the methods of mRMR (minimum Redundancy Maximum Relevance Feature selection) and IFS (Incremental Feature Selection). The first method selected features by ranking the importance of these features. This method selected 200 features. The second method used IFS to select the optimal set of features. IFS selected 19 features with an accuracy value 0.75. However, this proposed method gave insufficient classification accuracy. Pirgazi et al. [29] proposed a hybrid method using filter and wrapper for feature selection in high dimensional datasets. In the first stage, they applied a filter method using the Relief method to weight the features. In the second stage, they applied a wrapper method using shuffled frog leaping algorithm (SFLA) and IWSSr algorithms. Mandal et al. [30] proposed a hybrid method for feature selection using the filter and wrapper method. They applied MIFS, ReliefF, Chi-Square, and Xvariance for the filter method. The union for four filter methods is applied to obtain the most important features. The wrapper method is applied using Whale Optimization Algorithm to overcome any limitation in the filter method. Venkatesh et al. [31] proposed a hybrid method using MIFS as a filter method and RFE as a wrapper method. The hybrid method gave better results than the individual algorithms. Gakii et al. [32] proposed comparison methods using three algorithms for feature selection included in the PCA, RFE and graph-based feature selection. The results proved that the graph-based feature selection enhanced the performance of sequential minimal optimization and multilayer perceptron classifiers. In addition, researchers applied a hybrid method using the advantages of both wrapper and embedded methods to obtain the most effective features to solve the drawbacks in the previous studies. Liu et al. [28] proposed a hybrid method using GA as a global search with an embedded regularization approach as a local search. They proposed this method to solve the over-fitting problems and select relevant features. It is compared with individual algorithms, proving its effectiveness for feature selection. Aruna et al. [33] proposed a hybrid method using LR and RFE algorithms for the diabetes dataset. The RFE is based on LR as an estimator. The RF is applied for a classification step. Venkatachalam et al. [34] proposed a hybrid method that combined the ridge regression and RFE algorithms. It solved the problem of over-fitting for feature selection. The proposed method is compared with other models. RF is applied for the classification step. Due to the previous research gaps, this paper presents the proposed method PFBS-RFS-RFE with three positions to fix feature selection problems and improve the classification model over different datasets. It tries to enhance many issues included in time consuming using RFE algorithm, classification accuracy, over-fitting problems, fitting time and select the most effective features to know the chromosome that is considered the most developing human cancers in the datasets. Furthermore, we applied a resampling method to enhance the classification accuracy and improve the over-fitting problem [35]. The bootstrap is a resampling method that reduces the variance and bias between features; therefore, the over-fitting problem is minimized, and classification accuracy is maximized. We utilize PFBS as a resampling step with the hybrid RFS-RFE to reduce the over-fitting problem and improve the classification accuracy. We compared the proposed methods with RFE, RFS, and with previous work over five datasets. Four efficient supervised machine learning were used to evaluate the model performance of the proposed hybrid feature selection methods. The main contributions are summarized as follows: - We propose hybrid methods, namely, positions first bootstrap step random forest selection recursive feature elimination (PFBS-RFS-RFE) based on feature selection that combines the advantages of the wrapper and embedded methods to solve many feature selection problems, including over-fitting, time consuming, relevant features, classification accuracy and solving the problem in RFE algorithm, which suffers from time-consuming with high-dimensional datasets. The motivation behind the proposed methods is to know the genes or features associated with cancers; therefore, we can know the chromosome that is considered the most developing human cancers by taking the average number of runs and the intersection between features. The structure of the article is as follows. The “Introduction” section presents the feature selection troubles and how previous work tried to solve them. The “Results” section presents the results of hybrid algorithm and the comparison with other studies using the same datasets. The “Discussion” section summarizes and discusses the application of the hybrid algorithm. The “Conclusions” section presents the main idea and the importance of the proposed methods. The “Method” section presents the hybrid algorithm to enhance and solve these troubles.

Results

The hybrid proposed methods applied two important stages included in feature selection and model performance. They are applied using proposed datasets to select the effective cancer genes and improve the drawbacks included in over-fitting and classification accuracy. The selected features are utilized to feed more than one classifier using 10 cross-validations. The proposed classifiers are LR, support vector machine (SVM), RF and bagging (Bagg). The proposed method is compared with the individual algorithm such as RFE and RF and with the previous work. The proposed methods confirmed the results.

Performance metrics

Performance evaluation is a very important step in machine learning. Selecting the most relevant features increases the classification accuracy and decreases the classification error. We proposed a hybrid method to obtain the accurate classification value, therefore; we fixed any previous drawbacks. The proposed methods are compared with individual algorithms included in RFE and RFS using the following metrics: - The size of feature selection: - is the number of selected features. Processing time: - is the time of the fitting process in second. Performance accuracy is the percentage of the samples that are correctly evaluated by a classifier. Performance evaluation included: - Precision, F1-score, Recall, variance, Receiver operating characteristic (ROC) area, and Area under curve (AUC) [8, 12] is used to measure the classification performance by plotting the relationship between True Positive (TP) and False Positive (FP) rates. The calculation formula is applied to evaluate the model performance using ensemble and regularization classifiers with 10 cross-validation. Table 1 presents the meanings of the symbols that used in the proposed methods. The calculation formula is as follows: -
Table 1

The meanings of the symbol

SymbolMeaning
PPVPositive predictive value
TPTue positive (cancer type diagnosed correctly as a cancer type)
TNTrue negative (non-cancer type diagnosed correctly non-cancer type)
FNFalse-negative (cancer type diagnosed incorrectly as non-cancer type)
FPFalse-positive (non-cancer type diagnosed incorrectly as a cancer type
SFThe size of the selected features after applying the algorithm
TFThe total size of features
The meanings of the symbol

Parameter setting

The experiments were run in Python on a pc with windows 10, R TM CPU 1.80 GHz, and 8 GB memory. All parameter values are determined based on domain-specific knowledge or trial and error. The parameter setting for all proposed methods is given in Table 2, with a simple declaration for each parameter.
Table 2

The meaning of parameter setting

ParameterValueDefinition
NRuns20No of runs
Problem DimensionsNo of features in the dataset.
X*2916The number of data produced after the bootstrap resamples method.
M100The number of trees using in the Random Forest algorithm.
CriterionThe method which measures the quality of split, Entropy is applied.
min_samples_leaf100The minimum number of samples required to be at a leaf node.
RFE estimatorsA supervised learning algorithm. LR is applied.
C0.05Regularization parameter.
Max-iteration100Max iteration in LR classifier.
Tol0.0001Tolerance to stop criteria in LR classification.
CV10No of folds in cross-validation.
The meaning of parameter setting

Numerical results and discussion

The fundamental goal of these proposed methods is to enhance the performance of RFE to reach the optimum subset features that show the most associated features (genes) with cancers. Another goal of the proposed methods is to solve and fix the problem of over-fitting between training and testing data. The proposed method was compared with the original algorithms included in RFE and RFS. Table 3 presents the performance of the individual algorithms such as RFE and RF using the proposed classifiers LR with 10 folds stratified cross-validation before applying the feature selection proposed methods. Stratified cross-validation splits data into folds to ensure that the ratio between label classes is the same in each fold as in the full data.
Table 3

Performance of original algorithms before applying the proposed methods

Algo.Train%Test%Over-FittingDiff.%PreRecF1-ScoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
RNA gene dataset
RFE100.00099.8000.2000.9990.9980.99810,265190,00060.0001.0000.0000299.800
RFS100.00099.8000.2000.9990.9980.998374.00013.0150.2751.0000.0000299.800
DNA CNV dataset
RFE97.50087.00010.5000.7410.7060.7098190182,29540.0000.9600.02312587.000
RFS89.80384.0545.7490.8190.7640.77512345.0005.0850.9550.00019384.054
Parkinson’s disease dataset
RFE76.44175.1331.3080.3840.4800.426376.000144.7830.1770.6890.0014575.133
RFS76.48475.0001.4840.6290.5570.537224.0001.4740.1580.7060.0010875.000
BreastEW dataset
RFE95.00094.0001.0000.9480.9370.94115.0000.1420.0990.9900.0005094.000
RFS95.00093.0002.0000.9380.9280.93227.0000.0900.0080.9890.0058393.000
Dermatology erythemato-squamous diseases dataset
RFE97.54196.9970.5440.9720.9600.96317.0000.0620.0010.9970.00107396.997
RFS89.86087.7252.1350.8370.8210.81511.0000.0940.0020.9850.00281987.725
In Table 3, the RFE algorithm spent more time on feature selection with high-dimensional datasets. Therefore, it did not achieve good results for classification accuracy. The Parkinson’s disease dataset shows that the classification accuracy achieved low results before applying the proposed methods. Using the BreastEw dataset, we can notice that both RFE and RFS achieved the best results before applying the proposed methods. Still, we need to reach optimal classification accuracy with the smallest subset features. The terms Algo., over-fitting Diff., Pre, Rec, NO.F, F-Time, C-Time, and var. referred to proposed algorithms, difference percentage between training and testing dataset, Precision, Recall, Number of selected features, Fitting time of feature selection, classification fitting time and variance, respectively. Performance of original algorithms before applying the proposed methods We noticed the previous results that the single algorithms suffered from many problems in the fitting time of feature selection (F-Time), classification fitting time (C-Time), number of selected features, over-fitting, and classification accuracy. Therefore, we proposed the methods to fix any previous problems in original algorithms when run as a single algorithm and obtain the most effective cancers genes. In addition, we noticed that the single algorithms did not give the best results, so we applied a hybrid method using the wrapper and embedded procedure. In Table 4, the average results of the proposed method OFBS-RFS-RFE are presented using stratified cross-validation with proposed classifiers included in LR, SVM, RF and Bagg. The proposed methods are run 2o times to obtain the best results. The PFBS has many positions of the first bootstrap step included in OFBS, IFBS and both outer and O/IFBS. The following table presented the OFBS-RFSRFE after 20 runs.
Table 4

Average results after applying OFBS-RFS-RFE after 20 runs

Algo.TrainData %TestData %Over-fittingDiff.%PreRecF1-scoreNo.FF-Time (sec)C-Time (sec)AUCVar.ACC%
RNA gene dataset
LR classifier
OFBS-RFS100.00099.9440.0561.0000.9991.000379.1009.5370.2961.0000.00000399.944
OFBS-RFS-RFE100.00099.9810.0191.0001.0001.000142.500189.350.4451.0000.000000499.981
SVM classifier
OFBS-RFS100.00099.9450.0551.0000.9991.000379.1009.5370.2961.0000.00000399.945
OFBS-RFS-RFE100.00095.0384.9621.0000.9611.000142.500189.350.1921.0000.000000295.038
RF classifier
OFBS-RFS100.00099.8750.1250.9990.9990.999379.1009.5371.0070.9990.00001399.875
OFBS-RFS-RFE100.00099.9250.0751.0000.9990.999142.500189.350.8070.9990.00000599.925
Bagg classifier
OFBS-RFS99.96799.4390.5280.9950.9940.994379.1009.5370.9120.9990.00007499.439
OFBS-RFS-RFE99.97299.5130.4590.9960.9950.995142.500189.350.44820.9990.00006399.513
DNA CNV dataset
LR classifier
OFBS-RFS93.83890.8572.9810.9140.8750.88813515.4355.6200.9810.0013890.857
OFBS-RFS-RFE93.46291.0202.4420.9190.8750.887675.00027552.6370.9830.0002891.020
SVM classifier
OFBS-RFS94.24890.9803.2680.9230.8730.89013515.43527.3160.9810.0001290.980
OFBS-RFS-RFE94.24890.9803.2680.9230.8730.890675.000275527.3160.9810.0004890.980
RF classifier
OFBS-RFS90.96686.6134.3530.9180.8340.88813515.4352.9340.9850.0002186.613
OFBS-RFS-RFE95.68791.2654.4210.9210.8750.890675.00027552.1470.9860.0007491.265
Bagg classifier
OFBS-RFS98.62292.9715.6510.9270.9070.91613515.4359.0800.9290.0002492.971
OFBS-RFS-RFE95.52592.7622.7630.9250.9100.912675.00027554.5020.9810.0002392.762
Parkinson’s disease dataset
LR classifier
OFBS-RFS78.45377.8640.5890.7370.6020.605228.1501.7040.1280.7360.0010377.864
OFBS-RFS-RFE77.05072.7404.3100.7000.5560.579113.85011.2130.1490.7050.0009372.740
SVM classifier
OFBS-RFS76.10975.6240.4850.6230.5430.512228.1501.7040.6440.6430.0004375.624
OFBS-RFS-RFE76.00375.4990.5040.6170.5410.509113.85011.2140.4960.6380.0004175.499
RF classifier
OFBS-RFS100.00094.6345.3660.9480.9100.926228.1501.7041.4340.9860.0006494.634
OFBS-RFS-RFE100.00095.0005.0000.9450.9060.922113.85011.2141.1340.9850.0006295.000
Bagg classifier
OFBS-RFS99.71993.1636.5560.9170.9040.908228.1501.7041.7900.9660.0009293.163
OFBS-RFS-RFE99.73593.0086.7270.9160.9010.906113.85011.2140.8200.9660.0007893.008
Dermatology erythemato-squamous diseases dataset
LR classifier
OFBS-RFS96.02395.0370.9860.9260.9120.90718.6250.2160.0140.9950.00019095.037
OFBS-RFS-RFE97.24196.4810.7600.9320.9340.92616.0000.2030.5170.9970.00073096.481
SVM classifier
OFBS-RFS79.26978.3750.8940.6720.7080.66818.6250.2160.1690.9730.00259078.375
OFBS-RFS-RFE99.48498.9400.5440.9880.9860.98416.0000.2030.0640.9980.00036898.940
RF classifier
OFBS-RFS100.000100.0000.01.0001.0001.00018.6250.2160.5621.0000.0100.000
OFBS-RFS-RFE100.000100.0000.01.0001.0001.00016.0000.2030.5001.0000.0100.000
Bagg classifier
OFBS-RFS99.97099.7300.2400.9970.9950.99618.6250.2160.0770.9970.00005799.730
OFBS-RFS-RFE99.96699.7960.1700.9980.9950.99616.0000.2030.2420.9980.00005599.796
BreastEW dataset
LR classifier
OFBS-RFS94.77694.2180.5580.9220.9330.93727.1000.2980.0120.9880.00000194.218
OFBS-RFS-RFE95.06994.5870.4820.9470.9390.94113.3160.1300.0910.9890.00081094.587
SVM classifier
OFBS-RFS92.16791.9020.2660.9340.8970.90927.1000.2980.0760.9780.00098691.902
OFBS-RFS-RFE93.30193.1140.1870.9130.9140.92713.3160.1300.0700.9820.00111593.114
RF classifier
OFBS-RFS100.00097.8642.1360.9840.9810.97927.1000.2980.5060.9970.00027097.864
OFBS-RFS-RFE100.00098.0002.0000.9830.9790.98213.3160.1300.4280.9970.00030098.000
Bagg classifier
OFBS-RFS99.88997.5482.3410.9770.9720.97427.1000.2980.1010.9490.00028097.548
OFBS-RFS-RFE99.88897.7242.1640.9780.9740.97613.3160.1300.1040.9480.00043097.724
Average results after applying OFBS-RFS-RFE after 20 runs For more illustration, in Table 4, the proposed method using OFBS-RFS-RFE enhanced the performance of RFE algorithm. The over-fitting percentage was reduced from the RNA gene dataset after applying previous classifiers, so the accuracy difference between training and testing dataset was reduced compared with the single algorithm. The LR classifier achieved the best classification accuracy result with 99.981%, while the SVM classifier gave the best variance result with 0.0000002. From DNA CNV dataset the difference between training and testing became 2.442 and 2.763% using LR and Bagg classifiers, respectively, and the accuracy results were increased with 91.020 and 92.762%, respectively using the same classifiers. In addition, the variance between features was reduced using the same classifiers to become 0.00028 and 0.00023, respectively. The OFBS-RFS-RFE enhances the over-fitting and variance and minimizes features’ fitting time and number. From the Parkinson’s disease dataset, the classification accuracy, precision, recall, f1-score, AUC and variance are enhanced to 95.000%, 0.945, 0.906, 0.922, 0.985 and 0.00062, respectively using RF classifier. It suggested that only 113.85 features were good enough for the classification step with 1.134 s as a computational time. In addition, for dermatology erythemato-squamous diseases dataset, RF classifier gave the best classification accuracy, precision, recall, f1-score, AUC and variance to become 100.000%, 1.000, 1.000, 1.000, 1.000 and 0.0. On the other hand, the OFBS-RFS-RFE using the BreastEw dataset achieved the best computational time after applying LR and SVM in contrast with the other optimizer. We can notice that the RF gave the best over-fitting percentage, precision, recall, f1-score, AUC, variance, and accuracy to become 2.00%, 0.983, .979, 0.982, 0.997, 0.000302 and 98%, respectively. In Table 5, the average results of the proposed method PFBS-RFS-RFE using IFBS after 20 runs are presented. The different positions of bootstrap lead to different results. The IFBS used the bootstrap step inside the RFS algorithm for feature selection.
Table 5

Average results after applying IFBS-RFS-RFE after 20 runs

Algo.TrainData %TestData %Over-fittingDiff.%PreRecF1-scoreNo.FF-Time (sec)C-Time (sec)AUCVar.ACC%
RNA gene dataset
LR Classifier
IFBS-RFS100.00099.9250.0750.9990.9990.999239.0005.4210.1931.0000.00000499.925
IFBS-RFS-RFE100.00099.9750.0250.9990.9990.999125.25015.2010.3571.0000.000000499.975
SVM Classifier
IFBS-RFS99.99999.9060.0930.9990.9980.999239.0005.4210.2251.0000.00000599.906
IFBS-RFS-RFE100.00099.9880.0120.9990.9990.999125.25015.2010.1531.0000.000000299.988
RF Classifier
IFBS-RFS100.00099.6940.3060.9980.9970.997239.0005.4210.9011.0000.00003099.694
IFBS-RFS-RFE100.00099.8070.1930.9990.9980.998125.25015.2010.7370.9990.00001999.807
Bagg Classifier
IFBS-RFS99.94799.0020.9450.9910.9890.989239.0005.4210.6350.9990.00007599.002
IFBS-RFS-RFE99.95599.0270.9280.9920.9890.990125.25015.2010.3270.9990.00007599.027
DNA CNV dataset
LR Classifier
IFBS-RFS88.52582.8895.6360.8120.7520.763966.0006.2503.8760.9420.00037082.889
IFBS-RFS-RFE88.00084.0004.0000.8310.7640.804482.00015452.1490.9590.00042084.000
SVM Classifier
IFBS-RFS88.34181.6376.7040.8150.7290.745966.0006.05029.0780.9550.00058081.637
IFBS-RFS-RFE89.66882.2687.4000.8270.7380.753482.000154515.7210.9600.00088082.268
RF Classifier
IFBS-RFS89.66080.0899.5710.7680.70250.709966.0006.0503.1500.9380.00041080.089
IFBS-RFS-RFE89.93580.1389.7970.7700.7030.719482.00015452.4870.9410.00047080.138
Bagg Classifier
IFBS-RFS97.69778.31619.3810.7330.7220.702966.0006.0507.8500.8670.00045078.316
IFBS-RFS-RFE89.03578.30910.7260.7300.6950.702482.00015454.0230.9100.00048078.309
Parkinson’s disease dataset
LR Classifier
IFBS-RFS78.12276.7181.4040.6640.5880.582154.2631.0710.0820.7320.00221076.718
IFBS-RFS-RFE76.69373.9982.6950.6670.5880.58180.0504.5940.1640.7220.00199073.998
SVM Classifier
IFBS-RFS75.68472.2483.4360.4680.4980.448154.2631.0710.4820.6210.00077072.248
IFBS-RFS-RFE75.69772.2283.4690.4640.4970.44880.0504.5940.4070.6190.00076072.228
RF Classifier
IFBS-RFS99.99983.91216.0870.8110.7380.760154.2631.0711.2300.8660.00348083.912
IFBS-RFS-RFE100.00081.48518.5150.7730.7000.71980.0504.5940.9640.8340.00350081.485
Bagg Classifier
IFBS-RFS99.59080.81017.780.7540.7310.737154.2631.0711.2250.8260.00346080.810
IFBS-RFS-RFE99.58079.19120.3890.7290.7070.71380.0504.5940.5460.8040.00344079.191
Dermatology erythemato-squamous diseases dataset
LR classifier
IFBS-RFS91.53191.0000.5310.7710.7960.77713.0000.5150.0020.9880.00088191.000
IFBS-RFS-RFE92.19891.8010.3970.7730.7990.78012.0000.0160.0200.9880.00097591.801
SVM classifier
IFBS-RFS94.87093.9790.8910.8880.8780.87513.0000.5150.0230.9880.00128593.979
IFBS-RFS-RFE94.86993.9790.8900.8880.8780.87512.0000.0160.0750.9890.00128593.979
RF classifier
IFBS-RFS97.02593.1833.4820.9000.8920.88913.0000.5150.1420.9840.00149393.183
IFBS-RFS-RFE97.00093.5003.5000.9000.8920.88912.0000.0160.1400.9800.00149093.500
Bagg classifier
IFBS-RFS96.90392.1024.8010.8950.8840.88113.0000.5150.0160.9890.00329792.102
IFBS-RFS-RFE97.17781.19415.9830.7890.7640.76012.0000.0160.0140.9700.08125181.194
BreastEW dataset
LR Classifier
IFBS-RFS94.39493.6780.4610.9380.9380.93823.1000.4100.0120.9880.00069093.678
IFBS-RFS-RFE94.85594.4030.4520.9460.9460.94611.9000.1030.0910.9920.00052094.403
SVM Classifier
IFBS-RFS92.01091.5630.4470.9290.9290.92923.1000.4100.0690.9760.00101091.563
IFBS-RFS-RFE93.88893.5030.3850.9440.9440.94411.9000.1030.0590.9830.00055093.503
RF Classifier
IFBS-RFS100.00096.4113.5890.9650.9650.96523.1000.4100.4520.9910.00098096.411
IFBS-RFS-RFE100.00095.2774.7230.9520.9520.95211.9000.1030.4330.9890.00093095.277
Bagg Classifier
IFBS-RFS99.62595.3024.3230.9540.9540.95423.1000.4100.0990.9850.00092095.302
IFBS-RFS-RFE99.61094.4165.1940.9440.9440.94411.9000.1030.0850.9810.00117094.416
Average results after applying IFBS-RFS-RFE after 20 runs For more illustration, in Table 5, the SVM classifier achieved the best classification accuracy and variance results with 99.988% and 0.0000002, respectively. Although the inner position gave the best results using RNA gene dataset, but it did not give the best result for other datasets. In Table 6, the average results of PFBS-RFS-RFE using O/IFBS after 20 runs are presented. In this position the FBS is placed before selecting the features and during the feature selecting algorithm.
Table 6

Average results after applying O/IFBS-RFS-RFE after 20 runs

Algo.TrainData %TestData %Over-fittingDiff.%PreRecF1-scoreNO.FF-Time (sec)C-Time (sec)AUCVar.ACC%
RNA gene dataset
LR Classifier
O/IFBS-RFS-100.00099.9750.0250.9990.9990.999238.8004.2200.1761.0000.000000699.975
O/IFBS-RFS-RFE100.00099.9940.0060.9990.9990.999119.20013.7260.3071.0000.000000499.994
SVM Classifier
O/IFBS-RFS-100.00099.9500.050.9990.9990.999238.8004.2200.1971.0000.000002599.950
O/IFBS-RFS-RFE100.00099.9810.0190.9990.9990.999119.20013.7260.1251.0000.000000499.981
RF Classifier
O/IFBS-RFS-100.00099.8880.1120.9990.9990.999238.8004.2200.7551.0000.000007699.888
O/IFBS-RFS-RFE100.00099.9130.0870.9990.9990.999119.20013.7260.5960.9990.000005499.913
Bagg Classifier
O/IFBS-RFS-99.97499.3570.6170.9940.9920.993238.8004.2200.5130.9990.000082899.357
O/IFBS-RFS-RFE99.97299.3630.6090.9940.9930.993119.20013.7260.2660.9990.00008399.363
DNA CNV dataset
LR Classifier
O/IFBS-RFS-92.58189.8182.7630.9040.8610.877973.0003.6503.8500.9750.0003189.818
O/IFBS-RFS-RFE91.87889.6012.2770.9060.8570.885485.00014601.9500.9360.0003589.601
SVM Classifier
O/IFBS-RFS-93.36190.2533.1080.9170.8600.878973.0003.65022.000.9800.0006590.253
O/IFBS-RFS-RFE94.24190.9793.2620.9250.8730.891485.000146011.7000.9850.0002890.979
RF Classifier
O/IFBS-RFS-95.52790.7644.7630.9140.8680.882973.0003.6502.6500.9840.0002790.764
O/IFBS-RFS-RFE95.68190.9544.7270.9190.8720.890485.00014601.7500.9410.0002790.954
Bagg Classifier
O/IFBS-RFS-97.95892.7125.2460.9260.9060.913973.0003.6506.5500.9800.0002792.712
O/IFBS-RFS-RFE95.31892.8342.4840.9270.9060.913485.00014603.1500.9800.0002792.834
Parkinson’s disease dataset
LR classifier
O/IFBS-RFS-79.05078.4820.5680.7420.6190.626155.501.0580.0930.7640.0012378.482
O/IFBS-RFS-RFE77.74477.4270.3170.7120.5970.59877.5505.5510.1180.7310.0009277.427
SVM classifier
O/IFBS-RFS-76.00975.4420.5670.6120.5390.508155.5001.0580.5110.6370.0004175.442
O/IFBS-RFS-RFE77.50076.6720.8280.6530.5660.54277.5505.5510.4200.6690.0005176.672
RF classifier
O/IFBS-RFS-100.00094.4945.5060.9450.9090.924155.5001.0581.1220.9850.0006494.494
O/IFBS-RFS-RFE100.00094.0825.9180.9430.9010.91777.5505.5510.9110.9830.0007094.082
Bagg Classifier
O/IFBS-RFS-99.72093.1966.5240.9160.9060.909155.5001.0581.0910.9650.0009393.196
O/IFBS-RFS-RFE99.71992.9176.8020.9140.9000.90577.5505.5500.5110.9660.0008492.917
Dermatology erythemato-squamous diseases dataset
LR classifier
O/IFBS-RFS-96.69196.4410.2500.6490.6240.63011.0000.1670.0250.9980.00084896.441
O/IFBS-RFS-RFE92.53292.3500.2120.8010.7510.76610.0000.5000.1280.9990.00079092.350
SVM classifier
O/IFBS-RFS-95.08295.0000.0820.6380.6080.61311.0000.1670.0250.9770.00063295.000
O/IFBS-RFS-RFE98.36198.3560.0050.8920.9000.89510.0000.5000.0470.9990.00104098.356
RF classifier
O/IFBS-RFS-100.000100.0000.01.0001.0001.00011.0000.1670.5621.0000.0100.00
O/IFBS-RFS-RFE100.000100.0000.01.0001.0001.00010.0000.5000.5001.0000.0100.00
Bagg classifier
O/IFBS-RFS-100.000100.0000.01.0001.0001.00011.0000.1670.5200.9990.0100.000
O/IFBS-RFS-RFE100.000100.0000.01.0001.0001.00010.0000.5000.5000.9910.0100.000
BreastEw dataset
LR classifier
O/IFBS-RFS-94.64794.1480.4990.9440.9320.93622.9000.3990.0100.9880.0009594.148
O/IFBS-RFS-RFE95.30594.8420.4630.9490.9420.94411.3000.10330.0910.9920.0008694.842
SVM classifier
O/IFBS-RFS-92.11091.8890.2210.9340.8970.90922.9000.3990.0670.9780.0009891.889
O/IFBS-RFS-RFE93.51593.4000.1150.9430.9180.92711.3000.1030.0580.9830.0009493.400
RF classifier
O/IFBS-RFS-99.56397.5002.0630.9810.9760.97722.9000.39940.4110.9960.0003197.500
O/IFBS-RFS-RFE100.00098.0002.0000.9790.9770.97811.3000.1030.4040.9970.0003198.000
Bagg Classifier
O/IFBS-RFS-99.81997.6182.2010.9770.9730.97422.9000.3990.0890.9940.0003897.618
O/IFBS-RFS-RFE99.80397.5052.2980.9760.9720.97311.3000.1030.0650.9930.0003497.505
Average results after applying O/IFBS-RFS-RFE after 20 runs For more illustration, in Table 6, the accuracy and variance results are increased from the RNA gene dataset to 99.994% and 0.0000004, respectively, using LR classifier. Bagg classifier gave the best accuracy and variance results using DNA CNV dataset to become 92.834% and 0.00027, respectively. In addition, RF classifier gave the best accuracy and variance using dermatology erythemato-squamous diseases dataset to become 100% and 0.0, respectively. At the same time, the O/IFBS-RFS-RFE did not give good results for other datasets. In Fig. 1, the classification accuracy using the proposed methods is illustrated using all datasets. We can notice that RNA gene dataset achieved the best results with O/IFBS using LR classifier, while the DNA CNV dataset achieved the best results with O/IFBS using Bagg classifier. In addition, the Parkinson’s disease dataset achieved the best results with OFBS using LR classifier. The dermatology erythemato-squamous diseases and breast datasets achieved the best result using RF classifier with both OFBS and O/IFBS.
Fig. 1

Comparison between proposed methods on all datasets using classification accuracy

Comparison between proposed methods on all datasets using classification accuracy In Fig. 2, the number of selected features using the proposed methods is showed on all datasets. From this figure, we can note that the best algorithm that gave the smallest number of features was O/IFBS with RNA gene, Parkinson’s disease, dermatology erythemato-squamous diseases and breast datasets. On the other hand, the IFBS algorithm achieved the smallest number of features using DNA CNV dataset.
Fig. 2

Number of the selected features using all datasets

Number of the selected features using all datasets In Fig. 3, the variance of the proposed methods is illustrated. We can notice that the RNA gene dataset using LR and SVM classifiers gave the best variance results with all position of bootstrap. On the other hand, the DNA CNV dataset achieved the best variance result using the Bagg classifier with OFBS. In addition, the Parkinson’s disease dataset achieved the best variance result using SVM classifier with OFBS. OFBS and O/IFBS achieved the best variance result using RF and Bagg classifiers for dermatology erythemato-squamous diseases dataset. For Breast dataset, the RF classifier gave the best results with OFBS.
Fig. 3

Variance of the proposed methods using all bootstrap positions

Variance of the proposed methods using all bootstrap positions

Comparison with other studies

The results before and after PFBS-RFS-RFE are compared. In addition, these results are compared with the previous work using the same datasets. Table 7 showed the comparison before and after applying PFBS-RFS-RFE after 20 runs. The proposed methods improved the results and solved feature selection problems in high dimensions. Table 8 presented the results of the previous studies using the same dataset.
Table 7

The comparison between results before and after PFBS-RFS-RFE

DatasetsBefore PFBS-RFS-RFEAfter PFBS-RFS-RFE
ACC%OverfittingDiff%No.FC-timeVar.ACC%OverfittingDiff%NO. FC-timeVar.
RNA gene99.8000.20020,53116.5470.00001599.9940.006119.2000.307 s0.0000004
DNA CNV85.00012.60016,381170 s0.00058092.7622.763675.0000.981 s0.000230

Parkinson’s

disease

93.6770.8007532.000 s0.00063495.0005.000113.8501.134 s0.000620
Dermatology diseases97.8070.493340.003 s0.000810100.0000.010.0000.500 s0.0
BreastEW75.9282.072300.500 s0.00209298.0002.00013.3000.428 s0.000300
Table 8

Achievement of accuracy in different research for cancer classification using the same datasets [7–9, 12, 36, 37]

ReferenceDatasetFS ApproachNo of selected featuresVar.AUCACC%
García-Díaz et al. [36]RNA geneGGA490.00030398.810
Zhang et al. [7]DNA CNVmRMR & IFS190.0005800.97375.000
Sanaa et al. [8]PSO & GA20500.96184.600
Sanaa et al. [12]IG16,3810.96585.900
Sakar et al. [37]Parkinson’s disease,mRMR5085.000
Hegazy et al. [9]BreastEwCSSA5.20097.080
The comparison between results before and after PFBS-RFS-RFE Parkinson’s disease Achievement of accuracy in different research for cancer classification using the same datasets [7–9, 12, 36, 37] The proposed methods were compared with filter ones methods using MIFS, IGF and mRMR. Tables 9, 10 and 11 showed the results of MIFS, IGF and mRMR for all datasets. For MIFS method, the results proved that the LR classifier gave the best accuracy for RNA gene and DNA CNV datasets, while the RF classifier gave the best accuracy for Parkinson’s disease and BreastEW datasets. In addition, SVM classifier gave the best results for dermatology erythemato-squamous diseases dataset. For IGF method, LR classifier gave the best accuracy for RNA gene dataset. SVM classifier gave the best results for DNA CNV and dermatology erythemato-squamous diseases datasets, while the RF classifier gave the best accuracy for Parkinson’s disease and BreastEW datasets. Furthermore, mRMR achieved the best results for RNA gene dataset using LR classifer, while SVM classifier gave the best results for DNA CNV dataset. In addition, RF classifer achieved the best results for dermatology erythemato-squamous diseases, Parkinson’s disease and BreastEW datasets. Although filter ones methods improved the results, they did not give better results than the PFBS-RFS-RFE.
Table 9

The proposed methods compared with the MIFS method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
LR classifier
 RNA gene100.00099.8750.1250.9990.9980.98810,000192.5522.8961.0000.00001699.875
 DNA CNV96.59784.97811.6190.8170.7820.7889000173.95525.1950.9540.00041684.978

 Parkinson’s

disease

77.05875.5251.5330.6200.5560.5383000.3770.0370.6820.00100175.525

 Dermatology

diseases

97.84596.9890.8560.9710.9650.966250.2030.0030.9970.00058596.989
 BreastEW94.39693.6780.7180.9380.9280.932200.0670.0020.9880.00069493.678
SVM classifier
 RNA gene100.00099.7500.2500.9980.9970.99710,000192.5522.5341.0000.00002899.750
 DNA CNV91.60684.1227.4840.8600.7560.7759000173.95575.3940.9490.00066884.122

 Parkinson’s

disease

75.67672.2283.4480.4720.4980.4483000.2030.1380.6270.00081472.228

 Dermatology

diseases

98.42197.5230.8980.9760.9670.969250.2030.0280.9980.00092497.523
 BreastEW92.01391.5630.4500.9290.8950.906200.0670.0170.9760.00101491.563
RF classifier
 RNA gene100.00099.6270.3730.9980.9960.99710,000192.5521.2521.0000.00003699.627
 DNA CNV92.96280.62312.3390.7710.7190.7189000173.9553.5280.9420.00061480.623

 Parkinson’s

disease

100.00084.78215.2180.8270.7480.7733000.3770.3760.8760.00230384.782

 Dermatology

diseases

100.00096.4563.5440.9720.9500.955250.2030.1480.9990.00147396.456
 BreastEW100.00096.1403.8600.9630.9560.958200.0670.1100.9900.00094496.140
Bagg classifier
 RNA gene99.84798.6281.2190.9890.9850.98710,000192.5527.3220.9990.00003698.628
 DNA CNV98.96078.80620.1540.7330.6990.7079000173.95525.3090.9120.00061378.806

 Parkinson’s

disease

99.57479.23920.3350.7290.7290.7273000.3770.6730.7940.00200279.239

 Dermatology

diseases

99.69695.1054.5910.9550.9400.939250.2030.0210.9950.00147395.105
 BreastEW99.49295.4354.0570.9570.9470.950200.0670.0220.986.00090195.435
Table 10

The proposed methods compared with the IGF method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
LR classifier
 RNA gene100.00099.8750.1250.9990.9990.99835761.1822.1211.0000.00001699.875
 DNA CNV93.11581.31011.8050.7820.7060.70533155.6510.5950.9510.00057681.310

 Parkinson’s

disease

77.82276.9840.8380.6800.5760.5663960.0930.0570.7100.00144576.984

 Dermatology

diseases

97.78497.2600.5240.9730.9680.969250.0320.00090.9980.00067797.260
 BreastEW94.17093.6740.4960.9420.9280.931220.0640.0010.9890.00217693.674
SVM classifier
 RNA gene100.00099.7500.2500.9990.9970.99835761.1822.2721.0000.00002899.750
 DNA CNV94.27385.8728.4010.8730.7800.80133155.6513.1420.9690.00048685.872

 Parkinson’s

disease

75.66672.3793.2870.4340.4970.4433960.0930.2040.6400.00437872.379

 Dermatology

diseases

98.26997.5300.7390.9750.9720.972250.0320.0140.9990.00075297.530
 BreastEW92.00791.5690.4380.9300.8950.904220.0640.0210.9790.00450291.569
RF classifier
 RNA gene100.00099.5020.4980.9970.9940.99635761.1820.8260.9990.00041099.502
 DNA CNV92.55881.13911.4190.7730.7140.72133155.6511.5840.9440.00053181.139

 Parkinson’s

disease

100.00083.73316.2670.7930.7260.7343960.0930.7190.8600.00905783.733

 Dermatology

diseases

100.00096.9973.0030.9730.9620.964250.0320.0980.9990.00056796.997
 BreastEW99.98296.1403.8420.9610.9590.958220.0640.1180.9860.00228096.140
Bagg classifier
 RNA gene99.94099.1260.8140.9960.9900.99235761.1823.0400.9990.00026099.126
 DNA CNV98.70179.04519.6560.7410.6980.70833155.6511.0950.9110.00055379.045

 Parkinson’s

disease

99.65382.29717.3560.7900.7520.7543960.0931.0790.8300.01127082.297

 Dermatology

diseases

99.69695.3754.3210.9580.9500.949250.0320.0120.9930.00146695.375
 BreastEW99.63695.2534.3830.9570.9460.948220.0640.0290.9870.00178895.253
Table 11

The proposed methods compared with the mRMR method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
LR classifier
 RNA gene100.00099.7500.2500.9990.9970.9986501200.0110.2511.0000.00002899.750
 DNA CNV91.81979.69912.1200.7460.6880.6895052296.4090.6860.9400.00052979.699

 Parkinson’s

disease

74.61773.0111.6060.5000.5150.47914561.0050.0170.6590.00250273.011

 Dermatology

diseases

95.50895.0750.4330.9500.9080.919153.9960.0020.9950.00079695.075
 BreastEW93.08592.6200.4650.9360.9100.917194.1810.0020.9810.00335892.620
SVM classifier
 RNA gene100.00099.7480.2520.9990.9970.9986501200.0110.3821.0000.00002899.748
 DNA CNV92.48683.8488.6380.8450.7470.7665052296.4093.6090.9610.00055983.848

 Parkinson’s

disease

75.66172.3793.2820.4350.4970.44314561.0050.1420.6390.00437872.379

 Dermatology

diseases

52.79352.1850.6080.3250.4630.363153.9960.0530.9480.00230252.185
 BreastEW89.04988.9380.1110.9150.8600.870194.1810.0440.9450.00540588.938
RF classifier
 RNA gene100.00099.6270.3730.9980.9960.9976501200.0110.3981.0000.00003699.627
 DNA CNV90.95979.93511.0240.7270.6900.6895052296.4090.5340.9420.00124979.935

 Parkinson’s

disease

100.00081.91818.0820.7670.7030.70914561.0050.4670.8330.01113881.918

 Dermatology

diseases

100.00097.5532.4470.9810.9680.972153.9960.10000.9990.00056197.553
 BreastEW100.0095.6044.3960.9600.9500.952194.1810.1830.9910.00269395.604
Bagg classifier
 RNA gene99.96198.7461.2150.9910.9840.9866501200.0111.6800.9990.00043098.746
 DNA CNV97.81777.46820.3490.7310.6820.6875052296.4091.1350.9100.00093777.468

 Parkinson’s

disease

99.49879.36920.1290.7250.7120.70514561.0050.5610.7990.01062079.369

 Dermatology

diseases

99.54594.0175.5280.9510.9360.937153.9960.0090.9820.00241294.017
 BreastEW99.64293.6845.9580.9430.9280.931194.1810.0420.9800.00236993.684
The proposed methods compared with the MIFS method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases The proposed methods compared with the IGF method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases The proposed methods compared with the mRMR method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases The proposed methods were compared with many different filters methods as cited in the introduction section included in CfsSubsetEval, ReliefAttributeEval, OneRAttributeEval, ConsistencySubsetEval and PCA methods. Tables 12, 13, 14, 15 and 16 showed the results of these different filters methods. The ReliefAttributeEval method achieved the best results for RNA gene and BreastEW datasets, while ConsistencySubsetEval method gave the best results for DNA CNV dataset. In addition, CfsSubsetEval method gave the best results for Parkinson’s disease dataset, while the PCA method gave the best results for dermatology erythemato-squamous diseases dataset. Although filter methods improved the results, they did not give better results than the PFBS-RFS-RFE.
Table 12

The proposed methods compared with the CfsSubsetEval method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
J48 classifier
 RNA gene99.15497.1252.0290.9730.9740.97240833.8601.0300.9870.00062797.125
 DNA CNV64.03463.6820.3520.5390.5370.533412.9500.0090.8150.00099763.682

 Parkinson’s

disease

87.68378.4219.2620.7300.6730.6911190.1800.0160.7130.00599478.421

 Dermatology

diseases

74.43873.7760.6620.5330.5980.54790.1500.00020.8840.00296773.776
 BreastEW94.08390.8653.2180.9050.9020.90230.0500.00020.9500.00059690.865
Naïve base classifier
 RNA gene99.86198.5031.3580.9860.9790.98140833.8600.0480.9870.00013198.503
 DNA CNV64.86464.1960.6680.6010.6200.598412.9500.0010.8840.00121964.196

 Parkinson’s

disease

80.90479.9150.9890.7530.6810.6951190.1800.0010.7620.00541079.915

 Dermatology

diseases

62.36061.4410.9190.5730.6280.56190.1500.00070.9350.00555461.441
 BreastEW93.55692.9730.5830.9320.9210.92430.0500.00080.9790.00088692.973
K-nearest neighbors(KNN) classifier
 RNA gene99.79299.6270.1650.9980.9960.99740833.8600.0071.0000.00003699.627
 DNA CNV78.98672.5996.3870.6910.6260.630412.9500.000090.8620.00028972.599

 Parkinson’s

disease

89.02180.2828.7390.7510.6940.7081190.1800.00080.7430.00200780.282

 Dermatology

diseases

85.24779.7675.480.8230.7830.78090.1500.00040.9430.00668179.767
 BreastEW87.50381.1726.3310.8150.7930.79730.0500.00080.8570.00563981.172
Table 13

The proposed methods compared with the ReliefAttributeEval method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
J48 classifier
 RNA gene99.15497.6251.5290.9800.9790.97910,0001.9502.8870.9920.00043297.625
 DNA CNV65.32263.9221.40.5460.5480.54080001.5500.9940.8160.00122563.922

 Parkinson’s

disease

83.85874.7599.0990.6350.5990.5933000.9500.0600.7100.01370274.759

 Dermatology

diseases

79.35678.7010.6550.5700.6470.591200.5000.00070.9240.00071978.701
 BreastEW96.48593.4992.9860.4670.4480.455160.3500.0030.9590.00299493.499
Naïve base classifier
 RNA gene99.85596.8812.9740.9620.9550.95510,0001.9500.1150.9730.00097696.881
 DNA CNV65.67864.6790.9990.6290.6680.62080001.5500.2970.8490.00135964.679

 Parkinson’s

disease

81.55079.3382.2120.7420.7290.7223000.9500.0030.7750.01472779.338

 Dermatology

diseases

87.40385.5701.8330.8080.8520.806200.5000.00070.9790.00245485.570
 BreastEW94.68194.4150.2660.4810.4440.460160.3500.00050.9890.00216794.415
K-nearest neighbors(KNN) classifier
 RNA gene99.87599.8730.0020.9990.9990.99910,0001.9500.0161.0000.00003199.873
 DNA CNV80.73574.2466.4890.7080.6550.65480001.5500.0130.8740.00037874.246

 Parkinson’s

disease

85.80171.70814.0930.5950.5860.5793000.9500.00070.6080.00720371.708

 Dermatology

diseases

92.28986.0596.230.8720.8480.840200.5000.0010.9610.00470486.059
 BreastEW94.00591.7392.2660.4620.4270.442160.3500.00020.9640.00055391.739
Table 14

The proposed methods compared with the OneRAttributeEval method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
J48 classifier
 RNA gene99.15497.6251.5290.9780.9800.97870002.0211.7790.9910.00050297.625
 DNA CNV65.36764.0591.3080.5430.5570.54250001.0200.6440.8130.00126164.059

 Parkinson’s

disease

85.62778.7446.8830.7410.6700.6732000.1500.0380.7640.01279678.744

 Dermatology

diseases

79.44877.0802.3680.5630.6330.578150.1200.00060.9090.00151777.080
 BreastEW96.64192.0954.5460.9190.9120.914170.1050.0020.9620.00138092.095
Naïve base classifier
 RNA gene99.91793.6316.2860.9400.9120.91670002.0210.0760.9490.00092093.631
 DNA CNV67.36866.5280.8400.6170.6250.61050001.0200.1950.8500.00103566.528

 Parkinson’s

disease

74.26273.7840.4780.4000.4340.4152000.1500.0010.7150.00947373.784

 Dermatology

diseases

86.00483.5892.4150.8120.7910.758150.1200.00090.9580.00187383.589
 BreastEW93.45193.1960.2500.4690.4600.462170.1050.0010.9860.00308993.196
K-nearest neighbors(KNN) classifier
 RNA gene99.72399.6270.0960.9980.9960.99770002.0210.0100.9990.00003699.627
 DNA CNV78.15571.7776.3780.6510.6030.59750001.0200.0090.8420.00018971.777

 Parkinson’s

disease

80.68872.4818.2070.3880.4430.4142000.1500.00010.6230.00252972.481

 Dermatology

diseases

87.43184.7002.7310.7870.7870.780150.1200.0020.9630.00337084.700
 BreastEW94.72892.9761.7520.9300.9210.924170.1050.0010.9610.00095392.976
Table 15

The proposed methods compared with the ConsistencySubsetEval method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
J48 classifier
 RNA gene93.10691.1361.970.8800.8690.86931.8500.0010.9630.00064191.136
 DNA CNV64.84263.9210.9210.5500.5490.543421.6000.0120.8160.00117263.921

 Parkinson’s

disease

86.34680.2796.0670.7650.6880.707111.1000.0030.7380.00240780.279

 Dermatology

diseases

87.91887.7400.1780.7510.7550.742120.1020.00060.9460.00240787.740
 BreastEW96.99394.3802.6130.9500.9330.93980.0900.00080.9710.00087394.380
Naïve base classifier
 RNA gene97.54597.3800.1650.9720.9700.97031.8500.0010.9940.00018897.380
 DNA CNV73.01571.6061.4090.6820.6980.680421.6000.0060.9230.00127671.606

 Parkinson’s

disease

76.94075.5161.4240.6660.5990.605111.1000.00050.7290.00305175.516

 Dermatology

diseases

90.26589.8390.4260.8780.9010.867120.1020.00070.9950.00274089.839
 BreastEW94.63094.2010.4290.9430.9340.93780.0900.0020.9880.00068694.201
K-nearest neighbors(KNN) classifier
 RNA gene97.51797.1310.3860.9630.9640.96231.8500.0010.9930.00031197.131
 DNA CNV82.73577.0595.6760.7500.6800.685421.6000.00010.8880.00046577.059

 Parkinson’s

disease

79.10069.6989.4020.5570.5240.517111.1000.0020.5830.00296269.698

 Dermatology

diseases

97.63295.9161.7160.9620.9430.947120.1020.00080.9940.00133995.916
 BreastEW95.70493.4962.2080.9370.9250.92980.0900.00090.9650.00062193.496
Table 16

The proposed methods compared with the PCA method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
J48 classifier
 RNA gene96.82394.8841.9390.9430.9540.9427001.0270.1740.9850.00074094.884
 DNA CNV62.33460.6651.6690.5260.5160.505280049.7951.5490.8270.00172460.665

 Parkinson’s

disease

83.74573.8169.9290.6250.6070.6022500.0790.0280.6450.00241873.816

 Dermatology

diseases

81.14880.8780.2700.5760.6630.604180.0160.0030.9250.00014280.878
 BreastEW95.72393.4932.2300.9420.9250.929200.0160.0020.9590.00083293.493
Naïve base classifier
 RNA gene87.07279.4037.6690.7940.8060.7947001.0270.0050.9540.00211679.403
 DNA CNV29.33627.6411.6950.2550.3520.233280049.7950.0770.6800.00031927.641

 Parkinson’s

disease

74.47173.8210.650.6040.5580.5452500.0790.0090.6980.00282673.821

 Dermatology

diseases

98.36196.1792.1820.9610.9520.953180.0160.00010.9970.00102596.179
 BreastEW90.04189.8030.2380.8960.8860.889200.0163.0570.9620.00170789.803
K-nearest neighbors(KNN) classifier
 RNA gene99.75099.7400.0100.9990.9970.9987001.0270.0020.9990.00005999.740
 DNA CNV81.20074.3486.8520.6630.6390.634280049.7950.0100.8670.00027374.348

 Parkinson’s

disease

81.15872.6128.5460.6120.5710.5752500.0790.0010.6270.00230872.612

 Dermatology

diseases

92.38087.1625.2180.8620.8600.844180.0500.000020.9690.00198487.162
 BreastEW94.72892.9761.7520.9300.9210.924200.0300.00030.9610.00095392.976
The proposed methods compared with the CfsSubsetEval method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases The proposed methods compared with the ReliefAttributeEval method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases The proposed methods compared with the OneRAttributeEval method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases The proposed methods compared with the ConsistencySubsetEval method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases The proposed methods compared with the PCA method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Table 17 showed the comparison between the proposed methods, MIFS, CBF and FCBF methods as cited in the introduction section. The CBF gave the best results for RNA gene dataset, while FCBF method gave the best results for DNA CNV, Parkinson’s disease and BreastEW datasets. In addition, MIFS gave the best results for dermatology erythemato-squamous diseases dataset. These methods did not give the best results when compared with the PFBS-RFS-RFE.
Table 17

The proposed methods compared with the MIFS, CBF and FCBF methods

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
Mutual Information
KNN classifier
 RNA gene99.73699.6270.1090.9980.9960.99710,000258.9020.0081.0000.00003699.627
 DNA CNV82.68676.0976.5890.7450.6630.6679000180.3140.0110.8540.00036876.097

 Parkinson’s

disease

80.87972.4798.4000.6100.5680.5723002.1210.00010.6240.00234472.479

 Dermatology

diseases

97.96697.2670.6990.9750.9690.969250.3510.0020.9630.00083997.267
 BreastEW94.43592.6281.8070.9270.9170.920200.0830.000020.9580.00141992.628
Correlation Based Feature
KNN classifier
 RNA gene99.86799.7480.1190.9990.9970.9989002.6000.0031.0000.00009299.748
 DNA CNV52.83149.0733.7580.4470.4020.3697501.8500.0030.6690.00049049.073

 Parkinson’s

disease

81.15872.6128.5460.6120.5710.5753200.2550.0020.6270.00230872.612

 Dermatology

diseases

94.17190.9533.2180.8710.8550.846200.2020.0020.9470.00224390.953
 BreastEW94.74792.9761.7710.9310.9200.924170.1050.0020.9610.00095392.976
Fast Correlation Based Feature
KNN classifier
 RNA gene99.74299.6250.1170.9980.9960.9974001.7500.0011.0000.00013199.625
 DNA CNV81.39076.2365.1540.7210.6710.676130.8000.0070.9050.00113176.236

 Parkinson’s

disease

82.65773.2709.38773.2710.5850.587161.5000.0020.6750.00176773.270

 Dermatology

diseases

97.93697.0050.9310.9700.9670.966140.1010.0020.9610.00121797.005
 BreastEW95.33395.0780.2550.9530.9450.94770.0060.0020.9530.00026195.078
The proposed methods compared with the MIFS, CBF and FCBF methods Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Table 18 showed the proposed methods compared with the Chi-square method as cited in the introduction section using SVM and RF classifiers. The SVM classifiers gave the best results for RNA gene and DNA CNV datasets, while RF classifier gave the best results for, Parkinson’s disease, BreastEW and dermatology erythemato-squamous diseases datasets. This method did not give the best results when compared with the PFBS-RFS-RFE.
Table 18

The proposed methods compared with the Chi-square method

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
SVM classifier
 RNA gene100.00099.6250.3750.9970.9950.99675550.08012.3791.0000.00003699.625
 DNA CNV79.86270.1309.7320.5920.5860.58455550.5283.0500.9010.00036970.130

 Parkinson’s

disease

75.66172.2283.4330.4710.4970.4483980.0160.2100.6280.00081472.228

 Dermatology

diseases

71.22070.4880.7320.5560.6530.565240.0940.0930.6530.00130570.488
 BreastEW91.99491.5630.4310.9290.8950.906210.0160.0160.9760.00101491.563
RF classifier
 RNA gene100.00099.5020.4980.9970.9950.99675550.08011.0091.0000.00004199.502
 DNA CNV86.93468.55218.3820.5850.5720.57055550.5282.8170.8910.00024068.552

 Parkinson’s

disease

100.00081.08718.9130.7550.7010.7043980.0160.4710.8360.00878381.087

 Dermatology

diseases

100.00098.3551.6450.9840.9810.982240.0940.2290.9980.00036398.355
 BreastEW100.00096.8323.1680.9730.9620.965210.0160.1040.9900.00126596.832
The proposed methods compared with the Chi-square method Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Table 19 showed the proposed methods compared with the IGF, Chi-square and Bat algorithm as cited in the introduction section. The Bat algorithm gave the best results for RNA gene, DNA CNV and BreastEW datasets, while Chi-square method gave the best results for Parkinson’s disease dataset. In addition, the IGF method gave the best results for dermatology erythemato-squamous diseases dataset. These methods did not give the best results when compared with the PFBS-RFS-RFE.
Table 19

The proposed methods compared with the IGF, Chi-square and Bat algorithm methods

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
Information gain
KNN classifier
 RNA gene99.72399.6270.0960.9980.9960.99735761.1820.0051.0000.00003699.627
 DNA CNV81.31074.4486.8620.6710.6400.63633155.6510.0140.8500.00136174.448

 Parkinson’s

disease

81.02672.4798.5470.7770.8850.8273960.0930.00080.4520.00238472.479

 Dermatology

diseases

97.99697.2670.7290.9740.9700.969250.0320.00020.9640.00049697.267
 BreastEW94.72892.9761.7520.9240.8880.904220.0640.0020.9690.00095392.976
Naïve base classifier
 RNA gene100.00096.3803.6200.96850.9460.95235761.1820.0290.9700.00071096.380
 DNA CNV66.99465.6371.3570.6470.6570.62633155.6510.4350.8320.00133665.637

 Parkinson’s

disease

74.61874.0700.5480.8030.8670.8333960.0930.0040.7210.00623574.070

 Dermatology

diseases

86.94785.7811.1660.8280.8560.802250.0320.00070.9840.00114385.781
 BreastEW94.25993.8530.4060.9460.8870.914220.0640.0020.9880.00076693.853
Decision tree classifier
 RNA gene99.15497.2501.9040.9750.9770.97535761.1820.8360.9890.00044497.250
 DNA CNV65.62664.5741.0520.5530.5590.55133155.6512.0670.8200.00090064.574

 Parkinson’s

disease

86.44975.52810.9210.8100.8780.8413960.0930.0770.7070.00564575.528

 Dermatology

diseases

88.49485.0153.4790.7250.7450.721250.0320.00080.9340.00320985.015
 BreastEW96.46694.0292.4370.9240.9200.918220.0640.0030.9670.00131094.029
Chi-square
KNN classifier
 RNA gene99.84799.7500.0970.9990.9970.99875550.08010.0101.0000.00002899.750
 DNA CNV70.28359.63510.6480.5260.4980.49255550.5280.0050.7530.00214259.635

 Parkinson’s

disease

81.15872.6128.5460.7780.8860.8283980.0160.0010.4520.00230872.612

 Dermatology

diseases

92.62288.4984.1240.8890.8750.865240.0940.00090.9670.00264188.498
 BreastEW94.72892.9761.7520.9240.8880.904210.0160.0020.9690.00095392.976
Naïve base classifier
 RNA gene100.00075.78724.2130.7460.6830.68075550.08010.1840.8090.00483375.787
 DNA CNV49.60048.7650.8350.5120.5060.46855550.5280.5520.7340.00089348.765

 Parkinson’s

disease

74.69174.2070.4840.7970.8790.8353980.0160.0060.7080.00807374.207

 Dermatology

diseases

89.44587.1642.2810.8160.8600.817240.0940.00040.9750.00195487.164
 BreastEW94.22093.6780.5420.9460.8820.912210.0160.00070.9880.00096793.678
Decision tree classifier
 RNA gene99.15497.2501.9040.9740.9760.97475550.08015.2210.9900.00041097.250
 DNA CNV58.45155.8632.5880.4640.4580.45255550.5282.1550.7760.00025355.863

 Parkinson’s

disease

83.12776.7216.4060.8230.8780.8493980.0160.1170.7290.00533876.721

 Dermatology

diseases

88.49485.0153.4790.7250.7450.721240.0940.00080.9340.00320985.015
 BreastEW96.46693.3273.1390.9150.9110.909210.0160.0100.9660.00251393.327
Bat algorithm
KNN classifier
 RNA gene99.86199.7520.1090.9990.9970.998648313500.0121.0000.00002799.752
 DNA CNV81.78675.3096.4770.6850.6440.639530112800.0080.8640.00023575.309

 Parkinson’s

disease

80.27769.32610.9510.7570.8690.809350.3059.4220.5070.00347569.326

 Dermatology

diseases

98.02797.2600.7670.9710.9700.969190.2550.0020.9740.00167897.260
 BreastEW94.72892.9761.7520.9240.8880.904140.2000.0010.9690.00095392.976
Naïve base classifier
 RNA gene99.88283.77716.1050.8580.7870.792648313500.0840.8750.00249883.777
 DNA CNV67.46366.2901.1730.6540.6630.632530112800.1860.8730.00201266.290

 Parkinson’s

disease

75.61774.7440.8730.7920.8990.841350.3050.00080.7060.00588574.744

 Dermatology

diseases

86.10985.5400.5690.8010.8500.799190.2550.0010.9790.00191585.540
 BreastEW95.47795.0990.3780.9620.9060.931140.2000.00050.9900.00118395.099
Decision tree classifier
 RNA gene99.32098.7500.5700.9880.9900.988648313501.7820.9940.00013998.750
 DNA CNV65.59264.6080.9840.5530.5600.551530112800.5780.8220.00089464.608

 Parkinson’s

disease

81.82074.6097.2110.7830.9150.843350.3050.0100.6990.00341374.609

 Dermatology

diseases

89.01187.1771.8340.7470.7610.742190.2550.0020.9420.00290987.177
 BreastEW96.46694.0292.4370.9240.9200.918140.2000.0030.9680.00131094.029
The proposed methods compared with the IGF, Chi-square and Bat algorithm methods Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Parkinson’s disease Dermatology diseases Table 20 showed the comparison between the PFBS-RFS-RFE and other filter ones methods. The results showed that the PFBS-RFS-RFE gave the best results when compared with other filter ones methods.
Table 20

The comparison between the PFBS-RFS-RFE and other filter ones methods

AlgorithmACC%NO.FPreRecF1-scoreAUCVar.
MIFS99.87510,0000.9990.9980.9881.0000.000016
IGF99.87535760.9990.9990.9981.0000.000016
mRMR99.7506500.9990.9970.9981.0000.000028
CfsSubsetEval99.62740830.9980.9960.9971.0000.000036
ReliefAttributeEval99.87310,0000.9990.9990.9991.0000.000031
OneRAttributeEval99.62770000.9980.9960.9970.9990.000036
ConsistencySubsetEval97.38030.9720.9700.9700.9940.000188
PCA99.7407000.9990.9970.9980.9990.000059
MIFS, CBF and FCBF99.7489000.9990.9970.9981.0000.000092
Chi-square99.62575550.9970.9950.9961.0000.000036

IGF, Chi-square and

Bat algorithm

99.75264830.9990.9970.9981.0000.000027

Proposed method

(PFBS-RFS-RFE)

100.00010.0001.0001.0001.0001.0000.0
The comparison between the PFBS-RFS-RFE and other filter ones methods IGF, Chi-square and Bat algorithm Proposed method (PFBS-RFS-RFE) The proposed methods were compared with some hybrid-recursive feature elimination methods as cited in the introduction section. Table 21 showed the results of the hybrid-recursive feature elimination methods for all datasets using RFE and LR. The results proved that this hybrid method gave the best results for RNA Gene, dermatology erythemato-squamous diseases and BreastEW datasets. This hybrid method did not give the best results when compared with the PFBS-RFS-RFE.
Table 21

The proposed methods compared with the hybrid of MIFS and RFE

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
RF classifier
 RNA gene100.00099.5010.4990.7940.7150.723500010,227.5791.1991.0000.00004199.501
 DNA CNV92.90885.0347.8740.7700.7160.717450088,434.6273.4110.9460.00069885.034

 Parkinson’s

disease

100.00083.86116.1390.8090.7370.75915074.4450.4110.8760.00286783.861

 Dermatology

diseases

99.72794.8194.9080.9410.9300.930121.1130.0790.9960.00152894.819
 BreastEW100.00095.9654.0350.9610.9530.956101.5920.1330.9880.00075695.965
The proposed methods compared with the hybrid of MIFS and RFE Parkinson’s disease Dermatology diseases Another hybrid method is applied to show the comparison between the proposed method and hybrid method using GA and RFE. Table 22 showed the results of the hybrid method using GA and RFE. The results proved that this hybrid method gave the best results for RNA gene and BreastEW datasets. This hybrid method did not give the best result when compared with the PFBS-RFS-RFE.
Table 22

The proposed methods compared with the hybrid of GA and RFE

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
SVM classifier
 RNA gene99.79199.7500.2210.9990.9970.998312315,746.0430.7271.0000.00002899.750
 DNA CNV93.27184.9808.2910.8600.7700.790294062,405.81035.1180.9650.00062084.980

 Parkinson’s

disease

75.52974.9960.5330.5300.5230.474149.00055.1140.0710.7680.00065274.996

 Dermatology

diseases

84.80084.7220.0780.8540.8350.8305.0000.6510.0160.9600.00005284.722
 BreastEW91.79991.3940.4050.4630.4200.4395.0000.6560.0160.9770.00077691.394
The proposed methods compared with the hybrid of GA and RFE Parkinson’s disease Dermatology diseases In addition, the proposed method was compared with another hybrid method using ridge regression and RFE. Table 23 showed the results of the hybrid method using ridge regression and RFE. The results proved that this hybrid method gave the best results for RNA gene, dermatology erythemato-squamous diseases and BreastEW datasets. This hybrid method did not give the best result when compared with the PFBS-RFS-RFE.
Table 23

The proposed methods compared with the hybrid of Ridge regression and RFE

DatasetsTrainData %TestData %Over-fittingDiff. %PreRecF1-scoreNO.FF-Time(sec)C-Time(sec)AUCVar.ACC%
SVM classifier
 RNA gene100.00099.6270.3730.9980.8300.83110,26510,160.7202.9621.0000.00003699.627
 DNA CNV93.44680.76112.6850.7720.7070.710819037,302.173.2160.9440.00052780.761

 Parkinson’s

disease

100.00082.93017.0700.8100.7140.738376.0005.4821.1950.8550.00341082.930

 Dermatology

diseases

99.72794.8054.9220.9500.9470.94413.0000.0160.0800.9940.00155694.805
 BreastEW100.00093.6756.3250.9410.9260.93215.0000.01590.1010.9840.00096993.675
The proposed methods compared with the hybrid of Ridge regression and RFE Parkinson’s disease Dermatology diseases Table 24 showed the comparison between the PFBS-RFS-RFE and other RFE hybrid methods. The results showed that the PFBS-RFS-RFE gave the best results when compared with other RFE hybrid methods.
Table 24

The comparison between the PFBS-RFS-RFE and other RFE hybrid methods

AlgorithmACC%NO.FPreRecF1-scoreAUCVar.
MIFS and RFE99.50145000.7940.7150.7231.0000.000041
GA and RFE99.75031230.9990.9970.9981.0000.000028
Ridge regression and RFE99.62710,2650.9980.8300.8311.0000.000036
Proposed method (PFBS-RFS-RFE)100.00010.0001.0001.0001.0001.0000.0
The comparison between the PFBS-RFS-RFE and other RFE hybrid methods After the number of runs, the selected features are intersected to know the genes (features) associated with cancers which considered the most developing human cancers. Table 25 presented the features after the intersection, which played an important role in knowing the most genes and features developing human cancers.
Table 25

The selected features after intersection [38–58]

Datasets

No.

Intersection

Features

Feature indices

or feature names

Feature or gene Description

Reference

in cancer

RNA gene1G110

DNA

CNV

12PPP1R8

Through alternative splicing, three this gene encodes

different isoforms [38].

[39]
SCARNA1Small Cajal body-specific RNA 1 [38].[40]
RPA2Protein A (RPA) complex is encoded by this gene [38].[41]
SMPDL3BSphingomyelin phosphodiesterase acid like 3B [38].[42]
XKR8

Promotes phosphatidylserine exposure apoptotic cell surface,

possibly by mediating phospholipid scrambling [43].

[44]
PHACTR4

A member of the phosphatase and actin regulator (PHACTR)

family are encoded by this gene [38].

[45]
RCC1Regulator of chromosome condensation 1 [38].[46]
SNHG3Small nucleolar RNA host gene 3 [32].[47]
SNORD99Small nucleolar RNA, C/D box 99 [38].[48]
SNORA16ASmall nucleolar RNA, H/ACA box 16A [38].[49]
RAB42Member RAS oncogene family [38].
TFA12

This gene Control of transcription by RNA polymerase II

[38].

[50]

Parkinson’s

disease

7IMF_SNR_TKEO
IMF_NSR_TKEO
mean_MFCC_1st_coef
mean_4th_delta_delta
mean_5th_delta_delta
mean_6th_delta_delta
mean_7th_delta_delta
BreastEW1Radius

Can be defined as the mean of distances from center to points

on the perimeter [51].

[51]
Datasets

No.

Intersection

Features

Feature indices

or feature names

Feature or gene Description

Reference

in derma-

tology

Dermatology erythemato-squamous diseases5BordersThe border of the lesion which important for diagnosing and for other features [52, 53].[53]
ParakeratosisNucleated keratinocytes are existed in the stratum corneum due to accelerated keratinocytic turnover [54].[54]
SpongiosisIntraepidermal eosinophils is existed in spongiotic zones [55].[55, 56]
ItchingItching is a bad feeling that causes itching continuously, which affects the human psyche [57].[56]
AgeThe age at disease onset [58].[58]
The selected features after intersection [38-58] No. Intersection Features Feature indices or feature names Reference in cancer DNA CNV Through alternative splicing, three this gene encodes different isoforms [38]. Promotes phosphatidylserine exposure apoptotic cell surface, possibly by mediating phospholipid scrambling [43]. A member of the phosphatase and actin regulator (PHACTR) family are encoded by this gene [38]. This gene Control of transcription by RNA polymerase II [38]. Parkinson’s disease Can be defined as the mean of distances from center to points on the perimeter [51]. No. Intersection Features Feature indices or feature names Reference in derma- tology For DNA CNV dataset, the PHACTR4 was associated with prostate, breast and colon cancer [59], while RPA2 was associated with breast cancer [41]. We can notice that the proposed method achieved the best results and reached the most effective genes that develop human cancer. For dermatology erythemato-squamous diseases dataset, the age, itching and spongiosis features were associated with psoriasis dis- ease [56, 58].

Discussion

The proposed PFBS-RFS-RFE was applied to classify different human cancer using big, medium and small datasets and other medical dataset. It used five different datasets. PFBS-RFS-RFE was proposed to enhance drawbacks included in over-fitting, time-consuming, high dimension, variance and classification accuracy. The PFBS was applied in different position to obtain different results. It was applied using three positions outer, inner and outer/inner. After applying PFBS, the RFS algorithm for feature selection was applied to select the most relevant features and reduce time consumption in RFE algorithm. RFE algorithm was used to obtain the final relevant subset of features with higher classification accuracy results. The OFBS-RFS-RFE method achieved the best results using all datasets. The RF classifier achieved the best classification accuracy with 100% using dermatology erythemato-squamous diseases dataset with 0.0 variance results. The features and time were reduced to become 16.000 and 0.500, respectively. Furthermore, LR classifier achieved the best classification accuracy result with 99.981% using RNA gene dataset, while the SVM classifier gave the best variance result with 0.0000002. The number of features and time were reduced to become 142.500 and 0.192 s, respectively. From DNA CNV dataset the difference between training and testing was reduced using LR and Bagg classifiers, and the accuracy results were increased with 91.020 and 92.762%, respectively using the same classifiers. In addition, the OFBS-RFS-RFE reduced the variance between features to become 0.00028 and 0.00023, respectively, using the previous classifiers. The number of features and time were reduced to become 675 and 2.147 s, respectively. From Parkinson’s disease dataset the classification accuracy and variance are enhanced to become 95.000% and 0.00062, respectively using RF classifier. The features were reduced to 113.85 features which well enough for classification step with 1.134 s as a computational time. From BreastEw dataset the best computational time was after applying LR and SVM in contrast with the other optimizer. The RF gave the best variance and accuracy to become 0.000302 and 98%, respectively. The features and time were reduced to become 0.070 and 0.070 s, respectively. The IFBS-RFS-RFE not achieves the best results in all datasets. The SVM classifier achieved the best classification accuracy and variance results from the RNA gene dataset with 99.988% and 0.0000002, respectively. The features and time were minimized to 125.25 features and 0.153 s, respectively. For other datasets it did not give good results. The O/IFBS-RFS-RFE achieved the best results for dermatology erythemato-squamous diseases dataset. RF and Bagg classifiers gave the best results with 10 features. The classification accuracy, variance and time were improved to become 100%, 0.0 and 0.500, respectively. In addition, The O/IFBS-RFS-RFE achieved the best results in high dimension datasets using RNA gene. The LR classifier increased the accuracy and variance results to 99.994% and 0.0000004, respectively. From DNA CNV dataset, the Bagg classifier gave the best accuracy and variance results to become 92.834% and 0.00027, respectively. At the same time, the outer/inner position did not provide good results for other datasets. For future work, our proposed method will apply the incremental feature selection (IFS) for different datasets using PFBS. The IFS will select the most relevant subset features to minimize the time when using all features and overcome the feature selection drawback.

Conclusions

In our study, new hybrid methods are proposed to enhance cancers classification performance using different size of datasets. The PFBS using EDF equation is enhanced the RFS and RFE performance. Many bootstrap positions are applied to improve the problem of over-fitting and to fix the feature selection problems. Furthermore, our proposed methods achieved high results using different size of datasets. It is compared with previous work and it gave high results.

Method

Dataset description

We used five healthcare datasets in the experiments. The DNA CNV dataset is used in [7, 8, 12] and downloaded from the cBioPortal for Cancer Genomics [59-61] to classify different types of human cancers. The other four datasets are downloaded from the UCI machine learning repository [62] and used in [9, 23]. A brief description of each adopted dataset is presented in Table 26.
Table 26

Datasets Description

Category TypeDS No.Datasets#Features#Samples#Class
Small < 100D1BreastEW305692
D2Dermatology erythemato-squamous diseases343666

Medium

100 < D2 < 1000

D3Parkinson’s disease7537562

Large

1000 < D < 21,000

D4DNA CNV16,38129166
D5RNA gene20,5318015
Datasets Description Medium 100 < D2 < 1000 Large 1000 < D < 21,000

The proposed hybrid feature selection methods

The main motivation of the proposed methods is to select the most important and relevant features from all original features. This step is considered vital and plays a significant role in obtaining good classification results. Non- influencing features waste time and lead to many complex problems included in poor classification accuracy, over-fitting, and feature size. The wrapper method for feature selection selects the features based on machine learning to find optimal features, but it takes more time to obtain these features and has chances of over-fitting problems. On the other hand, the advantage of embedded methods for feature selection is that the selected features are embedded in machine learning or during the model building process. It is applied to reduce the over-fitting problem, reducing the variance between features. Based on the advantages of the two previous methods, we proposed hybrid methods for feature selection to obtain the most relevant subset feature. The proposed methods are shown in Fig. 4. Resampling method with different positions is applied to minimize the over-fitting problem and maximize the classification accuracy. After the resampling step, the most important features are selected using RFS algorithm. The hybrid between resampling and RF algorithms are applied to solve many problems such as (1) time consuming when using RFE algorithm, (2) over-fitting problem, (3) the most relevant features, and (4) classification accuracy. The wrapper method is applied to select the most important features, therefore; reduce the datasets dimensional and maximizing the classification accuracy. The RFE using LR classification as an estimator is integrated with the previous features to achieve the desired goals.
Fig. 4

Hybrid proposed methods for feature selection

Hybrid proposed methods for feature selection

First bootstrap step as a resampling method

A lot of high-dimensional datasets suffer from over-fitting problems and low classification accuracy. We apply the FBS step as a resampling method to avoid these problems. The bootstrap samples are drawn with replacement as the same size of the original data. Given the original datasets X = X1, X2, X3, ........, XO With O observations with a distribution function called empirical distribution function (EDF). The bootstrap sample is denoted as X* = X*1, X*2, X*3, ......., X*O. The (EDF) is denoted as follows [63]: - Where I(·) denotes the indicator function, the bootstrap resampling method is applied in many positions to achieve the desired task. The first position of bootstrap is before selecting the essential features called OFBS, but we need to apply different positions to obtain the best results. In this position the EDF is applied as a resampling method before selecting features. The IFBS is applied during selecting the feature selection. On the other hand, the O/IFBS is applied before and during selecting features. All bootstrap positions are applied to overcome the over-fitting and classification accuracy. After these positions, the classification accuracy and over-fitting problems are improved. Therefore, the proposed positions selected the most relevant features.

Feature selection using random Forest (RFS)

A random forest algorithm is applied for feature selection to improve the performance of the classifiers, reduce the over-fitting problem and time consuming due to the disadvantage of RFE algorithm. It is considered the embedded feature selection that interacts directly with classifiers and reduces the time complexity found in the wrapper method. The RFS algorithm can identify the importance of the feature. The training samples are created using bootstrap when applying IFBS method but using all datasets to create samples when applying OFBS to improve the over-fitting and classification accuracy. The trees are constructed with a specific size. Select M trees from the dataset to build the decision trees. Decision trees are constructed from the M trees and they are repeated B times. Construct the smallest subset of features F at each node and separate the best features for F by Gini importance scores. It is sorted the features according to their scores from smallest to largest. The features below the threshold will be eliminated.

Recursive feature elimination (RFE)

Selecting the most significant features is the main goal in the classification step. In this direction, we applied RFE algorithm to select the most important features therefore; reach to the chromosome which considered the most developing human cancers. RFE is an instance of backward feature elimination. The classifier estimator is trained on the initial set of features and these features are sorted according to their weights. The features with the smallest weights are removed because these features are not important during the classification process. The previous steps are repeated until the most relevant features are reached. RFE is applied with LR as an estimator. The classification accuracy is improved after applying the proposed method. The step size is proposed in the RFE method called recursive feature elimination with cross-validation (RFECV) to achieve the best results. The features are sorted according to their importance at each step, and the smallest ranked feature is deleted. The proposed methods are presented in Tables 27, 28 and 29 as follows:
Table 27

Algorithm 1 of the first hybrid proposed method using OFBS-RFS-RFE

Table 28

Algorithm 2 of the second hybrid proposed method using IFBS-RFS-RFE

Table 29

Algorithm 3 of the third hybrid proposed method using O/IFBS-RFS-RFE

Algorithm 1 of the first hybrid proposed method using OFBS-RFS-RFE Algorithm 2 of the second hybrid proposed method using IFBS-RFS-RFE Algorithm 3 of the third hybrid proposed method using O/IFBS-RFS-RFE
  33 in total

1.  Heuristic filter feature selection methods for medical datasets.

Authors:  Mehdi Alirezanejad; Rasul Enayatifar; Homayun Motameni; Hossein Nematzadeh
Journal:  Genomics       Date:  2019-07-02       Impact factor: 5.736

2.  PHACTR4 regulates proliferation, migration and invasion of human hepatocellular carcinoma by inhibiting IL-6/Stat3 pathway.

Authors:  F Cao; M Liu; Q-Z Zhang; R Hao
Journal:  Eur Rev Med Pharmacol Sci       Date:  2016-08       Impact factor: 3.507

3.  Classification of cancers based on copy number variation landscapes.

Authors:  Ning Zhang; Meng Wang; Peiwei Zhang; Tao Huang
Journal:  Biochim Biophys Acta       Date:  2016-06-03

4.  The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data.

Authors:  Ethan Cerami; Jianjiong Gao; Ugur Dogrusoz; Benjamin E Gross; Selcuk Onur Sumer; Bülent Arman Aksoy; Anders Jacobsen; Caitlin J Byrne; Michael L Heuer; Erik Larsson; Yevgeniy Antipin; Boris Reva; Arthur P Goldberg; Chris Sander; Nikolaus Schultz
Journal:  Cancer Discov       Date:  2012-05       Impact factor: 39.397

Review 5.  Pruritus: Progress toward Pathogenesis and Treatment.

Authors:  Jing Song; Dehai Xian; Lingyu Yang; Xia Xiong; Rui Lai; Jianqiao Zhong
Journal:  Biomed Res Int       Date:  2018-04-11       Impact factor: 3.411

6.  Efficient feature selection and classification for microarray data.

Authors:  Zifa Li; Weibo Xie; Tao Liu
Journal:  PLoS One       Date:  2018-08-20       Impact factor: 3.240

Review 7.  Artificial intelligence in cancer imaging: Clinical challenges and applications.

Authors:  Wenya Linda Bi; Ahmed Hosny; Matthew B Schabath; Maryellen L Giger; Nicolai J Birkbak; Alireza Mehrtash; Tavis Allison; Omar Arnaout; Christopher Abbosh; Ian F Dunn; Raymond H Mak; Rulla M Tamimi; Clare M Tempany; Charles Swanton; Udo Hoffmann; Lawrence H Schwartz; Robert J Gillies; Raymond Y Huang; Hugo J W L Aerts
Journal:  CA Cancer J Clin       Date:  2019-02-05       Impact factor: 508.702

8.  Editorial: Artificial Intelligence in Positron Emission Tomography.

Authors:  Hanyi Fang; Kuangyu Shi; Xiuying Wang; Chuantao Zuo; Xiaoli Lan
Journal:  Front Med (Lausanne)       Date:  2022-01-31

9.  Diagnosis of Chronic Kidney Disease Using Effective Classification Algorithms and Recursive Feature Elimination Techniques.

Authors:  Ebrahime Mohammed Senan; Mosleh Hmoud Al-Adhaileh; Fawaz Waselallah Alsaade; Theyazn H H Aldhyani; Ahmed Abdullah Alqarni; Nizar Alsharif; M Irfan Uddin; Ahmed H Alahmadi; Mukti E Jadhav; Mohammed Y Alzahrani
Journal:  J Healthc Eng       Date:  2021-06-09       Impact factor: 2.682

10.  Diagnostic approach of eosinophilic spongiosis.

Authors:  Karina Lopes Morais; Denise Miyamoto; Celina Wakisaka Maruta; Valéria Aoki
Journal:  An Bras Dermatol       Date:  2019-10-26       Impact factor: 1.896

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.