| Literature DB >> 35046459 |
Vaibhav Rupapara1, Furqan Rustam2, Wajdi Aljedaani3, Hina Fatima Shahzad2, Ernesto Lee4, Imran Ashraf5.
Abstract
Blood cancer has been a growing concern during the last decade and requires early diagnosis to start proper treatment. The diagnosis process is costly and time-consuming involving medical experts and several tests. Thus, an automatic diagnosis system for its accurate prediction is of significant importance. Diagnosis of blood cancer using leukemia microarray gene data and machine learning approach has become an important medical research today. Despite research efforts, desired accuracy and efficiency necessitate further enhancements. This study proposes an approach for blood cancer disease prediction using the supervised machine learning approach. For the current study, the leukemia microarray gene dataset containing 22,283 genes, is used. ADASYN resampling and Chi-squared (Chi2) features selection techniques are used to resolve imbalanced and high-dimensional dataset problems. ADASYN generates artificial data to make the dataset balanced for each target class, and Chi2 selects the best features out of 22,283 to train learning models. For classification, a hybrid logistics vector trees classifier (LVTrees) is proposed which utilizes logistic regression, support vector classifier, and extra tree classifier. Besides extensive experiments on the datasets, performance comparison with the state-of-the-art methods has been made for determining the significance of the proposed approach. LVTrees outperform all other models with ADASYN and Chi2 techniques with a significant 100% accuracy. Further, a statistical significance T-test is also performed to show the efficacy of the proposed approach. Results using k-fold cross-validation prove the supremacy of the proposed model.Entities:
Mesh:
Year: 2022 PMID: 35046459 PMCID: PMC8770560 DOI: 10.1038/s41598-022-04835-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the systematic analysis studies in related work.
| Study | Models | Dataset | Evaluation metrics | Results |
|---|---|---|---|---|
| [ | Bayes Network learning, Conjunctive Rule, NBTree, VFI, Random Subspace, Naïve Bayes Updateable, and PART | Three datasets contains 7130 Genes | Accuracy | 97.22% for 500 genes |
| [ | Local Directional path | 90 high-quality | Sensitivity, Specificity, Precision, F-Measure | Sensitivity: 100%, Specificity: 80%, Precision: 85.74%, F-Measure: 93.4% |
| [ | K-Means, Fuzzy C Means, Weighted K Means | Heart dataset from UCI machine learning repository | Cluster accuracy, error rate and execution time | Leukemia, K-Means: 78%, Fuzzy means: 75%, WK-Means: 85% |
| [ | Updatable NB, MLP, KNN, SVM | 25 variables or features and 82 instances or records | Accuracy | NB 94.76%, MLP 95.24%, SVM 96.20%, KNN 91.43% |
| [ | Fuzzy c-means clustering, PCA, SVM | 21 peripheral blood smear and bone marrow slides of 14 patients with all and 7 normal persons | sensitivity, specificity, accuracy, precision and false negative | Sensitivity 98%, Specificity 97%, Accuracy 98%, Precision 98% |
| [ | Linde–Buzo–Gray, Kekre’s Propotionate Error, K-Means | 115 digital images of size | Sensitivity, specificity, accuracy | Sensitivity 100%, Specificity 99.747%, Accuracy 99.7617% |
| [ | KNN, SVM, DT, RF, GBDT | Three RNA-seq data sets | Precision, recall and accuracy | Accuracy LUAD: 98.80 (± 1.79), STAD: 98.78 (± 1.44), BRCA: 98.41 (± 0.41) |
| [ | Deep convolutional neural networks | Images from ALL-Image DataBase (IDB) | Sensitivity, specificity, accuracy | Sensitivity 100%, Specificity 98.11%, Accuracy of 99.50% |
| [ | AlexNet | 2,820 images | Precision, Recall, accuracy | 100% classification accuracy |
| [ | Alert Net-RWD | 16 datasets with 2,415 images | Accuracy, precision | Accuracy 97.18%, Precision 97.23% |
| [ | SVM, KNN, NB, and RF | NCBI/GEO public database: 11 series from Microarray and 2 series from RNA-seq | ANOVA statistical test, accuracy, F1 | 10 Genes F1-score: SVM: 97.13%, KNN: 96.28%, NB: 97.29%, RF: 97.01% |
| [ | DNN deep learning network | 36 cases containing 22,283 gene expression of acute myeloid leukemia (AML) microarray | Accuracy | Accuracy: 96.6% |
Figure 1Methodology applied for the study.
Number of samples for each class with and without applying ADASYN technique.
| Target | Count | After ADASYN |
|---|---|---|
| B-CELL_ALL | 74 | 74 |
| B-CELL_ALL_TCF3-PBX1 | 22 | 74 |
| B-CELL_ALL_HYPERDIP | 51 | 64 |
| B-CELL_ALL_HYPO | 18 | 74 |
| B-CELL_ALL_MLL | 17 | 73 |
| B-CELL_ALL_T-ALL | 46 | 74 |
| B-CELL_ALL_ETV6-RUNX1 | 53 | 76 |
| Total Samples | 281 | 509 |
Number of features for experiments.
| Features | Original | After Chi2 |
|---|---|---|
| Total | 22,283 | 400 |
Number of samples and features in training and testing sets.
| Techniques | Training set | Testing set | ||
|---|---|---|---|---|
| Samples | Features | Samples | Features | |
| Original dataset | 238 | 22,283 | 43 | 22,283 |
| After ADASYN | 432 | 22,283 | 77 | 22,283 |
| After Chi2 | 238 | 400 | 43 | 400 |
| After ADASYN+Chi2 | 432 | 400 | 77 | 400 |
Target count for each class in Leukemia_GSE28497 dataset.
| Target | Count |
|---|---|
| B-CELL_ALL | 74 |
| B-CELL_ALL_TCF3-PBX1 | 22 |
| B-CELL_ALL_HYPERDIP | 51 |
| B-CELL_ALL_HYPO | 18 |
| B-CELL_ALL_MLL | 17 |
| B-CELL_ALL_T-ALL | 46 |
| B-CELL_ALL_ETV6-RUNX1 | 53 |
| Total Samples | 281 |
Sample of Leukemia_GSE28497 dataset.
| Type | 1007_s_at | 1053_at | . | AFFXTrpnXM_at |
|---|---|---|---|---|
| BCELL_ALL | 7.409521 | 5.009216 | . | 2.608381 |
| BCELL_ALL | 7.177109 | 5.415108 | . | 2.634063 |
Target count for each class in Leukemia_GSE9476.
| Target | Count |
|---|---|
| AML | 26 |
| Bone_Marrow | 10 |
| Bone_Marrow_CD34 | 8 |
| PB | 10 |
| PBSC_CD34 | 10 |
| Total | 64 |
Sample of Leukemia_GSE9476 dataset.
| Type | 1007_s_at | 1053_at | . | AFFXTrpnXM_at |
|---|---|---|---|---|
| Bone_Marrow_CD34 | 7.745245 | 7.811210 | . | 4.139249 |
| Bone_Marrow_CD34 | 8.087252 | 7.240673 | . | 4.122700 |
Description of used machine learning models.
| Model | Description |
|---|---|
| RF | RF is a model for tree-based ensemble learning that predicts accurately by combining multiple poor learners. IT uses the bagging method for training several decision trees with different samples of bootstrap. The substitution of training data in random forests is a bootstrap study, where the sample is the same as the training collection[ |
| LR | The classification problems are generally dealt with using logistic regression. It is a regression model based on the probability theorem and a predictive analysis algorithm. Binary information, in which one or more variables work together to generate a result, is most often interpreted. Using the sigmoid logistic regression function, a relationship is established between one or more independent variables with an approximation probability[ |
| SVC | The classification aims to divide a data collection into categories based on a set of criteria to classify data in a more meaningful way. SVC is a classification method focused on the support vector technique. The SVC’s goal is to fit the data you supply and return a “best fit” hyperplane that separates or categorizes the data. Following that, you should feed any features to your classifier to see what the “predicted” class is after you have obtained the hyperplane. This makes this algorithm particularly good for our purposes, though it can be used in a variety of contexts[ |
| KNN | KNN is a basic model used in machine learning for regression and classification processing. The data is referred to as the class with the closest neighbors, and the technique uses the data to organize the current data means based on a distance attribute. The KNN model bestows pledge effects in this experiment when the value of |
| NB | Focused on the Bayes Theorem, the controlled learning algorithm called the Naive Bayes algorithm is used to resolve classification problems. The training of an NB classifier involves a limited number of data points and is therefore fast and scalable. It is a probabilistic classifier that predicts the probability of an object. The NB classifier claims that each likelihood of feature is independent of the others and that they do not overlap, such that each feature contributes similarly to a sample belonging to a given class. The NB classifier is easy to use and quick to compute, and it works well on massive datasets of high dimensionality[ |
| ETC | The ETC works in a similar way to the random forest, except for the process of tree building in the forest. The ETC uses the initial training sample to build each decision tree. The top function to interrupt the data in the tree is chosen using the Gini index, and |
| DT | A DT is a kind of tree-like framework used to construct structures. A decision tree is commonly used in medical processing because it is quick and fast to execute. There are three nodes in the decision tree. (1) Root node (main node; other nodes’ roles are dependent on it); Interior node (it handles various types of attributes) (3) Node of the leaf (it is also called as end-node; it is the final node which represents the results of each test)[ |
| ADA | ADA is typically used in combination with other algorithms to improve their accuracy. It focuses on boosting vulnerable learners into good learners. Any AdaBoost tree is based on an error rate of the last constructed tree[ |
Models hyperparameters settings and hyperparameter range used for tuning.
| Model | Hyperparameters setting | Hyperparameter range |
|---|---|---|
| RF | n_estimators = 300, max_depth = 25 | n_estimators = 20 to 500, max_depth = 2 to 50 |
| LR | multi_class = “multinomial”, C = 2.0 | solver = liblinear,saga sag, multi_class = “multinomial”, C = 1.0–5.0 |
| SVC | kernel = “linear”, C = 2.0 | kernel = linear, sigmoid, poly, C = 1.0–5.0 |
| KNN | n_neighbors = 4 | n_neighbors = 2–6 |
| NB | Default setting | – |
| ETC | n_estimators = 300, max_depth = 25 | n_estimators = 20–500, max_depth = 2–50 |
| DT | max_depth = 25 | max_depth = 2–50 |
| ADA | n_estimators = 300, learning_rate = 0.2 | n_estimators = 20–500, learning_rate = 0.1–0.8 |
| LVTrees | Model (LR, SVC,ETC), Voting = Hard | Voting = Hard and Soft |
Figure 2Architecture of proposed hybrid LVTrees model.
Performance of models on original dataset.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| LVTrees | 0.91 | 0.95 | 0.89 | 0.89 |
| KNN | 0.91 | 0.95 | 0.88 | 0.88 |
| ETC | 0.88 | 0.80 | 0.84 | 0.82 |
| ADA | 0.65 | 0.78 | 0.67 | 0.67 |
| SVC | 0.91 | 0.96 | 0.88 | 0.88 |
| RF | 0.88 | 0.81 | 0.84 | 0.82 |
| NB | 0.86 | 0.79 | 0.81 | 0.79 |
| DT | 0.72 | 0.74 | 0.72 | 0.73 |
| LR | 0.91 | 0.95 | 0.88 | 0.88 |
Figure 3Confusion matrix of LVTrees on original dataset.
Performance of models applying ADASYN technique.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| LVTrees | 0.99 | 0.99 | 0.99 | 0.99 |
| KNN | 0.87 | 0.91 | 0.88 | 0.87 |
| ETC | 0.97 | 0.98 | 0.98 | 0.98 |
| ADA | 0.75 | 0.86 | 0.78 | 0.77 |
| SVC | 0.99 | 0.99 | 0.99 | 0.99 |
| RF | 0.99 | 0.99 | 0.99 | 0.99 |
| NB | 0.95 | 0.95 | 0.95 | 0.95 |
| DT | 0.87 | 0.87 | 0.88 | 0.87 |
| LR | 0.99 | 0.99 | 0.99 | 0.99 |
Figure 4Confusion matrix of best performer LVTrees after applying ADASYN technique.
Performance of models after applying Chi2 technique.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| LVTrees | 0.91 | 0.81 | 0.85 | 0.83 |
| KNN | 0.79 | 0.82 | 0.81 | 0.81 |
| ETC | 0.86 | 0.79 | 0.82 | 0.80 |
| ADA | 0.72 | 0.64 | 0.59 | 0.59 |
| SVC | 0.86 | 0.78 | 0.82 | 0.80 |
| RF | 0.88 | 0.81 | 0.83 | 0.82 |
| NB | 0.86 | 0.85 | 0.85 | 0.85 |
| DT | 0.74 | 0.73 | 0.4 | 0.73 |
| LR | 0.88 | 0.81 | 0.83 | 0.82 |
Figure 5Confusion matrix of LVTrees after applying Chi2 technique.
Performance of models after applying both CHI2 and ADASYN techniques.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| LVTrees | 1.00 | 1.00 | 1.00 | 1.00 |
| KNN | 0.95 | 0.96 | 0.92 | 0.92 |
| ETC | 0.97 | 0.97 | 0.96 | 0.97 |
| ADA | 0.86 | 0.88 | 0.85 | 0.84 |
| SVC | 0.99 | 0.99 | 0.98 | 0.98 |
| RF | 0.99 | 0.99 | 0.98 | 0.98 |
| NB | 0.92 | 0.91 | 0.90 | 0.91 |
| DT | 0.84 | 0.87 | 0.81 | 0.82 |
| LR | 0.97 | 0.97 | 0.97 | 0.97 |
Figure 6Confusion matrix of LVTrees after applying Chi2 and ADASYN techniques.
Figure 7Results of the models’ performance after applying each technique.
Figure 8Accuracy score comparison with all approaches.
Performance of proposed approach on Leukemia_GSE9476 dataset.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| LVTrees | 0.90 | 0.95 | 0.92 | 0.92 |
| LVTrees (CHI2+ADASYN) | 1.00 | 1.00 | 1.00 | 1.00 |
Performance of resampling on training data alone.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| LVTrees (Original) | 0.91 | 0.95 | 0.89 | 0.89 |
| LVTrees (Chi+ADASYN) | 0.95 | 0.93 | 0.95 | 0.94 |
Performance results when we done feature selection after data splitting.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| LVTrees | 0.97 | 0.97 | 0.97 | 0.97 |
| KNN | 0.89 | 0.91 | 0.90 | 0.89 |
| ETC | 0.96 | 0.96 | 0.96 | 0.96 |
| ADA | 0.38 | 0.42 | 0.45 | 0.39 |
| SVC | 0.95 | 0.95 | 0.95 | 0.95 |
| RF | 0.96 | 0.96 | 0.96 | 0.96 |
| NB | 0.94 | 0.94 | 0.94 | 0.94 |
| DT | 0.86 | 0.85 | 0.86 | 0.85 |
| LR | 0.96 | 0.96 | 0.96 | 0.96 |
Results of 10-fold cross validation for all models.
| Model | Original data | Chi2 +ADASYN | ||
|---|---|---|---|---|
| Accuracy | SD | Accuracy | SD | |
| LVTrees | 0.90 | 0.97 | ||
| KNN | 0.79 | 0.92 | ||
| ETC | 0.86 | 0.95 | ||
| ADA | 0.48 | 0.57 | ||
| SVC | 0.89 | 0.96 | ||
| RF | 0.86 | 0.96 | ||
| NB | 0.83 | 0.90 | ||
| DT | 0.70 | 0.86 | ||
| LR | 0.89 | 0.95 | ||
Comparison with previous approaches.
| Reference | Year | Model | Data | Accuracy |
|---|---|---|---|---|
| [ | 2019 | SVM, KNN, NB, and RF | Microarray gene | KNN: 96.28%, NB: 97.29%, RF: 97.01% |
| [ | 2020 | DNNs deep learning network | Microarray gene | 96.6% |
| Current study | 2021 | LVTrees | Microarray gene | 100% |