| Literature DB >> 35819826 |
Yousra El Alaoui1, Adel Elomri1, Marwa Qaraqe1, Regina Padmanabhan1, Ruba Yasin Taha2, Halima El Omri2, Abdelfatteh El Omri3, Omar Aboumarzouk3,4,5.
Abstract
BACKGROUND: Machine learning (ML) and deep learning (DL) methods have recently garnered a great deal of attention in the field of cancer research by making a noticeable contribution to the growth of predictive medicine and modern oncological practices. Considerable focus has been particularly directed toward hematologic malignancies because of the complexity in detecting early symptoms. Many patients with blood cancer do not get properly diagnosed until their cancer has reached an advanced stage with limited treatment prospects. Hence, the state-of-the-art revolves around the latest artificial intelligence (AI) applications in hematology management.Entities:
Keywords: artificial intelligence; cancer; deep learning; hematology; machine learning; malignancy; management; oncology; prediction
Mesh:
Year: 2022 PMID: 35819826 PMCID: PMC9328784 DOI: 10.2196/36490
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 7.076
Figure 1Review chart showing paper elimination and categorization process. AI: artificial intelligence; DL: deep learning; ML: machine learning.
Figure 2Citation statistics for the most cited documents.
Figure 3Thematic evolution of hematology management research.
Figure 4Statistics for most cited keywords.
Classification of journal papers by category.
| Categories | Number of articles | Studies |
| Review papers | 13 | [ |
| Conference proceedings | 33 | [ |
| Journal articles | 98 | [ |
Distribution of the 131 studies based on malignancy type.
| Malignancy type | Values, n (%) |
| Acute myeloid leukemia | 42 (32.1) |
| Acute lymphoblastic leukemia | 40 (30.5) |
| Chronic lymphocytic leukemia | 13 (9.9) |
| Chronic myeloid leukemia | 2 (1.5) |
| Lymphoma | 17 (13.0) |
| Other | 17 (13.0) |
Classification of journal publications by pathway stage (N=131).
| Pathway stage | Values, n (%) | Studies |
| Prediction | 27 (20.6) | [ |
| Screening | 34 (26.0) | [ |
| Diagnosis | 52 (39.7) | [ |
| Treatment | 18 (13.7) | [ |
Study analysis for journal publications on the prediction phase.
| Reference | Objective, data set, and methodology | Performance and remarks |
| [ |
Objective: ALLa relapse prediction Data set: 336 newly diagnosed children with ALL Methodology: Random forest algorithm | Performance: Accuracy: 0.829 AUCb: 0.902 Usage of 4 MLc algorithms and 104 features Good model performance in all risk-level groups Adoption of a special feature selection strategy: 100-fold Monte Carlo cross validation combined with 10-fold cross validation Data set imbalance (relapsed and nonrelapsed children) Strong predictors were excluded from the variable set 10-fold cross validation |
| [ |
Objective: Prediction of patients with CMLd and non-CML using complete blood count records Data set: Complete blood count records of 1623 patients with a BCR-ABL1 test extracted from the US Veterans Health Administration Methodology: XGBoost and LASSO | Performance: AUC range: 0.87-0.96 at the time of diagnosis Use of 2 models Use of 2 feature selection methods Imbalanced data set (predominant gender is male) Nonstandard data collection process Split sample validation (20% of the data for validation) |
| [ |
Objective: Leukemia detection based on biomedical data Data set: 401 leukemia datapoints from Z H Sikder Medical College and Hospital Methodology: Decision tree | Performance: Accuracy: 100% Use of 4 supervised ML algorithms Overfitting 10-fold cross validation |
| [ |
Objective: Prediction of leukemia survivability Data set: 131,615 records and 133 attributes for patients with leukemia from the SEERe database Methodology: Deep neural network model | Performance: Accuracy: 74.85% Use of a DNNf ensemble method Many problems in the leukemia data set (redundant attributes, missing values, and unknown values) 10-fold cross validation Ensemble method |
| [ |
Objective: Predictive identification of patients at risk during treatment Data set: 737 samples of patients diagnosed with CLLg at Mayo Clinic Methodology: logistic regression, support vector machine, gradient boosting machine, random forest | Performance: ROCh-AUC: above 80% Binary classification outperforms survival analytic methods Lack of actionable information provided by the ML algorithms 100 runs of 5-fold cross validation |
aALL: acute lymphoblastic leukemia.
bAUC: area under the curve.
cML: machine learning.
dCML: chronic myeloid leukemia.
eSEER: Surveillance, Epidemiology, and End Results
fDNN: deep neural network.
gCLL: chronic lymphocytic leukemia.
hROC: receiver operating characteristic
Study analysis for journal papers in the screening phase.
| Reference | Objective, data set, and methodology | Performance and remarks |
| [ |
Objective: Classification of white blood cell leukemia Data set: Acute Lymphoblastic Leukemia Image Database for Image Processing 1 and 2 Methodology: A hybrid model (CNNa and SESSAb) | Performance: Accuracy: 99.2% Sensitivity: 100% Powerful performance using CNN Use of the salp swarm optimization method Hybrid classification method Use of transfer learning Small limited data set insufficient to train CNNs 5-fold internal cross validation and 20% testing (external validation) |
| [ |
Objective: Automated identification of acute lymphoblastic leukemia Data set: Blood smear images obtained from the Department of Hematology at the University Hospital Ostrava Methodology: support vector machine/artificial neural networks | Performance: Accuracy: 98.19% High classification accuracy Successful feature selection Extensive preprocessing is required Lack of medical data sets Inability to generalize the results and trends for lack of comparison with other methods 10-fold cross validation repeated 10 times |
| [ |
Objective: Classification of chronic myeloid leukemia phases Data set: 500 pictures from Patliputra Medical College and Hospital, Dhanbad, and the blood journal repository Methodology: CNN | Performance: Accuracy: 97.8% Use of transfer learning Limited data set Internal validation (14 left for testing) |
aCNN: convolutional neural network.
bSESSA: statistically enhanced salp swarm algorithm.
Study analysis for journal publications on the diagnosis phase.
| Reference | Objective, data set, and methodology | Performance and remarks |
| [ |
Objective: Classification of mature B-cell neoplasm Data set: 20,622 routine diagnostic samples from Munich Leukemia Laboratory Methodology: CNN-SOMa transformation | Performance: Accuracy: 95% Large data set High accuracy Nonuniform distribution of misclassifications due to similarity in flow cytometric profiles 10% validation split |
| [ |
Objective: Detection of immature leukocytes and their classification into 4 types Data set: Images extracted from a publicly available data set at The Cancer Imaging Archive Methodology: Random forest algorithm | Performance: Accuracy: 92.99% High precision results for each class High number of false positives leading to low precision and specificity 5-fold cross validation |
| [ |
Objective: Identification of the leukemia type based on patient genetic expression Data set: A sample of 7129 genes that represent the genetic expressions of 72 people from Kaggle Methodology: XGBoost, artificial neural networks, and random forest algorithm | Performance: Random forest accuracy: 80.8% XGBoost accuracy: 92.3% Use of principal component analysis for dimensionality reduction and faster computation Use of grid search for the best hyperparameter selection Small data set (72 people) Internal validation (65%/35% split) |
| [ |
Objective: Classification of lymphocytic cells Data set: The ALL-IDB2 Database Methodology: bare bones particle swarm optimization–based feature optimization | Performance: Accuracy: 94.94%-96.25% A good performance on capturing prognostic chronic myeloid leukemia markers by the model Challenge of capturing relationships between data types with no information loss in clinical clustering Validation on an external independent clinical trial |
| [ |
Objective: Detection of leukemia and its types Data set: 220 blood smear images from healthy individuals and patients with leukemia Methodology: support vector machine | Performance: Accuracy: Above 80% Use of 3 segmentation methods Broader range of leukemia classification (types and subtypes) Costly method based on imaging data Internal validation (train test split) |
| [ |
Objective: Automated detection of malignant lymphoma Data set: Prepared histopathologic images (388 sections, 259 diffuse large B-cell lymphomas, 89 follicular lymphomas, and 40 reactive lymphoid hyperplasia) Methodology: Deep neural network classifier | Performance: Accuracy: 97% High accuracy outperforming 7 pathologists Model ensemble comprising 3 classifiers Classifier requires a manual annotation Model not able to classify all the subtypes K-fold cross validation repeated 5 times |
| [ |
Objective: Multiclassification of leukemia Data set: 100 blood smear images Methodology: Neural network classifiers | Performance: Accuracy: 97.7% Two-step neural network classifier Limited data set (100 blood smear images) Internal validation (90 images used for training and 10 kept for validation) |
| [ |
Objective: Leukemia and lymphoma diagnosis Data set: 283 blood and bone marrow sample images from patients with leukemia and lymphoma Methodology: Decision tree | Performance: Correctness: 95% Application of the LASSO algorithm for regularization Model robustness and strength against false negatives Complexity of the decision tree and the risk of overfitting through the production of too large trees 30-fold cross validation |
| [ |
Objective: Leukemia image segmentation Data set: The Acute Lymphoblastic Leukemia Image Database Methodology: HSCRKMb/particle swarm optimization/K-means | Performance: Accuracy: 80% and above Use of 7 machine learning methods Application of soft covering rough approximation Suitable for medical images only Application on multiple color images increases the processing time Different train/test sizes were used for model evaluation |
| [ |
Objective: Determining the most predictive features for acute lymphoblastic leukemia identification Data set: 94 pediatric patient samples collected from the Department of Hematology and Oncology, Children Hospital and Institute of Child Health, Lahore Methodology: Random forest, boosting machine, C5.0 decision tree, and classification and regression trees | Performance: Accuracy: 87.4% High accuracy Balanced data set Small-scale study Few machine learning models Socioeconomic risk factors not selected automatically Internal validation (train/validation data) 10-fold cross validation |
| [ |
Objective: Leukemia diagnosis and its subtypes Data set: 200 blood smear images extracted from Vidyalankar Institute of Technology, Mumbai and online databases Methodology: support vector machine | Performance: Accuracy: 97.8% Good detection accuracy Thorough image segmentation process Challenging detection process due to the irregularity of the cancer cell’s shape and nucleus Use of only support vector machine for classification |
aCNN-SOM: convolutional neural network-self-organizing map.
bHSCRKM: histogram-based soft covering rough K-means clustering.
Study analysis for journal publications on the treatment phase.
| Reference | Objective, data set, and methodology | Performance and remarks |
| [ |
Objective: Digital analysis of blood smears and preclassification of cells Data set: Images of blood smears from a hematologic laboratory Methodology: MERGE algorithm | Performance: Accuracy: 90% Introduction of a new computational and statistical method to determine gene markers Small data set comprising only 30 patients with acute myeloid leukemia Leave-one-out cross validation |
| [ |
Objective: Prediction of complete remission of acute myeloid leukemia Data set: 473 bone marrow samples from the Children’s Oncology Group Methodology: K-nearest neighbor, support vector machine, and hill climbing | Performance: Area under the curve: 0.84 Use of 3 feature selection algorithms: randomized LASSO, recursive feature elimination, and hill climbing Use of 3 classifiers: support vector machine, random forest, and K-nearest neighbor Small data set 100 iterations of a 5-fold cross validation |
| [ |
Objective: Identify the right patterns to improve risk stratification of patient with CLLsa Data set: (1) the first cohort comprised CLL cells of 196 individuals; the second cohort comprised CLL cells of 98 individuals including their clinical data and RNA-seq Methodology: (1) EM algorithm and the Gaussian mixture models; (2) Boosted tree ensemble method | Performance: Precision: 90% High accuracy and precision Large data set and 5-year monitoring is required External validation on an independent cohort |
aCLL: chronic lymphocytic leukemia.