| Literature DB >> 35740563 |
Erdal Tasci1,2, Ying Zhuge1, Kevin Camphausen1, Andra V Krauze1.
Abstract
Recent technological developments have led to an increase in the size and types of data in the medical field derived from multiple platforms such as proteomic, genomic, imaging, and clinical data. Many machine learning models have been developed to support precision/personalized medicine initiatives such as computer-aided detection, diagnosis, prognosis, and treatment planning by using large-scale medical data. Bias and class imbalance represent two of the most pressing challenges for machine learning-based problems, particularly in medical (e.g., oncologic) data sets, due to the limitations in patient numbers, cost, privacy, and security of data sharing, and the complexity of generated data. Depending on the data set and the research question, the methods applied to address class imbalance problems can provide more effective, successful, and meaningful results. This review discusses the essential strategies for addressing and mitigating the class imbalance problems for different medical data types in the oncologic domain.Entities:
Keywords: artificial intelligence; class imbalance; clinical data; machine learning; oncology
Year: 2022 PMID: 35740563 PMCID: PMC9221277 DOI: 10.3390/cancers14122897
Source DB: PubMed Journal: Cancers (Basel) ISSN: 2072-6694 Impact factor: 6.575
Figure 1The number of publications on class imbalance listed in Pubmed by year for years 2012 to 2021 (blue = total publications per year on class imbalance, orange = total publications per year on class imbalance in oncology, grey = the number of publications on class imbalance pertaining to oncology as a percent of the total number of publications that year) [8].
Figure 2Clinical, imaging, omic and outcome based algorithms (left panel) are generated based on data sets that harbor underrepresentation or overrepresentation of one or several categories as related to demographics (gender, race, social determinants of health–(right top panel), management (extent of surgical resection, type of upfront management, management upon recurrence (right middle panel) and disease characteristics (molecular classification, response to systemic management and radiation therapy (right lower panel). Class imbalance affects ancillary data sets such as imaging, omics based on biospecimens and outcomes (middle panel). The resulting algorithms result in limited reproducibility in other cohorts (Study population, B, C, and a theoretical infinite other population that does not share identical class imbalance). As both the sources and extent of over and underrepresentation of certain classes and class imbalance are altered in other populations/data sets (lower panel), defining features reflective of class imbalance and mitigating these via compensatory methods in all data subtypes, can address the lack of reproducibility and help identify additional features that can be used to further optimize algorithms allowing for transferrable results (lower panel).
Pertinent literature addressing approaches to class imbalance.
| Study | Technique | Setting |
|---|---|---|
|
| ||
| Bose et al. [ | Convolutional neural networks (CNN), long short-term memory (LSTM), over-sampling | Radiation Oncology, incident reporting |
| Brown et al. [ | Under-sampling | Radiation Oncology, prostate cancer |
| Liu et al. [ | Synthetic Minority Oversampling Technique (SMOTE) | Glioblastoma imaging |
| Suarez-Garcia et al. [ | Under-sampling | Glioma imaging |
| Li et al. [ | Under-sampling, K-Means++ and learning vector quantization (LVQ) | Liver cancer |
| Isensee et al. [ | Deep learning, nnU-net | Brain tumor segmentation |
|
| ||
| Goyal et al. [ | Recognition-based, one class classification | real-world data sets across different domains: tabular data, images (CIFAR and ImageNet), audio, and time-series |
| Gao et al. [ | Recognition-based, one class classification | Medical imaging |
| Welch et al. [ | Recognition-based | Head and Neck Radiation therapy |
| Leevy et al. [ | Cost-sensitive | General |
| Nguyen et al. [ | Cost-sensitive, comparison of techniques | SMOTE and Deep Belief Network (DBN) against |
| Milletari et al. [ | Cost-sensitive | Medical image segmentation |
| Lin et al. [ | Cost-sensitive, Focal loss | Medical imaging |
| Jaeger et al. [ | Cost-sensitive, Focal loss | Medical imaging object detection |
| Xiong et al. [ | Cost-Sensitive Naive Bayes Stacking Ensemble | Various malignancy data sets and data types |
| Shon et al. [ | Cost-sensitive | Kidney cancer data (TCGA) |
| Dong et al. [ | Ensemble-Learning | General |
| Sagi et al. [ | Ensemble-Learning | General |
| Tang et al. [ | Ensemble-Learning, bagging-based | Transcriptome and functional proteomics data breast cancer |
| Le et al. [ | Ensemble-Learning, ResNet50 CNN | Skin cancer |
| Wang et al. [ | Ensemble-Learning, multi-layer perceptron (MLP)-based | Gastric cancer |
|
| ||
| Khushi et al. [ | Comparative Performance Analysis of Data | Various malignancy data sets and data types |
| Chen et al. [ | Combination of methods | General |
| Zhao et al. [ | Random Under-Sampling Boost (RUSBoost) | Colorectal cancer, microarray data |
| Urdal et al. [ | Random Under-Sampling Boost (RUSBoost) | Urothelial carcinoma, histopathology data |
|
| ||
| Mirza et al. [ | Integrative analysis of biomedical big data | |
| Guyon et al. [ | variable and feature selection overview | |
| Hilario et al. [ | Class imbalance in proteomic biomarker studies overview | |
| Tibshirani et al. [ | LASSO (Least Absolute Shrinkage and Selection Operator) overview | |
| Yan et al. [ | Graph- and kernel-based—omics data integration algorithms | |
| Fawcett et al. [ | ROC analysis overview | |
Figure 3An overview of the existing methods for the class imbalance problem.
Figure 4The overview of the feature selection methods.