| Literature DB >> 36059892 |
Sharmin Afrose1, Wenjia Song1, Charles B Nemeroff2, Chang Lu3, Danfeng Daphne Yao1.
Abstract
Background: Many clinical datasets are intrinsically imbalanced, dominated by overwhelming majority groups. Off-the-shelf machine learning models that optimize the prognosis of majority patient types (e.g., healthy class) may cause substantial errors on the minority prediction class (e.g., disease class) and demographic subgroups (e.g., Black or young patients). In the typical one-machine-learning-model-fits-all paradigm, racial and age disparities are likely to exist, but unreported. In addition, some widely used whole-population metrics give misleading results.Entities:
Keywords: Cancer; Prognosis
Year: 2022 PMID: 36059892 PMCID: PMC9436942 DOI: 10.1038/s43856-022-00165-w
Source DB: PubMed Journal: Commun Med (Lond) ISSN: 2730-664X
Fig. 1Workflow for improving data balance in machine learning prognosis prediction using double prioritized (DP) bias correction.
Sample Enrichment prepares a number of new training datasets by incrementally enriching a specific demographic subgroup; Candidate Training is where each of the n + 1 datasets is used for training a candidate machine learning model; Model Selection identifies the optimal model; Prediction applies the selected model on new patient data. AUC-PR represents the area under the curve of the precision-recall curve.
Fig. 2Recall values for both classes C0 and C1 and training data statistics for the in-hospital mortality (IHM) and the 5-year breast cancer survivability (BCS) tasks.
a Percentage of the minority class C1, Recall C0, and Recall C1 of each subgroup of the MIMIC dataset for the IHM task. Statistics of b prediction class distribution, c racial group distribution, and d age group distribution for the MIMIC IHM dataset. The MIMIC IHM training set consists of 45.1% female samples and 54.8% male samples. e Percentage of the minority class C1, Recall C0, and Recall C1 of each subgroup of the SEER dataset for the BCS task. Statistics of f prediction class distribution, g racial group distribution, and h age group distribution for the SEER BCS dataset. The SEER BCS training set consists of 99.4% female samples and 0.6% male samples.