| Literature DB >> 34912242 |
Heewon Chung1, Chul Park2, Wu Seong Kang3, Jinseok Lee1.
Abstract
Artificial intelligence (AI) technologies have been applied in various medical domains to predict patient outcomes with high accuracy. As AI becomes more widely adopted, the problem of model bias is increasingly apparent. In this study, we investigate the model bias that can occur when training a model using datasets for only one particular gender and aim to present new insights into the bias issue. For the investigation, we considered an AI model that predicts severity at an early stage based on the medical records of coronavirus disease (COVID-19) patients. For 5,601 confirmed COVID-19 patients, we used 37 medical records, namely, basic patient information, physical index, initial examination findings, clinical findings, comorbidity diseases, and general blood test results at an early stage. To investigate the gender-based AI model bias, we trained and evaluated two separate models-one that was trained using only the male group, and the other using only the female group. When the model trained by the male-group data was applied to the female testing data, the overall accuracy decreased-sensitivity from 0.93 to 0.86, specificity from 0.92 to 0.86, accuracy from 0.92 to 0.86, balanced accuracy from 0.93 to 0.86, and area under the curve (AUC) from 0.97 to 0.94. Similarly, when the model trained by the female-group data was applied to the male testing data, once again, the overall accuracy decreased-sensitivity from 0.97 to 0.90, specificity from 0.96 to 0.91, accuracy from 0.96 to 0.91, balanced accuracy from 0.96 to 0.90, and AUC from 0.97 to 0.95. Furthermore, when we evaluated each gender-dependent model with the test data from the same gender used for training, the resultant accuracy was also lower than that from the unbiased model.Entities:
Keywords: COVID-19; artificial intelligence bias; feature importance; gender dependent bias; severity prediction
Year: 2021 PMID: 34912242 PMCID: PMC8667070 DOI: 10.3389/fphys.2021.778720
Source DB: PubMed Journal: Front Physiol ISSN: 1664-042X Impact factor: 4.566
Medical records used in developing AI model for severity prediction.
|
|
|
|
| 1 | Basic patient information | Age |
| 2 | Gender | |
| 3 | Pregnancy | |
| 4 | Pregnancy week | |
| 5 | Physical index | Body mass index |
| 6 | Initial examination findings | Systolic blood pressure |
| 7 | Diastolic blood pressure | |
| 8 | Heart rate | |
| 9 | Temperature | |
| 10 | Clinical findings | Fever |
| 11 | Cough | |
| 12 | Sputum production | |
| 13 | Sore throat | |
| 14 | Runny nose/rhinorrhea | |
| 15 | Muscle aches/myalgia | |
| 16 | Fatigue/malaise | |
| 17 | Shortness of breath/dyspnea | |
| 18 | Headache | |
| 19 | Altered consciousness/confusion | |
| 20 | Vomiting/nausea | |
| 21 | Diarrhea | |
| 22 | Current or previous comorbidity diseases | Diabetes mellitus |
| 23 | Hypertension | |
| 24 | Heart failure | |
| 25 | Chronic cardiac disease | |
| 26 | Asthma | |
| 27 | Chronic obstructive pulmonary disease | |
| 28 | Chronic kidney disease | |
| 29 | Cancer | |
| 30 | Chronic liver disease | |
| 31 | Rheumatism/autoimmune diseases | |
| 32 | Dementia | |
| 33 | General blood test results | Hemoglobin |
| 34 | Hematocrit | |
| 35 | Lymphocyte | |
| 36 | Platelets | |
| 37 | White blood cell |
AI, artificial intelligence.
Number of data groups for training and testing based on gender and severity.
|
|
|
|
| |
| Training data | Male | 1,732 | 116 | 1,848 |
| Female | 2,535 | 97 | 2,632 | |
| Testing data | Male | 434 | 28 | 462 |
| Female | 629 | 30 | 659 | |
| Total | 5,330 | 271 | 5,601 | |
FIGURE 1Results of the ranked feature importance values for the male group using (A) RF, (B) XGBoost, (C) AdaBoost, and (D) average after normalization.
FIGURE 2Results of the ranked feature importance values for the female group using (A) RF, (B) XGBoost, (C) AdaBoost, and (D) average after normalization.
FIGURE 3Cross-validation performance using the metrics of AUC and balanced accuracy: (A) male group and (B) female group. AUC, area under the curve.
Cross-validation results (mean ± SD) from male- and female-group models.
|
| ||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||||||||||
| 1 | 369 | 310 | 35 | 4 | 20 | 0.8333 | 0.8986 | 0.8943 | 0.8659 | 0.9266 |
| 2 | 370 | 314 | 31 | 5 | 20 | 0.8000 | 0.9101 | 0.9027 | 0.8551 | 0.9250 |
| 3 | 370 | 324 | 22 | 4 | 20 | 0.8333 | 0.9364 | 0.9297 | 0.8849 | 0.9458 |
| 4 | 370 | 326 | 21 | 3 | 20 | 0.8696 | 0.9395 | 0.9351 | 0.9045 | 0.9473 |
| 5 | 369 | 316 | 33 | 4 | 16 | 0.8000 | 0.9054 | 0.8997 | 0.8527 | 0.8930 |
| Mean | 1,848 | 318 | 28.4 | 4 | 19.2 | 0.83 ± 0.03 | 0.91 ± 0.02 | 0.91 ± 0.02 | 0.87 ± 0.02 | 0.93 ± 0.02 |
|
| ||||||||||
|
| ||||||||||
|
| ||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||||||||||
| 1 | 526 | 456 | 51 | 1 | 18 | 0.9474 | 0.8994 | 0.9011 | 0.9234 | 0.9536 |
| 2 | 536 | 478 | 36 | 4 | 18 | 0.8182 | 0.9300 | 0.9254 | 0.8741 | 0.9398 |
| 3 | 526 | 491 | 8 | 3 | 24 | 0.8889 | 0.9840 | 0.9791 | 0.9364 | 0.9597 |
| 4 | 526 | 488 | 22 | 1 | 15 | 0.9375 | 0.9569 | 0.9563 | 0.9472 | 0.9648 |
| 5 | 518 | 466 | 39 | 3 | 10 | 0.7692 | 0.9228 | 0.9189 | 0.8460 | 0.9240 |
| Mean | 2,632 | 475.8 | 31.2 | 2.4 | 17 | 0.87 ± 0.07 | 0.94 ± 0.03 | 0.94 ± 0.03 | 0.91 ± 0.03 | 0.95 ± 0.01 |
*AUC, area under the curve.
*TN: true negatives, FP: false positives, FN: false negatives, TP: true positives.
Testing data results from the same gender data.
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| |||||||||
| Random forest | 376 | 58 | 5 | 23 | 0.8214 | 0.8664 | 0.8636 | 0.8439 | 0.9236 |
| XGBoost | 402 | 32 | 8 | 20 | 0.7143 | 0.9263 | 0.9134 | 0.8203 | 0.9115 |
| AdaBoost | 401 | 33 | 4 | 24 | 0.8571 | 0.9240 | 0.9199 | 0.8906 | 0.9366 |
| DNN | 401 | 33 | 2 | 26 | 0.9286 | 0.9240 | 0.9242 | 0.9263 | 0.9660 |
|
| |||||||||
|
| |||||||||
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| |||||||||
| Random forest | 593 | 36 | 3 | 27 | 0.9000 | 0.9428 | 0.9408 | 0.9214 | 0.8596 |
| XGBoost | 612 | 17 | 8 | 22 | 0.7333 | 0.9730 | 0.9621 | 0.8532 | 0.8365 |
| AdaBoost | 559 | 70 | 2 | 28 | 0.9333 | 0.8887 | 0.8907 | 0.9110 | 0.8574 |
| DNN | 587 | 42 | 2 | 28 | 0.9333 | 0.9332 | 0.9332 | 0.9333 | 0.9539 |
*AUC, area under the curve; DNN, deep neural network.
*TN: true negatives, FP: false positives, FN: false negatives, TP: true positives.
Testing data results from different gender data.
|
|
|
|
|
|
|
|
|
|
|
|
| Female | Trained by all (unbiased) | 605 | 24 | 1 | 29 | 0.9667 | 0.9618 | 0.9621 | 0.9643 | 0.9727 |
| Trained by male-group only (biased) | 570 | 59 | 3 | 27 | 0.9000 | 0.9062 | 0.9059 | 0.9031 | 0.9499 | |
| Random forest (biased) | 536 | 93 | 2 | 28 | 0.9333 | 0.8521 | 0.8558 | 0.8927 | 0.9479 | |
| XGBoost (biased) | 611 | 18 | 8 | 22 | 0.7333 | 0.9714 | 0.9605 | 0.8524 | 0.9227 | |
| AdaBoost (biased) | 593 | 36 | 6 | 24 | 0.8000 | 0.9428 | 0.9363 | 0.8714 | 0.9495 | |
| Male | Trained by all (unbiased) | 407 | 27 | 2 | 26 | 0.9286 | 0.9378 | 0.9372 | 0.9332 | 0.9795 |
| Trained by female-group only (biased) | 375 | 59 | 4 | 24 | 0.8571 | 0.8641 | 0.8636 | 0.8606 | 0.9435 | |
| Random forest (biased) | 405 | 29 | 7 | 21 | 0.7500 | 0.9332 | 0.9221 | 0.8416 | 0.9338 | |
| XGBoost (biased) | 406 | 28 | 6 | 22 | 0.7857 | 0.9355 | 0.9264 | 0.8606 | 0.9398 | |
| AdaBoost (biased) | 374 | 60 | 4 | 24 | 0.8571 | 0.8618 | 0.8615 | 0.8597 | 0.9449 |
*AUC, area under the curve.
*TN: true negatives, FP: false positives, FN: false negatives, TP: true positives.