| Literature DB >> 35639450 |
Jonathan Huang1, Galal Galal1, Mozziyar Etemadi1,2, Mahesh Vaidyanathan1,3.
Abstract
BACKGROUND: Racial bias is a key concern regarding the development, validation, and implementation of machine learning (ML) models in clinical settings. Despite the potential of bias to propagate health disparities, racial bias in clinical ML has yet to be thoroughly examined and best practices for bias mitigation remain unclear.Entities:
Keywords: algorithm; algorithmic fairness; artificial intelligence; assessment; bias; clinical machine learning; diagnosis; fairness; machine learning; medical machine learning; mitigation; model; outcome prediction; prediction; race; racial bias; scoping review; score prediction
Year: 2022 PMID: 35639450 PMCID: PMC9198828 DOI: 10.2196/36388
Source DB: PubMed Journal: JMIR Med Inform
Figure 1The clinical machine learning development workflow (orange boxes) offers several opportunities (blue boxes) to evaluate and mitigate potential biases introduced by the data set or model. Preprocessing methods seek to adjust the existing data set to preempt biases resulting from inadequate data representation or labeling. In-processing methods impose fairness constraints as additional metrics optimized by the model during training or present data in a structured manner to avoid biases in the sampling process. Postprocessing methods account for model biases by adjusting model outputs or changing the way they are used.
Group fairness metrics encountered in this review.
| Term | Description |
| AUROCa | Assesses overall classifier performance by measuring the TPRb and FPRc of a classifier at different thresholds. |
| Average odds | Compares the average of the TPR and FPR for the classification outcome between protected and unprotected groups. |
| Balanced accuracy | A measure of accuracy corrected for data imbalance, calculated as the average of sensitivity and specificity for a group. |
| Calibration | Assesses how well the risk score or probability predictions reflect actual outcomes. |
| Disparate impact | Measures deviation from statistical parity, calculated as the ratio of the rate of the positive outcome between protected and unprotected groups. Ideally, the disparate impact is 1. |
| Equal opportunity | For classification tasks in which one outcome is preferred over the other, equal opportunity is satisfied when the preferred outcome is predicted with equal accuracy between protected and unprotected groups. Ideally, the TPR or FNRd disparity between groups is 0. |
| Equalized odds | The TPR and FPR are equal between protected and unprotected groups. |
| Error rate | Compares the error rate of predictions, calculated as the number of incorrect predictions divided by the total number of predictions, between protected and unprotected groups. Ideally, the error rate disparity between groups is 0. |
| Statistical parity | Statistical parity (also known as demographic parity) is satisfied when the rate of positive outcomes is equal between protected and unprotected groups. |
aAUROC: area under the receiver operating characteristic curve.
bTPR: true-positive rate.
cFPR: false-positive rate.
dFNR: false-negative rate.
Figure 2PRISMA flowchart of study inclusion. ML: machine learning; PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-analyses.
Study characteristics.
| Author (year) | Clinical objective | How was fairness evaluated? | Was racial bias identified? | How was the AIa model biased? | Was racial bias mitigated? | Protected class |
| Abubakar et al (2020) [ | Identification of images of burns vs healthy skin | Accuracy | Yes | Poor accuracy of models trained on a Caucasian data set and validated on an African data set and vice versa | Yes | Dark-skinned patients, light-skinned patients |
| Allen et al (2020) [ | Intensive care unit (ICU) mortality prediction | Equal opportunity difference (FNRb disparity) | N/Ac | N/A | Yes | Non-White patients |
| Briggs and Hollmén (2020) [ | Prediction of future health care expenditures of individual patients | Balanced accuracy, statistical parity, disparate impact, average odds, equal opportunity | N/A | N/A | Yes | Black patients |
| Burlina et al (2021) [ | Diagnosis of diabetic retinopathy from fundus photography | Accuracy | Yes | Lower diagnostic accuracy in darker-skinned individuals compared to lighter-skinned individuals | Yes | Dark-skinned patients |
| Chen et al (2019) [ | ICU mortality prediction, psychiatric readmission prediction | Error rate (0-1 loss) | Yes | Differences in error rates in ICU mortality between racial groups | No | Non-White patients |
| Gianattasio et al (2020) [ | Dementia status classification | Sensitivity, specificity, accuracy | Yes | Existing algorithms varying in sensitivity and specificity between race/ethnicity groups | Yes | Hispanic, non-Hispanic Black patients |
| Noseworthy et al (2020) [ | Prediction of left ventricular ejection fraction ≤35% from the electrocardiogram (ECG) | AUROCd | No | N/A | No | Non-White patients |
| Obermeyer et al (2019) [ | Prediction of future health care expenditures of individual patients | Calibration | Yes | Black patients with a higher burden than White patients at the same algorithmic risk score | Yes | Black patients |
| Park et al (2021) [ | Prediction of postpartum depression and postpartum mental health service utilization | Disparate impact, equal opportunity difference (TPRe disparity) | Yes | Black women with a worse health status than White women at the same predicted risk level | Yes | Black patients |
| Seyyed-Kalantari et al (2021) [ | Diagnostic label prediction from chest X-rays | Equal opportunity difference (TPR disparity) | Yes | Greater TPR disparity in Hispanic patients | No | Non-White patients |
| Thompson et al (2021) [ | Identification of opioid misuse from clinical notes | Equal opportunity difference (FNR disparity) | Yes | Greater FNR in the Black subgroup than in the White subgroup | Yes | Black patients |
| Wissel et al (2019) [ | Assignment of surgical candidacy score for patients with epilepsy using clinical notes | Regression analysis of the impact of the race variable on the candidacy score | No | N/A | No | Non-White patients |
aAI: artificial intelligence.
bFNR: false-negative rate.
cN/A: not applicable.
dAUROC: area under the receiver operating characteristic curve.
eTPR: true-positive rate.
Bias mitigation methods among reviewed studies.
| Description of strategies used | Effectiveness | |
|
| ||
|
| Reweighing training data |
An equal opportunity difference (FNRa difference) of 0.016 ( The mean fairness measure (average of statistical parity difference, disparate impact measure, average odds difference, and equal opportunity difference) improved to 0.06 from 0.12 for prediction of health care costs [ Disparate impact improved from 0.31 to 0.79, and the equal opportunity (TPRb) difference improved from –0.19 to 0.02 for prediction of postpartum depression development; prediction of mental health service use in pregnant individuals improved from 0.45 to 0.85 and –0.11 to –0.02, respectively [ |
|
| Combining data sets to increase heterogeneity |
The accuracy of skin burn identification increased to 99.5% using a combined data set compared to 83.4% and 87.5% when trained on an African and evaluated on a Caucasian data set and vice versa [ |
|
| Generating synthetic minority class data |
Disparity in diabetic retinopathy diagnostic accuracy improved from 12.5% to 7.5% and 0.5% when augmenting with retina appearance-optimized images and diabetic retinopathy status-optimized images created with a generative adversarial network, respectively [ |
|
| Adjusting label selection |
Improved congruence in health outcomes between groups after developing models to predict other labels for health status besides financial expenditures [ |
|
| Removing race information from training data |
Disparate impact improved from 0.31 to 0.61 and equal opportunity (TPR) difference improved from –0.19 to –0.05 for prediction of postpartum depression development; respective improvements from 0.45 to 0.63 and –0.11 to –0.04 for prediction of mental health service use in pregnant individuals [ |
|
| ||
|
| Use of a regularizer during training |
Disparate impact improved, but accuracy and the equal opportunity (TPR) difference decreased when implementing the prejudice remover regularizer in prediction of postpartum depression in pregnant individuals [ |
|
| Adversarial debiasing |
The mean fairness measure (average of statistical parity difference, disparate impact measure, average odds difference, and equal opportunity difference) worsened to 0.07 from 0.05 for prediction of health care costs [ |
|
| ||
|
| Calibration |
The equal opportunity (FNR) difference improved from 0.15 to 0.03 for identification of opioid misuse [ |
|
| Reject option-based classification |
The mean fairness measure (average of statistical parity difference, disparate impact measure, average odds difference, and equal opportunity difference) improved to 0.09 from 0.15 for prediction of health care costs [ |
|
| Varying cut-point selection |
The equal opportunity (FNR) difference improved from 0.15 to 0.04 for identification of opioid misuse [ The congruence in sensitivity and specificity between groups improved without reduction in accuracy for classification of dementia status [ |
aFNR: false-negative rate.
bTPR: true-positive rate.