| Literature DB >> 35579328 |
H Echo Wang1, Matthew Landers2, Roy Adams3, Adarsh Subbaswamy4, Hadi Kharrazi1, Darrell J Gaskin1, Suchi Saria4.
Abstract
OBJECTIVE: Health care providers increasingly rely upon predictive algorithms when making important treatment decisions, however, evidence indicates that these tools can lead to inequitable outcomes across racial and socio-economic groups. In this study, we introduce a bias evaluation checklist that allows model developers and health care providers a means to systematically appraise a model's potential to introduce bias.Entities:
Keywords: bias; clinical decision-making; health care disparity; hospital readmission; predictive model
Mesh:
Year: 2022 PMID: 35579328 PMCID: PMC9277650 DOI: 10.1093/jamia/ocac065
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 7.942
Figure 1.The PRISMA diagram for selecting common 30-day hospital readmission models.
Bias evaluation checklist to assess the potential for a machine learning model to introduce bias and perpetuate disparate performance across subgroups
| Source of bias | How the bias can arise | Example(s) | Checklist question(s) |
|---|---|---|---|
| Stage 1: model definition and design | |||
| Label bias | The use of a biased proxy target variable in place of the ideal prediction target during model learning. |
Health systems often rely on prediction algorithms to identify patients for their “high-risk care management” programs. The ideal prediction target for these models is patients’ future health care needs, and algorithms often predict the value of a concrete proxy variable—future health care costs to represent patient’s future health needs. Black patients typically have lower health care costs as they are less likely
to seek or receive care. Consequently, algorithms that predict future health
care costs as a surrogate for future health care needs create disparities in
medical decision-making for tens of millions of patients. |
Is the prediction target an appropriate proxy for patient health care outcomes or needs? |
| Modeling bias | The use of a model that, due to its model design, leads to inequitable outcomes. |
One study found colon cancer screening, sinusitis, and accidental injury to
be statistically significant predictors in a stroke risk prediction model.
However, these data are not actually relevant to stroke prediction. Instead,
they simply represent high utilization of health care resources. Using these
data can therefore create performance disparities between patients with low
health care utilization and those with high health care utilization. —————— Lending algorithms sometimes make decisions from nonuniversal
generalizations—such as the neighborhood in which an applicant lives—instead
of applicant-specific data. By using neighborhood-level data and excluding
important individual-level inputs, lending models cannot capture the variation
within each subpopulation that would result in different outcomes for
different individuals. As a result, qualified applicants that live in
disqualified neighborhoods are denied loans without merit. |
Are there any modeling choices made that could lead to bias? For example, are there any dependencies between inputs and outcomes that could lead to discriminatory performance across groups? Are any important features excluded from the model? Does the model algorithmically account for bias? For example, does the model attempt to limit bias as part of its optimization criteria? Does the model account for training data imbalance? |
| Stage 2: data collection and acquisition | |||
| Population bias | The algorithm performs poorly in subsets of the deployment population because the data used for model training does not adequately represent the population in which the algorithm will operate. |
A melanoma detection model achieved accuracy parity with a board-certified
dermatologist; however, the model was trained primarily on light-colored skin.
As such, the algorithm is likely to underperform for patients with dark skin.
The potential benefit of early detection through machine learning will thus be
limited for these patients. —————— Amazon used an algorithm to review job applicants’ resumes. However, the
model favored male candidates because it was trained with data from a period
during which most applicants were men. |
Was the data used to train the model representative of the population in the deployment environment? If not, was the model developed to be robust to changes in the population? |
| Measurement bias | Bias introduced because of differences in the quality or way in which features are selected and calculated across different subgroups. |
Under-served subgroups are disproportionately assessed as high-risk borrowers
and are thus less likely to have their mortgages approved. The difference in
mortgage approval rates is due to the relative absence of data (eg, short
credit history, non-diverse in types of loans) for minority groups. As a
result of missing data, prediction algorithms are less precise for minorities
which leads to the approval rate inequity. —————— Machine learning algorithms typically require large datasets for training.
Existing biomedical datasets have historically misrepresented or excluded data
for immigrants, minorities, and socioeconomically disadvantaged groups. |
Are input variables defined and measured in the same way for all patients? Was the prediction target measured similarly across subgroups and environments? Are input variables more likely to be missing in one subgroup than another? |
| Stage 3: Validation | |||
| Missing validation bias | An absence of validation studies that measure and address performance differences across subgroups. |
Machine learning models are often not assessed for disparate performance
across subgroups before they are deployed. This has led to the introduction
and perpetuation of bias in kidney transplant list placement, —————— The external validation of an acute kidney injury prediction model with
excellent performance at the source hospital demonstrated deteriorating
performance at 5 external sites due to the heterogeneity of risk factors
across populations. |
Do validation studies report and address performance differences between groups? |
| Stage 4: Deployment and model use | |||
| Human use bias | An inconsistent response to the algorithm’s output for different subgroups by the humans evaluating the output. |
In a study to assess the effect of criminal justice risk prediction
algorithms, judges were presented with vignettes that described a defendant’s
index offense, criminal history, and social background. Some judges were also
provided with a defendant’s estimated likelihood of re-offending. For affluent
defendants, the probability of incarceration decreased from 59.5% to 44.4%
when risk assessment information was provided. For relatively poor defendants,
the addition of risk assessment information increased the probability of
incarceration from 45.8% to 61.2%. Thus, the authors concluded that, in some
cases, providing judges with risk assessment scores can exacerbate disparities
in incarceration for disadvantaged defendants. —————— A machine learning algorithm developed to help pathologists differentiate
liver cancer types did not improve every pathologist’s accuracy despite the
model’s high rate of correct classification. Instead, pathologists’ accuracy
was improved when the model’s prediction was correct but decreased when the
model’s prediction was incorrect. This demonstrates the potential unintended
effects of using an algorithm to guide decision-making. |
Might a user interpret the model’s output differently for different subgroups? Might the use of the model perpetuate disparities even if the model’s predictions are accurate across groups? Might the model’s output lead to more uncertainty in decision-making (eg, if the model’s output is ambiguous)? |
Figure 2.Model assessment heat map. An overall rating was given for each bias type based on the qualitative assessment of the checklist questions (details in Appendix 1). Red indicates there is potential for concern, green indicates there is limited potential for concern, and yellow indicates the potential for concern is unclear or there is not enough information with which to draw a conclusion.