Literature DB >> 35663112

A scoping review of ethics considerations in clinical natural language processing.

Oliver J Bear Don't Walk¹, Harry Reyes Nieva^1,2, Sandra Soo-Jin Lee³, Noémie Elhadad¹.

Abstract

Objectives: To review through an ethics lens the state of research in clinical natural language processing (NLP) for the study of bias and fairness, and to identify gaps in research.
Methods: We queried PubMed and Google Scholar for articles published between 2015 and 2021 concerning clinical NLP, bias, and fairness. We analyzed articles using a framework that combines the machine learning (ML) development process (ie, design, data, algorithm, and critique) and bioethical concepts of beneficence, nonmaleficence, autonomy, justice, as well as explicability. Our approach further differentiated between biases of clinical text (eg, systemic or personal biases in clinical documentation towards patients) and biases in NLP applications.
Results: Out of 1162 articles screened, 22 met criteria for full text review. We categorized articles based on the design (N = 2), data (N = 12), algorithm (N = 14), and critique (N = 17) phases of the ML development process. Discussion: Clinical NLP can be used to study bias in applications reliant on clinical text data as well as explore biases in the healthcare setting. We identify 3 areas of active research that require unique ethical considerations about the potential for clinical NLP to address and/or perpetuate bias: (1) selecting metrics that interrogate bias in models; (2) opportunities and risks of identifying sensitive patient attributes; and (3) best practices in reconciling individual autonomy, leveraging patient data, and inferring and manipulating sensitive information of subgroups. Finally, we address the limitations of current ethical frameworks to fully address concerns of justice. Clinical NLP is a rapidly advancing field, and assessing current approaches against ethical considerations can help the discipline use clinical NLP to explore both healthcare biases and equitable NLP applications.

Entities: Chemical

Keywords: bias; ethically informed; fairness; natural language processing

Year: 2022 PMID： 35663112 PMCID： PMC9154253 DOI： 10.1093/jamiaopen/ooac039

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

INTRODUCTION

Recently, there has been a sharp increase in research at the intersection of machine learning (ML) and bias in clinical and biomedical research. While ML approaches may help advance human health, they also hold the potential to further entrench existing, even well-documented, biases. Such biases have led to growing concerns regarding the exacerbation of healthcare disparities, lack of funding for certain research topics, and inadequate diversity in the demographic makeup of study populations. Prior work has largely focused on elucidating biases via quantitative analysis of structured electronic health record (EHR) data.,, Natural language processing (NLP) of clinical text presents yet another robust method for inquiry, and is differentiated from general ML as methods or algorithms that take in or produce unstructured, free-text data. Most notably, NLP-based work has uncovered gender differences in disease associations, disparities in smoking documentation, and differences in financial consideration discussions among racial and ethnic groups. These examples offer a glimpse into biases in both the practice and discussion of healthcare delivery. Beyond discovery of bias in clinical text, NLP may also offer solutions (eg, techniques for information extraction have improved identification and representation of subgroups in clinical data), though, unless actively monitored and addressed, clinical NLP can contribute to biases through its development and use. Clinical text is a rich and nuanced source of patient data, but the subjective and nonstandardized means by which information is recorded and discussed in clinical notes also raises unique ethical considerations. Moreover, NLP has been widely leveraged in a multitude of tasks (eg, information extraction, understanding clinical workflow,, risk prediction and patient stratification, patient trajectory prediction,, decision support, and question answering) without a structured approach to examining and understanding the sources and implications of its biases in the complex environment of clinical practice. Multiple agents work to maximize benefits, minimize harms, and respect autonomy while maintaining a fair distribution of resources within the biomedical ecosystem. The 4 core bioethical principles of beneficence, nonmaleficence, autonomy, and justice present a framework for ethical decision making that allows for the complexity of healthcare delivery. Nonetheless, to fully examine and understand bias and its relevance to the application of NLP in the clinical setting, an expanded framework is necessary. Ethical concerns surrounding ML have been discussed by many researchers, in particular bias and fairness. Related studies have demonstrated that ML models can inherit, exacerbate, or even create new biases leading to disparities. Similar to bioethics, this work involves multiple agents interacting within various environments. However, to our knowledge, focused study of bias and ethics of clinical NLP remains a relatively nascent domain. Given the growing body of literature concerning ethical ML and the sensitive nature of clinical text, it is important to understand and anticipate ethical concerns before NLP applications are put into practice. A scoping review is well suited to provide an overview of bias in clinical NLP as they can rapidly map key concepts “underpinning a research area.” The objective of this work was to perform a scoping review of literature at the intersection between clinical NLP, bias, and fairness. It incorporates a robust framework that combines traditional bioethical principles with the stages of a proposed ML development process. This approach offers a unique lens through which clinical NLP and its broader ethical implications on healthcare decision making may be viewed and better understood. Overall, we find that clinical NLP can be used to uncover and ameliorate bias in healthcare, but is not without its own ethical concerns and even well-intentioned work can potentially expose patients to harm. While clinical NLP can support research into biases in the clinical setting, it also has the potential to inherit or exacerbate the biases we hope to study, or even lead to new biases altogether.

METHODS

We conducted our scoping review based on recommendations from the PRISMA Extension for Scoping Reviews (PRISMA-ScR) guidelines.

Eligibility criteria

We included 2 types of articles: (1) empirical studies on identifying or mitigating bias in clinical notes and (2) tasks focused on predictive analytics, classification, or information extraction using clinical text. We excluded articles if they did not focus on English text or did not involve a data-driven approach. Additionally, we further required articles to measure bias. Of note, our definition of bias differs from the term bias in ML literature that describes the differences between an estimator’s expected value and the true value of the parameter being estimated. In this study, we define bias in a sociological sense, that is still tied to machine learning. We defined bias as systematic differences in representations, predictions, or outcomes for individuals correlated with inherent or acquired characteristics related to systemic marginalization.

Search strategy

We searched PubMed and Google Scholar on June 8, 2021 using search terms related to the concepts of NLP, clinical data, and bias (Table 1) and included all work published since 2015 (inclusive). To account for concepts related to bias that fall under our definition we expanded our search to include terms such as fairness, health disparities, and explicability. We focused our review on more recent discussions of bias in clinical NLP as the ML and NLP-related work has been advancing at a rapid pace.

Table 1.

Search terms and queries for PubMed and Google Scholar

Source	Search term			Query
	NLP	Clinical data	Bias	Query
PubMed	“natural language processing”, “machine learning”, “artificial intelligence”, “information storage and retrieval”	“unstructured”, “electronic health records”, “clinical”	“bias”, “fair”, “fairness”, “health disparities”, “explicability”, “interpretab”, “explainab*”	(“natural language processing” OR “machine learning” OR “artificial intelligence” OR “information storage and retrieval”) AND (“unstructured” OR “electronic health records” OR “clinical”) AND (“bias” OR “fair” OR “fairness” OR “health disparities” OR “explicability” OR “interpretab” OR “explainab*”)
Google Scholar	“natural language processing”, “machine learning”	“clinical note”, “clinical text”, “electronic health records”	“bias*”, “fairness”, “health disparities”	(“natural language processing” OR (“machine learning” AND (“clinical note” OR “clinical text”))) AND (“electronic health records”) AND (“bias*” OR “fairness” OR “health disparities”

Search terms and queries for PubMed and Google Scholar “natural language processing”, “machine learning”, “artificial intelligence”, “information storage and retrieval” “unstructured”, “electronic health records”, “clinical” “bias*”, “fair”, “fairness”, “health disparities”, “explicability”, “interpretab*”, “explainab*” “natural language processing”, “machine learning” “clinical note”, “clinical text”, “electronic health records” “bias*”, “fairness”, “health disparities”

Study selection

Two reviewers (OJBDW and HRN) independently screened all titles and abstracts to determine eligibility for full text review. Any disagreements were adjudicated to reach consensus and an additional reviewer (SS-JL) was consulted, if needed.

Data extraction

Briefly, we analyzed articles along 2 axes, ML and ethics, to capture which aspects of both fields were investigated thus far in the literature. The ML axis (Figure 1) was modeled after existing work on ethical ML in healthcare and Box’s Loop and serves to conceptualize the stages of developing ML. For this component, we analyzed the 4 phases of each paper: study design (which is influenced by the research objectives), choice of data source (and possible selection biases), algorithm employed (and the features used for algorithm input), and self-critique (with respect to aspects such as assumptions made and the relative importance of individual elements of the work). The ethics axis (Figure 2) is based on a synthesis of existing ethical frameworks conducted by AI4People, a multi-stakeholder forum concerned with laying the foundations for a “Good AI Society.” Each dimension (beneficence, nonmaleficence, justice, autonomy, and explicability) was tailored for its application to ML development and adoption.

Figure 1.

Figure 2.

The ethical framework proposed by AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations and used to understand the complex interactions between multiple actors and clinical NLP technologies in this work. The framework focuses on the 4 traditional bioethics principles and introduces explicability to enable the other principles for application to AI. AI: artificial intelligence; NLP: natural language processing.

Proposed stages of the ML development process. Design, data, and algorithm capture stages discussed in prior work, while the critique stage incorporates Box’s Loop and illustrates the cyclic nature to inherent in development. ML: machine learning. The ethical framework proposed by AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations and used to understand the complex interactions between multiple actors and clinical NLP technologies in this work. The framework focuses on the 4 traditional bioethics principles and introduces explicability to enable the other principles for application to AI. AI: artificial intelligence; NLP: natural language processing. The ML axis integrates the 4 stages of the ML development process introduced by Chen et al.—design, data, algorithm, and critique. For the critique stage, we also incorporated a cyclic graph structure as a generalization of the principles of Box’s Loop, which proposes that building and computing probabilistic model is an iterative process (we extend this definition to all ML models). The ML development process begins in the design stage during which researchers determine what is being studied and which stakeholders are included. During the data stage, decisions are made concerning how to collect data, inclusion and exclusion criteria, what features to extract, and which groups are studied and how they are defined. Next, the algorithm stage refers to algorithm choice, training, optimization, and validation. In this phase, researchers may choose to optimize for a proxy if the true outcome is too difficult to measure., Examples of proxy labels in NLP tasks include: predicting missing words and logical sentence flow or billing codes to learn general language representations, and quality of life as measured by standardized surveys. The critique stage is unique in that it interacts with all other stages. During the critique stage researchers reflexively examine their decisions made in all other stages including why specific research questions are asked or ignored, why a given population is being studied, why certain groups are included or excluded, why the research question was operationalized into a specific study design, why a given outcome was selected for optimization, and how a model is evaluated. The critique stage asks researchers to examine their own worldview and the lens through which they conduct research. At the critique stage, the broader collective structural incentives and one’s position within that society (eg, “interest and funding”) help examine the other stages, design, data, and algorithm. While the critique stage is introduced last here, it is important to note that the setup calls for researchers to critique their work early and often throughout all other stages in an iterative manner. AI4People has synthesized existing ethical ML frameworks to guide ML development and adoption. The ethics axis (Figure 2) adopts a framework developed by AI4People, which combines the 4 traditional bioethics principles (beneficence, nonmaleficence, justice, and autonomy) with explicability (to better ensure intelligibility and accountability) to form a unified approach for ethical ML development and adoption. We chose the AI4People framework because it combined bioethical principles used in clinical research and supported their application to ML. In this context, beneficence pertains to designing and producing ML that benefits humanity and promotes well-being. Nonmaleficence covers aspects of good intentions gone awry and deals with preventing harms arising from ML through deliberate misuse or unpredictable behavior. Autonomy concerns human decision making and what/when decision making is ceded to ML. Justice states that ML should seek to eliminate discrimination by equally distributing the benefits and risk of healthcare resources, technologies, and datasets. Finally, explicability entails understanding ML from an epistemological perspective (how ML applications work) and as a matter of accountability (who is responsible for how ML applications work). Important to this work, explicability supports the 4 bioethics principles by exposing the “technical system and the broader human process, structures, and systems around [ML]” and promoting accountability. For each article, we collected data on the study objective, NLP methods employed, the bias measure used by the authors, and the marginalized population(s) mentioned. We also extracted information relevant to the ML development process and ethical framework axes by assigning each article to one or more stage of the ML development process (if ethical considerations of that stage were mentioned in the article) and one or more ethical category (if there was implicit or explicit mention of a given principle, eg, beneficence). We then generated a matrix of analysis (Figure 3) based on assignments from that binarized decision process (applicable/not applicable). One reviewer (OBDW) performed the initial extraction of relevant study information. A second reviewer (SSL) conducted a confirmatory assessment to evaluate consistency in data collection and assignment across studies.

Figure 3.

Articles were analyzed according to the ML development process and an ethical framework resulting in this matrix of analysis. ML: machine learning.

RESULTS

The PRISMA ScR flow diagram (Figure 4) summarizes our study selection process. We identified and screened 1162 unique articles, conducted a full text review of 268 publications for eligibility, and selected 22 studies (Supplementary Table S1) for our main analysis.

Figure 4.

A flowchart of the article screening process in accordance with PRISMA guidelines.

Design

The design stage includes research question conception and experimental design choices such as the study population, stakeholder inclusion, and phenomena studied. In the design stage we identified 2 concerns: stakeholder inclusion during the design process and balancing goals with group representation. Stakeholder inclusion is paramount to understanding important background context, study requirements, and potential pitfalls involved in research. It is especially relevant to studies concerning populations made vulnerable by systemic inequality. Three articles studied patients from LGBTQ+ communities, and a fourth studied geriatric patients. In all cases, authors cited a lack of representation as motivation for their work., No articles mentioned including patients in their research design process, though one engaged with home healthcare nurses to study attitudes and perceptions about sexual orientation and gender identity. Pfohl et al mentioned that technical fixes for biased models must take into account the sociopolitical contexts, and recommended including community stakeholders during the design phase. Balancing cohort definitions and group representation begins in the design phase as researchers outline inclusion and exclusion criteria, such as requiring complete data. Though data completeness can be achieved by linking to other data sources, concerns with data rights and privacy arise. As an alternative to complete data, “complete enough” data can avoid such issues, but it can also create additional biases. Weber et al explore how different data completeness definitions potentially bias patient-level demographic representations. The authors found that increasingly stringent data completeness standards resulted in datasets that skewed older, more female, and higher inpatient disease burden. Biased data were evaluated by comparing them to the gold standard of demographics found in the originating EHR and claims datasets.

Data

Compared to structured data, certain patient information is more accurate or only captured in clinical notes., We identified 3 concerns in the data stage: group representation, feature representation, and biases in documentation practices. These concerns align with the notion of data justice. Taylor defines 3 pillars of data justice: visibility, engagement with technology, and nondiscrimination. Group and feature representation fall under the pillars of visibility and engagement with technology, as they deal with aspects if privacy and control of representation. Biased documentation practices fit under the pillar of nondiscrimination as it deals with methods and research that identifies biases. Group representation and visibility straddles the design and data stage as improving group representation can be a motivation for carrying out research, but data are ubiquitous throughout all NLP stages so we discuss it in the data stage. Visibility was discussed in 6 papers.,,, Approaches to address poor group representation utilized keywords and structured data to identify LGBTQ+ patients. The impact of complete data filtering on group representation was explicitly explored by one article, while another article mentioned unbalanced group representation as a limitation. Data justice also raises the concern of accurate representation. Chen et al sought to accurately represent older adults by identifying geriatric syndromes with a deep learning model using clinical notes and structured data. The proposed model leveraged contextual information in the document to outperform baseline models. Clinical NLP was used to explore biased documentation practices and differences in healthcare delivery through a variety of methods such as latent Dirichlet allocation (LDA), and differences in n-gram level or specific topic distributions.,,,, LDA, an unsupervised, statistical approach to discover topics from a collection of text, was used to explore group differences in psychiatric and intensive care unit (ICU) notes as well as differences between staff and residents in asthma-related discussion. Sohn et al found that residents were less likely to enter the diagnosis in the EHR and patients of residents had poorer outcomes compared to staff. Overall, multiple works leveraged clinical NLP to explore differences in treatment related to data justice.,, Biased documentation practices were also partially due to a lack of clinician familiarity with LGBTQ+ communities leading to omission, inaccurate, and harmful documentation.,, One article used semistructured interviews to understand attitudes and perceptions about documenting LGBTQ+-specific health concerns. Though sexual health discussions did not occur as often for nonheterosexual patients as heterosexual patients, building rapport somewhat alleviates this disparity., Only 2 works (a published article and dissertation both by the same author) discussed stakeholder interviews focusing on providers. Other community stakeholders were not included as suggested previously.

Algorithm

Algorithms can be biased for multiple reasons, including biased data and choices in experimental design, (eg, choosing to model a proxy that reflects ingrained inequities or removing a racial group from analysis because of insufficient representation), as well as model selection (eg, backfilling missing gender information for patients with a model that perpetuates the harmful idea of binarized gender). A majority of the work identified under the algorithm phase focused on reducing,, or measuring,,,,, bias in applications using NLP and addressed the ethical principles of explicability and justice. We did not analyze algorithm explicability unless the presence or absence of explicability resulted in ethical considerations or explicability was a motivating factor for algorithm selection. The reason for this is that while explicability can support bioethical principles, explicability’s effect on bias was not measured in any of the papers analyzed here. There are many methods to measure bias and no one approach seems to be the best for all contexts.,, We identified 8 studies that measure bias in clinical NLP models using 13 different measures (Table 2). In most cases, articles did not overlap with one another in how bias was measured.

Table 2.

Different measures for biased models discussed throughout the work identified in this scoping review

Bias measure	Description	Relevant article(s)
Parity gap	Positive prediction differences between 2 groups	Zhang et al
Recall gap	Recall difference between 2 groups	Zhang et al
Specificity gap	Specificity difference between 2 groups	Zhang et al
AUC gap	AUC difference between 2 groups	Tsui et al
Zero-one loss gap	Zero-one loss difference between 2 groups	Chen et al
Sentence log probability gap	Difference in a language model’s sentence log probability when swapping out demographic information (eg, discussion of race)	Zhang et al
Rank-turbulence divergence	Ranks occurrences of n-grams between 2 groups and takes into account how often rankings change	Minot et al
Conditional prediction parity	Fairness criteria that assess conditional independence between a model outcome and a demographics class. Encompasses notions of the parity gap.	Pfohl et al
Calibration fairness criteria	Measures model calibration across groups.	Pfohl et al
Cross-group ranking measures	Variation on AUC that measures how often positive instances in 1 group are ranked above negative instances in another	Pfohl et al
Sensitive attribute recovery	Measures how well a sensitive attribute (eg, gender) can be recovered	Minot et al
Demographic association with outcome	Significant association between patient demographics and model outcome using regression parameters	Wissel et al
Gold-standard bias comparison	Compare group representation to previous standard’s representation	Weber et al, Polling et al

Note: This does not include measures of bias for data or healthcare delivery.

Different measures for biased models discussed throughout the work identified in this scoping review Note: This does not include measures of bias for data or healthcare delivery. Gap scores, zero-one loss, and outcome association measured model outcome differences between groups. Gap scores were produced in phenotyping and mortality prediction tasks by subtracting performance metrics between 2 groups, focusing on recall gap. Chen et al focused on zero-one loss in psychiatric readmission and ICU mortality prediction. Outcome association was used in a single article, where the authors trained a model to identify candidates for epilepsy surgery and measured bias through the model’s outputs and patient demographics using univariate and multivariate linear regressions. Model decisions were not found to be significantly associated with patient demographics. Two articles used the pretrained language model, BERT, which can be trained to perform new tasks in addition to the pretraining task. Zhang et al found significant differences in sentence probabilities discussing various clinical categories when switching out stereotypically masculine and feminine pronouns. Minot et al measured bias using the proxy of how well a trained BERT model could predict a patient’s gender from the clinical notes. Two articles used a gold-standard dataset or approach with an acceptable or normalized amount of bias,, though what constitutes a gold-standard in this scenario depends on the setting. Whereas one article measured against claims data as the most complete kind of data available, another measured a model’s results against a manually extracted gold-standard. One article focused on understanding bias through the lens of confounding. Lynch et al compared the impacts of confounding when measuring smoking status through ICD-9 codes or using information extracted from clinical notes. The authors found that when extracting smoking status as a confounder for an exposer–outcome relationship, NLP-based methods resulted in better ability to control for confounding than using ICD-9 codes for smoking status. Understanding how data sources effect confounding supports explicability. Bias mitigation relied on methods-based approaches that directly reduced bias through either adapting the training methods or the training data. Zhang et al attempted to reduce bias during the language pretraining phase. Minot et al were slightly more successful in reducing their measure of bias by removing highly gendered phrases during training. Minot et al was unique in that the authors both measured bias and developed an interpretable method to mitigate bias. In particular, words which were identified as biased towards one binarized gender or another could be selectively removed during the training step to reduce bias while balancing performance on the downstream task. Their bias measure did not evaluate the differences in performance on a downstream task, but rather how well the model could recover a patient’s gender from clinical notes. Pfohl et al also explored how different levels of bias mitigation could impact downstream task performance across a variety of measures. As a sidenote, 2 studies cited model interpretability as a motivation to reduce bias, but neither of these studies measured bias and were not included.,

Critique

The critique stage concerns all other stages in the ML development process, as it requires researchers to examine how their positionality (ie, degrees of privilege through factors of race, class, educational attainment, income, ability, gender, and citizenship, among others) and broader, collective structural incentives might have affected choices made during research. Examining one’s positionality and reflection on social and structural factors are especially pertinent when studying bias. In light of this, we provide an overview of the main critique-related concerns and topics within each stage. The main topics identified in the design stage related to motivations for research and caution for well-intentioned research that may cause harm. The data stage was associated with justification for different measures of biased data and why certain biases in datasets were addressed or not. Finally, in the algorithm stage, justification for how to measure bias and interpretability were the main topics. Many studies included in this review were directly motivated by health disparities,,,,,, understanding data completeness, or improving information extraction with automation., Three papers discussed a lack of representation of LGBTQ+ patients, cited a dearth of applicable codes for accurate representation in structured data, and withholding information due to fear of discrimination and stigmatization. Sociopolitical and historical factors contribute to bias in datasets and authors often used this context to explain bias. One article dismissed a limitation of unbalanced race distribution in their datasets because their race distribution matched that of US census data, potentially perpetuating model bias for marginalized populations. Another article explained the differences in clinical note disease topics as representative of the medical literature. One resource explored race as a proxy for other information such as mistrust, and introduced 3 measures of mistrust to characterize disparities in end-of-life care. The implicit bias of clinicians can also be explored through clinical NLP and can be characterized through sentiment analysis. Finally, it was found that biased language associations can be explained by differences in the disease-demographic co-occurrence statistics of training corpora. An important aspect of bias in data that was not explored is the bias introduced in the selection of characteristics and dimensions of study. We recognize that “[d]ata are created and shaped by the assumptive determinations of their makers to collect some data and not others, to interrogate some objects over others and to investigate some variable relationships over others.” For example, investigating categories used to capture gender in clinical notes and impact of such choices are important questions in the study of biases this might perpetuate, and similar for race and ethnicity. Our proposed critique phase of ML design creates space for such inquiry. There are many methods for identifying bias and a lack of consensus on which is best.,, Five studies identified biases in corpora using word and topic distributions,,,,, while 2 papers examined n-gram level differences as a baseline analysis., Two articles applied LDA to learn topics in corpora, and 2 studies used keyword searches or NLP models to identify specific topic differences., Papers which identified biased models used multiple measurements with varied motivations. One article justified using recall gap to obtain fewer false negatives for diagnostic tools and motivated their approach to evaluate language modeling bias through sentence probability by citing the approach as state-of-the-art. Two articles also compared new approaches to a gold-standard approach to measure bias, and one study measured differences in group fairness using performance metrics originally motivated as clinically relevant. Only one article explicitly explored the outcomes and downstream impacts of multiple fairness measures. Three biased-model identification articles did not justify their measures.,, It is important to note that measuring bias against gold-standards from EHR or claims data, data which were not originally intended to support secondary use, come with the assumption that these data and their potential biases are acceptable. This practice can perpetuate biases present in the gold-standard.

DISCUSSION

Clinical NLP can be used to study bias in applications reliant on clinical text data and explore biases in the healthcare setting. In this way, clinical NLP reflects our own biases and serves as a basis for healthcare delivery and policy changes. However, clinical NLP itself and research that leverages clinical NLP are shaped, to varying degrees, by the same forces that shape the bias we hope to study, and even work that identifies/ameliorates biases has the potential for harm. We identified several themes in each phase of the ML development process and group them into 3 main areas: (1) ambiguity when selecting metrics that interrogate and promote fairness in models; (2) opportunities and risks of active research into ML and ethics (eg, identifying sensitive patient attributes); and (3) best practices in reconciling individual autonomy, leveraging patient data, and inferring and manipulating sensitive information of subgroups. Our review of the Design phase identified articles that explicitly focused on increasing the visibility of marginalized populations in research through clinical NLP but none of these articles discussed the potential risks for harm due to increased visibility. Focusing on the Data phase of projects revealed that NLP can be helpful to identify and better characterize marginalized populations, while also offering a tool to shed light on biased data generation practices.,,, Examination of the Algorithm phase of studies noted a plethora of methods to measure bias,,, and demonstrated that mitigation can be difficult., Finally, the Critique phase illustrated that not all articles explicitly motivated their bias and explicability measures and they often presented different justifications for biased data or algorithms.,,,,, There are multiple bias metrics that align with bioethical principles to varying degrees and for a given application some may identify bias while others do not. However, similar to the potential for clinical NLP to address and/or contribute to biases, bias metrics risk prioritizing certain ideas of what constitutes bias or harm from the perspectives of researchers and institutions that may be incongruent with those who are experiencing the outcomes of biases. Similar to Metcalf et al, we also recommend researchers develop and motivate bias metrics with not only clinicians in mind, but alongside community expertise by those who are most affected by the outcomes of biases in informatics. Finally, we recommend researchers remain aware that fairness metrics, in their attempt to quantify nuanced and complex interactions, are not the gold standard to follow and reflect on, but rather a trigger for reflecting on the interactions themselves. We also found that clinical text may also be leveraged to make visible those who were previously invisible within structured data. While improving representation can lead to more diverse research and quality care metrics, it may also expose patients to harm and discrimination. Understanding and addressing bias is not always a straight forward endeavor, and well-intentioned work may also give way to further ethical considerations. As an example, identifying gender and sexual minority patients within the EHR may violate autonomy by going against choices to withhold information due to fear of discrimination. Moreover, once these models are in place and providing information that patients chose to withhold, they can enable harm against these patients whether it be accidental or intentional. When leveraging clinical NLP to address biases in representation, researchers should consider how such technology might be abused or misused once research is concluded and a system is in place. Conceptions of autonomy also need to be reframed and informed by communities historically excluded from decision making regarding the use of their data. This concern is supported by the result that no articles discussed community stakeholder inclusion, besides that of providers, in their methods and only one article made the suggestion to include community stakeholders. This lack of community stakeholder engagement raises concerns about who is not included in driving research into biases that ultimately impact health outcomes for patients. Individual communities are well equipped to define group status and identify potential unintended consequences of clinical NLP leveraging their data. Indigenous data sovereignty provides an example for conceptualizing community autonomy in the age of big data., All of these concerns point to limitations of bioethics principles for addressing group level harms that may emerge from biases in clinical NLP. Much like the existing bioethics principle we describe in this work, we note that existing ethics governing structures, like institutional review boards, also lack guidance for addressing these potential group-level harms. Thus, while the papers reviewed, and in fact most papers that use NLP and ML in healthcare, are approved by their institutional review boards, they lack critical considerations for groups, power dynamics, and systemic structures. As scholars have argued, ethical frameworks that address issues of power, recognition of groups, and structural injustices are needed to mitigate the potential for further exacerbating and reproducing inequities through technology., Benjamin addresses this in the context of consent and suggests “informed refusal” as a method “to construct more reciprocal relationships between institutions and individuals” for groups to address concerns over the mining of sensitive patient information such as sex and gender through clinical NLP., This concept of refusal originated from Simpson in describing how consent was weaponized to dispossess Indigenous peoples throughout North America and Australia, and how “Refusal’ rather than recognition is an option for producing and maintaining alternative structures of thought, politics and traditions away from and in critical relationship to states.” A current example of refusal, and specifically informed refusal, is supporting Indigenous led efforts to empower and train the next generation of Indigenous data scientists and geneticists, such as Indigidata and the Summer internship for INdigenous peoples in Genomics Consortium. These kinds of efforts can help community members support their community expertise with technical expertise to truly be empowered to guide, modify, and refuse research concerning their communities. Furthermore, Tsosie et al argue that a bias towards prioritizing the individual as primary in bioethics is “culturally incongruent with Indigenous communitarian ethics,” suggesting yet another way to reconceptualize how clinical NLP research could be conducted in consideration for its potential impact on specific communities. The findings of this scoping review should be considered in light of several limitations. First, due to our search strategy, studies that used free-text may have been missed during the screening process if the data source was not explicitly mentioned in the title or abstract. Second, we acknowledge that some authors could have addressed our concerns and were unable to write this into the report for various reasons (audience, journal content policies, word count, etc). Either way, we cannot analyze content that did not make it into the report regardless of reason. Third, our definition of bias may differ from other authors. Lastly, while bioethical principles serve as the basis for oversight of biomedical research, we have identified limitations in applying these principles to guide research into fair clinical NLP and in creating our own reviewing methodology. In particular, we encountered a lack of guidance for evaluating how technologies interact with other technologies and assessing how sociocultural forces contribute to discrimination and other harmful outcomes. This raises challenges in the consideration of group harm due to the particular emphasis of bioethics on individual autonomy. To the best of our ability, we have incorporated these ideas into our analysis, but future work is need to explore other ethical frameworks.

CONCLUSIONS

Clinical NLP is a rapidly advancing field, and assessing current approaches against ethical considerations can help the discipline use clinical NLP to explore both healthcare biases and equitable NLP applications. This scoping review mapped how recent works have both studied bias in clinical NLP and used the tools of clinical NLP to study bias in healthcare delivery. Leveraging a bioethics framework and clinical ML development process, we identified challenges and opportunities in studying the intersection of clinical NLP and bias. We also recognize the limits of such frameworks for addressing potential risks of bias for groups and communities. As such, new ethics frameworks that empower communities and recognize structural injustices will be essential to intervene on the potential for clinical NLP to further entrench inequities in clinical practice.

FUNDING

This work was supported by grants from the National Library of Medicine (OBDW, HRN: T15LM007079) and National Institute of General Medical Sciences (NE: R01GM114355). The study funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. The content is solely the responsibility of the authors and does not necessarily represent the official views of a funder.

AUTHOR CONTRIBUTIONS

All authors had full access to the data in the study, and take responsibility for the integrity of the data and the accuracy of the analysis, and have approved the final manuscript. Study concept and design: OBDW and NE. Data analysis: OBDW, HRN, and SSL. Drafting and critical revision of the manuscript: all authors. Supervision: NE.

SUPPLEMENTARY MATERIAL

Supplementary material is available at JAMIA Open online. Click here for additional data file.

46 in total

1. Gender disparities in clozapine prescription in a cohort of treatment-resistant schizophrenia in the South London and Maudsley case register.

Authors: Emma Wellesley Wesley; India Patel; Giouliana Kadra-Scalzo; Megan Pritchard; Hitesh Shetty; Matthew Broadbent; Aviv Segev; Rashmi Patel; Johnny Downs; James H MacCabe; Richard D Hayes; Daniela Fonseca de Freitas
Journal: Schizophr Res Date: 2021-05-19 Impact factor: 4.939

2. Developing and Validating a Computable Phenotype for the Identification of Transgender and Gender Nonconforming Individuals and Subgroups.

Authors: Yi Guo; Xing He; Tianchen Lyu; Hansi Zhang; Yonghui Wu; Xi Yang; Zhaoyi Chen; Merry Jennifer Markham; François Modave; Mengjun Xie; William Hogan; Christopher A Harle; Elizabeth A Shenkman; Jiang Bian
Journal: AMIA Annu Symp Proc Date: 2021-01-25

3. The battle for ethical AI at the world's biggest machine-learning conference.

Authors: Elizabeth Gibney
Journal: Nature Date: 2020-01 Impact factor: 49.962

4. Accuracy of race, ethnicity, and language preference in an electronic health record.

Authors: Elissa V Klinger; Sara V Carlini; Irina Gonzalez; Stella St Hubert; Jeffrey A Linder; Nancy A Rigotti; Emily Z Kontos; Elyse R Park; Lucas X Marinacci; Jennifer S Haas
Journal: J Gen Intern Med Date: 2014-12-20 Impact factor: 5.128

5. Five sources of bias in natural language processing.

Authors: Dirk Hovy; Shrimai Prabhumoye
Journal: Lang Linguist Compass Date: 2021-08-20

6. Informatics for sex- and gender-related health: understanding the problems, developing new methods, and designing new solutions.

Authors: Mary Regina Boland; Noémie Elhadad; Wanda Pratt
Journal: J Am Med Inform Assoc Date: 2022-01-12 Impact factor: 7.942

7. Using routine clinical and administrative data to produce a dataset of attendances at Emergency Departments following self-harm.

Authors: C Polling; A Tulloch; S Banerjee; S Cross; R Dutta; D M Wood; P I Dargan; M Hotopf
Journal: BMC Emerg Med Date: 2015-07-16

8. Prevalence of Financial Considerations Documented in Primary Care Encounters as Identified by Natural Language Processing Methods.

Authors: Meliha Skaljic; Ihsaan H Patel; Amelia M Pellegrini; Victor M Castro; Roy H Perlis; Deborah D Gordon
Journal: JAMA Netw Open Date: 2019-08-02

9. Topic choice contributes to the lower rate of NIH awards to African-American/black scientists.

Authors: Travis A Hoppe; Aviva Litovitz; Kristine A Willis; Rebecca A Meseroll; Matthew J Perkins; B Ian Hutchins; Alison F Davis; Michael S Lauer; Hannah A Valantine; James M Anderson; George M Santangelo
Journal: Sci Adv Date: 2019-10-09 Impact factor: 14.136

10. Gender differences in clinical presentation and illicit substance use during first episode psychosis: a natural language processing, electronic case register study.

Authors: Jessica Irving; Craig Colling; Hitesh Shetty; Megan Pritchard; Robert Stewart; Paolo Fusar-Poli; Philip McGuire; Rashmi Patel
Journal: BMJ Open Date: 2021-04-20 Impact factor: 2.692