Literature DB >> 35864931

Five sources of bias in natural language processing.

Abstract

Recently, there has been an increased interest in demographically grounded bias in natural language processing (NLP) applications. Much of the recent work has focused on describing bias and providing an overview of bias in a larger context. Here, we provide a simple, actionable summary of this recent work. We outline five sources where bias can occur in NLP systems: (1) the data, (2) the annotation process, (3) the input representations, (4) the models, and finally (5) the research design (or how we conceptualize our research). We explore each of the bias sources in detail in this article, including examples and links to related work, as well as potential counter-measures.

Entities: Chemical

Year: 2021 PMID： 35864931 PMCID： PMC9285808 DOI： 10.1111/lnc3.12432

Source DB: PubMed Journal: Lang Linguist Compass ISSN： 1749-818X

INTRODUCTION

A visitor who arrived at night in the London of 1878 would not have seen very much: cities were dark, only illuminated by candles and gas lighting. The advent of electricity changed that. This technology lit up cities everywhere and conferred a whole host of other benefits. From household appliances to the internet, electricity brought in a new era. However, as with every new technology, electricity had some unintended consequences. To provide the energy necessary to light up cities, run the appliances and fuel the internet, we required more power plants. Those plants contributed to pollution and ultimately to the phenomenon of global warming. Presumably, those consequences were far from the minds of the people who had just wanted to illuminate their cities. This dual nature of intended use and unintended consequences is common to all new technologies. And natural language processing (NLP) proves no exception. NLP has progressed relatively rapidly from a niche academic field to a topic of widespread industrial and political interest. Its economic impact is substantial: NLP‐related companies are predicted to be valued at $26.4 billion by 2024. We make daily use of machine translation (Wu et al., 2016), personal assistants like Siri or Alexa (Palakovich et al., 2017; Ram et al., 2018) and text‐based search engines (Brin & Page, 1998). NLP is used in industrial decision‐making processes for hiring, abusive language and threat detection on social media (Roberts et al., 2019), and mental health assessment and treatment (Benton et al., 2017; Coppersmith et al., 2015). An increasing volume of social science research uses NLP to generate insights into society and the human mind (Bhatia, 2017; Kozlowski et al., 2019). However, the interest in and use of NLP have grown much faster than an understanding of the unintended consequences. Some researchers have pointed out how NLP technologies can be used for harmful purposes, such as suppressing dissenters (Bamman et al., 2012; Zhang et al., 2014), compromising privacy/anonymity (Coavoux et al., 2018; Grouin et al., 2015), or profiling (Jurgens et al., 2017; Wang et al., 2018). Those applications might be unintended outcomes of systems developed for other purposes but could be deliberately developed by malicious actors. A much more widespread unintended negative consequence is the unfairness caused by demographic biases, such as unequal performance for different user groups (Tatman, 2017), misidentification of speakers and their needs (Criado Perez, 2019) or the proliferation of harmful stereotypes (e.g., Agarwal et al., 2019; Koolen & van Cranenburgh, 2017; Kiritchenko & Mohammad, 2018). In this work, we follow the definition of Shah et al. (2020) for ‘bias’, which focuses on the mismatch of ideal and actual distributions of labels and user attributes in training and application of a system. These biases are partially due to the rapid growth of the field and an inability to adapt to the new circumstances. Originally, machine learning and NLP were about solving toy problems on small data sets, promising to do it on more extensive data later. Any scepticism and worry about Artificial Intelligence (AI's) power were primarily theoretical. In essence, there was not enough data or computational power for these systems to impact people's lives. With the recent availability of large amounts of data and the universal application of NLP, this point has now arrived. However, even though we now have the possibility, many models are still trained without regard for demographic aspects. Moreover, many applications are focussed solely on information content, without awareness or concern for those texts' authors and the social meaning of the message (Hovy & Yang, 2021). But today, NLP's reach and ubiquity do have a real impact on people's lives (Hovy & Spruit, 2016). Our tools, for better or for worse, are used in everyday life. The age of academic innocence is over: we need to be aware that our models affect people's lives, yet not always in the way we imagine (Ehni, 2008). The most glaring reason for this disconnect is bias at various steps of the research pipeline. The focus on applications has moved us away from models as a tool for understanding and towards predictive models. It has become clear that these tools produce excellent predictions but are much harder to analyse. They solve their intended task but also pick up on secondary aspects of language and potentially exploit them to fulfil the objective function. And language carries a lot of secondary information about the speaker, their self‐identification, and membership in socio‐demographic groups (Flek, 2020; Hovy & Spruit, 2016; Hovy & Yang, 2021). Whether I say ‘I am totally pumped’ or ‘I am very excited’ conveys information about me far beyond the actual meaning of the sentence. In a conversation, we actively use this information to pick up on a speaker's likely age, gender, regional origin or social class (Eckert, 2019; Labov, 1972). We know that the same sentence (‘That was a sick performance!’) can express either approval or disgust, based on whether a teenager or an octogenarian says it. In contrast, current NLP tools fail to incorporate demographic variation and instead expect all language to follow the ‘standard’ encoded in the training data. But the question is: whose standard (Eisenstein, 2013)? This approach is equivalent to expecting everyone to speak like the octogenarian from above: it leads to problems when encountering the teenager. As a consequence, NLP tools trained on one demographic sample perform worse on another sample (Garimella et al., 2019; Hovy & Søgaard, 2015; Jørgensen et al., 2015). This mismatch was known to affect text domains, but it also applies to socio‐demographic domains: people of, say, different age groups are linguistically as diverse as a text from a web blog and a newspaper (Johannsen et al., 2015). Incidentally, demographics like age and text‐domain can often be correlated (Hovy, 2015). Plank (2016) therefore suggests treating these aspects of language as parts of our understanding of ‘domain’. The consequences of these shortfalls range from an inconvenience to something much more insidious. In the most straightforward cases, systems fail and produce no output. This outcome is annoying and harms the user who cannot benefit from the service, but at least it is obvious enough for the user to see and respond to. In many cases, though, the effect is much less easy to notice: the performance degrades, producing sub‐par output for some users. This difference will become only evident in comparison but is not apparent to the individual user. This degradation is much harder to see but often systematic for a particular demographic group and creates a demographic bias in NLP applications. The problem of bias introduced by socio‐demographic differences in the target groups is not restricted to NLP, though, but occurs in all data sciences (O'Neil, 2016). For example, in speech recognition, there is a strong bias towards native speakers of any language (Lawson et al., 2003). But even for native speakers, there are barriers: dialect speakers or children can struggle to make themselves understood by a smart assistant (Harwell, 2018). Moreover, women and children—who speak in a higher register than the speakers in the predominantly male training sample—might not be processed correctly (or at all) in voice‐to‐text systems (Criado Perez, 2019). There have been several examples of computer vision bias, from an image captioning system labelling pictures of black people as ‘gorillas’ to cameras designed to detect whether a subject blinked, which malfunctioned if Asian people were in the picture (Howard & Borenstein, 2018). In a more abstract form, the correlation of socio‐demographics with variables of interest can cause problems, such as when ZIP code and income level can act as proxies for race (O'Neil, 2016). ProPublica reported that a machine learning system designed to predict bail decisions overfit on the defendants' skin colour: in this case, the social prejudices of the prior decisions became encoded in the data (Angwin et al., 2016). With great (predictive) power comes great responsibility, and several ethical questions arise when working with language. There are no hard and fast rules for everything, and the topic is still evolving, but several issues have emerged so far. While the overview here is necessarily incomplete, it is a starting point on the issue. It is based on recent work by Hovy and Spruit (2016), Shah et al. (2020) as well as the two workshops by the Association for Computational Linguistics on Ethics in NLP (Alfano et al., 2018; Hovy et al., 2017). See those sources for further in‐depth discussion.

OVERVIEW

This article is not the first attempt to comprehensively address demographic factors in NLP. General bias frameworks in Artificial Intelligence (AI) exist that lay the necessary groundwork for our approach. For example, Friedler et al. (2021) defined bias as fairness in algorithms by capturing all the latent features (i.e., demographics) in the data. Suresh and Guttag (2019) suggested a qualitative framework for bias in machine learning, defining bias as a ‘potential harmful property of the data’, though they leave out demographic and modelling aspects. Hovy and Spruit (2016) noted three qualitative sources of bias: data, modelling and research design, related to demographic bias, overgeneralization and topic exposure. In Shah et al. (2020), these and other frameworks are combined under a joint mathematical approach. Blodgett et al. (2020) provide an extensive survey of the way bias is studied in NLP. It points out the weaknesses in the research design and recommends grounding work analysing ‘bias’ in NLP systems in the relevant literature outside of NLP, understanding why system behaviours can be harmful and to whom, and engaging in a conversation with the communities that are affected by the NLP systems. One thing to stress is that ‘bias’ per se is neither good nor bad: in a Bayesian framework, the prior P(X) serves as a bias: the expectation or base‐rate we should have for something before we see any further evidence. In real life, many of our reactions to everyday situations are biases that make our lives easier. Biases as a preset are not necessarily an issue: they only become problematic when they are kept even in the face of contradictory evidence or when applied to areas they were not meant for. Many of the biases we will discuss here can also represent a form of information: as a diagnostic tool about the state of society (Garg et al., 2018; Kozlowski et al., 2019), or as a way to regularize our models (Plank et al., 2014a; Uma et al., 2020). However, as input to predictive systems, these biases can have severe consequences and exacerbate existing inequalities between users. Figure 1 shows the five sources of bias we discuss in this article. The first entry point for bias in the NLP pipeline is the choice of data for the experimentation. The labels chosen for training and the procedure used for annotating the labels introduces the annotation bias. Selection bias is introduced by the samples chosen for training or testing an NLP model. The third type of bias is introduced by the choice of representation used for the data. The choice of models or machine learning algorithms used also introduces the issue of bias amplification. Finally, the entire research design process can introduce bias if researchers are not careful with their choices in the NLP pipeline. In what follows, we discuss each of these biases in detail and provide insights into how they occur and how to mitigate them.

FIGURE 1

Schematic of the five bias sources in the general natural language processing pipeline

Bias from data

NLP systems reflect biases in the language data used for training them. Many data sets are created from long‐established news sources (e.g., Wall Street Journal, Frankfurter Rundschau from the 1980s through the 1990s), a very codified domain predominantly produced by a small, homogeneous sample: typically white, middle‐aged, educated, upper‐middle‐class men (Garimella et al., 2019; Hovy & Søgaard, 2015). However, many syntactic analysis tools (taggers and parsers) are still trained on the newswire data from the 1980s and 1990s. Modern syntactic tools, therefore, expect everyone to speak like journalists from the 1980s. It should come as no surprise that most people today do not: language has evolved since then, and expressions that were ungrammatical then are acceptable today, ‘because internet’ (McCulloch, 2020). NLP is, therefore, unprepared to cope with this demographic variation. Models trained on these data sets treat language as if it resembles this restricted training data, creating demographic bias. For example, Hovy (2015) and Jørgensen et al. (2015) have shown that this bias leads to significantly decreased performance for people under 35 and ethnic minorities, even in simple NLP tasks like finding verbs and nouns (i.e., part‐of‐speech tagging). The results are ageist, racist or sexist models that are biased against the respective user groups. This is the issue of , which is rooted in data. When choosing a text data set to work with, we are also making decisions about the demographic groups represented in the data. As a result of the demographic signal present in language, any data set carries a demographic bias, that is, latent information about the demographic groups present in it. As humans, we would not be surprised if someone who grew up hearing only their dialect would have trouble understanding other people. If our data set is dominated by the ‘dialect’ of a specific demographic group, we should not be surprised that our models have problems understanding others. Most data sets have some built‐in bias, and in many cases, it is benign. It becomes problematic when this bias negatively affects certain groups or disproportionately advantages others. On biased data sets, statistical models will overfit to the presence of specific linguistic signals that are particular to the dominant group. As a result, the model will work less well for other groups, that is, it excludes demographic groups. Hovy (2015) and Jørgensen et al. (2015) have shown the consequences of exclusion for various groups, for example, people under 35 and speakers of African‐American vernacular English. Part‐of‐speech(POS) tagging models have a significantly lower accuracy for young people and ethnic minorities, vis‐á‐vis the dominant demographics in the training data. Apart from exclusion, these models will pose a problem for future research. Given that a large part of the world's population is currently under 30, such models will degrade even more over time and ultimately not meet their users' needs. This issue also has severe ramifications for the general applicability of any findings using these tools. In psychology, most studies are based on college students, a very specific demographic: western, educated, industrialized, rich and democratic research participants (so‐called WEIRD; Henrich et al., 2010). The assumption that findings from this group would generalize to all other demographics has proven wrong and led to a heavily biased corpus of psychological data and research.

Counter‐measures

Potential counter‐measures to demographic selection bias can be simple. The most salient is undoubtedly to pay more attention to how data is collected and clarify what went into the construction of the data set. Bender and Friedman (2018) proposed a framework to document these decisions in a Data Statement. This statement includes various aspects of the data collection process and the underlying demographics. It provides future researchers with a way to assess the effect of any bias they might notice when using the data. As a beneficial side effect, it also forces us to consider how our data is made up. For already existing data sets, post‐stratification is the down‐sampling of over‐represented groups in the training data to even out the distribution until it reflects the actual distribution. Mohammady and Culotta (2014) have shown how existing demographic statistics can be used as supervision. In general, we can use measures to address overfitting or imbalanced data to correct for demographic bias in data. However, as various papers have pointed out (Bender et al., 2021; Hutchinson et al., 2021), addressing data bias is not a ‘one‐and‐done’ exercise but requires continual monitoring throughout a data sets lifecycle. Alternatively, we can also collect additional data to balance existing data sets to account for exclusions or misrepresentations. Webster et al. (2018) released a gender‐balanced data set for co‐reference resolution task. Zhao et al. (2017) also explore balancing a data set with gender confounds for multi‐label object classification and visual semantic role labelling tasks. Data augmentation by controlling the gender attribute is an effective technique in mitigating gender bias in NLP processes (Dinan et al., 2020; Sun et al., 2019). Wei and Zou (2019) explore data augmentation techniques that improve performance on various text classification tasks.

Bias from annotations

Annotation can introduce bias in various forms through a mismatch of the annotator population with the data. This is the issue of . Label and selection bias can—and most often do—interact, so it can be challenging to distinguish them. It does, however, underscore how important it is to address them jointly. There are several ways in which annotations introduce bias. In its simplest form, bias arises because annotators are distracted, uninterested, or lazy about the annotation task. As a result, they choose the ‘wrong’ labels. More problematic is label bias from informed and well‐meaning annotators that systematically disagree. Plank et al. (2014b) have shown that this type of bias arises when there is more than one possible correct label. For example, the term ‘social media’ can be validly analysed as either a noun phrase composed of an adjective and a noun, or a noun compound, composed of two nouns. Which label an annotator chooses depends on their interpretation of how lexicalized the term ‘social media’ is. If they perceive it as fully lexicalized, they will choose a noun compound. If they believe the process is still ongoing, that is, the phrase is analytical, they will choose an ‘adjective plus noun’ construct. Two annotators with these opposing views will systematically label ‘social’ as an adjective or a noun, respectively. While we can spot the disagreement, we cannot discount either of them as wrong or malicious. Finally, label bias can result from a mismatch between authors' and annotators' linguistic and social norms. Sap et al. (2019) showed that they reflect social and demographic differences, for example, that annotators rate the utterances of different ethnic groups differently and that they mistake innocuous banter as hate speech because they are unfamiliar with communication norms of the original speakers. There has been a movement towards increasingly using annotations from crowdsourcing rather than trained expert annotators. While it is cheaper and (in theory) equivalent to the quality of trained annotators (Snow et al., 2008), it does introduce a range of biases. For example, various works have shown that crowdsourced annotators' demographic makeup is not as representative as one might hope (Pavlick et al., 2014). On the one hand, crowdsourcing is easier to scale, potentially covering more diverse backgrounds than we would find in expert annotator groups. On the other hand, it is much harder to train and communicate with crowdsourced annotators, and their incentives might not align with the projects we care about. For example, suppose we ask crowd workers to annotate concepts like dogmatism, hate speech, or microaggressions. Their answers will inherently include their societal perspective of these concepts. This bias can be good or bad, depending on the sample of annotators: we may get multiple perspectives that approximate the population as a whole, or annotations may get skewed results due to the selection. However, we might also not want various perspectives if there is a theoretically motivated and well‐defined way in which we plan to annotate. Crowdsourcing and its costs raise several other ethical questions about worker payment and fairness (Fort et al., 2011). Malicious annotators are luckily relatively easy to spot and can be remedied by using multiple annotations per item and aggregating with an annotation model (Hovy et al., 2013; Passonneau & Carpenter, 2014; Paun et al., 2018). These models help us find biased annotators and let us account for the human disagreement between labels. A free online version of such a tool is available at https://mace.unibocconi.it/. They presuppose, however, that there is a single correct gold label for each data point and that annotations are simply corruptions of it. If there is more than one possible correct answer, we can use disagreement information in the update process of our models (Fornaciari et al., 2021; Plank et al., 2014a; Uma et al., 2020). That is, we can encourage the models to make more minor updates if human annotators easily confuse the categories with each other (say, adjectives and nouns in noun compounds like ‘social media’). We make regular updates if they are mutually exclusive categories (such as verbs and nouns). The only way to address mismatched linguistic norms is to pay attention to selecting annotators (i.e., matching them to the author population in terms of linguistic norms) or provide them with dedicated training. The latter should be generally considered. While annotator training is time‐intensive and potentially costly, it can be worth the effort in terms of better and less biased labels.

Bias from input representations

Even balanced, well‐labelled data sets contain bias: the most common text inputs representing in NLP systems, word embeddings (Mikolov et al., 2013), have been shown to pick up on racial and gender biases in the training data (Bolukbasi et al., 2016; Manzini et al., 2019). For example, ‘woman’ is associated with ‘homemaker’ in the same way ‘man’ is associated with ‘programmer’. There has been some justified scepticism over whether these analogy tasks are the best way to evaluate embedding models (Nissim et al., 2020), but there is plenty of evidence that (1) embeddings do capture societal attitudes (Bhatia, 2017; Garg et al., 2018; Kozlowski et al., 2019), and that (2) these societal biases are resistant to many correction methods (Gonen & Goldberg, 2019). This is the issue of . These biases hold not just for word embeddings but also for the contextual representations of big pre‐trained language models that are now widely used in different NLP systems. As they are pre‐trained on almost the entire available internet, they are even more prone to societal biases. Several papers have shown that these models reproduce and thereby perpetuate these biases and stereotypes (Kurita et al., 2019; Tan and Celis, 2019). There exist a plethora of efforts for debiasing embeddings (Bolukbasi et al., 2016; Sun et al., 2019; Zhao et al., 2017, 2019). The impact and applicability of debiased embeddings are unclear on a wide range of downstream tasks. As stated above, biases are usually masked, not entirely removed, by these methods. Even if it was possible to remove biases in the embeddings, it is not always clear whether it is useful (bias might carry information). A central issue is the language models' training objective: to predict the most likely next term, given the previous context (n‐grams). While this objective captures distributional semantic properties, it may itself not contribute to building unbiased embeddings, as it represents the world as we find it, rather than as we would like to have it (descriptive vs. normative view). In general, when using embeddings for downstream applications, it is good practice to be aware of their biases. This awareness helps to identify the applicability of such embeddings to your specific domains and tasks. For example, these models are not directly applicable to data sets that contain scientific articles or medical terminologies. Recent work has focussed on debiasing embeddings for specific downstream applications and groups of the population. For example debiasing embeddings for reducing gender bias in text classification (Prost et al., 2019), dialogue generation (Dinan et al., 2020; Liu et al., 2020), and machine translation (Font & Costa‐jussà, 2019). Such efforts are more conscious of the effects of debiasing on the target application. Additional metrics, approaches and data sets have been proposed to measure the bias inherent in large language models and their sentence completions (Nangia et al., 2020; Nozza et al., 2021; Sheng et al., 2019).

Bias from models

Simply using ‘better’ training data is not a feasible long‐term solution: languages evolve continuously, so even a representative sample can only capture a snapshot—at best a short‐lived solution (see Fromreide et al., 2014). These biases compound to create severe performance differences for different user groups. Zhao et al. (2017) demonstrated that systems trained on biased data exacerbate that bias even further when applied to new data, and Kiritchenko and Mohammad (2018) have shown that sentiment analysis tools pick up on societal prejudices, leading to different outcomes for different demographic groups. For example, by merely changing the gender of a pronoun, the systems classified the sentence differently. Hovy et al. (2020) found that machine translation systems changed the perceived user demographics to make samples sound older and more male in translation. This issue is , which is rooted in the models themselves. One of the sources of bias overamplification is the choice of loss objective used in training the models. These objectives usually correspond to improving the precision of the predictions. Models might exploit spurious correlations (e.g., all positive examples in the training data happened to come from female authors so that gender can be used as a discriminative feature) or statistical irregularities in the data set to achieve higher precision (Gururangan et al., 2018; Poliak et al., 2018). In other words, they might give the correct answers for the wrong reasons. This behaviour is hard to track until we find a consistent case of bias. Another issue with the design of machine learning models is that they always make a prediction, even when they are unsure or when they cannot know the answer. The latter could be due to the test data point lying outside the training data distribution or the model's representation space. Prabhumoye et al. (2021) discuss this briefly in a case study for machine translation systems. If a machine translation tool translates the gender‐neutral Turkish ‘O bir doktor, o bir hemşire’ into ‘He is a doctor, she is a nurse’, it might provide us with an insight into societal expectations (Garg et al., 2018). Still, it also induces an incorrect result the user did not intend. Ideally, models should report to the user that they could not translate rather than produce a wrong translation. The susceptibility of models to all aspects of the training data makes it so important to test our systems on various held‐out data sets rather than a single, designated test set. Recent work has explored objectives other than recall, F1 and so on, for example, the performance stratified by subgroup present in the data. These metrics can lead to fairer predictions across subgroups (Chouldechova, 2017; Corbett‐Davies & Goel, 2018; Dixon et al., 2018), for example, if the metrics show that the performance for a specific group is much lower than for the rest. Moving away from pure performance metrics and looking at the robustness and behaviour of the model in suites of specially designed cases can add further insights (Ribeiro et al., 2020). Card and Smith (2020) explore constraints to be specified on outcomes of models. Specifically, these constraints ensure that the proportion of predicted labels should be the same or approximately the same for each user group. More generally, methods designed to probe and analyse the model can help us understand how it reached decisions. Neural features like attention (Bahdanau et al., 2015) can provide visualizations. Kennedy et al. (2020) propose a sampling‐based algorithm to explore the impact of individual words on classification. As policy changes put an increased focus on explainable AI (EU High‐Level Expert Group on AI, 2019), such methods will likely become useful for both bias spotting and legal recourse. Systems that explicitly model user demographics will help produce both more personalized and less biased translations (Font & Costa‐jussà, 2019; Mirkin et al., 2015; Mirkin & Meunier, 2015; Saunders & Byrne, 2020; Stanovsky et al., 2019).

Bias from research design

Despite a growing interest in multi‐ and cross‐lingual work, most NLP research is still in and on English. It generally focuses on Indo‐European data/text sources, rather than other language groups or smaller languages, for example, in Asia or Africa (Joshi et al., 2020). Even if there is a potential wealth of data available from other languages, most NLP tools skew towards English (Munro, 2013; Schnoebelen, 2013). This underexposure is a self‐fulfilling prophecy: researchers are less likely to work on those languages for which there are not many resources. Instead, they work on languages and tasks for which data is readily available, potentially generating more data in the process. Consequently, there is a severe shortage for some languages but an overabundance for others. In a random sample of Tweets from 2013, there were 31 different languages (Plank, 2016), but no treebanks for about two‐thirds of them and even fewer semantically annotated resources like WordNets. Note that the number of language speakers does not necessarily correlate with the number of available resources. These were not obscure languages with few speakers, but often languages with millions of speakers. The shortage of syntactic resources has since been addressed by the Universal Dependency Project (Nivre et al., 2020). However, a recent paper (Joshi et al., 2020) found that most conferences still focus on the well‐resourced languages and are less inclusive of less‐resourced ones. This dynamic makes new research on smaller languages more complicated, and it naturally directs new researchers towards the existing languages, first among them English. The existence of off‐the‐shelf tools for English makes it easy to try new ideas in English. The focus on English may therefore be self‐reinforcing and has created an overexposure of this variety. The overexposure to English (as well as to particular research areas or methods) creates a bias described by the availability heuristic (Tversky & Kahneman, 1973). If we are exposed to something more often, we can recall it more efficiently, and if we can recall things quickly, we infer that they must be more important, bigger, better, more dangerous and so on. For instance, people estimate the size of cities they recognize to be larger than that of unknown cities (Goldstein & Gigerenzer, 2002). It requires a much higher start‐up cost to explore other languages in terms of data annotation, basic analysis models and other resources. The same holds for languages, methods and topics we research. Overexposure can also create or feed into existing biases, for example, that English is the ‘default’ language, even though both morphology and syntax of English are global outliers. It is questionable whether NLP would have focused on n‐gram models to the same extent if it had instead been developed on a morphologically complex language (e.g., Finnish, German). However, because of the unique structure of English, n‐gram approaches worked well, spread to become the default approach and only encountered problems when faced with different languages. Lately, there has been a renewed interest beyond English, as there are economic incentives for NLP groups to work on and in other languages. Concurrently, new neural methods have made more multi‐lingual and cross‐lingual approaches possible. These methods include, for example, multi‐lingual representations (Devlin et al., 2019; Nozza et al., 2020) and the zero‐shot learning they enable (e.g., Bianchi et al., 2021; Jebbara & Cimiano, 2019; Liu et al., 2019, inter alia). However, English is still one of the most widely spoken languages and by far the biggest market for NLP tools. So there are still more commercial incentives to work on English than other languages, perpetuating the overexposure. One of the reasons for the linguistic and cultural skew in research is the makeup of research groups themselves. In many cases, these groups do not necessarily reflect the demographic composition of the user base. Hence, marginalized communities or speaker groups do not have their voice represented proportionally. Initiatives like Widening NLP are beginning to address this problem, but the issue still leaves a lot of room for improvement. Finally, not analysing the behaviour of models sufficiently, or not fully disclosing it can be harmful (Bianchi & Hovy, 2021). These omissions are not necessarily due to ill will, but are often the result of a relentless pressure to publish. An example of the resulting bias is not fully understanding the intended use of the trained models and how they can be misused (i.e., its dual use). The introduction of ethical consideration sections and an ethics reviews in NLP venues is a step to give these aspects more attention and encourage reflection. An interesting framework to think about these issues is the suggestion by Van de Poel (2016) to think of new technology (such as NLP) as a large‐scale social experiment. An experiment we are all engaged in at a massive scale. As an experiment, however, we need to make sure we respect specific guidelines and ground rules. There are detailed requirements for social and medical sciences experiments to get the approval of an ethics committee or IRB (internal review board). These revolve around the safety of the subjects and involve beneficence (no harm to subjects, maximize benefits, minimize risk), respect for subjects' autonomy (informed consent), and justice (weighing of benefits vs. harms, protection of vulnerable subjects). Not all of these categories are easily translated into NLP as a large‐scale experiment. However, it can help us frame our decisions within specific philosophical schools of thought, as outlined by Prabhumoye et al. (2021). There are no easy solutions to design bias, which might only become apparent in hindsight. However, any activity or measure that increases the chance of reflection on the project can help to counter inherent biases. For example, Emily Bender has suggested making overexposure bias more apparent by stating explicitly which language we work on ‘even if it is English’ (Bender, 2019). There is, of course, no issue with research on English, but it should be made explicit that the results might not automatically hold for all languages. It can help to ask ourselves counterfactuals: ‘Would I research this is if the data wasn't as easily available? Would my finding still hold on another language?’ We can also try to assess whether the research direction of a project feeds into existing biases or whether it overexposes certain groups. A way forward is to use various evaluation settings and metrics (Ribeiro et al., 2020). Some conferences have also started suggesting guidelines to assess the potential for ethical issues with a system (e.g., the NAACL 2021 Ethics FAQ and guidelines). Human intervention and thought are required at every stage of the NLP application design lifecycle to prioritize equity and stakeholders from marginalized groups (Costanza‐Chock, 2020). Recent work by Bird (2020) suggests new ways of collaborating with Indigenous communities in the form of open discussions and proposes a postcolonial approach to computational methods for supporting language vitality. Finally, Havens et al. (2020) discuss the need for a bias‐aware methodology in NLP and present a case study in executing it. Researchers have to be mindful of the entire research design: data sets they choose, the annotation schemes or labelling procedures they follow, how they decide to represent the data, the algorithms they choose for the task and how they evaluate the automated systems. Researchers need to be aware of the real‐world applications of their work and consciously decide to choose to help marginalized communities via technology (Asad et al., 2019).

CONCLUSION

This article outlined five of the most common sources of bias in NLP models: data selection, annotation, representations, models and our own research design. However, we are not merely at the mercy of these biases: there exists a growing arsenal of algorithmic and methodological approaches to mitigate biases from all sources. The most difficult might be bias from research design, which requires introspection and systematic analysis of our own preconceived notions and blind spots.

10 in total

Review 9. On Consequentialism and Fairness.

Authors: Dallas Card; Noah A Smith
Journal: Front Artif Intell Date: 2020-05-08

10. An Ethical Framework for Evaluating Experimental Technology.

Authors: Ibo van de Poel
Journal: Sci Eng Ethics Date: 2015-11-14 Impact factor: 3.525

10 in total

3 in total

Five sources of bias in natural language processing.

INTRODUCTION

OVERVIEW

Bias from data

Counter‐measures

Bias from annotations

Bias from input representations

Bias from models

Bias from research design

CONCLUSION

1. Models of ecological rationality: the recognition heuristic.

2. Dual use and the ethical responsibility of scientists.

3. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.

4. Associative judgment and vector space semantics.

5. Word embeddings quantify 100 years of gender and ethnic stereotypes.

6. The Ugly Truth About Ourselves and Our Robot Creations: The Problem of Bias and Social Inequity.

7. The weirdest people in the world?

8. Five sources of bias in natural language processing.

Review 9. On Consequentialism and Fairness.

10. An Ethical Framework for Evaluating Experimental Technology.

1. Five sources of bias in natural language processing.

Review 2. A scoping review of ethics considerations in clinical natural language processing.

3. What's That Noise? Interpreting Algorithmic Interpretation of Human Speech as a Legal and Ethical Challenge.