| Literature DB >> 35514953 |
Kathleen C Fraser1, Svetlana Kiritchenko1, Isar Nejadgholi1.
Abstract
Stereotypes are encountered every day, in interpersonal communication as well as in entertainment, news stories, and on social media. In this study, we present a computational method to mine large, naturally occurring datasets of text for sentences that express perceptions of a social group of interest, and then map these sentences to the two-dimensional plane of perceived warmth and competence for comparison and interpretation. This framework is grounded in established social psychological theory, and validated against both expert annotation and crowd-sourced stereotype data. Additionally, we present two case studies of how the model might be used to answer questions using data "in-the-wild," by collecting Twitter data about women and older adults. Using the data about women, we are able to observe how sub-categories of women (e.g., Black women and white women) are described similarly and differently from each other, and from the superordinate group of women in general. Using the data about older adults, we show evidence that the terms people use to label a group (e.g., old people vs. senior citizens) are associated with different stereotype content. We propose that this model can be used by other researchers to explore questions of how stereotypes are expressed in various large text corpora.Entities:
Keywords: biased language; computational model; computational social science; natural language processing; sentence embeddings; social media analysis; stereotypes; text analysis
Year: 2022 PMID: 35514953 PMCID: PMC9063736 DOI: 10.3389/frai.2022.826207
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Testing the linguistic capabilities of each model.
|
|
|
|
|---|---|---|
| Basic functionality | 1D accuracy | These people are always friendly (Label: warm) |
| Negation | 1D accuracy | These people are never friendly (Label: cold) |
| Semantic composition | 2D accuracy | These people are always friendly and smart (Label: warm and competent) |
| Syntactic variability | 2D accuracy | This group is known for being friendly as well as smart (Label: warm and competent) |
While the models always predict two values (warmth and competence) for each sentence, the lexicon data provide gold labels for only one dimension (warmth or competence). Therefore, in the first two cases, each test case has only one gold label, and so accuracy is measured by whether the model correctly assigns positive vs. negative warmth or competence. In the last two cases, each test case is associated with gold values for both warmth and competence dimensions, and so the accuracy is measured for both dimensions (i.e., the model must place the sentence in the correct quadrant).
Mean accuracy (with standard deviation in parentheses) across folds for each combination of model, configuration, and functional test category.
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| Basic | RoBERTa-STS | 94.5 (2.1) | 93.6 (1.8) | 95.4 (2.0) | 95.3 (2.0) | 95.3 (2.0) | 96.2 (0.8) |
| RoBERTa-NLI |
|
|
|
|
|
| |
| RoBERTa-para | 92.7 (3.6) | 90.2 (3.2) | 92.3 (3.5) | 95.3 (0.8) | 95.3 (1.7) | 94.5 (1.5) | |
| GloVe-average | 90.2 (5.5) | 80.1 (5.6) | 90.2 (4.3) | 92.2 (3.2) | 91.4 (3.7) | 92.3 (2.9) | |
| MPNet-para | 92.3 (2.4) | 94.0 (3.6) | 95.3 (2.5) | 95.4 (2.0) | 94.0 (2.2) | 95.7 (0.2) | |
| Negation | RoBERTa-STS | 91.4 (3.8) | 92.3 (2.4) | 94.0 (2.5) | 93.1 (2.9) | 92.3 (3.2) | 93.2 (3.2) |
| RoBERTa-NLI |
|
|
|
|
|
| |
| RoBERTa-para | 91.1 (2.3) | 88.1 (2.3) | 91.5 (1.2) | 91.6 (4.5) | 92.4 (2.3) | 92.8 (2.0) | |
| GloVe-average | 9.8 (5.5) | 19.9 (5.6) | 9.8 (4.3) | 7.8 (3.2) | 8.6 (3.7) | 7.7 (2.9) | |
| MPNet-para | 94.0 (5.5) | 92.7 (6.9) | 91.2 (4.7) | 94.5 (3.9) | 94.9 (3.9) | 93.3 (3.2) | |
| Semantic | RoBERTa-STS |
| 76.8 (10.0) | 75.0 (8.8) | 76.8 (10.1) | 76.9 (12.0) | 78.7 (7.6) |
| RoBERTa-NLI |
|
|
|
|
|
| |
| RoBERTa-para | 64.4 (7.2) | 61.4 (5.2) | 67.3 (9.3) | 57.8 (7.2) | 58.6 (4.4) | 57.8 (8.7) | |
| GloVe-average | 62.5 (7.9) | 51.1 (8.1) | 71.0 (7.5) | 67.3 (8.5) | 63.5 (8.2) | 65.3 (7.0) | |
| MPNet-para | 58.5 (11.5) | 62.4 (8.5) | 67.4 (10.7) | 57.9 (7.0) | 61.6 (2.0) | 62.6 (5.6) | |
| Syntactic | RoBERTa-STS | 69.2 (10.3) |
|
|
|
| 75.1 (13.1) |
| RoBERTa-NLI |
|
| 71.0 (12.3) | 72.1 (9.3) | 72.0 (11.4) |
| |
| RoBERTa-para | 57.6 (7.6) | 57.7 (7.9) | 49.0 (2.5) | 51.0 (6.3) | 54.0 (4.6) | 52.0 (9.5) | |
| GloVe-average | 54.8 (6.8) | 41.6 (6.2) | 64.2 (6.8) | 51.0 (4.9) | 52.0 (7.9) | 61.4 (6.9) | |
| MPNet-para | 63.3 (9.0) | 63.3 (11.9) | 61.3 (18.1) | 62.6 (8.8) | 57.6 (10.1) | 59.7 (6.9) | |
For simplicity, we use the following abbreviations for the pretrained model names: RoBERTa-STS, SYS-RoBERTa-large; RoBERTa-NLI, RoBERTa-large-NLI-mean-tokens; roBERTa-para, paraphrase-distilRoBERTa-base-v2; gloVe-average, average-word-embeddings-gloVe.840B.300d; MPNet-para, paraphrase-MPNet-base-v2. Boldface indicates the highest accuracy for each column in each set of experiments; italic font indicates the highest accuracy overall in each set of experiments.
Correlation between three human annotators (A1, A2, and A3), and between manual and automatic annotations, for warmth and competence scores.
|
|
| |
|---|---|---|
| Between A1 and A2 | 0.915 | 0.884 |
| Between A1 and A3 | 0.890 | 0.830 |
| Between A2 and A3 | 0.852 | 0.839 |
| Between manual and automatic (cross-validation) | 0.870 | 0.858 |
| Between manual and automatic (full dataset) | 0.880 | 0.873 |
Social groups in StereoSet, along with available research findings from the social sciences literature on stereotyping.
|
|
|
|
|---|---|---|
| Nurse | Nurse, (Fiske and Dupree, | HW-HC |
| Psychologist | Psychologist (Brambilla et al., | HW-HC |
| Researcher | Researcher (Fiske and Dupree, | LW-HC |
| Commander | Command-and-control leadership (Cuddy et al., | LW-HC |
| Manager | Manager (male) (Eckes, | LW-HC |
| Entrepreneur | Entrepreneur (Cuddy et al., | LW-HC |
| Mathematician, physicist, chemist | Scientist (Fiske and Dupree, | LW-HC |
| Engineer | Engineer (Fiske and Dupree, | LW-HC |
| Software developer | technical experts (Fiske, | LW-HC |
| Grandfather | The elderly (Cuddy et al., | HW-LC |
| Mommy, Mother | Housewife (Eckes, | HW-LC |
| Schoolboy, schoolgirl | Children (Fiske, | HW-LC |
| Male, gentleman | Men (Glick et al., | LW-HC |
| Japanese | Japanese (Lee and Fiske, | LW-HC |
| African | African (Fiske, | LW-LC |
| Hispanic | Latino (Lee and Fiske, | LW-LC |
| Arab | Arab (Fiske et al., | LW-LC |
The Prediction column lists the expected SCM quadrant for each group, based on the literature. HW, high warmth; LW, low warmth; HC, high competence; LC, low competence.
Figure 1(Left) Plotting the average of the StereoSet stereotypes using the proposed method. (Right) Plotting the stereotypes using the baseline method of McKee et al. (2021). Groups which are correctly categorized according to the predictions of the literature are shown in blue, while those which are incorrectly categorized are marked with pink.
Figure 2Analyzing stereotypical language “in-the-wild”: steps for collecting, filtering, and analyzing real-life data about social groups.
The number of extracted sentences with the syntactic pattern “
|
|
|
|
|
|---|---|---|---|
|
|
|
| |
| Women | 28,229 | 12.96 | 365,911 |
| Feminists | 862 | 14.85 | 12,804 |
| Moms | 1,906 | 10.04 | 19,135 |
| Black women | 2,423 | 12.69 | 30,737 |
| White women | 1,000 | 12.52 | 12,522 |
Figure 3Areas of highest density for the group women.
Words associated with different clusters and paraphrased example contexts where the words appear for each women target group.
|
|
|
|
| |
|---|---|---|---|---|
|
| ||||
| Cluster 1 | 1,886 |
|
|
|
| Cluster 2 | 475 |
|
|
|
| Cluster 3 | 305 |
|
|
|
| Cluster 4 | 503 |
|
|
|
| Cluster 5 | 314 |
|
|
|
| Cluster 6 | 3,117 |
|
|
|
| Cluster 7 | 1,150 |
|
|
|
|
| ||||
| Cluster 1 | 47 |
|
|
|
| Cluster 2 | 386 |
|
|
|
|
| ||||
| Cluster 1 | 444 |
|
|
|
| Cluster 2 | 140 |
|
|
|
| Cluster 3 | 434 |
|
|
|
|
| ||||
| Cluster 1 | 742 |
|
|
|
| Cluster 2 | 188 |
|
|
|
| Cluster 3 | 647 |
|
|
|
|
| ||||
| Cluster 1 | 39 |
| – | – |
| Cluster 2 | 112 |
|
|
|
| Cluster 3 | 520 |
|
|
|
Up to 10 words with highest association with the cluster are shown. The cluster locations on the warmth(W)–competence(C) plane are denoted as .
Figure 4Areas of highest density for the different groups of women.
The number of extracted sentences with the target as nominal subject and the average and total number of words (sequences of alpha-numeric characters) in the sentences for each age-related target group.
|
|
|
|
|
|---|---|---|---|
|
|
|
| |
| Elderly | 7,840 | 19.40 | 152,097 |
| Old folks | 2,126 | 16.03 | 34,076 |
| Old people | 19,812 | 15.07 | 298,499 |
| Senior citizens | 1,705 | 17.46 | 29,766 |
Figure 5Areas of highest density for the different labels for older adults.
Words associated with different clusters and paraphrased example contexts where the words appear for each age-related target group.
|
|
|
|
| |
|---|---|---|---|---|
|
| ||||
| Cluster 1 | 1,440 |
|
|
|
| Cluster 2 | 291 |
|
|
|
|
| ||||
| Cluster 1 | 325 |
|
|
|
| Cluster 2 | 269 |
|
|
|
|
| ||||
| Cluster 1 | 1,651 |
|
|
|
| Cluster 2 | 508 |
|
|
|
|
| ||||
| Cluster 1 | 376 |
|
|
|
| Cluster 2 | 406 |
|
|
|
The cluster locations on the warmth(W)–competence(C) plane are denoted as .