| Literature DB >> 35231059 |
Diego Kozlowski1, Dakota S Murray2, Alexis Bell3, Will Hulsey3, Vincent Larivière4, Thema Monroe-White3, Cassidy R Sugimoto5.
Abstract
Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial-based systemic inequalities is an important step towards a more equitable research system. However, because of the lack of robust information on authors' race, few large-scale analyses have been performed on this topic. Algorithmic approaches offer one solution, using known information about authors, such as their names, to infer their perceived race. As with any other algorithm, the process of racial inference can generate biases if it is not carefully considered. The goal of this article is to assess the extent to which algorithmic bias is introduced using different approaches for name-based racial inference. We use information from the U.S. Census and mortgage applications to infer the race of U.S. affiliated authors in the Web of Science. We estimate the effects of using given and family names, thresholds or continuous distributions, and imputation. Our results demonstrate that the validity of name-based inference varies by race/ethnicity and that threshold approaches underestimate Black authors and overestimate White authors. We conclude with recommendations to avoid potential biases. This article lays the foundation for more systematic and less-biased investigations into racial disparities in science.Entities:
Mesh:
Year: 2022 PMID: 35231059 PMCID: PMC8887775 DOI: 10.1371/journal.pone.0264270
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Sample of family names (U.S. Census) and given names (mortgage data).
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Given | Juan | 1.5% | 0.5% | 93.4% | 4.5% | 4,019 |
| Doris | 3.4% | 13.5% | 6.3% | 76.7% | 1,332 | |
| Andy | 38.8% | 1.6% | 6.4% | 53.2% | 555 | |
| Family | Rodriguez | 0.6% | 0.5% | 94.1% | 4.8% | 1,094,924 |
| Lee | 43.8% | 16.9% | 2.0% | 37.3% | 693,023 | |
| Washington | 0.3% | 91.6% | 2.7% | 5.4% | 177,386 |
Fig 1Manual validation of racial categories.
Fig 2Given names weight distribution by given and family name skewness.
Simulated data.
Racial representation of family names (U.S. Census) and given names (mortgage data).
|
|
|
|
|---|---|---|
| Asian | 5.0% | 6.3% |
| Black | 12.4% | 4.2% |
| Hispanic | 16.5% | 6.9% |
| White | 66.1% | 82.6% |
Fig 3Changes in groups share, and people retrieved, by threshold.
Census (Family names) and mortgage (Given names) datasets. The evolution of thresholds between 0 and 1 (A), and detail on thresholds between 0.9 and 1 (B).
Fig 4Resulting distribution on different models with 90% threshold.
Fractional counting on family names for comparison.
Fig 5Retrieval of authors by race using different inference models for varying thresholds.
Racial distribution in U.S. Census and WoS U.S. Authors with known family names.
|
| |||
|---|---|---|---|
| Asian | 5.0% | 8.2% | 24.5% |
| Black | 12.4% | 8.8% | 7.2% |
| Hispanic | 16.5% | 14.1% | 5.4% |
| White | 66.1% | 68.8% | 59.4% |
Fig 6Proportion of Temporary Visa Holders by racial group.
General recommendations for implementing a name-based inference of race for U.S. authors.
|
|
| |
|---|---|---|
|
| Use only family names from U.S. Census to avoid bias. | Do not use given names, except when the underlying distribution of your dataset matches that of mortgage data. |
|
| Consider each person in your data as a distribution and adapt your summary statistics. | Do not use a threshold for categorical classification of each person, as this under-represents Black population, due to the correlation between racial groups and name informativeness. |
|
| If needed, calculate first the aggregated distribution on your dataset, and use this for imputation of missing cases. Acknowledge the potential bias of imputation. | Do not use the census aggregate distribution for imputation, except when your target population matches the U.S. population. |