| Literature DB >> 29509189 |
Abstract
We introduce a list that offers information on the relation between first names and race or ethnicity. Drawing information from mortgage applications, the list includes 4,250 first names and information on their respective count and proportions across six mutually exclusive racial and Hispanic origin groups. These six categories are consistent with the categories used in the Census Bureau's list on surnames' demographic information. Also, just like the Census Bureau's list of surnames, the list of first names is highly aggregated, so as to not identify any specific individuals.Entities:
Mesh:
Year: 2018 PMID: 29509189 PMCID: PMC5839157 DOI: 10.1038/sdata.2018.25
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Classifying applicants into OMB categories based on race and ethnicity information in HMDA.
| Hispanic or Latino | Hispanic or Latino | White |
| Hispanic or Latino | Hispanic or Latino | Black or African American |
| Hispanic or Latino | Hispanic or Latino | Asian |
| Hispanic or Latino | Hispanic or Latino | Native Hawaiian or Other Pacific Islander |
| Hispanic or Latino | Hispanic or Latino | American Indian or Alaska Native |
| Hispanic or Latino | Hispanic or Latino | Non-missing secondary race variable |
| White | Not Hispanic or Latino | White |
| Black | Not Hispanic or Latino | Black or African American |
| Asian/Nat. Haw./Other Pac. Isl. | Not Hispanic or Latino | Asian |
| Asian/Nat. Haw./Other Pac. Isl. | Not Hispanic or Latino | Native Hawaiian or Other Pacific Islander |
| American Indian/Alaska Native | Not Hispanic or Latino | American Indian or Alaska Native |
| Multi-race | Not Hispanic or Latino | Non-missing secondary race variable |
Sample size, by name.
| The table presents the number of observations used in calculating the proportions for race/ethnicity for the 4,250 first names and 11,299 surnames in our list. The minimum number of observations for each name is 30, except for cases in which the proportion is unity for a single category and it is based on 15-29 observations. The list of 4,250 first names is based on 2,449,240 first name observations, while the list of 11,299 surnames is based on 1,240,098 surname observations. | ||||
|---|---|---|---|---|
| [15–29] | 376 | 8.8% | 3,522 | 31.2% |
| [30–49] | 1,159 | 27.3% | 3,060 | 27.1% |
| [50–99] | 1,006 | 23.7% | 2,373 | 21.0% |
| [100–249] | 739 | 17.4% | 1,453 | 12.8% |
| [250–499] | 337 | 7.9% | 511 | 4.5% |
| [500–999] | 225 | 5.3% | 225 | 2.0% |
| 1000+ | 408 | 9.6% | 155 | 1.4% |
| 4,250 | 100.0% | 11,299 | 100.0% | |
Description of fields.
| firstname | First name |
| obs | Number of occurrences in the combined mortgage datasets |
| pcthispanic | Percent Hispanic or Latino |
| pctwhite | Percent Non-Hispanic White |
| pctblack | Percent Non-Hispanic Black or African American |
| pctapi | Percent Non-Hispanic Asian or Native Hawaiian or Other Pacific Islander |
| pctaian | Percent Non-Hispanic American Indian or Alaska Native |
| pct2prace | Percent Non-Hispanic Two or More Races |
Composition of validation dataset in terms of race and ethnicity.
| Non-Hispanic | 10.9% | 10.5% | 66.8% | 0.8% | 89.0% |
| Hispanic | 0.3% | 0.3% | 10.0% | 0.4% | 11.0% |
| 11.2% | 10.8% | 76.8% | 1.2% | 100.0% |
Coverage for first name and surname demographic information.
| First Name Information | |||
|---|---|---|---|
| Present | Missing | ||
| Present Surname Information | 15,159 | 2,131 | 17,290 |
| Missing Surname Information | 2,036 | 674 | 2,710 |
| 17,195 | 2,805 | 20,000 | |
Figure 1First Names -- Distribution of proportions across race/ethnicity categories.
These four categories (i.e., Hispanic; NH White; NH Black; NH Asian, Native Hawaiian or Other Pacific Islander) reflect 99 percent of the validation dataset. The empirical distributions are presented using horizontal box plots, with the x-axis denoting proportions. The line inside the box denotes the median, while the interior of the box denotes the interquartile range (IQR). In each box, the upper adjacent value is equal to the upper quartile plus 1.5*IQR, while the lower adjacent value is equal to the lower quartile minus 1.5*IQR. For presentation purposes, we exclude outliers.
Figure 2Surnames -- Distribution of proportions across race/ethnicity categories.
These four categories (i.e., Hispanic; NH White; NH Black; NH Asian, Native Hawaiian or Other Pacific Islander) reflect 99 percent of the validation dataset. The empirical distributions are presented using horizontal box plots, with the x-axis denoting proportions. The line inside the box denotes the median, while the interior of the box denotes the interquartile range (IQR). In each box, the upper adjacent value is equal to the upper quartile plus 1.5*IQR, while the lower adjacent value is equal to the lower quartile minus 1.5*IQR. For presentation purposes, we exclude outliers.