| Literature DB >> 23405197 |
Francis P Boscoe1, Maria J Schymura, Xiuling Zhang, Rachel A Kramer.
Abstract
We compared several techniques for assigning Hispanic ethnicity to records in data systems where this information may be missing, variously making use of country of origin, surname, race, and county of residence. We considered an algorithm in use by the North American Association of Central Cancer Registries (NAACCR), a variation of this developed by the authors, a "fast and frugal" algorithm developed with the aid of recursive partitioning methods, and conventional logistic regression. With the exception of logistic regression, each approach was rule-based: if specific criteria were met, an ethnicity assignment was made; otherwise, the next criterion was considered, until all records were assigned. We evaluated the algorithms on a sample of over 500,000 female clients from the New York State Cancer Services Program for whom self-reported Hispanic ethnicity was known. We found that all approaches yielded similarly high accuracy, sensitivity, and positive predictive value in all parts of the state, from areas with very low to very high Hispanic populations. An advantage of the fast and frugal method is that it consists of a small number of easily remembered steps.Entities:
Mesh:
Year: 2013 PMID: 23405197 PMCID: PMC3566036 DOI: 10.1371/journal.pone.0055689
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Hispanic identification algorithms evaluated. Number of persons classified by each step given in parentheses.
| Algorithm Name | Description |
| NAACCR Hispanic Identification Algorithm (NHIA) | 1. Persons born in non-Spanish-speaking countries in South America and Europe and several other specified countries are coded as non-Hispanic (28,191). |
| 2. Persons born in Spanish-speaking countries are coded as Hispanic (148,698). | |
| 3. Persons with American Indian, Asian, or Pacific Islander race are coded as non-Hispanic (51,063). | |
| 4. Female maiden names that are Hispanic among at least 75% of the population are coded as Hispanic (7,044). | |
| 5. Female maiden names that are Hispanic among less than 5% of the population are coded as non-Hispanic (68,459). | |
| 6. Female surnames that are Hispanic among at least 75% of the population are coded as Hispanic (14,977). | |
| 7. Remaining cases are coded as non-Hispanic (228,139). | |
| Authors’ algorithm | 1. Persons with Asian race are coded as non-Hispanic (51,401). |
| 2. Persons born in Spanish-speaking countries are coded as Hispanic (148,554). | |
| 3. Persons born in all remaining countries except U.S., Brazil, Portugal (including Cape Verde), and Belize are coded as non-Hispanic (53,219). | |
| 4. Surnames that are Hispanic among at least 75% of the population are coded as Hispanic (20,847). | |
| 5. Surnames that are Hispanic among less than 25% of the population are coded as non-Hispanic (268,497). | |
| 6. Persons from high-Hispanic counties (≥10% Hispanic in the 2000 U.S. census) are coded as Hispanic (1,330). | |
| 7. Persons from low-Hispanic counties (<5% Hispanic in the 2000 U.S. census) are coded as non-Hispanic (671). | |
| 8. Majority-Hispanic surnames are coded as Hispanic (1,194). | |
| 9. Remaining cases are coded as non-Hispanic (858). | |
| Fast and frugal (3-step version) | 1. Persons born in Spanish-speaking countries are coded as Hispanic (148,719). |
| 2. Majority-Hispanic surnames are coded as Hispanic (25,222). | |
| 3. Remaining cases are coded as non-Hispanic (372,630) | |
| Fast and frugal (4-step version) | 1. Persons with Asian or Pacific Islander race are coded as non-Hispanic (51,401). |
| 2. Persons born in Spanish-speaking countries are coded as Hispanic (148,554). | |
| 3. Majority-Hispanic surnames are coded as Hispanic (24,272). | |
| 4. Remaining cases are coded as non-Hispanic (322,344). | |
| Logistic regression | Hispanic ethnicity is a function of country of birth, surname percent Hispanic (using the same categories as in |
For all but the NHIA algorithm, maiden names are used in place of surname when available.
This is a “female only” version of the published algorithm; a data set including males would require one additional step.
Comparison of New York State and CSP populations, age 18 and above.
| New York State, 2000 Census (%) | CSP, 1994–2010 (%) | ||
| Race/ethnicity | Hispanic | 13.3 | 32.5 |
| White | 61.7 | 39.9 | |
| Black | 14.7 | 15.9 | |
| Asian | 5.4 | 9.4 | |
| Other | 4.8 | 2.3 | |
| Birthplace | Born in U.S. | 72.2 | 49.5 |
| Born in Spanish-speaking country | 9.7 | 27.5 | |
| Other foreign-born | 18.1 | 23.0 | |
| Age | 18–39 | 40.7 | 19.8 |
| 40–49 | 19.4 | 33.4 | |
| 50–59 | 15.0 | 25.9 | |
| 60–69 | 10.2 | 13.3 | |
| 70–79 | 8.8 | 5.7 | |
| 80+ | 5.9 | 1.9 | |
| Geography | New York City | 43.3 | 43.4 |
| New York State Excluding New York City | 56.7 | 56.6 |
Excludes records with missing information, which ranged from 0 percent (age) to 3 percent (birthplace).
Hispanic Surname Classification Scheme.
| Designation | % of Persons | Number of Names,2000 Census (%) | Number of Hispanicswith this designation,2000 Census (%) | Number of Hispanicswith this designation,CSP data (%) |
| Heavily Hispanic | ≥75% | 6,020 (4.0) | 25,353,317 (71.8) | 129,839 (73.9) |
| Generally Hispanic | ≥50%–<75% | 1,774 (1.1) | 1,185,327 (3.4) | 9,627 (5.5) |
| Moderately Hispanic | ≥25%–<50% | 1,616 (1.1) | 429,309 (1.2) | 3,748 (2.1) |
| Occasionally Hispanic | ≥5%–<25% | 11,179 (7.4) | 547,786 (1.6) | 4,236 (2.4) |
| Rarely Hispanic | <5% | 131,082 (86.2) | 7,790,079 (22.1) | 28,166 (16.0) |
| Total | 151,671 (100.0) | 35,305,818 (100.0) | 175,616 (100.0) |
Hispanic Classification Results by Method and County Hispanic Prevalence.
| Self-Reported Value/Algorithm-Derived Value | Quality Measure | ||||||||||
| County HispanicPrev. | Method | Hispanic/Hispanic (A) | Hispanic/Non-Hispanic (B) | Non-Hispanic/Non-Hispanic (C) | Non-Hispanic/Hispanic (D) | Acc | SN | SP | PPV | NPV | RB |
| All | NHIA | 163,175 | 12,441 | 363,411 | 7,544 | 96.3 | 92.9 | 98.0 | 95.6 | 96.7 | −2.8 |
| Authors’ | 164,596 | 11,020 | 363,626 | 7,329 | 96.6 | 93.7 | 98.0 | 95.7 | 97.1 | −2.1 | |
| FF (4) | 164,800 | 10,816 | 362,929 | 8,026 | 96.6 | 93.8 | 97.8 | 95.4 | 97.1 | −1.6 | |
| FF (3) | 164,900 | 10,716 | 361,914 | 9,041 | 96.4 | 93.9 | 97.6 | 94.8 | 97.1 | −1.0 | |
| Regression | 163,694 | 11,922 | 363,193 | 7,762 | 96.4 | 93.2 | 97.9 | 95.5 | 96.8 | −2.4 | |
| High | NHIA | 140,402 | 8,987 | 170,050 | 5,176 | 95.6 | 94.0 | 97.0 | 96.4 | 95.0 | −2.6 |
| Authors’ | 141,323 | 8,066 | 170,188 | 5,038 | 96.0 | 94.6 | 97.1 | 95.6 | 95.5 | −2.0 | |
| FF (4) | 141,803 | 7,586 | 169,405 | 5,821 | 95.9 | 94.9 | 96.7 | 96.1 | 95.7 | −1.2 | |
| FF (3) | 141,839 | 7,550 | 168,593 | 6,633 | 95.6 | 94.9 | 96.2 | 95.5 | 95.7 | −0.6 | |
| Regression | 140,748 | 8,641 | 169,277 | 5,949 | 95.5 | 94.2 | 96.6 | 95.9 | 95.1 | −1.8 | |
| Medium | NHIA | 16,600 | 1,666 | 46,664 | 982 | 96.0 | 90.9 | 97.9 | 94.4 | 96.6 | −3.7 |
| Authors’ | 16,705 | 1,561 | 46,774 | 872 | 96.3 | 91.5 | 98.2 | 95.0 | 96.8 | −3.8 | |
| FF (4) | 16,768 | 1,498 | 46,692 | 954 | 96.3 | 91.8 | 98.0 | 94.6 | 96.9 | −3.0 | |
| FF (3) | 16,772 | 1,494 | 46,600 | 1,046 | 96.2 | 91.8 | 97.8 | 94.1 | 96.9 | −2.5 | |
| Regression | 17,165 | 1,101 | 46,443 | 1,203 | 96.5 | 94.0 | 97.5 | 93.5 | 97.7 | 0.6 | |
| Low | NHIA | 5,199 | 1,621 | 145,137 | 1,347 | 98.1 | 76.2 | 99.1 | 79.4 | 98.9 | −4.0 |
| Authors’ | 5,114 | 1,706 | 145,510 | 974 | 98.2 | 75.0 | 99.3 | 84.0 | 98.8 | −10.7 | |
| FF (4) | 5,268 | 1,552 | 145,220 | 1,264 | 98.2 | 77.2 | 99.1 | 80.6 | 98.9 | −4.2 | |
| FF (3) | 5,275 | 1,545 | 145,178 | 1,306 | 98.1 | 77.3 | 99.1 | 80.2 | 98.9 | −3.5 | |
| Regression | 4,714 | 2,106 | 146,073 | 411 | 98.4 | 69.1 | 99.7 | 92.0 | 98.6 | −24.9 | |
County Hispanic prevalence: High: 9 counties with ≥10% Hispanic population according to the 2000 U.S. Census; Medium: 7 counties with 5–10%; Low: 46 counties with <5% Hispanic. In the CSP sample, 149,389 of 324,615 from high-Hispanic counties self-identified as Hispanic (46%). Corresponding values for medium-Hispanic counties were 18,266 of 65,912 (28%), and for low-Hispanic counties were 6,820 of 153,304 (4%). County was not known for 2,740 persons.
Quality measures: Acc: Accuracy (A+C)/(A+B+C+D); SN: Sensitivity (A)/(A+B); SP: Specificity (C)/(C+D); PPV: Positive Predictive Value (A)/(A+D); NPV: Negative Predictive Value (C)/(B+C); RB: Relative Bias [(A+D)/(A+B)]−1.
Sum of the two models, each containing approximately half of the observations.