| Literature DB >> 22269985 |
Ronan Ryan1, Sally Vernon, Gill Lawrence, Sue Wilson.
Abstract
BACKGROUND: Information on ethnicity is commonly used by health services and researchers to plan services, ensure equality of access, and for epidemiological studies. In common with other important demographic and clinical data it is often incompletely recorded. This paper presents a method for imputing missing data on the ethnicity of cancer patients, developed for a regional cancer registry in the UK.Entities:
Mesh:
Year: 2012 PMID: 22269985 PMCID: PMC3353229 DOI: 10.1186/1472-6947-12-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Characteristics of the Project Cohort
| Cases | Cases with missing ethnicity following linkage with HES records (% of cases) | Chi2 statistic | ||
|---|---|---|---|---|
| Site | Lower GI | 24 446 | 4112(17) | |
| Breast | 28 795 | 6029(21) | ||
| Lung | 24 060 | 5660(24) | ||
| Prostate | 23 716 | 8814(37) | ||
| Upper GI | 10 677 | 1727(16) | 3500(p < 0.001) | |
| Year of diagnosis | 2001 | 15 102 | 4118(27) | |
| 2002 | 15 523 | 3840(25) | ||
| 2003 | 15 731 | 3772(24) | ||
| 2004 | 16 162 | 3681(23) | ||
| 2005 | 16 317 | 3632(22) | ||
| 2006 | 16 458 | 3470(21) | ||
| 2007 | 16 401 | 3829(23) | 206(p < 0.001) | |
| Deprivation | 1 (most deprived) | 26 738 | 5149(19) | |
| (Income Domain of | 2 | 22 104 | 4856(22) | |
| Index of Multiple | 3 | 22 759 | 5425(24) | |
| Deprivation 2007) | 4 | 22 465 | 5942(26) | |
| 5 (least deprived) | 17 628 | 4970(28) | 6300(p < 0.001) | |
| Age | < 40 | 1771 | 251(14) | |
| 40-49 | 5622 | 892(16) | ||
| 50-59 | 15 338 | 2933(19) | ||
| 60-69 | 27 759 | 5924(21) | ||
| 70-79 | 35 258 | 8538(24) | ||
| 80+ | 25 946 | 7804(30) | 1100(p < 0.001) | |
| Sex | Male | 59 592 | 15 454(26) | |
| Female | 52 102 | 10 888(21) | 391(p < 0.001) | |
| Death Certificate | No | 106 217 | 23 577(22) | |
| Only registration | Yes | 5477 | 2765(50) | 2300(p < 0.001) |
| Ever seen privately | No | 106 566 | 23 113(22) | |
| (cancer was diagnosed or treated outside the free National Health Service at least on one occasion) | Yes | 5128 | 3229(63) | 4600(p < 0.001) |
| Surgery | No | 58 875 | 18 344(31) | |
| Yes | 52 819 | 7998(15) | 4000(p < 0.001) | |
| Radiotherapy | No | 73 520 | 19 009(26) | |
| Yes | 38 174 | 7333(19) | 616(p < 0.001) | |
| Chemotherapy | No | 93 778 | 24 650(26) | |
| Yes | 17 916 | 1692(9) | 2400(p < 0.001) | |
| Screen detected | No | 22 900 | 5012(22) | |
| breast cancer* | Yes | 5895 | 1017(17) | 61(p < 0.001) |
| HES-linked | No | 19 694 | 19 694(100) | |
| Yes | 92 000 | 6648(7) | ||
| Number of admissions | 0 | 19 694 | 19 694(100) | |
| (includes non-cancer admissions) | 1 | 8012 | 2071(26) | |
| 2 | 10 261 | 1414(14) | ||
| 3 | 10 523 | 985(9) | ||
| 4 | 9332 | 562(6) | ||
| 5+ | 53 872 | 1616(3) | 6300(p < 0.001)** |
* Comparison of breast cancer cases who were and were not detected by population screening.
** Chi-square test excludes cases with no admissions as, by definition, none have ethnicity recorded.
Sensitivity, Specificity and Positive Predictive Value of NBSS-derived Ethnicity for Breast Cancer Cases
| Ethnic group recorded in HES | Number of cases recorded in HES | NBSS | ||
|---|---|---|---|---|
| Sensitivity | Specificity | Positive predictive value | ||
| White | 5093 | 99.7% | 77.3% | 99.3% |
| South Asian | 82 | 90.0% | 99.8% | 87.1% |
| Black | 44 | 61.4% | 99.9% | 79.4% |
| Chinese/Other | 14 | 0.0% | 100.0% | |
| Mixed | 10 | 0.0% | 100.0% | |
Includes 5243 cases where ethnic group was recorded in both HES and NBSS datasets. Individual logistic models (positive outcome threshold: p > = 0.5).
Ethnicity of Cases Following Linkage with HES and NBSS Datasets
| Ethnic group | Cases with ethnic group recorded in HES (%) | Cases with unknown ethnicity resolved by NBSS linkage | Ethnic breakdown of cohort following use of HES and NBSS (%) | ||
|---|---|---|---|---|---|
| White | 81 934 | (73.4) | 1053 | 82 987 | (74.3) |
| South Asian | 1545 | (1.4) | 13 | 1558 | (1.4) |
| Black | 1429 | (1.3) | 11 | 1440 | (1.3) |
| Chinese/Other | 303 | (0.3) | 3 | 306 | (0.3) |
| Mixed | 141 | (0.1) | 2 | 143 | (0.1) |
| Not known | 26 342 | (23.6) | 26 331 | (22.6) | |
| Total | 111 694 | 1082 | |||
Sensitivity, Specificity and Positive Predictive Value of Name Recognition Software
| Name recognition software | Ethnic group | Sensitivity | Specificity | Positive predictive value |
|---|---|---|---|---|
| Onomap | White | 99.8% | 51.5% | 98.0% |
| South Asian | 82.1% | 99.9% | 92.9% | |
| Black | 4.4% | 99.9% | 70.8% | |
| Chinese/Other | 0.0% | 100.0% | ||
| Nam Pehchan* | South Asian | 71.1% | 99.9% | 94.5% |
| Onomap and Nam Pehchan combined | South Asian | 90.5% | 99.9% | 93.3% |
Includes 85 352 cases where a single ethnic group was recorded in the HES dataset. Individual logistic models (positive outcome threshold: p > = 0.5).
* Matched on forename and surname separately.
Sensitivity, Specificity and Positive Predictive Value of Census Data on Ethnicity
| Ethnic group | Sensitivity | Specificity | Positive predictive value |
|---|---|---|---|
| White | 99.3% | 21.4% | 96.8% |
| South Asian | 7.4% | 99.8% | 44.9% |
| Black | 2.3% | 99.9% | 34.4% |
| Chinese/Other | 0.0% | 100.0% | |
| Mixed | 0.0% | 100.0% |
Includes 85352 cases where a single ethnic group was recorded in the HES dataset. Census data used to predict ethnic group were: percentage of local population in South Asian, Black, Chinese/Other and Mixed ethnic group at last national census. Individual logistic models (positive outcome threshold: p > = 0.5).
Sensitivity, Specificity and Positive Predictive Value of Full Model
| Ethnic group | Sensitivity | Specificity | Positive predictive value |
|---|---|---|---|
| White | 99.7% | 56.0% | 98.2% |
| South Asian | 94.7% | 99.8% | 90.4% |
| Black | 20.4% | 99.8% | 63.6% |
| Chinese/Other | 21.0% | 99.9% | 57.6% |
| Mixed | 0% | 100% |
A multinomial logistic regression model was used to predict ethnic group. The model was developed on a randomly selected 50% sample of the 85352 cases whose ethnicity was recorded in the HES dataset. The remaiming 50% of cases were used to validate the model and derive the above estimates. The predictors used in the model were: ethnicity derived from name recognition software; Census estimates of ethnic distribution of population; number of hospital admissions; year of diagnosis; patient seen outside the NHS (yes/no); screen-detected cancer (yes/no); death certificate only cancer registration (yes/no); cancer treatment type (surgery/radiotherapy/chemotherapy); deprivation score; gender; age at diagnosis; cancer site; and death during follow-up period (all-cause and due to primary cancer separately) and time to death/censoring (Nelson-Aalen cumulative hazard).
Comparison of Distribution of Ethnic Groups: Observed and Imputed
| Ethnic group | Observed % | Imputed* % | Total* % |
|---|---|---|---|
| White | 96.0 | 95.8 | 96.0 |
| South Asian | 1.8 | 1.7 | 1.8 |
| Black | 1.7 | 1.6 | 1.7 |
| Chinese/Other | 0.4 | 0.6 | 0.4 |
| Mixed | 0.2 | 0.3 | 0.2 |
| Total | 100 | 100 | 100 |
* Using all 23 imputations combined.