| Literature DB >> 36160111 |
Simon A Ruhnke1, Fernando A Wilson2, Jim P Stimpson3.
Abstract
We describe a novel machine learning method of imputing legal status for immigrants using nationally representative survey data from the Survey of Income and Program Participation (SIPP) and the National Health Interview Survey (NHIS). K-nearest Neighbor (KNN) classifier and Random Forest (RF) Algorithm machine learning were described as novel imputation methods compared to established regression-based imputation. After validating the imputation methods using sensitivity, specificity, positive predictive value (PPV) and accuracy statistics, the Random Forest Algorithm was more accurate in identifying undocumented immigrants and minimized bias in both socio-demographic variables included in the imputation, and unobserved health variables relative to regression-based imputation and KNN.•We developed a new machine learning method of imputing legal status for immigrants that can be used with nationally representative, publicly available data.•Our findings indicate that using machine learning to impute legal status of immigrants, specifically the Random Forest Algorithm, was more accurate in identifying undocumented immigrants and minimized bias relative to other imputation methods.Entities:
Keywords: Demography; Immigrant; Machine Learning; Population Health; Undocumented Immigrants; United States
Year: 2022 PMID: 36160111 PMCID: PMC9490167 DOI: 10.1016/j.mex.2022.101848
Source DB: PubMed Journal: MethodsX ISSN: 2215-0161
Average correlations between (Imputed) legal status and health variables using bootstrapped cross-validation.
| Correlation between: | Legal Status & Private Health Insurance | Legal Status & Poor/Fair Health |
|---|---|---|
| Pearson's Cor. Coef. (95% CI) | Pearson's Cor. Coef. (95% CI) | |
| True relationship in full SIPP | -0.2245 | -0.0375 |
| (N = 7998) | (-0.2431; -0.2056) | (-0.0594; -0.02) |
| TRUE Relationship in Test-SIPP | -0.16823 | -0.01478 |
| (N = 984) | (-0.2283; -0.1316) | (-0.0851; 0.0557) |
| Logical | -0.276 | -0.0611 |
| -0.2942; -0.2577 | -0.0829; -0.04 | |
| Logit | -0.27218 | -0.0495 |
| (-0.329; -0.2134) | (-0.1195; 0.021) | |
| MI | -0.14931 | -0.01666 |
| (-0.2463; -0.0494) | (-0.1188; 0.0859) | |
| KNN | -0.20708 | -0.02859 |
| (-0.2661; -0.1465) | (-0.0988; 0.0419) | |
| RF | -0.17275 | -0.04385 |
| (-0.2327; -0.1115) | (-0.1139; 0.0266) |
Average model performance metrics of logistic, K-Nearest neighbor and random forest using bootstrapped cross-validation.
| Logit | KNN | RF | |
|---|---|---|---|
| Sensitivity | 0.68995 | 0.68478 | 0.71828 |
| Specificity | 0.57205 | 0.57092 | 0.61948 |
| PPV | 0.34462 | 0.29311 | 0.43952 |
| Accuracy | 0.86738 | 0.88600 | 0.86152 |
Average model performance metrics of logical, single logistic, K-Nearest neighbor and random forest using bootstrapped cross-validation, alternative specification including self-rated health as a predictor.
| Logit | KNN | RF | |
|---|---|---|---|
| Sensitivity | 0.69401 | 0.68589 | 0.72328 |
| Specificity | 0.58148 | 0.56923 | 0.62773 |
| PPV | 0.35240 | 0.30720 | 0.45389 |
| Accuracy | 0.86954 | 0.88048 | 0.86169 |
Average Correlations Between (Imputed) Legal Status and Health Variables using bootstrapped Cross-Validation, alternative specification including self-rated health as a predictor.
| Correlation between | Legal Status & Private Health Insurance | Legal Status & Poor/Fair Health |
|---|---|---|
| Pearson's Cor. Coef. (95% CI) | Pearson's Cor. Coef. (95% CI) | |
| True relationship in full SIPP | -0.2245 | -0.0375 |
| (N = 9845) | (-0.2431; -0.2056) | (-0.0594; -0.02) |
| TRUE Relationship in Test-SIPP | -0.16823 | -0.01478 |
| (N = 984) | (-0.2283; -0.1316) | (-0.0851; 0.0557) |
| Logit | -0.26579 | -0.0369 |
| (-0.3229; -0.2067) | (-0.107; 0.0336) | |
| MI | -0.1441 | -0.02003 |
| (-0.2378; -0.0477) | (-0.1237; 0.0841) | |
| KNN | -0.19708 | -0.04334 |
| (-0.2564; -0.1363) | (-0.1134; 0.0272) | |
| RF | -0.17478 | -0.03421 |
| (-0.2347; -0.1135) | (-0.1043; 0.0363) |
Socio-economic characteristics in the SIPP 2004, SIPP 2008, SIPP 2014, NHIS (2000-2018) using the Random Forest algorithm to impute documentation status.
| SIPP 2004 | SIPP 2008 | SIPP 2014 | NHIS 2000-2018 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| US-born | Documented | Undocumented | US-born | Documented | Undocumented | US-born | Documented | Undocumented | US-born | Documented | Undocumented | |
| (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | |
| 40.35 | 40.31 | 32.72 | 40.93 | 41.22 | 33.47 | 41.11 | 42.75 | 36.15 | 40.79 | 41.05 | 35.2 | |
| Married | 55.22 | 63.75 | 60.22 | 52.91 | 63.47 | 53.45 | 48.86 | 64.83 | 53.77 | 52.92 | 64.53 | 60.90 |
| Widowed | 1.96 | 1.67 | 0.51 | 1.86 | 1.91 | 0.97 | 1.82 | 1.66 | 0.87 | 1.91 | 1.79 | 0.90 |
| Divorced | 11.93 | 8.07 | 2.45 | 11.82 | 8.14 | 3.86 | 12.03 | 8.92 | 4.95 | 11.80 | 7.73 | 4.38 |
| Separated | 2.12 | 3.40 | 3.03 | 1.95 | 2.99 | 3.29 | 2.20 | 2.76 | 3.48 | 2.32 | 3.40 | 3.51 |
| Never Married | 28.37 | 22.43 | 33.32 | 30.84 | 22.50 | 37.08 | 34.79 | 21.38 | 35.84 | 30.76 | 22.18 | 29.97 |
| Missing | 0.40 | 0.68 | 0.46 | 0.62 | 0.98 | 1.34 | 0.30 | 0.45 | 1.08 | 0.30 | 0.38 | 0.33 |
| White | 83.33 | 67.55 | 78.28 | 82.90 | 65.16 | 78.06 | 79.01 | 44.38 | 44.75 | 83.47 | 62.84 | 69.96 |
| Black | 12.59 | 9.78 | 7.22 | 12.69 | 11.42 | 7.93 | 13.49 | 9.17 | 5.79 | 13.48 | 10.27 | 7.06 |
| Asian | 0.94 | 19.25 | 13.28 | 1.19 | 20.05 | 10.97 | 1.33 | 25.25 | 16.54 | 1.42 | 23.29 | 19.73 |
| Other | 3.15 | 3.42 | 1.21 | 3.22 | 3.37 | 3.04 | 3.26 | 2.71 | 0.97 | 1.64 | 3.59 | 3.24 |
| Missing | 0 | 0 | 0 | 0 | 0 | 0 | 2.30 | 15.67 | 28.85 | 0 | 0 | 0 |
| No | 92.49 | 57.31 | 38.50 | 91.41 | 58.48 | 31.67 | 90.58 | 56.80 | 32.32 | 92.42 | 51.93 | 35.26 |
| Yes | 7.51 | 42.69 | 61.50 | 8.59 | 41.52 | 68.33 | 9.34 | 43.14 | 67.55 | 7.58 | 48.07 | 64.74 |
| No | 24.32 | 26.35 | 31.68 | 27.03 | 28.86 | 34.95 | 30.47 | 29.90 | 33.51 | 27.41 | 27.46 | 32.79 |
| Yes | 75.68 | 73.65 | 68.32 | 72.97 | 71.14 | 65.05 | 69.53 | 70.10 | 66.49 | 72.45 | 72.36 | 66.86 |
| Never Attended | 0.14 | 1.33 | 2.05 | 0.09 | 0.98 | 1.96 | 0.06 | 0.83 | 1.36 | 0.15 | 1.20 | 1.61 |
| 1-6 Grade | 0.51 | 11.46 | 22.43 | 0.39 | 9.25 | 18.32 | 0.17 | 6.68 | 17.87 | 0.40 | 9.70 | 17.12 |
| 7-12 Grade | 8.34 | 14.90 | 20.92 | 6.98 | 11.32 | 19.98 | 7.97 | 12.41 | 20.26 | 9.42 | 15.26 | 21.60 |
| Highschool | 26.76 | 20.25 | 20.05 | 26.23 | 23.92 | 28.42 | 28.96 | 23.11 | 21.37 | 27.45 | 21.34 | 20.48 |
| Some College | 38.18 | 25.67 | 13.07 | 38.03 | 26.13 | 12.49 | 32.30 | 22.25 | 13.76 | 33.36 | 21.38 | 14.34 |
| Undergrad | 17.10 | 15.32 | 11.08 | 18.39 | 17.19 | 8.49 | 19.62 | 20.26 | 11.82 | 19.28 | 17.94 | 13.28 |
| Graduate | 7.33 | 7.92 | 6.63 | 8.23 | 8.24 | 5.17 | 10.10 | 11.67 | 10.30 | 9.44 | 11.61 | 9.80 |
| Missing | 1.65 | 3.16 | 3.77 | 1.67 | 2.98 | 5.17 | 0.83 | 2.79 | 3.26 | 0.51 | 1.56 | 1.76 |
| Below 100% FPL | 11.28 | 14.92 | 23.12 | 13.24 | 17.49 | 32.19 | 15.03 | 16.98 | 29.81 | 11.74 | 17.92 | 26.88 |
| Below 200% FPL | 14.96 | 22.86 | 35.11 | 15.15 | 23.66 | 32.96 | 15.43 | 20.91 | 28.06 | 15.54 | 23.57 | 31.43 |
| Above 200% FPL | 73.76 | 62.23 | 41.78 | 71.62 | 58.85 | 34.85 | 69.53 | 62.11 | 42.14 | 72.72 | 58.51 | 41.69 |
| USA | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 | 100 | 0 | 0 |
| Central/South America | 0.00 | 56.73 | 74.92 | 0.00 | 51.86 | 79.35 | 0.00 | 51.50 | 70.72 | 0.00 | 53.81 | 67.73 |
| Europe | 0.00 | 13.08 | 4.55 | 0.00 | 12.15 | 3.44 | 0.00 | 12.51 | 5.33 | 0.00 | 12.36 | 5.42 |
| Africa | 0.00 | 3.19 | 4.26 | 0.00 | 4.08 | 3.22 | 0.00 | 4.17 | 3.98 | 0.00 | 4.34 | 3.36 |
| Asia | 0.00 | 23.84 | 14.77 | 0.00 | 24.25 | 11.97 | 0.00 | 31.06 | 18.77 | 0.00 | 25.87 | 20.30 |
| Other | 0.00 | 3.17 | 1.49 | 0.00 | 7.65 | 2.03 | 0.00 | 0.76 | 1.20 | 0.00 | 2.95 | 2.17 |
| < 5 years | 0.00 | 33.13 | 43.87 | 0.00 | 16.98 | 33.31 | 0.00 | 6.01 | 15.34 | 0.00 | 10.38 | 27.25 |
| 5-10 years | 0.00 | 14.22 | 38.21 | 0.00 | 23.28 | 32.87 | 0.00 | 10.36 | 18.76 | 0.00 | 12.52 | 24.28 |
| 10-15 years | 0.00 | 13.86 | 10.92 | 0.00 | 14.15 | 17.65 | 0.00 | 12.27 | 23.61 | 0.00 | 14.76 | 17.30 |
| 15+ years | 0.00 | 38.79 | 7.00 | 0.00 | 45.60 | 16.17 | 0.00 | 50.44 | 36.15 | 0.00 | 59.59 | 28.67 |
| Missing | 100.00 | 0 | 0 | 100.00 | 0 | 0 | 100.00 | 20.93 | 6.01 | 100.00 | 2.75 | 2.50 |
| No | 99.24 | 74.22 | 43.94 | 99.31 | 77.04 | 48.73 | 99.84 | 86.13 | 74.29 | 98.89 | 71.12 | 51.17 |
| Yes | 0.76 | 25.78 | 56.06 | 0.69 | 22.96 | 51.27 | 0.16 | 13.82 | 25.71 | 0.92 | 28.66 | 48.56 |
| 0.77 | 1.05 | 1.18 | 0.74 | 1.03 | 1.2 | 0.73 | 1.18 | 1.28 | 0.78 | 1.17 | 1.14 | |
| 3.11 | 3.72 | 4.11 | 3.09 | 3.67 | 4.34 | 3 | 3.64 | 3.97 | 2.91 | 3.54 | 3.76 | |
| No | 91.71 | 89.48 | 92.19 | 91.61 | 90.13 | 91.95 | 84.64 | 80.56 | 81.78 | 92.38 | 91.48 | 92.47 |
| Yes | 7.38 | 9.08 | 6.67 | 7.12 | 8.06 | 7.45 | 9.23 | 11.83 | 10.46 | 7.13 | 7.98 | 6.70 |
| No | 0.88 | 21.22 | 45.55 | 0.97 | 21.10 | 40.34 | 0.89 | 18.65 | 36.03 | 1.11 | 25.78 | 47.70 |
| Yes | 53.62 | 38.95 | 9.18 | 51.32 | 39.58 | 8.12 | 46.93 | 43.14 | 13.93 | 58.97 | 40.68 | 15.77 |
| Missing | 45.50 | 39.83 | 45.26 | 47.71 | 39.33 | 51.54 | 52.18 | 38.22 | 50.04 | 39.92 | 33.54 | 36.52 |
| Northeast | 18.12 | 22.91 | 18.05 | 17.91 | 20.50 | 12.40 | 17.11 | 22.01 | 16.50 | 17.05 | 21.77 | 16.75 |
| Midwest | 24.52 | 11.52 | 9.29 | 23.79 | 11.80 | 13.20 | 23.44 | 12.22 | 12.34 | 25.92 | 12.06 | 11.40 |
| South | 36.75 | 29.74 | 34.57 | 37.25 | 32.42 | 37.00 | 38.19 | 31.87 | 36.74 | 37.32 | 31.95 | 34.37 |
| West | 20.62 | 35.83 | 38.09 | 21.05 | 35.28 | 37.40 | 21.26 | 33.89 | 34.42 | 19.71 | 34.22 | 37.47 |
| No | 79.03 | 76.90 | 79.02 | 74.91 | 73.68 | 67.54 | 79.81 | 79.75 | 83.19 | 89.29 | 89.82 | 92.18 |
| Yes | 10.52 | 10.22 | 6.29 | 10.02 | 9.39 | 6.20 | 12.99 | 11.71 | 8.11 | 10.65 | 10.13 | 7.79 |
| Missing | 10.45 | 12.88 | 14.69 | 15.07 | 16.93 | 26.26 | 7.2 | 8.54 | 8.7 | 0.06 | 0.05 | 0.03 |
Results expressed in weighted Column Percentages.
Source: United States Census: Survey of Income and Program Participation 2004, Wave 2&3; 2008, Wave 2&4, 2014, Wave 1; National Health Interview Survey 2000-2018.
FPL: Federal Poverty Line (Household Income).
SRH: Self-Rated Health.
| Subject area: | Economics and Finance |
| More specific subject area: | |
| Name of your method: | |
| Name and reference of original method: | Breiman, L. (2001). Random Forests. |
| Resource availability: |