| Literature DB >> 35118188 |
Mayuri Mahendran1, Daniel Lizotte1,2, Greta R Bauer1.
Abstract
Intersectionality recognizes that in the context of sociohistorically shaped structural power relations, an individual's multiple social positions or identities (e.g., gender, ethnicity) can interact to affect health-related outcomes. Despite limited methodological guidance, intersectionality frameworks have increasingly been incorporated into epidemiological studies, both to describe health disparities and to examine their causes. This study aimed to advance methods in intersectional estimation of binary outcomes in descriptive health disparities research through evaluation of 7 potentially intersectional data analysis methods: cross-classification, regression with interactions, multilevel analysis of individual heterogeneity (MAIHDA), and decision trees (CART, CTree, CHAID, random forest). Accuracy of estimated intersection-specific outcome prevalence was evaluated across 192 intersections using simulated data scenarios. For comparison we included a non-intersectional main effects regression. We additionally assessed variable selection performance amongst decision trees. Example analyses using National Health and Nutrition Examination Study data illustrated differences in results between methods. At larger sample sizes, all methods except for CART performed better than non-intersectional main effects regression. In smaller samples, MAIHDA was the most accurate method but showed no advantage over main effects regression, while random forest, cross-classification, and saturated regression were the least accurate, and CTree and CHAID performed moderately well. CART performed poorly for estimation and variable selection. Sensitivity analyses examining the bias-variance tradeoff suggest MAIHDA as the preferred unbiased method for accurate estimation of high-dimensional intersections at smaller sample sizes. Larger sample sizes are more imperative for other methods. Results support the adoption of an intersectional approach to descriptive epidemiology.Entities:
Keywords: Biostatistics; CART, classification and regression tree; CHAID, chi-square automatic interaction detector; CTree, conditional inference trees; Epidemiological studies; Health equity; Intersectionality; MAD, mean absolute deviation; MAIHDA, multilevel analysis of individual heterogeneity and discriminatory accuracy; NHANES, National Health and Nutrition Examination Study; Research design; SD, standard deviation; U.S., United States; VIM, variable importance measure
Year: 2022 PMID: 35118188 PMCID: PMC8800141 DOI: 10.1016/j.ssmph.2022.101032
Source DB: PubMed Journal: SSM Popul Health ISSN: 2352-8273
Description of variables in data generation model input variables.
| Variable | Model 1: categorical inputs | Model 2: mixed inputs (categorical and continuous) | ||
|---|---|---|---|---|
| Type | Distribution | Type | Distribution | |
| X1 | Categorical | P(X1 = 0) = 0.25 | Continuous (split in quartiles to create intersections for prediction) | mean=0, variance=1 |
| X2 | Binary | P(X2=1) = 0.2 | Binary | P(X2=1) = 0.2 |
| X3 | Binary | P(X3=1) = 0.5 | Binary | P(X3=1) = 0.5 |
| X4 | Binary | Mediation: | Binary | Mediation: |
| X5 | Binary | P(X5=1) = 0.25 | Binary | P(X5=1) = 0.25 |
| X6 | Categorical | P(X6 = 0) = 0.33 | Continuous (split in tertiles to create intersections for prediction) | mean=0, variance=1 |
Each simulated model resulted in 192 intersections, (4*2*2*2*2*3=192).
Proportion of converged saturated regression models over 1000 iterations by sample size.
| % of models converged | ||||
|---|---|---|---|---|
| N=2000 | N=5000 | N=50,000 | N=200,000 | |
| Common binary outcome, categorical inputs | 16.7 | 83.0 | 100.0 | 100.0 |
| Common binary outcome, mixed inputs | 99.8 | 100.0 | 100.0 | 100.0 |
| Rare binary outcome, categorical inputs | 48.0 | 85.5 | 100.0 | 100.0 |
| Rare binary outcome, mixed inputs | 98.9 | 99.8 | 100.0 | 100.0 |
Fig. 1A to 1.D. Boxplots of the mean absolute deviation (MAD) of intersection estimations for four different sample sizes (graph excludes outliers) 1.A. Common outcome with categorical inputs 1.B. Rare outcome with categorical inputs 1.C. Common outcome with mixed inputs 1.D. Rare outcome with mixed inputs. Abbreviations: CART = classification and regression tree; CHAID = chi-square automatic interaction detector; CTree = conditional inference trees; MAIHDA = multilevel analysis of individual heterogeneity and discriminatory accuracy.
Fig. 2A to 2.C. Prevalence of high blood pressure by intersection. Abbreviations: CART = classification and regression tree; CHAID = chi-square automatic interaction detector; CTree = conditional inference trees; MAIHDA = multilevel analysis of individual heterogeneity and discriminatory accuracy.
Splitting percentage (% of 1000 iterations) for each variable.
| CART | CTree | CHAID | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N=2000 | N=5000 | N=50,000 | N=200,000 | N=2000 | N=5000 | N=50,000 | N=200,000 | N=2000 | N=5000 | N=50,000 | N=200,000 | ||
| Rare binary outcome, Categorical inputs | x1 | 0 | 0 | 0 | 0 | 8 | 22 | 99 | 100 | 19 | 37 | 99 | 100 |
| x2 | 0 | 0 | 0 | 0 | 5 | 11 | 72 | 99 | 21 | 32 | 90 | 100 | |
| x3 | 0 | 0 | 0 | 0 | 56 | 78 | 100 | 100 | 73 | 90 | 100 | 100 | |
| x4 | 0 | 0 | 0 | 0 | 63 | 83 | 100 | 100 | 76 | 92 | 100 | 100 | |
| x5 | 0 | 0 | 0 | 0 | 27 | 65 | 100 | 100 | 53 | 80 | 100 | 100 | |
| x6 | 0 | 0 | 0 | 0 | 2 | 4 | 20 | 45 | 12 | 18 | 47 | 67 | |
| Rare binary outcome, Mixed inputs | x1 | 0.1 | 0 | 0 | 0 | 50 | 79 | 100 | 100 | – | – | – | - |
| x2 | 0 | 0 | 0 | 0 | 10 | 22 | 94 | 100 | – | – | – | - | |
| x3 | 0.1 | 0 | 0 | 0 | 49 | 73 | 100 | 100 | – | – | – | - | |
| x4 | 0.1 | 0 | 0 | 0 | 52 | 75 | 100 | 100 | – | – | – | - | |
| x5 | 0 | 0 | 0 | 0 | 19 | 47 | 98 | 100 | – | – | – | - | |
| x6 | 0.1 | 0 | 0 | 0 | 3 | 5 | 23 | 50 | – | – | – | – | |
| Common binary outcome, Categorical inputs | x1 | 0.7 | 0.4 | 0 | 0 | 51 | 84 | 100 | 100 | 65 | 90 | 100 | 100 |
| x2 | 0.7 | 0.4 | 0 | 0 | 24 | 52 | 98 | 100 | 50 | 76 | 100 | 100 | |
| x3 | 0.5 | 0.2 | 0 | 0 | 83 | 95 | 100 | 100 | 92 | 98 | 100 | 100 | |
| x4 | 0.6 | 0.2 | 0 | 0 | 86 | 94 | 100 | 100 | 93 | 97 | 100 | 100 | |
| x5 | 0.2 | 0.2 | 0 | 0 | 73 | 88 | 100 | 100 | 85 | 95 | 100 | 100 | |
| x6 | 0 | 0 | 0 | 0 | 7 | 12 | 49 | 77 | 23 | 35 | 68 | 84 | |
| Common binary outcome, Mixed inputs | x1 | 2 | 0.4 | 0 | 0 | 83 | 94 | 100 | 100 | – | – | – | - |
| x2 | 0.5 | 0.1 | 0 | 0 | 38 | 64 | 99 | 100 | – | – | – | - | |
| x3 | 0.6 | 0.1 | 0 | 0 | 76 | 90 | 100 | 100 | – | – | – | - | |
| x4 | 1 | 0.2 | 0 | 0 | 80 | 91 | 100 | 100 | – | – | – | - | |
| x5 | 0.4 | 0.1 | 0 | 0 | 62 | 84 | 100 | 100 | – | – | – | - | |
| x6 | 0.7 | 0 | 0 | 0 | 6 | 13 | 42 | 64 | – | – | – | – | |
Random forest average variable importance measure (VIM) (impurity-based: average over 1000 iterations).
| N | X1 | X2 | X3 | X4 | X5 | X6 | |
|---|---|---|---|---|---|---|---|
| Rare binary outcome, Categorical inputs | 2000 | 4 | 2 | 2 | 2 | 2 | 3 |
| 5000 | 4 | 2 | 3 | 3 | 2 | 3 | |
| 50,000 | 9 | 4 | 19 | 21 | 10 | 4 | |
| 200,000 | 25 | 10 | 77 | 82 | 39 | 4 | |
| Rare binary outcome, Mixed inputs | 2000 | 33 | 1 | 1 | 1 | 1 | 32 |
| 5000 | 74 | 3 | 3 | 3 | 2 | 70 | |
| 50,000 | 376 | 10 | 15 | 16 | 11 | 352 | |
| 200,000 | 793 | 21 | 54 | 58 | 32 | 712 | |
| Common binary outcome, Categorical inputs | 2000 | 18 | 8 | 16 | 17 | 11 | 12 |
| 5000 | 26 | 12 | 35 | 38 | 21 | 14 | |
| 50,000 | 124 | 51 | 308 | 338 | 166 | 17 | |
| 200,000 | 442 | 170 | 1294 | 1361 | 645 | 17 | |
| Common binary outcome, Mixed inputs | 2000 | 112 | 8 | 12 | 13 | 9 | 101 |
| 5000 | 226 | 14 | 26 | 27 | 17 | 204 | |
| 50,000 | 938 | 61 | 211 | 223 | 124 | 782 | |
| 200,000 | 1902 | 184 | 822 | 879 | 435 | 1392 |
Random forest variable importance measure (VIM) (permutation-based: % of 200 iterations p-value is less than 0.05).
| N | X1 | X2 | X3 | X4 | X5 | X6 | |
|---|---|---|---|---|---|---|---|
| Rare binary outcome, Categorical inputs | 2000 | 19.0 | 13.5 | 20.5 | 25.0 | 26.5 | 14.5 |
| 5000 | 48.0 | 33.0 | 73.0 | 76.5 | 65.5 | 16.5 | |
| 50,000 | 100.0 | 99.0 | 100.0 | 100.0 | 100.0 | 44.0 | |
| 200,000 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 42.0 | |
| Rare binary outcome, Mixed inputs | 2000 | 20.5 | 7.5 | 8.5 | 7.5 | 8.0 | 2.0 |
| 5000 | 26.0 | 11.0 | 7.0 | 12.0 | 14.0 | 3.5 | |
| 50,000 | 79.0 | 51.0 | 55.0 | 57.0 | 70.5 | 0.5 | |
| 200,000 | 100.0 | 90.0 | 82.0 | 81.0 | 95.5 | 0.5 | |
| Common binary outcome, Categorical inputs | 2000 | 56.5 | 36.0 | 69.5 | 65.0 | 63.0 | 7.5 |
| 5000 | 96.0 | 81.0 | 89.5 | 89.5 | 85.0 | 12.0 | |
| 50,000 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 46.5 | |
| 200,000 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 45.0 | |
| Common binary outcome, Mixed inputs | 2000 | 61.5 | 30.5 | 55.5 | 60.0 | 49.0 | 1.5 |
| 5000 | 83.5 | 59.0 | 73.0 | 73.0 | 74.0 | 2.5 | |
| 50,000 | 99.5 | 98.5 | 91.0 | 94.0 | 95.0 | 1.0 | |
| 200,000 | 100.0 | 99.5 | 95.0 | 96.5 | 100.0 | 0.0 |
Variable importance measures (VIM) for NHANES high blood pressure.
| CART | CTree | CHAID | Random forest | |||
|---|---|---|---|---|---|---|
| Splitting variable (Yes/No) | Splitting variable (Yes/No) | Splitting variable (Yes/No) | Impurity-based VIM | Permutation-based VIM | Permutation-based VIM | |
| Age | Yes | Yes | Yes | 509.927473 | 0.0552 | 0.010 |
| Gender | No | Yes | Yes | 44.347381 | 0.005197 | 0.010 |
| Race | Yes | Yes | Yes | 39.163758 | 0.003238 | 0.010 |
| Income | No | No | Yes | 8.795415 | 0.000413 | 0.337 |
Figure 3A to 3.D. Boxplots of the MAD of intersection-specific estimations for two different small sample sizes, and a simulated outcome prevalence of 50% (graph excludes outliers) A. Categorical inputs B. categorical inputs with larger effect sizes only for interaction effects C. Mixed inputs D. Mixed inputs with larger effect sizes only for the interaction effects. Abbreviations: CART = classification and regression tree; CHAID = chi-square automatic interaction detector; CTree = conditional inference trees; MAIHDA = multilevel analysis of individual heterogeneity and discriminatory accuracy.
Bias and variance of single simulation models at small sample sizes (median, minimum, and maximum values amongst the 192 intersections).
| Bias | Variance | ||||
|---|---|---|---|---|---|
| N=2000 | N=5000 | N=2000 | N=5000 | ||
| Rare outcome prevalence | Correctly-specified regression | 0.02 (−0.1, 0.3) | −0.005 (−0.1, 0.1) | 3.06 (0.3, 29.1) | 1.09 (0.1, 11.0) |
| Cross classification | 0.009 (−0.7, 1.1) | −0.002 (−0.5, 0.6) | 44.01 (3.1, 362.2) | 15.92 (1.2, 165.8) | |
| MAIHDA | 0.04 (−2.8, 2.8) | 0.06 (−2.8, 2.7) | 1.42 (0.2, 11.7) | 0.56 (0.1, 4.1) | |
| Main effects regression | 0.08 (−2.7, 3.0) | 0.07 (−2.8, 2.9) | 1.39 (0.2, 10.9) | 0.54 (0.1, 4.0) | |
| Common outcome prevalence | Correctly-specified regression | −0.01 (−0.2, 0.4) | −0.03 (−0.3, 0.2) | 10.37 (1.2, 93.6) | 3.78 (0.5, 36.0) |
| Cross classification | 0.01 (−2.4, 2.5) | - 0.01 (−2.0, 1.1) | 170.72 (13.9, 1053.9) | 58.75 (5.3, 591.0) | |
| MAIHDA | −0.19 (−12.4, 11.1) | −0.26 (−11.7, 10.5) | 5.66 (0.5, 30.2) | 2.24 (0.2, 14.8) | |
| Main effects regression | −0.16 (−11.6, 11.8) | −0.20 (−11.4, 11.7) | 4.73 (0.5, 40.9) | 1.76 (0.2, 14.7) | |
| 50% outcome prevalence | Correctly-specified regression | −0.59 (−28.9, 72.6) | −0.47 (−28.7, 73.0) | 28.53 (4.4, 121.7) | 10.60 (1.6, 49.2) |
| Cross classification | −0.03 (−2.4, 2.2) | −0.007 (−1.3, 1.3) | 260.6 | 96.58 (3.5, 1065.0) | |
| MAIHDA | 1.24 (−22.4, 19.7) | 0.19 (−12.5, 17.5) | 21.94 (0.5, 36.3) | 20.98 (0.2, 43.5) | |
| Main effects regression | 2.74 (−36.4, 55.8) | 2.74 (−36.2, 56.5) | 16.42 (2.6, 74.3) | 6.33 (1.0, 28.7)a | |
Rare outcome prevalence was on average 4%b
Common outcome prevalence was on average 15%.
Outputs of each method, assessed and not assessed in this study.
| Estimation of binary outcomes | Variable selection | Outputs not assessed in this study | |
|---|---|---|---|
| Regression with interactions | Recommended for large sample sizes | Not assessed | Estimation of first-order and interaction effects |
| Cross-classification | Recommended for large sample sizes | Not applicable | Tests of significance between groups (e.g. t-tests) |
| MAIHDA | Recommended for all sample sizes | Not assessed | Estimation of main and residual effects |
| CART, CTree, or CHAID | CART: Not recommended | CART: Not recommended | Comparability of variable splitting to interaction effects identified in traditional regression models |
| Random forest | Recommended for large sample sizes | Impurity-based: |