| Literature DB >> 35046023 |
Michael P Kim1,2, Christoph Kern3, Shafi Goldwasser4,5, Frauke Kreuter6,7, Omer Reingold8.
Abstract
The gold-standard approaches for gleaning statistically valid conclusions from data involve random sampling from the population. Collecting properly randomized data, however, can be challenging, so modern statistical methods, including propensity score reweighting, aim to enable valid inferences when random sampling is not feasible. We put forth an approach for making inferences based on available data from a source population that may differ in composition in unknown ways from an eventual target population. Whereas propensity scoring requires a separate estimation procedure for each different target population, we show how to build a single estimator, based on source data alone, that allows for efficient and accurate estimates on any downstream target data. We demonstrate, theoretically and empirically, that our target-independent approach to inference, which we dub "universal adaptability," is competitive with target-specific approaches that rely on propensity scoring. Our approach builds on a surprising connection between the problem of inferences in unspecified target populations and the multicalibration problem, studied in the burgeoning field of algorithmic fairness. We show how the multicalibration framework can be employed to yield valid inferences from a single source population across a diverse set of target populations.Entities:
Keywords: algorithmic fairness; propensity scoring; statistical validity
Year: 2022 PMID: 35046023 PMCID: PMC8794832 DOI: 10.1073/pnas.2108097119
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Comparing propensity scoring and universal adaptability
| Method | Required estimation | Inference procedure |
| Propensity scoring | Estimate target-specific propensity score | Evaluate statistic of |
| Universal adaptability | Estimate target-independent prediction function | Evaluate statistic of |
Required estimation: Propensity scoring (PS) requires estimating a target-specific propensity score using unlabeled samples from and ; universal adaptability (UA) estimates a multicalibrated prediction function using labeled samples from the source . Inference procedure: For each method, the inferences consist of empirical expectations of different variables over different distribution. To obtain the PS estimates, labeled samples from are reweighted by the propensity score and then the variable of interest is averaged; to obtain the UA estimates, the prediction is used in place of the variable of interest, and is averaged across unlabeled samples from . Importantly, universal adaptability via multicalibration requires access to only at inference time, so a given prediction function can imply efficient inferences simultaneously for many target populations.
Fig. 1.(A) Setting. We consider a single source of labeled data (covariates and outcomes), for example, from a hospital study. Our goal is to make inferences that generalize to different target distributions, for example, to inform patient care at other hospitals. (B) Propensity scoring. First, unlabeled samples from the source and target are employed to learn a propensity score. Then, target-specific estimates are computed on the reweighted (labeled) source samples. (C) Universal adaptability via multicalibration. The MCBoost algorithm iteratively performs regression over the source data, updating the prediction function, and returning a multicalibrated . The output predictor can be used to make estimates in any target distribution, with performance similar to that of the target-specific propensity score estimators.
Source and target composition
| Sample composition | Average mortality | |||
| NHANES | NHIS | NHANES | NHIS | |
| Overall | 27.67 | 17.57 | ||
| Male | 46.75 | 47.74 | 30.56 | 18.77 |
| Female | 53.25 | 52.26 | 25.11 | 16.48 |
| Age 18 y to 24 y | 13.87 | 13.36 | 3.81 | 2.23 |
| Age 25 y to 44 y | 36.43 | 43.61 | 5.70 | 3.86 |
| Age 45 y to 64 y | 23.11 | 26.62 | 22.71 | 17.66 |
| Age 65 y to 69 y | 6.34 | 5.10 | 48.61 | 45.52 |
| Age 70 y to 74 y | 6.57 | 4.57 | 64.24 | 60.03 |
| Age 75+ y | 13.69 | 6.75 | 90.47 | 86.25 |
| White | 42.56 | 75.81 | 37.25 | 18.70 |
| Black | 27.30 | 11.19 | 23.08 | 18.94 |
| Hispanic | 28.59 | 9.01 | 18.38 | 10.18 |
| Other | 1.55 | 3.99 | 15.62 | 8.96 |
For NHANES and NHIS, subpopulations are listed with prevalence (percent) in the distributions, and average mortality rate (percent) in NHANES and NHIS.
Comparison of inference methods
| IPSW | RF | RF | ||||
| Naive | Overall | Subgroup | Hybrid | Naive | MC-Ridge | |
| Overall | 10.10 (57.5) | 2.37 (13.5) | — | 1.11 (6.3) | ||
| Male | 11.80 (62.9) | 2.51 (13.4) | 0.91 (4.9) | –1.34 (7.1) | –0.34 (1.8) | |
| Female | 8.63 (52.4) | 2.40 (14.6) | 3.99 (24.2) | 1.89 (11.5) | 2.43 (14.8) | |
| Age 18 y to 24 y | 1.57 (70.5) | –0.39 (17.5) | 5.18 (232.1) | 6.03 (270.2) | 1.76 (79.0) | |
| Age 25 y to 44 y | 1.84 (47.6) | – | –0.41 (10.6) | 0.82 (21.2) | 0.66 (17.2) | |
| Age 45 y to 64 y | 5.05 (28.6) | –0.75 (4.2) | –0.41 (2.3) | 0.86 (4.8) | –0.29 (1.6) | |
| Age 65 y to 69 y | 3.09 (6.8) | –4.23 (9.3) | –5.23 (11.5) | –5.40 (11.9) | – | – |
| Age 70 y to 74 y | 4.21 (7.0) | –1.36 (2.3) | –4.07 (6.8) | –3.02 (5.0) | ||
| Age 75+ y | 4.22 (4.9) | 3.53 (4.1) | 2.85 (3.3) | – | 0.51 (0.6) | 2.19 (2.5) |
| White | 18.55 (99.2) | 3.53 (18.9) | 0.75 (4.0) | 1.03 (5.5) | 0.69 (3.7) | |
| Black | 4.14 (21.9) | –4.00 (21.1) | – | –1.30 (6.8) | – | – |
| Hispanic | 8.20 (80.5) | 1.73 (17.0) | 2.84 (27.9) | 2.91 (28.6) | 1.55 (15.2) | |
| Other | 6.66 (74.4) | – | –3.54 (39.5) | 2.44 (27.3) | 3.52 (39.3) | –2.06 (23.0) |
Shift-aware inferences: Estimation error in inferred mortality rate for each technique on each subpopulation is shown (percent error in parentheses). For each subgroup, the technique achieving (within ) best performance is in bold. Results highlight the universal adaptability of the multicalibrated prediction function (MC-Ridge).
Fig. 2.Relative error (percent) in inferred voting rates under synthetic shift (varying intensity q). Shifts are given by three modes of propensity score: logistic with linear terms (Logit-linear), logistic with linear terms and pairwise interactions (Logit-interaction), and decision tree (Tree). Error of (naive, IPSW, and RF-based) inferences plotted against unbiased baseline (relative error ).