Literature DB >> 35046023

Universal adaptability: Target-independent inference that competes with propensity scoring.

Michael P Kim^1,2, Christoph Kern³, Shafi Goldwasser^4,5, Frauke Kreuter^6,7, Omer Reingold⁸.

Abstract

The gold-standard approaches for gleaning statistically valid conclusions from data involve random sampling from the population. Collecting properly randomized data, however, can be challenging, so modern statistical methods, including propensity score reweighting, aim to enable valid inferences when random sampling is not feasible. We put forth an approach for making inferences based on available data from a source population that may differ in composition in unknown ways from an eventual target population. Whereas propensity scoring requires a separate estimation procedure for each different target population, we show how to build a single estimator, based on source data alone, that allows for efficient and accurate estimates on any downstream target data. We demonstrate, theoretically and empirically, that our target-independent approach to inference, which we dub "universal adaptability," is competitive with target-specific approaches that rely on propensity scoring. Our approach builds on a surprising connection between the problem of inferences in unspecified target populations and the multicalibration problem, studied in the burgeoning field of algorithmic fairness. We show how the multicalibration framework can be employed to yield valid inferences from a single source population across a diverse set of target populations.

Entities: Chemical

Keywords: algorithmic fairness; propensity scoring; statistical validity

Year: 2022 PMID： 35046023 PMCID： PMC8794832 DOI： 10.1073/pnas.2108097119

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

Across the world, there is a growing push to leverage data about populations to inform effective data-driven policy. In the United States, the Evidence-Based Policymaking Act of 2018 and the US Federal Data Strategy (1) established government-wide reforms for making data accessible and useful for decision-making; globally, in the Post-2015 Development Agenda, the High Level Panel articulated the need for a “data revolution” to promote evidence-based decisions and to strengthen accountability (2). Answering key questions like “Will this policy work in our context?” or “How will this disease variant spread in our country?” can be very challenging, because the composition of populations varies considerably across regions, as is acutely apparent during the COVID-19 crisis. To make progress, systematic methodology for collecting and processing data is needed. The gold-standard approaches for gleaning insights from data involve proper random sampling. Classic experimental methods estimate statistics (e.g., population averages, or causal effects) by randomly sampling individuals to participate in trial groups (interviewed, treatment/control). The statistical validity of the conclusions—and the methods’ effectiveness as part of a policy platform—depends crucially on the quality of randomness. Collecting data with proper randomization, however, is often difficult and costly. For instance, to understand medical trends across the United States, traditional statistical methods might require analysts to coordinate with hospitals across the country to collect random samples throughout the general population. Comparatively, it would be cheap and easy to collect data from a single hospital, but samples from a given hospital may not be representative of the US population at large. As such, a huge body of modern quantitative statistical research focuses on methods that, given access to observational data from a “source” population, enable valid inferences for “target” populations, even when proper randomization over the target is not possible (3, 4). Today, developing such robust approaches to statistical inference is a challenging and active research area, spanning areas including domain adaptation (5), “quasi-experimental” methods in causal inference (6–8), and prediction and inference under distributional shift (9, 10), with particular emphasis in public health studies (11). Within this research, a major paradigm for obtaining valid statistical inferences involves propensity score reweighting (3, 12). The propensity score between a source and target population relates the likelihood of observing data under the two populations. As such, the propensity score can be used to reweight samples of data from an observational source to “look like” a sample from a randomly sampled target population. Given access to labeled data from a source population—where we observe a variable of interest Y associated with covariates X—and unlabeled data from a target population—where we observe only the covariates X—we can use the propensity score to obtain valid statistical inferences about Y within the target. Performing inference via propensity score reweighting involves two steps: first, estimating the propensity score using unlabeled samples of data from the source and the target; second, evaluating the statistic of interest using labeled samples from the source population that have been reweighted by the propensity score. In this way, propensity score reweighting provides a pragmatic method for gleaning insights about populations of interest (targets) from plentiful but nonrandomized observational data (sources). The paradigm has been successfully applied across a vast array of scientific settings, including estimating the effects of training programs on later earnings (13), the relationship between postmenopausal hormone therapy and coronary heart disease (14), and the effectiveness of HIV therapy (15). In each of these examples, the propensity reweighted estimates demonstrated differences in efficacy across populations significant enough to change policy for treatment (16). Despite the remarkable success of propensity score reweighting for performing inferences in diverse applications, the approach has some critical limitations. Crucially, to obtain accurate inferences in a given target population, we must first estimate the propensity score from the source to the target. In this way, propensity score reweighting is best suited for making inferences in a single fixed target population. Often, however, it may be useful to make inferences in many target populations. Continuing the earlier example, rather than making a broad-strokes medical inference about the entire US population, hospitals across the country may benefit from findings tailored to their region and patient demographic. In such a setting, analysts may wish to transfer insights from data collected at a single source hospital to many target hospitals across the country. When performing inferences in multiple possibly evolving target populations, the need to estimate target-specific propensity scores presents challenges. In particular, in addition to the labeled source data, for every new target population, we must obtain a random sample of unlabeled data from the target and perform regression to estimate the propensity score from the source population. This framework for target-specific inferences is demanding in terms of data and computation, and may be prohibitive for making inferences in resource-limited target populations.

An Approach for Inference across Targets

In settings where we want to perform inferences on many downstream target populations, ideally, we would eliminate the need to estimate a separate propensity score model for each target. In this work, we explore the possibility of such an ideal framework: Rather than estimating target-specific reweighting functions, we aim to learn a single estimator that automatically adapts to shifts from source to target populations. In particular, we study how to use labeled source data to build a single prediction function that, given unlabeled samples from any target population t, allows for efficient estimation of the statistic of interest over t. We formalize the goal of our approach through a criterion for prediction functions, which we call “universal adaptability.” Informally, for a given statistic, we say a prediction function is universally adaptable from a source s, if, for any target population t, the error in estimation using is comparable to the error obtained via target-specific propensity score reweighting from s to t. We make progress on the problem of universal adaptability by demonstrating a surprising connection to a problem studied in the burgeoning field of algorithmic fairness (17). One serious fairness concern with using algorithmic predictions is that they may be miscalibrated—or systematically biased—on important but historically marginalized groups of people. A recent study suggests, for instance, that predictive algorithms used within the health care system can exhibit miscalibration across racial groups, contributing to significant disparity in health outcomes between Black and White patients (18). Introduced recently in ref. 19, multicalibration provides a formal and constructive framework for mitigating such forms of systematic bias. In this work, we reinterpret the fairness guarantees of multicalibration to obtain universal adaptability: We derive a direct correspondence between protecting a vast collection of subpopulations from miscalibration and ensuring unbiased statistical estimates over a vast collection of target populations. Technically, we show how multicalibration implicitly anticipates propensity shifts to potential target populations. In turn, we leverage multicalibration to obtain estimates in the target that achieve comparable accuracy with estimators that use propensity scoring. Whereas modeling the propensity score explicitly requires performing regression over source and target data for every new target, our approach for universal adaptability allows the analyst to learn a single estimator (based on a multicalibrated prediction function) that can be evaluated efficiently on any downstream target population.

Formal Preliminaries

Let denote the space of covariates representing individual records and denote the outcome range, for example, for binary outcomes. For each individual x, the associated y may represent a variable of interest, for instance, a health outcome after treatment. Let denote sampling in the source or target distribution. Formally, we assume a joint distribution over X, Y, Z triples, for covariates X, outcome Y, and source vs. target indicator Z. We use and to denote the joint distributions over X, Y pairs, conditioned on and , respectively. For convenience, we use and to denote the distribution over unlabeled samples (i.e., marginal distribution over covariates X), conditioned on source or target. Importantly, we assume that the relationship between X and Y does not change from to ; formally, we assume the joint law factorizes as , sometimes called “ignorability.”

Inference Task

We aim for accurate statistical inferences over a target distribution. For concreteness, we focus on the task of estimating the average value of the outcome of interest in the target population, denoted as For any estimate of the statistic, , we define the estimation error on the target to be the absolute deviation from the true statistic. Given direct access to labeled samples from , the empirical estimator gives a good approximation to . When our access to labeled samples from the target distribution is limited, more-sophisticated techniques are necessary to obtain unbiased estimates.

Propensity Score Reweighting

For given source and target, the propensity score allows us to relate the odds, given a set of covariates X = x, of being sampled from and (20, 21). Specifically, for a given source and target , the propensity score is defined as the following probability: Correspondingly, . Given the propensity score, we can obtain an unbiased estimate of the expectation of Y in the target by reweighting accordingly.* The approach of inverse propensity score weighting (IPSW) follows from this observation. First, the analyst chooses a class of propensity scoring functions Σ. With Σ fixed, the analyst uses unlabeled samples from and to find the best fit approximation of the propensity score. Then, to obtain an inference of the target mean, the analyst reweights labeled samples from according to the (best-fit) propensity odds . Many techniques, varying in sophistication, can be used to estimate the propensity score (22). Concretely, logistic regression is the most commonly used method for fitting the propensity score; in this case, Σ is taken to be the class of linear functions passed through the logistic activation. For any method of fitting a best-fit propensity score to the true propensity score , we define the misspecification error, denoted , as the following expected distance:where is the absolute difference in the corresponding propensity ratios. Intuitively, if Σ fits the shift from to accurately, then the misspecification error will be small for some . Importantly, if Σ correctly specifies the true propensity score (i.e., ), then the propensity-based estimator is unbiased.

Imputation and Universal Adaptability

An alternative strategy for inference, known as imputation, involves learning a prediction function that estimates the relationship of the variable of interest given the covariates of an individual, then averaging this prediction over the target distribution. Specifically, given a prediction function and an unlabeled sample from the target distribution, we can use as a surrogate estimate for Y, and estimate as follows: Our goal will be to learn using samples from the source distribution in order to guarantee the imputation inference is competitive with the propensity score estimate, regardless of the eventual target population. We formalize this goal through a notion we call “universal adaptability.” For a source distribution and a class of propensity scores Σ, a predictor is universally adaptable if for any target distribution , Table 1 summarizes the differences in data and estimation requirements under propensity scoring and universal adaptability inferences. In general, prediction accuracy on the source population will not imply that the downstream estimates will be universally adaptable. Our goal is to characterize sufficient conditions on such that universal adaptability is guaranteed.

Table 1.

Comparing propensity scoring and universal adaptability

Method	Required estimation	Inference procedure
Propensity scoring	Estimate target-specific propensity score e_t using unlabeled source/target data	Evaluate statistic of Y using labeled source data reweighted by e_t
Universal adaptability	Estimate target-independent prediction function p˜ using labeled source data	Evaluate statistic of p˜(X) using unlabeled target data

Required estimation: Propensity scoring (PS) requires estimating a target-specific propensity score using unlabeled samples from and ; universal adaptability (UA) estimates a multicalibrated prediction function using labeled samples from the source . Inference procedure: For each method, the inferences consist of empirical expectations of different variables over different distribution. To obtain the PS estimates, labeled samples from are reweighted by the propensity score and then the variable of interest is averaged; to obtain the UA estimates, the prediction is used in place of the variable of interest, and is averaged across unlabeled samples from . Importantly, universal adaptability via multicalibration requires access to only at inference time, so a given prediction function can imply efficient inferences simultaneously for many target populations.

Comparing propensity scoring and universal adaptability Required estimation: Propensity scoring (PS) requires estimating a target-specific propensity score using unlabeled samples from and ; universal adaptability (UA) estimates a multicalibrated prediction function using labeled samples from the source . Inference procedure: For each method, the inferences consist of empirical expectations of different variables over different distribution. To obtain the PS estimates, labeled samples from are reweighted by the propensity score and then the variable of interest is averaged; to obtain the UA estimates, the prediction is used in place of the variable of interest, and is averaged across unlabeled samples from . Importantly, universal adaptability via multicalibration requires access to only at inference time, so a given prediction function can imply efficient inferences simultaneously for many target populations.

Multicalibrated Predictions

Multicalibration is a property of prediction functions, initially studied in the context of algorithmic fairness. Intuitively, multicalibrated predictions mitigate subpopulation bias, by ensuring that the prediction function appears well calibrated, not simply overall but even when we restrict our attention to structured subpopulations of interest. In the original formulation of ref. 19, the subpopulations of interest are defined in terms of a class of Boolean functions . Here, we work with a generalization of where we parameterize multicalibration in terms of a collection of real-valued functions . For a given distribution and class of functions , a predictor is multicalibrated if Multicalibration ensures that predictions are unbiased across every (weighted) subpopulation defined by . Importantly, it is always feasible (e.g., perfect predictions are multicalibrated) and—unlike many notions of fairness—exhibits no fairness–accuracy tradeoff. Further, the framework is constructive: There is a boosting-style algorithm—MCBoost—that, given a small sample of labeled data, produces a multicalibrated prediction function (19, 23). See for a formal description of the definition and algorithm. At a high level, the multicalibration algorithm works by iteratively identifying a function (auditing via regression) under which the current predictions violate multicalibration. Then, the algorithm updates the predictions to improve the calibration over the weighted subpopulation defined by c. This process repeats until is multicalibrated. The approach, and how it differs from IPSW, is depicted in Fig. 1.

Fig. 1.

(A) Setting. We consider a single source of labeled data (covariates and outcomes), for example, from a hospital study. Our goal is to make inferences that generalize to different target distributions, for example, to inform patient care at other hospitals. (B) Propensity scoring. First, unlabeled samples from the source and target are employed to learn a propensity score. Then, target-specific estimates are computed on the reweighted (labeled) source samples. (C) Universal adaptability via multicalibration. The MCBoost algorithm iteratively performs regression over the source data, updating the prediction function, and returning a multicalibrated . The output predictor can be used to make estimates in any target distribution, with performance similar to that of the target-specific propensity score estimators.

Multicalibration Guarantees Universal Adaptability

Our main theorem relates the estimation error obtained using multicalibration to that of propensity score reweighting for any source and target populations. Specifically, given a class of propensity scores Σ, we define a corresponding class of functions , where is the likelihood ratio under σ, defined as follows: With this class of functions in place, we can state the theorem, which establishes universal adaptability from multicalibration. The guaranteed estimation error depends directly on the misspecification error for any propensity score . Suppose is a -multicalibrated prediction function over source distribution . Then, for any target distribution , and for any , the estimator is universally adaptable. Note that, in the case where Σ is well specified (i.e., ), then, as with the propensity scoring approach, the multicalibrated estimator will be nearly unbiased (up to the multicalibration error α). In other words, even though a multicalibrated prediction function can be learned using only samples from the source, when evaluated on the target, the estimation error is nearly as good as the IPSW inferences, which explicitly model the shift. Thus, appealing to the learning algorithm of ref. 19, it is possible to obtain universally adaptable estimators that perform as well as the shift-specific propensity-based estimates. Further technical details and a proof of the theorem are included in .

Multicalibrated Prediction Functions Adapt to Subpopulation Shifts

We consider an inference task from epidemiology. To model distributional shift, we use data from two US household surveys. As source, we use the third US National Health and Nutrition Examination Survey (NHANES), with 20,050 observations in the adult sample (24); as target, we use the (weighted) US National Health Interview Survey (NHIS), with 19,738 observations from the Department of Health and Human Services Year 2000 Health objectives interview (25). NHANES differs from NHIS in composition, for example, in sampling rates of demographic groups. Both surveys are linked to death certificate records from the National Death Index (26). We infer 15-y mortality rates, using covariates (age, sex, ethnicity, marital status, education, family income, region, smoking status, health, BMI) for estimating both propensity scores (IPSW) and the multicalibrated predictor of mortality. To evaluate the methods, we measure the estimation error on the overall target distribution, and also on demographic subpopulations. We can view each subpopulation G as its own extreme shift, where for any . In this way, the experiment measures adaptability across many different shifts simultaneously. For a subgroup G, we denote, by NHANES(G) and NHIS(G), the restrictions to individuals in the subgroup.

Methods.

For each group G, as a naive baseline, we estimate the target mean over NHIS(G) using the source mean over NHANES(G). We evaluate two IPSW approaches: First, we run IPSW with a global propensity score between NHANES and NHIS, reporting the propensity-weighted average over NHANES(G); second, we run a stronger subgroup-specific IPSW, where we learn a separate propensity score for each group G between NHANES(G) and NHIS(G). We evaluate the adaptability properties of naive and multicalibrated prediction functions. To start, we evaluate estimates derived using a random forest (RF), trained on the source data. Then, we evaluate the performance of an RF that is postprocessed for multicalibration using MCBoost, auditing with ridge regression, using samples from NHANES (MC-Ridge). In all predictor-based methods, we estimate the target expectation using unlabeled samples from NHIS(G). Finally, we run a hybrid predictor-based method, where we estimate a propensity score between NHANES and NHIS, then learn RF to predict outcomes over propensity-weighted samples from NHANES(G), providing a strong benchmark. For a detailed description of the methods and results for additional techniques, see .

Results.

We report the estimation error for each technique in Tables 2 and 3. First, we observe that the source and target compositions differ in significant ways: The distribution of covariates shifts nontrivially, resulting in different expected mortality rates across groups. As such, the naive inference suffers considerable estimation error. The techniques that account for this shift—through propensity scoring or universal adaptability—incur significantly smaller errors. Among the propensity scoring techniques, the overall IPSW, subgroup-specific IPSW, and hybrid approaches perform similarly overall; on the race-based demographic groups, the subgroup-specific IPSW model performs better than the others. Among the RF-based inferences, the naive approach exhibits nontrivial estimation error overall and on many subpopulations. The RF that has been postprocessed to be multicalibrated has consistently smaller errors, and obtains estimation performance comparable or better than the IPSW approaches.

Table 2.

Source and target composition

	Sample composition		Average mortality
	NHANES	NHIS	NHANES	NHIS
Overall			27.67	17.57
Male	46.75	47.74	30.56	18.77
Female	53.25	52.26	25.11	16.48
Age 18 y to 24 y	13.87	13.36	3.81	2.23
Age 25 y to 44 y	36.43	43.61	5.70	3.86
Age 45 y to 64 y	23.11	26.62	22.71	17.66
Age 65 y to 69 y	6.34	5.10	48.61	45.52
Age 70 y to 74 y	6.57	4.57	64.24	60.03
Age 75+ y	13.69	6.75	90.47	86.25
White	42.56	75.81	37.25	18.70
Black	27.30	11.19	23.08	18.94
Hispanic	28.59	9.01	18.38	10.18
Other	1.55	3.99	15.62	8.96

For NHANES and NHIS, subpopulations are listed with prevalence (percent) in the distributions, and average mortality rate (percent) in NHANES and NHIS.

Table 3.

Comparison of inference methods

		IPSW		RF	RF
	Naive	Overall	Subgroup	Hybrid	Naive	MC-Ridge
Overall	10.10 (57.5)	2.37 (13.5)	—	0.35 (2.0)	1.11 (6.3)	0.52 (3.0)
Male	11.80 (62.9)	2.51 (13.4)	0.91 (4.9)	–1.34 (7.1)	–0.34 (1.8)	0.11 (0.6)
Female	8.63 (52.4)	2.40 (14.6)	3.99 (24.2)	1.89 (11.5)	2.43 (14.8)	0.90 (5.4)
Age 18 y to 24 y	1.57 (70.5)	0.00 (0.1)	–0.39 (17.5)	5.18 (232.1)	6.03 (270.2)	1.76 (79.0)
Age 25 y to 44 y	1.84 (47.6)	–0.20 (5.2)	–0.41 (10.6)	0.29 (7.6)	0.82 (21.2)	0.66 (17.2)
Age 45 y to 64 y	5.05 (28.6)	–0.75 (4.2)	–0.41 (2.3)	0.04 (0.2)	0.86 (4.8)	–0.29 (1.6)
Age 65 y to 69 y	3.09 (6.8)	–4.23 (9.3)	–5.23 (11.5)	–5.40 (11.9)	–3.52 (7.7)	–1.99 (4.4)
Age 70 y to 74 y	4.21 (7.0)	–1.36 (2.3)	0.47 (0.8)	–4.07 (6.8)	–3.02 (5.0)	0.61 (1.0)
Age 75+ y	4.22 (4.9)	3.53 (4.1)	2.85 (3.3)	–0.25 (0.3)	0.51 (0.6)	2.19 (2.5)
White	18.55 (99.2)	3.53 (18.9)	0.75 (4.0)	0.19 (1.0)	1.03 (5.5)	0.69 (3.7)
Black	4.14 (21.9)	–4.00 (21.1)	–0.48 (2.5)	–1.30 (6.8)	–0.66 (3.5)	–0.52 (2.7)
Hispanic	8.20 (80.5)	1.73 (17.0)	0.48 (4.7)	2.84 (27.9)	2.91 (28.6)	1.55 (15.2)
Other	6.66 (74.4)	–0.02 (0.2)	–3.54 (39.5)	2.44 (27.3)	3.52 (39.3)	–2.06 (23.0)

Shift-aware inferences: Estimation error in inferred mortality rate for each technique on each subpopulation is shown (percent error in parentheses). For each subgroup, the technique achieving (within ) best performance is in bold. Results highlight the universal adaptability of the multicalibrated prediction function (MC-Ridge).

Source and target composition For NHANES and NHIS, subpopulations are listed with prevalence (percent) in the distributions, and average mortality rate (percent) in NHANES and NHIS. Comparison of inference methods Shift-aware inferences: Estimation error in inferred mortality rate for each technique on each subpopulation is shown (percent error in parentheses). For each subgroup, the technique achieving (within ) best performance is in bold. Results highlight the universal adaptability of the multicalibrated prediction function (MC-Ridge). The performance obtained via multicalibration highlights how a single multicalibrated prediction function can actually generalize to many different target distributions, competitive with shift-specific inference techniques. Intuitively, the strong performance across subgroups highlights the connection between universal adaptability and multicalibration as a notion of fairness: To be multicalibrated, the predictor must model the variation in outcomes robustly—not just overall, but simultaneously across many subpopulations.

Universal Adaptability Maintains Small Error under Extreme Shift

To push the limits of universal adaptability, we design a semisynthetic experimental setup to model extreme shift (i.e., strong differences between source and target distribution). We use data collected by the Pew Research Center (30), with a source of 31,319 online opt-in interviews (OPT) and a reference target of 20,000 observations from high-quality surveys (REF). Our outcome of interest indicates whether an individual voted in the 2014 midterm election. Our inference methods use covariates (age, sex, education, ethnicity, census division). Using OPT and REF, we construct various semisynthetic target distributions. At a high level, we estimate a propensity score σ between OPT and REF that we amplify exponentially according to varying intensities q. Technically, the qth semisynthetic shift is implemented using a propensity score with odds ratio given as . See . A for a detailed description of the sampling procedure. We also track how techniques’ performance changes based on the mode of shift, based on the model type and specification—logistic regression with linear terms, logistic regression with linear terms and pairwise interactions, decision-tree regression—used to fit the initial propensity score σ.

Methods and Results.

We perform target inference using various methods. Fig. 2 shows the (signed) estimation error resulting under different modes and shift intensities for the different methods. Using the source mean as an estimate for the target mean (Naive) incurs significant bias, increasing linearly with the shift exponent in all three modes. We evaluate IPSW, where we fit propensity scores using logistic regression. The standard approach (IPSW) achieves nearly unbiased estimates even under extreme shifts, but has large variance (especially on the logistic shift models). We also evaluate the IPSW approach after trimming the propensity scores (IPSW-trimmed); this results in considerably lower variance estimator, but incurs increased estimation error at extreme shifts. We evaluate four predictor-based approaches. The bias for a baseline RF (RF-Naive) estimate also increases linearly with q, albeit slowly under the tree-based shift. Training a propensity score–weighted RF (RF-Hybrid) leads to improved estimates compared to RF-Naive but still results in considerable bias under logit-based shifts. We explore two variants of multicalibration: one auditing with ridge regression (RF-MC-Ridge) and one with decision tree regression (RF-MC-Tree). The ridge-multicalibrated predictor obtains low overall estimation error, competitive with IPSW, while maintaining reasonable variance. The tree-multicalibrated predictor incurs considerable error on the logistic-based shifts, but maintains low error on the tree-based shift, highlighting how the choice of multicalibration functions relative to the true shift affects its adaptability. In , we report and discuss additional inference techniques.

Fig. 2.

Relative error (percent) in inferred voting rates under synthetic shift (varying intensity q). Shifts are given by three modes of propensity score: logistic with linear terms (Logit-linear), logistic with linear terms and pairwise interactions (Logit-interaction), and decision tree (Tree). Error of (naive, IPSW, and RF-based) inferences plotted against unbiased baseline (relative error ).

Conclusion

In all, the theory and experiments validate the conclusion that it is possible to achieve universal adaptability in diverse contexts. By training a single multicalibrated prediction function on source data, the analyst can guarantee estimation error on any target population, comparable to the performance achieved by explicitly modeling the propensity score. In this way, universal adaptability suggests a pathway for rapid and accessible dissemination of statistical inferences to many target populations: After running a study, a research organization can publish a multicalibrated prediction function based on their findings without the need to reweight data for novel targets. This strategy may be particularly effective in communities that want to implement evidence-based decision-making but do not have the resources to collect high-quality data or perform propensity score estimation on their own.

8 in total

1. Doubly robust estimation in missing data and causal inference models.

Authors: Heejung Bang; James M Robins
Journal: Biometrics Date: 2005-12 Impact factor: 2.571

2. Selection criteria and generalizability within the counterfactual framework: explaining the paradox of antidepressant-induced suicidality?

Authors: Herbert I Weisberg; Vanessa C Hayden; Victor P Pontes
Journal: Clin Trials Date: 2009-04 Impact factor: 2.486

3. Training replicable predictors in multiple studies.

Authors: Prasad Patil; Giovanni Parmigiani
Journal: Proc Natl Acad Sci U S A Date: 2018-03-12 Impact factor: 11.205

4. Dissecting racial bias in an algorithm used to manage the health of populations.

Authors: Ziad Obermeyer; Brian Powers; Christine Vogeli; Sendhil Mullainathan
Journal: Science Date: 2019-10-25 Impact factor: 47.728

5. Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial.

Authors: Stephen R Cole; Elizabeth A Stuart
Journal: Am J Epidemiol Date: 2010-06-14 Impact factor: 4.897

6. Observational studies analyzed like randomized experiments: an application to postmenopausal hormone therapy and coronary heart disease.

Authors: Miguel A Hernán; Alvaro Alonso; Roger Logan; Francine Grodstein; Karin B Michels; Walter C Willett; Joann E Manson; James M Robins
Journal: Epidemiology Date: 2008-11 Impact factor: 4.822

7. The calibration of treatment effects from clinical trials to target populations.

Authors: Constantine Frangakis
Journal: Clin Trials Date: 2009-04 Impact factor: 2.486

8. Generalizing randomized trial findings to a target population using complex survey population data.

Authors: Benjamin Ackerman; Catherine R Lesko; Juned Siddique; Ryoko Susukida; Elizabeth A Stuart
Journal: Stat Med Date: 2020-11-26 Impact factor: 2.373

8 in total