Literature DB >> 27592566

Robust empirical calibration of p-values using observational data.

Martijn J Schuemie^1,2, George Hripcsak^3,2, Patrick B Ryan^1,3,2, David Madigan^4,2, Marc A Suchard^5,2.

Abstract

Entities: Chemical Disease Gene Species

Year: 2016 PMID： 27592566 PMCID： PMC5108459 DOI： 10.1002/sim.6977

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.373

× No keyword cloud information.

In our previous paper 1, we proposed empirical calibration of p‐values as a strategy for mitigating risk of systematic error when estimating average treatment effects from observational studies. By estimating the effect of exposure on outcomes across a collection of settings where the exposure is not believed to cause the outcome (negative controls), one can estimate an empirical null distribution of the exposure effect and compute calibrated p‐values that take both random and systematic error into account. Gruber and Tchetgen 2 recently published a simulation that is intended to demonstrate the theoretical scenario in which empirical calibration may not be recommended. We welcome a thoughtful debate about the theoretical and empirical underpinnings of empirical calibration and share an enthusiasm to develop practical solutions that can be made broadly applicable to all observational analyses as a means of generating more reliable evidence. However, as we explain in more detail later, we would like to highlight and challenge the premise of some of the concerns raised by Gruber and Tchetgen and demonstrate how our empirical findings support empirical calibration as a robust approach to Type I error control. We believe their simulations are not realistic: the simulated estimates of negative controls showed severe bias with an odds ratio (OR) of 3, and perhaps more important, this bias was also simulated to be unusually homogeneous across the sample of negative controls. No one would continue their analysis after observing such estimates for their negative controls, and in all our real‐world experiments using calibration, we have never seen such bias or homogeneity of negative controls. They confirm our findings that nominal p‐values are strikingly and perhaps dangerously optimistic but recommend that we continue to use them. We believe that p‐value calibration should be carried out, and the results should be reported. We first would like to make the following points of clarification: Empirical calibration deals with bias in observational research estimates, which is the bias that remains after any measures taken to adjust for bias and confounding inherent in observational data, such as propensity score adjustment or using a self‐controlled design. Empirical calibration does not require that the sources and structure of this bias are the same for all negative controls. Instead, the calibration approach assumes that the bias observed for each negative control draws from a common distribution of biases that can plague the specific analysis at hand. For example, if we suspect strong confounding by indication for our exposure‐outcome pair of interest, it is not necessary for all negative controls to have the same confounding by indication. We believe bias inherent in an analysis can be approximated using a Gaussian distribution for our purposes. Previously, we performed a leave‐one‐out validation study that provides strong support for these points. In this analysis, for each negative control, we computed the empirical null distribution using all other controls and calibrated the p‐value of the held‐out control using that distribution. Results showed good calibration properties (e.g. approximately 5% of the negative controls had calibrated p < 0.05 as expected) despite the fact that negative controls represented a wide range of drug‐outcome pairs, with potentially very different sources of bias. (Note that we recommend that people always perform this leave‐one‐out analysis, because in specific cases the previously mentioned assumptions could be violated.)

Exchangeability

The empirical calibration approach requires an exchangeability assumption that there is some level of exchangeability between the negative controls and the drug‐outcome hypothesis of interest. Our leave‐one‐out evaluation shows that there is exchangeability within the negative controls; for any negative control, the others can be used to calibrate its p‐value accurately. But we have no evidence that this exchangeability holds for drug‐outcome pairs outside the set of negative controls. Gruber and Tchetgen's main challenge is to this assumption. They argue that theoretically all negative controls could show strong bias in one direction, while the hypothesis of interest shows no bias or the other way around, in which case empirical calibration will fail. They use simulations to demonstrate this point. In their ‘medium’ simulation, all negative controls show strong bias, with a mean OR of 3, whereas the hypotheses of interest all have small bias, as shown in Figure 1.

Figure 1

Density plot of the point estimates generated in the ‘medium’ simulation by Gruber and Tchetgen, corresponding to their Study Ib.

Density plot of the point estimates generated in the ‘medium’ simulation by Gruber and Tchetgen, corresponding to their Study Ib. Gruber and Tchetgen provide no evidence that such conditions happen in real applications. They do, however, motivate these conditions by noting that when using negative exposure controls for a newly marketed drug that ‘for example, uptake of a newly marketed drug, prevalence of off‐label drug use, availability of alternative treatment options and physician prescribing behaviors typically changes in the years following drug approval.’ To frame this in their simulation, Gruber and Tchetgen assume that a particular study design (e.g. a new‐user cohort design using propensity score adjustment) would be almost unbiased for a newly marketed drug (the hypothesis of interest) but would systematically produce an OR of 3 for negative control drugs that have been on the market longer. They do not discuss the consequence that a reasonable researcher would become suspicious when observing OR = 3 for all negative controls and might decide to discard the analytic approach. In practice, negative controls are inherently heterogeneous. Figure 2 shows the distribution of negative control estimates from a new‐user cohort study 3 comparing the risk of gastrointestinal (GI) bleeds in celecoxib versus diclofenac users, where we use negative outcome controls (i.e. outcomes not believed to be causally related to either drug) (see the Supporting information for details of the study). In this particular analysis, we make no adjustments for confounding, and we observe large bias as expected.

Figure 2

Estimates from a crude comparison between celecoxib and diclofenac users. Estimates below the dashed line (gray area) have p < 0.05 using traditional p‐value calculation. Estimates in the orange areas have p < 0.05 using the calibrated p‐value calculation. Blue dots indicate negative controls, and the yellow diamond indicates the outcome of interest: GI bleeds. In this real‐world example the distribution of negative controls looks nothing like the simulation. Overall, negative controls are positively biased as one would expect; celecoxib tends to be prescribed to a frailer population than diclofenac. But much more pronounced is the large variability in bias, both in positive and negative direction. Figure 3 shows the results of the leave‐one‐out cross‐validation of the negative controls, demonstrating superior calibration for the empirically calibrated p‐value compared with the traditional p‐value. Even though negative controls are very different from one another (negative controls includes ‘ingrowing nail’, ‘allergic rhinitis’ and also ‘obsessive compulsive disorder’), we achieve better calibration than when using traditional p‐value computation, suggesting the exchangeability assumption holds within the set of negative controls.

Figure 3

Calibration plot of the crude analysis. For the calibrated p‐value, we employ a leave‐one‐out design with the negative controls and plot the fraction of p‐values greater than nominal type I error rate α versus α. We have no evidence that these performance characteristics apply for exposure‐outcome pairs outside the set of negative controls. However, given the large heterogeneity in the types of controls, ranging from infectious diseases to mental disorders and accompanying heterogeneity in underlying confounding, it is likely the bias of the hypothesis of interest is represented to some extent by the negative controls. Furthermore, the observed empirical null distribution is far wider than those simulated by Gruber and Tchetgen, making it more likely there is at least some overlap with the bias distribution for the hypotheses of interest. Even though the calibration performance may not be perfect, it is unlikely to be as bad as simulated by Gruber and Tchetgen, where the two distributions were far apart as shown in Figure 1.

Type II error

Gruber and Tchetgen also lament that correcting type I error by p‐value calibration can drastically increase type II error. As noted in our original paper: ‘The method proposed here aims to correct the type I error (erroneously rejecting the null hypothesis) level, most likely at the cost of vastly increasing the number of type II errors (erroneously rejecting the alternative hypothesis).’ 1 There is no free lunch: we cannot have both low type I and type II error in the face of a strongly biased estimator, such as the one depicted in Figure 2. In this plot, it is clear that no matter how large the sample size, no estimate between 0.6 and 2.1 will ever reach statistical significance when using p‐value calibration. Even when the null hypothesis is not true, we will most likely fail to reject it. This trade‐off is not a drawback of the calibration method, but instead exposes a limitation of the observational data and the analysis design being employed. With a less biased estimator, we could achieve both lower type I and type II error. For example, Figure 4 reports the same results as Figure 2 but now using an adjusted analysis using propensity score matching.

Figure 4

Estimates from a propensity‐score matched comparison between celecoxib and diclofenac users. Estimates below the dashed line (gray area) have p < 0.05 using traditional p‐value calculation. Estimates in the orange areas have p < 0.05 using the calibrated p‐value calculation. Blue dots indicate negative controls, and the yellow diamond indicates the outcome of interest: GI bleeds. In this analysis, there is little bias, and the calibrated p‐value does not differ much from the traditional p‐value. Figure 5 shows that in a leave‐one‐out validation, both p‐values are indeed comparable. It is worthwhile to note that without examining negative controls in the crude analysis, we would not have unearthed the limitations of that design.

Figure 5

Calibration plot of the adjusted analysis. For the calibrated p‐value, a leave‐one‐out design was used.

Discussion

Gruber and Tchetgen challenge the assumption of exchangeability, arguing that newly marketed drugs may have inherently different bias than exposure controls, which are typically older drugs. However, in their simulations, they underestimate the variability in the bias observed for negative controls in the real world. Given the width of the estimated empirical null distribution, it is likely that the bias of the hypothesis of interest is included in this distribution. Therefore, even if the exchangeability assumption is violated, and Gruber and Tchetgen have provided no evidence that this is the case, this most likely would mean the calibrated p‐value would err on the conservative side, leading to larger p‐values and making it harder to reject the null. As the mean and standard deviation of the empirical null distribution approach zero, as can be seen in Figure 4, we interpret this as meaning that the method is becoming insensitive to confounding in the data. A consequence is that the calibrated p‐value is becoming less sensitive to the exchangeability assumption. In other words, if large bias and variability in bias is observed for the negative controls, one should be aware there is a possibility one's calibrated p‐values are too conservative. If small or no bias is observed, we should feel confident the calibrated p‐value is approximately correct. It should also be noted that Gruber and Tchetgen's criticism only applies to exposure controls for new drugs, in which case one could elect to use outcome controls instead. We fully agree with Gruber and Tchetgen that empirical calibration can lead to a large increase in type II error when using a strongly biased estimator. It seems Gruber and Tchetgen suggest that if one is truly concerned with only type II error, one should forego calibration and accept that the p‐value does not represent any meaningful statistic. But we think that in this case a better approach would be to still use calibrated p‐values and increase one's nominal α to decrease type II error. In that case, at least the increase in type I error would be quantifiable. We argue that the best solution is to use designs that have small to no error, as achieved here in this example using a propensity score adjusted new‐user cohort design. For such designs the empirical calibration has little effect, and both type I and type II error are, in some sense, optimal. We do not agree with Gruber and Tchetgen that our recommendation that ‘observational studies always include negative controls to derive an empirical null distribution and use these to compute calibrated p‐values’ is premature. Our findings so far suggest that the use of negative controls is a good way to evaluate the bias inherent in a study design, as demonstrated here by comparing a crude and adjusted analysis. The leave‐one‐out evaluations show that the calibrated p‐values provide superior (in the case of large bias) to equal (in the case of almost no bias) calibration properties compared with traditional p‐values. We also do not see viable alternatives. For example, the control outcome calibration approach proposed by Gruber and Tchetgen requires a strong exchangeability assumption that the structure and magnitude of the bias is exactly the same for the control and hypothesis of interest. Even if this assumption were to hold true, it remains unknowable to the researcher. Given the large sample size in most observational studies, even small bias can ‘distort the p‐values beyond repair’. 4 Interpretation of results of observational studies is impossible without quantifying the bias, and we currently see empirical calibration as the only viable means to do this. We believe empirical calibration should always be used and presented in observational studies. To facilitate this, we have created an R package called EmpiricalCalibration (available in Comprehensive R Archive Network (CRAN)) and are working on a tool to quickly identify candidate negative controls.

Conflicts of interest and source of funding

Drs Schuemie and Ryan are employees of Janssen Research and Development, LLC. Drs Madigan and Suchard are partially supported by National Science Foundation grant IIS 1251151. Supporting info item Click here for additional data file.

5 in total

1. Empirical performance of a new user cohort method: lessons for developing a risk identification and analysis system.

Authors: Patrick B Ryan; Martijn J Schuemie; Susan Gruber; Ivan Zorych; David Madigan
Journal: Drug Saf Date: 2013-10 Impact factor: 5.606

2. Limitations of empirical calibration of p-values using observational data.

Authors: Susan Gruber; Eric Tchetgen Tchetgen
Journal: Stat Med Date: 2016-03-10 Impact factor: 2.373

3. Interpreting observational studies: why empirical calibration is needed to correct p-values.

Authors: Martijn J Schuemie; Patrick B Ryan; William DuMouchel; Marc A Suchard; David Madigan
Journal: Stat Med Date: 2013-07-30 Impact factor: 2.373

4. Robust empirical calibration of p-values using observational data.

Authors: Martijn J Schuemie; George Hripcsak; Patrick B Ryan; David Madigan; Marc A Suchard
Journal: Stat Med Date: 2016-09-30 Impact factor: 2.373

5. p-Curve and p-Hacking in Observational Research.

Authors: Stephan B Bruns; John P A Ioannidis
Journal: PLoS One Date: 2016-02-17 Impact factor: 3.240

5 in total

19 in total

1. Applied comparison of large-scale propensity score matching and cardinality matching for causal inference in observational research.

Authors: Stephen P Fortin; Stephen S Johnston; Martijn J Schuemie
Journal: BMC Med Res Methodol Date: 2021-05-24 Impact factor: 4.615

2. How Confident Are We about Observational Findings in Healthcare: A Benchmark Study.

Authors: Martijn J Schuemie; M Soledad Cepeda; Marc A Suchard; Jianxiao Yang; Yuxi Tian; Alejandro Schuler; Patrick B Ryan; David Madigan; George Hripcsak
Journal: Harv Data Sci Rev Date: 2020-01-31

3. Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data.

Authors: Martijn J Schuemie; George Hripcsak; Patrick B Ryan; David Madigan; Marc A Suchard
Journal: Proc Natl Acad Sci U S A Date: 2018-03-13 Impact factor: 11.205

4. Perspective: Limiting Dependence on Nonrandomized Studies and Improving Randomized Trials in Human Nutrition Research: Why and How.

Authors: John F Trepanowski; John P A Ioannidis
Journal: Adv Nutr Date: 2018-07-01 Impact factor: 8.701

5. Inflation of type I error rates due to differential misclassification in EHR-derived outcomes: Empirical illustration using breast cancer recurrence.

Authors: Yong Chen; Jianqiao Wang; Jessica Chubak; Rebecca A Hubbard
Journal: Pharmacoepidemiol Drug Saf Date: 2018-10-30 Impact factor: 2.890

6. Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, large-scale analysis.

Authors: Marc A Suchard; Martijn J Schuemie; Harlan M Krumholz; Seng Chan You; RuiJun Chen; Nicole Pratt; Christian G Reich; Jon Duke; David Madigan; George Hripcsak; Patrick B Ryan
Journal: Lancet Date: 2019-10-24 Impact factor: 79.321

7. A Selective Review of Negative Control Methods in Epidemiology.

Authors: Xu Shi; Wang Miao; Eric Tchetgen Tchetgen
Journal: Curr Epidemiol Rep Date: 2020-10-15

8. Robust empirical calibration of p-values using observational data.

Authors: Martijn J Schuemie; George Hripcsak; Patrick B Ryan; David Madigan; Marc A Suchard
Journal: Stat Med Date: 2016-09-30 Impact factor: 2.373

9. Protection From Natural Immunity Against Enteric Infections and Etiology-Specific Diarrhea in a Longitudinal Birth Cohort.

Authors: Elizabeth T Rogawski McQuade; Jie Liu; Gagandeep Kang; Margaret N Kosek; Aldo A M Lima; Pascal O Bessong; Amidou Samie; Rashidul Haque; Estomih R Mduma; Sanjaya Shrestha; Jose Paulo Leite; Ladaporn Bodhidatta; Najeeha Iqbal; Nicola Page; Ireen Kiwelu; Zulfiqar Bhutta; Tahmeed Ahmed; Eric R Houpt; James A Platts-Mills
Journal: J Infect Dis Date: 2020-11-09 Impact factor: 5.226

10. Comparative First-Line Effectiveness and Safety of ACE (Angiotensin-Converting Enzyme) Inhibitors and Angiotensin Receptor Blockers: A Multinational Cohort Study.

Authors: RuiJun Chen; Marc A Suchard; Harlan M Krumholz; Martijn J Schuemie; Steven Shea; Jon Duke; Nicole Pratt; Christian G Reich; David Madigan; Seng Chan You; Patrick B Ryan; George Hripcsak
Journal: Hypertension Date: 2021-07-26 Impact factor: 9.897