Literature DB >> 28458954

Correcting AUC for Measurement Error.

Bernard Rosner^1,2, Shelley Tworoger^1,3, Weiliang Qiu¹.

Abstract

Entities: Chemical Disease Gene Species

Keywords: AUC; Biomarkers; Non-normal distributions

Year: 2015 PMID： 28458954 PMCID： PMC5409172 DOI： 10.4172/2155-6180.1000270

Source DB: PubMed Journal: J Biom Biostat

Introduction

Diagnostic biomarkers are used frequently in epidemiologic and clinical work. The ability of a diagnostic biomarker to discriminate between subjects who develop disease (cases) and subjects who do not (controls) is often measured by the area under the receiver operating characteristic curve (AUC), with values close to 1.0 indicating high diagnostic accuracy. The AUC can be interpreted as where Xobs is the value of the diagnostic biomarker for a randomly selected case and Yobs is the value of the diagnostic biomarker for a randomly selected control. AUC takes values between 0.5 and 1. AUC close to 0.5 indicates no diagnostic accuracy; AUC close to 1.0 indicates high diagnostic accuracy. Under the normality assumption that and Xi and Yj, i = 1,…,m, j = 1,…,n, are all independent, AUC is calculated as [1]: where It is extensively documented in the medical literature that diagnostic biomarkers may be subject to errors of measurement [2], which may be attributed to variation in performance of laboratory equipment, variation between technicians, temporal changes, biologic variability, etc. It has been reported [1,2] that ignoring measurement error can cause biased estimation of AUC. In many cases, the biases can result in misleading interpretation of the efficacy of a diagnostic biomarker [3]. For example, not adjusting for measurement error can result in useful diagnostic biomarkers being overlooked. In general, an increase in measurement error moves the receiver operating characteristic (ROC) curve towards the diagonal (non-informative) line, and the value of the AUC is decreased [4,5]. The biases of estimators usually can be corrected by resampling methods (e.g., jackknife or bootstrap). However, resampling methods are not appropriate when biases are caused by non-sampling errors, such as measurement error [2]. Several methods [1- 3,6] have been proposed in the literature to correct estimates of the AUC when accounting for measurement error. Coffin and Sukhatme [1] and Coffin and Sukhatme [2] assumed the following measurement error model: where F (a,b) is a cumulative distribution function (CDF) with mean a and variance b, and Xi,true, Yj,true, ε, and ξ, i = 1,…,m, j = 1,…,n, are mutually independent. FY,true, AUCobs Coffin and Sukhatme (1995) [1] assumed Fx,true, FY,true, F, and F are CDFs from an exponential family and derived an approximate bias C of the observed AUC due to measurement error and then obtained estimates of the corrected AUC by adding this bias term to the observed AUC, i.e., AUC. Coffin and Sukhatme's [1] Monte Carlo simulation studies showed that the bias of the corrected AUC (AUCcorrected) is generally an order of magnitude smaller than the bias of the AUC without measurement error correction (AUCobs). Also the corrected AUC estimate (AUCcorrected) has comparable mean square error (MSE) to AUCobs. Coffin and Sukhatme [2] noted that the AUC estimated by the Mann-Whitney U statistic is also subject to measurement error. Paralleling to Coffin and Sukhatme [1], Coffin and Sukhatme [2] used a non-parametric approach to derive an approximate bias C for the AUC estimated by the Mann-Whitney U statistic. The simulation studies in Coffin and Sukhatme [2] showed that for several families of distributions (normal, gamma, or t distributions), bias-corrected AUC have much smaller bias and comparable MSE to the AUC estimated by the Mann-Whitney U statistic. Faraggi [3] derived an exact relationship between the observed AUC and the true AUC by assuming that Fx,true, FY,true, F, and F are CDFs of normal distributions and by assuming equal variance (i.e., and ), whereby where . Faraggi [3] also derived a 95% confidence interval (CI) for AUCtrue when θ2 is known. Faraggi [3] showed numerically that not taking measurement error into account can give seriously misleading results that understate the diagnostic effectiveness (i.e., the coverage probability of the unadjusted confidence interval can be far from its nominal value when measurement error is present). The method proposed by Faraggi [3] requires that the ratio θ2 of intra-individual to inter-individual variation was accurately known (e.g., based on prior experience). If θ2 is unknown, either repeated measurement or an external validation study is required to estimate θ2. Reiser [6] generalized the formula for θ2 by allowing different variances and provided an estimate of θ2 based on repeated measurements Xik,obs and Yj,obs, where the subscripts k and l indicate the k-th and l-th replicates for the i-th case and the j-th control, respectively. The measurement error model that Reiser [6] assumed is where and X, Y, ε and ξ, i = 1,…,m, k = 1,…,m, j = 1,…,n, ℓ = 1,…,n are mutually independent. Based on (??), it follows that The relationship between AUCtrue and AUCobs again has the form (??), where Reiser [6] used the delta method to obtain the approximate variance of the estimate δ̂, then obtained the 95% CI for δ and AUC = Φ (δ). Li et al. [7] provided an alternative method to obtain the variance of the estimate δ̂ by using the method of variance estimates recovery (MOVER), which allows the variance estimate to change with the underlying parameter values. Schisterman et al. [4] proposed a AUC correction method for the case where no repeated measurements are available, but an external validation data set is available. In addition to the normality assumption, Schisterman et al. [4] assumed that the distributions in the external validation data set are the same as those in the main study. Li et al.[7] method can also be used for the case where an external validation data set is available. Tosteson et al. [5] extended the measurement error model (??) by assuming that Fx,true, and FY,true, are CDFs of normal distributions, but the error terms ε and ξ have non-normal distributions. They derived the measurement error correction for sensitivity, specificity, and sensitivity at a given value of specificity, but not for AUC. Most of the aforementioned AUC measurement correction methods require the normality assumption. However, the normality assumption is often violated in real data analysis. Some of these methods assumed the location-shift hypothesis: for η ≠ 0, where Fx,true, and FY,true, are the cumulative distribution functions of the biomarker for cases and controls, respectively. The location-shift hypothesis is reasonable for symmetric distributions, but may not be ideal for skewed distributions as the mean is no longer a good summary of the distribution center. In this paper, we aim to extend the method of Reiser [6] by relaxing the normality assumption. The paper is arranged as follows: In section 2, we first present a measurement-error-correction method for AUC under the probit-shift hypothesis without requiring the normality assumption. We then construct confidence intervals for the corrected AUC. In Section 3, we present a simulation study. In Section 4, we present results from data analysis of a real example based on the Swiss Analgesic Study. Section 5 is a discussion.

Methods

AUC for non-normally distributed diagnostic biomarkers measured without error

We first consider how to handle the non-normality for a diagnostic biomarker M measured without error. We propose a probit-shift model or equivalently where Φ is the CDF of the standard normal distribution. That is, after probit transformations, the distributions of cases and controls satisfy the location-shift property. Thus, the AUC is a function of μ. If we let w = H(x) ≡ Φ−1{F(x)} then based on (??) it follows that We can use a first order Taylor series approximation to approximate the above integration (c.f Online Supplementary Document Section A, Equation A1) and obtain:

AUC for non-normally distributed diagnostic biomarkers measured with error

We assume the following measurement error model for probit transformed data: where H (z) = Φ−1 {F (z)}, H (z) = Φ−1 {F (z)}, e is independent of H, and H and H are defined similarly. F (z), F (z), F (z), and F (z) are the cumulative distribution functions of the underlying true/observed values of the diagnostic biomarker M, respectively. We assume that e and e are independent. To derive the relationship between the true AUC and the observed AUC, we first consider the conditional observed AUC: Note that Hence, Note that Thus, Upon integration and use of the delta method (c.f. Online Supplementary Document Equation (A7)) or equivalently, based on Online Supplementary Document Equation (A9), where ICC and ICC are intra-class correlations We assume there exists at least one replicated observation for each subject in the data set or in a subset of the data set and that the replicates are distinguishable, so that we can determine unique probit scales for each subject and each replicate and then can estimate the intra-class correlations ICC and ICC by using the variance components from a one-way ANOVA. We used the function ICCest of the package ICC[8] from the statistical software R[9] to calculate ICCs. Furthermore, because the probit transformation is a rank-invariant transformation, we can use the Mann-Whitney statistic to estimate AUC(μ) [10] (c.f Formula A13 in the Online Supplementary Document Section D.1). When we estimate AUC (μ), only the data in the main study were used (replicates were not used). Replicates were used only to estimate ICCs. The relationship (??) between AUC (μ) and AUC (μ) provides a method to correct measurement error for the observed AUC(μ). Hence, we also refer to AUC (μ) as the corrected AUC and denote it as AUCcorrected.

Confidence limits for AUC (μ)

We use the delta method to derive the variance of the true AUC. Denote We have An approximate 100%×(1−α) CI for AUC (μ) is given by {Φ(c1), Φ(c2)}, where The detailed derivations of c1 and c2 are shown in the online supplementary document Sections C and D.

A Simulation Study

To evaluate the performance of the proposed AUC estimate ÂUC (μ̂) that corrects for measurement error, we conducted 3 simulation studies. In each simulation study, we generated 1000 simulated data sets, each of which contains 100 cases and 100 controls. We then ran each simulation study 100 times to obtain the mean performance measure over the 100 simulations and to estimate the 95% confidence interval (CI) of the performance measures, such as bias, mean square error (MSE), and coverage. We also compared the performance of AUCcorrected in (??) with that proposed by Reiser [6] in equation (??). Both methods require the availability of replicate observations.

Simulation model I

In the first simulation study, we assumed that there are replicate observations for each subject and generated simulated data using Reiser's [6] model (c.f. Formula (??)). That is, X and e were generated from normal distributions. To generate replicates, we generated another set of error terms e and e, but kept the values of true observations X, Y, so that the 2 observations for the same subject would be dependent.

Simulation model II

In the second simulation study, we assumed that X and Y were from log-normal distributions, while the error terms ε and ε were from normal distributions: To generate replicates, we generated another set of error terms ε and ε, but kept the values of true observations X, Y, so that the 2 observations for the same subject would be dependent.

Simulation model III

In the third simulation study, we assumed that X and Y and ε were all from log-normal distributions: To generate replicates, we generated another set of error terms ε and ε, but kept the values of true observations X so that the 2 observations for the same subject would be dependent.

Parameter settings

For Simulation Model I, the true AUC value is AUC = Φ(δ), where . We set m = n = 100, m = n = 2, , μ = 0, and μ = 0.25,0.5, or 1. For Simulation Models II and III, we can show that (c.f. Online Supplementary Document Section E) . We set m = n = 100, λ = 0, , μ = 0, and μ = 0.25, 0.5, or 1. For Simulation Models I, II, and II, the true AUC values are 0.57 (for μ = μ − μ = 0.25), 0.64 (for μ = μ − μ = 0.5), and 0.76 (for μ = μ − μ = 1), respectively. To evaluate the effects of sample size and unequal variance on the performances of the three methods, we also performed an addtional set of simulations with m = n = 50 and and the same set of other parameters as above. To further evaluate the effect of the value of (i.e., the degree of measurement error), we performed another set of simulations with m = n = 50, , and .

Results of simulation studies

Tables 1-3 and online supplementary Figure 1 summarized the results of the three simulation studies. We observed that (1) the observed (i.e., uncorrected) AUC estimates AUC underestimated the true AUC for all 9 scenarios (i.e., the estimated biases were negative and the estimated coverages were less than the nominal value 0.95); (2) The MSE of AUC was much larger than those of the proposed method and Reiser's method when μ = 1; (3) as the value of μ increases, the absolute bias and MSE generally increased for all 3 types of AUC estimates; (4) for Simulation Study I (i.e., data were generated under Reiser's model), the probit method had similar performance to Reiser's method; (5) for Simulation Studies II and III (i.e., data were generated from non-normal distributions), the coverages estimated by the proposed method were close to the nominal value 0.95, while the coverages of the uncorrected AUC and the coverages of the corrected AUC estimated by Reiser's method were smaller than the nominal value, especially when the value of μ was large; (6) for Simulation Studies II and III, the proposed method had much smaller absolute bias than the other two methods.

Table 1

Bias, mean square error (MSE), and coverage for AUC (μ) from simulation I **.

λ	μ_Y	μ_X	AUC_true		MW*	R*	P*
0	0	0.25	0.570	Bias(×10³) 95% Cl	-13 (-17, -9)	0 (-4,4)	0 (-4,5)
				MSE(×10⁴) 95% Cl	18 (16,20)	19 (18,22)	25 (22,28)
				Coverage (%) 95% Cl	93.7 (91.8,95.6)	95.1 (93.4, 96.9)	94.9 (93.0, 96.8)
0	0	0.5	0.638	Bias(×10³) 95% Cl	-25 (-28, -21)	0 (-4,4)	0 (-4,5)
				MSE(×10⁴) 95% Cl	22 (19,24)	18 (16,20)	23 (20,26)
				Coverage(%) 95% Cl	90.3 (87.6, 93.0)	95.2 (93.5, 96.8)	94.8 (92.9, 96.7)
0	0	1	0.760	Bias(×10³) 95% Cl	-42 (-45, -39)	0 (-4,3)	0 (-3,4)
				MSE(×10⁴) 95% Cl	31 (27,34)	14 (12,16)	17 (15,20)
				Coverage (%) 95% Cl	76.6 (73.0, 80.2)	95.2 (93.3, 97.0)	94.4 (92.4, 96.5)

MW: Mann-Whitney estimate (i.e., AUC); R: Reiser's (2000) method; P: Probit method.

Simulation I was run 100 times. Each time, we generated 1000 simulated data sets. Each data set consists of 100 cases and 100 controls. Each subject provides two replicate biomaker scores. Both true values and random errors are assumed to come from normal distributions with .

Table 3

Bias, mean square error (MSE), and coverage for AUC (μ) from simulation III**.

A	μ_Y	μ_X	AUC^true		MW*	R*	P*
0	0	0.25	0.570	Bias(×10³) 95% Cl	- 26 (-29, -22)	- 14 (-19, -10)	0 (-6,5)
				MSE(×10⁴) 95% Cl	23 (20, 26)	30 (26,34)	41 (35,46)
				Coverage (%) 95% Cl	90.3 (87.5,93.1)	94.5 (92.6, 96.5)	95.3 (93.5, 97.1)
0	0	0.5	0.638	Bias(×10³) 95% Cl	- 53 (-56, -51)	- 31 (-34, -28)	2 (-2,7)
				MSE(×10⁴) 95% Cl	45 (40,49)	38 (33,43)	37 (40,53)
				Coverage (%) 95% Cl	72.5 (68.3,76.8)	91.4 (88.8, 93.9)	95.7 (93.9, 97.6)
0	0	1	0.76	Bias(×10³) 95% Cl	- 111 (-114, -108)	- 72 (-76, -67)	17 (11,24)
				MSE(×10⁴) 95% Cl	138 (131, 146)	83 (75,90)	67 (58,75)
				Coverage (%) 95% Cl	12.9 (10.3, 15.6)	66.1 (61.8, 70.5)	96.9 (95.2, 98.7)

MW: Mann-Whitney estimate (i.e., AUC); R: Reiser's (2000) method; P: probit method.

Simulation III was run 100 times. Each time, we generated 1000 simulated data sets. Each data set consists of 100 cases and 100 controls. each subject provides two replicate biomaker scores. Both true values and random errors were generated from log normal distributions with .

Tables S1, S2, and S3 in the online Supplementary Documents showed the results for the simulations with smaller sample size m = n = 50 and with unequal variance and . The results are similar to those shown in Tables 1-3. If the degree of measurement error as characterized by is large (θ2 = 3 say), the bias of the probit method is smaller than the other two approches. However, the coverage of Reiser's method and the probit-shit method tend to be somewhat larger than the nominal level 0.95 (c.f, Tables S4, S5, S6 in the online Supplementary Documents).

Examples

In this section, we used a real data set (the Swiss Analgesic Study data set) to evaluate the performance of the proposed measurement correction method for AUC estimation. The Swiss Analgesic Study data set was collected starting from 1967/1968 [11]. There were 1244 Swiss women participating in this study whose purpose was to evaluate the association of the use of phenacetin-containing analgesics with kidney function. NAPAP is a biomarker which is associated with recent use of phenacetin-containing analgesics. The NAPAP value was measured in a urine sample at the baseline clinic visit. There were additional follow-up collections of NAPAP values at home on 2 separate days within 1 week of the baseline clinic visit. In addition, serum creatinine was measured at the baseline clinic visit. We wish to investigate whether excessive recent intake of phenacetin-containing analgesics as determined by the urinary NAPAP level can be used as a screening test for identifying subjects with abnormal kidney function as determined by elevated serum creatinine. For this purpose, we dichotomized the baseline serum creatinine level. If a woman had elevated baseline serum creatinine (i.e., serum creatinine ≥1.5mg / dL), she was classified as a case; otherwise she was classified as a control. There were 1081 controls, 128 cases, and 35 subjects missing baseline serum creatinine. In the analysis, 1209 women without missing values were used. We would like to assess if NAPAP values could be used to discriminate between cases and controls. The AUC based on the NAPAP values measured at the clinic visit was used to measure the discrimination ability of the NAPAP assay. The 3 replicates were used to calculate ICC values. By examining the histograms of the NAPAP values for cases and controls, we found the distribution of the NAPAP value is quite skewed in both cases and controls in all 3 measurements (Figure 1). Hence, the normality assumption was violated.

Figure 1

Histograms of the NAPAP values. The upper panel: cases (left) and controls (right) measured at the clinic visit; The middle panel: cases (left) and controls (right) measured at the first home collection; The bottom panel: cases (left) and controls (right) measured at the second home collection.

The estimated AUC and 95% confidence interval (CI) of AUC are summarized in Table 4. The estimated AUC based on the Mann-Whitney U statistic (i.e., the uncorrected estimate of AUC) was 0.589 with 95% confidence interval (CI) [0.537,0.640]. The corrected AUC estimate based on Reiser's [6] method was 0.611 with 95% CI [0.557,0.663]. The corrected AUC estimate based on the probit-shift method was 0.618 with 95% CI [0.549,0.684]. In this example, the number of women with replicated observations is 1193, the estimated ICC based on probit transformed data was 0.648 for cases and 0.498 for controls. Hence, the corrected AUC is similar for the Reiser's and probit-shift methods, but the confidence limits are wider for the latter method.

Table 4

Estimate of AUC and its 95% confidence interval for the NAPAP data.

	MW	R	P
ÂUCtrue95 % CI	0.589 [0.557,0.663]	0.611 [0.557,0.663]	0.618 [0.549,0.684]

Discussion

In this article, we presented a method to correct AUC for measurement error without making the assumption of normally distributed diagnostic biomarkers. Instead, we use the probit transformation to create a transformed diagnostic biomarker, which on the probit scale is approximately normally distributed. To implement our approach, one needs replicate data on at least a subsample of subjects to compute the intraclass correlation. The replicates should be close enough in time so that the assumption that the underlying mean diagnostic biomarker level is the same is not violated. Simulation studies support the validity of the methods based on moderate sized samples of 100 cases and 100 controls. The simulation studies demonstrated that without correcting for measurement error would result in AUC biased toward the null value (0.5) Under the normality assumption, the proposed method has similar performance as Reiser's method which requires the normality assumption in measurement error modelling. When the normality assumption is violated, the proposed method performed much better than Reiser's method in terms of bias and coverage. The probit-shift model assumes equal variance . In the simulation studies, we evaluated the effects of unequal variance on the performance of the probit-shift model. The results were similar to Tables 1-3, if measurement error is small as characterized by θ2. If θ2 > 1, then the probit-shift model still has minimal bias, but has observed coverage greater than nominal coverage. In future work, we will extend the probit-shift model to allow unequal variance scenario, in which the probit-shift model would have the following form: where c1 = σ / σ and c2 = (μ − μ) / σ. In the real data analysis, the corrected AUC by the proposed method was similar to the corrected AUC by the Reiser's method, although the distributions of the biomarker in both cases and controls were highly skewed. This is probably because the unknown true AUC is close to the null value 0.5. The three simulation studies also demonstrated this point. That is, when μ is close to 0 or equivalently when AUC is close to 0.5, the 3 AUC estimation methods gave similar results. However, confidence limits are wider with the probit-shift method. An implicit assumption of our approach is that the distribution of diagnostic biomarkers is continuous. If instead, risk is defined based on a limited number of categorical risk factors, then the diagnostic biomarker distribution will be discrete and the assumption that the probit transformation results in a normally distributed scale will only be approximately satisfied and needs to be studied in more detail. It is worth noting that several authors have developed measurement-error-correction approaches for estimating a variety of diagnostic performance measures other than AUC, including sensitivity, specificity, and the Youden index [12]. The probit-shift method may be useful in incorporating the effects of measurement error on these indices in the setting of non-normally distributed diagnostic biomarkers.

Table 2

Bias, mean square error (MSE), and coverage for AUC (μ) from simulation II**.

λ	μ_Y	μ_X	AUC_true		MW*	R*	P*
0	0	0.25	0.57	Bias(×10³) 95% Cl	-23 (-26, -19)	15 (-18, -11)	4 (-9,1)
				MSE(×10⁴) 95% Cl	22 (19, 25)	20 (17,22)	32 (28,26)
				Coverage (%) 95% Cl	91.2 (88.5,93.9)	93.8 (91.6, 95.6)	94.8 (92.8, 96.8)
0	0	0.50	0.638	Bias(×10³) 95% Cl	- 49 (-52, -45)	-32 (-35, -28)	-7 (-12, -2)
				MSE(×10⁴) 95% Cl	39 (35,44)	26 (23,29)	36 (31,41)
				Coverage (%) 95% Cl	76.3 (72.1,80.5)	89.6 (86.9, 92.2)	94.7 (92.7, 96.8)
0	0	1.0	0.760	Bias(×10³) 95% Cl	-104 (-107, -101)	-74 (-77, -70)	2 (-4, 8)
				MSE(×10⁴) 95% Cl	122 (115, 130)	69 (64,74)	53 (46, 61)
				Coverage (%) 95% Cl	17.4 (14.6, 20.3)	52.3 (47.6, 56.9)	95.3 (93.1, 97.5)

MW: Mann-Whitney estimate (i.e., AUC); R: Reiser's (2000) method; P: Probit method.

Simulation II was run 100 times. Each time, we generated 1000 simulated data sets. Each data set consists of 100 cases and 100 controls. Each subject provides two replicate biomaker scores. True values were generated from log normal distributions and random errors were generated from normal distributions with .

9 in total

1. The effect of random measurement error on receiver operating characteristic (ROC) curves.

Authors: D Faraggi
Journal: Stat Med Date: 2000-01-15 Impact factor: 2.373

2. Measuring the effectiveness of diagnostic markers in the presence of measurement error through the use of ROC curves.

Authors: B Reiser
Journal: Stat Med Date: 2000-08-30 Impact factor: 2.373

3. Statistical inference for the area under the receiver operating characteristic curve in the presence of random measurement error.

Authors: E F Schisterman; D Faraggi; B Reiser; M Trevisan
Journal: Am J Epidemiol Date: 2001-07-15 Impact factor: 4.897

7. Relationships between regular analgesic intake and urorenal disorders in a working female population of Switzerland. I. Initial results (1968).

Authors: U C Dubach; P S Levy; A Müller
Journal: Am J Epidemiol Date: 1971-06 Impact factor: 4.897

8. The meaning and use of the area under a receiver operating characteristic (ROC) curve.

Authors: J A Hanley; B J McNeil
Journal: Radiology Date: 1982-04 Impact factor: 11.105

9. Adjustment for measurement error in evaluating diagnostic biomarkers by using an internal reliability sample.

Authors: Matthew T White; Sharon X Xie
Journal: Stat Med Date: 2013-06-14 Impact factor: 2.373