Literature DB >> 25530064

Covariate-adjusted measures of discrimination for survival data.

Abstract

MOTIVATION: Discrimination statistics describe the ability of a survival model to assign higher risks to individuals who experience earlier events: examples are Harrell's C-index and Royston and Sauerbrei's D, which we call the D-index. Prognostic covariates whose distributions are controlled by the study design (e.g. age and sex) influence discrimination and can make it difficult to compare model discrimination between studies. Although covariate adjustment is a standard procedure for quantifying disease-risk factor associations, there are no covariate adjustment methods for discrimination statistics in censored survival data.
OBJECTIVE: To develop extensions of the C-index and D-index that describe the prognostic ability of a model adjusted for one or more covariate(s).
METHOD: We define a covariate-adjusted C-index and D-index for censored survival data, propose several estimators, and investigate their performance in simulation studies and in data from a large individual participant data meta-analysis, the Emerging Risk Factors Collaboration.
RESULTS: The proposed methods perform well in simulations. In the Emerging Risk Factors Collaboration data, the age-adjusted C-index and D-index were substantially smaller than unadjusted values. The study-specific standard deviation of baseline age was strongly associated with the unadjusted C-index and D-index but not significantly associated with the age-adjusted indices.
CONCLUSIONS: The proposed estimators improve meta-analysis comparisons, are easy to implement and give a more meaningful clinical interpretation.

Entities: Chemical

Keywords: C-index; D-index; Discrimination

Mesh：

Year: 2014 PMID： 25530064 PMCID： PMC4666552 DOI： 10.1002/bimj.201400061

Source DB: PubMed Journal: Biom J ISSN： 0323-3847 Impact factor: 2.207

1 Introduction

A fundamental property of a prognostic marker is its ability to discriminate high from low risk patients (Hlatky et al., 2009). Markers that do not improve discrimination are also unlikely to improve other measures of clinical performance (Mihaescu et al., 2010). The discrimination performance of a marker and its incremental value can vary significantly across different studies. Variation beyond chance can be attributed to differences between studies either in the strength of association between the marker and the outcome or in the marker’s distribution (Pepe et al., 2004), or to a range of other possible biases that relate to the conduct and recording in a study (Lijmer et al., 2002). In practice, markers are often used in combination, so interest lies in evaluating the discrimination of a prognostic model including one or more markers together with demographic variables. Other important aspects of prognostic ability include calibration. For a binary outcome, the standard description of discrimination is the receiver operating characteristic (ROC) curve, which displays the trade-off between specificity (the probability that a control marker value is below the cut-off) and sensitivity (the probability that a case marker value is above the cut-off) for different marker cut-offs. The ROC curve is often summarized by the area under the curve (AUC), also called the C-statistic, which can be interpreted as the probability that the marker will correctly classify a randomly chosen pair of patients as case and noncase (Hanley and McNeil, 1982). The AUC ranges from 0 (when all predictions are wrong) and 1 (perfect predictions) with 0.5 representing the average discriminative ability of random predictions. Values below 0.5 are rarely seen other than due to small sample variation. For a survival outcome, many measures of prognostic ability have been proposed (Choodari-Oskooei et al., 2012a, 2012b). The C-statistic can be used to measure discrimination in this setting, taking the binary outcome to be survival to a fixed follow-up time (Chambless and Diao, 2006). The C-index (Harrell et al., 1982) extends the C-statistic and avoids specifying a fixed follow-up time: it estimates the probability that given two randomly drawn patients, the patient who has an event first is predicted a higher risk. Royston and Sauerbrei (2004) proposed an alternative measure, D, which we call the D-index: it is based on a proportional hazards model and has the interpretation of an average log hazard ratio between an individual in the upper half of the risk distribution and an individual in the lower half (Pennells et al., 2014). We use the C-index because it is the most widely used measure in practice (Mallett et al., 2010), and the D-index because it adapts well to the purposes of this paper; the two measures give similar conclusions when used to evaluate the discrimination added by a new marker (Fibrinogen Studies Collaboration, 2009). Covariate adjustment is necessary for correct assessment of disease-risk factor associations in observational studies, but its importance is rarely acknowledged in assessing discrimination. A particular issue is that covariates that form part of the study design such as age and sex can impact substantially on the prognostic ability of a model (Janes and Pepe, 2008; Kerr and Pepe, 2011): for example, the prognostic ability of a cardiovascular risk model (which includes age, sex, clinical covariates, and biomarkers) is likely to be substantially larger in a study that recruited men and women aged 40–80 than in a similar study that only recruited men aged 55–65. Janes and Pepe (2008), writing in the context of ROC curves, identify three cases in which covariates Z influence the discrimination performance of a risk score R: The covariate Z is associated both with R and with disease risk. A common example is when Z is age. Janes and Pepe (2008) showthat stratifying by categoricalZ reduces the discrimination of R. This reduction is greater if R and Z are highly correlated. A different issue arises if the covariates Z are associated with R but not with disease risk. Ignoring this type of covariate effect may underestimate discrimination (Janes and Pepe, 2008). In cardiovascular disease, R might be a function of C-reactive protein, a cardiovascular risk factor, and Z might be acute infection, which strongly raises C-reactive protein. Allowing for Z in the analysis could remove a source of noise in R and hence improve discrimination. The discrimination of R may vary across levels of Z, which is analogous to effect modification. This situation arises if the hazards associated with various levels of Z vary: for example, associations with blood pressure measurements are attenuated with increasing age (Prospective Studies Collaboration, 2002). It also arises when associations remain constant but the spread of the distribution of X varies with Z. Covariate adjustment for measures of discrimination has been tackled in the context of diagnostic tests using ROC curves based on binary outcomes (Janes et al., 2009). However, there are currently no methods to adjust the discrimination performance of prognostic markers for covariates for censored survival data. In this paper, we propose definitions of the C-index and D-index for censored survival data allowing adjustment for one or more binary or continuous covariates, and ways to estimate them. Although we primarily aim to adjust for age and sex, the methods are presented for a general covariate adjustment. The paper is arranged as follows. In Section 2 we describe the data which motivated our methods, and we define the model used to generate risk predictions. In Section 3, we review unadjusted measures of discrimination. In Section 4, we propose definitions of adjusted measures of discrimination, and our new estimators. In Section 5, we examine the performance of the proposed estimators in a simulation study. In Section 6 we apply the proposed methods to data from a large set of epidemiological cohort studies. We conclude in Section 7 with a discussion of our results, recommendations for when our proposed estimators might be appropriate, desirable extensions and limitations.

2 Data

The Emerging Risk Factors Collaboration (ERFC) has collated and harmonized individual participant data from population-based prospective studies of cardiovascular disease (CVD) (Emerging Risk Factors Collaboration, 2007). In May 2011 the data set comprised 1.9 million individuals in 108 studies with an average of 15.5 years follow-up. We used these data to model time to first fatal/nonfatal CVD event, which includes coronary heart disease and stroke. Our main dataset was restricted to prospective cohorts and clinical trials that provide information on Framingham risk factors, that is age, smoking status (current/ex vs. never), systolic blood pressure (SBP), total cholesterol (TC), high-density lipoprotein (HDL) cholesterol and history of diabetes at the baseline survey. Individual participant data were further restricted to subjects aged at least 40 years at baseline with the above risk factors recorded, no known history of CVD at the baseline survey, no recorded history of diabetes, and not known to be under statin treatment. Thus, our analysis data comprised 349,137 individuals from 82 studies (of which eight were clinical trials), of whom 24,369 experienced a CVD event. The model fitted to these data was the Cox proportional hazards model stratified by study and sex. Studies that were randomized trials were additionally stratified by trial arm. Thus, for individual i in stratum s, the hazard at time t is where x is the vector of covariates (age, smoking status, SBP, TC, and HDL cholesterol) for individual i, β is a vector of corresponding regression coefficients (assumed constant across strata), and h0(t) is the baseline hazard at time t for individuals of stratum s. Table 1 summarizes the data from these 82 studies and the fitted Cox model.

Table 1

ERFC data: variable summaries and selected model.

Variable	Mean	Within-	Between-	Fitted model

		studies SD	studies SD	Coef.	Std. Err.
Baseline variables
Age (years)	55.92	7.57	6.78	0.0771	0.0009
Smoking (0 = no, 1 = yes)	0.30	0.43	0.15	0.563	0.014
SBP (mm Hg)	133.30	18.51	7.70	0.0146	0.0003
Total cholesterol (mmol/L)	5.87	1.07	0.46	0.168	0.006
HDL cholesterol (mmol/L)	1.35	0.37	0.15	−0.512	0.020
Sex (0 = male, 1 = female)	0.42	0.40	0.29	(stratifier)
Outcome variables
Follow-up (years)	10.64	3.65	4.87
CVD event (0 = no, 1 = yes)	0.07	0.24	0.07

The within-study distribution of baseline age and sex is determined by a study design and hence differs between studies. This may affect measures of discrimination. Model (1) is stratified by sex, so standard calculations automatically stratify measures of discrimination for sex. We therefore focus on the effect of baseline age. The left-hand panel in Fig. 1 plots the C-index for each study (computed as described in Section 6) against the within-study standard deviation (SD) of baseline age. Studies with more variation in baseline age tend to have substantially larger C-indices. The age-adjusted C-index, introduced in Section 4 below, is plotted in the right-hand panel, and shows no association with variation in baseline age.

Figure 1

ERFC data: unadjusted and age-adjusted C-index for a model including baseline age, smoking, SBP, TC, and HDL, plotted for each study against the SD of baseline age in that study. Analyses are stratified by sex and trial arm. Each point represents one study.

3 Measures of discrimination in the absence of covariate adjustment

3.1 Notation

We initially work in a single dataset of n individuals. For each individual i = 1,…, n, we assume that the model covariates are x (scalar or vector), the true event time (in the absence of censoring) is , and censoring time is c, so the event indicator is and the observed time is . The observed data are (x). Censoring is assumed to be noninformative. We also assume that the risk score is (scalar) r(x), which is typically (but not necessarily) the linear predictor β̂x from fitting a survival model such as h(t) = h0(t) exp (βx) where β are coefficients and h0(t) is the baseline hazard. Our aim is to evaluate the discrimination of r(x).

3.2 C-index

Harrell et al. (1982) defined the C-index as a statistic measuring the degree to which sample pairs are concordant, where concordance occurs if the individual of higher predicted risk has the first event in the pair. This statistic is affected by censoring (see Section 3.2.1). The underlying estimand was defined by Heagerty and Zheng (2005) and Uno et al. (2011) as . Gonen and Heller (2005) instead stated the estimand . These estimands are equivalent in the absence of ties in r(x) (i.e., if all individuals have different values of r(x)). We believe that ties in r(x) are important, since poorly discriminating models may have many ties, so it is important to account for them. Heagerty and Zheng’s C counts pairs tied on r(x) as discordant, while K double-counts them (because they satisfy both r(x) ≥ r(x) and r(x) ≥ r(x)). Instead, we make the natural definition of the C-index as for a random pair (i, j), where 1(a) = 1 if a is true and 0 if a is false. Pairs with tied event times are excluded from all calculations based on C, so that the estimand becomes . For simplicity, however, we ignore tied event times in the notation throughout this article. We now consider various estimators of C in Eq. (2).

3.2.1 Harrell’s estimator

Estimation of C is complicated by the presence of censoring, because we do not know whether for pairs where the first event time is censored. Harrell et al. (1996) proposed estimating C as the mean of C over informative pairs, where pair (i, j) is informative if and d = 1 or and d = 1: that is, if the first event in the pair is observed. Harrell’s estimator is often written as where #concordant counts pairs with and r(x) > r(x), or and r(x) < r(x); #tied counts pairs with r(x) = r(x); and #discordant counts pairs with and r(x) < r(x), or and r(x) > r(x). However, the informative pairs are not representative of all pairs—for example, a pair of low-risk individuals is likely to have no event and hence be noninformative—and this can cause bias in Ĉ (Gonen and Heller, 2005).

3.2.2 Gonen and Heller’s estimator

Gonen and Heller (2005) proposed an alternative estimator to avoid bias due to censoring. To present the idea in greater generality, suppose r*(x) is a linear predictor from a correctly specified proportional hazards model. Then r*(x) − r*(x) represents the log hazard ratio between individuals i and j, and the probability that individuals i and j are concordant is expit {r*(x) − r*(x)} if r(x) > r(x) where expit(η) = 1/(1 + exp (−η)). (Similarly it is expit {r*(x) − r*(x)} if r(x) < r(x), and 0.5 if r(x) = r(x)). Then the estimator is the average of this concordance probability, which can be written as Gonen and Heller (2005) considered the special case r*(x) = r(x) (that is, they assumed that r(x) is a linear predictor from a correctly specified proportional hazards model) giving the simpler expression Ĉ is an indirect measure, since it does not use the event times and relies on correct model specification.

3.2.3 Restricted C-index

Let τ = max be the longest follow-up time observed. The study only gives information about discrimination at time t ≤ τ, and C can only be estimated by (implicitly) extrapolating to times t > τ: for example, Ĉ assumes that the proportional hazards model continues to hold at times beyond τ. To avoid extrapolation, Heagerty and Zheng (2005) proposed the restricted C-index which is estimable without extrapolation in a study with follow-up at least up to time τ. They and Uno et al. (2011) proposed estimators of C to account for censoring before time τ : that of Uno et al. (2011) involves a weighted mean of C over informative pairs, where the weight for pair (i, j) is Ĝ(min (t))−2 and G(t) = P(c ≥ t).

3.3 D-index

Royston and Sauerbrei (2004) proposed a measure, D, which we also call the D-index, with the interpretation of the log hazard ratio between two equal-sized prognostic groups. It is estimated in a two-stage procedure. In stage 1, the values of r(x) are ranked, converted to normal scores, and multiplied by . In stage 2, a proportional hazards regression is performed on the scaled normal scores, and D is the regression coefficient. As in the work of Harrell et al. (1982), the estimand is not immediately clear. A possible estimand is based on pairs: still assuming that r*(x) − r*(x) is the true log hazard ratio between individuals i and j, the estimand D can be defined as the average of this log hazard ratio when i is drawn randomly from the upper half of the risk distribution and j is drawn randomly from the lower half (Pennells et al., 2014): that is, where r̄ is the mean of the r(x). The algorithm above clearly estimates this estimand consistently when the proportional hazards model is correctly specified and r(x) is normally distributed, but it may be biased when r(x) is skewed (Choodari-Oskooei et al., 2012a).

4 Measures of discrimination with covariate adjustment

Let z be covariates, which may or may not form part of x. We aim to evaluate the risk score r(x) while adjusting for the covariates z. Conceptually, we want to estimate C and D if we had a sample with a common value of z, or by restricting attention to pairs with equal values of z. In the ERFC data of Section 2, r(x) is a cardiovascular risk prediction while z is age.

4.1 Adjusted C

4.1.1 Estimand

Covariate adjustment can be defined by considering pairs that match exactly on Z, so that is a z-specific C. For situations where C(z) is roughly constant over z, or where a summary measure of discrimination is required, the z-adjusted C is the natural measure when a risk model is stratified by z, and can be applied more widely. Note that for continuous z with density f(z), pairs matching on z have density proportional to f(z)2, so that It is natural to consider weighting by f(z) rather than f(z)2 in (6), so we also define although other choices of weights are also possible. Of course, C = C if C(z) is constant.

4.1.2 Direct estimation for categorical Z

We describe an estimator as direct (like Ĉ) if it uses actual event times, and indirect (like Ĉ) if instead it uses risks predicted under a model. Direct estimation is tricky with continuous z, as there may be few or no matching pairs (Section 4.1.3). We therefore first consider the case with categorical z. Simple estimators are Ĉ(z) = {Σ(C}/{Σ( 1}, Ĉ = {Σ( C}/{Σ( 1} and Ĉ = {Σ(z)Ĉ(z)}/{Σ(z)}, where f̂(z) = Σ 1. In the presence of censoring, these sums are restricted to informative pairs, and the weighting scheme of Uno et al. (2011) may be used to handle random censoring.

4.1.3 Direct estimation for continuous or multivariate Z

For some methods, it is helpful to decompose where m(x) is uncorrelated with z. This is easily done by fitting a suitable regression for r(x) on z, and defining r̂(z) = E[r(x)|z] as the fitted value and m(x) as the residual. Conceptually, we want to estimate the discrimination that is due to m(x). The methods proposed in this section do not assume that m(x) is independent of z, unlike the methods proposed in Section 4.1.4. We propose direct estimation by plotting C against r̂(z) − r̂(z) or |r̂(z) − r̂(z)|, fitting a suitable model (parametric or nonparametric), and taking as the fitted value at r̂(z) − r̂(z) = 0. In order to automate the procedure, we use a logistic regression of C on (r̂(z) − r̂(z))2 with weights where λ controls the amount of smoothing. is then the inverse logit of the estimated intercept. The procedure is illustrated in Supporting Information Fig. S1. Again, in the presence of censoring, the sums are restricted to informative pairs, and the weighting scheme of Uno et al. (2011) may be used to handle random censoring. We estimate Ĉ using the same logistic regression but with weights w1(z)w2(z) where since this weight approximates 1/f (z) when z ≈ z. Here, f̂(z) might be a kernel estimate of the density of z. The above method is based on C which represents whether two events occur in the order predicted by r(x). An alternative is to explore whether events occur in the order predicted by m(x, z). We define “m-concordance” by replacing conditions r(x) > r(x) etc. in Eq. (2) with m(x) > m(x) etc. Again, fitted values of the mean of at r̂(z) − r̂(z) = 0 give an estimator of Ĉ, which we denote by . Comparing the two estimators and may help to detect an unsuitable value of λ. Too small a value causes bias by giving too much weight to mismatched pairs, while too large a value causes large variance by reducing the effective number of pairs used. We used the ERFC data to compare with (Supporting Information Fig. S2) and to compare their standard errors (Supporting Information Fig. S3), for 0 ≤ λ ≤ 10. Values λ < 1 tended to give large differences between the two estimators, but values in the range 1–10 seemed broadly reasonable: later work uses λ = 3.

4.1.4 Indirect estimation of an approximate estimand

Now we use the correctly specified linear predictor r*(x), which we decompose as r*(x) = m*(x) + r̂*(z) as in (7). Recall that P(C = 1|x x) = expit {r*(x) − r*(x)} when r(x) > r(x), etc. Hence for pairs that match on z, P(C = 1|x x) = expit (m*(x) − m*(x)) when m(x) > m(x), etc. This suggests defining a new estimand C* = C if m(x) is independent of z. Appendix B and the simulation study demonstrate that C* ≠ C in general, but differences are not large. Analogous to (3), we propose the indirect estimator in the correctly specified case m*(x, z) = m(x, z): Like Ĉ, this estimator is unaffected by censoring, but requires correct model specification.

4.1.5 Recalibrating

To be useful in practice, a risk score must be well calibrated. Ideally, this is ensured by recalibrating the model in an external validation set. However, sometimes miscalibrated risk scores are evaluated, and in this case we want to be sure that the miscalibration does not distort the C-index. The advantage of a direct method is that it should give correct results if the risk score is miscalibrated. The indirect methods above are very susceptible to miscalibration. However, even the direct methods of Section 4.1.3 are slightly affected by miscalibration, because the weights in (8) are affected. We therefore propose preceding all the above methods, except for Harrell’s method (which is unaffected by miscalibration), by a recalibration step. For the unadjusted indirect method, we assume r*(x) = γ(x) and estimate γ by fitting the Cox model If r(x) is well calibrated, then γ̂ ≈ 1. The recalibrated estimate is For the adjusted methods, we assume m*(x, z) = γ(x, z) and r̂*(x) = γ(x) and estimate γ and γ by fitting the Cox model with fixed m(x, z) and r̂(z). The recalibrated estimate is Definitions (12) and (13) allow for negative values of γ̂ and γ̂, which could arise with a very poorly calibrated model, and would correctly give estimates less than 0.5. If r(x) is the linear predictor from fitting a Cox model to the data, then recalibration as proposed above is pointless: if done, it yields γ̂ = γ̂ = 1. However, the values of γ and γ in (12) and (13) could instead be estimated by shrinkage methods (Copas, 1983; van Houwelingen and Le Cessie, 1990).

4.2 Adjusted D

We define covariate-adjusted D by extending estimand (5) proposed in Section 3. First, z-specific D is recalling that r̂(z) is the z-specific mean of the r(x). We can also write D(z) as and so it is natural to define adjusted D as That is, D is the average log hazard ratio between individuals matched on z who are above-average and below-average for their value of z. We propose the following modification to the estimation algorithm for D given in Section 3.3. In stage 1, instead of ranking the r(x), we rank the m(x, z) across the whole sample, form normal scores, and scale by . In stage 2, the proportional hazards regression on the scaled normal scores is adjusted for z to avoid bias from omitting a prognostic covariate (Ford et al., 1995). D̂ is the coefficient of the scaled normal scores in the stage 2 model.

4.3 Stratification

A stratified version of C may be computed by restricting attention to pairs within strata. For stratified versions of C* and D we replace m(x, z) and r̂(z) above with m(x, z, s) and r̂(z, s) where s is the stratum of individual i, r̂(z, s) = E[r(x)|z, s] and r(x, s) = m(x, z, s) + r̂(z, s). This decomposition is performed by regressing r(x, s) on z within strata. In estimating D, the stage 2 proportional hazards regression is stratified by s.

5 Simulation study

We next explore the performance (bias and precision) of the proposed estimators as we vary the strengths of association of the outcome with x and z. We first consider an ideal setting where r(x) is the linear predictor from a correctly specified Cox model, and m(x, z) is independent of z so that C = C*. We then consider a nonideal setting where var(m(x, z)) depends on z, so that estimands C, C,, and C* potentially differ.

5.1 Data generating model

Data sets of size n = 1000 were generated with covariates x = (v, z). This relatively large sample size was designed to make optimism negligible without excessively increasing computing time for C (which is roughly proportional to n2). Covariate z represents age at baseline. A fraction 1 − ϕ of individuals belong to age group 1 and have z ~ U(40, 50), the uniform distribution from 40 to 50. The remaining fraction ϕ of individuals belong to age group 2 and have z ~ U(50, 60). Covariate v represents the biomarker of interest and was drawn as v = α(z − 50) + u with for an individual whose value of z places them in age group g. Settings for σ are given below. We chose α to make corr(v, z) = 0.25: changing corr(v, z) to 0 or 0.5 affected the unadjusted results for both C and D, but had negligible effect on the adjusted results (results not shown). Survival times were drawn from the Gompertz distribution where t is time in years from baseline. We took β = 0, 0.5, 1, and β = 0, 0.1, 0.2. Follow-up was for 15 years, and h0 was chosen to give 50–70% censoring. With this data generating model, r(x) = β + β(z − 50), r̂(z) = (β + β)(z − 50) and m(x, z) = β[v − α(z − 50)]. In simulation 1, we take ϕ = 0.5, so that z ~ U(40, 60) and weighting does not affect the estimands. We also take σ1 = σ2 = 1, so that r̂(z) and m(x, z) are independent, and the estimands C and C* are equal. In simulation 2, we take ϕ = 2/3, in order to explore weighting, and σ1 = 1 and σ2 = 2, so that var(m(x, z)|z) depends on z and the estimands differ.

5.2 Methods considered

For C, the unadjusted methods considered were Harrell’s Ĉ (“Harrell”) and Gonen and Heller’s Ĉ (“indirect”). The adjusted methods considered were (“smooth 1 unweighted”) and (“smooth 2 unweighted”) with weights w1(z, z); (“smooth 1 weighted”) and (“smooth 2 weighted”) with weights w1(z, z)w2(z, z); and (“indirect”). Weight w1(z, z) was computed using (8) with λ = 3, and w2(z, z) was computed using (9) and estimating f̂(z) in one-unit bins for z. Estimation was restricted to informative pairs without the weighting of Uno et al. (2011). For D, we used unadjusted D̂ and adjusted D̂. The methods are summarized in Table 2 and the top half of Table 3.

Table 2

Summary of methods for unadjusted measures of discrimination.

	C-index		D-index
	Harrell’s C	Indirect	D-index
Notation	Ĉ_Har	Ĉ_ind	D̂
Description	Mean concordance among informative pairs	Mean expected concordancea)	Cox model on scaled rankit of risk score
Quantity estimated	C	C	D
Recalibration	No impact	Needed	Implicit in method

Expected concordance is computed assuming that r(x) is the linear predictor in a correctly specified Cox model.

Table 3

Summary of methods for adjusted measures of discrimination.

	C-index			Adjusted D-index
	Smooth 1	Smooth 2	Indirect	Adjusted D-index
Notation	C^smooth1adj,C^smooth1adj,w	C^smooth2adj,C^smooth2adj,w	C^indadj	D^adj
Description	Mean concordance (Smooth 1) or m-concordance (Smooth 2) among informative pairs with similar values of adjustment variable	Mean expected concordancea) after removing difference in adjustment variable	Covariate-adjusted Cox model on scaled rankit of risk score
Options	Choice of weights; choice of smoothing parameter λ	–	–
Quantity estimated	C^adj or C^adjw	C^adj^*	D^adj
Recalibration	Little impact	Little impact	Needed	Implicit in method
Properties
Must lie between 0 and 1	✓	✓	✓	N/A
Has value 1 if all pairs are concordant	✓	✓	✕	N/A
Has value 0 if all pairs are discordant	✓	✓	✕	N/A
Reduces to unadjusted estimate if there is no Z	✓	✓	✓	✓
Direct – unaffected by miscalibration	✓	✓	✕	(✕)
Free of tuning parameter λ	✕	✕	✓	✓
Fast to compute	✕	✕	✕	✓
Unaffected by optimism	✕	✕	✕	✕
Unaffected by censoring	✕	✕	✓	✓
Unbiased in simulation after accounting for optimism and censoring	✕	✓	✓	✓

Expected concordance is computed assuming that r(x) is the linear predictor in a correctly specified Cox model. (✕) means only slightly affected.

5.3 Simulation scheme

A total of 1000 datasets were drawn for each combination of parameters. For each combination of parameters and each method, we computed C̄ and s, the mean and standard deviation of Ĉ, and we present a forest-type plot showing C̄ with an interval constructed as C̄ ± 1.96s. We computed the true values of C and D using the exact methods described in appendices C and D. Source code to reproduce the results is available as Supporting Information on the journal’s web page (http://onlinelibrary.wiley.com/doi/10.1002/bimj.201400061/suppinfo).

5.4 Results for simulation 1

Figure 2 shows results with corr(v, z) = 0.25: the five panels show different combinations of β and β.

Figure 2

Simulation study 1: comparison of various unadjusted and adjusted estimates of C. Intervals show C̄ ± 1.96s where C̄ and s are the mean and standard deviation of Ĉ. Vertical lines indicate the true value of adjusted C. Panels show simulated data with different values of β and β, but all have corr(v, z) = 0.25 (see text).

Considering the unadjusted results, Ĉ has a slightly larger mean than Ĉ in all panels, indicating small bias due to censoring. Comparing the unadjusted and adjusted estimates, we see that they are similar in the second panel where β = 0 (i.e., where there is no covariate effect to adjust for), but markedly different in the other panels where β > 0. We now compare the different adjustment methods in the first panel where β = 0 so that the true value is C = 0.5. The “smooth 1” methods (unweighted and weighted) show substantial bias. This is likely to have arisen because concordance C is strongly related to r̂(z) − r̂(z) and the smoothing method inadequately allows for this association. The other adjusted methods show small positive bias. This is attributable to optimism, since the models are fitted and evaluated in the same data; however, optimism is small because of our large sample size. We therefore suggest that “smooth 2” may be preferable to “smooth 1”. In the other panels where C > 0.5, the indirect estimator appeared unbiased (suggesting the bias from optimism is negligible), while the smoothing estimators had small positive bias, presumably due to censoring. Results for adjusted D similarly suggest small optimism when β = 0 and little or no bias elsewhere (Fig. 3).

Figure 3

Simulation study 1: comparison of various unadjusted and adjusted estimates of D. Intervals show D̄ ± 1.96s. Vertical lines indicate the true value of adjusted D.

5.5 Results for simulation 2

Results for C are shown in Fig. 4. When β = 0 (top panel), the results are very similar to simulation 1. In the other panels, the estimands C (estimated by the unweighted methods), C, (estimated by the weighted methods), and C* (estimated by the indirect method) are unequal and are shown by three vertical lines. These estimands differ by up to 0.014, with C, < C* < C.

Figure 4

Simulation study 2: comparison of various unadjusted and adjusted estimates of C. Vertical lines indicate (from L to R) true values of C,, C*, and C.

remains unbiased for C*. The smoothing estimators are all positively biased, with weighted estimators on average slightly smaller than unweighted estimators. In the second panel, where unadjusted and adjusted C-indices are equal, the bias in the smoothing estimators is slightly smaller than that in Harrell’s estimator: this suggests that all the bias observed is attributable to censoring. Corresponding results for D are shown in Fig. 5. Small positive bias is found for all parameter values. Because no bias was seen in simulation 1, this is likely to arise from model mis-specification.

Figure 5

Simulation study 2: comparison of various unadjusted and adjusted estimates of D. Vertical lines indicate the true value of adjusted D.

6 ERFC results

To illustrate the differences between methods and the effects of recalibration, we used the single Cox proportional hazards model fitted to all the ERFC studies, stratifying by study, sex, and trial arm, as displayed in Table 1. The linear predictor from the resulting model was evaluated using unweighted methods in each study separately, stratifying by sex and trial arm. We first explore the effect of recalibration. The model is guaranteed to be well calibrated in the whole ERFC data, but it is likely to be miscalibrated in individual studies. Figure 6 plots, for each method considered, the difference between the C-indices after recalibration and before recalibration against their mean, as proposed by Bland and Altman (1986). Harrell’s method is unaffected by recalibration and so is not shown in Fig. 6. We see that recalibration has a large impact for the indirect methods and very little impact for the smoothing methods.

Figure 6

ERFC data: Bland-Altman plots exploring the effects of recalibration on various methods for computing the adjusted C-index. Each point represents one study.

We next compare the different methods after recalibration. Figure 7 shows Bland-Altman plots comparing the two unadjusted methods and the three adjusted methods. The top panel shows that the indirect method tends to give lower results than Harrell’s method, probably due to censoring (Gonen and Heller, 2005). The four panels in the lower left-hand corner compare unadjusted and adjusted methods and show large differences. The three panels in the lower right-hand corner compare the adjusted methods. Again, the indirect method tends to give lower estimates than the other methods, while the two smoothing methods give very similar results.

Figure 7

ERFC data: Bland-Altman plots comparing different methods for computing the adjusted C-index, after recalibration. Each point represents one study.

Finally, we revisit Fig. 1, which used the indirect method with recalibration. The strong association of the unadjusted C-index with SD of baseline age (left-hand panel) is removed when we use the age-adjusted C-index (right-hand panel). Covariate adjustment reduces C by up to 0.21 in 77 of the 82 studies; increases C by up to 0.03 in four studies (all of which are small); and leaves C unchanged for one study where all participants have the same baseline age. Figure 8 shows the corresponding results for the D-index.

Figure 8

ERFC data: unadjusted and age-adjusted D-index for a model including baseline age, smoking, SBP, TC, and HDL, plotted for each study against the SD of baseline age in that study. Analyses are stratified by sex and trial arm. Each point represents one study.

Source code to analyze simulated data (like one ERFC cohort) is available as Supporting Information on the journal’s web page (http://onlinelibrary.wiley.com/doi/10.1002/bimj.201400061/suppinfo).

7 Discussion

We have proposed a number of methods for estimating an adjusted C-index. The lower part of Table 3 lists a number of desirable properties of an adjusted measure of discrimination, and evaluates the proposed adjusted C-indices and the adjusted D-index against these measures. Overall, the best adjusted C-index appears to be the indirect estimator, although we caution that it is sensitive to model mis-specification. The adjusted D-index is an excellent alternative which is easy to compute. We have not discussed computation of standard errors in this paper: for the C-index, bootstrapping could be used, while a standard error arises naturally in the calculation of the D-index. Optimism (overfitting) is an issue whenever models are estimated and evaluated on the same dataset (Harrell et al., 1996). It is not the focus of this paper, because the ERFC data set is large and optimism is likely to be negligible. However, there were signs of optimism in the simulation studies with β = 0. In general, our methods should be applied after recalibrating the risk score r(x) to allow for optimism, ideally in an external validation set, or otherwise by using internal corrections such as the bootstrap (Harrell et al., 1996). Censoring is another potential problem for our methods. It causes bias in direct methods if noninformative pairs are simply excluded. Our simulation results show that moderate biases due to censoring do occur, especially in larger C-indices (e.g., in the bottom two panels of Fig. 4). Typically, studies have both a limit τ to the length of follow-up and random censoring before that time. The method of Uno et al. (2011) can be used to correct for the random censoring, but it estimates C not C. Currently the only way to estimate C with data censored by end of follow-up is the indirect method. Model mis-specification is a further potential problem, especially with the indirect methods which assume a correctly specified proportional hazards model. We have proposed a recalibration step which should remove bias due to miscalibration, but not necessarily other forms of model mis-specification. The direct methods such as Smooth 1 and Smooth 2 should be much less sensitive to model mis-specification, since they depend on observed concordance, not model-predicted concordance. To reduce sensitivity to model mis-specification, we tried using the difference between the direct and indirect estimators of C (which may reflect the impact of model mis-specification) to correct the indirect estimator of C*, defining (or the equivalent on the logit scale). However, because the impact of censoring is greater in unadjusted than adjusted estimators (Fig. 2), the corrected estimator did not perform well in the simulation study and was not included in the results. Covariate adjustment could be considered for other measures of discrimination. The net reclassification index (NRI) is a popular measure of the difference in discrimination between two models (Pencina et al., 2008). Because the NRI is based on within-individual comparisons, it is neither necessary nor possible to adjust it for covariates, although a covariate-specific NRI could be a useful quantity. However, correct calibration is required to avoid misleading NRI results (Hilden and Gerds, 2014). Another way to evaluate the value of adding a new biomarker to a risk prediction model could be to evaluate the discrimination of the new model, adjusting for the covariates in the original model. Further extensions include ways to account for competing risks and to obtain time-dependent measures of discrimination (Wolbers et al., 2009). A common feature of these approaches is that the C-index needs to be computed for different time points which limit the comparability of different studies that have different length of follow-up. An open question is which metric is more appropriate for which data and whether these different approaches can produce different conclusions in some scenarios. The methodology presented here could be extended to incorporate these extensions. Our approach should not be confused with ROC regression (Tosteson and Begg, 1988). ROC regression methods model the accuracy of a diagnostic test as a function of covariates, not how the disease is associated with covariates. A possible use of such an approach might be to find subgroups where the marker should not be used or to find optimal cutoffs. Here we assume predictions that have been optimized with respect to disease risk and our aim is not (primarily) to explain which covariates affect accuracy but to adjust discrimination statistics for the confounding effect of the covariates that do. In summary, we have proposed covariate-adjusted measures of concordance. There are many benefits to such measures (Pepe et al., 2008). In the meta-analysis setting, they facilitate comparisons between studies with different covariate distributions (Figs. 1 and 8). They also enable matched case-control studies nested within cohort studies to be compared with standard cohort studies, since the former can only yield measures of discrimination adjusted for the matching variables. We advocate adjustment, at least for study design variables such as age, sex, and study centre, whenever measures of discrimination are to be compared between studies with different distributions of the design variables.

29 in total

1. Exploring sources of heterogeneity in systematic reviews of diagnostic tests.

Authors: Jeroen G Lijmer; Patrick M M Bossuyt; Siem H Heisterkamp
Journal: Stat Med Date: 2002-06-15 Impact factor: 2.373

2. A simulation study of predictive ability measures in a survival model II: explained randomness and predictive accuracy.

Authors: B Choodari-Oskooei; P Royston; Mahesh K B Parmar
Journal: Stat Med Date: 2012-07-05 Impact factor: 2.373

3. Improvement of risk prediction by genomic profiling: reclassification measures versus the area under the receiver operating characteristic curve.

Authors: Raluca Mihaescu; Moniek van Zitteren; Mandy van Hoek; Eric J G Sijbrands; André G Uitterlinden; Jacqueline C M Witteman; Albert Hofman; M G Myriam Hunink; Cornelia M van Duijn; A Cecile J W Janssens
Journal: Am J Epidemiol Date: 2010-06-18 Impact factor: 4.897

4. Integrating the predictiveness of a marker with its performance as a classifier.

Authors: Margaret S Pepe; Ziding Feng; Ying Huang; Gary Longton; Ross Prentice; Ian M Thompson; Yingye Zheng
Journal: Am J Epidemiol Date: 2007-11-02 Impact factor: 4.897

5. Prognostic models with competing risks: methods and application to coronary risk prediction.

Authors: Marcel Wolbers; Michael T Koller; Jacqueline C M Witteman; Ewout W Steyerberg
Journal: Epidemiology Date: 2009-07 Impact factor: 4.822

Review 6. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.

Authors: F E Harrell; K L Lee; D B Mark
Journal: Stat Med Date: 1996-02-28 Impact factor: 2.373

7. Statistical methods for assessing agreement between two methods of clinical measurement.

Authors: J M Bland; D G Altman
Journal: Lancet Date: 1986-02-08 Impact factor: 79.321

8. Evaluating the yield of medical tests.

Authors: F E Harrell; R M Califf; D B Pryor; K L Lee; R A Rosati
Journal: JAMA Date: 1982-05-14 Impact factor: 56.272

Review 9. Reporting performance of prognostic models in cancer: a review.

Authors: Susan Mallett; Patrick Royston; Rachel Waters; Susan Dutton; Douglas G Altman
Journal: BMC Med Date: 2010-03-30 Impact factor: 8.775

10. Measures to assess the prognostic ability of the stratified Cox proportional hazards model.

Authors:
Journal: Stat Med Date: 2009-02-01 Impact factor: 2.373

7 in total

1. A framework for meta-analysis of prediction model studies with binary and time-to-event outcomes.

Authors: Thomas Pa Debray; Johanna Aag Damen; Richard D Riley; Kym Snell; Johannes B Reitsma; Lotty Hooft; Gary S Collins; Karel Gm Moons
Journal: Stat Methods Med Res Date: 2018-07-23 Impact factor: 3.021

2. The use of repeated blood pressure measures for cardiovascular risk prediction: a comparison of statistical models in the ARIC study.

Authors: Michael J Sweeting; Jessica K Barrett; Simon G Thompson; Angela M Wood
Journal: Stat Med Date: 2016-10-11 Impact factor: 2.373

3. On hazard ratio estimators by proportional hazards models in matched-pair cohort studies.

Authors: Tomohiro Shinozaki; Mohammad Ali Mansournia; Yutaka Matsuyama
Journal: Emerg Themes Epidemiol Date: 2017-06-05

4. Empirical evidence of the impact of study characteristics on the performance of prediction models: a meta-epidemiological study.

Authors: Johanna A A G Damen; Thomas P A Debray; Romin Pajouheshnia; Johannes B Reitsma; Rob J P M Scholten; Karel G M Moons; Lotty Hooft
Journal: BMJ Open Date: 2019-04-01 Impact factor: 2.692

5. Overall Survival and Biomarker Analysis of Neoadjuvant Nivolumab Plus Chemotherapy in Operable Stage IIIA Non-Small-Cell Lung Cancer (NADIM phase II trial).

Authors: Mariano Provencio; Roberto Serna-Blasco; Ernest Nadal; Amelia Insa; M Rosario García-Campelo; Joaquín Casal Rubio; Manuel Dómine; Margarita Majem; Delvys Rodríguez-Abreu; Alex Martínez-Martí; Javier De Castro Carpeño; Manuel Cobo; Guillermo López Vivanco; Edel Del Barco; Reyes Bernabé Caro; Nuria Viñolas; Isidoro Barneto Aranda; Santiago Viteri; Eva Pereira; Ana Royuela; Virginia Calvo; Javier Martín-López; Francisco García-García; Marta Casarrubios; Fernando Franco; Estela Sánchez-Herrero; Bartomeu Massuti; Alberto Cruz-Bermúdez; Atocha Romero
Journal: J Clin Oncol Date: 2022-05-16 Impact factor: 50.717

6. Initiation of the SGLT2 inhibitor canagliflozin to prevent kidney and heart failure outcomes guided by HbA1c, albuminuria, and predicted risk of kidney failure.

Authors: Sok Cin Tye; Niels Jongs; Steven G Coca; Johan Sundström; Clare Arnott; Bruce Neal; Vlado Perkovic; Kenneth W Mahaffey; Priya Vart; Hiddo J L Heerspink
Journal: Cardiovasc Diabetol Date: 2022-09-23 Impact factor: 8.949

7. Landmark Models for Optimizing the Use of Repeated Measurements of Risk Factors in Electronic Health Records to Predict Future Disease Risk.

Authors: Ellie Paige; Jessica Barrett; David Stevens; Ruth H Keogh; Michael J Sweeting; Irwin Nazareth; Irene Petersen; Angela M Wood
Journal: Am J Epidemiol Date: 2018-07-01 Impact factor: 4.897

7 in total