Geneviève Cadieux1, Robyn Tamblyn, David L Buckeridge, Nandini Dendukuri. 1. *Dalla Lana School of Public Health, University of Toronto, Toronto, ON †Department of Epidemiology, Biostatistics and Occupational Health, McGill University ‡Direction de la Santé Publique de Montréal §Department of Medicine, McGill University, Montreal, QC, Canada.
Abstract
OBJECTIVE: Valid measurement of outcomes such as disease prevalence using health care utilization data is fundamental to the implementation of a "learning health system." Definitions of such outcomes can be complex, based on multiple diagnostic codes. The literature on validating such data demonstrates a lack of awareness of the need for a stratified sampling design and corresponding statistical methods. We propose a method for validating the measurement of diagnostic groups that have: (1) different prevalences of diagnostic codes within the group; and (2) low prevalence. METHODS: We describe an estimation method whereby: (1) low-prevalence diagnostic codes are oversampled, and the positive predictive value (PPV) of the diagnostic group is estimated as a weighted average of the PPV of each diagnostic code; and (2) claims that fall within a low-prevalence diagnostic group are oversampled relative to claims that are not, and bias-adjusted estimators of sensitivity and specificity are generated. APPLICATION: We illustrate our proposed method using an example from population health surveillance in which diagnostic groups are applied to physician claims to identify cases of acute respiratory illness. CONCLUSIONS: Failure to account for the prevalence of each diagnostic code within a diagnostic group leads to the underestimation of the PPV, because low-prevalence diagnostic codes are more likely to be false positives. Failure to adjust for oversampling of claims that fall within the low-prevalence diagnostic group relative to those that do not leads to the overestimation of sensitivity and underestimation of specificity.
OBJECTIVE: Valid measurement of outcomes such as disease prevalence using health care utilization data is fundamental to the implementation of a "learning health system." Definitions of such outcomes can be complex, based on multiple diagnostic codes. The literature on validating such data demonstrates a lack of awareness of the need for a stratified sampling design and corresponding statistical methods. We propose a method for validating the measurement of diagnostic groups that have: (1) different prevalences of diagnostic codes within the group; and (2) low prevalence. METHODS: We describe an estimation method whereby: (1) low-prevalence diagnostic codes are oversampled, and the positive predictive value (PPV) of the diagnostic group is estimated as a weighted average of the PPV of each diagnostic code; and (2) claims that fall within a low-prevalence diagnostic group are oversampled relative to claims that are not, and bias-adjusted estimators of sensitivity and specificity are generated. APPLICATION: We illustrate our proposed method using an example from population health surveillance in which diagnostic groups are applied to physician claims to identify cases of acute respiratory illness. CONCLUSIONS: Failure to account for the prevalence of each diagnostic code within a diagnostic group leads to the underestimation of the PPV, because low-prevalence diagnostic codes are more likely to be false positives. Failure to adjust for oversampling of claims that fall within the low-prevalence diagnostic group relative to those that do not leads to the overestimation of sensitivity and underestimation of specificity.
Recent interest toward developing a “learning health system,” in which new knowledge is generated as an integral byproduct of health care delivery,1 has prompted calls for improved capacity to draw timely inference from health care data.2–4 Major investments in health information technology, including the development and widespread implementation of electronic health records,5 have created new streams of health data and added richness and complexity to existing data streams. As this wealth of new data becomes available, methods to infer from these data important metrics, such as disease prevalence, must be validated before they are used to guide decisions within a “learning health system.”In addition to playing a critical role in achieving the vision of a “learning health system,” the ability to draw sound inference from comprehensive and high-quality health care utilization data (ie, data generated as a corollary of health service delivery) is a cornerstone of quality improvement, pharmacosurveillance, public health practice and policymaking, and health-related research in a variety of disciplines. Diagnostic codes are among the most widely used data elements in health care utilization data; they are used to identify patient populations of interest,6–8 to assess the presence of risk factors,9,10 to adjust for comorbidity11–14 and case-mix,14–16 to monitor health outcomes in the population,17,18 and even to inform individual patient care at the time of the clinical encounter.19 However, because diagnostic codes in health care utilization data are generated as a byproduct of health services delivery (eg, to enable fee-for-service remuneration), before using them for another purpose, it is essential to determine whether these data are sufficiently sensitive and specific for that purpose. Typically, when these data are used to measure something about a particular condition, a group of diagnostic codes associated with the condition is defined and validated. Diagnostic groups are generally used because individual diagnostic codes are too “fine-grained” for most purposes.The scientific literature is replete with validation studies of diagnostic groups in health care utilization data. However, the quality of their methodology is highly variable, ranging from cross-correlations between 2 time-series20,21 to record-level comparison with other data sources including registries,22,23 patient self-report,24,25 clinical information systems,26 medical record review,27–29 and clinical measurement.25,30,31 The ecological approach to validating diagnostic groups is inferior because it does not permit the estimation of the sensitivity, specificity, positive predictive value (PPV), or negative predictive value (NPV) of the diagnostic group. In contrast, validation studies based on the direct comparison of individual records identified from 2 different data sources are more informative; however, their design and analysis can pose important challenges. These challenges commonly arise because: (1) the diagnostic group is made up of several diagnostic codes that can have vastly different prevalences in the database; and/or (2) the diagnostic group itself has a low prevalence in the database.
DESCRIPTION OF THE CHALLENGES IN VALIDATING DIAGNOSTIC GROUPS
Challenge 1: Estimating the PPV and NPV of a Diagnostic Group Composed of Several Diagnostic Codes, Each With a Different Population Prevalence
One problem commonly encountered in validation studies of diagnostic groups composed of diagnostic codes in health care utilization data is that each of the diagnostic codes has a different prevalence in the population (example in Table 1). Because of this variation in prevalence, a simple random sample of health care utilization records with diagnostic codes belonging to the diagnostic group may fail to capture enough records with low-prevalence diagnostic codes to generate a reliable estimate of their accuracy. Knowledge of the accuracy of individual diagnostic codes within a diagnostic group is crucial to understanding how a given diagnostic group behaves in different situations and populations (eg, variation in diagnostic coding practices between physician specialty groups and care settings), and to “refine” or modify diagnostic groups to suit different purposes (eg, when a more sensitive vs. specific diagnostic group is needed for disease screening vs. confirmation).
TABLE 1
Example of a Diagnostic Group Made up of Diagnostic Codes, Each With a Different Population Prevalence: The CDC’s Diagnostic Group for “Respiratory Syndrome”
Example of a Diagnostic Group Made up of Diagnostic Codes, Each With a Different Population Prevalence: The CDC’s Diagnostic Group for “Respiratory Syndrome”Furthermore, previous validation studies have found that low-prevalence diagnostic codes are more likely to be false positives than higher-prevalence ones.33,34 Therefore, estimation of the PPV for a diagnostic group must accurately reflect the population prevalence of the individual diagnostic codes in that diagnostic group; oversampling low-prevalence diagnostic codes and failing to adjust for such a sampling strategy in the analysis could lead to an underestimation of the PPV.
Challenge 2: Estimation of the Sensitivity and Specificity of a Diagnostic Group With a Low Population Prevalence
Another challenge in validating diagnostic groups based on diagnostic codes in health care utilization data arises when the prevalence of the diagnostic group itself is low. In such a situation, it may be prohibitively costly, and perhaps infeasible, to use a simple random sample of health care utilization data. With this approach, one would need to review a very large number of records to capture a sufficient number with a diagnostic code that belongs to the diagnostic group. To address this problem, investigators typically sample a larger proportion of claims with a diagnostic code belonging to the diagnostic group of interest than without. For example, a validation of the diagnostic group for “respiratory syndrome” used for surveillance by both the Centers for Disease Control and Prevention (CDC) and the US Department of Defense,27 sampled 454 records positive for respiratory syndrome and 2020 records negative for respiratory syndrome, for a sample prevalence of respiratory syndrome of 22.5%; whereas the authors did not report the population prevalence for respiratory syndrome, we estimated it to be nearly half that of the sample prevalence, that is, 12.8%.33 Failure to account for such a stratified sampling strategy in the statistical analysis can lead to an overestimation of sensitivity and underestimation of specificity,35 a phenomenon known as verification bias in the context of diagnostic test evaluation.A correction for verification bias was first described in the context of diagnostic test evaluation by Begg and Greenes.35 In that context, it occurs when patients first undergo test A (eg, fecal occult blood test), and then, based on the clinician’s interpretation of the results from test A in the context of other relevant clinical factors (eg, other signs, symptoms, family history), a subgroup of patients who underwent test A are selected to undergo test B (eg, colonoscopy). Verification bias arises in the estimation of the sensitivity and specificity of test A (the screening test) when a nonrepresentative sample of patients who underwent test A are selected to undergo test B (the “verification” test). Typically, a larger proportion of patients who tested positive on test A (the screening test) will be selected to undergo test B (the verification test), as compared with the fraction of patient who tested negative on test A who are selected to undergo test B. When estimating the sensitivity and specificity of test A, failure to account for the mechanism whereby patients are selected to undergo test B (the “verification” test) results in verification bias. Consequently, the sensitivity of test A is overestimated and its specificity is underestimated.Although verification bias was first described in diagnostic test evaluation, it can arise in any validation study that uses a stratified random sampling strategy whereby the proportion of “test A positives” sampled is different from (typically larger than) the proportion of “test A negatives” sampled. In other words, verification bias can arise in any validation study where the prevalence of “test A positives” in the study sample is different from the prevalence of “test A positives” in the study population. Therefore, when we apply this principle to validation studies of diagnostic groups measured in health care utilization databases, we conclude that verification bias can arise whenever the sampled proportion of records with a diagnostic code belonging to the diagnostic group is different from the sampled proportion of records without such a diagnosis.In the next sections of this paper, we propose a sequential 2-step approach for validating diagnostic groups based on health care utilization data: step 1, estimating the PPV and NPV of a diagnostic group composed of several diagnostic codes, each code with a different prevalence, and step 2, estimating the sensitivity and specificity of a diagnostic group with low population prevalence. Then, we illustrate our proposed method through application to data from a validation study33 of the CDC’s diagnostic group for surveillance of acute respiratory illness.32
METHODS
Notation and study design: The notation we will be using is summarized in Table 2. For each diagnostic code, the entries in the 2×2 table of claims versus medical charts data are denoted using capital letters for the population level (A, B, C, D) and small letters for the stratified sample level (a, b, c, d). For each diagnostic group, at the population level, the number of positive claims is denoted by A+B and the number of negative claims by C+D. The A+B claims positive for the diagnostic group were first stratified by diagnostic code and then a random sample was drawn from each code for verification with the medical chart. A single random sample was also drawn from the C+D negative claims for verification, this sampled was frequency-matched to the positive claims on month of visit to avoid seasonal bias. Note that each 2×2 table in Table 2 is created by combining the results of claims which are positive for an individual diagnostic codes and negative for an entire diagnostic group. The entries c, d are not mentioned explicitly as they are included within c, d.
TABLE 2
Notation Used in the Methods Section and Example of the Data Used in Our Application (Excerpted From Appendix A, Supplemental Digital Content 1, http://links.lww.com/MLR/A876)
Notation Used in the Methods Section and Example of the Data Used in Our Application (Excerpted From Appendix A, Supplemental Digital Content 1, http://links.lww.com/MLR/A876)
Step 1: Estimating the PPV and NPV of a Diagnostic Group Composed of Several Diagnostic Codes, Each With a Different Population Prevalence
When a diagnostic group is composed of diagnostic codes that each have a different population prevalence, we propose stratifying the sample of claims with a diagnosis belonging to the diagnostic group by the individual diagnostic codes that make up the diagnostic group (ie, sampling at the level of the diagnostic code rather than at the level of the diagnostic group).Using such a stratified sampling strategy, the PPV of individual diagnostic codes can be estimated directly from the 2×2 tables based on the stratified sample using the following usual formula:where m denotes an individual diagnostic code within a diagnostic group.As shown by Begg and Greenes, the PPV of individual diagnostic codes can be estimated without verification bias despite the stratified sampling strategy.35 It can be shown using Bayes theorem that the PPV of the diagnostic group is the weighted average of the PPV of each diagnostic code within the group, the weight being the estimated prevalence of each diagnostic code in the population:where n denotes the total number of diagnostic codes in diagnostic group y. The weight was obtained as the number of visits with a given diagnostic code among the sampled claims, divided by the total number of visits positive for that diagnostic group among all the claims from participating physicians. Note that A and B are not observed separately due to the fact that not all claims are verified, but their sum, A+B, is observed.Using standard statistical theory for stratified samples,36 the variance of the estimated PPV for the diagnostic group, can be expressed as follows:As explained earlier the C+D “group-negative” records should be negative for all “group-positive” diagnostic codes, not only a single diagnostic code in the group; for example, “group-negative” records for the CDC’s diagnostic group for respiratory syndrome should not include “influenza” or any other acute respiratory infection. Therefore, unlike the PPV, the NPV and its variance are estimated at the level of the diagnostic group, not at the level of individual diagnostic codes, from the 2×2 table in Table 2 using the usual formula:
Step 2: Estimation of the Sensitivity and Specificity of a Diagnostic Group With a Low Population Prevalence
In most validation studies of low-prevalence diagnostic groups, a larger proportion of records with a diagnostic code belonging to the diagnostic group (ie, “group-positive” records) are sampled than records without any such diagnosis (ie, “group-negative” records); this is often done to maximize efficiency and minimize cost by validating fewer claims in total.37 However, failing to take this stratified random sampling strategy into account in the analysis can lead to verification bias: sensitivity is overestimated, specificity is underestimated, and the bias is typically larger for sensitivity than specificity.35A method for correcting for verification bias was published by Begg and Greenes in 1983.35 It involves taking into account the relative difference in the sampled proportions between the group-positive records and the group-negative records in the estimation of sensitivity and specificity. When the validated claims were randomly sampled within group-positive and group-negative strata, estimation of sensitivity and specificity can be achieved by re-weighting for the different sampling fractions.35 We propose the following equations to estimate the sensitivity (Sn) and specificity (Sp) of a diagnostic group y using the PPV and NPV estimates derived in the previous section, and the proportion (p) of records with a diagnostic code belonging to the diagnostic group,38 while correcting for verification bias35:To estimate the variance for sensitivity and specificity, we used the equations that appear in the original paper on verification bias by Begg and Greenes.35
APPLICATION
We illustrate the proposed methods by application to a validation study33 of the CDC’s diagnostic group for respiratory syndrome surveillance.32 In this study, we validated diagnostic codes recorded in reimbursement claims for primary care physician visits; specifically, we compared International Classification of Diseases 9th Revision (ICD-9) codes belonging to the CDC’s diagnostic group for respiratory syndrome against diagnoses obtained from chart review for the same patient visit.33 In brief, we initially selected a random sample of 3600 community-based primary care physicians practicing in the fee-for-service system in the province of Quebec, Canada. We then randomly selected 10 visits per physician from their claims, stratifying on syndrome type and presence, diagnosis, and month. Double-blinded chart reviews were conducted by telephone with consenting physicians to obtain information on patient diagnoses for each sampled visit. The sensitivity, specificity, and PPV of physician claims were estimated by comparison with chart review.Our final study sample comprised 1098 (12.6%) participating primary care physicians and 10,529 of the 7,079,171 visits for which they submitted fee-for-service claims to the provincial health insurance program in the 2-year period from October 1, 2005 to September 30, 2007. The CDC’s diagnostic group for respiratory syndrome includes 171 individual ICD-9 codes; in our final study sample, the prevalence of these individual ICD-9 codes ranged from zero to 3152 per 100,000 primary care visits and the overall population prevalence of the CDC’s diagnostic group for respiratory syndrome was 128.3 per 1000 primary care visits.
Example of Step 1: Estimating the PPV and NPV of a Diagnostic Group Composed of Several Diagnostic Codes, Each With a Different Population Prevalence
In our example (see an excerpt of our data in Table 3 or the full data table in online Appendix A, Supplemental Digital Content 1, http://links.lww.com/MLR/A876), the diagnostic group for respiratory syndrome included diagnostic codes with high population prevalence (eg, 465.9—acute upper respiratory infection of unspecified or multiple sites, prevalence of 31.5 per 1000 primary care visits) and diagnostic codes with low population prevalence (eg, 487.0—influenza with pneumonia, prevalence of 0.04 per 1000 primary care visits). Given that these 2 prevalences differ by several orders of magnitude (103), had we taken a simple random sample of 100 visits that met the diagnostic group for respiratory syndrome, we likely would have failed to capture any visit with a diagnosis code of 487.0—influenza with pneumonia. However, the diagnostic code with the lowest prevalence may be more valuable to the investigators (in this case, 487.0—influenza with pneumonia may be a more specific indicator of influenza infection than 465.9—acute upper respiratory infection of unspecified or multiple sites); therefore, validating the diagnosis with the low prevalence may be highly desirable. Therefore, to ensure that our sample contained a sufficient number of visits with rarely used diagnoses to generate stable estimates of those diagnoses’ PPV, we stratified our sample by individual diagnostic code.
TABLE 3
Example of Data from Our Validation Study33 of the CDC’s Diagnostic Group for “Respiratory Syndrome” (Excerpted From Supplemental Digital Content 1, http://links.lww.com/MLR/A876)
Example of Data from Our Validation Study33 of the CDC’s Diagnostic Group for “Respiratory Syndrome” (Excerpted From Supplemental Digital Content 1, http://links.lww.com/MLR/A876)In the Methods section, we provided Eq. (2) to obtain an estimate of the PPV of a given diagnostic group when some of the diagnostic codes in the diagnostic group are oversampled relative to others. Solving Eq. (2) using the numbers in Supplemental Digital Content 1, http://links.lww.com/MLR/A876, we obtain the PPV estimate for the CDC’s diagnostic group for respiratory syndrome (PPVy) (where the subscript “y” denotes respiratory syndrome):Had we ignored the stratified sampling at the diagnostic code level, and calculated the PPV at the level of the diagnostic group, using PPV=A/(A+B), our PPV estimate would have been underestimated at 0.77. Had we computed a simple (unweighted) average of the PPVs of each diagnostic code in the diagnostic group, using , our PPV estimate would have been ever further underestimated at 0.63.As mentioned in the Methods section, the NPV of the diagnostic group is conceptualized only at the level of the diagnostic group (not at the level of the diagnostic code) and therefore can be estimated directly from diagnostic group-level data. When we solve Eq. (3) using the numbers in Supplemental Digital Content 1, http://links.lww.com/MLR/A876, we obtain the following NPV estimate for the CDC’s diagnostic group for respiratory syndrome: (where the subscript “y” denotes respiratory syndrome)
Example for Step 2: Estimation of the Sensitivity and Specificity of a Diagnostic Group With Low Population Prevalence
In the Methods section, we proposed 2 statistical equations to yield estimates of the overall sensitivity (Eq. (5)) and specificity (Eq. (6)) of a given diagnostic group when, due to low population prevalence, “test A positives” are oversampled relative to “test A negatives.” When we estimate sensitivity and specificity of respiratory syndrome using Eq. (4–6) together with the data in Supplemental Digital Content 1, http://links.lww.com/MLR/A876, we obtain the following prevalence, sensitivity, and specificity estimate for the CDC definition of respiratory syndrome (where the subscript “y” denotes respiratory syndrome):In our study population: the prevalence of respiratory syndrome is 128 per 1000 visitsHad we not adjusted for verification bias, sensitivity would have been overestimated, and specificity would have been underestimated:It should be noted that the Begg and Greenes correction for verification bias35 is highly dependent on p, the proportion of records with a diagnostic code belonging to the diagnostic group; the following 2 simulations illustrate this point:
Simulation 1: Decreased Population Prevalence of Respiratory Syndrome
If the study population had been visits to psychiatrists instead of primary care physicians, the prevalence of the CDC’s diagnostic group for respiratory syndrome may have been as low as 10 per 1000 visits instead of 128.2 per 1000 visits. Under such conditions, the sensitivity would have been much lower, and the specificity higher:
Simulation 2: Increased Population Prevalence of Respiratory Syndrome
Conversely, if the study population had been visits to pediatric emergency departments, the prevalence of respiratory syndrome could have easily been as high as 200 per 1000 instead of 128.3 per 1000 visits. Under those conditions, the sensitivity would have been higher, and the specificity would have been slightly lower:
DISCUSSION
In this paper, we described 2 common challenges in validating diagnostic groups measured from health care utilization data: (1) the diagnostic group’s PPV can be underestimated when ignoring the underlying stratified sampling strategy; and (2) the diagnostic group’s sensitivity can be overestimated and its specificity underestimated when stratified sampling strategies to improve data collection cost-efficiency by sampling more “group-positive” records relative to “group-negative” records are used. Next, we proposed a 2-step approach for validating diagnostic groups based on health care utilization data: step 1, estimating the PPV and NPV of a diagnostic group composed of several diagnostic codes, each code with a different prevalence, and step 2, estimating the sensitivity and specificity of a diagnostic group with low population prevalence. We then illustrated our proposed methodological approaches by application to a validation study33 of the CDC’s diagnostic group for respiratory syndrome32 surveillance, and showed how using a stratified sampling strategy without the corresponding statistical adjustments can lead to the underestimation of the PPV and specificity, and the overestimation of the sensitivity.As we have shown in this paper, failing to recognize and account for challenges intrinsic to the validation of diagnostic groups from health care utilization data can have a substantial impact on inferences drawn from these data. In the context of quality improvement, the overestimation of the sensitivity of a diagnostic group can lead to the underestimation of the frequency or magnitude of the “problem.” For example, if 10 cases of complications from diabetes are detected in a given physician practice using a diagnostic group thought to have a sensitivity of 0.95, one can reasonably expect that 10 is the “true” number of cases of diabetic complications in that practice, and, on that basis, one may choose not to invest in interventions to improve diabetes management. However, if the “true” sensitivity of that diagnostic group is 0.30, then the “true” number of diabetic complications in that practice exceeds 10, and an intervention may be desirable (eg, the “true” number of diabetic complications may now be above the threshold at which implementing a given intervention is considered to be cost-effective). Similarly, population surveillance may be greatly affected by the underestimation of the PPV of a diagnostic group secondary to overlooking large differences in the prevalence of individual diagnostic codes within a diagnostic group; because investigating false-positive alerts is very costly, a surveillance system wrongly thought to have a low PPV may not be implemented at all, or worse, an alert it generates may not be acted upon because of the perception that the alert is very likely to be a false positive. In this way, biases in the estimation of the sensitivity, specificity, and PPV of diagnostic groups based on diagnostic codes in health care utilization data can lead to the inefficient and ineffective allocation of limited resources. However, as we have illustrated, it is possible to obtain estimates of validity measures that are free of verification bias by adapting recognized statistical techniques that have been developed in other areas where selection biases arise due to using a cost-effective, stratified sampling strategy.Supplemental Digital Content is available for this article. Direct URL citations appear in the printed text and are provided in the HTML and PDF versions of this article on the journal's Website, www.lww-medicalcare.com.
Authors: D K McClish; L Penberthy; M Whittemore; C Newschaffer; D Woolard; C E Desch; S Retchin Journal: Am J Epidemiol Date: 1997-02-01 Impact factor: 4.897
Authors: Geneviève Cadieux; David L Buckeridge; André Jacques; Michael Libman; Nandini Dendukuri; Robyn Tamblyn Journal: BMC Public Health Date: 2011-01-07 Impact factor: 3.295