Literature DB >> 29181145

The Impact of Diagnostic Code Misclassification on Optimizing the Experimental Design of Genetic Association Studies.

Abstract

Diagnostic codes within electronic health record systems can vary widely in accuracy. It has been noted that the number of instances of a particular diagnostic code monotonically increases with the accuracy of disease phenotype classification. As a growing number of health system databases become linked with genomic data, it is critically important to understand the effect of this misclassification on the power of genetic association studies. Here, I investigate the impact of this diagnostic code misclassification on the power of genetic association studies with the aim to better inform experimental designs using health informatics data. The trade-off between (i) reduced misclassification rates from utilizing additional instances of a diagnostic code per individual and (ii) the resulting smaller sample size is explored, and general rules are presented to improve experimental designs.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2017 PMID： 29181145 PMCID： PMC5664372 DOI： 10.1155/2017/7653071

Source DB: PubMed Journal: J Healthc Eng ISSN： 2040-2295 Impact factor: 2.682

1. Introduction

Clearly, a wealth of important clinical information is contained within large electronic health record (EHR) systems. Such information can be an invaluable resource for measuring disease prevalence [1] and disease comorbidity [2], the association between birth month and disease susceptibility [3], the prediction of outcomes [4], the measurement of economic impact of health care [5], and the discovery of etiological factors [6]. A key feature of these data is in the diagnostic codes given by medical professionals to patient records. However, the accuracy of inferring disease phenotypes from electronic diagnostic codes can vary widely across diseases and is often subject to high degrees of error [7-10]. These studies have noted the substantial misclassification effects from the use of electronic diagnostic code data, sufficient to undermine experiments utilizing cases and controls defined by the International Classification of Diseases (ICD) codes alone. The ICD coding system is instituted by the World Health Organization and has been adopted in the United States by the National Center for Health Statistics. More sophisticated approaches to disease classification, such as those using a variety of EHR data and machine learning methods, are difficult to generalize across all diseases and implement in a high-throughput manner. That said, I anticipate that machine learning methods applied to problems of phenotype prediction using EHR variables as features in the predictive modeling will eventually supplant the sole use of ICD code data. Until that time, the use of ICD data may still have utility in initial screens, to be subsequently validated through methods with higher positive and negative predictive values.

2. Related Work

In a general setting, the effect of phenotypic misclassification on statistical power of genetic association studies has been previously explored [11-14]. Edwards and colleagues characterized the noncentrality parameter in asymptotic power distributions given the presence of phenotypic misclassification [11]. The authors use cost functions to capture the effect of misclassification and show that the cost of misclassifying a control as a case becomes large and the cost of misclassifying a case individual as a control becomes small as the disease prevalence becomes small. Similarly, Ji et al. also investigated the calculation of a noncentrality parameter capturing phenotype errors for subsequent use in a likelihood ratio test for genetic association studies [12]. Later, Gordon and colleagues showed how to incorporate misclassification error rates into a trend test for genetic association in case/control studies [13]. More recently, Manchia and colleagues investigated the impact of heterogeneity within a clinical phenotype on genetic association [14]. Considering ICD data with misclassification, the type I and type II error rates for genomic association studies were recently thoroughly explored by Duan et al. [15]. The Duan et al. study found little inflation in false-positive rates, but not in considerable false-negative rates under certain allele frequency, effect size, and disease prevalence parameters. In the context of initial screens of ICD codes in EHR systems, several studies have investigated the relationship between the number of instances of particular ICD codes and the measures of diagnostic utility [1, 16–18]. In general, the accuracy of diagnoses improves with the number of instances of the code; however, this is at the expense of smaller sample sizes/increasing false negatives. Hence, there is a trade-off between type I and type II error rates with the number of ICD code instances used to define a disease. In this work, I investigate this trade-off and provide a framework for determining highly powered EHR-based experimental designs using diseases defined by different numbers of instances of ICD codes.

3. Materials and Methods

For a large genetic association scan of using ICD data, define a simple disease classification scheme such that cases are those individuals with x instances of a particular ICD code. Consider a design where individuals with ambiguous numbers of instances (i) of the code (i.e., 0 < i < x) are excluded from the analysis. Further consider a comparison of well-defined cases (i.e., those with at least x instances) against a large, fixed set of controls. With regard to the genetics, restrict the methods to biallelic markers with minor alleles segregating in the population at a frequency of at least 1% single-nucleotide polymorphisms (SNPs). Define the alleles at a SNP contributing to the susceptibility of the disease as A1 and A2. Let the relative risk of the minor allele, A2, be R, such that R = P(A2 | cases)[P(A2 | controls)]−1. Let the frequency of A2 in the general population be q. Accordingly, 1 − q is the frequency of A1. Define n as the number of cases obtained from the definition of having at least x instances of the ICD code being evaluated. Set the number of controls as m, such that m ≫ n. Assume that the A2 frequency in controls is approximately q. Model the decrease in the misclassification proportion within cases as x increases with a monotonic function f(x), such that the expected number of truly positive cases is n[1 − f(x)]. The form of f(x) may vary considerably for different ICD codes. Lastly, let α be the statistical threshold for determining a positive finding in analyses where p value < α. The statistical test of genetic association considered is the binomial test of proportions which evaluates the null hypothesis of no correlation between the frequency of A2 and the disease status. Statistical power will be used to evaluate the impact of increasing x and the resulting experimental design. Under the model specified above, the power to detect association at an autosomal SNP, 1 − β, is calculated by the approximation as follows: where Φ is the standard Gaussian cumulative distribution function, z is the inverse standard Gaussian score, N = 4nm/n + m, and q and s are the A2 frequencies in controls and cases, respectively. Using Bayes' theorem, the expected frequency of A2 within cases under the misclassification model is given by To model the decrease in the misclassification rate with increasing numbers of ICD code instances, consider the simple decay function for f(x): where δ is the parameter that can be estimated for each ICD code. Similarly consider the following form for n as a function of n to model the reduction in the number of cases defined by using increasing numbers of instances of an ICD code: where ε is the parameter that captures the rate of decline in case numbers as the definition for case status becomes more stringent with the use of larger numbers of ICD code instances and can also be estimated for each ICD code. The machinery is now in place for the calculation of statistical power to detect disease association at a genetic marker using data from linked ICD coding systems.

4. Results and Discussion

The above model is used to conduct an exploration of the impact of ICD code definitions on power. To obtain a value of x which maximizes power to detect genetic association, one can numerically solve the following differential equation for x: The solution to (5) can be solved through standard numerical methods applied to solving where The closest integer value to the value of x that solves this continuous equation can be used to optimize the power for a given set of parameters. To exemplify the use of this approach, let m = 10,000, n = 400, R = 2, q = 0.20, δ = 0.15, and ε = 0.15. Call this set of parameters the baseline model. x = 7.2265 solves the differential equation. Therefore, using seven instances of an ICD code will yield the optimal design weighing the trade-off between the case sample size and the misclassification. For that set of parameters, Figure 1 shows the power curve for this set of parameters.

Figure 1

Statistical power versus ICD code instances, baseline model. From the mathematical model specified, power was calculated using the set of parameters from the baseline model. The results show the trade-off between the sample size, misclassification rates, and statistical power to detect genetic association. For the baseline model, the peak of power occurs when the number of instances is 7.

To investigate the power curves, varying the baseline number of cases (n), the calculations were performed as the n varied from 100 to 800. Visual inspection shows the peak of power at approximately 7 instances. Figure 2 shows the results.

Figure 2

Power versus ICD code instances, effect of varying n. The baseline level was used to generate this figure with the exception of n, which varied from 100 to 800, and the resulting power was calculated for each number of ICD instances.

Next, to determine the role of the δ and ε parameters on the power curves, the calculations were performed fixing the other parameters. Figures 3 and 4 display these results.

Figure 3

Power versus ICD code instances, effect of varying epsilon. The epsilon parameter varied in the baseline model from 0.01 to 0.30, and the power to detect was subsequently calculated.

Figure 4

Power versus ICD code instances, effect of varying delta. To explore the effect of the delta parameter on the power calculations, the baseline model was modified to include values of delta from 0.01 to 0.30. The power to detect genetic association was calculated across these delta parameter values.

5. Conclusions

Genetic data linked to longitudinal electronic health records can serve as a very useful tool in modern disease genetics. However, misclassification present in ICD coding systems can severely hamper large-scale screens using those codes for the purpose of genetic association studies. This work has described a simple approach to better understand the impact of misclassification present in EHR systems for the purpose of optimizing experimental designs that screen numerous ICD codes in genetic association studies. Under the mathematical models considered, the methods offer an approach to select the number of instances of an ICD code for the purpose of defining cases and obtaining an optimal experimental design for the identification of genetic markers. Additional work is needed in this area to improve disease classification schemes for genetic association studies as well as for other investigations.

18 in total

1. Accuracy of mild traumatic brain injury case ascertainment using ICD-9 codes.

Authors: Jeffrey J Bazarian; Peter Veazie; Sohug Mookerjee; E Brooke Lerner
Journal: Acad Emerg Med Date: 2005-12-19 Impact factor: 3.451

2. Computing asymptotic power and sample size for case-control genetic association studies in the presence of phenotype and/or genotype misclassification errors.

Authors: Fei Ji; Yaning Yang; Chad Haynes; Stephen J Finch; Derek Gordon
Journal: Stat Appl Genet Mol Biol Date: 2006-01-04

3. Birth month affects lifetime disease risk: a phenome-wide method.

Authors: Mary Regina Boland; Zachary Shahn; David Madigan; George Hripcsak; Nicholas P Tatonetti
Journal: J Am Med Inform Assoc Date: 2015-06-02 Impact factor: 4.497

4. Contrasting Association Results between Existing PheWAS Phenotype Definition Methods and Five Validated Electronic Phenotypes.

Authors: Joseph B Leader; Sarah A Pendergrass; Anurag Verma; David J Carey; Dustin N Hartzel; Marylyn D Ritchie; H Lester Kirchner
Journal: AMIA Annu Symp Proc Date: 2015-11-05

5. Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors.

Authors: Elena Birman-Deych; Amy D Waterman; Yan Yan; David S Nilasena; Martha J Radford; Brian F Gage
Journal: Med Care Date: 2005-05 Impact factor: 2.983

6. Healthy People 2010 disease prevalence in the Marshfield Clinic Personalized Medicine Research Project cohort: opportunities for public health genomic research.

Authors: C A McCarty; B N Mukesh; P F Giampietro; R A Wilke
Journal: Per Med Date: 2007-05 Impact factor: 2.512

7. Accuracy of Veterans Administration databases for a diagnosis of rheumatoid arthritis.

Authors: Jasvinder A Singh; Aaron R Holmgren; Siamak Noorbaloochi
Journal: Arthritis Rheum Date: 2004-12-15

8. Linear trend tests for case-control genetic association that incorporate random phenotype and genotype misclassification error.

Authors: Derek Gordon; Chad Haynes; Yaning Yang; Patricia L Kramer; Stephen J Finch
Journal: Genet Epidemiol Date: 2007-12 Impact factor: 2.135