Steven J Schrodi1,2. 1. Center for Human Genetics, Marshfield Clinic Research Institute, Marshfield, WI, USA. 2. Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, WI, USA.
Abstract
Diagnostic codes within electronic health record systems can vary widely in accuracy. It has been noted that the number of instances of a particular diagnostic code monotonically increases with the accuracy of disease phenotype classification. As a growing number of health system databases become linked with genomic data, it is critically important to understand the effect of this misclassification on the power of genetic association studies. Here, I investigate the impact of this diagnostic code misclassification on the power of genetic association studies with the aim to better inform experimental designs using health informatics data. The trade-off between (i) reduced misclassification rates from utilizing additional instances of a diagnostic code per individual and (ii) the resulting smaller sample size is explored, and general rules are presented to improve experimental designs.
Diagnostic codes within electronic health record systems can vary widely in accuracy. It has been noted that the number of instances of a particular diagnostic code monotonically increases with the accuracy of disease phenotype classification. As a growing number of health system databases become linked with genomic data, it is critically important to understand the effect of this misclassification on the power of genetic association studies. Here, I investigate the impact of this diagnostic code misclassification on the power of genetic association studies with the aim to better inform experimental designs using health informatics data. The trade-off between (i) reduced misclassification rates from utilizing additional instances of a diagnostic code per individual and (ii) the resulting smaller sample size is explored, and general rules are presented to improve experimental designs.
Clearly, a wealth of important clinical information is contained within large electronic health record (EHR) systems. Such information can be an invaluable resource for measuring disease prevalence [1] and disease comorbidity [2], the association between birth month and disease susceptibility [3], the prediction of outcomes [4], the measurement of economic impact of health care [5], and the discovery of etiological factors [6]. A key feature of these data is in the diagnostic codes given by medical professionals to patient records. However, the accuracy of inferring disease phenotypes from electronic diagnostic codes can vary widely across diseases and is often subject to high degrees of error [7-10]. These studies have noted the substantial misclassification effects from the use of electronic diagnostic code data, sufficient to undermine experiments utilizing cases and controls defined by the International Classification of Diseases (ICD) codes alone. The ICD coding system is instituted by the World Health Organization and has been adopted in the United States by the National Center for Health Statistics. More sophisticated approaches to disease classification, such as those using a variety of EHR data and machine learning methods, are difficult to generalize across all diseases and implement in a high-throughput manner. That said, I anticipate that machine learning methods applied to problems of phenotype prediction using EHR variables as features in the predictive modeling will eventually supplant the sole use of ICD code data. Until that time, the use of ICD data may still have utility in initial screens, to be subsequently validated through methods with higher positive and negative predictive values.
2. Related Work
In a general setting, the effect of phenotypic misclassification on statistical power of genetic association studies has been previously explored [11-14]. Edwards and colleagues characterized the noncentrality parameter in asymptotic power distributions given the presence of phenotypic misclassification [11]. The authors use cost functions to capture the effect of misclassification and show that the cost of misclassifying a control as a case becomes large and the cost of misclassifying a case individual as a control becomes small as the disease prevalence becomes small. Similarly, Ji et al. also investigated the calculation of a noncentrality parameter capturing phenotype errors for subsequent use in a likelihood ratio test for genetic association studies [12]. Later, Gordon and colleagues showed how to incorporate misclassification error rates into a trend test for genetic association in case/control studies [13]. More recently, Manchia and colleagues investigated the impact of heterogeneity within a clinical phenotype on genetic association [14].Considering ICD data with misclassification, the type I and type II error rates for genomic association studies were recently thoroughly explored by Duan et al. [15]. The Duan et al. study found little inflation in false-positive rates, but not in considerable false-negative rates under certain allele frequency, effect size, and disease prevalence parameters. In the context of initial screens of ICD codes in EHR systems, several studies have investigated the relationship between the number of instances of particular ICD codes and the measures of diagnostic utility [1, 16–18]. In general, the accuracy of diagnoses improves with the number of instances of the code; however, this is at the expense of smaller sample sizes/increasing false negatives. Hence, there is a trade-off between type I and type II error rates with the number of ICD code instances used to define a disease. In this work, I investigate this trade-off and provide a framework for determining highly powered EHR-based experimental designs using diseases defined by different numbers of instances of ICD codes.
3. Materials and Methods
For a large genetic association scan of using ICD data, define a simple disease classification scheme such that cases are those individuals with x instances of a particular ICD code. Consider a design where individuals with ambiguous numbers of instances (i) of the code (i.e., 0 < i < x) are excluded from the analysis. Further consider a comparison of well-defined cases (i.e., those with at least x instances) against a large, fixed set of controls. With regard to the genetics, restrict the methods to biallelic markers with minor alleles segregating in the population at a frequency of at least 1% single-nucleotide polymorphisms (SNPs). Define the alleles at a SNP contributing to the susceptibility of the disease as A1 and A2. Let the relative risk of the minor allele, A2, be R, such that R = P(A2 | cases)[P(A2 | controls)]−1. Let the frequency of A2 in the general population be q. Accordingly, 1 − q is the frequency of A1. Define n as the number of cases obtained from the definition of having at least x instances of the ICD code being evaluated. Set the number of controls as m, such that m ≫ n. Assume that the A2 frequency in controls is approximately q. Model the decrease in the misclassification proportion within cases as x increases with a monotonic function f(x), such that the expected number of truly positive cases is n[1 − f(x)]. The form of f(x) may vary considerably for different ICD codes. Lastly, let α be the statistical threshold for determining a positive finding in analyses where p value < α. The statistical test of genetic association considered is the binomial test of proportions which evaluates the null hypothesis of no correlation between the frequency of A2 and the disease status.Statistical power will be used to evaluate the impact of increasing x and the resulting experimental design. Under the model specified above, the power to detect association at an autosomal SNP, 1 − β, is calculated by the approximation as follows:
where Φ is the standard Gaussian cumulative distribution function, z is the inverse standard Gaussian score, N = 4nm/n + m, and q and s are the A2 frequencies in controls and cases, respectively. Using Bayes' theorem, the expected frequency of A2 within cases under the misclassification model is given byTo model the decrease in the misclassification rate with increasing numbers of ICD code instances, consider the simple decay function for f(x):
where δ is the parameter that can be estimated for each ICD code. Similarly consider the following form for n as a function of n to model the reduction in the number of cases defined by using increasing numbers of instances of an ICD code:
where ε is the parameter that captures the rate of decline in case numbers as the definition for case status becomes more stringent with the use of larger numbers of ICD code instances and can also be estimated for each ICD code. The machinery is now in place for the calculation of statistical power to detect disease association at a genetic marker using data from linked ICD coding systems.
4. Results and Discussion
The above model is used to conduct an exploration of the impact of ICD code definitions on power. To obtain a value of x which maximizes power to detect genetic association, one can numerically solve the following differential equation for x:The solution to (5) can be solved through standard numerical methods applied to solving
whereThe closest integer value to the value of x that solves this continuous equation can be used to optimize the power for a given set of parameters. To exemplify the use of this approach, let m = 10,000, n = 400, R = 2, q = 0.20, δ = 0.15, and ε = 0.15. Call this set of parameters the baseline model. x = 7.2265 solves the differential equation. Therefore, using seven instances of an ICD code will yield the optimal design weighing the trade-off between the case sample size and the misclassification. For that set of parameters, Figure 1 shows the power curve for this set of parameters.
Figure 1
Statistical power versus ICD code instances, baseline model. From the mathematical model specified, power was calculated using the set of parameters from the baseline model. The results show the trade-off between the sample size, misclassification rates, and statistical power to detect genetic association. For the baseline model, the peak of power occurs when the number of instances is 7.
To investigate the power curves, varying the baseline number of cases (n), the calculations were performed as the n varied from 100 to 800. Visual inspection shows the peak of power at approximately 7 instances. Figure 2 shows the results.
Figure 2
Power versus ICD code instances, effect of varying n. The baseline level was used to generate this figure with the exception of n, which varied from 100 to 800, and the resulting power was calculated for each number of ICD instances.
Next, to determine the role of the δ and ε parameters on the power curves, the calculations were performed fixing the other parameters. Figures 3 and 4 display these results.
Figure 3
Power versus ICD code instances, effect of varying epsilon. The epsilon parameter varied in the baseline model from 0.01 to 0.30, and the power to detect was subsequently calculated.
Figure 4
Power versus ICD code instances, effect of varying delta. To explore the effect of the delta parameter on the power calculations, the baseline model was modified to include values of delta from 0.01 to 0.30. The power to detect genetic association was calculated across these delta parameter values.
5. Conclusions
Genetic data linked to longitudinal electronic health records can serve as a very useful tool in modern disease genetics. However, misclassification present in ICD coding systems can severely hamper large-scale screens using those codes for the purpose of genetic association studies. This work has described a simple approach to better understand the impact of misclassification present in EHR systems for the purpose of optimizing experimental designs that screen numerous ICD codes in genetic association studies. Under the mathematical models considered, the methods offer an approach to select the number of instances of an ICD code for the purpose of defining cases and obtaining an optimal experimental design for the identification of genetic markers. Additional work is needed in this area to improve disease classification schemes for genetic association studies as well as for other investigations.
Authors: Mary Regina Boland; Zachary Shahn; David Madigan; George Hripcsak; Nicholas P Tatonetti Journal: J Am Med Inform Assoc Date: 2015-06-02 Impact factor: 4.497
Authors: Joseph B Leader; Sarah A Pendergrass; Anurag Verma; David J Carey; Dustin N Hartzel; Marylyn D Ritchie; H Lester Kirchner Journal: AMIA Annu Symp Proc Date: 2015-11-05
Authors: Elena Birman-Deych; Amy D Waterman; Yan Yan; David S Nilasena; Martha J Radford; Brian F Gage Journal: Med Care Date: 2005-05 Impact factor: 2.983
Authors: Laurent G Glance; Turner M Osler; Dana B Mukamel; Wayne Meredith; Jacob Wagner; Andrew W Dick Journal: Ann Surg Date: 2009-06 Impact factor: 12.969