| Literature DB >> 31684865 |
Jacob J Hughey1,2, Seth D Rhoades3, Darwin Y Fu3, Lisa Bastarache3, Joshua C Denny3,4, Qingxia Chen3,5.
Abstract
BACKGROUND: The growth of DNA biobanks linked to data from electronic health records (EHRs) has enabled the discovery of numerous associations between genomic variants and clinical phenotypes. Nonetheless, although clinical data are generally longitudinal, standard approaches for detecting genotype-phenotype associations in such linked data, notably logistic regression, do not naturally account for variation in the period of follow-up or the time at which an event occurs. Here we explored the advantages of quantifying associations using Cox proportional hazards regression, which can account for the age at which a patient first visited the healthcare system (left truncation) and the age at which a patient either last visited the healthcare system or acquired a particular phenotype (right censoring).Entities:
Keywords: Cox regression; Electronic health record; GWAS; Time-to-event modeling
Mesh:
Year: 2019 PMID: 31684865 PMCID: PMC6829851 DOI: 10.1186/s12864-019-6192-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Comparing logistic regression and Cox regression on data simulated from either a logistic model or a Cox model (1000 simulations each). Each simulation included 100 risk alleles and 799,900 alleles not associated with the phenotype. True positive rate was calculated as the fraction of risk alleles having Bonferroni-adjusted p-value less than the given cutoff. a Boxplots of true positive rate for logistic regression, Cox regression, and the sequential strategy, across simulations from each simulation model. The sequential strategy used the p-value from Cox regression, if the unadjusted p-value from logistic regression was ≤10− 4. For ease of visualization, outliers are not shown. b 95% confidence intervals of the difference between the true positive rates of Cox and logistic regression
Fig. 2Manhattan plots of GWAS results using Cox and logistic regression for four phenotypes (phecode in parentheses). For each phenotype, only associations having mean(−log10(P)) ≥ 2 are shown. Dark green lines correspond to P = 5·10− 8 and light green lines correspond to P = 10− 5
Fig. 3Comparing Cox regression and logistic regression for the ability to detect known genotype-phenotype associations for the 50 phenotypes analyzed. Known significant associations (P ≤ 5·10− 8) were curated from the NHGRI-EBI GWAS Catalog and aggregated by LD for each phenotype. a Sensitivity of each method, i.e., fraction of known and tested associations that gave a p-value less than or equal to the specified cutoff. The sequential strategy used the p-value from Cox regression, if the unadjusted p-value from logistic regression was ≤10− 4. The sequential line overlaps the Cox line. b Relative change in sensitivity between logistic and Cox regression, i.e., difference between the sensitivities for Cox and logistic, divided by the sensitivity for logistic. The gray line corresponds to the raw value at each cutoff, while the black line corresponds to the smoothed value according to a penalized cubic regression spline in a generalized additive model
Fig. 4Kaplan-Meier curves for three phenotype-SNP pairs, showing the fraction of at-risk persons still undiagnosed as a function of age and allele count. For each phenotype, the corresponding phecode is in parentheses. As in the GWAS, diagnosis was defined as the second date on which a person received the given phecode. The curves do not account for sex or principal components of genetic ancestry, and thus are not exactly equivalent to the Cox regression used for the GWAS