Literature DB >> 33175840

LabWAS: Novel findings and study design recommendations from a meta-analysis of clinical labs in two independent biobanks.

Jeffery A Goldstein1, Joshua S Weinstock2, Lisa A Bastarache3, Daniel B Larach4, Lars G Fritsche2, Ellen M Schmidt2, Chad M Brummett4, Sachin Kheterpal4, Goncalo R Abecasis2, Joshua C Denny3, Matthew Zawistowski2.   

Abstract

Phenotypes extracted from Electronic Health Records (EHRs) are increasingly prevalent in genetic studies. EHRs contain hundreds of distinct clinical laboratory test results, providing a trove of health data beyond diagnoses. Such lab data is complex and lacks a ubiquitous coding scheme, making it more challenging than diagnosis data. Here we describe the first large-scale cross-health system genome-wide association study (GWAS) of EHR-based quantitative laboratory-derived phenotypes. We meta-analyzed 70 lab traits matched between the BioVU cohort from the Vanderbilt University Health System and the Michigan Genomics Initiative (MGI) cohort from Michigan Medicine. We show high replication of known association for these traits, validating EHR-based measurements as high-quality phenotypes for genetic analysis. Notably, our analysis provides the first replication for 699 previous GWAS associations across 46 different traits. We discovered 31 novel associations at genome-wide significance for 22 distinct traits, including the first reported associations for two lab-based traits. We replicated 22 of these novel associations in an independent tranche of BioVU samples. The summary statistics for all association tests are freely available to benefit other researchers. Finally, we performed mirrored analyses in BioVU and MGI to assess competing analytic practices for EHR lab traits. We find that using the mean of all available lab measurements provides a robust summary value, but alternate summarizations can improve power in certain circumstances. This study provides a proof-of-principle for cross health system GWAS and is a framework for future studies of quantitative EHR lab traits.

Entities:  

Year:  2020        PMID: 33175840      PMCID: PMC7682892          DOI: 10.1371/journal.pgen.1009077

Source DB:  PubMed          Journal:  PLoS Genet        ISSN: 1553-7390            Impact factor:   5.917


Introduction

Laboratory testing is a key component of modern medicine. Laboratory measurements provide a glimpse into the functioning of the human body, allowing clinicians to diagnose and monitor disease. In most health systems, lab measurements are routinely captured in patient Electronic Health Records (EHRs) alongside disease diagnoses, free text notes and medical procedures to provide a detailed, longitudinal health history [1]. EHRs present exciting research potential by providing broad phenotyping on large cohorts with minimal cost [2,3]. Several large-scale genetic studies have already leveraged biobanks linked to EHRs, such as the UK Biobank [4], Japan Biobank [5], FinnGen [6] and HUNT [7], as sources of phenotypes for Genome-wide Association Studies (GWAS) [4-7]. The phenotypes are typically based on International Classification of Diseases (ICD) codes mapped to dichotomous traits [8]. Although disease is often thought of in all-or-nothing binary states, many diseases exist on a continuum with the ultimate clinical diagnosis occurring once a relevant quantitative laboratory measurement exceeds a pre-determined threshold. For example, hypercholesteremia, diabetes mellitus and chronic kidney disease are each diagnosed almost entirely on measurements of low density lipoprotein (LDL), glycated hemoglobin (or glucose) and creatinine, respectively. Laboratory measurements can therefore be a more sensitive measure of underlying health than diagnosis and may provide a more powerful outcome for analysis. As an example, the hypercholesterolemia and coronary artery disease risk locus PSCK9 was initially discovered based on quantitative LDL measurement rather than clinical diagnosis [9,10]. In contrast to binary disease phenotypes, there are fewer examples of genetic analyses of EHR-derived quantitative lab values [11-13]. Hereafter, we use the term lab traits to refer to quantitative biomarkers assayed through clinical laboratory testing (e.g., “creatinine", "LDL cholesterol"), and the term lab measurements to refer to realized values of these tests stored in patient EHRs. The rich source of quantitative lab measurements in EHR cohorts comes with unique concerns. Quantitative traits collected specifically for research purposes typically use a controlled experimental design to ensure consistency among samples. In contrast, lab measurements contained in EHRs are a historical record of medical care. As such, patients may have hundreds of lab measurements for some traits and none for others, depending on their specific health history and utilization of the health system. The measurements can be collected in times of sickness or good health leading to substantial variation in measurements for the same lab. Lab measurements can also be artificially modified by prescription medication, such as statin use for lowering LDL cholesterol. Moreover, recruitment mechanisms and health system demographics can dramatically shape the overall health of the biobank cohort, which in turn dictates lab measurements available for analysis. The broad impact of using such “real world” measurements for genetic association studies is unclear. Questions remain over the effect and robustness of analytic choices made when analyzing EHR-based lab traits including how best to summarize complicated, longitudinal lab measurements and whether comorbid diseases highly correlated with lab measurements must be considered. Prior studies are not consistent in addressing these concerns. For example, GWAS of EHR-derived quantitative traits in Biobank Japan enrolled patients with at least 1 of 47 diagnoses and controlled for all 47 diagnoses while testing each lab [11]. In contrast, an analysis of labs within the Geisinger EHR did not control for underlying disease states [14]. The variety of methods to summarize lab measurements and models to test for genetic association indicates that the question of how to analyze these data remains unsettled. In this paper we explore strategies for analyzing quantitative lab measurements extracted from EHRs and describe the first large-scale meta-analysis of EHR-derived lab traits across independent health systems. We used lab measurements and genetic data from two academic health systems: the BioVU cohort from Vanderbilt University [15] and the Michigan Genomics Initiative (MGI) from Michigan Medicine [16]. Meta-analysis offers a mechanism to increase sample size and power for detecting genetic risk variants but comes with distinct challenges for EHR lab traits, particularly matching lab traits between health systems and determining specific analysis protocols. The cohorts differ dramatically in their recruitment mechanisms, patient composition and recording format for lab measurements: MGI was predominantly recruited through inpatient surgical encounters at Michigan Medicine whereas BioVU recruitment required outpatient appointments at Vanderbilt University Medical Center. As a result, MGI is enriched for diseases treated surgically such solid tumors [16]. This heterogeneity reflects the reality of EHR-based phenotyping, and strategies must be developed for future collaborative work on the growing number of EHR-linked biobanks. Our initial challenge was identifying which labs to meta-analyze between the health systems. Accurately matching labs is complicated by the fact that no standardized coding scheme exists for lab measurements. Dichotomous disease traits are readily matched between health systems using the ubiquitous ICD coding system for disease diagnoses [17]. Although the Logical Observation Identifiers Names and Codes (LOINC) system offers the promise of interoperability for lab traits, it is cumbersome and maps poorly onto other ontologies [18]. For example, there are 21 distinct codes for blood glucose which might not be used consistently between institutions. Moreover, health systems may adopt their own idiosyncratic internal terminology for electronic recording of lab results. Based on a methodical manual review of EHR text descriptions and lab measurements, we identified 70 lab traits between BioVU and MGI that could be matched with high confidence. We extracted previously identified variants for these lab traits from the GWAS catalog to serve as true positive variants for assessing subsequent analyses. Our meta-analysis replicated nearly 75% of these true positive variants, validating both the accuracy of lab matches across health systems and the overall quality of the EHR lab data. Further, we discovered 31 novel lab-associated variants across 22 labs, including the first reported associations for the saliva and pancreatic enzyme amylase and bicarbonate CO2, a gaseous waste product from metabolism carried in the blood. We immediately replicated 22 (71%) of these novel associations using an independent second set of BioVU samples. The meta-analysis required several strategic choices regarding data preparation and statistical analysis. We explored the consequences of various analytic choices using a series of mirrored analyses performed in MGI and BioVU. In particular, we varied the summary statistic for lab measurements and the inclusion of covariates to control for comorbid diseases in the GWAS. We compared the results between the independent biobank cohorts to assess consistency of effects. We hypothesized that alternative summary statistics to the basic mean could provide more powerful genetic analyses. We considered: the median lab measurement due to robustness against data recording errors and extreme measurements, the first available lab measurement to mitigate the effects of prescription drugs on modifiable lab traits, and the maximum recorded measurement to magnify variation in extreme measurements. The comorbidity analysis compared GWAS results from models that included indicator covariates for a wide array of diseases to models that did not. The complete set of GWAS summary statistics from this analysis are broadly available to the research community. We encourage others to use this data to replicate their own GWAS findings and perform hypothesis-driven lookups on specific SNPs or lab traits of interest. Our results are viewable through an interactive PheWeb web browser [19] at http://pheweb.sph.umich.edu/mgi-biovu-labs and available for bulk download at https://phewascatalog.org/labwas and ftp://share.sph.umich.edu/mgi_biovu_labwas/.

Methods

Datasets

We analyzed data from two university hospital biobanks that link electronic health records with genetic data: BioVU from Vanderbilt University and the Michigan Genomics Initiative (MGI) from Michigan Medicine. We restricted our analysis to unrelated patients of European ancestry because of insufficient patient sample sizes and a paucity of known variants in non-European populations. The BioVU cohort has been described previously [15]. Briefly, DNA was extracted from surplus blood samples and genotyping data was linked to de-identified EHR data. For this study, we used a cohort of 20,515 individuals genotyped on the Multi-Ethnic Genotyping Array (MEGA) from Illumina and estimated to be of European ancestry by admixture [20]. We included 843,242 SNPs that passed standard marker QC filters and had a minor allele frequency >1%. We retrieved all available lab measurements in this cohort that occurred when the subject was at least 18 years of age. The MGI cohort has also been described previously [16]. Briefly, MGI samples were recruited primarily through surgical encounters at Michigan Medicine and provided consent for linking of their EHRs and genetic data for research purposes. MGI samples were genotyped on customized Illumina HumanCoreExome v12.1 bead arrays. European samples were identified using Principal Component Analysis. We used a data freeze consisting of 37,354 unrelated European individuals for this analysis. MGI samples were imputed to the Haplotype Reference Consortium using the Michigan Imputation Server [21], providing ~14 million SNPs with a minimac imputation quality R2>0.3 and an allele frequency greater than 1e-6. We analyzed the set of ~800K overlapping SNPs between the MGI imputed genotypes and the BioVU MEGA array for this study.

Harmonization of labs between health systems and the GWAS Catalog

We extracted all available clinical lab measurements and metadata from the electronic health records of MGI samples and BioVU samples. We collapsed distinct labs when obvious duplications were present (e.g., “Eosinophils” and “EOSINOPHILS”). Available metadata differed slightly between the health systems but included brief text descriptions, unit of measurements, and range for normal values. We excluded individual lab measurements taken outside the health system labelled as “External.” In cases where multiple tests examined the same analyte, e.g. blood glucose, we removed point of care (POC) tests which are more susceptible to technical artifacts and tend to be deployed in intensive care or emergency settings where acute disease or treatment effects supervene determinants of the underlying baseline [22,23]. Lab traits were matched between the Vanderbilt and Michigan health systems based on manual curation of the metadata including recorded lab names, clinical descriptions, measurement units, range of measurements, and patient count.

Disease phenotypes

In order to study the effect of underlying health conditions we extracted ICD9 and ICD10 diagnosis codes from the EHR of the BioVU and MGI cohorts. We searched for diagnosis for 42 diseases with the potential to alter a clinical lab measurement (S1 Table). We started with the disease list used in the BioBank Japan lab analysis [11] and removed diseases which do not occur in our population (e.g. febrile seizures of infancy) and those expected to have minimal effect on labs (e.g. cataracts). We supplemented their list with chronic diseases expected to have a large impact on labs due to their prevalence (e.g. hypertension). We created an indicator variable for each disease (1 if the sample had at least one qualifying ICD code for the specific disease and a 0 otherwise) to include as covariates in GWAS regression analyses.

Statistical analysis

Intra-cohort Genome-wide Association Studies

We first performed GWAS analysis of each lab trait separately in the MGI and BioVU cohorts. We performed multiple GWAS for each lab, varying the statistic used to summarize the longitudinal lab measurements for each sample (mean, median, first available measurement and maximum available measurement) and the inclusion of binary indicators for diagnosis comorbid diseases in the GWAS regression. For each GWAS, the distribution of lab summary statistics was inverse normalized separately within the MGI and BioVU cohorts prior to regression analysis. In a separate analysis of the BioVU cohort, we determined that inverse normalization of lab values performed better than applying no transformation, or a log or square root transformation for controlling GWAS type I error. Genome-wide association tests were performed on the inverse normalized traits using additive linear regression models containing age, sex and four principal components as covariates. The comorbidity model controlled for disease status by inclusion of an additional 42 covariates for the binary disease phenotypes. The regression analyses were performed in the BioVU cohort using PLINK [24] and in the MGI cohort using epacts 3.3.0 [25].

Comparison of p-values across cohorts

We treated the GWAS of mean trait value with no disease covariates as the default. We quantified the impact of each alternate analysis strategy relative to the default analysis by computing the log fold change in p-value between the alternative and default analysis for each analyzed SNP. That is, for each SNP we compute the quantity for the MGI analysis and the BioVU analysis separately. A positive value of Δ indicates a SNP that increases in significance (smaller p-value) for the alternate summary statistic. A negative value of Δ indicates a decrease in significance for the alternate analysis. Scatterplots of Δ computed in MGI and BioVU summarize the magnitude and consistency of change in p-value significance between the cohorts (Fig 1, S1 Fig). We performed LD-pruning on non-catalog SNPs to simplify the scatterplots. Since most SNPs are not associated with the lab trait of interest, alternative summarizations simply result in independent noise between the two cohorts, resulting in a diamond shaped pattern centered at the origin.
Fig 1

Scatterplot of Δ in MGI and BioVU when using the first available measure rather than the mean measurement in a GWAS of Cholesterol level.

Δ is the -log fold change in p-value at a SNP for using an alternate analysis, in this case the first available lab measurement. Each dot is a SNP, with red dots indicating GWAS catalog SNPs for the specific lab trait. The white diamond contains 99.9% of SNPs and is used to identify SNPs with the largest changes in p-value due to the alternate analysis. SNPs outside the bounding diamond in the top right (green) quadrant show a concordant increase in significance in both MGI and BioVU, that is, SNPs for which the alternative strategy increases significance in both cohorts. Conversely, SNPs in the bottom left (blue) quadrant show a concordant decrease in significance in both MGI and BioVU. SNPs in either the top left or bottom right (yellow) quadrants have a discordant effect, indicating a large increase in p-value in one cohort but a large decrease in p-value in the second cohort. In this example, one catalog SNP showed a concordant increase in significance when using the first available lab measure, 11 catalog SNPs had a concordant decrease in significance and one SNP had discordant effects. The complete set of scatterplots for each analyzed lab and alternative analysis strategy (summary statistic and comorbidity model) are included in the S1 Fig. Tables 3 and 4 summarize the movement of catalog SNPs for each lab and analysis strategy.

Scatterplot of Δ in MGI and BioVU when using the first available measure rather than the mean measurement in a GWAS of Cholesterol level.

Δ is the -log fold change in p-value at a SNP for using an alternate analysis, in this case the first available lab measurement. Each dot is a SNP, with red dots indicating GWAS catalog SNPs for the specific lab trait. The white diamond contains 99.9% of SNPs and is used to identify SNPs with the largest changes in p-value due to the alternate analysis. SNPs outside the bounding diamond in the top right (green) quadrant show a concordant increase in significance in both MGI and BioVU, that is, SNPs for which the alternative strategy increases significance in both cohorts. Conversely, SNPs in the bottom left (blue) quadrant show a concordant decrease in significance in both MGI and BioVU. SNPs in either the top left or bottom right (yellow) quadrants have a discordant effect, indicating a large increase in p-value in one cohort but a large decrease in p-value in the second cohort. In this example, one catalog SNP showed a concordant increase in significance when using the first available lab measure, 11 catalog SNPs had a concordant decrease in significance and one SNP had discordant effects. The complete set of scatterplots for each analyzed lab and alternative analysis strategy (summary statistic and comorbidity model) are included in the S1 Fig. Tables 3 and 4 summarize the movement of catalog SNPs for each lab and analysis strategy.
Table 3

Classification of catalog SNPs for alternate summary statistics.

Median MeasurementFirst Available MeasurementMaximum Measurement
LabTestable Catalog SNPsConcordant Increased SignificanceConcordant Decreased SignificanceDiscordant EffectConcordant Increased SignificanceConcordant Decreased SignificanceDiscordant EffectConcordant Increased SignificanceConcordant Decreased SignificanceDiscordant Effect
Chol9101201111246
Create36200221081
EoAB31060090021
EoRE28010040011
HCT360004011501
HDL10106301510275
Hgb340005001200
LDL84091094226
LymphAB35000031512
LymphRE20000000000
MCHC20010253251
MCH640162703380337
MCV77157019130306
MonoAB432300900131
MPV840119039952017
PLT1020017710195
PMNAB35000021031
PMNRE21000000001
RBC5004413012100
RDW29012014070
Trigs7307011510220
WBC33045071090
Total11275 (0.4%)86 (7.6%)59 (5.2%)35 (3.1%)190 (16.9%)51 (4.5%)64 (5.6%)206 (18.3%)62 (5.5%)
Table 4

Classification of catalog SNPs for the comorbidity model, which includes covariates for various lab-altering diseases.

Comorbidity Model
LabTestable Catalog SNPsConcordant Increased SignificanceConcordant Decreased SignificanceDiscordant Effect
Chol91252
Creat36132
EoAB31000
EoRE28001
HCT36202
HDL1011522
Hgb34100
LDL84072
LymphAB35204
LymphRE20000
MCHC20202
MCH641726
MCV77914
MonoAB43501
MPV841805
PLT102514
PMNAB35021
PMNRE21002
RBC50205
RDW29013
Trigs73337
WBC33222
Total112770 (6.2%)34 (3.0%)77 (6.8%)
We implemented a heuristic to formally distinguish the SNPs with largest changes in p-value between the alternative and default analysis methods from those with movement due simply to random noise. The heuristic generates a bounding quadrilateral polygon around the diamond cluster of points, generated using simulated annealing to determine the bounding coordinates of a polygon containing 99.9% of all SNPs. We defined SNPs outside the boundaries of the polygon as those with largest simultaneous changes in p-values in both cohorts. Catalog SNPs located outside the bounding polygon were classified as having either a concordant increased effect if p-value significance increased in both MGI and BioVU, a concordant decrease effect if p-value significance decreased in both MGI and BioVU or a discordant effect if the p-value increased in significance in one cohort but decreased in the other.

Meta-analysis

We meta-analyzed the GWAS results from the MGI and BioVU default analysis (mean trait value, no disease covariates). The meta-analysis was performed using METAL by combining study-specific GWAS effect size estimates and standard errors [26]. We computed genomic control inflation factors (λGC) on a set of LD-pruned SNPs for each meta-analyzed lab.

GWAS catalog variants

We created a list of previously identified genetic associations for each analyzed lab trait using the GWAS catalog [27] (downloaded 9/27/2017). We searched the catalog for quantitative phenotypes matching our analyzed labs using pattern matching in the DISEASE_TRAIT, MAPPED_TRAIT, and P_VALUE_TEXT columns. We searched for each lab using multiple potential string patterns, for example “AST”, “aspartate aminotransferase”, “SGOT”, and “serum glutamine oxaloacetic aminotransferase”. For purposes of replication, we limited our catalog search to studies of European cohorts performed on adults of both sexes without disease-based sampling (e.g. glucose measurements in type 2 diabetes samples) and required a reported p-value of at least 5e-8. We considered a catalog association replicated if the meta-analysis p-value for our corresponding lab was < 0.05 and the BioVU and MGI studies had the same direction of effect.

Definition of novelty

We report novel lab-SNP associations as those reaching genome-wide significance that have not been previously reported in European populations and are not reasonably expected based on existing SNP-lab associations in similar labs. We used the following criteria: meta-analysis p-value <5e-8, consistent direction of effect between MGI and BioVU and at least 1 megabase from any previously reported SNP for the given lab or a related lab in the GWAS catalog. Here, we define related labs as those which are commonly ordered as part of a panel of correlated tests (e.g. AST and ALT for liver function) or arithmetically-dependent traits (e.g. LDL and total cholesterol), and therefore likely to indicate the same biological association. We report the “peak” or most significant SNP when a group of novel SNPs are in linkage disequilibrium.

Replication of novel associations

We performed a replication analysis of novel associations identified in the meta-analysis using an independent cohort of BioVU samples that became available after the original meta-analysis was performed. This replication cohort consisted of 29,043 European ancestry adult individuals with extant lab data recruited using the same procedure as the initial BioVU cohort, genotyped on the same MEGA genotyping array, and subjected to the same data QC procedure. We declared a novel association to be replicated if the replication p-value was <0.05 and the direction of effect was consistent with that from the meta-analysis.

Ethics statement

Data were collected according to Declaration of Helsinki principles. MGI study participants’ consent forms and protocols were reviewed and approved by the University of Michigan Medical School Institutional Review Board (IRB ID HUM00099605 and HUM00155849). Opt-in written informed consent was obtained for each MGI participant. BioVU is Vanderbilt University's biobank of DNA extracted from leftover and otherwise discarded clinical blood specimens. BioVU operates as a consented biorepository; all individuals must sign the BioVU consent form in order to donate future specimens.

Results

We extracted all available clinical lab measurements from the electronic health records (EHRs) for genotyped samples in two academic biobank cohorts: the Michigan Genomics Initiative [16] (MGI) at Michigan Medicine and the BioVU [15] at Vanderbilt University. In total, this consisted of 35,785,074 lab measurements in 50,743 MGI samples, and 28,929,660 lab measurements in 61,378 BioVU samples. We focused on samples of European ancestry in both cohorts due to insufficient sample sizes in other ancestry groups. Genetic analyses were performed on the set of ~800K overlapping SNPs between the MGI imputed genotypes and the BioVU MEGA array genotypes. We analyzed 70 labs matched with high confidence between the health systems and having at least 1,000 samples with the lab measured in each health system (Table 1). We searched the GWAS catalog for known genetic associations among the 70 lab traits to serve as “true positive” variants to validate the data and assess competing analysis strategies (S2 Table). We identified 4,140 such associations, of which, 1,313 (32%) across 48 different traits were in the set of overlapping markers tested in the meta-analysis. Many lab traits have been well studied [28,29] and provided many testable catalog SNPs. LDL, for example, had 84 catalog SNPs that could be directly tested in our meta-analysis. Alternatively, several labs had relatively few or no catalog SNPs, including labs for which either no variant was reported in the catalog or the catalog variants were not typed in at least one of our cohorts.
Table 1

Summary of clinical lab traits tested, including meta-analysis samples size, number of testable GWAS catalog SNPs, number of replicated catalog SNPs and replication rate.

Lab NameCategoryDescriptionMeta-Analysis Sample SizeNumber of Testable GWAS Catalog SNPsNumber of Catalog SNPs Replicated in Meta-AnalysisReplication Rate (%)
AlbLiver functionAlbumin, most abundant blood protein39,5135480
AlkPLiver functionAlkaline phosphatase, bile duct and bone enzyme released by damage39,8093133
ALTLiver functionALanine aminoTransferase, liver enzyme released by damage40,11600N/A
AmylPancreasAmylase, digestive pancreas enzyme released by damage10,36800N/A
ASTLiver functionASpartate aminoTransferase, liver enzyme released by damage40,17600N/A
BasoABDifferentialBasophils, white blood cell type (absolute number)29,653191263
BasoREDifferentialBasophils, white blood cell type (relative proportion)32,57811764
BEARBlood gasBase Excess ARterial, Acid-base measure of metabolic acidosis or alkalosis8,89500N/A
BiliLiver functionTotal Bilirubin, heme byproduct excreted by liver38,41644100
BNPHeart failureBrain Natriuretic Protein, Signaling protein from heart under stress9,36911100
BUNRenal functionBlood Urea Nitrogen Protein byproduct excreted by kidneys45,92200N/A
CaElectrolytesCalcium, blood electrolyte46,1009778
CholLipid panelTotal cholesterol23,642916066
CKMBReCardiac markersCreatine Kinase Muscle Brain isoform, relative, Enzyme in heart released by damage10,96400N/A
ClElectrolytesChloride, blood electrolye45,92000N/A
CPKCardiac markersCreatine PhosphoKinase, enzyme in skeletal and cardiac muscle released by damage15,15000N/A
CreatRenal functionCreatinine, creatine byproduct excreted by kidneys46,027362981
CRPInflammatoryC-reactive protein, marker of inflammation12,44716744
EoABDifferentialEosinophils, white blood cell type (absolute count)29,912312581
EoREDifferentialEosinophils, white blood cell type (relative proportion)26,980281864
FerritIronFerritin, iron storage protein11,7446117
FT4Thyroid functionFree tetraiodothyronin, active thyroid hormone15,86800N/A
GlucMetabolicBlood glucose46,027181689
HCO3 (CO2)Blood gasBicarbonate, main blood pH buffer45,93200N/A
HCTComplete blood countHematocrit, measure of blood oxygen carrying capacity46382362056
HDLLipid panelHigh density lipoprotein cholesterol23,3181018483
HgbComplete blood countHemoglobin, oxygen carrying protein46,159341853
HgbA1CMetabolicHemoglobin A1C, measure of blood glucose over previous 90 days17,407111091
IGranABDifferentialImmature granulocytes, immature white blood cell type (absolute count)30,74400N/A
IGranREDifferentialImmature granulocytes, immature white blood cell type (relative proportion)30,68300N/A
INRCoagulationInternational Normalized Ratio, derivative of PT used to dose anticoagulants33,69500N/A
IronIronIron11,3174375
KElectrolytesPotassium, blood electrolyte45,94100N/A
LACBlood gasLactic acid, marker of tissue hypoxia8,79200N/A
LDHTumor markersLactate dehydrogenase, enzyme found in many cell types released by damage9,73400N/A
LDLLipid panelLow density lipoprotein cholesterol22,896845869
LipasePancreasLipase, digestive pancreas enzyme released by damage12,64922100
LymphABDifferentialLymphocytes, white blood cell type (absolute count)32,548352263
LymphREDifferentialLymphocytes, white blood cell type (relative proportion)32,553201050
MCHRed cell indicesMean corpuscular hemoglobin, used to differentiate causes of anemia46,159645789
MCHCRed cell indicesMean corpuscular hemoglobin concentration, used to differentiate causes of anemia46,157201995
MCVRed cell indicesMean corupuscular volume, used to differentiate causes of anemia46,153776888
MgElectrolytesMagnesium, blood electrolyte22,77344100
MonoABDifferentialMonocytes, white blood cell type (absolute count)32,587433274
MonoREDifferentialMonocytes, white blood cell type (relative proportion)32,594151280
MPVCoagulationMean platelet volume40,058847387
NaElectrolytesSodium, blood electrolyte45,93300N/A
pCO2Blood gasArterial partial pressure of CO2, measure of ventilation9,51600N/A
pHBlood gasArterial pH10,27900N/A
PhosElectrolytePhosphorus, blood electrolyte21,6185480
PLTComplete blood countPlatelet count, clot forming measure46,1451028482
PMNABDifferentialNeutrophils, white blood cell type (absolute count)32,595351543
PMNREDifferentialNeutrophils, white blood cell type (relative proportion)29,43521733
pO2Blood gasArterial partial pressure of oxygen, measure of oxygenation9,55700N/A
PTCoagulation panelProthrombin time, clot forming measure33,67111100
PTTCoagulation panelPartial Thromboplastin Time, clot forming measure30,9729667
RBCComplete blood countRed Blood Cell count, measure of blood oxygen carrying capacity46,158503162
RDWRed cell indicesRed cell Distribution Width, measure of variability in MCV, used to differentiate causes of anemia44,281292172
%SATIronTransferrin saturation, measure of available iron transport capacity10,1804375
SedRatInflammatory markersErythrocyte Sedimentation Rate (ESR), non-specific marker of inflammation13,94555100
TIBCIronTotal Iron Binding Capacity, measure of iron transport capacity, used to calculate transferrin saturation10,39711100
TProtLiver functionTotal Protein in blood38,35222100
TrigsLipid panelTriglycerides, tested as part of cholesterol panels23,963736386
TroponinCardiac markersTroponin I, heart protein released by damage10,10600N/A
TSHThyroid functionThyroid Stimulating Hormone, test of thyroid function and feedback27,44111100
UCreaRenal functionUrine creatinine, measure of kidney function10,52200N/A
UricAGoutUric acid, nucleotide breakdown product elevated in gout7,429171482
Vi-B12NutritionVitamin B12, used in DNA synthesis12,50677100
Vit-DNutritionVitamin D storage form, regulates calcium and phosphorus12,25066100
WBCComplete blood countWhite Blood Cell count46,100332782
TOTAL131398274.8

Meta-analysis of Labs in MGI and BioVU

The 70 EHR-derived lab traits were first analyzed separately in the cohorts using the mean measurement as the individual-level outcome. The meta-analysis sample size differed between labs, ranging from 7,429 for uric acid to 46,382 for hematocrit (Fig 2), reflecting the frequency with which different labs are administered in health systems. Several labs have previously been studied in much larger cohorts, including the differential panel of 10 white blood cell measures, analyzed in >170K samples in the UK BioBank [29]. However, this meta-analysis provides the largest sample size for 34 labs, including 14 clinical lab traits with no previously reported study in the GWAS catalog at the time of our analysis. Genomic control lambda values (λGC) confirmed the meta-analyses were well-controlled [30]. The mean λGC across all labs was 1.035, ranging between 0.995 and 1.103. Consistent with polygenicity [31], traits with a larger numbers of catalog variants had, on average, larger λGC values. The mean λGC for labs with zero testable catalog SNPs was 1.020. Labs with one to twenty testable Catalog SNPs had mean λGC of 1.028 and labs with greater than 20 testable Catalog SNPs had mean λGC of 1.066.
Fig 2

Sample sizes for 70 clinical lab traits from the meta-analysis of BioVU and MGI EHRs (red triangles) and the previous largest reported GWAS in a European cohort (black circles). Our meta-analysis provides the largest GWAS for 34 lab traits, including the first for 14. Asterisks along the bottom row indicate labs for which we identified a novel genetic association.

Sample sizes for 70 clinical lab traits from the meta-analysis of BioVU and MGI EHRs (red triangles) and the previous largest reported GWAS in a European cohort (black circles). Our meta-analysis provides the largest GWAS for 34 lab traits, including the first for 14. Asterisks along the bottom row indicate labs for which we identified a novel genetic association.

Replication of GWAS Catalog SNPs

We first performed a replication analysis of the 1,313 GWAS catalog SNPs to validate the EHR-derived lab traits. Overall, we replicated 982 of the GWAS catalog SNPs, giving an overall replication rate of 74.8% (Table 1). Replication rates varied across the individual labs; however, we did replicate at least one catalog SNP for each of the 48 traits with a testable catalog SNP. Replication rates were high for several previously well-studied traits, including red blood cell indices (MCHC, MCH, MCV), metabolic measures (glucose and HgbA1C) and creatinine. The lowest replication rates occurred for the differential panel of white blood cell traits (neutrophils, lymphocytes) which included catalog SNPs discovered in the much larger UK Biobank cohort [4]. Interestingly, replication rates differed among the well-studied lipid panel traits. We replicated a lower percentage of catalog SNPs for LDL cholesterol and total cholesterol compared to triglycerides and HDL cholesterol. Several factors influenced our ability to replicate individual catalog SNPs (Fig 3), each consistent with statistical power rather than adequate matching of labs as the primary limiting factor. Replication increased sharply with the number of publications reporting the association, as quantified using the PMID citation count from the GWAS catalog (Fig 3A). Associations reported only once in the catalog are a mix of true unreplicated associations and false positives, whereas associations reported more than once have already been replicated and are likely real. We replicated 70% (699 of 1000) of associations reported only a single time. That rate increased to 77% (196 of 256) for associations reported twice, 91% for associations reported three times and nearly 100% (56 of 57) for associations reported four or more times. Importantly, this analysis provides the first replication for 699 previously reported quantitative lab trait associations, increasing the likelihood that these are true genotype-phenotype associations (S2 Table).
Fig 3

Replication rates for GWAS catalog SNPs of clinical labs increased with (A) the number of times an association was reported in the GWAS catalog, (B) the most significant p-value previously reported for the association, and (C) the ratio of sample size in our meta-analysis to that of the previous largest study.

Replication rates for GWAS catalog SNPs of clinical labs increased with (A) the number of times an association was reported in the GWAS catalog, (B) the most significant p-value previously reported for the association, and (C) the ratio of sample size in our meta-analysis to that of the previous largest study. Replication rate was also dependent on both the best previously reported p-value for the association and the sample size of the study reporting the association (Fig 3B & 3C). Our replication rate was lowest, between 55%-65%, for associations whose best reported p-value was just above genome-wide significance of 5e-8 but increased sharply thereafter. We replicated ~85% of catalog SNPs with best reported p-value <1e-15 and over 90% of catalog SNPs with best p-value <1e-20. Replication rate increased with the relative size of our meta-analysis compared to the largest reported study. We replicated approximately 90% of catalog SNPs for which our meta-analysis was at least as large as prior studies reporting the association.

Novel associations

We identified 264 SNP-lab trait pairs representing potentially novel associations. Based on visual inspection, these SNPs corresponded to 31 distinct peaks for which we report the lead SNP having the strongest association signal at each peak (Table 2).
Table 2

Summary of Novel findings.

MGI-BioVU Meta-AnalysisBioVU Replication Cohort
LabSNPChr:PosAllele 1Allele 2NBetaP-ValueNBetaP-ValueReplicated
AlkPrs384373817:43739194AG39,8090.042.51E-0822,9200.013.58E-01No
AlkPrs7300493319:19675696TC39,8090.084.47E-0922,7300.057.14E-03Yes
ALTrs1125747918:145730221AG40,1160.183.02E-0823,0070.155.80E-04Yes
Amylrs19302121:104324819AG10,368-0.251.48E-453,573-0.184.69E-09Yes
Amylrs805136316:75255217AG10,3680.101.07E-103,5640.094.51E-04Yes
BasoRErs38678515815:70744437TC29,6530.067.94E-1316,1910.042.10E-04Yes
Bilirs85579122:37462936AG39,8900.042.34E-0822,9180.041.00E-05Yes
BUNrs105169574:95949206TC45,922-0.061.35E-0825,2450.016.11E-01No
Cars67273842:97400324AG46,100-0.045.13E-1025,200-0.052.06E-07Yes
Cars28398999:80350999AG46,1000.046.76E-0925,1940.039.47E-03Yes
Clrs10300252:103105611AT45,9200.054.68E-1025,2040.029.16E-02No
FT4rs101228249:139109861TG15,8680.071.00E-099,7210.077.28E-07Yes
Glucosers76079802:165551201TC46,027-0.054.27E-0925,312-0.042.09E-03Yes
Glucosers8968548:95960511TC46,027-0.041.55E-0925,3110.013.64E-01No
Glucosers92733646:32626302TG46,0270.052.63E-1124,8010.053.10E-06Yes
HgbA1Crs31306286:31609272TC17,407-0.081.23E-087,3400.033.79E-02No
HCO3 (CO2)rs179991311:18047255TG45,932-0.045.89E-0925,219-0.047.82E-07Yes
HCO3 (CO2)rs773758462:103155075TC45,932-0.109.33E-2525,217-0.062.78E-05Yes
IGranRErs132846659:131513370AG30,6830.226.61E-74QC FailN/AN/ANo
IGranABrs132846659:131513370AG30,7440.136.76E-35QC FailN/AN/ANo
Krs100391395:137164863TG45,9410.078.32E-1625,2110.061.83E-06Yes
Lipasers93773436:96512220AG12,649-0.104.79E-145,564-0.083.60E-05Yes
Lipasers805136316:75255217AG12,6490.132.00E-205,5490.078.39E-04Yes
MCHCrs123528309:80041132CG46,157-0.044.37E-0826,243-0.045.77E-05Yes
MonoRErs11735868312:44145965AG32,594-0.232.69E-0816,1850.044.07E-01No
MPVrs1121263511:108310702AT40,0580.049.55E-0917,333-0.013.68E-01No
TProtrs802218014:103263020AG38,3520.047.24E-1019,6650.032.63E-03Yes
Trigsrs68475984:76750356TC23,963-0.051.58E-0812,526-0.031.48E-02Yes
TSHrs1259016314:105223525TC27,441-0.054.68E-0817,042-0.046.76E-04Yes
TSHrs3107663:12233482AG27,441-0.061.66E-0817,079-0.051.42E-05Yes
TSHrs92751416:32651117TG27,4410.053.47E-0917,0540.048.64E-04Yes
We performed a replication analysis of the 31 lead SNPs using an independent cohort of 29,043 BioVU patients that became available after the initiation of our primary analysis. One SNP potentially novel for both immature granulocytes measures failed QC filtering in the replication cohort and could not be tested for replication. In total, we replicated 22 of the 31 (71%) novel associations (Table 2). Among the 24 replicated novel SNPs are the first associations for amylase (Amyl) and bicarbonate (CO2). We identified and replicated additional associations for alanine aminotransferase (ALT), alkaline phosphate (AlkP), Relative count of basophils (BasoR), total bilirubin (Bili), calcium (Ca), creatinine phosphokinase (CPK), glucose (gluc), mean corpuscular hemoglobin concentration (MCHC), lipase, and thyroid stimulating hormone (TSH). Several of our novel findings have biological or existing evidence that support the association. Three of the associations have recently been identified for the same lab in non-European cohorts. rs855791, a missense variant in TMPRSS6 (transmembrane serine protease 6), and rs8022180, an intronic variant in TRAF3, were shown to be associated with bilirubin and serum total protein level, respectively, in a Japanese population [11]. rs112574791 is in the glutamic—pyruvic transaminase gene GPT, a gene associated with alanine aminotransferase levels in the Korea Biobank [32]. Our results confirm these prior findings and suggest a cross-ethnic effect in European populations. The intronic variant rs8051363 in CTRB1 was associated with both amylase and lipase, clinical assays of pancreas function used to diagnose pancreatitis. While the SNP itself has previously been linked to blood protein measurements [33], the CTRB1 gene encodes chymotrypsin, a component of digestive enzyme secreted by the pancreas, and was previously shown to be associated with alcoholic chronic pancreatitis [34]. A second novel SNP for lipase, rs9377343, is an intronic variant in FUT9, a gene that showed association with diabetic neuropathy in a trans-ethnic meta-analysis [35]. The amylase-associated SNP rs1930212 resides near three amylase genes (AMY2B, AMY2A and AMY1) on chromosome 1, each of which encodes enzymes that digest starch into sugar [36]. Copy number variation for amylase genes is hypothesized to have been subject to selective sweeps corresponding to starch content in human diets [37]. The rs1930212 SNP tags a known deletion of AMY2A, a pancreatic amylase enzyme, most common in populations historically lacking starch rich diets [37]. One of our novel results for calcium, rs2839899, is an intronic variant in GNAQ (G protein subunit alpha q), a signaling protein involved in response to various hormones. Variation in GNAQ is associated with Sturge-Weber syndrome [38], a hereditary vascular malformation syndrome which can lead to deposits of calcium (calcification) in the brain. Three SNPs showed associations with glucose. rs7607980 is a missense variant in COBLL1 previously linked to fasting blood insulin and Type 2 diabetes [39-41]. rs9273364 is located near HLA-DQB1-AS1, a gene associated with T2D [42]. And, although it did not replicate in our analysis, rs896854, a variant mapping to both NDUFAF6 and TP53INP1, has recent associations with T2D [43] and eosinophil count [44] among UK biobank participants. We note that several associations occurred within the HLA region on chromosome 6, notably for glucose, hemoglobin A1C, and TSH. These variants are likely segregating with HLA types, which are strongly associated with various autoimmune diseases including diabetes and autoimmune thyroiditis, which have strong effects in these particular labs.

Genetic correlation of clinical labs

We computed the genetic correlation between pairs of labs using LD score regression [45] to learn about the shared genetic basis of these traits (Fig 4). We restricted analysis to the 50 lab traits with heritability of at least 7% based on recommendations by the developers of LDscore regression that estimation of genetic correlation can be unreliable when one of trait has heritability close to zero. We observe strong positive correlations among lab traits of similar function. For example, the liver enzymes alanine aminotransferase (ALT) and aspartate aminotransferase (AST) were strongly correlated, as were the measures of renal function Blood Urea Nitrogen (BUN) and creatinine (Creat). Prothrombin time (PT), a measure of clot formation time and a derivative measure International Normalized Ratio (INR) were positively correlated as expected. More surprisingly, INR was also positively correlated with vitamin D. While vitamin K is known to be required for the formation of prothrombin, the correlation with Vitamin D suggests a potential covariance in nutrition or nutrient absorption.
Fig 4

Pairwise genetic correlation of clinical lab traits.

We restricted to labs with heritability of at least 7%. Squares are colored only for correlations having a p-value <0.05 for the null hypothesis of correlation equal to zero.

Pairwise genetic correlation of clinical lab traits.

We restricted to labs with heritability of at least 7%. Squares are colored only for correlations having a p-value <0.05 for the null hypothesis of correlation equal to zero. A prominent cluster of labs (top right corner of the heatmap) contains primarily white blood cell traits including measures of immature granulocytes, lymphocytes, monocytes and neutrophils. The immature granulocytes also showed a strong correlation with ferritin (ferrit), an iron storage and acute phase protein. Ferritin and immature granulocytes can both be elevated during severe acute inflammation, explaining this correlation. As expected, HgbA1C and glucose were strongly correlated. More interestingly, they also clustered with Red cell Distribution Width (RDW) and Erythrocyte Sedimentation Rate (SedRat). This cluster of labs showed negative associations with high density lipoprotein (HDL), mean cell hemoglobin concentration (MCHC), and mean cell hemoglobin (MCH). This supports a pathophysiology where the metabolic syndrome (obesity, elevated glucose, low HDL) is linked by complex mechanisms to persistent low-level inflammation (elevated SedRat), and anemia of chronic disease (elevated RDW, low MCH, low MCHC). We identified a cluster containing red cell indices–mean cell hemoglobin concentration (MCHC), mean cell hemoglobin (MCH), and mean cell volume (MCV)–with total bilirubin (Bili) and transferrin saturation (%SAT). This reflects the biology of hemoglobin–iron is carried to red cell precursors by transferrin and incorporated into heme and thence hemoglobin, red cells are filled with hemoglobin, and at the end of a red cell lifecycle, heme is broken down into bilirubin. Additional clusters include (1) calcium (Ca), albumin (Alb) and total protein in blood (TProt), (2) thyroid stimulating hormone (TSH) and lactate dehydrogenase (LDH), and (3) hematocrit (HCT), red blood cell count (RBC) and hemoglobin (Hgb) with free tetraiodothyronine (FT4). Albumin (Alb) is the major blood protein, so Alb levels are unsurprisingly correlated with total blood protein (TProt). Calcium homeostasis is driven by free calcium, while albumin acts as a calcium sink, therefore calcium (Ca) levels would reasonably be expected to correlate with Alb [46]. Hematocrit (HCT), red blood cell count (RBC) and hemoglobin (Hgb) are interrelated measures of oxygen carrying capacity in blood and unsurprisingly correlated. In our study, they are also correlated with free tetraiodothyronine (FT4). Anemia (low HCT, RBC, and Hgb) may be a feature of hypothyroidism (low FT4), and tetraiodothyronine—thyroid hormone—has been reported to play a role in red cell maturation [47,48]. A final cluster was identified linking thyroid stimulating hormone (TSH) to lactate dehydrogenase (LDH). Muscle breakdown, manifesting as weakness, is a feature of hypothyroidism, and therefore other laboratory anomalies seen in hypothyroidism include release of muscle enzymes including LDH [47,49].

Analytic strategies for EHR-derived lab traits

We explored the impact of analytic choices on downstream analysis by performing parallel GWAS analyses in the MGI and BioVU cohorts with one of the analytic steps perturbed from our original analysis: either the individual-level statistic used to summarize longitudinal lab measurements (median, maximum measurement, first available measurement) or the inclusion of covariates for underlying comorbid health conditions. We performed these analyses on the 22 lab traits for which there were least 20 testable GWAS catalog SNPs, using the catalog SNPs to interpret the effect of each analytic strategy on true risk variants.

Summary statistic

Overall, 13.3% of testable catalog SNPs showed a major change in significance when using the median as opposed to mean value for the summary statistic (Table 3). The median rarely resulted in a consistent improvement for both MGI and BioVU. Only 0.4% of catalog SNPs had concordant increased effect compared to 7.6% with concordant decreasing effect and 5.2% with a discordant effect. Creatinine was the sole lab for which using median lab value had a greater number of catalog SNPs with concordant increased significance than catalog SNPs with concordant decreased significance. Even here the effect was small, only two of the 36 catalog SNPs had a concordant increase in significance. In comparison, the first available measurement and the maximum measurement had a greater impact on association p-values for catalog SNPs. In both cases, the alternate summary statistic was most likely to cause a concordant decrease in significance. Using the first available measurement resulted in concordant increase for only 3.1% of catalog SNPs, whereas 16.9% of catalog SNPs had a concordant decrease and 4.5% had discordant changes in significance. Using the maximum available measure had similar performance (5.6% concordant increase, 18.3% concordant decrease, 5.5% discordant). Despite an overall trend of reducing significance of known risk variants, several related labs for blood oxygen carrying capacity did benefit from using the first available or maximum measurements. Red blood cell count (RBC), hematocrit (HCT) and hemoglobin (Hgb) each showed concordant increase in significance for several of their respective catalog SNPs without negatively impacting remaining catalog SNPs. This likely reflects red cell biology. Conditions that decrease oxygen carrying capacity, such as blood loss or iron deficiency are far more common than those that increase it, polycythemia vera or severe obstructive sleep apnea, for example. Thus, maximum measurement of an individual’s oxygen carrying capacity more likely represents the genetically determined set point.

Controlling for comorbid disease

The comorbidity model, containing binary covariates for 42 comorbid diseases with the potential to alter lab values, produced the largest proportion of catalog SNPs (6.2%) with concordant increased significance in MGI and BioVU among the alternate analysis strategies considered (Table 4). Despite this, a roughly equal number of catalog SNPs had discordant effects (6.8%) between the two cohorts. The clearest example of a substantial and consistent effect on catalog SNPs between MGI and BioVU was for HDL and Mean platelet volume (MPV). Interestingly, in contrast to this result for HDL, LDL had no catalog SNPs with concordant increase in significance and seven catalog SNPs with concordant decrease.

Discussion

This study represents the first cross-health system study of EHR-derived lab traits at large scale. We performed meta-analysis GWAS of 70 lab traits and have made these association results easily accessible to the research community. Thoroughly dissecting each lab-SNP combination is a daunting task. Here, we focused on replication of GWAS catalog variants to validate our data and highlighted novel genetic associations. We anticipate that our full results, including those which do not reach genome-wide significance will be useful in replicating future novel results, in studies which synthesize findings across multiple SNPs, or in hypothesis-driven studies which require less stringent thresholds. Our study serves as a proof-of-principle for performing cross-health-system genetic analysis of EHR-derived lab traits. The high replication rate for known GWAS variants indicates that EHR lab traits can be well-matched between discordant health systems and that measurements taken during real-life medical interactions sufficiently reflect those taken under more idealized experimental conditions. Moreover, this implies that mechanisms underlying variation in lab traits among healthy populations also act in a health system population with diseased individuals, strengthening their clinical relevance. By comparing various analytic strategies, we show that there is no optimal strategy that holds across all lab traits. In fact, we observed many instances in which the alternate analysis simultaneously increased significance for some risk variants and decreased significance for others. Thus, even within a given lab trait, an optimal strategy for variant discovery might not exist. We also considered a summary statistic based on Area Under the Curve for the longitudinal lab data [50,51]. Analysis in the MGI cohort showed that this measure performed consistently worse than the mean lab measurement (S1 Text, S3 Table). A potential area of future research would be determining if multiple versions of a lab trait can be combined into an omnibus test that simultaneously increases power across all risk variants. We encourage researchers to use our results across the various analysis strategies to guide decisions about how best to analyze their traits of interest. The primary strength of our study was the access to two independent biobank cohorts. Using two cohorts provides an increase in sample size and power over analyzing and reporting on each cohort separately. In addition, the two-cohort design adds a built-in internal consistency check to our results by requiring effect sizes to be in the same direction in both cohorts. This additional requirement reduced the potential for unknown biases in the health system cohorts to create spurious results when replicating GWAS catalog SNPs or novel association discovery. Further, the independent cohorts provided the means to rigorously examine the portability of analytic strategies between biobanks. A similar analysis performed in a single cohort could produce recommendations over fitted to one specific context. Use of multiple sites increases the generalizability of our recommendations. This study was further strengthened by the fortuitous availability of an independent tranche of BioVU samples that provided an immediate replication cohort for the novel findings of our meta-analysis. Our study has implications for the design and analysis of similar studies in the future. Matching and analyzing lab data between health systems is difficult and requires substantial content knowledge. This study benefited from a multi-disciplinary team consisting of clinical experts to lead the categorization of the raw lab data extracts and statistical geneticists to guide analytic strategies. We leaned heavily on GWAS catalog SNPs to serve as positive controls. We recommend researchers to incorporate an explicit replication step to validate lab data prior to testing novel hypotheses. Summarizing the longitudinal measurements simply using the mean proved relatively robust across labs but was by no means optimal in all scenarios. Future studies can benefit from considering a summary statistic suited to the specific lab trait being evaluated. Our analysis also highlights that close attention must be paid to differences in the preparation and analysis of EHR phenotypes, particularly longitudinal lab measurements. Failing to replicate a prior finding can be due to lack of a true effect but also a variety of differences between biobank cohorts and analytic procedures. We were motivated to examine the effect of controlling for disease status because of its use in the analysis of lab traits in BioBank Japan [11]. Controlling for diseases or risk factors such as tobacco use is a common practice [29]. We considered testing the effect of each disease individually but discarded it as cumbersome. Our strategy reflects a broad-spectrum approach in which diagnoses that are rare or have limited effect on a lab can be rationalized as not causing harm by remaining in the model. The effect of controlling for comorbid diseases can be unpredictable. For example, within the components of a lipid panel, controlling for disease status led to a net improvement for HDL catalog SNPs, a net worsening for LDL catalog SNPs, and had cohort-specific impact on triglycerides. From a methodological standpoint, this argues for careful consideration of comorbid disease covariates. From a practical standpoint, the absence of diagnostic data should not be seen as precluding use of a clinical lab data. A limitation of studying clinical labs in real-life cohorts is that some measurements will be affected by medication. We were unable to formally address the effect of medication because of unreliable measurements of medication. However, it remains an important consideration for future EHR-based lab studies and requires further study. There was indication that in situations where a disease diagnosis is likely to be accompanied by medication, for example a diagnosis of dyslipidemia with lipid labs, controlling for disease status diagnosis serves as a reasonable proxy to treatment status. As research interest in EHR phenotypes increases, we anticipate that improved capture of prescription data will facilitate the identification of medication effects. A further limitation of this study is the number of analyzed genetic variants. The study was restricted to ~800K SNPs because BioVU imputed genotypes were unavailable at time of analysis. Although this limited our ability to discover novel variation, the number of SNPs was more than sufficient to perform the primary purpose of the paper, a proof-of-principle replication analysis across a broad range of clinical labs and analytic strategies. However, there are likely many loci remaining to be discovered for these labs, particularly the understudied traits. In conclusion, we report the first lab-wide genome-wide association study linking data between two independent EHR-based cohorts. We achieved a high degree of replication of prior associations and report a modest number of new associations. In melding these data sets, we addressed key questions in design and analysis of ‘real world’ data that are increasingly relevant.

List of ICD-10 codes used for defining binary trait comorbidities in MGI and BioVU participants for the comorbidity GWAS model.

(XLSX) Click here for additional data file.

Table of 1,313 SNPs extracted from the GWAS Catalog based on prior associations with the lab traits and SNPs considered in this study.

These associations have been reported at least once in a mixed-sex, adult, European-predominant population not selected for the presence of any disease. (XLSX) Click here for additional data file.

Comparison of GWAS results based on the Area Under the Curve (AUC) summary statistic and the default mean value summary statistic.

(PDF) Click here for additional data file.

Methodological description of the GWAS analysis of lab traits using a summary statistic based on Area Under the Curve (AUC).

(PDF) Click here for additional data file.

The following set of scatterplots show the -log10 fold changes in p-value at individual SNPs when comparing GWAS of our default summary statistic (mean) to GWAS based on an alternative statistic (median, maximum or first available).

Please refer to the Methods section for a complete description. The x-axis corresponds the fold changes for the SNP in MGI and the y-axis corresponds to the fold changes for BioVU. Positive log-fold changes indicate that the alternative statistic yielded a smaller (more significant) p-value than using the mean as a summary statistic. The upper-right (green) quadrant plots SNPs that decreased in p-value in both cohorts for the alternative statistic. The lower-left (blue) quadrant plots SNPs that increased in p-value in both cohorts. The two remaining quadrants indicate SNPs with discordant changes in p-value between the cohorts. GWAS catalog SNPs are plotted in red, novel SNPs for a given lab (if applicable) are plotted in purple, and the remaining SNPs are LD-pruned (for plotting convenience) and plotted in black. The white diamond displays an empirical null distribution of fold changes for non-associated SNPs. The first 22 pages display the three alternative summary statistics (maximum value, median value, and first available measurement) for a single lab. The following six pages contain the analogous plots showing log fold change in p-values for the comorbidity model, which includes binary covariates for various comorbid diseases with the potential to impact lab measures, to a default analysis that does not account for comorbidities. (PDF) Click here for additional data file. (PDF) Click here for additional data file.
  47 in total

1.  Development of a large-scale de-identified DNA biobank to enable personalized medicine.

Authors:  D M Roden; J M Pulley; M A Basford; G R Bernard; E W Clayton; J R Balser; D R Masys
Journal:  Clin Pharmacol Ther       Date:  2008-05-21       Impact factor: 6.875

2.  Sturge-Weber syndrome and port-wine stains caused by somatic mutation in GNAQ.

Authors:  Matthew D Shirley; Hao Tang; Carol J Gallione; Joseph D Baugher; Laurence P Frelin; Bernard Cohen; Paula E North; Douglas A Marchuk; Anne M Comi; Jonathan Pevsner
Journal:  N Engl J Med       Date:  2013-05-08       Impact factor: 91.245

3.  Cohort Profile: the HUNT Study, Norway.

Authors:  S Krokstad; A Langhammer; K Hveem; T L Holmen; K Midthjell; T R Stene; G Bratberg; J Heggland; J Holmen
Journal:  Int J Epidemiol       Date:  2012-08-09       Impact factor: 7.196

4.  The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies.

Authors:  Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf
Journal:  BMC Med Genomics       Date:  2011-01-26       Impact factor: 3.063

5.  Thyroid hormone receptor beta and NCOA4 regulate terminal erythrocyte differentiation.

Authors:  Xiaofei Gao; Hsiang-Ying Lee; Wenbo Li; Randall Jeffrey Platt; M Inmaculada Barrasa; Qi Ma; Russell R Elmes; Michael G Rosenfeld; Harvey F Lodish
Journal:  Proc Natl Acad Sci U S A       Date:  2017-09-01       Impact factor: 11.205

6.  Genome-wide association study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci.

Authors:  Jaspal S Kooner; Danish Saleheen; Xueling Sim; Joban Sehmi; Weihua Zhang; Philippe Frossard; Latonya F Been; Kee-Seng Chia; Antigone S Dimas; Neelam Hassanali; Tazeen Jafar; Jeremy B M Jowett; Xinzhong Li; Venkatesan Radha; Simon D Rees; Fumihiko Takeuchi; Robin Young; Tin Aung; Abdul Basit; Manickam Chidambaram; Debashish Das; Elin Grundberg; Asa K Hedman; Zafar I Hydrie; Muhammed Islam; Chiea-Chuen Khor; Sudhir Kowlessur; Malene M Kristensen; Samuel Liju; Wei-Yen Lim; David R Matthews; Jianjun Liu; Andrew P Morris; Alexandra C Nica; Janani M Pinidiyapathirage; Inga Prokopenko; Asif Rasheed; Maria Samuel; Nabi Shah; A Samad Shera; Kerrin S Small; Chen Suo; Ananda R Wickremasinghe; Tien Yin Wong; Mingyu Yang; Fan Zhang; Goncalo R Abecasis; Anthony H Barnett; Mark Caulfield; Panos Deloukas; Timothy M Frayling; Philippe Froguel; Norihiro Kato; Prasad Katulanda; M Ann Kelly; Junbin Liang; Viswanathan Mohan; Dharambir K Sanghera; James Scott; Mark Seielstad; Paul Z Zimmet; Paul Elliott; Yik Ying Teo; Mark I McCarthy; John Danesh; E Shyong Tai; John C Chambers
Journal:  Nat Genet       Date:  2011-08-28       Impact factor: 38.330

7.  Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity.

Authors:  Christina L Usher; Robert E Handsaker; Tõnu Esko; Marcus A Tuke; Michael N Weedon; Alex R Hastie; Han Cao; Jennifer E Moon; Seva Kashin; Christian Fuchsberger; Andres Metspalu; Carlos N Pato; Michele T Pato; Mark I McCarthy; Michael Boehnke; David M Altshuler; Timothy M Frayling; Joel N Hirschhorn; Steven A McCarroll
Journal:  Nat Genet       Date:  2015-06-22       Impact factor: 38.330

8.  Selective sweep on human amylase genes postdates the split with Neanderthals.

Authors:  Charlotte E Inchley; Cynthia D A Larbey; Nzar A A Shwan; Luca Pagani; Lauri Saag; Tiago Antão; Guy Jacobs; Georgi Hudjashov; Ene Metspalu; Mario Mitt; Christina A Eichstaedt; Boris Malyarchuk; Miroslava Derenko; Joseph Wee; Syafiq Abdullah; François-Xavier Ricaut; Maru Mormina; Reedik Mägi; Richard Villems; Mait Metspalu; Martin K Jones; John A L Armour; Toomas Kivisild
Journal:  Sci Rep       Date:  2016-11-17       Impact factor: 4.379

9.  Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes.

Authors:  Angli Xue; Yang Wu; Zhihong Zhu; Futao Zhang; Kathryn E Kemper; Zhili Zheng; Loic Yengo; Luke R Lloyd-Jones; Julia Sidorenko; Yeda Wu; Allan F McRae; Peter M Visscher; Jian Zeng; Jian Yang
Journal:  Nat Commun       Date:  2018-07-27       Impact factor: 14.919

10.  Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program.

Authors:  Derek Klarin; Scott M Damrauer; Kelly Cho; Yan V Sun; Tanya M Teslovich; Jacqueline Honerlaw; David R Gagnon; Scott L DuVall; Jin Li; Gina M Peloso; Mark Chaffin; Aeron M Small; Jie Huang; Hua Tang; Julie A Lynch; Yuk-Lam Ho; Dajiang J Liu; Connor A Emdin; Alexander H Li; Jennifer E Huffman; Jennifer S Lee; Pradeep Natarajan; Rajiv Chowdhury; Danish Saleheen; Marijana Vujkovic; Aris Baras; Saiju Pyarajan; Emanuele Di Angelantonio; Benjamin M Neale; Aliya Naheed; Amit V Khera; John Danesh; Kyong-Mi Chang; Gonçalo Abecasis; Cristen Willer; Frederick E Dewey; David J Carey; John Concato; J Michael Gaziano; Christopher J O'Donnell; Philip S Tsao; Sekar Kathiresan; Daniel J Rader; Peter W F Wilson; Themistocles L Assimes
Journal:  Nat Genet       Date:  2018-10-01       Impact factor: 38.330

View more
  4 in total

Review 1.  Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS.

Authors:  Lisa Bastarache
Journal:  Annu Rev Biomed Data Sci       Date:  2021-07-20

2.  A multiancestry genome-wide association study of unexplained chronic ALT elevation as a proxy for nonalcoholic fatty liver disease with histological and radiological validation.

Authors:  Marijana Vujkovic; Shweta Ramdas; Daniel J Rader; Benjamin F Voight; Kyong-Mi Chang; Kim M Lorenz; Xiuqing Guo; Rebecca Darlay; Heather J Cordell; Jing He; Yevgeniy Gindin; Chuhan Chung; Robert P Myers; Carolin V Schneider; Joseph Park; Kyung Min Lee; Marina Serper; Rotonya M Carr; David E Kaplan; Mary E Haas; Matthew T MacLean; Walter R Witschey; Xiang Zhu; Catherine Tcheandjieu; Rachel L Kember; Henry R Kranzler; Anurag Verma; Ayush Giri; Derek M Klarin; Yan V Sun; Jie Huang; Jennifer E Huffman; Kate Townsend Creasy; Nicholas J Hand; Ching-Ti Liu; Michelle T Long; Jie Yao; Matthew Budoff; Jingyi Tan; Xiaohui Li; Henry J Lin; Yii-Der Ida Chen; Kent D Taylor; Ruey-Kang Chang; Ronald M Krauss; Silvia Vilarinho; Joseph Brancale; Jonas B Nielsen; Adam E Locke; Marcus B Jones; Niek Verweij; Aris Baras; K Rajender Reddy; Brent A Neuschwander-Tetri; Jeffrey B Schwimmer; Arun J Sanyal; Naga Chalasani; Kathleen A Ryan; Braxton D Mitchell; Dipender Gill; Andrew D Wells; Elisabetta Manduchi; Yedidya Saiman; Nadim Mahmud; Donald R Miller; Peter D Reaven; Lawrence S Phillips; Sumitra Muralidhar; Scott L DuVall; Jennifer S Lee; Themistocles L Assimes; Saiju Pyarajan; Kelly Cho; Todd L Edwards; Scott M Damrauer; Peter W Wilson; J Michael Gaziano; Christopher J O'Donnell; Amit V Khera; Struan F A Grant; Christopher D Brown; Philip S Tsao; Danish Saleheen; Luca A Lotta; Lisa Bastarache; Quentin M Anstee; Ann K Daly; James B Meigs; Jerome I Rotter; Julie A Lynch
Journal:  Nat Genet       Date:  2022-06-02       Impact factor: 41.307

3.  GWAS of longitudinal trajectories at biobank scale.

Authors:  Seyoon Ko; Christopher A German; Aubrey Jensen; Judong Shen; Anran Wang; Devan V Mehrotra; Yan V Sun; Janet S Sinsheimer; Hua Zhou; Jin J Zhou
Journal:  Am J Hum Genet       Date:  2022-02-22       Impact factor: 11.043

4.  Developing real-world evidence from real-world data: Transforming raw data into analytical datasets.

Authors:  Lisa Bastarache; Jeffrey S Brown; James J Cimino; David A Dorr; Peter J Embi; Philip R O Payne; Adam B Wilcox; Mark G Weiner
Journal:  Learn Health Syst       Date:  2021-10-14
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.