Literature DB >> 29666432

Genetic validation of bipolar disorder identified by automated phenotyping using electronic health records.

Chia-Yen Chen^1,2,3,4,5, Phil H Lee^1,3,4,5, Victor M Castro^1,6,7, Jessica Minnier⁸, Alexander W Charney^9,10,11, Eli A Stahl^9,10, Douglas M Ruderfer¹², Shawn N Murphy^7,13,14, Vivian Gainer⁷, Tianxi Cai¹⁵, Ian Jones¹⁶, Carlos N Pato¹⁷, Michele T Pato¹⁷, Mikael Landén^18,19, Pamela Sklar^9,10,11, Roy H Perlis^1,3,6, Jordan W Smoller^20,21,22.

Abstract

Bipolar disorder (BD) is a heritable mood disorder characterized by episodes of mania and depression. Although genomewide association studies (GWAS) have successfully identified genetic loci contributing to BD risk, sample size has become a rate-limiting obstacle to genetic discovery. Electronic health records (EHRs) represent a vast but relatively untapped resource for high-throughput phenotyping. As part of the International Cohort Collection for Bipolar Disorder (ICCBD), we previously validated automated EHR-based phenotyping algorithms for BD against in-person diagnostic interviews (Castro et al. Am J Psychiatry 172:363-372, 2015). Here, we establish the genetic validity of these phenotypes by determining their genetic correlation with traditionally ascertained samples. Case and control algorithms were derived from structured and narrative text in the Partners Healthcare system comprising more than 4.6 million patients over 20 years. Genomewide genotype data for 3330 BD cases and 3952 controls of European ancestry were used to estimate SNP-based heritability (h2g) and genetic correlation (rg) between EHR-based phenotype definitions and traditionally ascertained BD cases in GWAS by the ICCBD and Psychiatric Genomics Consortium (PGC) using LD score regression. We evaluated BD cases identified using 4 EHR-based algorithms: an NLP-based algorithm (95-NLP) and three rule-based algorithms using codified EHR with decreasing levels of stringency-"coded-strict", "coded-broad", and "coded-broad based on a single clinical encounter" (coded-broad-SV). The analytic sample comprised 862 95-NLP, 1968 coded-strict, 2581 coded-broad, 408 coded-broad-SV BD cases, and 3 952 controls. The estimated h2g were 0.24 (p = 0.015), 0.09 (p = 0.064), 0.13 (p = 0.003), 0.00 (p = 0.591) for 95-NLP, coded-strict, coded-broad and coded-broad-SV BD, respectively. The h2g for all EHR-based cases combined except coded-broad-SV (excluded due to 0 h2g) was 0.12 (p = 0.004). These h2g were lower or similar to the h2g observed by the ICCBD + PGCBD (0.23, p = 3.17E-80, total N = 33,181). However, the rg between ICCBD + PGCBD and the EHR-based cases were high for 95-NLP (0.66, p = 3.69 × 10-5), coded-strict (1.00, p = 2.40 × 10-4), and coded-broad (0.74, p = 8.11 × 10-7). The rg between EHR-based BD definitions ranged from 0.90 to 0.98. These results provide the first genetic validation of automated EHR-based phenotyping for BD and suggest that this approach identifies cases that are highly genetically correlated with those ascertained through conventional methods. High throughput phenotyping using the large data resources available in EHRs represents a viable method for accelerating psychiatric genetic research.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2018 PMID： 29666432 PMCID： PMC5904248 DOI： 10.1038/s41398-018-0133-7

Source DB: PubMed Journal: Transl Psychiatry ISSN： 2158-3188 Impact factor: 6.222

Introduction

Although twin studies first documented the high heritability of bipolar disorder (BD) decades ago, only recently have robustly associated genetic risk loci been identified through genomewide association studies (GWAS)[1-8]. At present, the major rate-limiting step for GWAS of BD is the need for ever-larger sample sizes to detect both common modest-effect variants and rarer large effect variants. In recent years, the widespread adoption of longitudinal electronic health records (EHRs) has provided a vast and growing repository of phenotypic data that can be leveraged for psychiatric research[9]. In particular, when linked to sample collections through biobanks and other efforts, EHR data provide a relatively untapped opportunity to enhance the power of genetic research. Nevertheless, establishing the validity of EHR-derived phenotypes remains an important pre-requisite for leveraging these resources. In an effort to rapidly increase available samples for genomewide studies of BD, we established the International Cohort Collection for Bipolar Disorder (ICCBD) through which we applied high-throughput phenotyping methods at sites in the United States (US), United Kingdom (UK), and Sweden[7]. At the US site (Partners Healthcare), we developed and applied EHR phenotyping algorithms to identify approximately 4500 cases and 5000 controls for whom DNA was obtained from discarded blood samples. The use of EHR data to define valid phenotypes is particularly challenging for psychiatric disorders. Because there are no pathognomonic laboratory or pathologic findings, psychiatric diagnosis has traditionally relied on self-reported symptoms, behavioral observations, and clinical judgment. Thus, genomic studies have typically utilized structured or semi-structured diagnostic interviews as the gold-standard method to establish case and control status. EHR data, on the other hand, are limited to information (e.g., billing codes, medication lists, and narrative notes) collected in the course of clinical care rather than for research purposes. Recognizing this, we have undertaken systematic efforts to evaluate the validity of our EHR-based phenotyping algorithms. In an earlier report[10], we described the development of our automated phenotyping algorithms for BD cases and controls. Briefly, we developed four case definitions, one of which included natural language processing of narrative EHR notes and three based on structured coded data using rule-based classifiers that differed in their stringency. Another rule-based algorithm was developed to identify controls. To establish the clinical validity of these algorithms, we conducted an in-person diagnostic validation study (N = 190) in which algorithm diagnoses were compared to diagnoses made by blinded expert clinicians using a gold-standard in-person diagnostic interview (SCID-IV). Three of the four case definitions achieved high positive predictive value (PPV) compared with diagnostic interviews (up to 0.86) and the PPV for the control algorithm was 1.0. Thus, we demonstrated that automated EHR-based phenotyping can be used to identify clinically valid case and control definitions for BD. However, an important remaining question is whether these case and control sets are genetically comparable to traditionally ascertained samples that have been used in most genomic studies of BD. This is an important issue in evaluating whether EHR-based samples can be combined (e.g., through meta-analyses) with data from other ongoing genomic studies (e.g., by consortia such as the Psychiatric Genomics Consortium) to enhance gene discovery. Here, we report genetic validation of our EHR phenotyping algorithms by using genomewide data to estimate their SNP-based heritability (h2g) and genetic correlation (rg) with other large-scale traditionally ascertained BD GWAS samples. We further examined genetic correlations with other phenotypes of interest and performed genome-wide heterogeneity testing to validate the consistency of genome-wide association results. Our results demonstrate that automated EHR phenotyping can be used to assemble case/control cohorts that are both clinically and genetically comparable to traditionally ascertained samples and thus represent a valuable tool for accelerating psychiatric genetic research.

Materials and methods

Study subjects

Cases and controls were collected as part of the International Cohort Collection for Bipolar Disorder (ICCBD), a US, UK, and Swedish consortium established to accelerate genomic studies of BD by applying high throughput phenotyping methods[7,10]. The Massachusetts General Hospital site of the ICCBD aimed to collect DNA from 4500 cases and 4500 controls by linking discarded blood samples to de-identified EHR data. As described in detail elsewhere[10], cases and controls were identified by deriving EHR-based phenotyping algorithms applied to the Partners Healthcare Research Patient Data Registry (RPDR), which spans more than 20 years of data from 4.6 million patients. As the first step in building our BD phenotyping algorithms, we created a “datamart” of 52,235 individuals by filtering medical records to identify patients seen at Massachusetts General Hospital, Brigham and Women’s Hospital, or McLean Hospital who had at least one diagnosis of bipolar disorder (ICD- 9 and DSM-IV-TR codes 296.4*–296.8*) or manic disorder (ICD 296.0*–296.1*). Next, four phenotyping algorithms were developed to identify cases and one algorithm to identify controls. The development and clinical validation of case and control algorithms described here is adapted from Castro et al. [10]. The five phenotyping algorithms developed comprised the following: 95-NLP: This BD case algorithm incorporated natural language processing (NLP) of narrative notes using the i2b2 suite of software[11]. Expert clinicians manually reviewed 612 notes from 209 randomly selected patients to identify gold-standard cases and to extract relevant features from narrative notes to be processed by NLP. We trained a model based on 414 features to predict the probability of BD using a logistic regression classifier with the adaptive least absolute shrinkage and selection operator (LASSO) procedure. The final model, comprising 13 features, achieved an area under the receiver operating curve (AUC) of 0.93, with a sensitivity of 0.53 when the specificity was set to 0.95. Coded-strict: This algorithm was a rule-based classifier that required at least three ICD codes for BD, a predominance of BD diagnoses in the longitudinal record, and either (a) treatment with lithium or valproate within a year of BD diagnosis or (b) treatment at a bipolar specialty clinic. Coded-broad: This algorithm required at least two ICD codes for BD, a predominance of BD diagnoses, and treatment with at least two bipolar medications (lithium, valproate, carbamazepine, or an atypical antipsychotic). Coded-broad-SV: This algorithm was the same as “Coded-broad” except that two or more BD diagnoses were allowed to occur during the same inpatient or outpatient episode of illness. Controls: This algorithm defined controls as those age 30 years or older with no ICD-9 codes or history of medications related to a psychiatric or neurological condition. As reported earlier, we conducted a direct-interview study to examine the predictive validity of these algorithms. Patients in the Partners Healthcare system who were identified by each algorithm as BD cases or controls were invited by mail to participate. Participants in the validation study were independent of the patient samples on which the algorithms were trained. After informed consent was obtained, participants underwent semistructured diagnostic interviews (SCID-IV) conducted by experienced doctoral-level clinicians blinded to classifier diagnosis. To further preserve clinician blinding, we recruited individuals from MGH clinics who reported a previous diagnosis of schizophrenia or major depressive disorder, disorders commonly considered in the differential diagnosis of BD. The protocol was approved by the Partners HealthCare system institutional review board. A total of 190 participants were interviewed and PPVs for each algorithm were calculated as the proportion of algorithm defined BD cases (or controls) who received a concordant diagnosis by SCID interview. The PPVs for each algorithm using a non-hierarchical approach (where each case was assigned to any algorithm for which they satisfied inclusion criteria) are shown in Table 1 and reported in Castro et al.[10].

Table 1

SNP-based heritability (h2g) for EHR-based bipolar disorder from the Partners Healthcare Research Patient Data Registry

Bipolar disorder algorithms	h²_g (SE)				Sample size
	Liability scale	Observed scale	P-value^b	PPV	cases	controls
95-NLP	0.24 (0.10)	0.25 (0.10)	0.015	0.86	862	3952
Coded-strict	0.09 (0.05)	0.15 (0.08)	0.064	0.84	1968	3952
Coded-broad	0.13 (0.04)	0.22 (0.08)	0.003	0.80	2581	3952
Coded-broad-SV	0.00 (0.11)	0.00 (0.18)	0.591	0.50	408	3952
All except coded-broad-SV	0.12 (0.04)	0.21 (0.07)	0.004	0.83	3013	3952
ICCBD + PGCBD^a	0.23 (0.01)	0.41 (0.02)	3.17 × 10⁻⁸⁰	NA	13902	19279

SNP-based heritability on liability scale was converted from observed scale based on population prevalence of 1%

aICCBD + PGCBD: Bipolar disorder genome-wide association study from the ICCBD and PGC1 with cases ascertained by traditional methods (Charney et al.[7])

bTest for h2g different from 0. PPV: positive predictive values from clinical validation (Castro et al.[10]). 95-NLP: probabilistic algorithm with 95% specificity based on natural language processing. Coded-strict, Coded-broad, Coded-broad-SV: coded rule-based algorithms with decreasing stringency. SV: single visit. SE: standard error

SNP-based heritability (h2g) for EHR-based bipolar disorder from the Partners Healthcare Research Patient Data Registry SNP-based heritability on liability scale was converted from observed scale based on population prevalence of 1% aICCBD + PGCBD: Bipolar disorder genome-wide association study from the ICCBD and PGC1 with cases ascertained by traditional methods (Charney et al.[7]) bTest for h2g different from 0. PPV: positive predictive values from clinical validation (Castro et al.[10]). 95-NLP: probabilistic algorithm with 95% specificity based on natural language processing. Coded-strict, Coded-broad, Coded-broad-SV: coded rule-based algorithms with decreasing stringency. SV: single visit. SE: standard error

DNA sample collection and genotyping

The phenotyping algorithms were applied to the Partners Healthcare system to ascertain case and control DNA samples by linking phenotypic data to discarded blood samples as previously described[10-12]. In brief, case and control medical record numbers are submitted to the Partners HealthCare Crimson system, which acts as an “honest broker” to match deidentified phenotypic data to discarded blood samples under an approved Institutional Review Board (IRB) protocol. Genotyping was performed in five batches that included case and control samples at the Broad Institute of MIT and Harvard using the Illumina PsychChip, a genomewide array that includes ~250,000 common variants as a backbone for GWAS imputation, ~250,000 rare variants, and ~50,000 additional markers focusing on psychiatrically relevant loci.

Genotype quality control (QC) and imputation

A total of 3772 BD cases and 4141 controls with genomewide data were available for this analysis. The number of genotyped BD cases by algorithm and controls are listed in Table 1 and described in Supplementary Figure 1. We performed QC on each genotyping batch separately with the following steps: (1) we removed single nucleotide polymorphisms (SNPs) with genotype missing rate >0.05 before sample-based QC; (2) we excluded samples with genotype missing rate >0.02, absolute value of heterozygosity >0.2, or failed sex checks; (3) after sample-based QC, we then removed SNPs with missing rate >0.02, with differential missing rate between cases and controls >0.02, or failed Hardy–Weinberg equilibrium test (p-value <1.0 × 10−6 in controls and p-value <1.0 × 10−10 in cases). To merge genotyping batches for imputation and analyses, we performed batch QC by removing SNPs with differential missing rate >0.005 between batches or significant batch association (p-value <5.0 × 10−8 between controls form different batches). All QC were conducted using PLINK v1.9[13]. The BD cases and controls included individuals from diverse populations. To control for population stratification and ensure the comparability between the current sample and previous European ancestry BD GWAS, we extracted samples with European ancestry for imputation and analyses. We used the International Haplotype Map project (HapMap3) samples as a population reference panel and performed principal component analysis (PCA) with the study samples and HapMap3 samples combined[14]. We calculated the distance between each study sample and the average European population samples in HapMap3 using PC1 and PC2. We selected the study samples with distance to average European HapMap3 samples <0.01 (Supplementary Figure 2–4)[14]. We also removed one sample from each pair of related or duplicate samples (> 0.2). The final analytic dataset comprised 3330 BD cases (862 95-NLP, 1968 coded-strict, 2581 coded-broad, and 408 coded-broad-SV) and 3952 controls. The sum of the individual cases groups exceeds 3330 due to the non-hierarchical design in which cases were assigned to each phenotype for which they met inclusion criteria. We performed two-step genotype imputation with Eagle2 software for pre-phasing and IMPUTE2 on the European population study samples[15,16].

Statistical analysis

To assess whether our EHR-based phenotypes capture heritable components of BD, we used LD score regression (LDSC)[7,17,18] to estimate SNP-based heritability (h2g) for each EHR-based BD cohort. We then examined the degree to which heritable influences on our BD phenotypes overlap with those traditionally ascertained BD cases in other large-scale GWAS samples. To do this, we used LDSC to compute the genetic correlation (rg) between EHR-based BD samples and previously published BD GWAS by other ICCBD cohorts and the PGC (ICCBD + PGCBD)[7,17,18]. The LDSC requires association summary statistics for genome-wide SNPs to estimate h2g and rg. To obtain these summary statistics, we first performed GWAS for each of the four EHR-based BD definitions separately and for our combined BD case–control sample. We used a BD prevalence of 1% to obtain liability-scale h2g from LDSC[19-22]. Prior studies have documented substantial genetic correlation between BD and other psychiatric disorder phenotypes, most notably schizophrenia (SCZ) and major depressive disorder (MDD)[20]. To examine the genetic relationship between EHR-based BD samples and related phenotypes, we used LD Hub[23] to estimate rg with schizophrenia (SCZ), major depressive disorder (MDD), subjective well-being, and, as a negative control, mean platelet volume (MPV). Finally, we performed genome-wide Cochran’s Q-test to look for heterogeneity between association summary statistics from the EHR-based BD samples and the ICCBD + PGCBD samples at single variant level, using SNPs with association p-value <0.001 in the ICCBD + PGCBD GWAS.

Results

We first estimated SNP-based heritability (h2g) for the four EHR-based BD samples (Table 1). The liability-scale h2g estimates were largest for the 95-NLP BD algorithm (0.24, p = 0.015) and smallest for the coded-broad-SV algorithm (0.0, p = 0.59), with intermediate but statistically significant values for the coded-strict and coded-broad algorithms. The h2g of BD in the ICCBD + PGCBD sample was 0.23, which matches the h2g for the 95-NLP algorithm but is greater than that of the rule-based algorithms. Of note, the coded-broad-SV case set had the least power with only 408 cases. As shown in Table 1, this distribution of heritability estimates mirrors the relative PPVs obtained in our clinical validation study. To maximize the BD case–control sample size, we combined the BD case–control samples across algorithms into a single case–control dataset. Since the coded-broad-SV had no evidence of heritability, we created two combined BD datasets; one included all BD cases and one included all but the coded-broad-SV cases). The h2g was 0.11 (p-value = 0.006) for all algorithms combined BD and 0.12 (p-value = 0.004) for all algorithms excluding coded-broad-SV. We next estimated the SNP-based genetic correlation (rg) between the EHR-based BD samples and the ICCBD + PGCBD samples (Table 2). The rg estimates were 95-NLP (0.66), coded-strict (1.0), and coded-broad (0.74) were all statistically significant. (Note that rg could not be estimated for coded-broad-SV given its h2g of 0). The rg for all algorithms excluding coded-broad-SV was 0.83 (p = 7.19 × 10−7). Adding coded-broad-SV BD cases to the combined case set did not substantially change the rg estimate although the standard error (SE) increased and p-value rose to 2.88 × 10-6. We also estimated the pairwise rg between the EHR-based BD case–control samples and the final combined BD samples. As expected, the rg estimates are high and ranged from 0.90 to 0.98 between algorithms, and were 1.00 between each algorithm and the combined sample (excluding coded-broad-SV). Finally, the rg between ICCBD and PGCBD was 1.00 (SE = 0.065, p-value = 1.45 × 10-74).

Table 2

SNP-based genetic correlation (rg) between EHR-based bipolar disorder and bipolar disorder ascertained by traditional methods from ICCBD + PGCBD

	r_g (SE)	p-value^a
95-NLP	0.66 (0.16)	3.69 × 10⁻⁵
Coded-strict	1.00 (0.29)	2.40 × 10⁻⁴
Coded-broad	0.74 (0.15)	8.11 × 10⁻⁷
All except coded-broad-SV	0.83 (0.17)	7.19 × 10⁻⁷

Genetic correlation was not estimated for coded-broad-SV due to SNP-based heritability estimate of 0

SE standard error

aTest for different from 0

SNP-based genetic correlation (rg) between EHR-based bipolar disorder and bipolar disorder ascertained by traditional methods from ICCBD + PGCBD Genetic correlation was not estimated for coded-broad-SV due to SNP-based heritability estimate of 0 SE standard error aTest for different from 0 Given prior evidence that traditionally ascertained BD GWAS show significant positive genetic correlations with SCZ and MDD[18,20] and significant negative genetic correlation with subjective well-being[24], we examined these correlations using our EHR-based algorithms as another index of their genetic validity. As a negative control, we also examined their genetic correlation with mean platelet volume (MPV), a phenotype for which we would not expect significant genetic correlation. (Fig. 1; Supplementary Table 1). For comparison, we also examined these patterns of correlation with BD in traditionally ascertained samples (ICCBD + PGCBD). As expected based on prior data[18,20,24], EHR-based BD was positively genetically correlated with SCZ and MDD, negatively correlated with subjective well-being, and uncorrelated with MPV (Fig. 1). Of note, the BD-MDD genetic correlation was higher than the BD-SCZ genetic correlation using EHR-based BD (left side, Fig. 1) while the opposite was true using the ICCBD + PGCBD cohort (right side, Fig. 1). We hypothesized that this difference in rg patterns might be related to differences in the proportions of BD subtypes among the EHR-based BD cases and those included in the traditionally ascertained BD samples. To investigate this, we calculated the percentage of BD case subtypes, including bipolar I disorder (BD1), bipolar II disorder (BD2), schizoaffective disorder bipolar type (SAB), and bipolar disorder not otherwise specified (NOS) for the EHR-based BD cases and the ICCBD cases (the subtypes of PGCBD cases were not available). We found that the EHR-based BD cases comprised a lower proportion of SAB subtype cases (0.6–1.6%) compared with the ICCBD samples (9.1%) (Supplementary Figure 5). This difference would be consistent with a relatively smaller BD-SCZ genetic correlation seen in the EHR-based samples compared with the ICCBD samples.

Fig. 1

SNP-based genetic correlation (with 95% confidence interval) between bipolar disorder based on different ascertainment methods and other traits

SNP-based genetic correlation (with 95% confidence interval) between bipolar disorder based on different ascertainment methods and other traits Finally, we performed Cochran’s Q-test to identify potential heterogeneity of the association summary statistics between EHR-based BD samples and the ICCBD + PGCBD samples. This analysis was restricted to SNPs with association p < 0.001 in the ICCBD + PGCBD GWAS in order to exclude SNPs with weak association results whose directionality might be less robust. We identified a single locus with significant heterogeneity across the genome after Bonferroni correction (SNP N = 28,320) for both coded-broad and for the combined EHR-BD sample (excluding coded-broad-SV) (Fig. 2). This locus on chromosome 22 (peak Q-test p-value at rs196065 = 3.34 × 10−7), showed modest association with BD (p-value = 5.78 × 10−5 in ICCBD + PGCBD) and did not overlap with any previously reported BD-associated loci. Thus, we found negligible evidence of heterogeneity of genomewide association results between EHR-based BD and traditionally ascertained BD.

Fig. 2

Genome-wide Cochran’s Q-test for heterogeneity of SNP effects between ICCBD + PGCBD and EHR-based bipolar disorder.

Red line shows the Bonferroni-corrected significance level for the Q-test. SNPs are selected with association p-value threshold of 0.001 based on ICCBD + PGCBD analysis (total number of SNPs = 28,320)

Genome-wide Cochran’s Q-test for heterogeneity of SNP effects between ICCBD + PGCBD and EHR-based bipolar disorder.

Discussion

As an ever-growing longitudinal repository of the clinical phenome, EHRs represent a new and powerful resource for psychiatric research[9]. Nevertheless, their utility depends on the validity of the clinical and phenotypic data that can be extracted. We have previously demonstrated the feasibility of deriving diagnoses with high predictive value compared with a gold standard of clinician-administered diagnostic interviews[10]. However, in the context of psychiatric genetic research, establishing the genetic validity of these phenotypes is crucial. In the present study, using genomewide genotype data for more than 7000 cases and controls, we demonstrate that EHR-based algorithms can be used to ascertain BD phenotypes that are heritable and genetically comparable to traditionally ascertained samples. Automated algorithm-based phenotyping linked to biospecimens provides substantial efficiencies in terms of the time and costs involved in assembling large-scale samples for genetic research. Prior simulations have documented up to a ten-fold reduction in the cost associated with phenotyping and sample collection[11]. Using our case/control BD definitions linked to discarded blood samples, we were able to collect approximately 5000 controls over 10 weeks and more than 4000 cases over 3 years. As described below, three sets of findings from our analyses are particularly noteworthy. First, our results document that EHR-based diagnostic algorithms can be used to ascertain BD phenotypes that yield SNP-based heritability comparable to that observed in GWAS that have relied on more time-intensive, cost-intensive, and labor-intensive recruitment and diagnostic evaluation. The highest heritability (0.24) was seen with our 95-NLP algorithm which combined NLP of narrative test features and coded EHR data. This estimated heritability was nearly identical to that derived from GWAS of the larger traditionally ascertained cohorts of the international ICCBD and PGC (h2g = 0.23 for 13,902 cases and 19,279 controls). The 95-NLP algorithm also achieved the highest positive predictive value in our previous clinical validation study. For two of the remaining three algorithms which involved rule-based algorithms of structured EHR data, we also observed significant, though relatively lower, heritability estimates (h2g = 0.09–0.12). The least restrictive algorithm (coded-broad-SV) did not exhibit significant heritability, though the small sample size of this subgroup limited the power of our analyses. Of note, this last algorithm also performed poorly in our prior clinical validation study (PPV = 0.5). Nevertheless, the overall heritability of our EHR-based BD was 0.12 (p = 0.004), dropping slightly to 0.11 (p = 0.006) when the coded-broad-SV was included. In addition, the EHR-based BD definitions were nearly perfectly genetically correlated. Pairwise genetic correlations between the phenotypes ranged from 0.98 to 1.0 except for 95-NLP and coded-broad-SV (rg = 0.90). Second, we found that our cohorts ascertained by automated EHR phenotyping exhibited substantial genetic correlations (rg) with the large ICCBD + PGCBD samples. Overall, the rg between our EHR-based BD case/control samples and the ICCBD + PGCBD samples was 0.83 (p = 2.88 × 10−6), demonstrating that our approach captures genetic influences that strongly overlap with those acting on BD in traditionally ascertained samples. In addition to providing further genetic validation of EHR-derived phenotypes, these results indicate that such samples can be combined with other existing samples to enhance the power of genetic discovery. Finally, we demonstrate that our phenotyping approach replicates patterns of cross-disorder genetic overlap that have previously been reported in genetic studies of BD[7,25]. In particular, EHR-based BD exhibited positive genetic correlations with SCZ and MDD and negative correlations with subjective well-being. Once again, this supports the genetic validity of our algorithm-defined BD phenotype. Unexpectedly, the genetic correlation with SCZ was less than that seen with MDD, a finding that may be attributable to the relatively low frequency of SAB cases in our sample. We acknowledge that our results have certain limitations. First, our sample size, while substantial, is smaller than that of some other existing samples (e.g., ICCBD and PGCBD), which may have limited the power and precision of our heritability and genetic correlation analyses. Second, the portability of our specific phenotyping algorithms to other healthcare settings remains to be determined. Notably, however, our results demonstrate that a range of algorithms—with and without NLP and using diagnostic rules of varying stringency—yield phenotypes that are clinically and genetically comparable to those obtained by in-person standardized diagnostic assessments. With the exception of the 95-NLP algorithm, each of our case and control phenotypes are based on structured diagnostic codes and medication data that are widely available and standardized in EHR systems, facilitating their application in other systems. In the case of the 95-NLP algorithm, the narrative features comprise nine features (enumerated in Castro et al.[10]) that refer to either a diagnosis, medication, or visit type—none of which we would expect to be unique to our healthcare system. There is now a growing literature demonstrating the portability of EHR-based diagnostic algorithms[26]. In particular, the Electronic Medical Records and Genomics (eMERGE) Network has created a catalog of EHR-based diagnostic phenotypes (PheKB) that includes NLP-based algorithms developed in one healthcare system and applied in others[27]. Finally, diagnostic misclassification could bias or reduce the power of genetic analyses[28,29]. However, this caveat applies to any phenotyping method, and we note that our clinical validation studies[10] demonstrate relatively low levels of diagnostic misclassification in comparison to a gold-standard clinician-administered diagnostic interview. In summary, the current study provides the first genetic validation of EHR-based phenotyping for BD and suggests that automated phenotyping algorithms can identify samples that are highly genetically correlated with those ascertained through conventional methods. Taken together, the present results and those of our prior clinical validation study, suggest that the use of any or all three of the heritable EHR-based algorithms we derived (i.e., 95-NLP, coded-strict, and coded-broad) can facilitate genetic studies of bipolar disorder. High throughput phenotyping using the large data resources available in the EHR database represents a viable method for accelerating psychiatric genetic research.

29 in total

1. A better coefficient of determination for genetic profile analysis.

Authors: Sang Hong Lee; Michael E Goddard; Naomi R Wray; Peter M Visscher
Journal: Genet Epidemiol Date: 2012-04 Impact factor: 2.135

2. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.

Authors: Brendan K Bulik-Sullivan; Po-Ru Loh; Hilary K Finucane; Stephan Ripke; Jian Yang; Nick Patterson; Mark J Daly; Alkes L Price; Benjamin M Neale
Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330

3. Instrumenting the health care enterprise for discovery research in the genomic era.

Authors: Shawn Murphy; Susanne Churchill; Lynn Bry; Henry Chueh; Scott Weiss; Ross Lazarus; Qing Zeng; Anil Dubey; Vivian Gainer; Michael Mendis; John Glaser; Isaac Kohane
Journal: Genome Res Date: 2009-07-14 Impact factor: 9.043

4. An Empirical Study for Impacts of Measurement Errors on EHR based Association Studies.

Authors: Rui Duan; Ming Cao; Yonghui Wu; Jing Huang; Joshua C Denny; Hua Xu; Yong Chen
Journal: AMIA Annu Symp Proc Date: 2017-02-10

5. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records.

Authors: Fina Kurreeman; Katherine Liao; Lori Chibnik; Brendan Hickey; Eli Stahl; Vivian Gainer; Gang Li; Lynn Bry; Scott Mahan; Kristin Ardlie; Brian Thomson; Peter Szolovits; Susanne Churchill; Shawn N Murphy; Tianxi Cai; Soumya Raychaudhuri; Isaac Kohane; Elizabeth Karlson; Robert M Plenge
Journal: Am J Hum Genet Date: 2011-01-07 Impact factor: 11.025

6. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis.

Authors:
Journal: Lancet Date: 2013-02-28 Impact factor: 79.321

7. Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4.

Authors:
Journal: Nat Genet Date: 2011-09-18 Impact factor: 38.330

8. Validation of electronic health record phenotyping of bipolar disorder cases and controls.

Authors: Victor M Castro; Jessica Minnier; Shawn N Murphy; Isaac Kohane; Susanne E Churchill; Vivian Gainer; Tianxi Cai; Alison G Hoffnagle; Yael Dai; Stefanie Block; Sydney R Weill; Mireya Nadal-Vicens; Alisha R Pollastri; J Niels Rosenquist; Sergey Goryachev; Dost Ongur; Pamela Sklar; Roy H Perlis; Jordan W Smoller
Journal: Am J Psychiatry Date: 2014-12-12 Impact factor: 18.112

9. Second-generation PLINK: rising to the challenge of larger and richer datasets.

Authors: Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee
Journal: Gigascience Date: 2015-02-25 Impact factor: 6.524

10. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs.

Authors: S Hong Lee; Stephan Ripke; Benjamin M Neale; Stephen V Faraone; Shaun M Purcell; Roy H Perlis; Bryan J Mowry; Anita Thapar; Michael E Goddard; John S Witte; Devin Absher; Ingrid Agartz; Huda Akil; Farooq Amin; Ole A Andreassen; Adebayo Anjorin; Richard Anney; Verneri Anttila; Dan E Arking; Philip Asherson; Maria H Azevedo; Lena Backlund; Judith A Badner; Anthony J Bailey; Tobias Banaschewski; Jack D Barchas; Michael R Barnes; Thomas B Barrett; Nicholas Bass; Agatino Battaglia; Michael Bauer; Mònica Bayés; Frank Bellivier; Sarah E Bergen; Wade Berrettini; Catalina Betancur; Thomas Bettecken; Joseph Biederman; Elisabeth B Binder; Donald W Black; Douglas H R Blackwood; Cinnamon S Bloss; Michael Boehnke; Dorret I Boomsma; Gerome Breen; René Breuer; Richard Bruggeman; Paul Cormican; Nancy G Buccola; Jan K Buitelaar; William E Bunney; Joseph D Buxbaum; William F Byerley; Enda M Byrne; Sian Caesar; Wiepke Cahn; Rita M Cantor; Miguel Casas; Aravinda Chakravarti; Kimberly Chambert; Khalid Choudhury; Sven Cichon; C Robert Cloninger; David A Collier; Edwin H Cook; Hilary Coon; Bru Cormand; Aiden Corvin; William H Coryell; David W Craig; Ian W Craig; Jennifer Crosbie; Michael L Cuccaro; David Curtis; Darina Czamara; Susmita Datta; Geraldine Dawson; Richard Day; Eco J De Geus; Franziska Degenhardt; Srdjan Djurovic; Gary J Donohoe; Alysa E Doyle; Jubao Duan; Frank Dudbridge; Eftichia Duketis; Richard P Ebstein; Howard J Edenberg; Josephine Elia; Sean Ennis; Bruno Etain; Ayman Fanous; Anne E Farmer; I Nicol Ferrier; Matthew Flickinger; Eric Fombonne; Tatiana Foroud; Josef Frank; Barbara Franke; Christine Fraser; Robert Freedman; Nelson B Freimer; Christine M Freitag; Marion Friedl; Louise Frisén; Louise Gallagher; Pablo V Gejman; Lyudmila Georgieva; Elliot S Gershon; Daniel H Geschwind; Ina Giegling; Michael Gill; Scott D Gordon; Katherine Gordon-Smith; Elaine K Green; Tiffany A Greenwood; Dorothy E Grice; Magdalena Gross; Detelina Grozeva; Weihua Guan; Hugh Gurling; Lieuwe De Haan; Jonathan L Haines; Hakon Hakonarson; Joachim Hallmayer; Steven P Hamilton; Marian L Hamshere; Thomas F Hansen; Annette M Hartmann; Martin Hautzinger; Andrew C Heath; Anjali K Henders; Stefan Herms; Ian B Hickie; Maria Hipolito; Susanne Hoefels; Peter A Holmans; Florian Holsboer; Witte J Hoogendijk; Jouke-Jan Hottenga; Christina M Hultman; Vanessa Hus; Andrés Ingason; Marcus Ising; Stéphane Jamain; Edward G Jones; Ian Jones; Lisa Jones; Jung-Ying Tzeng; Anna K Kähler; René S Kahn; Radhika Kandaswamy; Matthew C Keller; James L Kennedy; Elaine Kenny; Lindsey Kent; Yunjung Kim; George K Kirov; Sabine M Klauck; Lambertus Klei; James A Knowles; Martin A Kohli; Daniel L Koller; Bettina Konte; Ania Korszun; Lydia Krabbendam; Robert Krasucki; Jonna Kuntsi; Phoenix Kwan; Mikael Landén; Niklas Långström; Mark Lathrop; Jacob Lawrence; William B Lawson; Marion Leboyer; David H Ledbetter; Phil H Lee; Todd Lencz; Klaus-Peter Lesch; Douglas F Levinson; Cathryn M Lewis; Jun Li; Paul Lichtenstein; Jeffrey A Lieberman; Dan-Yu Lin; Don H Linszen; Chunyu Liu; Falk W Lohoff; Sandra K Loo; Catherine Lord; Jennifer K Lowe; Susanne Lucae; Donald J MacIntyre; Pamela A F Madden; Elena Maestrini; Patrik K E Magnusson; Pamela B Mahon; Wolfgang Maier; Anil K Malhotra; Shrikant M Mane; Christa L Martin; Nicholas G Martin; Manuel Mattheisen; Keith Matthews; Morten Mattingsdal; Steven A McCarroll; Kevin A McGhee; James J McGough; Patrick J McGrath; Peter McGuffin; Melvin G McInnis; Andrew McIntosh; Rebecca McKinney; Alan W McLean; Francis J McMahon; William M McMahon; Andrew McQuillin; Helena Medeiros; Sarah E Medland; Sandra Meier; Ingrid Melle; Fan Meng; Jobst Meyer; Christel M Middeldorp; Lefkos Middleton; Vihra Milanova; Ana Miranda; Anthony P Monaco; Grant W Montgomery; Jennifer L Moran; Daniel Moreno-De-Luca; Gunnar Morken; Derek W Morris; Eric M Morrow; Valentina Moskvina; Pierandrea Muglia; Thomas W Mühleisen; Walter J Muir; Bertram Müller-Myhsok; Michael Murtha; Richard M Myers; Inez Myin-Germeys; Michael C Neale; Stan F Nelson; Caroline M Nievergelt; Ivan Nikolov; Vishwajit Nimgaonkar; Willem A Nolen; Markus M Nöthen; John I Nurnberger; Evaristus A Nwulia; Dale R Nyholt; Colm O'Dushlaine; Robert D Oades; Ann Olincy; Guiomar Oliveira; Line Olsen; Roel A Ophoff; Urban Osby; Michael J Owen; Aarno Palotie; Jeremy R Parr; Andrew D Paterson; Carlos N Pato; Michele T Pato; Brenda W Penninx; Michele L Pergadia; Margaret A Pericak-Vance; Benjamin S Pickard; Jonathan Pimm; Joseph Piven; Danielle Posthuma; James B Potash; Fritz Poustka; Peter Propping; Vinay Puri; Digby J Quested; Emma M Quinn; Josep Antoni Ramos-Quiroga; Henrik B Rasmussen; Soumya Raychaudhuri; Karola Rehnström; Andreas Reif; Marta Ribasés; John P Rice; Marcella Rietschel; Kathryn Roeder; Herbert Roeyers; Lizzy Rossin; Aribert Rothenberger; Guy Rouleau; Douglas Ruderfer; Dan Rujescu; Alan R Sanders; Stephan J Sanders; Susan L Santangelo; Joseph A Sergeant; Russell Schachar; Martin Schalling; Alan F Schatzberg; William A Scheftner; Gerard D Schellenberg; Stephen W Scherer; Nicholas J Schork; Thomas G Schulze; Johannes Schumacher; Markus Schwarz; Edward Scolnick; Laura J Scott; Jianxin Shi; Paul D Shilling; Stanley I Shyn; Jeremy M Silverman; Susan L Slager; Susan L Smalley; Johannes H Smit; Erin N Smith; Edmund J S Sonuga-Barke; David St Clair; Matthew State; Michael Steffens; Hans-Christoph Steinhausen; John S Strauss; Jana Strohmaier; T Scott Stroup; James S Sutcliffe; Peter Szatmari; Szabocls Szelinger; Srinivasa Thirumalai; Robert C Thompson; Alexandre A Todorov; Federica Tozzi; Jens Treutlein; Manfred Uhr; Edwin J C G van den Oord; Gerard Van Grootheest; Jim Van Os; Astrid M Vicente; Veronica J Vieland; John B Vincent; Peter M Visscher; Christopher A Walsh; Thomas H Wassink; Stanley J Watson; Myrna M Weissman; Thomas Werge; Thomas F Wienker; Ellen M Wijsman; Gonneke Willemsen; Nigel Williams; A Jeremy Willsey; Stephanie H Witt; Wei Xu; Allan H Young; Timothy W Yu; Stanley Zammit; Peter P Zandi; Peng Zhang; Frans G Zitman; Sebastian Zöllner; Bernie Devlin; John R Kelsoe; Pamela Sklar; Mark J Daly; Michael C O'Donovan; Nicholas Craddock; Patrick F Sullivan; Jordan W Smoller; Kenneth S Kendler; Naomi R Wray
Journal: Nat Genet Date: 2013-08-11 Impact factor: 38.330

9 in total

1. Diagnostic Algorithms to Study Post-Concussion Syndrome Using Electronic Health Records: Validating a Method to Capture an Important Patient Population.

Authors: Jessica Dennis; Aaron M Yengo-Kahn; Paul Kirby; Gary S Solomon; Nancy J Cox; Scott L Zuckerman
Journal: J Neurotrauma Date: 2019-03-28 Impact factor: 5.269

2. Electronic Health Records Are the Next Frontier for the Genetics of Substance Use Disorders.

Authors: Sandra Sanchez-Roige; Abraham A Palmer
Journal: Trends Genet Date: 2019-02-21 Impact factor: 11.639

Review 3. Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS.

Authors: Lisa Bastarache
Journal: Annu Rev Biomed Data Sci Date: 2021-07-20

4. The importance of social activity to risk of major depression in older adults.

Authors: Euijung Ryu; Gregory D Jenkins; Yanshan Wang; Mark Olfson; Ardesheer Talati; Lauren Lepow; Brandon J Coombes; Alexander W Charney; Benjamin S Glicksberg; J John Mann; Myrna M Weissman; Priya Wickramaratne; Jyotishman Pathak; Joanna M Biernacka
Journal: Psychol Med Date: 2021-11-12 Impact factor: 10.592

5. Exome sequencing in bipolar disorder identifies AKAP11 as a risk gene shared with schizophrenia.

Authors: Duncan S Palmer; Daniel P Howrigan; Sinéad B Chapman; Rolf Adolfsson; Nick Bass; Douglas Blackwood; Marco P M Boks; Chia-Yen Chen; Claire Churchhouse; Aiden P Corvin; Nicholas Craddock; David Curtis; Arianna Di Florio; Faith Dickerson; Nelson B Freimer; Fernando S Goes; Xiaoming Jia; Ian Jones; Lisa Jones; Lina Jonsson; Rene S Kahn; Mikael Landén; Adam E Locke; Andrew M McIntosh; Andrew McQuillin; Derek W Morris; Michael C O'Donovan; Roel A Ophoff; Michael J Owen; Nancy L Pedersen; Danielle Posthuma; Andreas Reif; Neil Risch; Catherine Schaefer; Laura Scott; Tarjinder Singh; Jordan W Smoller; Matthew Solomonson; David St Clair; Eli A Stahl; Annabel Vreeker; James T R Walters; Weiqing Wang; Nicholas A Watts; Robert Yolken; Peter P Zandi; Benjamin M Neale
Journal: Nat Genet Date: 2022-04-11 Impact factor: 41.307

6. The Role of Genetics in Bipolar Disorder.

Authors: Chiara Fabbri
Journal: Curr Top Behav Neurosci Date: 2021

7. Using whole genome scores to compare three clinical phenotyping methods in complex diseases.

Authors: Wenyu Song; Hailiang Huang; Cheng-Zhong Zhang; David W Bates; Adam Wright
Journal: Sci Rep Date: 2018-07-27 Impact factor: 4.379

8. Defining Major Depressive Disorder Cohorts Using the EHR: Multiple Phenotypes Based on ICD-9 Codes and Medication Orders.

Authors: Wendy Marie Ingram; Anna M Baker; Christopher R Bauer; Jason P Brown; Fernando S Goes; Sharon Larson; Peter P Zandi
Journal: Neurol Psychiatry Brain Res Date: 2020-02-21

9. Automated Phenotyping Tool for Identifying Developmental Language Disorder Cases in Health Systems Data (APT-DLD): A New Research Algorithm for Deployment in Large-Scale Electronic Health Record Systems.

Authors: Courtney E Walters; Rachana Nitin; Katherine Margulis; Olivia Boorom; Daniel E Gustavson; Catherine T Bush; Lea K Davis; Jennifer E Below; Nancy J Cox; Stephen M Camarata; Reyna L Gordon
Journal: J Speech Lang Hear Res Date: 2020-08-11 Impact factor: 2.297

9 in total