| Literature DB >> 35139533 |
Marc P Maurits1,2, Ilya Korsunsky3, Soumya Raychaudhuri3, Shawn N Murphy4, Jordan W Smoller5,6, Scott T Weiss7, Thomas W J Huizinga1, Marcel J T Reinders2,8, Elizabeth W Karlson9, Erik B van den Akker2,10, Rachel Knevel1,9.
Abstract
OBJECTIVE: To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects.Entities:
Keywords: ICD; PhenoGraph; clustering; eMERGE; electronic health records; electronic medical records
Mesh:
Year: 2022 PMID: 35139533 PMCID: PMC9122640 DOI: 10.1093/jamia/ocac008
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 7.942
Figure 1.Full pipeline flowchart. Overview of the full pipeline described in this manuscript. As indicated by the legend, steps in the green field are part of Harmony and those in the purple field are part of PhenoGraph.
Figure 2.The effects of dataset harmonization. Overview showing (A) the t-SNE embedding of all 102,880 individuals colored for LISI score prior to harmonization with Harmony and (B) the same post-harmonization, as well as (C) an example showing that relevant structure is maintained by (D) not forcing dataset mixing where local structure is best represented by a small selection of datasets (developmental disorders in children’s hospitals, green arrows in A and B).
Figure 3.Prostate cancer cluster PheSpec composition. PheSpec composition showing the phenotypic profile (= profile of medical events) of one of the identified clusters as captured by the methodology described in this paper. The main PheSpec graph (A) is a representation of the harmonized dataset where the frequency (y-axis) reflects the proportion of total cluster members with each RSP filtered top 500 Phenotypic code (PheCode) (x-axis), the top 3 most prevalent codes are labeled. In the main PheSpec, all PheCodes are grouped and colored by ICD chapter (Supplementary Material 1). Table (B) shows the prevalence of the 10 most predictive codes of cluster membership (selected by elastic net), white background indicates a positive predictor, gray a negative one. The miniatures (C) show the replication of the phenotype cluster of prostate cancer across the separate cohorts, by splitting the cluster into its individual centers. Correlations of the cluster’s phenotypic profile between each centers are shown as a heatmap (D). Localization of the cluster in t-SNE space is shown in purple (E).
Figure 4.Prevalence–rank plot “other headache syndromes.” Clusters of interest for PheCode “Other headache syndromes.” The Prevalence-rank plot depicts the proportion of patients in the cluster with the code of interest on the y-axis and the prevalence rank of the code within the cluster on the x-axis. The prevalence of PheCode 339 (“Other headache syndromes”) in the entire set was 0.27 (dotted line). We labeled the clusters (arbitrary cluster identifier) where the prevalence of this code was higher than its overall prevalence and where the code was present in the clusters' top 10 most prevalent codes.
Figure 5.Overview of 6 headache subgroups. N is the number of patients located in the cluster. Location of clusters characterized by “other headache syndromes” in t-SNE space (A) and their corresponding phenotypic profiles (B). The frequency (y-axis) reflects the proportion of total cluster members with each RSP filtered top 500 code (x-axis). The graphs summarize the data from all cohorts together. For complete PheSpec compositions, see Supplementary Figure S5.