Literature DB >> 27307606

Comparative analyses of population-scale phenomic data in electronic medical records reveal race-specific disease networks.

Benjamin S Glicksberg¹, Li Li¹, Marcus A Badgeley¹, Khader Shameer¹, Roman Kosoy¹, Noam D Beckmann², Nam Pho³, Jörg Hakenberg², Meng Ma², Kristin L Ayers², Gabriel E Hoffman², Shuyu Dan Li², Eric E Schadt², Chirag J Patel³, Rong Chen², Joel T Dudley⁴.

Abstract

MOTIVATION: Underrepresentation of racial groups represents an important challenge and major gap in phenomics research. Most of the current human phenomics research is based primarily on European populations; hence it is an important challenge to expand it to consider other population groups. One approach is to utilize data from EMR databases that contain patient data from diverse demographics and ancestries. The implications of this racial underrepresentation of data can be profound regarding effects on the healthcare delivery and actionability. To the best of our knowledge, our work is the first attempt to perform comparative, population-scale analyses of disease networks across three different populations, namely Caucasian (EA), African American (AA) and Hispanic/Latino (HL).
RESULTS: We compared susceptibility profiles and temporal connectivity patterns for 1988 diseases and 37 282 disease pairs represented in a clinical population of 1 025 573 patients. Accordingly, we revealed appreciable differences in disease susceptibility, temporal patterns, network structure and underlying disease connections between EA, AA and HL populations. We found 2158 significantly comorbid diseases for the EA cohort, 3265 for AA and 672 for HL. We further outlined key disease pair associations unique to each population as well as categorical enrichments of these pairs. Finally, we identified 51 key 'hub' diseases that are the focal points in the race-centric networks and of particular clinical importance. Incorporating race-specific disease comorbidity patterns will produce a more accurate and complete picture of the disease landscape overall and could support more precise understanding of disease relationships and patient management towards improved clinical outcomes. CONTACTS: rong.chen@mssm.edu or joel.dudley@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27307606 PMCID： PMC4908366 DOI： 10.1093/bioinformatics/btw282

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Design, comparison and analytics of disease networks can inform epidemiology and disease biology (Barabasi and Oltvai, 2004; Feldman ; Zanzoni ). Comparative network analyses and network inference have helped in understanding the relative risk of various diseases and characterize their shared disease architectures (Barabasi ; Cassidy-Bushrow ; Goh ; Lee ; Li , 2014, 2015b; Zhou et al., 2014). Global disease network analyses utilizing biological databases and patient data from electronic medical records (EMR) have emerged as a powerful modality for understanding the complexity of disease relationships (Jensen ; Shameer ). Incorporating findings from disease networks has been used to inform disease repurposing (Dudley ), develop therapeutics (Schadt ) and improve patient safety (Stewart ). Phenomics (Bilder ) aim to map and understand the system of phenotypes and their interactions—where in clinical studies a phenotype can include a trait (e.g. height), lab test (e.g. cholesterol levels) or disease (e.g. rheumatoid arthritis). The catalog of phenome-wide associations, which evaluate phenomic correlations of genotypes, is rapidly growing and currently being leveraged for drug development and drug repositioning (Denny ; Hall ; Namjou ). We recently used EMR-wide phenomic information to identify: shared genetic architectures of various diseases (Glicksberg ; Li ; Suthram ), sub-types of type-2 diabetes (Li ), drug repurposing for various indications (Dudley ; Shameer ), disease progression patterns through data stream visualization (Badgeley ; Shameer ), disease risk estimations (Nead ), and genomics-informed, personalized therapy (Dudley , 2015). Similar to the current situation in genomics research, racial groups and related factors remain understudied in phenomics. Most of the current human phenomics research is based primarily on populations of European background. Thus, compiling and analyzing data from EMR databases that contains patient data from diverse demographics and racial groups remains a priority. It is clear that racial background represents an overt source of variability in disease risk and mortality (Trepka ). Traditionally, clinicians are required to ‘bridge the inferential gap’, or make clinical decisions for one racial group based on data from another, due to lack of knowledge. Accordingly, the implications of this racial underrepresentation of data can be profound with regard to healthcare delivery and actionability. For example, a previous study found that African American women were twice as likely, and Hispanic women were 50% as likely, to be readmitted to the hospital within 30 days of vaginal or cesarean delivery, even when controlling for socioeconomic status (Aseltine ). Systematic analysis of phenomic data represented in a racially and demographically diverse patient population could reveal precise patterns and further understanding of disease relationships, risk and comorbidity. Previous studies put forth several approaches for the phenomic study of clinical populations. Blair ) utilized data from the Centers for Medicare and Medicaid Services (CMS) Databases, multiple hospitals across the United States, and the population registry of Denmark (n = 110 million) to discover comorbidity patterns across complex and Mendelian diseases. This work, however, was mainly focused comparing certain types of diseases (i.e. Mendelian and complex diseases) and did not fully investigate the disease space. Hidalgo ) created a more expansive phenotype disease network (CMS data, n = 30 million) that incorporated demographic factors, such as sex and race into the analytics. The authors revealed disparate disease patterns and network connectivity that was due to race, but only between Caucasian and African American populations. Jensen ) extended the field of disease network research by using timescale data to define temporal disease trajectories in a Danish clinical cohort (n = 6.2 million). The researchers were successfully able to identify clusters of diseases that consistently manifested in particular order (i.e. disease trajectories). These trajectories, however, were built specifically on European population data and may not extend to other racial groups. These studies, while powerful and pioneering, did not sufficiently address the issue of racial diversity in their disease networks partially due to limitations of their datasets. As such, a particular concern is a lack of representation of Mexican Americans and other Hispanic Americans in healthcare analytics (López-Candales ). As indicated by Hidalgo et al., there is a significant disparity in network structure between racial populations. However, prior phenomic studies have not evaluated racially diverse populations in depth. In the current study, we propose to combine many of the powerful approaches developed in the previous studies and leverage a racially diverse hospital population to compare disease network structure and connectivity between Caucasian, African American and Hispanic/Latino populations. To the best of our knowledge, our work is the first attempt to perform comparative, population-scale analyses of disease networks across three different populations within the same hospital cohort.

2 Methods

We present a schematic of our study design and approach in Figure 1.

Fig. 1.

Workflow of the current study. We outline steps taken in our study from data organization and statistical methodologies to network analytics

2.1 Data sources

2.1.1 Clinical cohort

We performed disease-related analyses on patients from the Mount Sinai Hospital (MSH) located in New York City, NY. The unique location of MSH engenders a diverse racial patient population. The Mount Sinai Data Warehouse which houses all the clinical data, currently has 4 034 924 unique patients (as of February 2015), over 16 million patient visits recorded, over 1.7 billion patient encounters, and over 46 million International Classification of Diseases (ICD)-9 code cases documented. We performed the following extensive pre-processing and filtering steps of the clinical data from the EMR. After these filtering steps, a total of 1 025 573 individuals remained for analysis. The mean age within the population is 47.19 ± 24.3 years. The population contained 443 816 (43.27%) Males and 581 757 (56.73%) Females. The race breakdown of the population is as follows: 621 827 (60.63%) EA 223 915 (21.83%) AA and 179 831 (17.53%) HL. We excluded individuals that did not have a healthcare visit since 2003, when the EMR was implemented into the MSH system. We included individuals with a reported sex and age. We only included individuals with self-identified races of Caucasian (White) [EA], African American (Black) [AA] or Hispanic/Latino [HL]. For all individuals with recorded death, we excluded individuals without an age of death. Of these individuals, we used their date of death as their current age as to not confound subsequent analyses. In compliance with Protected Health Information (PHI) and Health Insurance Portability and Accountability Act (HIPPA), we censored the ages of individuals <18 or >90 years old to those limits.

2.1.2 Disease classification sources

At the time of this analysis, the MSH EMR system used ICD-9 codes for billing and recording diagnoses. As the ICD-9-CM classification system is fraught with challenges (Hazlewood, 2003), particularly when dealing with rare and/or recently discovered diseases, we utilized a curated ontology of established and documented mappings for clinical studies. Disease Ontology (Schriml ) (DO; July 15th, 2015 release) is an open-source repository that integrates phenotype information relating to human diseases. The Healthcare Cost and Utilization Project (HCUP) has developed Clinical Classifications Software (CCS) (HCUP Clinical Classifications Software (CCS) for ICD-9-CM, 2006–2009), which we used to characterize the individually mapped diseases into broader categories. In the current study, we used the ‘Single-Level Diagnosis’ terms for categorization, which has 202 different categories. For enrichment analyses, we further only kept categories that contained at least 5 diseases, which left 93 categories. The full list of diseases, their respective ICD-9 codes, classification categories, as well as frequencies in our population can be found in the Supplementary materials. We present the disease frequencies (A) and category composition (B) in Figure 2.

Fig. 2.

Disease and category frequency. We show for A disease counts (log10) overall and by EA, AA and HL cohorts. We show for B the distribution of the number of diseases encompassed within each of the 93 used CCS disease categories

2.1.2 Disease filtration

The primary focus of the current study is to compare temporal disease connection patterns across races. Accordingly, we performed several filtering steps on the raw list of 6545 disease terms to prioritize particular diseases of interest best suited for the analysis: We only included diseases that mapped to at least one ICD-9 code. We removed all diseases that were top-level, parent disease categories (e.g. endocrine system disease). We only kept diseases if there were ≥10 affected individuals from each racial group in our cohort. These filtering steps resulted in a list of 1198 diseases. To assess connectivity between diseases, we then compiled pairs of diseases from this list that underwent further curation steps. We filtered the raw list of all possible 759528 disease-pair combinations as following: 4.We kept a disease pair only if there were ≥10 individuals from each racial group with both diseases in our cohort. 5.We removed disease pairs in which one disease was a complete subset of another. There were 37 282 disease pairs remaining after these filtering steps.

2.2 Statistical analyses

2.2.1 Deriving race-specific disease dynamics

For each disease of interest, we assessed if and to what extent demographic factors, namely race, play a role in defining morbidity, when controlling for other potentially associated factors. Specifically, we ran a logistic regression adjusting for sex, age and race and assessed if any demographic covariate was significantly associated with disease risk (Eq. 1). where βr is categorical piecewise (Caucasian, African American, Hispanic/Latino), βa is a continuous constant per year and βs is binary piecewise Female/Male As such, for each disease we compared the effect of race using the EA population as a baseline for disease susceptibility.

2.2.2 Disease pair temporal patterns

With the filtered disease pairs compiled, we then sought to determine whether each disease pair had temporal directionality, or specifically whether one disease consistently preceded the other. For the overlapping individuals that are afflicted with both diseases in each pair, we tabulated per patient the ordering of their pathogeneses. Specifically, we compared the number of patient instances where one disease preceded the other and vice versa or if they were recorded during the same encounter. We took earliest instance recording date for each disease. If one disease more frequently predated the other, we calculated the cumulative binomial probability that the precedence occurs significantly more often than by chance (Eq. 2). For each disease pair, we made the assumption that there was a 50% chance that one disease can occur before the other. where n is the number of individuals with both diseases, r is number of instances where one disease predates the other, p is the probability of success (0.5) and q is the probability of failure (0.5).

2.2.3 Comorbidity calculation

While one disease may statistically precede another, it does not necessarily mean they have a direct relationship. Accordingly, for each disease pair with significant directionality identified by the previous step, we next determined whether there was significant comorbidity in the clinical population. Specifically, for each of these 37 282 disease pairs, we performed a logistic regression estimating the contribution the prior disease (i.e. the ‘predictor’ disease) has for risk of developing subsequent sequelae (i.e. the ‘response’ disease) (Eq. 3) for each population separately. where βd is binary Yes/No, βa is a continuous constant per year, βs is binary piecewise Female/Male

2.3 Disease network construction and comparative analytics

Using results from the previous analyses, we generated population-specific disease networks using the Cytoscape (Shannon ) platform (v3.3.0). These networks are comprised of directed connections between source and target disease pairs found to be significant in terms of both temporal directionality and connectivity for individual race populations, using β as edge weight. We then performed network metric analyses for each population network using the NetworkAnalyzer (Doncheva ) plugin for Cytoscape. Using these generated metric statistics, we compared each population networks to determine network structure concordance via metrics (e.g. closeness centrality). Specifically, we performed a one-way analysis of variance (ANOVA) between the metrics for race-cohort networks. We then performed Tukey HSD test on significant results to determine which race networks differed.

2.3.1 Disease hub identification and categorical enrichments

For each population, we identified ‘hubs’ of connectivity, which are focal points in the network that have many outgoing connections. We defined hubs as diseases that have at least 10 outgoing disease connections: specifically, any predictor disease in a pair (i.e. those that predate the latter) that is significantly connected to at least 10 diseases within a population. We then evaluated the different composition of the results between populations using the 93 different categories of diseases. We first determined whether the identified hub diseases for each population were enriched for any of these categories. We then performed the same analysis on predictor (i.e. earlier) and response (i.e. later) diseases in the significant disease pairs in each population. Specifically, for hub, source and target diseases significant for each population, we performed a one-way Fisher’s exact test comparing the amount of overlap with diseases of each category.

3 Results

For the current study, we calculated disease connectivity patterns for a 1198 diseases in a large, ethnically diverse EMR cohort with 3 well-represented populations and compared across race-specific networks.

3.1 Effect of race on disease susceptibility prediction

Using the EA cohort as a baseline, we first determined how race affects susceptibility of each of the 1198 diseases while controlling for age and sex factors compared to AA and HL. In total, we found that a large portion, 968 (81%), of these diseases had some race contribution (Bonferroni corrected P < 4.2 × 1005) to pathogenesis (Eq. 1). The corresponding trends of race association with disease risk along with selected examples are displayed in Figure 3.

Fig. 3.

Disease susceptibility profiles based on racial group. We present here the distribution of diseases (with highlighted examples) that have statistically significant (Bonferroni corrected P < 4.2 × 10−05) differences in risk profiles for AA and HL cohorts compared to EA. The race beta values refer to effect size of race when controlling for age and sex with positive values indicating increased risk compared to EA and vice versa We found 731 diseases (61%) for which EA and AA individuals had significantly different risks of affliction, 369 of which were not associated with the HL population. Effect sizes, in terms of β, ranged from −3.70 to 4.12 with positive values indicating increased risk for AA individuals and vice versa. Our data suggests that the AA population is more susceptible to disease acquisition overall: out of the significant associations a large proportion, 580 (79%), were positively associated with AA. Compared to the AA population, there were fewer diseases significantly associated with altered risk profiles for HL individuals. Only 599 (30%) of the diseases were associated with HL cohort and 237 of which were not associated with AA risk. The effect sizes ranged from −3.55 to 2.66, with a fewer number of diseases, 182 (30%), at increased prevalence in HL which is the opposite of the trend for the AA population.

3.2 Directionality of race-specific temporal disease pairs

We first determined (Eq. 2) which of the 37 282 disease pairs had significant temporal directionality (i.e. a pair in which one disease significantly precedes the other) for EA, AA and HL populations separately (P < 1.42 × 10−06). For EA, we found 2333 (6.61%) significant temporally related disease pairs, 3311 (9.38%) for AA and 691 (1.96%) for HL. In total, across all population, we found 6336 (5.99%) disease pairs that were significantly related temporally.

3.3 Race-specific disease pair connectivity patterns

Within each population, for each disease pair that we determined to have significant directionality, we then evaluated (Eq. 3.) whether and to what extent they were connected (P < 1.42 × 10−06). We present the relative distribution of significant disease pairs between each population in Figure 4. We also highlight select pairs unique to each race in Table 1.

Fig. 4.

Table 1.

Temporal directionality and connectivity significance of selected disease pairs unique to each race cohort

Pop.	Disease 1	Disease 2	P-val	β
EA	Thyroid cancer	Postsurgical hypothyroidism	<6.4E−324	5.25
EA	Lymphosarcoma	Aplastic anemia	<6.4E−324	3.42
EA	Ulcerative colitis	Intestinal obstruction	<6.4E−324	3.27
EA	Toxic diffuse goiter	Postsurgical hypothyroidism	1.3E−153	3.11
EA	Familial hypercholesterolemia	Acute cystitis	<6.4E−324	3.09
AA	Diabetes mellitus, type 2	Diabetic cataract	2.1E−16	5.73
AA	Hyperthyroidism	Toxic diffuse goiter	<6.4E−324	5.10
AA	Chronic ulcer of skin	Osteomyelitis	1.4E−235	4.96
AA	Hypertension	IgA glomerulonephritis	6.4E−75	4.09
AA	HIV disease	Esophageal candidiasis	<6.4E−324	3.87
HL	Diabetes mellitus, type 1	Clostridium difficile colitis	3.3E−73	2.51
HL	Benign essential hypertension	Phobic disorder	5.1E−28	2.25
HL	Coronary artery disease	ARDS	1.7E−61	1.89
HL	Generalized anxiety disorder	Anemia	3.1E−64	1.72
HL	Major depressive disorder	Decubitus ulcer	2.1E−42	1.67

For each population, we determined which temporally related disease pairs had Bonferroni-corrected significant connectivity (P < 1.42 × 10−06). We present particular disease pairs of interest from among the top-25 associations for each population, ranked by effect size. Effect size, or β, can be interpreted as the odds ratio of disease 2 occurring given disease 1, holding age and sex constant.

Distribution of significantly connected disease pairs by racial cohort. We show the amount of disease pairs that were significantly temporally related and comorbid for all racial groups (P < 1.42 × 10−06 criteria for both) Temporal directionality and connectivity significance of selected disease pairs unique to each race cohort For each population, we determined which temporally related disease pairs had Bonferroni-corrected significant connectivity (P < 1.42 × 10−06). We present particular disease pairs of interest from among the top-25 associations for each population, ranked by effect size. Effect size, or β, can be interpreted as the odds ratio of disease 2 occurring given disease 1, holding age and sex constant. We further determined the relative timescale of the latencies between disease pairs across all populations. Specifically, for all significantly comorbid disease pairs common among all racial groups (n = 464), we determined the average latency from the pathogenesis of the first disease to developing the latter within each racial group. The average latency between diseases was 1.67 ± 0.62 years for EA, 2.35 ± 0.94 years for AA and 1.75 ± 0.76 years for HL.

3.4 Race-specific network dynamics

Using the results from the previous sections, we generated unique disease networks for each race cohort as displayed in Figure 5A/B/C. In addition to the varying disease patterns across the cohorts, there were also significant differences in the composition of the networks, which we show in Table 2. Full descriptions of these metrics can be found in the documentation for the Cytoscape NetworkAnalyzer package.

Fig. 5.

Table 2.

Metric statistic results across race-specific networks

Metric	EA	AA	HL	P-value	EA/AA (p)	EA/HL (p)	AA/HL (p)	Trend
Closeness centrality	0.27±0.38	0.21±0.35	0.16±0.37	2.0E−03	0.04	2.00E−03	0.28
Clustering coefficient	0.05±0.09	0.08±0.1	0.01±0.05	1.07E−14	1.00E-03	2.10E−06	4.94E−324
Eccentricity	0.78±1.18	0.69±1.21	0.21±0.50	1.80E−08	0.42	1.37E−08	9.95E−07
Edge count	11.34±23.94	13.3±33.54	6.86±15.06	2.30E−02	0.55	0.16	0.02
In-degree	5.67±7.89	6.65±8.16	3.43±3.1	1.98E−06	0.13	1.73E−03	9.03E−07
Neighborhood connectivity	109.22±66.37	289.76±111.47	69.08±34.72	2.97E−77	4.94E-324	5.16E−07	4.94E−324
Out-degree	5.67±23.46	6.65±33.31	3.43±15.48	0.38	–	–	–	–
Stress	8.55±43.08	13.13±64.38	0.29±1.7	1.1E−02	0.38	0.15	8.00E−03

We determined significant differences (italicized) in network structure across EA, AA and HL networks using a one-ANOVA to compare average metric statistics for race-cohort networks (P < 0.05). We then performed Tukey HSD test on significant results to determine specifically which races differed from one another (P < 0.05).

Network structure patterns for each racial cohort and hub connectivity. We provide race-specific networks for EA (A), AA (B) and HL (C) populations for disease pairs that were significantly temporally related and comorbid for each group (P < 1.42 × 10−06 criteria for both). Effect size, shown as edge weight, is the increased risk of developing the target disease when having the source, controlling for sex and age. Node size reflects number of directed, outgoing connections. The larger text refers to diseases identified as hubs for the population Metric statistic results across race-specific networks We determined significant differences (italicized) in network structure across EA, AA and HL networks using a one-ANOVA to compare average metric statistics for race-cohort networks (P < 0.05). We then performed Tukey HSD test on significant results to determine specifically which races differed from one another (P < 0.05).

3.5 Population-specific disease hubs

In total, across all populations, we identified 51 unique diseases that were hubs. Many of these hubs were so in multiple populations with 9 being hubs in all 3 populations. We found 7 diseases that were hubs only in the EA population, 24 only in AA population and none that were unique to the HL population. We present the sub-network of hub diseases significant to each population, along with their first neighbor connections in Figure 5D.

3.6 Disease categorical enrichment of connectivity results between populations

From our network connectivity results, we determined whether hub, source (i.e. predictor) and target (i.e. response) diseases significant to each population were enriched for any of the 93 disease categories.

3.6.1 Hub disease categorical enrichment

In total, we found 14 nominally significant (p < 0.05) disease category-hub enrichments. The hubs of all 3 cohorts were most highly enriched for ‘Diabetes mellitus with complications’ (EA: P = 7.0 × 10−04, odds ratio = 23.42; AA: P = 8.0 × 10−04, OR = 28.2; HL: P = 7.0 × 10−03, OR = 84.9). Furthermore, the EA hubs were enriched for ‘Mood disorders’ (P = 0.01, OR = 18.72), ‘Esophageal disorders’ (P = 0.01, OR = 18.7) and ‘Thyroid disorders’ (P = 0.04, OR = 7.7). While the AA hubs were similarly enriched for ‘Mood disorders’ (P = 0.03, OR = 11.0) and ‘Esophageal disorders’ (P = 0.03, OR = 11.0), they were also enriched for ‘Asthma disorders’ (P = 0.01, OR = 11.0), ‘Allergic reactions’ (P = 0.03, OR = 9.1), ‘Anxiety disorders’ (P = 0.03, OR = 9.1) and ‘Other gastrointestinal disorders’ (P = 0.04, OR = 7.8). The HL hubs only were enriched for ‘Asthma disorders’ (P = 0.04, OR = 37.1) and ‘Complications of surgical procedures or medical care’ (P = 0.04, OR = 37.1).

3.6.2 Source disease categorical enrichment

Next, we determined categorical enrichment for source diseases in significant pairs in each race population. Within the EA disease network, there were 136 significant source diseases, 144 for AA and 33 for HL. The source diseases of each race were significantly enriched for ‘Diabetes mellitus with complications’ (EA: P = 0.02, odds ratio = 8.0; AA: P = 0.02, OR = 15.1; HL: P = 4.0 × 10−04, OR = 38.9) and ‘Mood disorders’ (EA: P = 0.04, odds ratio = 6.0; AA: P = 5.0 × 10−03, OR = 10.1; HL: P = 6.0 × 10−04, OR = 29.2). For the EA cohort, source diseases were also enriched for ‘Diseases of white blood cells’ (P = 0.02, OR = 8.0), ‘Esophageal disorders’ (P = 0.04, OR = 6.0) and ‘Thyroid disorders’ (P = 0.02, OR = 4.5). The source diseases of the AA population were likewise enriched for ‘Thyroid disorders’ (P = 3.4 × 10−03, OR = 5.8) but also for ‘Epilepsy/Convulsions’ (P = 5.0 × 10−03, OR = 10.1), ‘Mycoses’ (P = 0.04, OR = 3.15) and ‘Pulmonary heart diseases’ (P = 0.01, OR = 11.3). Like the EA cohort, HL source diseases were enriched for ‘Esophageal disorders’ (P = 0.01, OR = 15.0). Additionally, we found enrichment for ‘Asthma diseases’ (P = 7.0 × 10−03, OR = 25.1) and ‘Other inflammatory skin conditions’ (P = 0.04, OR = 7.5).

3.6.3 Target disease categorical enrichment

Finally, we analyzed categorical enrichment of target diseases, which are the direct connections from source diseases. Overall there were more target diseases than source: 319 target diseases for EA, 454 for AA and 178 for HL. The only disease category significantly enriched in target diseases of all races was ‘Mycoses’ (EA: P = 7.0 × 10−04, odds ratio = 23.42; AA: P = 8.0 × 10−04, OR = 28.2; HL: P = 7.0 × 10−03, OR = 84.9). We found that ‘Diseases of white blood cells’ (P = 4.5 × 10−02, OR = 5.6) and ‘Retinal detachments/defects/vascular occlusions/retinopathies’ (P = 0.01, OR = 2.7) were the only other categories enriched for EA target diseases. For AA, we discovered that ‘Retinal detachments/defects/vascular occlusions/retinopathies’ (P = 3.0 × 10−04, OR = 4.6) was also significantly enriched along with ‘Cataract diseases’ (P = 0.03, OR = 8.4) ‘Glaucoma diseases’ (P = 7.0 × 10−03, OR = 6.8) and ‘Other diseases of kidney and ureters’ (P = 0.02, OR = 4.5). The HL population had a similar target disease enrichment profile to AA with categorical enrichments of ‘Cataract diseases’ (P = 4.5 × 10−02, OR = 5.8), ‘Glaucoma diseases’ (P = 8.8 × 10−03, OR = 5.9). The HL cohort also had enrichments in ‘Diabetes mellitus with complications’ (P = 4.5 × 10−02, OR = 5.8), ‘Gastritis and duodenitis’ (P = 0.03, OR = 8.8) and ‘Hypertension with complications and secondary hypertension’ (P = 0.01, OR = 7.9).

4 Discussion

The results from the current study provide illustrative examples of the extent disease susceptibility and connectivity patterns differ between race cohorts, formalizing the need for race-specific risk assessment. Overall, the cross-race individual disease profiles are consistent with known data and expectations (Fig. 3), which is important for implications that can inform follow-up studies. More importantly, our results are in line with findings from related studies. In particular, our findings for the disease temporal patters in the EA cohort are consistent with the disease trajectories identified by Jensen . Direct comparison of results between our study and theirs is difficult, however, namely due to use of non-identical statistical methodologies and ontological disease mappings (ICD-9 versus ICD-10). Regardless, several similarities are apparent: firstly, the raw number of disease pairs with significant temporal directionality is consistent between the two studies: there were 4014 disease pairs with significant temporal directionality in their study and 2333 in ours; the small difference of which can partially be explained by sample size discrepancies. Furthermore, the authors identified related clusters of trajectories that are akin to hubs of the current study. While there are discrepancies, many focal disease points overlap: Type 2 Diabetes (T2D), for instance, was involved in many trajectories in their study and was a central hub in our EA cohort with 101 Bonferroni-corrected sequellae. Outcome diseases in the Jensen et al. T2D network included ‘retinal disorder’ which corresponds to many target diseases found in ours, including dry-eye syndrome, retinal drusen, peripheral retinal degeneration, and retinal edema. ‘Chronic renal failure’ and ‘Unspecified renal failure’ were also outcome diseases, which overlap with target diseases such as impaired renal function disease, benign hypertensive renal disease and secondary hyperparathyroidism of renal origin found in our network. Chronic obstructive pulmonary disease, another cluster disease, was found to have similar temporal patterns as well, including ‘Angina’ as a predictor disease (angina pectoris in our results) and ‘Unspecified chronic bronchitis’ as an outcome disease (bronchitis in ours). Comparing our results to those of the Hidalgo et al. study using network approaches to study human phenotype serves as a source to validate our AA and EA networks. In their study, they found certain disease combinations that were differentially comorbid in black (AA) and white (EA) populations, many of which are validated in our findings. They demonstrate that heart diseases, including ‘mitral valve disorders’ and ‘mitral and aortic valve stenosis’ were more comorbid in white males than black males. Similarly, we found both aortic valve disease and mitral valve disease to be hubs in only the EA network. Interestingly, they show that ‘other peripheral vascular disease’ was connected to diseases across networks for both races and we also identified that the same disease is a hub for both these populations. Hidalgo et al. further demonstrate that ‘diabetes’ and ‘hypertension’ were more comorbid in black males than white males. While, in our study, both diseases were found to be hubs for both EA and AA networks, they were more highly connected (diabetes mellitus: 102 connections for EA versus 187 for AA and hypertension: 233 connections for EA versus 377 for AA) in our AA cohort than the EA network. ‘Respiratory abnormality’ was also more comorbid for black males in their study, which can be seen as corresponding to asthma-related disorder categorical enrichments in our identified AA hubs. Taken together, the concordance of results between these two studies and ours is extremely encouraging and provides support for the methodologies employed in the current paper and the ensuing HL cohort network discovery.

4.1 Race-centric disease connectivity and network composition disparities

As shown in Table 2, there are noticeable differences between the disease networks, not only between EA and AA or HL but also between AA and HL. Metric differences in average clustering coefficient and eccentricity (which reflects maximum length between one node and its connections) between all races reaffirms that disease patterns vary considerably in different racial backgrounds, despite population size. Significant differences between AA and HL network composition emphasizes that it is not enough to merely compare EA versus ‘other’ races: disease networks of each racial group requires substantial, individualized investigation. Another interesting component of comparative analyses of these three networks is the identification of what diseases are common sequelae for each race. We reveal categories of diseases that are enriched only for the HL population. Of particular interest is ‘gastritis and duodenitis’ which is known to have higher incidence in Hispanic populations, which could possibly be due to increased rates of H. pylori infection (Dehesa ). As gastritis can lead to gastric carcinoma, the diseases that predate it can serve as early warning signs.

4.2 Impact on healthcare delivery

Identifying diseases that are hubs within a network, especially those that are specific to certain racial groups, can highlight focal areas that warrant particular clinical attention. We show that while diseases may be abundant in multiple populations (Fig. 5D), some diseases are hubs only for certain populations. It is clear that the AA population has the most hub diseases both overall and unique to the population, which reflects an increased disease burden. We found, for example, that Type 1 Diabetes (T1D) is a hub disease for only the AA cohort, although there were some interesting associations in the other populations (e.g. T1D to clostridium difficile colitis in HL). There has been thorough and extensive research on the impact of T1D on the AA population and which has shown that the AA population indeed has higher incidence rates (Mayer-David ). Results from our hub analysis can extend beyond the simple observation of increased T1D risk in AA to actually illuminate the subsequent disease pathogeneses specific to AA including many eye-related diseases, such as background diabetic retinopathy, blindness, borderline glaucoma, dry eye syndrome, retinal edema and senile cataract. Knowledge of these associations can be passed along to patients in this group so they can be aware of increased risk for such complications. Furthermore, there are several AA-specific disease hubs that are not as well established in the literature: diaper rash, constipation and diarrhea are all seemingly mild conditions but, as we show, can lead to a number of disorders, particularly in the AA population. Findings from these network analytics can suggest clinical practice considerations. The AA network, for example, has significantly higher local interconnectivity scores (e.g. clustering coefficient, neighborhood connectivity) compared to both EA and HL. Furthermore, we have shown that AA individuals have, on average, relatively longer latencies between disease comorbidity onsets. While one could interpret the longer latencies between diseases to slower progression or attribute them to less frequent patient visits, these findings nonetheless could inform clinical treatment strategies: if an AA patient is diagnosed with a highly interconnected disease, such as Hepatitis C, the clinician might strongly urge proximate follow-up visits for active screening of comorbid diseases and consideration of preventative or prophylactic treatment. Similar practices are already implemented in the clinic for certain diseases: the 2016 American Diabetes Association Guidelines (American Diabetes Association, 2016) recommend recurrent T2D screening visits for individuals who have hypertension and/or are of a ‘high-risk race/ethnicity’, even if they are completely asymptomatic. The HL network, on the other hand, consistently has lower scores relating to connectivity and clustering (e.g. clustering coefficient, in-degree, neighborhood connectivity and stress). While this pattern may reflect a unique, sparser phenotypic landscape and lower overall disease burden, it may serve as a reminder of clinical underrepresentation and the need for community outreach (see: ‘Limitations’).

4.3 Limitations

An obvious limitation of the current study is the respective size of the HL population in our analysis. Although the AA cohort was not much larger (21.8% versus 17.5%), it is clear that the AA population was more represented in the disease space (Fig. 2). Another limitation is the type of information available in the EMR data. Our results highlight differences in population disease risk patterns, that in some cases are likely indicative of other potentially confounding factors not captured in the EMR data, such as language barriers, access to healthcare or important environmental or socioeconomic factors. Another possible reason for the sparser HL disease network could be due to a higher heterogeneity of the underlying HL population structure. Studies have indeed shown that while Hispanic/Latino populations are traditionally combined into a single ethnic group (as in the current study), there is extensive diversity in terms of cultural backgrounds and genetic ancestry (Gonzalez ), which might be masking associations in our networks.

4.4 Future directions

The discovery of unique disease sequelae between race populations is a promising start, but it reveals how much more has to be done to generate a broader understanding of disease susceptibility patterns across diverse populations. An obvious, urgent need is to introduce and incorporate data from population groups that are almost completely absent in phenomics space, such as Native Americans and Pacific Islanders. As illustrated by the aforementioned example of HL population diversity, there is a clear need to better stratify potentially overgeneralized cohorts. This can be facilitated by increased sample sizes, more accurate demographic reporting in the EMR and incorporating genetic ancestry. Many other population-scale factors would be important to compare across race-centric disease networks. One particular direction warranting further investigation is an examination of the latencies between disease pairs across populations beyond overall average. Accordingly, by including encounter information and visit frequency, we might be able to identify factors underlying racial group latency discrepancies for each particular pair of comorbid diseases, which may help inform clinical practices. Furthermore, researchers (Patel ) recently demonstrated intricate links between socioeconomic factors, health outcomes and disease risk. Environment-Wide Association Studies (EWAS) (Patel ) have shown the dynamic relationship between disease risk, environmental exposures and genetic profiles. Combining phenomics, subtleties of phenocopies, disease genetics and environmental exposures by zip code within the current dataset can bring us further towards a framework for establishing stratified precise comorbidity networks for personalized medicine.

44 in total

Review 1. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

2. Latino populations: a unique opportunity for the study of race, genetics, and social environment in epidemiological research.

Authors: Esteban González Burchard; Luisa N Borrell; Shweta Choudhry; Mariam Naqvi; Hui-Ju Tsai; Jose R Rodriguez-Santana; Rocio Chapela; Scott D Rogers; Rui Mei; William Rodriguez-Cintron; Jose F Arena; Rick Kittles; Eliseo J Perez-Stable; Elad Ziv; Neil Risch
Journal: Am J Public Health Date: 2005-10-27 Impact factor: 9.308

3. The human disease network.

Authors: Kwang-Il Goh; Michael E Cusick; David Valle; Barton Childs; Marc Vidal; Albert-László Barabási
Journal: Proc Natl Acad Sci U S A Date: 2007-05-14 Impact factor: 11.205

4. Matching cancer genomes to established cell lines for personalized oncology.

Authors: Joel T Dudley; Rong Chen; Atul J Butte
Journal: Pac Symp Biocomput Date: 2011

5. Personalized medicine: from genotypes, molecular phenotypes and the quantified self, towards improved medicine.

Authors: Joel T Dudley; Jennifer Listgarten; Oliver Stegle; Steven E Brenner; Leopold Parts
Journal: Pac Symp Biocomput Date: 2015

6. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations.

Authors: Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford
Journal: Bioinformatics Date: 2010-03-24 Impact factor: 6.937

7. An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus.

Authors: Chirag J Patel; Jayanta Bhattacharya; Atul J Butte
Journal: PLoS One Date: 2010-05-20 Impact factor: 3.240

8. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets.

Authors: Silpa Suthram; Joel T Dudley; Annie P Chiang; Rong Chen; Trevor J Hastie; Atul J Butte
Journal: PLoS Comput Biol Date: 2010-02-05 Impact factor: 4.475

9. The Racial, Cultural and Social Makeup of Hispanics as a potential Profile Risk for Intensifying the Need for Including this Ethnic Group in Clinical Trials.

Authors: Angel López-Candales; Jaime Aponte Rodríguez; David Harris
Journal: Bol Asoc Med P R Date: 2015 Jul-Sep

10. Detection of pleiotropy through a Phenome-wide association study (PheWAS) of epidemiologic data as part of the Environmental Architecture for Genes Linked to Environment (EAGLE) study.

Authors: Molly A Hall; Anurag Verma; Kristin D Brown-Gentry; Robert Goodloe; Jonathan Boston; Sarah Wilson; Bob McClellan; Cara Sutcliffe; Holly H Dilks; Nila B Gillani; Hailing Jin; Ping Mayo; Melissa Allen; Nathalie Schnetz-Boutaud; Dana C Crawford; Marylyn D Ritchie; Sarah A Pendergrass
Journal: PLoS Genet Date: 2014-12-04 Impact factor: 5.917

18 in total

1. PREDICTIVE MODELING OF HOSPITAL READMISSION RATES USING ELECTRONIC MEDICAL RECORD-WIDE MACHINE LEARNING: A CASE-STUDY USING MOUNT SINAI HEART FAILURE COHORT.

Authors: Khader Shameer; Kipp W Johnson; Alexandre Yahi; Riccardo Miotto; L I Li; Doran Ricks; Jebakumar Jebakaran; Patricia Kovatch; Partho P Sengupta; Sengupta Gelijns; Alan Moskovitz; Bruce Darrow; David L David; Andrew Kasarskis; Nicholas P Tatonetti; Sean Pinney; Joel T Dudley
Journal: Pac Symp Biocomput Date: 2017

2. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data.

Authors: Milena A Gianfrancesco; Suzanne Tamang; Jinoos Yazdany; Gabriela Schmajuk
Journal: JAMA Intern Med Date: 2018-11-01 Impact factor: 21.873

3. Empirical assessment of bias in machine learning diagnostic test accuracy studies.

Authors: Ryan J Crowley; Yuan Jin Tan; John P A Ioannidis
Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497

Review 4. Cardiovascular informatics: building a bridge to data harmony.

Authors: John Harry Caufield; Dibakar Sigdel; John Fu; Howard Choi; Vladimir Guevara-Gonzalez; Ding Wang; Peipei Ping
Journal: Cardiovasc Res Date: 2022-02-21 Impact factor: 13.081

5. Accelerators: Sparking Innovation and Transdisciplinary Team Science in Disparities Research.

Authors: Carol R Horowitz; Khader Shameer; Janice Gabrilove; Ashish Atreja; Peggy Shepard; Crispin N Goytia; Geoffrey W Smith; Joel Dudley; Rachel Manning; Nina A Bickell; Maida P Galvez
Journal: Int J Environ Res Public Health Date: 2017-02-23 Impact factor: 3.390

6. A Network-Biology Informed Computational Drug Repositioning Strategy to Target Disease Risk Trajectories and Comorbidities of Peripheral Artery Disease.

Authors: Khader Shameer; Garrett Dow; Benjamin S Glicksberg; Kipp W Johnson; Yi Ze; Max S Tomlinson; Ben Readhead; Joel T Dudley; Iftikhar J Kullo
Journal: AMIA Jt Summits Transl Sci Proc Date: 2018-05-18

7. Systematic analyses of drugs and disease indications in RepurposeDB reveal pharmacological, biological and epidemiological factors influencing drug repositioning.

Authors: Khader Shameer; Benjamin S Glicksberg; Rachel Hodos; Kipp W Johnson; Marcus A Badgeley; Ben Readhead; Max S Tomlinson; Timothy O'Connor; Riccardo Miotto; Brian A Kidd; Rong Chen; Avi Ma'ayan; Joel T Dudley
Journal: Brief Bioinform Date: 2018-07-20 Impact factor: 11.622

Review 8. Enabling Precision Cardiology Through Multiscale Biology and Systems Medicine.

Authors: Kipp W Johnson; Khader Shameer; Benjamin S Glicksberg; Ben Readhead; Partho P Sengupta; Johan L M Björkegren; Jason C Kovacic; Joel T Dudley
Journal: JACC Basic Transl Sci Date: 2017-06-26

9. Sensitivity of comorbidity network analysis.

Authors: Jason Cory Brunson; Thomas P Agresta; Reinhard C Laubenbacher
Journal: JAMIA Open Date: 2019-12-31

10. Pharmacological risk factors associated with hospital readmission rates in a psychiatric cohort identified using prescriptome data mining.

Authors: Khader Shameer; M Mercedes Perez-Rodriguez; Roy Bachar; Li Li; Amy Johnson; Kipp W Johnson; Benjamin S Glicksberg; Milo R Smith; Ben Readhead; Joseph Scarpa; Jebakumar Jebakaran; Patricia Kovatch; Sabina Lim; Wayne Goodman; David L Reich; Andrew Kasarskis; Nicholas P Tatonetti; Joel T Dudley
Journal: BMC Med Inform Decis Mak Date: 2018-09-14 Impact factor: 2.796