| Literature DB >> 35665012 |
Justin Reese, Hannah Blau, Timothy Bergquist, Johanna J Loomba, Tiffany Callahan, Bryan Laraway, Corneliu Antonescu, Elena Casiraghi, Ben Coleman, Michael Gargano, Kenneth Wilkins, Luca Cappelletti, Tommaso Fontana, Nariman Ammar, Blessy Antony, T M Murali, Guy Karlebach, Julie A McMurry, Andrew Williams, Richard Moffitt, Jineta Banerjee, Anthony E Solomonides, Hannah Davis, Kristin Kostka, Giorgio Valentini, David Sahner, Christopher G Chute, Charisse Madlock-Brown, Melissa A Haendel, Peter N Robinson.
Abstract
Accurate stratification of patients with Post-acute sequelae of SARS-CoV-2 infection (PASC, or long COVID) would allow precision clinical management strategies and could enable more focussed investigation of the molecular pathogenetic mechanisms of this disease. However, the natural history of long COVID is incompletely understood and characterized by an extremely wide range of manifestations that are difficult to analyze computationally. In addition, the generalizability of machine learning classification of COVID-19 clinical outcomes has rarely been tested. We present a method for computationally modeling long COVID phenotype data based on electronic healthcare records (EHRs) and for assessing pairwise phenotypic similarity between patients using semantic similarity. Using unsupervised machine learning (k-means clustering), we found six distinct clusters of long COVID patients, each with distinct profiles of phenotypic abnormalities with enrichments in pulmonary, cardiovascular, neuropsychiatric, and constitutional symptoms such as fatigue and fever. There was a highly significant association of cluster membership with a range of pre-existing conditions and with measures of severity during acute COVID-19. We show that the clusters we identified in one hospital system were generalizable across different hospital systems. Semantic phenotypic clustering can provide a foundation for assigning patients to stratified subgroups for natural history or therapy studies on long COVID.Entities:
Year: 2022 PMID: 35665012 PMCID: PMC9164456 DOI: 10.1101/2022.05.24.22275398
Source DB: PubMed Journal: medRxiv
Fig. 1Cohort construction.
Patients with long COVID (U09.9 diagnosis) were extracted from the much larger dataset of the N3C. Long COVID patients were selected from the five data partners that provided data for at least 300 U09.9 patients and had an average of at least 7 long COVID HPO terms per patient. The data partner with the most U09.9 patients (data partner 1) was chosen for clustering, and additional U09.9 patients from four other data partners (data partners 2–5) were chosen to assess generalizability.
Fig. 2Calculating patient semantic similarity based on HPO phenotypes.
A) HPO terms are arranged in a directed acyclic graph with specific terms such as Bradycardia (HP:0001662) being related to more general terms (here: Arrhythmia; HP:0011675) by subtype relations. An excerpt of the entire ontology (15,247 terms) is shown. B) Example showing a pair of patients with relatively high phenotypic similarity; for each of the HPO terms in patient 1, the best match is sought in patient 2. If an exact match is not found, the algorithm searches for the most informative common ancestor (MICA) in the ontology; the information content (a measure of specificity) of the exact matching term or most specific ancestor term is calculated to determine the specificity. For instance, Visual hallucinations (HP:0002367) and Auditory hallucinations (HP:0008765) are not an exact match, so the information content of their MICA Hallucinations (HP:0000738) is chosen. Hallucinations (HP:0002367) is still relatively specific (and shown in gray), while the MICA of Angina pectoris (HP:0001681) and Hypotension (HP:0002615) is more general (shown in red) and contributes less to the matching score. Hallucinations (HP:0002367) is still relatively specific (and shown in gray), while the MICA of Angina pectoris (HP:0001681) and Hypotension (HP:0002615) is more general (shown in red) and contributes less to the matching score. C) Example of a pair of patients with a relatively lower similarity due to (specific) fewer exact matches and one unmatched term. The pairwise similarity is calculated in this way for all pairs of patients to construct the similarity matrix that is used for clustering (Fig. 3).
Figure 3.Patient similarity matrix illustrating long COVID subtypes in data partner 1.
A heatmap representing the 6 clusters created by k-means clustering is shown. Cluster hierarchy was calculated using the nearest point algorithm and Euclidean distance.
Characteristics of the study population in data partner 1.
For the overall study population and for each cluster, age, gender, and race/ethnicity are shown. Data for characteristics for which there were fewer than 20 patients, and data about race/ethnicities for which there were fewer than 20 patients overall (Other Non-Hispanic, Native Hawaiian or Other Pacific Islander Non-Hispanic, Asian Non-Hispanic) are not shown to reduce the risk of patient re-identification.
| Overall | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | |
|---|---|---|---|---|---|---|---|
| n | 1233 | 276 | 301 | 195 | 70 | 148 | 243 |
| Acute COVID-19 Inpatient | 424 (34.6%) | 203 (74.1%) | 21 (7.0%) | <20 | 0 | <20 | 170 (70.0%) |
| age - mean (SD) | 51.9 (16.5) | 58.7 (17.6) | 50.0 (15.3) | 48.5 (15.2) | 47.0 (16.4) | 44.6 (13.4) | 55.0 (16.3) |
| Female | 714 (58.2%) | 112 (40.9%) | 182 (60.7%) | 127 (65.5%) | 48 (69.6%) | 104 (70.7%) | 141 (58.0%) |
| Black or African American Non-Hispanic | 60 (4.9%) | <20 | <20 | <20 | <20 | <20 | <20 |
| White Non-Hispanic | 882 (71.9%) | 186 (67.9%) | 228 (76.0%) | 153 (78.9%) | 54 (78.3%) | 107 (72.8%) | 154 (63.4%) |
| Hispanic or Latino Any Race | 202 (16.5%) | 52 (19.0%) | 42 (14.0%) | <20 | <20 | 26 (17.7%) | 53 (21.8%) |
| Unknown race/ethnicity | 58 (4.7%) | <20 | <20 | <20 | <20 | <20 | <20 |
p < 0.001,
p < 0.05 by one-way ANOVA (age) or chi squared test (all others).
Figure 4.Phenotypically characterizing long COVID subtype clusters.
Shown are the most frequently cooccurring high-level HPO categories for patients in the overall cohort (A) and for each of the 6 clusters (B). For the overall population of patients in data partner 1 and for each cluster, the frequency of each category of long COVID HPO terms (left) and the frequency of the three most common combinations of HPO categories (top) are shown. Notably, most clusters contain some widely shared features, but also distinguishing features such as symptoms in the pulmonary, neuropsychiatric, and cardiovascular systems. Data are shown as UpSet plots, which visualize set intersections in a matrix layout and show the counts of patients with the combination indicated by the black dots as bars above the matrix.[28] The most commonly occurring HPO category in each cluster is highlighted. HPO term combinations that occur less than 20 times are masked to limit the risk of patient re-identification.
Fig. 5.Summary of phenotypic feature distribution in the six clusters. HPO terms are shown if Pearson’s chi-squared test on the numbers of patients in each category with the feature was significant with and if at least 20% of patients in at least one cluster had the feature. Terms are grouped in categories shown on the left in this order: laboratory, constitutional, neuropsychiatric, cardiovascular, gastrointestinal, pulmonary, ENT, endocrine/metabolism, and immunological.
Clinical features of patients before acute COVID-19 infection by cluster.
The 13 of 35 clinical features present before COVID-19 infection (Supplemental Table S12) that were significantly overrepresented in clusters (chi squared p < 0.001 after Bonferonni correction) and the percent of patients in each cluster with each clinical feature are shown.
| Pre-existing Clinical Feature | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 |
|---|---|---|---|---|---|---|
| chronic lung disease | 37.2% | 20.0% | 21.6% | 20.3% | 16.3% | 37.4% |
| peripheral vascular disease | 7.3% | 1.0% | 1.5% | 1.4% | 3.4% | 11.1% |
| systemic corticosteroids | 61.3% | 49.3% | 48.5% | 37.7% | 41.5% | 71.6% |
| kidney disease | 27.0% | 3.0% | 4.6% | 4.3% | 2.7% | 22.6% |
| obesity | 58.8% | 44.3% | 48.5% | 39.1% | 37.4% | 66.3% |
| diabetes (uncomplicated) | 29.9% | 12.0% | 8.8% | 7.2% | 4.8% | 28.8% |
| coronary artery disease | 15.0% | 2.3% | 4.1% | 1.4% | 5.4% | 11.9% |
| diabetes (complicated) | 23.7% | 4.3% | 6.2% | 5.8% | 2.0% | 23.0% |
| hypertension | 46.7% | 25.0% | 28.9% | 21.7% | 17.0% | 49.8% |
| congestive heart failure | 8.8% | 2.0% | 1.0% | 0.0% | 0.7% | 7.8% |
| heart failure | 11.7% | 2.0% | 1.5% | 1.4% | 2.0% | 10.3% |
| depression | 16.4% | 16.0% | 35.6% | 15.9% | 15.0% | 29.2% |
| AKI | 22.6% | 0.7% | 2.6% | 1.4% | 0.7% | 14.0% |
Clinical features of patients during acute COVID-19 infection by cluster.
The 6 of 9 clinical features present during COVID-19 infection (Supplemental Table S13) that were significantly overrepresented in clusters (chi squared p < 0.001 after Bonferonni correction) and the percent of patients in each cluster with each clinical feature are shown.
| Clinical Feature during COVID-19 | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 |
|---|---|---|---|---|---|---|
| AKI | 20.8% | 0.0% | 0.5% | 0.0% | 0.0% | 12.8% |
| vasopressors | 19.7% | 0.7% | 0.5% | 0.0% | 1.4% | 22.6% |
| IMV | 14.6% | 0.0% | 0.0% | 0.0% | 0.7% | 18.9% |
| remdesivir | 44.2% | 1.7% | 1.5% | 0.0% | 1.4% | 30.5% |
| sepsis | 17.2% | 0.0% | 0.0% | 0.0% | 0.7% | 15.2% |
| corticosteroids | 65.3% | 3.0% | 5.7% | 0.0% | 6.1% | 55.1% |
Generalizability of clusters in patients from new data partners.
The similarity of patients from test data partners 2–5 to patients from clusters generated from data partner 1, and to patients from randomly permuted clusters was measured as in Fig 2. For patients from the given data partner, the average similarity of patients to the best matching randomly permuted cluster and to the best matching cluster from data partner 1, as well as the Z-score and p-value for each test data partner are shown. The empirical p-value reflects the number of times that the similarity of a permuted dataset was higher than that of the observed clusters (this never occurred).
| Test data partner | Similarity to permuted clusters | Observed mean similarity | Z-score | Empirical p-value |
|---|---|---|---|---|
| 2 | 0.179±0.000351 | 0.270 | 261.0 | < 0.001 |
| 3 | 0.179±0.000387 | 0.271 | 236.3 | < 0.001 |
| 4 | 0.180±0.000355 | 0.274 | 266.0 | < 0.001 |
| 5 | 0.182±0.000787 | 0.300 | 149.7 | < 0.001 |