| Literature DB >> 30999919 |
Maria Pikoula1,2, Jennifer Kathleen Quint3,4,5, Francis Nissen3,5, Harry Hemingway6,3, Liam Smeeth3,5, Spiros Denaxas6,3.
Abstract
BACKGROUND: COPD is a highly heterogeneous disease composed of different phenotypes with different aetiological and prognostic profiles and current classification systems do not fully capture this heterogeneity. In this study we sought to discover, describe and validate COPD subtypes using cluster analysis on data derived from electronic health records.Entities:
Keywords: COPD epidemiology; COPD exacerbations; Cluster analysis; Electronic health records
Mesh:
Year: 2019 PMID: 30999919 PMCID: PMC6472089 DOI: 10.1186/s12911-019-0805-0
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Main experiment steps (1) Split cohort into Training and Test sets; (2) Apply multiple correspondence analysis (MCA) to the Training set using all 15 potential cluster-generating features, results in 3 components; (3) Use 3 components derived in Step 2 from MCA analysis in k-means algorithm, results in k = 5 clusters; (4) Split Training set into a decision tree classifier (DTC) Training and DTC Test set to predict cluster labels obtained from k-means algorithm; (5) Train and validate DTC; (6) Apply DTC to Test set to predict cluster labels; (7) Apply MCA to Test set as in Step 2, results in 3 components; (8) Use 3 components derived in Step 7 from MCA analysis in k-means algorithm, results in k = 5 clusters; (9) Compare cluster assignments in Test set from Steps 6 and 8 by calculating the Jaccard Index (% of patients overlapping in the same cluster between the two solutions)
Fig. 2Patient flow diagram. Top level of excluded patient numbers not mutually exclusive. Second level of excluded numbers are given as applied sequentially
Characteristics that were used the analysis: all patients (Entire cohort) and split by training and testing datasetsa
| Covariate | Level | Entire cohort | Training cohort | Test cohort |
|---|---|---|---|---|
| n | 30,961 | 23,275 | 7686 | |
| Sex (male) | n (%) | 16,885 (54.54) | 12,723 (54.66) | 4163 (54.15) |
| BMI | < 18.5 | 1305 (4.21) | 978 (4.2) | 327 (4.25) |
| ≥ 18.5, < 25 | 9926 (32.06) | 7461 (32.06) | 2465 (32.07) | |
| ≥ 25, < 30 | 10,358 (33.45) | 7758 (33.33) | 2600 (33.83) | |
| ≥ 30 | 9372 (30.27) | 7078 (30.41) | 2294 (29.85) | |
| CRS | 590 (1.91) | 445 (1.91) | 145 (1.89) | |
| Anxiety | 3123 (10.09) | 2375 (10.2) | 748 (9.73) | |
| Atopy | 3809 (12.3) | 2868 (12.32) | 941 (12.24) | |
| Depression | 3413 (11.02) | 2605 (11.19) | 808 (10.51) | |
| Diabetes | 5001 (16.15) | 3789 (16.28) | 1212 (15.77) | |
| Eosinophils > 2% | 20,363 (65.77) | 15,299 (65.73) | 5064 (65.89) | |
| GERD | 2759 (8.91) | 2108 (9.06) | 651 (8.47) | |
| GOLD | 1 | 8077 (26.09) | 6017 (25.85) | 2060 (26.8) |
| 2 | 15,536 (50.18) | 11,749 (50.48) | 3787 (49.27) | |
| 3 | 6322 (20.42) | 4730 (20.32) | 1592 (20.71) | |
| 4 | 1026 (3.31) | 779 (3.35) | 247 (3.21) | |
| Heart failure | 4685 (15.13) | 3579 (15.38) | 1106 (14.39) | |
| Hypertension | 10,515 (33.96) | 7906 (33.97) | 2609 (33.94) | |
| IHD | 7134 (23.04) | 5379 (23.11) | 1755 (22.83) | |
| Smoking | ex | 14,447 (46.66) | 10,920 (46.92) | 3527 (45.89) |
| current | 16,514 (53.34) | 12,355 (53.08) | 4159 (54.11) | |
| Therapy type | none | 11,621 (37.53) | 8775 (37.7) | 2846 (37.03) |
| mono | 4071 (13.15) | 3018 (12.97) | 1053 (13.7) | |
| dual | 10,261 (33.14) | 7722 (33.18) | 2539 (33.03) | |
| triple | 5008 (16.18) | 3760 (16.15) | 1248 (16.24) |
aBMI Body mass index, CRS Chronic rhinosinusitis, GERD Gastroesophageal reflux disease, IHD Ischaemic heart disease, GOLD Global initiative for chronic obstructive lung disease
Fig. 3Silhouette plot of all samples resulting from the a HC 5 and b k-means cluster solutions. The dotted line represents the average silhouette score. Clusters are not annotated with specific labels at this stage
Characteristics of the 5 clusters identified by k-means clustering. Dark and light shading indicates higher and lower proportions respectively with regards to the entire cohort
Variables not included as input in cluster analysis: Comparison between clusters. Higher IMD score values indicate more social deprivation (5th quintile is most deprived). Dark and light shading indicates higher and lower proportions respectively with regards to the entire cohort
Fig. 43D scatter plot of the three MCA Components colour-coded by cluster assignment
Fig. 5Simplified example output of decision tree classifier trained with a maximum depth of three
Mortality and AECOPD outcomes: Comparison between clusters. Dark and light shading indicates higher and lower proportions respectively with regards to the entire cohort
Fig. 6Cumulative AECOPD episodes by subgroup a in primary care and b) hospital admissions
Age-adjusted Cox regression with regards to CVD and respiratory related mortality
| Characteristic | Hazard ratio |
|---|---|
| Age | 1.08 [1.07–1.08] |
| Cluster | |
| Not comorbid | 1 |
| Anxiety / depression | 1.28 [1.13–1.46] |
| CVD / diabetes | 1.49 [1.38–1.60] |
| Severe COPD / frail | 1.30 [1.20–1.40] |
| Atopy / obesity | 1.15 [1.03–1.30] |