| Literature DB >> 35439945 |
Wei Shao1, Xiao Luo2, Zuoyi Zhang1, Zhi Han1,3, Vasu Chandrasekaran4, Vladimir Turzhitsky4, Vishal Bali4, Anna R Roberts3, Megan Metzger3, Jarod Baker3, Carmen La Rosa4, Jessica Weaver4, Paul Dexter1,3,5, Kun Huang6,7.
Abstract
BACKGROUND: Chronic cough affects approximately 10% of adults. The lack of ICD codes for chronic cough makes it challenging to apply supervised learning methods to predict the characteristics of chronic cough patients, thereby requiring the identification of chronic cough patients by other mechanisms. We developed a deep clustering algorithm with auto-encoder embedding (DCAE) to identify clusters of chronic cough patients based on data from a large cohort of 264,146 patients from the Electronic Medical Records (EMR) system. We constructed features using the diagnosis within the EMR, then built a clustering-oriented loss function directly on embedded features of the deep autoencoder to jointly perform feature refinement and cluster assignment. Lastly, we performed statistical analysis on the identified clusters to characterize the chronic cough patients compared to the non-chronic cough patients.Entities:
Keywords: Chronic cough; Deep clustering; EMR data; Unsupervised learning
Mesh:
Year: 2022 PMID: 35439945 PMCID: PMC9019947 DOI: 10.1186/s12859-022-04680-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Summary of the overall cohort
| Category | Non-CC (N = 238,265) | CC (N = 25,881) | |
|---|---|---|---|
| Age | Mean (SD) | 45.29 (17.81) | 54.7 (16.35) |
| Gender | Male | 91,768 (38.52%) | 8985 (34.72%) |
| Female | 146,491 (61.48%) | 16,896 (65.28%) | |
| Unknown | 6 | ||
| Race | Black | 48,864 (20.51%) | 4674 (18.06%) |
| Other | 43,606 (18.3%) | 1144 (4.42%) | |
| White | 145,795 (61.19%) | 20,063 (77.52%) | |
| Urbanicity | Rural | 21,181 (9.34%) | 2848 (12.13%) |
| Urban | 205,484 (90.66%) | 20,635 (87.87%) | |
| Unknown | 11,600 | 2398 |
Fig. 1The purity and silhouette values of different methods, where the number of clusters k are varied from {2, 5, 9, 13, 17, 21}
Fig. 2Visualization of the clustering results by tSNE [18]. Different colors represent different clusters. CC and Non-CC clusters correspond to CC and Non-CC patient dominant clusters
The number of CC dominated clusters (nCCD) for our DCAE method
| Cluster number | K = 5 | K = 9 | K = 13 | K = 17 | K = 21 |
|---|---|---|---|---|---|
| nCCD | 3 | 7 | 11 | 14 | 18 |
Fig. 3The influence of the parameter in the DCAE model
Univariate analysis of the categorized diagnosis, medication, and lab data of the non-CC and CC clusters
| Non-CC (N = 8588) | CC-1 (N = 4658) | CC-2 (N = 1737) | CC-3 (N = 154) | ||
|---|---|---|---|---|---|
| Respiratory | 2003 (23.32%) | 4300 (92.31%) | 1593 (91.71%) | 140 (90.91%) | < .0001 |
| Endocrine metabolic | 1268 (14.76%) | 3514 (75.44%) | 1026 (59.07%) | 88 (57.14%) | < .0001 |
| Circulatory system | 1687 (19.64%) | 3735 (80.18%) | 1110 (63.9%) | 91 (59.09%) | < .0001 |
| Mental disorder | 1020 (11.88%) | 3133 (67.26%) | 857 (49.34%) | 56 (36.36%) | < .0001 |
| Neurological | 805 (9.37%) | 2381 (51.12%) | 579 (33.33%) | 38 (24.68%) | < .0001 |
| Digestive | 1419 (16.52%) | 2889 (62.02%) | 767 (44.16%) | 60 (38.96%) | < .0001 |
| Symptoms | 2344 (27.29%) | 3335 (71.6%) | 943 (54.29%) | 71 (46.1%) | < .0001 |
| Hematopoietic | 336 (3.91%) | 2353 (50.52%) | 484 (27.86%) | 20 (12.99%) | < .0001 |
| Antiasthmatic bronchodilator | 1099 (12.8%) | 2777 (59.62%) | 898 (51.7%) | 79 (51.3%) | < .0001 |
| Minerals electrolytes | 1164 (13.55%) | 2791 (59.92%) | 704 (40.53%) | 52 (33.77%) | < .0001 |
| Corticosteroids | 1035 (12.05%) | 2529 (54.29%) | 728 (41.91%) | 61 (39.61%) | < .0001 |
| Ulcer drugs | 1714 (19.96%) | 2789 (59.88%) | 769 (44.27%) | 55 (35.71%) | < .0001 |
| Blood count | 3437 (40.02%) | 4078 (87.55%) | 1287 (74.09%) | 105 (68.18%) | < .0001 |
The p values for the comparison are bracketed in the last column. Only < .0001 was listed in the last column if all p values were < .0001
Univariate analysis on respiratory diagnosis between CC and non-CC clusters
| Non-CC (N = 8588) | CC-1 (N = 4658) | CC-2 (N = 1737) | CC-3 (N = 154) | ||
|---|---|---|---|---|---|
| Chronic airway obstruction | 82 (0.95%) | 1376 (29.54%) | 318 (18.31%) | 25 (16.23%) | < .0001 |
| Obstructive chronic bronchitis | 41 (0.48%) | 741 (15.91%) | 178 (10.25%) | 17 (11.04%) | < .0001 |
| Cough | 323 (3.76%) | 2503 (53.74%) | 1001 (57.63%) | 79 (51.3%) | < .0001 |
| Pneumonia | 99 (1.15%) | 1143 (24.54%) | 279 (16.06%) | 13 (8.44%) | < .0001 |
| Shortness of breath | 121 (1.41%) | 1156 (24.82%) | 323 (18.6%) | 23 (14.94%) | < .0001 |
| Other dyspnea | 148 (1.72%) | 1296 (27.82%) | 295 (16.98%) | 24 (15.58%) | < .0001 |
| Asthma | 230 (2.68%) | 716 (15.37%) | 284 (16.35%) | 21 (13.64%) | < .0001 |
| Other diseases of lung | 55 (0.64%) | 559 (12%) | 173 (9.96%) | 7 (4.55%) | < .0001 |
| Acute bronchitis and bronchiolitis | 99 (1.15%) | 597 (12.82%) | 248 (14.28%) | 19 (12.34%) | < .0001 |
| Respiratory Failure | 18 (0.21%) | 568 (12.19%) | 86 (4.95%) | 3 (1.95%) | < .0001 |
| Pleurisy pleural effusion | 27 (0.31%) | 533 (11.44%) | 102 (5.87%) | 6 (3.9%) | < .0001 |
Univariate analysis on endocrine and metabolic diagnosis between CC and non-CC clusters
| Non-CC (N = 8588) | CC-1 (N = 4658) | CC-2 (N = 1737) | CC-3 (N = 154) | ||
|---|---|---|---|---|---|
| Obesity | 99 (1.15%) | 491 (10.54%) | 103 (5.93%) | 7 (4.55%) | < .0001 |
| Type 2 diabetes | 352 (4.1%) | 1272 (27.31%) | 294 (16.93%) | 21 (13.64%) | < .0001 |
| Hyperlipidemia | 196 (2.28%) | 1650 (35.42%) | 423 (24.35%) | 23 (14.94%) | < .0001 |
| Hypothyroidism | 153 (1.78%) | 713 (15.31%) | 184 (10.59%) | 12 (7.79%) | < .0001 |
| Hypovolemia | 51 (0.59%) | 569 (12.22%) | 78 (4.49%) | 1 (0.65%) | (< .0001, < .0001, 0.9293) |
The p values are shown in the last column. < .0001 was listed in the last column if all of three p values were < .0001
Univariate analysis on circulatory system diagnosis between CC and non-CC clusters
| Non-CC (N = 8588) | CC-1 (N = 4658) | CC-2 (N = 1737) | CC-3 (N = 154) | ||
|---|---|---|---|---|---|
| Essential hypertension | 735 (8.56%) | 2400 (51.52%) | 647 (37.25%) | 43 (27.92%) | < .0001 |
| Nonspecific chest pain | 498 (5.8%) | 1201 (25.78%) | 330 (19%) | 35 (22.73%) | < .0001 |
| Long term or current use of aspirin | 8 (0.09%) | 694 (14.9%) | 84 (4.84%) | 4 (2.6%) | < .0001 |
| Congestive heart failure | 27 (0.31%) | 674 (14.47%) | 88 (5.07%) | 4 (2.6%) | < .0001 |
| Atrial fibrillation | 25 (0.29%) | 572 (12.28%) | 79 (4.55%) | 2 (1.3%) | (< .0001, < .0001, 0.0255) |
| Hypertensive chronic kidney disease | 14 (0.16%) | 537 (11.53%) | 59 (3.4%) | 4 (2.6%) | < .0001 |
Fig. 4The framework for deep AutoEncoders (DAE)
Fig. 5The framework for deep clustering with auto-encoder embedding (DCAE)