| Literature DB >> 35780100 |
Hongxia Lu1, Louis Ehwerhemuepha1,2, Cyril Rakovski3.
Abstract
BACKGROUND: Discharge medical notes written by physicians contain important information about the health condition of patients. Many deep learning algorithms have been successfully applied to extract important information from unstructured medical notes data that can entail subsequent actionable results in the medical domain. This study aims to explore the model performance of various deep learning algorithms in text classification tasks on medical notes with respect to different disease class imbalance scenarios.Entities:
Keywords: BERT; CNN; Deep learning; Embedding; Medical notes; Text classification; Transformer encoder
Mesh:
Year: 2022 PMID: 35780100 PMCID: PMC9250736 DOI: 10.1186/s12874-022-01665-y
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.612
Disease prevalence (N = 1,237)
| Disease | Disease Prevalence | Disease | Disease Prevalence |
|---|---|---|---|
| Hypertriglyceridemia | 5% | GERD* | 20% |
| Venous Insufficiency | 7% | Depression | 20% |
| Asthma | 13% | Obesity | 40% |
| Gout | 13% | CHF* | 43% |
| OSA* | 14% | Hypercholesterolemia | 47% |
| PVD* | 15% | CAD* | 55% |
| Gallstones | 15% | Diabetes | 66% |
| OA* | 18% | Hypertension | 73% |
OSA* obstructive sleep apnea, PVD* peripheral vascular disease , OA* osteo arthritis, GERD* gastroesophageal reflux disease, CHF* congestive heart failure, CAD* coronary artery disease
Descriptive statistics
| Descriptive Statistics | Number of Words | Number of Characters | ||
|---|---|---|---|---|
| Minimum | 146 | 50 | 903 | 410 |
| 25% Percentile | 819 | 391 | 4798 | 3089 |
| Median | 1084 | 517 | 6391 | 4098 |
| Mean | 1170 | 557 | 6870 | 4429 |
| 75% Percentile | 1425 | 687 | 8404 | 5420 |
| Maximum | 4280 | 2098 | 25,842 | 16,976 |
| Standard Deviation | 506 | 242 | 2960 | 1931 |
Model architectures
| Model | Number of Filters/Units/Encoders | Embedding Dimension | Max Sequence Length | Dropout | Activation Function | Optimizer | Total Parameters |
|---|---|---|---|---|---|---|---|
| CNN | 8 | 200 | 557 | 0.3 | ReLU | Adam | 5.51 M |
| RNN | 8 | 200 | 557 | 0.3 | ReLU | Adam | 5.50 M |
| GRU | 8 | 200 | 557 | 0.3 | ReLU | Adam | 5.50 M |
| LSTM | 8 | 200 | 557 | 0.3 | ReLU | Adam | 5.50 M |
| Bi-LSTM | 8 | 200 | 557 | 0.3 | ReLU | Adam | 5.51 M |
| Transformer Encoder | 1 encoder (2 heads) | 200 | 557 | 0.3 | ReLU | Adam | 5.94 M |
| BERT-Base | 12 encoders (12 heads) | 768 | 512 | 0.3 (fine-tune layer) | ReLU (fine-tune layer) | Adam (fine-tune layer) | 110 M |
Fig. 1Model Performance. (c. F1 Score* and d. Balanced Accuracy*: some points in these graphs are missing due to NA values resulted from zero values for the True Positives in the highly imbalanced datasets)
Number of samples in each class in training and test sets
| Disease | Prevalence | Training Set | Test Set | ||
|---|---|---|---|---|---|
| Disease Presence Presence | Disease Absence | Disease Presence | Disease Absence | ||
| Hypertriglyceridemia | 5% | 50 | 878 | 17 | 292 |
| Venous Insufficiency | 7% | 62 | 865 | 21 | 289 |
| Asthma | 13% | 123 | 805 | 41 | 268 |
| Gout | 13% | 120 | 808 | 40 | 269 |
| OSA | 14% | 129 | 799 | 43 | 266 |
| PVD | 15% | 135 | 793 | 45 | 264 |
| Gallstones | 15% | 141 | 787 | 47 | 262 |
| OA | 18% | 168 | 760 | 56 | 253 |
| GERD | 20% | 184 | 743 | 62 | 248 |
| Depression | 20% | 187 | 741 | 62 | 247 |
| Obesity | 40% | 374 | 554 | 125 | 184 |
| CHF | 43% | 402 | 526 | 134 | 175 |
| Hypercholesterolemia | 47% | 432 | 496 | 144 | 165 |
| CAD | 55% | 512 | 416 | 170 | 139 |
| Diabetes | 66% | 616 | 312 | 205 | 104 |
| Hypertension | 73% | 677 | 250 | 226 | 84 |
Fig. 2Average training time
Fig. 3Model performance with and without Pre-trained Word Embeddings. (c. F1 Score* and d. Balanced Accuracy*: some points in these graphs are missing due to NaN values resulted from zero values for the True Positives in the highly imbalanced datasets)
Fig. 4Average training time with and without pre-trained word embeddings