| Literature DB >> 23976944 |
Chen Lin1, Elizabeth W Karlson, Helena Canhao, Timothy A Miller, Dmitriy Dligach, Pei Jun Chen, Raul Natanael Guzman Perez, Yuanyan Shen, Michael E Weinblatt, Nancy A Shadick, Robert M Plenge, Guergana K Savova.
Abstract
OBJECTIVE: We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23976944 PMCID: PMC3745469 DOI: 10.1371/journal.pone.0069932
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Dataset characteristics.
|
|
|
| |
| High Disease Activity | 506 notes | 190 notes | |
| Moderate Disease Activity | 966 notes | 610 notes | |
| Aggregate High/Moderate Disease Activity | 1472 notes | 800 notes | 133 notes |
| Low Disease Activity | 369 notes | 312 notes | |
| Remission Disease Activity | 951 notes | 637 notes | |
| Aggregate Low/Remission Disease Activity | 1320 notes | 949 notes | 211 notes |
| Total | 2792 notes | 1749 notes | 344 notes |
| Agreement | MD/DAS28: 0.81 | MD/DAS28: 0.87 | Inter-annotator agreement: 0.87 |
Figure 1Representation of the processing flow for automatic disease activity labeling.
Abbreviations: CUI – Unified Medical Language System Concept Unique Identifier; cTAKES – clinical Text Analysis and Knowledge Extraction System; LR – Low/Remission disease activity; MH – Medium/High disease activity; EMR – Electronic Medical Record.
Figure 2Lab-value and 20 top-ranked CUIs.
Their Chi-square values were visualized as bars. Longer bars suggest higher impact. The negative signs “-” before some of the CUIs suggest negation (CUI – Unified Medical Language System Concept Unique Identifier).
Figure 3Histogram of DAS28 scores for 25 discordant cases.
These discordant cases are between DAS labels and domain expert labels among 93 random samples from the Training Set (the remaining 68 cases were concordant).
Corpus selection effect on Test set 1 using a linear-kernel SVM model.
|
|
|
|
|
|
| UMLS CUIs after feature selection and lab values | High and Low Disease Activity labels from Training set | Aggregate High/Moderate and Low/Remission Disease Activity labels from Test Set 1 (10-fold cross-validation) | 0.789±0.0445 |
|
| UMLS CUIs after feature selection and lab values | Aggregate High/Moderate and Low/Remission Disease Activity labels from Training Set | Aggregate High/Moderate and Low/Remission Disease Activity labels from Test Set 1 (10-fold cross-validation) | 0.747±0.0316 | 0.810±0.0297 |
| Baseline 1Bag-of-words | Aggregate High/Moderate and Low/Remission Disease Activity labels from Training Set | Aggregate High/Moderate and Low/Remission Disease Activity labels from Test Set 1 (10-fold cross-validation) | 0.737±0.0331 | 0.732±0.0348 |
| Baseline 2Bag-of-words and lab values | Aggregate High/Moderate and Low/Remission Disease Activity labels from Training Set | Aggregate High/Moderate and Low/Remission Disease Activity labels from Test Set 1 (10-fold cross-validation) | 0.750±0.0265 | 0.758±0.0291 |
Figure 4Error analysis of the best performing classifier.
Out of 429 misclassified cases (using DAS28 derived dichotomous labels as gold standard), the majority are from the Moderate and Low disease activity categories.
Feature contribution.
|
|
| |||
|
|
|
|
|
|
| UMLS CUIs | 0.740±0.039 | 0.775±0.036 | 0.722±0.0602 | 0.669±0.0641 |
| Lab Values | 0.736±0.0393 | 0.748±0.0300 | 0.704±0.0419 | 0.679±0.0337 |
| UMLS CUIs and Lab Values | 0.789±0.0445 |
| 0.74±0.0447 | 0.714±0.0505 |
Portability testing.
|
|
|
|
|
|
| UMLS CUIs after feature selection and lab values | High and Low Disease Activity labels from Training set | Aggregate High/Moderate and Low/Remission Disease Activity labels from Test Set 2 (10-fold cross-validation) | 0.761±0.0553 | 0.785±0.0599 |
| UMLS CUIs after feature selection and lab values | Aggregate High/Moderate and Low/Remission Disease Activity labels from Training Set | Aggregate High/Moderate and Low/Remission Disease Activity labels from Test Set 2 (10-fold cross-validation) | 0.646±0.0863 | 0.748±0.0944 |
Figure 5Scatter plot of DAS28 scores and log transformed lab values.
(Left) Scatter plot of DAS28 scores and log transformed lab values for 1320 correctly classified notes. (Right) Scatter plot of DAS28 scores and log transformed lab values for 429 misclassified notes. The lines are the regression lines.
Figure 6Ranges of lab values.
(Left) Range of lab values for Moderate/High (MH) disease activity cases vs. Range of lab values for Low/Remission (LR) disease activity cases among 1320 correctly classified notes. (Right) Range of lab values for Moderate/High (MH) disease activity cases vs. Range of lab values for Low/Remission (LR) disease activity cases among 429 misclassified notes.