| Literature DB >> 28830409 |
Clayton A Turner1, Alexander D Jacobs2, Cassios K Marques2, James C Oates3, Diane L Kamen3, Paul E Anderson2, Jihad S Obeid3.
Abstract
BACKGROUND: Identifying patients with certain clinical criteria based on manual chart review of doctors' notes is a daunting task given the massive amounts of text notes in the electronic health records (EHR). This task can be automated using text classifiers based on Natural Language Processing (NLP) techniques along with pattern recognition machine learning (ML) algorithms. The aim of this research is to evaluate the performance of traditional classifiers for identifying patients with Systemic Lupus Erythematosus (SLE) in comparison with a newer Bayesian word vector method.Entities:
Keywords: Machine learning; Natural language processing; Systemic lupus erythematosus
Mesh:
Year: 2017 PMID: 28830409 PMCID: PMC5568290 DOI: 10.1186/s12911-017-0518-1
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
1997 Update of the 1982 ACR revised criteria for the classification of SLE [13, 14]
| Short description |
|---|
| 1. Malar rash |
| 2. Discoid rash |
| 3. Photosensitivity |
| 4. Oral ulcers |
| 5. Nonerosive arthritis |
| 6. Pleuritis or Pericarditis |
| 7. Renal disorder |
| 8. Neurologic disorder |
| 9. Hematologic disorder |
| 10. Immunologic disorder |
| 11. Positive antinuclear antibody |
Fig. 1Data Flow and Feature Engineering Pipeline This pipeline shows the flow of data through our clinical pipeline. This pipeline shows how the data is bootstrapped, subsetted and filtered in order to obtain higher quality notes for use in the proceeding feature selection and classification. If CUIs are to be utilized, then the cTAKES/YTEX pipeline is used to create the initial features through usage of the Collection Processing Engine within the cTAKES suite. This data is output as a sparse matrix which we convert in order to conform to the style of the other feature engineering techniques so each algorithm can be used independently of the data it is being given. If CUIs are not being used, then a stemming process is undergone as this is needed for both BOWs and the inversion method. In the case of using BOWs, punctuation and stop words are additionally removed in order to reduce bias in the dataset. If the inversion method is to be used, then we leverage Word2Vec to create two Word2Vec models which are fine-tuned according to which phenotype they represent. All feature sets are subjected to normalization and feature selection through scikit-learn’s ExtraTreesClassifier’s variable importance to prep for classifier usage [22]
Table showing each machine learning technique and its 5-fold Cross-Validation accuracy and its test set accuracy with each NLP classifier
| Technique | Data form | CV Acc. | CV CI ( | Test Acc. |
|---|---|---|---|---|
| ICD-9 billing codes | N/A | 89.655 | N/A | 90.00 |
| Word2Vec inversion | N/A | 89.653 | [89.281, 90.025] | 90.039 |
| Neural network | BOWs | 84.138 | [80.887, 87.630] | 87.100 |
| CUIs | 94.138 | [89.539, 92.358] | 92.10 | |
| Random forests | BOWs | 95.172 | [93.875, 94.539] | 95.250 |
| CUIs | 95.345 | [94.889, 95.318] | 95.00 | |
| Naïve Bayes | BOWs | 85.000 | [80.141, 83.859] | 82.000 |
| CUIs | 81.207 | [76.087, 79.013] | 77.55 | |
| Support vector machines | BOWs | 86.724 | [83.031, 86.469] | 84.750 |
| CUIs | 90.862 | [90.470, 92.230] | 91.35 |
Table showing each machine learning technique and its AUC from the test set and its respective 20x repeated Cross-Validation AUC and confidence interval with each NLP classifier
| Technique | Data form | CV AUC | CV CI ( | Test AUC |
|---|---|---|---|---|
| ICD-9 billing codes | N/A | 0.897 | N/A | 0.900 |
| Word2Vec inversion | N/A | 0.963 | [0.956, 0.971] | 0.905 |
| Neural network | BOWs | 0.902 | [0.897, 0.908] | 0.925 |
| CUIs | 0.960 | [0.957, 0.964] | 0.974 | |
| Random forests | BOWs | 0.981 | [0.979, 0.984] | 0.987 |
| CUIs | 0.987 | [0.985,0.989] | 0.988 | |
| Naïve Bayes | BOWs | 0.841 | [0.815, 0.868] | 0.841 |
| CUIs | 0.805 | [0.777, 0.833] | 0.805 | |
| Support vector machines | BOWs | 0.923 | [0.911, 0.934] | 0.923 |
| CUIs | 0.980 | [0.975, 0.985] | 0.980 |
Fig. 2External AUC Curves This graph depicts the AUC of each technique as it performed on the external testing set, generated using the pROC package [42]
Top 25 word stems for BOWs according to the variable importance extracted from scikit-learn’s ExtraTreesClassifier and stemmed using nltk’s SnowballStemmer [17, 22]
| Rank | Word | VIMP |
|---|---|---|
| 1 | C3 | 0.0311 |
| 2 | Sle | 0.0225 |
| 3 | Graviti | 0.0172 |
| 4 | Sole | 0.0126 |
| 5 | Phurin | 0.0113 |
| 6 | Epitheli | 0.0084 |
| 7 | C4 | 0.0065 |
| 8 | Yet | 0.0065 |
| 9 | Lymph | 0.0063 |
| 10 | Hemlymph | 0.0059 |
| 11 | Educ | 0.0059 |
| 12 | Resolv | 0.0054 |
| 13 | 912 | 0.0054 |
| 14 | Fatigu | 0.0050 |
| 15 | Thrombocytopenia | 0.0047 |
| 16 | 2500 | 0.0047 |
| 17 | Need | 0.0047 |
| 18 | Naugl | 0.0047 |
| 19 | Clot | 0.0043 |
| 20 | Screen | 0.0042 |
| 21 | Antidoubl | 0.0040 |
| 22 | Beat | 0.0040 |
| 23 | Acut | 0.0038 |
| 24 | 843identificationremov | 0.0038 |
| 25 | Pregnanc | 0.0036 |
A graph of the degradation of variable importance for these word stems can be found in Fig. 3
Top 25 CUIs according to the variable importance extracted from scikit-learn’s ExtraTreesClassifier [22]
| Rank | CUI | Description | VIMP |
|---|---|---|---|
| 1 | C0042014 | Laboratory: Urine Examination | 0.0307 |
| 2 | C0699177 | Plaquenil | 0.0258 |
| 3 | C0024141 | Systemic Lupus Erythematosus | 0.0236 |
| 4 | C0194073 | Kidney Biopsy | 0.0208 |
| 5 | C0024204 | Lymph Node | 0.0179 |
| 6 | C0008031 | Nonspecific Chest Pain | 0.0166 |
| 7 | C0018966 | Heme | 0.0158 |
| 8 | C2711450 | Enlargement (Morphological Anomaly) | 0.01502 |
| 9 | C0014597 | Epithelial Cell | 0.0111 |
| 10 | C0023516 | Leukocytes | 0.0100 |
| 11 | C0003243 | Antinuclear Antibody (ANA) | 0.0094 |
| 12 | C0002170 | Alopecia | 0.0089 |
| 13 | C0024202 | Lymph | 0.0085 |
| 14 | C1267547 | Entire Mouth Region | 0.0084 |
| 15 | C0009780 | Connective Tissue | 0.0083 |
| 16 | C0229671 | Serum | 0.0068 |
| 17 | C0042036 | Urine | 0.0065 |
| 18 | C0014060 | St. Louis Encephalitis | 0.0062 |
| 19 | C0038999 | Swelling | 0.0061 |
| 20 | C1269549 | Entire Zygoma | 0.0060 |
| 21 | C0036749 | Serositis | 0.0060 |
| 22 | C0033684 | Proteins | 0.0059 |
| 23 | C0014239 | Endoplasmic Reticulum | 0.0059 |
| 24 | C0009782 | Connective Tissue Disorder | 0.0058 |
| 25 | C0024143 | Lupus Nephritis | 0.0058 |
CUI descriptions were extracted from MetamorphoSys [7]. A graph of the degradation of variable importance for these CUIs can be found in Fig. 4
Fig. 3Variable Importance by Word Stem This graph depicts the degradation of variable importance from the top 50 word stems
Fig. 4Variable Importance by CUI This graph depicts the degradation of variable importance from the top 50 CUIs