| Literature DB >> 25332356 |
Christian M Rochefort1, Aman D Verma2, Tewodros Eguale3, Todd C Lee4, David L Buckeridge2.
Abstract
BACKGROUND: Venous thromboembolisms (VTEs), which include deep vein thrombosis (DVT) and pulmonary embolism (PE), are associated with significant mortality, morbidity, and cost in hospitalized patients. To evaluate the success of preventive measures, accurate and efficient methods for monitoring VTE rates are needed. Therefore, we sought to determine the accuracy of statistical natural language processing (NLP) for identifying DVT and PE from electronic health record data.Entities:
Keywords: acute care hospital; automated text classification; deep vein thrombosis; natural language processing; pulmonary embolism; support vector machines
Mesh:
Year: 2014 PMID: 25332356 PMCID: PMC4433368 DOI: 10.1136/amiajnl-2014-002768
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
The top 30* most informative unigrams and bigrams for deep vein thrombosis and pulmonary embolism identification according to the Pearson's correlation statistic
| Deep vein thrombosis | Pulmonary embolism | ||
|---|---|---|---|
| Unigrams | Bigrams | Unigrams | Bigrams |
| Vein | Length of | Filling | Pulmonary artery |
| Thrombus | The thrombus | Segmental | Filling defect |
| Occlusive | Thrombosis involving | Artery | Lower lobe |
| Peroneal | Popliteal vein | Lobe | Defect in |
| Length | Over a | Pulmonary | Pulmonary emboli |
| Patent | Non occlusive | Defect | A filling |
| Popliteal | Posterior tibial | Subsegmental | There are |
| Over | A length | Strain | With pulmonary |
| Femoral | Is deep | Emboli | Upper lobe |
| Tibial | Peroneal vein | Chest | Main pulmonary |
| Thrombosis | The mid | Lung | Segmental branch |
| Veins | Basilic vein | Main | Defects are |
| Involving | Femoral vein | Branches | Segmental branches |
| Entire | Entire length | Basal | Embolus in |
| Thrombosed | Reminder of | Small | Multiple filling |
*Features were ranked using their Pearson correlation coefficient (ρ) and the top 30 unigrams and bigrams were selected.
Characteristics of the 1751 hospitalizations which contributed the 2000 narrative radiology reports
| Hospitalization characteristics | Statistics (n = 1751) |
|---|---|
| Demographic characteristics | |
| Sex | |
| Male, n (%) | 892 (50.9) |
| Female, n (%) | 859 (49.1) |
| Age, mean ± SD | 66.7 ± 16.1 |
| Nursing unit at the time of the examination | |
| Internal medicine, n (%) | 643 (36.7) |
| Surgery, n (%) | 537 (30.7) |
| Intensive care unit, n (%) | 332 (19.0) |
| Other (e.g., geriatrics, neurology, short stay), n (%) | 239 (13.6) |
| Length of hospital stay (days), median (IQR) | 15 (28) |
| Number of radiology reports contributed to the analyses | |
| One radiology report, n (%) | 1544 (88.2) |
| Two radiology reports, n (%) | 173 (9.9) |
| Three to five radiology reports, n (%) | 34 (1.9) |
Average accuracy estimates of the SVM models for identifying deep vein thrombosis (DVT)
| Average estimates (95% CI)† | |||||
|---|---|---|---|---|---|
| SVM models | Sensitivity | Specificity | PPV | NPV | AUC |
| Unigram only, linear kernel, no tuning | 0.65 (0.58 to 0.71) | 0.97 (0.96 to 0.98) | 0.82 (0.77 to 0.87) | 0.93 (0.90 to 0.96) | 0.94 (0.92 to 0.96)‡ |
| Unigram only, RBF kernel, no tuning | 0.57 (0.45 to 0.68) | 0.96 (0.95 to 0.97) | 0.72 (0.66 to 0.79) | 0.92 (0.89 to 0.95) | 0.93 (0.91 to 0.96)‡ |
| Unigram only, linear kernel, with tuning | 0.69 (0.64 to 0.73) | 0.97 (0.96 to 0.98) | 0.82 (0.77 to 0.87) | 0.94 (0.91 to 0.97) | 0.95 (0.94 to 0.97)‡ |
| Unigram only, RBF kernel, with tuning | 0.70 (0.65 to 0.74) | 0.97 (0.96 to 0.98) | 0.81 (0.74 to 0.88) | 0.94 (0.92 to 0.97) | 0.95 (0.92 to 0.97)‡ |
| Uni + bigrams, linear kernel, no tuning | 0.67 (0.63 to 0.73) | 0.98 (0.97 to 0.99) | 0.87 (0.82 to 0.93) | 0.94 (0.90 to 0.97) | 0.95 (0.93 to 0.98) |
| Uni + bigrams, RBF kernel, no tuning | 0.59 (0.49 to 0.69) | 0.96 (0.95 to 0.97) | 0.74 (0.66 to 0.82) | 0.92 (0.89 to 0.96) | 0.94 (0.92 to 0.96)‡ |
| Uni + bigrams, linear kernel, tuning | 0.70 (0.63 to 0.77) | 0.98 (0.97 to 0.99) | 0.85 (0.81 to 0.90) | 0.94 (0.91 to 0.97) | 0.96 (0.93 to 0.98) |
| Uni + bigrams, RBF kernel, tuning | 0.77 (0.72 to 0.84) | 0.97 (0.97 to 0.98) | 0.84 (0.79 to 0.89) | 0.96 (0.92 to 0.99) | 0.97 (0.94 to 0.99) |
| Uni + bigrams, linear kernel, tuning, all features. | 0.79 (0.74 to 0.84) | 0.98 (0.97 to 0.99) | 0.89 (0.84 to 0.94) | 0.96 (0.93 to 0.98) | 0.98 (0.97 to 0.99) |
Bold typeface is used to highlight the characteristics of the best performing SVM model. *p<0.001; statistically significant difference in performance compared to alternative SVM models.
†Averages correspond to the mean accuracy estimates obtained after 10 rounds of cross-validation.
‡Statistically significantly different compared to the best performing SVM model (i.e., Uni + bigrams, RBF kernel, tuning, all features).
AUC, area under the curve; NPV, negative predictive value; PPV, positive predictive value; RBF, radial basis function kernel; SVM, support vector machine.
Figure 1:Average receiver operator characteristic (ROC) curve associated with the deep vein thrombosis (DVT) model. The average ROC curve for the DVT model was estimated by vertically averaging the 10 ROC curves generated during 10-fold cross-validation. The best performances were achieved using an SVM model trained on the whole feature set using an RBF kernel for which both the cost and γ parameters were optimized.
Average accuracy estimates of the SVM models for identifying pulmonary embolism (PE)
| Average estimates (95% CI)‡ | |||||
|---|---|---|---|---|---|
| SVM models | Sensitivity | Specificity | PPV | NPV | AUC |
| Unigram only, linear kernel, no tuning | 0.51 (0.34 to 0.68) | 0.99 (0.98 to 1.00) | 0.75 (0.62 to 0.88) | 0.96 (0.95 to 0.98) | 0.92 (0.87 to 0.96)† |
| Unigram only, RBF kernel, no tuning | 0.39 (0.32 to 0.46) | 0.98 (0.97 to 0.99) | 0.63 (0.45 to 0.81) | 0.95 (0.94 to 0.96) | 0.93 (0.92 to 0.95)† |
| Unigram only, linear kernel, with tuning | 0.53 (0.36 to 0.70) | 0.98 (0.97 to 0.99) | 0.75 (0.65 to 0.85) | 0.96 (0.94 to 0.98) | 0.92 (0.88 to 0.96)† |
| Unigram only, RBF kernel, with tuning | 0.55 (0.42 to 0.67) | 0.98 (0.97 to 0.99) | 0.70 (0.54 to 0.86) | 0.96 (0.95 to 0.98) | 0.95 (0.93 to 0.97)† |
| Uni + bigrams, linear kernel, no tuning | 0.60 (0.45 to 0.75) | 0.99 (0.98 to 1.00) | 0.85 (0.77 to 0.93) | 0.97 (0.95 to 0.98) | 0.95 (0.90 to 1.00) |
| Uni + bigrams, RBF kernel, no tuning | 0.40 (0.33 to 0.47) | 0.98 (0.97 to 0.99) | 0.67 (0.51 to 0.83) | 0.95 (0.94 to 0.96) | 0.95 (0.93 to 0.96)† |
| Uni + bigrams, linear kernel, tuning | 0.61 (0.46 to 0.76) | 0.99 (0.98 to 1.00) | 0.84 (0.76 to 0.92) | 0.97 (0.95 to 0.98) | 0.95 (0.90 to 1.00) |
| Uni + bigrams, RBF kernel, tuning | 0.66 (0.49 to 0.83) | 0.99 (0.98 to 1.00) | 0.80 (0.68 to 0.93) | 0.97 (0.96 to 0.99) | 0.96 (0.92 to 1.00) |
| Uni + bigrams, linear kernel, tuning, all features | 0.78 (0.72 to 0.85) | 0.99 (0.98 to 0.99) | 0.84 (0.76 to 0.91) | 0.98 (0.98 to 0.99) | 0.99 (0.98 to 1.00) |
Bold typeface is used to highlight the characteristics of the best performing SVM model. *p<0.001; statistically significant difference in performance compared to alternative SVM models.
†Statistically significantly different compared to the best performing SVM model (i.e., Uni + bigrams, RBF kernel, tuning, all features).
‡Averages correspond to the mean accuracy estimates obtained after 10 rounds of cross-validation.
AUC, area under the curve; NPV, negative predictive value; PPV, positive predictive value; RBF, radial basis function kernel; SVM, support vector machine.
Figure 2:Average receiver operator characteristic (ROC) curve associated with the pulmonary embolism (PE) model. The average ROC curve for the PE model was estimated by vertically averaging the 10 ROC curves generated during 10-fold cross-validation. The best performances were achieved using an SVM model trained on the whole feature set using an RBF kernel for which both the cost and γ parameters were optimized.