| Literature DB >> 31834905 |
Daniel A da Silva1, Carla S Ten Caten1, Rodrigo P Dos Santos2, Flavio S Fogliatto1, Juliana Hsuan3.
Abstract
In this study we propose the use of text mining and machine learning methods to predict and detect Surgical Site Infections (SSIs) using textual descriptions of surgeries and post-operative patients' records, mined from the database of a high complexity University hospital. SSIs are among the most common adverse events experienced by hospitalized patients; preventing such events is fundamental to ensure patients' safety. Knowledge on SSI occurrence rates may also be useful in preventing future episodes. We analyzed 15,479 surgery descriptions and post-operative records testing different preprocessing strategies and the following machine learning algorithms: Linear SVC, Logistic Regression, Multinomial Naive Bayes, Nearest Centroid, Random Forest, Stochastic Gradient Descent, and Support Vector Classification (SVC). For prediction purposes, the best result was obtained using the Stochastic Gradient Descent method (79.7% ROC-AUC); for detection, Logistic Regression yielded the best performance (80.6% ROC-AUC).Entities:
Year: 2019 PMID: 31834905 PMCID: PMC6910696 DOI: 10.1371/journal.pone.0226272
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
List of surgical specialties and associated post-operative length-of-stay.
| 1. Pediatrics | 2,963 | 58.5 |
| 2. Colorectal | 2,516 | 32.8 |
| 3. Neurosurgery | 2,392 | 27.1 |
| 4. Digestive System | 9,930 | 26.6 |
| 5. Urology | 11,136 | 22.0 |
| 6. Vascular | 3,435 | 21.9 |
| 7. Plastic | 1,663 | 21.8 |
| 8. Thoracic | 3,213 | 20.6 |
| 9. General | 9,294 | 15.7 |
| 10. Orthopedics and Traumatology | 6,468 | 13.0 |
| 11. Gynecology and Obstetrics | 5,398 | 8.8 |
| 12. Otorhino | 5,652 | 8.5 |
| 13. Oral and Maxillofacial | 278 | 7.5 |
| 14. Mastology | 2,034 | 3.0 |
Fig 1Overview of the proposed method.
SSI database analyzed in this study.
| Prediction | Detection | |||||
|---|---|---|---|---|---|---|
| Infected surgeries | Clean surgeries | Total | Infected surgeries | Clean surgeries | Total | |
| Initial sample | 247 | 27,401 | 27,648 | 233 | 15,481 | 15,714 |
| Empty records | -29 | -12,103 | -12,132 | -2 | -3,037 | -3,039 |
| Records of patients that had more than one surgery, one of which was reported clean | -4 | -7 | -11 | -3 | -9 | -12 |
| Infections reported more than 30 days after surgery | -26 | 0 | -26 | -26 | 0 | -26 |
| Records used in the study (final sample) | 188 (1.21%) | 15,291 (98.79%) | 15,479 (100%) | 202 (1.6%) | 12,435 (98.4%) | 12,637 (100%) |
Descriptive view of the dataset.
| Characteristic | Value |
|---|---|
| Number of patients | 12,483 |
| Mean (and SD) of patients’ age | 48.31 (22.03) |
| Average number of surgeries per patient | 1.24 |
| Female patients | 7,107 |
| Mean (and SD) of female patients’ age | 47.13 (20.44) |
| Male patients | 5,376 |
| Mean (and SD) of male patients’ age | 49.88 (23.87) |
| Number of surgical procedures | 18,062 |
| Elective procedures | 13,027 |
| Urgent procedures | 3,239 |
| Emergency procedures | 1,796 |
| Average size (and SD) of surgical team | 5.84 (2.81) |
Algorithms’ performance in predicting SSI.
| ROC-AUC | |||||||
|---|---|---|---|---|---|---|---|
| Method | T | P | FS | N | CW | Mean | SD |
| Random Forest (RF) | TF-IDF | 85% | F | 1/0.01 | 76.3% | 3.3% | |
| Logistic Regression (LR) | TF-IDF | 55% | F | 1/0.005 | 75.9% | 2.5% | |
| Linear SVC (LSVC) | TF | 85% | F | 1/0.005 | 79.0% | 4.7% | |
| SVC | TF | 10% | F | 1/0.001 | 75.3% | 4.6% | |
| Nearest Centroid (NC) | TF | 20% | - | 78.2% | 4.3% | ||
| SGD | TF | 75% | - | 1/0.01 | 79.7% | 3.3% | |
| M-Naive Bayes (MNB) | TF | 40% | - | 20/80 | 75.0% | 4.6% | |
T: Transformation; P: Percentile; FS: Feature Selection; N: Normalization; CW: Class_Weight / Prior Probability; SD: Standard Deviation
Fig 2ROC-AUC performance of algorithms in predicting SSIs.
Fig 3Precision-recall percentages and boxplots for surgical descriptions.
Fig 4Precision-recall curves of methods tested for predicting SSIs.
Algorithms’ performance in detecting SSI.
| ROC-AUC | |||||||
|---|---|---|---|---|---|---|---|
| Method | T | P | FS | N | CW | Mean | SD |
| Random Forest (RF) | TF | 20% | 1/0.01 | 76.1% | 3.4% | ||
| Logistic Regression (LR) | TF-IDF | 40% | 1/0.01 | 80.6% | 2.4% | ||
| Linear SVC (LSVC) | TF-IDF | 45% | 1/0.01 | 78.1% | 2.5% | ||
| SVC | TF | 25% | F | 1/0.1 | 61.0% | 6.3% | |
| Nearest Centroid (NC) | TF-IDF | 80% | - | 76.4% | 2.9% | ||
| SGD | TF | 55% | F | - | 1/0.05 | 63.6% | 1.0% |
| M-Naive Bayes (MNB) | TF-IDF | 80% | F | - | 20/80 | 64.1% | 6.5% |
T: Transformation; P: Percentile; FS: Feature Selection; N: Normalization; CW: Class_Weight / Prior Probability; SD: Standard Deviation
Fig 5ROC-AUC performance of algorithms in detecting SSIs.
Fig 6Precision-recall percentages and boxplots for post-operative notes.
Fig 7Precision-recall curves of methods tested for detecting SSIs.
SGD method for prediction–confusion matrix.
| Predicted clean | Predicted infected | ||
|---|---|---|---|
| Actual clean | True negatives (TN): 9,930 | False positives (FP): 5,352 | 15,291 |
| Actual Infected | False negatives (FN): 20 | True positives (TP): 168 | 188 |
| 9,959 | 5,520 |