| Literature DB >> 35042703 |
Abstract
OBJECTIVES: We aim to extract a subset of social factors from clinical notes using common text classification methods.Entities:
Keywords: biotechnology & bioinformatics; health informatics; history (see medical history); social medicine
Mesh:
Year: 2022 PMID: 35042703 PMCID: PMC8768909 DOI: 10.1136/bmjopen-2020-048397
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 2.692
Figure 1High-level overview of the workflow process.
Figure 2Text extraction, classification and scoring workflow. ED, emergency department.
Figure 3Text extraction and cleaning process. Additional steps were performed for notes when classifying text related to tobacco and alcohol use to extract negative sentiment doubles or triples. ROS, Review of Systems.
Population demographics
| Race (n=43 798) | n (%) |
| White or Caucasian | 31 575 (72.1) |
| Black or African American | 4812 (11.0) |
| Asian | 3174 (7.2) |
| American Indian or Alaska Native | 1165 (2.7) |
| Native Hawaiian or other Pacific Islander | 524 (1.2) |
| Multiple races | 3 (0) |
| Unavailable, unknown or missing | 2545 (5.8) |
Extracted data amounts for housing status
| Level of extraction | Rows (n) | Unique patients (n) | Unique encounters (n) | Social history entries (n/unique) |
| ED and admit notes | 49 955 | 3233 | 15 664 | 21 876/21334 |
| Housing, tobacco, alcohol information | 6000 | 218 | 1995 | 2408/2211 |
| Remove nulls/missing data | Housing: 1785 | Housing: 200 | 1361 | 1785/1684 |
ED, emergency department.
Word or phrase importance ranking
| Social factor (classifier) | Top 20 weighted words |
| Housing stability (support vector machine, n=1) | ['friends' 'motel' ’stay' 'cigs' 'found' ’street' ’stays' ’streets' 'van' |
| No tobacco use (logistic regression, n=1,2) | ['use denies' 'deneis' 'lives' 'tobacco drug' ’seattle denies' |
| No alcohol use (logistic regression, n=1,2) | ['care' 'ppd' 'tobacco' ’smoking' 'etoh tobacco' 'history cocaine' |
Accuracies among text classifiers
| n=1 | n=1–2 | |
| Multinomial naïve Bayes | Housing: 91.62% | Housing: 91.43% |
| Support vector machine | Housing: | Housing: 91.99% |
| Logistic regression | Housing: 84.36% | Housing: 90.13% |
| Random forest | Housing: 90.50% | Housing: 91.25% |
Bold values indicates highest performance for each SDBH.
Best-performing classifier detailed metrics
| Classifier | Accuracy | Recall | Precision | F1 | |
| Housing status* | Support vector machine (n=1) | 0.92 | 0.93/0.91 (0/1) | 0.94/0.90 | 0.93/0.91 |
| Tobacco use† | Logistic regression (n=1–2) | 0.85 | 0.82/0.95/0.86 | 0.96/0.43/0.87 | 0.88/0.60/0.87 |
| Alcohol use† | Logistic regression (n=1–2) | 0.83 | 0.86/0.73/0.81 | 0.93/0.44/0.88 | 0.89/0.55/0.84 |
*0:no use, 1: current use.
†0:no use, 1: rare/occasional/history, 2: current use.