| Literature DB >> 31391095 |
Yuqing Yang1,2, Xin Wang3,4, Yu Huang1,2, Ning Chen1,2, Juhong Shi5,6, Ting Chen7,8.
Abstract
BACKGROUND: Padua linear model is widely used for the risk assessment of venous thromboembolism (VTE), a common but preventable complication for inpatients. However, genetic and environmental differences between Western and Chinese population limit the validity of Padua model in Chinese patients. Medical records which contain rich information about disease progression, are useful in mining new risk factors related to Chinese VTE patients. Furthermore, machine learning (ML) methods provide new opportunities to build precise risk prediction model by automatic selection of risk factors based on original medical records.Entities:
Keywords: Machine learning (ML); Medical record; Natural language processing (NLP); Risk assessment; Venous thromboembolism (VTE)
Mesh:
Year: 2019 PMID: 31391095 PMCID: PMC6686216 DOI: 10.1186/s12911-019-0856-2
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Padua risk assessment model
| Risk Factors | Score |
|---|---|
| Active malignant cancer/chemotherapy | 3 |
| Previous VTE | 3 |
| Reduced mobility | 3 |
| Thrombophilic condition | 3 |
| Recent trauma/surgery | 2 |
| Age > =70 | 1 |
| Heart/respiratory failure | 1 |
| Acute myocardial infarction/ischemic stroke | 1 |
| Acute Infection/rheumatologic disorder | 1 |
| BMI > =30 kg/m2 | 1 |
| Ongoing glucocorticoid treatment | 1 |
The Padua score ≥ 4 is classified as high risk
Fig. 1The workflow of VTE prediction model construction from medical records
Fig. 2One example of the greedy section match algorithm
Fig. 3Neighbors of one term when calculating first order neighbors entropy
Fig. 4Section evaluation process of VTE risk assessment model
Demographic characteristics of inpatients
| VTE | non-VTE | P value | Total | |
|---|---|---|---|---|
| N | 224 | 2882 | – | 3106 |
| Gender (Male) | 119 (53.13%) | 1587 (55.07%) | > 0.05 | 1673 |
| Age (Year) | 55.81 ± 16.45 | 52.75 ± 16.24 | < 0.05 | 52.97 ± 16.28 |
| BMI (Kg/m2) | 23.99 ± 4.13 | 23.47 ± 4.27 | > 0.05 | 23.51 ± 4.26 |
| Hospital stay (Day) | 22 (14, 35) | 11 (6, 18) | < 0.05 | 12 (6, 19) |
| Padua Score | 5.88 ± 2.46 | 2.89 ± 2.45 | < 0.05 | 3.10 ± 2.57 |
| High Risk (Padua model) | 194 (86.61%) | 1078 (37.40%) | < 0.05 | 1272 |
| Active malignant cancer/chemotherapy | 70 (31.25%) | 807 (28.00%) | > 0.05 | 877 |
| Previous VTE | 34 (15.28%) | 17 (0.59%) | < 0.05 | 51 |
| Reduced mobility | 158 (70.54%) | 1030 (35.74%) | < 0.05 | 1188 |
| Thrombophilic condition | 25 (15.63%) | 53 (1.84%) | < 0.05 | 88 |
| Recent trauma/surgery | 14 (6.25%) | 87 (3.02%) | > 0.05 | 101 |
| Age > =70 | 45 (20.09%) | 418 (14.50%) | < 0.05 | 463 |
| Heart/respiratory failure | 65 (29.12%) | 112 (3.89%) | < 0.05 | 177 |
| Acute myocardial infarction/ischemic stroke | 8 (3.57%) | 44 (1.53%) | > 0.05 | 52 |
| Acute Infection/rheumatologic disorder | 131 (58.48%) | 710 (24.64%) | < 0.05 | 841 |
| BMI > =30 kg/m2 | 13 (5.80%) | 165 (5.73%) | > 0.05 | 178 |
| Ongoing glucocorticoid treatment | 136 (60.71%) | 972 (33.73%) | < 0.05 | 1108 |
Hospital stay is denoted as ‘Median (lower quartile, upper quartile)’. Age, BMI and Padua score was expressed with ‘Mean ± Standard Deviation’
Number of terms of ontologies within different sections
| Section Name | Non-VTE | VTE | ||
|---|---|---|---|---|
| Term | Neighbor | Term | Neighbor | |
| Chief Complaint | 44 | 13 (6, 28) | 53 | 2 (1, 4) |
| Present History | 1162 | 37 (22, 75) | 1244 | 4 (2, 6) |
| Previous History | 197 | 32 (18, 53) | 224 | 4 (2, 6) |
| Personal History | 20 | 16 (9, 32) | 18 | 3 (2, 5) |
| Family History | 28 | 22 (9, 33) | 26 | 4 (1, 6) |
| Physical Examination | 391 | 16 (7, 30) | 380 | 3 (2, 6) |
| Laboratory Examination | 385 | 29 (18, 49) | 344 | 3 (2, 6) |
| Admitting Diagnosis | 175 | 42 (29, 67) | 211 | 5 (3, 8) |
| Progress Note | 733 | 148 (92, 267) | 811 | 31 (17, 58) |
The ‘Neighbor’ is the number of distinct words around terms with a window length 5 and is expressed with ‘Median (lower quartile, upper quartile)’
Fig. 5AUC scores of four ML methods using top K = 100, 200, 300 and 400 terms
Comparison of AUC scores of four ML models using word vectors with different dimensions
| Dimension | 10 | 20 | 40 | 80 | 100 | 120 | 150 | 200 |
|---|---|---|---|---|---|---|---|---|
| AUC (GBDT) | 0.863 ± 0.024 | 0.891 ± 0.023 | 0.916 ± 0.026 | 0.916 ± 0.017 | 0.929 ± 0.014 | 0.929 ± 0.015 | 0.927 ± 0.020 | 0.927 ± 0.020 |
| AUC (RF) | 0.852 ± 0.022 | 0.871 ± 0.030 | 0.883 ± 0.023 | 0.881 ± 0.029 | 0.884 ± 0.021 | 0.893 ± 0.020 | 0.897 ± 0.018 | 0.884 ± 0.019 |
| AUC (LR) | 0.851 ± 0.023 | 0.884 ± 0.019 | 0.912 ± 0.019 | 0.921 ± 0.020 | 0.926 ± 0.022 | 0.923 ± 0.022 | 0.920 ± 0.026 | 0.923 ± 0.014 |
| AUC (SVM) | 0.862 ± 0.019 | 0.873 ± 0.034 | 0.861 ± 0.031 | 0.879 ± 0.021 | 0.869 ± 0.025 | 0.880 ± 0.031 | 0.867 ± 0.023 | 0.830 ± 0.034 |
The value of AUC score is formatted with ‘Mean value ± Standard deviation’. All terms are used to train models
AUC scores of GBDT models using only or excluding some section
| Section Name | AUC (Only) | AUC (Exclusion) |
|---|---|---|
| Chief Complaint | 0.635 ± 0.029 | 0.930 ± 0.015 |
| Present History | 0.748 ± 0.023 | 0.933 ± 0.021 |
| Previous History | 0.610 ± 0.033 | 0.927 ± 0.019 |
| Personal History | 0.564 ± 0.019 | 0.929 ± 0.017 |
| Family History | 0.605 ± 0.029 | 0.926 ± 0.023 |
| Physical Examination | 0.711 ± 0.019 | 0.928 ± 0.021 |
| Laboratory Examination | 0.638 ± 0.036 | 0.940 ± 0.016 |
| Admitting Diagnosis | 0.754 ± 0.034 | 0.923 ± 0.017 |
| Progress Note | 0.939 ± 0.018 | 0.784 ± 0.035 |
| ALL | 0.929 ± 0.014 | |
The value of AUC score is formatted with ‘Mean value ± Standard deviation’. AUC scores with one specific section are denoted as ‘AUC (Only)’ and results excluding some section are ‘AUC (Exclusion)’. The ‘ALL’ means that terms from all sections are used
Fig. 6Unique terms within progress notes and admitting diagnosis only from VTE or non-VTE patients
Fig. 7Relationships between predictive validity of two models, GBDT and Padua, and the number of terms
Classification of typical terms proposed by ML model
| Classification Name | Typical Terms |
|---|---|
| Predilection site and clinical manifestation | Thrombus, Lower limb vein, Left lower limb, Posterior tibial vein, Thrombosis, Femoral vein, Embolism, Shank, Hemoptysis, Pulmonary artery |
| Treatment | Warfarin |
| Tumor | Paclitaxel, Lymphoma, Lung adenocarcinoma |
| Surgery, Trauma, Invasive Operation | Resection, Peritoneal dialysis, Cholecystectomy, Stenting |
| Rheumatic Disease | Rheumatoid arthritis, Prednisone, Hyperuricemia, Lupus nephritis |
| Acute Infection | Bacteria, Pneumonia, Antibiotics, Septic shock, Vancomycin, Soft tissue infection |
| Mechanical Ventilation | Mask |
Terms’ average positions of ontology enrichment and TF-IDF methods
| Top@K | 10 | 30 | 50 | 70 | 90 | 110 |
| Ontology Enrichment | 146.6 | 214.1 | 218.9 | 218.7 | 215.6 | 218.7 |
| TF-IDF | 161.4 | 238.0 | 248.4 | 242.9 | 237.2 | 235.8 |
The ‘Top@K’ means top K terms among 110 terms proposed by new VTE risk assessment model