| Literature DB >> 35921141 |
Yanqun Huang1,2, Zhimin Zheng1,2, Moxuan Ma1,2, Xin Xin1,2, Honglei Liu1,2, Xiaolu Fei3, Lan Wei3, Hui Chen1,2.
Abstract
BACKGROUND: The widespread secondary use of electronic medical records (EMRs) promotes health care quality improvement. Representation learning that can automatically extract hidden information from EMR data has gained increasing attention.Entities:
Keywords: acute myocardial infarction; feature association strengths; feature importance; mortality risk prediction; representation learning; skip-gram
Mesh:
Year: 2022 PMID: 35921141 PMCID: PMC9386580 DOI: 10.2196/37486
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 7.076
Figure 1Overview of the proposed representation learning method for patients’ mortality risk prediction. First, feature representations were learned by the skip-gram algorithm using an adaptive context window. Then, patient representations were constructed based on feature representations weighted by the feature importance. Finally, the proposed patient representation was applied in the mortality risk prediction for acute myocardial infarction in-patients from a public data set and a private data set, and compared with reference methods.
Figure 2An illustration of context concept selection for the skip-gram algorithm using association strengths. All records are composed of 10 concepts (C1, C2, ……, and C10). In the confidence matrix, element Cij was the confidence of the association rule with Cj as antecedent and Ci as consequent. For patient 1 with 6 concepts (C1, C3, C6, C7, C8, and C10), the included concepts in C1’s 4-concept context window were selected from the remining 5 candidate concepts, whose confidences were 0.66 (antecedent, C10), 0.62 (C3), 0.55 (C6), 0.53 (C8), and 0.46 (C7). Therefore, C10, C3, C6, and C8 were selected to construct the context window for C1.
Concepts and feature groups of both the public and private data sets.
| Feature category | Public data set | Private data set | Concept examples | ||||
|
| Feature groups (n=104), n | Concepts (n=3326), n | Feature groups (n=108), n | Concepts (n=1073), n |
| ||
| Age | 1 | 2 | 1 | 2 | >60 years and ≤60 years | ||
| Gender | 1 | 2 | 1 | 2 | Male and female | ||
| Laboratory tests | 19 | 38 | 40 | 80 | Abnormal serum triglyceride and normal serum creatinine | ||
| Radiological features | 34 | 34 | 36 | 36 | Cardiac image enlargement and sharp costophrenic angle | ||
| Disease diagnoses | 24 | 2600 | 15 | 739 | Hypertension and brainstem infarction | ||
| Procedures | 18 | 643 | 8 | 207 | Coronary stenting and pericardiocentesis | ||
| Medications | 7 | 7 | 7 | 7 | Angiotensin-converting enzyme inhibitor and heparin | ||
Descriptions of the proposed and reference representation methods.
| Representation method | Descriptions | Representation examples |
| Mixture | The mixture of discretization codes for original discrete features and original values for continuous features. The missing values in the laboratory tests were interpolated using the mean of the corresponding laboratory tests. | (0,1,1,0,0,0,1,12,8.5,3,8) for a patient with 11 features |
| Discretization | The 0-1 vector where the digit 1 represented the patient having the specific disease, procedure, radiological feature, and medication, and 0 otherwise. Age of 1 meant >60 years and 0 meant ≤60 years, gender of 1 meant male and 0 meant female, and a laboratory test item of 1 meant abnormal and 0 meant normal. Missing values for laboratory tests were interpolated by the corresponding mode. | (0,1,1,0,0,0,1,1,0,1,1) for a patient with 11 discretization features |
| DIS_FSa | The selected features with discretization representations were statistically different between patients with and without the label “death.” | (0,0,1,0,0,1,0,1) for a patient with 8 selected features |
| DIS_AEb | The hidden-layer vector of a 3-layer autoencoder with discretization vectors as inputs and outputs. The dimension of the hidden layer was set to 64. | (0.7,1.9,0.5,−1,−3.1,2.4) for a patient with a 6-dimensional vector |
| RAN_EM_AVEc | The average of feature embedding vectors learned from the skip-gram algorithm using the random selection method to determine the context window. | (1.6,−0.5,1.1,0.1,−1.3,0.6) for a patient with a 6-dimensional embedding vector |
| RAN_EM_WGTd | The weighted sum of the feature embedding vectors learned from the skip-gram algorithm using the random selection method to determine the context window. | (1.2,−0.9,1.3,0.4,−1.9,1.0) for a patient with a 6-dimensional embedding vector |
| ANT_EM_AVEe | The average of the feature embedding vectors learned from the skip-gram algorithm using the confidence with the target concept as the antecedent. | (0.9,−0.6,1.2,1.4,−1.9,0.6) for a patient with a 6-dimensional embedding vector |
| ANT_EM_WGTf | The weighted sum of the feature embedding vectors learned from the skip-gram algorithm using the confidence with the target concept as the antecedent. | (1.2,−1.5,1.1,0.1,−0.6,0.6) for a patient with a 6-dimensional embedding vector |
| CON_EM_AVEg | The average of the feature embedding vectors learned from the skip-gram algorithm using the confidence with the target concept as the consequent. | (1.6,−0.8,2.1,1.6,−1.4,1.5) for a patient with a 6-dimensional embedding vector |
| CON_EM_WGTh | The weighted sum of the feature embedding vectors learned from the skip-gram algorithm using the confidence with the target concept as the consequent. | (1.1,−0.4,−0.7,1.6,−0.3,0.9) for a patient with a 6-dimensional embedding vector |
aDIS_FS: discretization representations with feature selection.
bDIS_AE: hidden vector of an autoencoder-based representation.
cRAN_EM_AVE: average of the random selection–based embedding representation.
dRAN_EM_WGT: weighted sum of the random selection–based embedding representation.
eANT_EM_AVE: average of the antecedent-based embedding representation.
fANT_EM_WGT: weighted sum of the antecedent-based embedding representation.
gCON_EM_AVE: average of the consequent-based embedding representation.
hCON_EM_WGT: weighted sum of the consequent-based embedding representation.
Figure 3Visualization of the embedding laboratory tests using different selection schemes for contextual concepts in the skip-gram algorithm (the t-distributed stochastic neighbor embedding algorithm was used). Dots in red and green represent abnormal and normal laboratory test results, respectively. A to C for the public data set: the contextual concepts of a target concept consist of its consequent concepts (A) or antecedent concepts (B) in association rules, or randomly selected concepts (C). D to F are the counterparts of A to C on the private data set.
Predictive performance of patient representation methods on the private data set.
| Feature set and representation methods | AUROCa, mean (95% CI) | AUPRCb, mean (95% CI) | F1-score, mean (95% CI) | |||||
|
|
|
|
| |||||
|
|
|
|
|
| ||||
|
|
| CON_EM_WGTc | 0.973 (0.951-0.995) | 0.505 (0.278-0.732) | 0.674 (0.468-0.880) | |||
|
|
| CON_EM_AVEd | 0.957 (0.933-0.981) | 0.312 (0.159-0.465) | 0.479 (0.301-0.657) | |||
|
|
| ANT_EM_WGTe | 0.972 (0.948-0.996) | 0.489 (0.258-0.720) | 0.658 (0.442-0.874) | |||
|
|
| ANT_EM_AVEf | 0.953 (0.929-0.977) | 0.310 (0.185-0.435) | 0.478 (0.329-0.627) | |||
|
|
| RAN_EM_WGTg | 0.967 (0.942-0.992) | 0.486 (0.263-0.709) | 0.660 (0.460-0.860) | |||
|
|
| RAN_EM_AVEh | 0.948 (0.923-0.973) | 0.287 (0.167-0.407) | 0.451 (0.306-0.596) | |||
|
|
|
|
|
| ||||
|
|
| DIS_AEi | 0.884 (0.845-0.923) | 0.207 (0.144-0.270) | 0.361 (0.279-0.443) | |||
|
|
| DIS_FSj | 0.938 (0.907-0.969) | 0.283 (0.167-0.399) | 0.452 (0.309-0.595) | |||
|
|
| Discretization | 0.939 (0.908-0.970) | 0.283 (0.165-0.401) | 0.454 (0.307-0.601) | |||
|
|
| Mixture | 0.904 (0.849-0.959) | 0.251 (0.135-0.367) | 0.417 (0.264-0.570) | |||
|
|
|
|
| |||||
|
|
|
|
|
| ||||
|
|
| CON_EM_WGT | 0.926 (0.883-0.969) | 0.282 (0.139-0.425) | 0.456 (0.282-0.630) | |||
|
|
| CON_EM_AVE | 0.915 (0.876-0.954) | 0.248 (0.156-0.340) | 0.413 (0.297-0.529) | |||
|
|
| ANT_EM_WGT | 0.919 (0.874-0.964) | 0.278 (0.133-0.423) | 0.455 (0.275-0.635) | |||
|
|
| ANT_EM_AVE | 0.912 (0.869-0.955) | 0.256 (0.162-0.350) | 0.423 (0.307-0.539) | |||
|
|
| RAN_EM_WGT | 0.915 (0.868-0.962) | 0.248 (0.119-0.377) | 0.416 (0.238-0.594) | |||
|
|
| RAN_EM_AVE | 0.897 (0.850-0.944) | 0.225 (0.133-0.317) | 0.385 (0.265-0.505) | |||
|
|
|
|
|
| ||||
|
|
| DIS_AE | 0.884 (0.845-0.923) | 0.207 (0.144-0.270) | 0.361 (0.279-0.443) | |||
|
|
| DIS_FS | 0.903 (0.862-0.944) | 0.214 (0.124-0.304) | 0.367 (0.236-0.498) | |||
|
|
| Discretization | 0.905 (0.862-0.948) | 0.224 (0.122-0.326) | 0.381 (0.238-0.524) | |||
|
|
| Mixture | 0.867 (0.806-0.928) | 0.202 (0.116-0.288) | 0.356 (0.227-0.485) | |||
aAUROC: area under the receiver operating characteristic curve.
bAUPRC: area under the precision-recall curve.
cCON_EM_WGT: weighted sum of the consequent-based embedding representation.
dCON_EM_AVE: average of the consequent-based embedding representation.
eANT_EM_WGT: weighted sum of the antecedent-based embedding representation.
fANT_EM_AVE: average of the antecedent-based embedding representation.
gRAN_EM_WGT: weighted sum of the random selection–based embedding representation.
hRAN_EM_AVE: average of the random selection–based embedding representation.
iDIS_AE: discretization representations with features selection.
jDIS_FS: hidden vector of an autoencoder-based representation.
Figure 4The mean absolute Shapley additive explanations (SHAP) values of the top 20 features of the private data set within the entire feature set (A) and the treatment-free feature set (B).
Figure 5Shapley additive explanations (SHAP) values for a patient who died during hospital stay (A and C) and another patient who did not die (B and D). Both patients were selected from the private data set with the entire feature set. A and B, all features with their SHAP values. C and D, 20 features with the greatest absolute SHAP values. Features in blue tend to reduce the possibility of a patient being classified as positive (death in this study), while features in red do the contrary. The meaning of each abbreviated feature name can be found in Multimedia Appendix 1.