| Literature DB >> 35127624 |
Jingfeng Chen1,2, Chonghui Guo2, Menglin Lu2, Suying Ding1.
Abstract
OBJECTIVE: The reasonable classification of a large number of distinct diagnosis codes can clarify patient diagnostic information and help clinicians to improve their ability to assign and target treatment for primary diseases. Our objective is to identify and predict a unifying diagnosis (UD) from electronic medical records (EMRs).Entities:
Keywords: clustering; disease ontology structure; electronic medical records; set similarity measure; unifying diagnosis
Mesh:
Year: 2022 PMID: 35127624 PMCID: PMC8811031 DOI: 10.3389/fpubh.2021.793801
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Figure 1Dataset selection of sepsis patients from the MIMIC-III database.
Feature information of the health condition of sepsis patients.
|
|
|
|
|---|---|---|
| Demographic information | Admission type | Emergency, elective, urgent (Nominal) |
| Gender | Female, male (Nominal) | |
| Age | [18, 89] (Numeric) | |
| Laboratory examination information | Potassium Level, PO2, serum bicarbonate level, temperature, sodium level, urine out foley, urea nitrogen, WBC, bilirubin level, GCSmotor, GCSeyes, HR, GCSverbal, NBP, RR, SPO2, hemoglobin, platelet count, creatimine | Minimum, maximum, median, mean, and variance value (Numeric) |
| Symptom information | Fever, abdominal pain, shortness of breath, nausea and vomiting, weakness, diarrhea, dizziness, palpitation, cough, fatigue, discomfort, dysuria, shock, weight change, loss of appetite, and night sweating | 0, 1 (Nominal) |
| Related indicators | AIDS, hematologic malignancy, metastatic cancer | 0, 1 (Nominal) |
| SOFA, SAPS, and SAPS-II | Integer (Numeric) |
Figure 2Research framework for applying the proposed UDIPM to EMRs.
Figure 3Local ontology structure of ICD-9 codes.
Figure 4Example of LCA generation in the ICD-9 ontology structure. (A) Denotes the ICD-9 ontology structure, and (B) denotes the diagnosis codes of two patients.
Patient similarity measure method.
Patient clustering algorithm.
TDCCoP extraction method.
Figure 5Proposed UD identification method.
UD identification method.
Figure 6Proposed UD prediction method.
Figure 7Distribution of the number of clusters for different values of p.
Figure 8Distribution of TDCs for 800 core patients.
Figure 9Co-occurrence relation and AOrd of all TDCs. (A) Co-occurrence relation. (B) AOrd.
Detailed description of three TDCs.
|
|
|
| ||||
|---|---|---|---|---|---|---|
| TDCCOP1 | 1 | 518.81 | Acute respiratory failure | 0.604 | 4.145 | 3 |
| (1391) | 2 | 38.9 | Septicemia | 0.769 | 2.411 | 1 |
| 4 | 785.52 | Septic shock | 0.669 | 4.090 | 2 | |
| 5 | 584.9 | Acute kidney failure | 0.534 | 4.956 | 4 | |
| 14 | 995.92 | Severe sepsis | 0.824 | 7.816 | 5 | |
| TDCCOP2 | 1 | 518.81 | Acute respiratory failure | 0.526 | 7.665 | 3 |
| (3027) | 2 | 38.9 | Septicemia | 0.608 | 7.545 | 2 |
| 4 | 785.52 | Septic shock | 0.729 | 7.813 | 8 | |
| 5 | 584.9 | Acute kidney failure | 0.554 | 7.377 | 1 | |
| 12 | 427.31 | Atrial fibrillation | 0.593 | 8.038 | 11 | |
| 14 | 995.92 | Severe sepsis | 0.941 | 8.031 | 10 | |
| 30 | 428.0 | Congestive heart failure | 0.729 | 7.703 | 5 | |
| 46 | 486.0 | Pneumonia organism | 0.334 | 7.805 | 6 | |
| 58 | 599.0 | Urinary tract infection | 0.389 | 7.701 | 4 | |
| 62 | 401.9 | Essential hypertension | 0.343 | 7.807 | 7 | |
| 63 | 276.2 | Acidosis | 0.360 | 7.875 | 9 | |
| 77 | 250.0 | Diabetes mellitus without complication | 0.383 | 8.062 | 12 |
Figure 10LCoP2 identified using the visualization of TDCoP3 in the ontology structure.
CCoM2 of the LCoP2.
Values in brackets are the orders of the seven diseases, bold values on the master diagonal denote the occurrence probabilities of the seven diseases, and values in red and blue are conditional probabilities for distinguishing between primary diseases and complications.
Figure 11Classification performance of the proposed UDIPM. (A) AUC. (B) Acc, Pre, Rec, and F1.
Figure 12Ten most important features using the random forest method.
Evaluation methods and metrics used in our experiment.
|
|
|
|
|
|---|---|---|---|
| The proposed method (UDIPM) | Set similarity based on ontology | AP | Logistic regression |
| Fusion method 1 (FM1) | Dice = 2|A⋂B|/| | Decision tree | |
| Fusion method 2 (FM2) | Jaccrd = |A⋂B|/|A⋃B| | Random forest | |
| Fusion method 3 (FM3) | Cosine = | SVM | |
| Fusion method 4 (FM4) | Overlap = |A⋂B|/ | XGBoost | |
| AUC | |||
| Acc = (TP + TN)/N | |||
| Metric | Pre = TP/(TP + FP) | ||
| Rec = TP/(TP + FN) | |||
| F1 = 2Pre*Rec/(Pre + Rec) | |||
A and B are the diagnosis code sets of two patients, the Dice method is the same as the proposed UDIPM when we do not consider the disease ontology structure and replace the code similarity with .
Figure 13Similarity measure and clustering results of different fusion methods.
Classification results of different fusion methods.
|
| ||||||
|---|---|---|---|---|---|---|
|
|
|
|
|
| ||
| FM1 | Logistic regression | 0.725 | 0.739 | 0.725 | 0.721 | 0.782 |
| (Dice) | Decision tree | 0.682 | 0.683 | 0.682 | 0.682 | 0.682 |
| FM2 | Random forest | 0.779 | 0.782 | 0.779 | 0.778 | 0.851 |
| (Jaccard) | SVM | 0.722 | 0.763 | 0.722 | 0.711 | 0.778 |
| XGBoost | 0.804 | 0.818 | 0.804 | 0.802 | 0.860 | |
| FM3 | Logistic regression | 0.734 | 0.743 | 0.734 | 0.732 | 0.804 |
| (Cosine) | Decision tree | 0.682 | 0.683 | 0.682 | 0.682 | 0.682 |
| Random forest | 0.786 | 0.790 | 0.786 | 0.785 | 0.859 | |
| SVM | 0.736 | 0.752 | 0.736 | 0.732 | 0.801 | |
| XGBoost | 0.813 | 0.821 | 0.813 | 0.812 | 0.884 | |
| FM4 | Logistic regression | 0.465 | 0.437 | 0.421 | 0.411 | 0.628 |
| (Overlap) | Decision tree | 0.388 | 0.370 | 0.371 | 0.369 | 0.529 |
| Random forest | 0.467 | 0.434 | 0.400 | 0.371 | 0.620 | |
| SVM | 0.471 | 0.384 | 0.404 | 0.350 | 0.626 | |
| XGBoost | 0.481 | 0.451 | 0.423 | 0.404 | 0.629 | |
| UDIPM | Logistic regression | 0.733 | 0.740 | 0.733 | 0.732 | 0.806 |
| Decision tree | 0.662 | 0.663 | 0.662 | 0.662 | 0.662 | |
| Random forest |
|
|
|
|
| |
| SVM | 0.734 | 0.743 | 0.734 | 0.732 | 0.800 | |
| XGBoost |
|
|
|
|
| |
Bold values denote the first and second-highest performance using the UDIPM.
|
| |
| Order of | |
| Least common ancestor of | |
| Similarity of diagnosis code | |
| Similarity of diagnostic information of patients | |
|
| Patient similarity matrix based on diagnostic information |
|
| |
|
| Number of clusters |
|
| |
| Exemplar of cluster | |
|
| Core zone and the number of patients in |
| Occurrence probability of the diagnosis code | |
| Average order of the typical diagnosis code | |
| New order of the typical diagnosis code | |
|
| |
|
| |
|
| Conditional co-occurrence matrix for all diseases in |
|
| |
| Information gain of feature | |
|
| Average error using |