| Literature DB >> 35990251 |
He Xu1,2,3,4,5, Qunli Zheng1, Jingshu Zhu1, Zuoling Xie6, Haitao Cheng1,2,3,4,5, Peng Li1,2,3,4,5, Yimu Ji1,3,4,5.
Abstract
The deep learning methods for various disease prediction tasks have become very effective and even surpass human experts. However, the lack of interpretability and medical expertise limits its clinical application. This paper combines knowledge representation learning and deep learning methods, and a disease prediction model is constructed. The model initially constructs the relationship graph between the physical indicator and the test value based on the normal range of human physical examination index. And the human physical examination index for testing value by knowledge representation learning model is encoded. Then, the patient physical examination data is represented as a vector and input into a deep learning model built with self-attention mechanism and convolutional neural network to implement disease prediction. The experimental results show that the model which is used in diabetes prediction yields an accuracy of 97.18% and the recall of 87.55%, which outperforms other machine learning methods (e.g., lasso, ridge, support vector machine, random forest, and XGBoost). Compared with the best performing random forest method, the recall is increased by 5.34%, respectively. Therefore, it can be concluded that the application of medical knowledge into deep learning through knowledge representation learning can be used in diabetes prediction for the purpose of early detection and assisting diagnosis.Entities:
Mesh:
Year: 2022 PMID: 35990251 PMCID: PMC9391170 DOI: 10.1155/2022/7593750
Source DB: PubMed Journal: Dis Markers ISSN: 0278-0240 Impact factor: 3.464
Figure 1The models of different knowledge representation learning.
Figure 2Physical examination indicator entity and test value entity to vector.
Figure 3TH-SAC: architecture diagram of a disease prediction model integrating knowledge representation and deep learning.
Figure 4Self-attention mechanism
The reference range of detection value of some physical examination indicators.
| Medical examination indicator | Reference range |
|---|---|
| Serum alanine aminotransferase | 9-50 IU/L |
| Serum aspartate aminotransferase | 15-40 IU/L |
| Albumin | 40.0–55.0 g/L |
| Total bilirubin | 2.0–20.0 |
| Blood urea nitrogen | 3.6-9.5 mmol/L |
| Total cholesterol | 2.86-6.10 mmol/L |
| Triglycerides | 0.45-1.81 mmol/L |
| Low-density lipoprotein | 0.00-3.37 mmol/L |
| High-density lipoprotein | 1.16-1.42 mmol/L |
| …… | …… |
Entity type and quantity.
| Type of entity | Example | Number of entities |
|---|---|---|
| Medical examination indicator | Triglycerides | 16 |
| Detection value | 1.62 mmol/L | 5499 |
| Abnormally high | <HIGHEST> | 1 |
| Abnormally low | <LOWEST> | 1 |
| Unknown | <UNK> | 1 |
| Medical examination indicator | Triglycerides | 16 |
Relationship type and quantity.
| Relationship types | Number of entities |
|---|---|
| Severely low | 337 |
| Generally low | 343 |
| Slightly low | 457 |
| Normal | 1558 |
| Slightly high | 2663 |
| Generally high | 2005 |
| Severely high | 2017 |
The distribution of physical examination dataset.
| Disease label | Training set | Test set |
|---|---|---|
| Diabetes | 3815 | 954 |
| Nondiabetes | 35924 | 8824 |
| Total | 39109 | 9778 |
Model parameter settings.
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Batch_size | 32 |
| Epoch | 100 |
| Dropout | 0.5 |
| Learning rate | 0.0002 |
| Entity vector dimension of physical examination data | 256 |
| Size of convolution filter window | 2, 3, 4 |
| Number of convolution filters per window size | 100 |
| Number of layers of self-attention | 2 |
MR and Hit@10 of different knowledge representation model.
| Model | MR | Hit@10 (%) |
|---|---|---|
| TransE | 623.0 | 44.9 |
| TransH | 711.6 | 47.9 |
| TransR | 897.8 | 19.0 |
Accuracy and Recall Rates of Different Knowledge Representation Models.
| Model | Accuracy (%) | Recall (%) | F1 |
|---|---|---|---|
| TransE-SAC | 97.11 | 87.16 | 0.8300 |
| TH-SAC | 97.18 | 87.55 | 0.8351 |
| TransR-SAC | 97.03 | 86.89 | 0.8295 |
Accuracy and recall rates of different diabetes prediction models.
| Model | Accuracy (%) | Recall (%) | F1 |
|---|---|---|---|
| LR | 90.29 | 49.9 | 0.1731 |
| SVM | 90.59 | 51.60 | 0.0631 |
| NB | 87.48 | 53.94 | 0.1604 |
| RF | 96.37 | 82.11 | 0.8295 |
| XGBoost | 92.42 | 61.64 | 0.3746 |
| DNN | 90.21 | 58.62 | 0.1373 |
| TH-SA | 96.05 | 86.15 | 0.7892 |
| TH-CNN | 96.26 | 87.03 | 0.8307 |
| TH-SAC | 97.18 | 87.55 | 0.8351 |
Distribution of sampled physical examination dataset.
| Disease label | Training set | Test set |
|---|---|---|
| Diabetes | 35294 | 8824 |
| Nondiabetes | 35924 | 8824 |
| Total | 70588 | 17648 |
Figure 5Comparison of accuracy of each model before and after resampling.
Figure 6Comparison of F1 values for each model before and after resampling.
Accuracy and recall rates of different diabetes prediction models.
| Model | Accuracy (%) | Recall (%) | F1 |
|---|---|---|---|
| Embedding-SAC | 96.98 | 86.32 | 0.8253 |
| TH-SAC | 97.18 | 87.32 | 0.8351 |
Figure 7Training loss values of the original dataset.
Figure 8The test loss value of the original data set.
Figure 9Training loss values for resampled dataset.
Figure 10Test loss value for resampling dataset.
Figure 11Accuracy of representation vectors in different dimensions.
Figure 12Recall of representation vectors in different dimensions.
Figure 13Visualization of the weight of the self-attentive layer.
The numerical values of the weight of the self-attentive layer.
| BMI | ALT | AST | ALB | TBIL | Cr | BUN | CHOL | TG | LDL | HDL | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BMI | 0.044682 | 0.0063637 | 0.082328 | 0.043529 | 0.0024994 | 0.0017869 | 0.12636 | 0.18994 | 0.50183 | 0.00024223 | 0.00043629 |
| ALT | 4.254 | 0.95567 | 0.013733 | 0.0014261 | 0.012931 | 0.014677 | 8.8271 | 0.0003062 | 6.7363 | 0.00067861 | 0.00044125 |
| AST | 0.0024338 | 0.27391 | 0.57118 | 0.036222 | 0.0088331 | 0.045473 | 0.0044836 | 0.00071565 | 1.6739 | 0.025368 | 0.031365 |
| ALB | 0.001296 | 0.06182 | 0.17392 | 0.70461 | 0.16075 | 0.013515 | 0.0104 | 0.0032933 | 0.00039018 | 0.014664 | 0.011861 |
| TBIL | 1.9193 | 2.5626 | 4.1367 | 2.6728 | 0.25978 | 0.74022 | 2.3709 | 8.6627 | 4.7918 | 5.7109 | 1.6839 |
| Cr | 8.3536 | 9.8131 | 5.6186 | 9.2363 | 2.0324 | 1.0 | 9.2502 | 8.219 | 4.5131 | 3.925 | 4.0831 |
| BUN | 2.4488 | 3.7696 | 3.0095 | 0.000663 | 6.1397 | 2.97775 | 0.99901 | 3.3287 | 3.4574 | 5.81 | 8.1024 |
| CHOL | 4.7872 | 2.9578 | 2.7738 | 7.6755 | 8.7449 | 1.0327 | 1.8206 | 0.99996 | 1.4392 | 4.8578 | 3.181 |
| TG | 3.7574 | 7.568 | 5.1691 | 1.1795 | 2.7954 | 0.023429 | 2.0787 | 0.00045578 | 0.97609 | 4.4345 | 1.37772 |
| LDL | 0.0024032 | 0.018078 | 0.001621 | 0.0015858 | 0.077337 | 0.0070292 | 0.00026765 | 0.00030051 | 9.7969 | 0.66945 | 0.22183 |
| HDL | 0.0038618 | 0.0082498 | 0.0018781 | 0.0011091 | 0.011362 | 0.00010898 | 0.00043675 | 0.00049594 | 8.7666 | 0.34158 | 0.63083 |