| Literature DB >> 32324148 |
Zheyu Wang1,2, Haoce Huang1, Liping Cui3, Juan Chen3, Jiye An1, Huilong Duan1, Huiqing Ge4, Ning Deng1,2.
Abstract
BACKGROUND: Health education emerged as an important intervention for improving the awareness and self-management abilities of chronic disease patients. The development of information technologies has changed the form of patient educational materials from traditional paper materials to electronic materials. To date, the amount of patient educational materials on the internet is tremendous, with variable quality, which makes it hard to identify the most valuable materials by individuals lacking medical backgrounds.Entities:
Keywords: chronic disease; health education; natural language processing; ontology; recommender system
Year: 2020 PMID: 32324148 PMCID: PMC7206519 DOI: 10.2196/17642
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Overall study design.
Figure 2Chronic Disease Patient Education Ontology construction steps.
Figure 3Recommendation generation steps.
Summary of the three strategies for improving keyword extraction performance.
| Strategy | Description | Effect |
| Weight assignment | Assign weight of 3, 1.2, and 0.8 to title words, nouns, and verbs, respectively, when performing keyword extraction. | Nouns and title words will be more likely to be keywords, and verbs are less likely to be keywords. |
| Compound word identification | Use several filter conditions to generate user-defined dictionary of compound words in educational materials for word segmentation. | Compound words that meet filter conditions will be identified and are more likely to be the keywords than atom words. |
| Synonym elimination | Remove shorter keywords with similar Chinese characters based on cosine similarity between their character compositions. | Two or more keyword candidates with similar character composition will be merged into one keyword to avoid redundancy. |
Figure 4Concrete calculation process of the text vector.
Figure 5Inner product of the patient vector and text vector.
Figure 6Evaluation metrics of keyword extraction performance.
Figure 7Evaluation metrics of recommendation performance.
Patient characteristics from the collected data (n=50).
| Patient characteristics | Value | ||
|
|
| ||
|
|
|
| |
|
|
| Female | 23 (46) |
|
|
| Male | 27 (54) |
|
| Age in years, mean (SD) | 57 (0.57) | |
|
|
|
| |
|
|
| Normala | 16 (32) |
|
|
| Overweight | 34 (68) |
|
|
|
| |
|
|
| Pregnant | 0 (0) |
|
|
| Nonpregnant | 50 (100) |
|
|
| ||
|
| Hypertension | 50 (100) | |
|
| Diabetes | 6 (12) | |
|
| Stroke | 4 (8) | |
|
| Hyperlipidemia | 12 (24) | |
|
| Coronary artery disease | 3 (6) | |
|
| Chronic obstructive pulmonary disease | 2 (4) | |
|
| Other diseases | 17 (34) | |
|
|
| ||
|
| Blood glucose (normal)b | 36 (72) | |
|
| Total cholesterol (normal)c | 36 (72) | |
|
| Triglyceride (normal)d | 29 (58) | |
|
| High density lipoprotein (normal)e | 43 (86) | |
|
| Low density lipoprotein (normal)f | 40 (80) | |
|
| Uric acid (normal)g | 39 (78) | |
|
|
| ||
|
|
|
| |
|
|
| Normalh | 23 (46) |
|
|
| Abnormal | 27 (54) |
|
|
|
| |
|
|
| Smoking | 7 (14) |
|
|
| Drinking | 9 (18) |
|
|
|
| |
|
|
| Good | 19 (38) |
|
|
| Medium | 27 (54) |
|
|
| Poor | 4 (8) |
|
|
|
| |
|
|
| Antihypertensive drugs | 50 (100) |
|
|
| Hypoglycemic drugs | 3 (6) |
|
|
| Hypolipidemic drugs | 12 (24) |
|
|
| ||
|
|
|
| |
|
|
| Minimal depression | 33 (66) |
|
|
| Mild depression | 12 (24) |
|
|
| Moderate depression | 3 (6) |
|
|
| Moderately severe depression | 2 (4) |
|
|
| Severe depression | 0 (0) |
|
|
|
| |
|
|
| High physical activity level | 18 (36) |
|
|
| Moderate physical activity level | 23 (46) |
|
|
| Low physical activity level | 9 (18) |
aReference range of body mass index: 18.5-23.9 kg/m2 for Chinese patients.
bReference range of blood glucose: 3.9-6.1 mmol/L.
cReference range of total cholesterol: 2.9-5.2 mmol/L.
dReference range of triglyceride: 0.56-1.70 mmol/L.
eReference range of high density lipoprotein: 1.20-1.68 mmol/L.
fReference range of low density lipoprotein: 2.07-3.12 mmol/L.
gReference range of uric acid: 149-416 μmol/L (for men under 60), 89-357 μmol/L (for women under 60), 250-476 μmol/L (for men over 60), 190-434 μmol/L (for women over 60).
hReference range of blood pressure: 90-119 mm Hg for systolic BP, 60-79 mm Hg for diastolic BP.
Overview of the entire corpus and the test collection.
| Corpus | Number | Total word count | Word count, mean (SD) | Unique word count |
| Entire corpus | 88,746 | 40,797,062 | 490 (387) | 270,591 |
| Test collection | 100 | 71,905 | 719 (462) | 10,707 |
Figure 8Topic distribution of the test collection.
Figure 9Class diagram of the Chronic Disease Patient Education Ontology’s main core.
Figure 10Word2Vec embedding visualization of the 33 ontology vector items.
Figure 11Complete scenario for the recommendation generation process.
Figure 12Structure of the system.
Results for automatic keyword extraction using different algorithms.
| Method | Automatic extraction | Correct keywords | Precision (%) | |||
|
| Total | Mean | Total | Mean |
| |
| Improved TextRank | 500 | 5 | 266 | 2.66 | 53.2 | |
| Original TextRank | 500 | 5 | 151 | 1.51 | 30.2 | |
| Improved TF-IDFa | 500 | 5 | 133 | 1.33 | 26.6 | |
| Original TF-IDF | 500 | 5 | 206 | 2.06 | 41.2 | |
aTF-IDF: term frequency–inverse document frequency.
Performance comparison among different keyword extraction algorithms for n=50 evaluations (patients).
| Method | MAPa | Macro Precision | |||||||||
|
|
| P@1 | P@2 | P@3 | P@4 | P@5 | P@10 | P@15 | P@20 | P@25 | P@30 |
| Improved TextRank | 0.622 | 0.810 | 0.730 | 0.713 | 0.705 | 0.710 | 0.717 | 0.701 | 0.688 | 0.673 | 0.645 |
| Original TextRank | 0.585 | 0.610 | 0.600 | 0.650 | 0.648 | 0.642 | 0.661 | 0.665 | 0.662 | 0.646 | 0.641 |
| Improved TF-IDFb | 0.620 | 0.970 | 0.920 | 0.880 | 0.845 | 0.822 | 0.741 | 0.715 | 0.677 | 0.651 | 0.632 |
| Original TF-IDF | 0.628 | 0.930 | 0.875 | 0.867 | 0.853 | 0.836 | 0.772 | 0.723 | 0.680 | 0.660 | 0.634 |
| Manual Annotation | 0.635 | 0.650 | 0.720 | 0.740 | 0.753 | 0.740 | 0.726 | 0.707 | 0.697 | 0.681 | 0.660 |
aMAP: mean average precision.
bTF-IDF: term frequency–inverse document frequency.
Figure 13Macro precisions at rank 1 to 30 of different keyword extraction algorithms for n=50 evaluations (patients).