| Literature DB >> 31801534 |
Xuedong Li1, Yue Wang2, Dongwu Wang3, Walter Yuan3, Dezhong Peng1, Qiaozhu Mei4.
Abstract
BACKGROUND: Accurately recognizing rare diseases based on symptom description is an important task in patient triage, early risk stratification, and target therapies. However, due to the very nature of rare diseases, the lack of historical data poses a great challenge to machine learning-based approaches. On the other hand, medical knowledge in automatically constructed knowledge graphs (KGs) has the potential to compensate the lack of labeled training examples. This work aims to develop a rare disease classification algorithm that makes effective use of a knowledge graph, even when the graph is imperfect.Entities:
Keywords: Extremely imbalanced data; Knowledge graph; Machine learning; Rare disease diagnosis; Text classification
Mesh:
Year: 2019 PMID: 31801534 PMCID: PMC6894101 DOI: 10.1186/s12911-019-0938-1
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Corpora statistics
| HaoDaiFu | ChinaRe | |
|---|---|---|
| # of documents | 51,374 | 86,663 |
| # of classes (diseases) | 805 | 44 |
| Vocabulary size | 59,879 | 41,087 |
| Average # of words/doc | 26.7 | 29.7 |
| Average # of knowledge terms/doc | 10.8 | 4.0 |
A “knowledge terms” is a term appearing in medical knowledge graph (see “Acquiring knowledge features from KG entities” section)
Fig. 1Zipf’s plots of disease frequency in the two corpora. The x-axis is the disease frequency rank; the y-axis is the disease frequency (number of documents in the disease category). Common diseases appear on the left; rare diseases correspond to the long tail on the right. We annotate cutoff ranks above which the diseases are rarer than the specified percentage
Fig. 2An illustrative example of two disease entities and some of their attributes in a knowledge graph
Fig. 3An illustrative example of using knowledge graph to “emphasize” features (words) in a document. This is an ideal case, where the highlighted features are relevant to the diagnosis. In practice, all features that appear in diagnosis-related part of KG will be highlighted
Rare disease classification performance on HaoDaiFu corpus
| (0, 0.02%] | (0.02%, 0.05%] | (0.05%, 0.1%] | (0.1%, 0.5%] | (0.5%, 1%] | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 89 diseases | 277 diseases | 205 diseases | 194 diseases | 32 diseases | ||||||
| MRR | MRR | MRR | MRR | MRR | ||||||
| BOW | 34.10 | 45.86 | 40.80 | 49.91 | 49.48 | 58.81 | ||||
| LSTM | 0.00 | 0.41 | 0.01 | 1.07 | 0.38 | 5.91 | 12.29 | 27.23 | 40.07 | 53.04 |
| UpSample | 35.17 ∗ | 47.10 ∗ | 40.69 | 50.43 ∗ | 47.63 | 57.63 | 49.85 | 59.75 | 58.6 | 68.95 |
| 34.04 | 46.75 ∗ | 40.81 ∗ | 50.66 ∗ | 49.15 | 58.53 | 51.74 | 61.38 | 61.55 | 74.05 | |
| BOW+ | 34.56 | 47.25 ∗ | 42.41 | 73.97 | ||||||
| KG1 | 33.66 | 44.98 | 38.25 | 47.45 | 45.17 | 53.97 | 48.07 | 57.55 | 59.21 | 71.29 |
| KG12 | 33.51 | 44.92 | 39.08 | 48.07 | 45.23 | 54.55 | 48.66 | 58.00 | 59.2 | 71.43 |
| BOW+KG | 31.91 | 42.81 | 37.51 | 46.08 | 44.08 | 53.22 | 47.01 | 56.94 | 55.91 | 69.47 |
| BOW+KG | 34.87 ∗ | 46.14 ∗ | 41.74 ∗ | 50.14 ∗ | 49.31 | 57.94 | 52.56 | 61.59 | 61.65 | |
| BOW+KG | 33.33 | 45.42 | 38.41 | 48.68 | 47.15 | 56.39 | 51.13 | 60.18 | 61.42 | 73.30 |
| BOW+KG | 52.86 | 61.90 | 61.90 | 73.57 | ||||||
| BOW+KG | 51.40 ∗ | 49.66 | 58.62 | 52.60 | 61.51 | 61.47 | 73.23 | |||
The higher F1 and MMR, the better. Each column’s highest number is shown in boldface, second highest number shown with underline. The left three percentage bins are rare disease bins; the right two bins are for comparison purposes. “ ∗” denotes results significantly higher than BOW (randomization test, significance level α=0.05)
Rare disease classification performance on ChinaRe corpus
| (0, 0.02%] | (0.02%, 0.05%] | (0.05%, 0.1%] | (0.1%, 0.5%] | (0.5%, 1%] | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 5 diseases | 3 diseases | 2 diseases | 7 diseases | 9 diseases | ||||||
| MRR | MRR | MRR | MRR | MRR | ||||||
| BOW | 91.58 | 93.36 | 29.76 | 53.97 | 90.49 | 93.49 | 88.69 | 92.64 | 92.6 | 95.09 |
| LSTM | 0.00 | 4.03 | 0.00 | 4.75 | 0.00 | 9.64 | 22.38 | 44.68 | 85.86 | 93.55 |
| UpSample | 88.36 | 94.81 | 90.11 | 93.06 | 89.36 | 94.27 | 92.62 | 95.76 | ||
| 91.38 | 95.83 ∗ | 47.97 | 65.12 | 90.40 | 93.68 | |||||
| BOW+ | 97.55 ∗ | 42.14 ∗ | 62.80 ∗ | |||||||
| KG1 | 91.06 | 97.47 ∗ | 22.63 | 43.64 | 48.52 | 48.11 | 80.54 | 86.67 | 74.32 | 77.33 |
| KG12 | 92.26 ∗ | 31.20 | 43.91 | 85.61 | 91.42 | 83.71 | 87.96 | 80.05 | 83.18 | |
| BOW+KG | 75.68 | 82.49 | 34.86 | 52.08 | 83.20 | 87.84 | 78.79 | 85.57 | 88.34 | 91.86 |
| BOW+KG | 88.14 | 91.02 | 30.04 ∗ | 52.62 | 89.02 | 93.61 | 85.54 | 88.64 | 90.8 | 93.34 |
| BOW+KG | 89.01 | 95.41 ∗ | 29.76 | 48.8 | 68.63 | 70.80 | 86.18 | 89.65 | 86.89 | 86.21 |
| BOW+KG | 92.30 ∗ | 90.27 | 92.54 | 91.00 | 95.05 | 93.59 | 95.92 | |||
| BOW+KG | 97.13 | 47.78 | 62.04 | 90.70 | 94.49 | 93.46 | 95.70 | |||
See the footnote below Table 2 for details
Example feature weights of the rare disease syringomyelia
| Feature | BOW | BOW+KG |
|---|---|---|
| Syrinx | 1.19 | 1.34 |
| Temperature sensation | 0.52 | 0.82 |
| Numb | 0.76 | 0.45 |
| Tremble | 0.82 | 0.75 |
Our method BOW+KG learned to place larger weights on knowledge features (“syrinx” and “temperature sensation”) and smaller weights on non-knowledge features (“numb” and “tremble”)