| Literature DB >> 32693245 |
Abstract
Wikipedia contains rich biomedical information that can support medical informatics studies and applications. Identifying the subset of medical articles of Wikipedia has many benefits, such as facilitating medical knowledge extraction, serving as a corpus for language modeling, or simply making the size of data easy to work with. However, due to the extremely low prevalence of medical articles in the entire Wikipedia, articles identified by generic text classifiers would be bloated by irrelevant pages. To control the false discovery rate while maintaining a high recall, we developed a mechanism that leverages the rich page elements and the connected nature of Wikipedia and uses a crawling classification strategy to achieve accurate classification. Structured assertional knowledge in Infoboxes and Wikidata items associated with the identified medical articles were also extracted. This automatic mechanism is aimed to run periodically to update the results and share them with the informatics community.Entities:
Keywords: Crawling classification; False discovery control; Text classification; Wikipedia
Mesh:
Year: 2020 PMID: 32693245 PMCID: PMC7357526 DOI: 10.1016/j.ijmedinf.2020.104234
Source DB: PubMed Journal: Int J Med Inform ISSN: 1386-5056 Impact factor: 4.046
Composition of UMLS matched articles.
| SemGroup | ANAT | CHEM | DEVI | DISO | LIVB | PHYS | PROC | NULL |
|---|---|---|---|---|---|---|---|---|
| Count | 3,111 | 10,849 | 343 | 6,799 | 5,805 | 817 | 1,289 | 11,843 |
Fig. 1Elements of a Wikipedia article (title, main body, Infobox, section titles, categories, and links).
Fig. 2The two-step classification workflow.
The number of medical articles identified by the proposed mechanism (Proposed), NaïveB, RM-TF-IDF, and TextCNN.
| 6863 | 35026 | 1502 | 14145 | 28524 | 2948 | 4412 | 93420 | |
| 12544 | 62524 | 18764 | 18191 | 261697 | 16680 | 15068 | 405468 | |
| 9058 | 46293 | 1911 | 18274 | 40841 | 2899 | 4610 | 123886 | |
| 10719 | 55095 | 1806 | 33586 | 52317 | 4909 | 4862 | 163294 |
Precision (P), recall (R), and F-score (F) evaluated using the reversed articles with automatic labels.
| 94.44% | 95.64% | 83.33% | 94.50% | 96.85% | 88.22% | 60.53 % | 80.90 % | |
| 77.16 % | 83.18 % | 31.18 % | 82.60 % | 70.69 % | 94.70 % | 24.69 % | 45.36 % | |
| 92.24 % | 91.92 % | 69.57 % | 90.92 % | 94.15 % | 91.28 % | 48.00 % | 76.47 % | |
| 93.62 % | 90.52 % | 64.29 % | 87.51 % | 93.33 % | 90.66 % | 35.43 % | 66.48 % | |
| 90.63 % | 92.16 % | 49.30 % | 86.62 % | 71.59 % | 97.62 % | 46.62 % | 64.92 % | |
| 86.75 % | 91.60 % | 40.85 % | 77.68 % | 95.16 % | 82.18 % | 40.54 % | 53.23 % | |
| 86.43 % | 91.88 % | 22.54 % | 85.02 % | 91.71 % | 95.86 % | 32.43 % | 62.90 % | |
| 80.61 % | 91.32 % | 12.68 % | 83.03 % | 91.80 % | 95.29 % | 30.41 % | 47.98 % | |
| 92.50 % | 93.87 % | 90.39 % | 82.32 % | 92.68 % | 52.67 % | 72.04 % | ||
| 81.67 % | 87.19 % | 35.37 % | 80.06 % | 81.12 % | 88.00 % | 30.69 % | 48.98 % | |
| 89.24 % | 91.90 % | 34.04 % | 87.87 % | 92.91 % | 93.52 % | 38.71 % | 69.03 % | |
| 86.63 % | 90.92 % | 21.18 % | 85.21 % | 92.56 % | 92.91 % | 32.73 % | 55.74 % |
Accuracy of positive predictions, false discovery rate, the estimated number of medical articles with correct classification, and the estimated number of identified articles with wrong classifications by the proposed mechanism and the baselines.
| 85,946 | 7,474 | |||
| 0.26 | 0.72 | 300,046 | ||
| 0.69 | 0.26 | 85,481 | 38,405 | |
| 0.53 | 0.39 | 86,546 | 76,748 |
Fig. 3The number of Wikidata items identifiable using each method.
Fig. 4Number of concepts that each relation covered that were unique in the UMLS, unique in Infobox/Wikidata, or common in both.
Relation name mapping for May cause, Caused by, Treatment, Differential diagnosis, and Site. Words in the table show relation names used by each source; The meaning of Wikidata properties are in parentheses; ‘NA’: unavailable.
| UMLS relation names | Infobox relation names | Wikidata relation names | |
|---|---|---|---|
| May cause | has_manifestation, has_definitional_manifestation | Symptoms, Complications | P780 (symptoms), P1542 (has effect) |
| Caused by | has_causative_agent, cause_of | Causes, Risk factors | P828 (has cause), P5642 (risk factor) |
| Treatment | may_be_treated_by | Treatment, Medication | P2176 (drug used for treatment), P924 (possible treatment) |
| Differential diagnosis | ddx | Differential diagnosis | NA |
| Site | disease_has_primary_anatomic_site, disease_has_associated_anatomic_site | NA | P689 (afflicts), P927 (anatomical location) |