| Literature DB >> 25971816 |
Patrick Ernst1, Amy Siu2, Gerhard Weikum3.
Abstract
BACKGROUND: Biomedical knowledge bases (KB's) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like protein-protein interactions or gene-drug relationships, or solely address highly specific topics such as drug effects.Entities:
Mesh:
Year: 2015 PMID: 25971816 PMCID: PMC4448285 DOI: 10.1186/s12859-015-0549-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the KnowLife KB and processing pipeline.
KnowLife relations, their type signatures, and number of seeds
|
|
|
|
|
|---|---|---|---|
| Affects | Disease | Organ | 23 |
| Aggravates | Ecofactor | Disease | 21 |
| Alleviates | Drug | Disease | 18 |
| Causes | Disease | Disease | 70 |
| ComplicationOf | Disease | Disease | 5 |
| Contraindicates | Drug | Disease | 26 |
| CreatesRisk | Ecofactor | Disease | 103 |
| Diagnoses | Device | Disease | 29 |
| Interacts | Drug | Drug | 9 |
| IsSymptom | Symptom or Disease | Disease | 69 |
| ReducesRisk | Drug or Behavior | Disease | 24 |
| SideEffect | Symptom or Disease | Drug | 12 |
| Treats | Drug | Disease | 58 |
Overview of KnowLife’s input corpus
|
|
|
|
|
|---|---|---|---|
| Scientific Publications | PubMed Medline | 580,892 | 5,875,006 |
| PubMed Central | 12,532 | 2,765,580 | |
| Encyclopedic Articles | Drugs.com | 31,837 | 7,586,236 |
| Mayo Clinic | 2,166 | 570,325 | |
| Medline Plus | 3,076 | 197,055 | |
| RxList | 2,515 | 1,102,791 | |
| Wikipedia Health | 20,893 | 787,148 | |
| Social Sources | Healthboards.com | 752,778 | 37,270,371 |
| Patient.co.uk | 44,610 | 1,081,420 | |
|
|
|
|
Figure 2Pattern gathering in KnowLife.(a) Sentence-level pattern: Dependency graph of a sentence with recognized entities anemia and sarcoidosis. By computing the shortest path (bold lines) between the two entities, the word sequence symptom of is extracted. This sequence is extended by an adjectival modifier (amod) which results in the extracted pattern common symptom of. (b) Document-structure pattern: The entity Diclofenac is found within the document title and Belching within an
Examples of seed facts and seed patterns as well as automatically acquired patterns and facts
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| progress | createsRisk | 0.5 | which progresses to |
|
|
| causes | 0.5 | still progressing to |
| |
|
| risk factor | createsRisk | 1.0 | children risk factors |
|
|
| have risk factors |
| |||
| known risk factors |
| ||||
|
| occur | affects | 0.67 | occurs anywhere |
|
|
| isSymptom | 0.33 | occurs patients |
|
Evaluation of different text genres
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
| |
|
|
|
|
| |||||
| Affects | 0.855 ±0.047 | 0.762 ±0.049 |
| 0.767 ±0.048 | 1,278 | 450 |
| 5,053 |
| Aggravates | 0.810 ±0.041 | 0.459 ±0.044 |
| 0.785 ±0.049 | 130 | 371 |
| 708 |
| Alleviates | 0.953 ±0.039 | 0.735 ±0.048 |
| 0.736 ±0.048 | 903 | 4,433 |
| 6,790 |
| Causes | 0.904 ±0.039 | 0.674 ±0.049 |
| 0.792 ±0.049 | 28,119 | 19,203 |
| 62,407 |
| Complication | 0.917 ±0.039 | 0.397 ±0.049 |
| 0.869 ±0.046 | 1,011 | 1,475 |
| 1,566 |
| Contraindicates | 0.874 ±0.048 | 0.710 ±0.000 |
| 0.908 ±0.048 | 512 | 49 |
| 1,831 |
| CreatesRisk | 0.878 ±0.047 | 0.569 ±0.049 |
| 0.620 ±0.049 | 4,407 | 24,695 |
| 32,211 |
| Diagnoses | 0.964 ±0.035 | 0.839 ±0.049 |
| 0.840 ±0.047 | 813 | 5,920 |
| 9,743 |
| Interacts | 0.964 ±0.035 | 0.709 ±0.000 |
| 0.957 ±0.034 | 164,912 | 103 |
| 164,912 |
| IsSymptom | 0.891 ±0.042 | 0.482 ±0.050 |
| 0.694 ±0.048 | 4,878 | 2,320 |
| 11,017 |
| ReducesRisk | 0.797 ±0.045 | 0.637 ±0.046 |
| 0.751 ±0.049 | 1,712 | 4,684 |
| 5,865 |
| SideEffect | 0.956 ±0.038 | 0.826 ±0.000 |
| 0.971 ±0.026 | 270,600 | 139 |
| 271,416 |
| Treats | 0.850 ±0.048 | 0.581 ±0.045 |
| 0.566 ±0.048 | 11,915 | 9,318 |
| 35,803 |
| Aggregated ∗ | 0.951 | 0.630 |
| 0.892 | 491,190 | 73,160 |
| 609,322 |
*Precision values are averaged and numbers of harvested facts are summed.
Evaluation of the impact of different components
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
| |
| Affects | 0.825 ±0.047 | 0.882 ±0.044 | 0.821 ±0.048 | 0.171 ±0.051 | 2,388 | 2,350 | 4,088 | 29,477 |
| Aggravates | 0.829 ±0.049 | 0.833 ±0.036 | 0.598 ±0.049 | 0.592 ±0.053 | 432 | 431 | 592 | 1,730 |
| Alleviates | 0.786 ±0.046 | 0.778 ±0.050 | 0.320 ±0.049 | 0.289 ±0.062 | 4,530 | 4,387 | 18,142 | 16,943 |
| Causes | 0.801 ±0.049 | 0.800 ±0.046 | 0.631 ±0.048 | 0.490 ±0.069 | 47,463 | 30,563 | 66,833 | 91,784 |
| Complication | 0.897 ±0.041 | 0.781 ±0.048 | 0.376 ±0.050 | 0.739 ±0.050 | 1,524 | 700 | 4,812 | 2,955 |
| Contraindicates | 0.961 ±0.030 | 0.914 ±0.043 | 0.122 ±0.049 | 0.630 ±0.059 | 1,808 | 365 | 26,298 | 15,279 |
| CreatesRisk | 0.720 ±0.040 | 0.750 ±0.044 | 0.386 ±0.047 | 0.406 ±0.067 | 18,508 | 17,282 | 77,158 | 48,159 |
| Diagnoses | 0.860 ±0.048 | 0.887 ±0.044 | 0.802 ±0.049 | 0.303 ±0.063 | 4,832 | 4,002 | 7,467 | 35,326 |
| Interacts | 0.965 ±0.034 | 0.858 ±0.046 | 0.953 ±0.047 | 0.941 ±0.049 | 164,912 | 392 | 200,935 | 187,201 |
| IsSymptom | 0.858 ±0.048 | 0.691 ±0.050 | 0.625 ±0.049 | 0.328 ±0.064 | 6,395 | 2,920 | 9,543 | 29,776 |
| ReducesRisk | 0.762 ±0.048 | 0.729 ±0.050 | 0.228 ±0.046 | 0.406 ±0.067 | 4,489 | 4,043 | 11,023 | 14,729 |
| SideEffect | 0.964 ±0.035 | 0.938 ±0.048 | 0.941 ±0.046 | 0.879 ±0.050 | 270,709 | 924 | 270,427 | 338,645 |
| Treats | 0.898 ±0.041 | 0.784 ±0.050 | 0.549 ±0.050 | 0.402 ±0.067 | 14,699 | 14,057 | 23,473 | 45,439 |
| Aggregated ∗ | 0.933 | 0.784 | 0.777 | 0.707 | 542,689 | 82,416 | 720,791 | 857,443 |
*Precision values are averaged and numbers of harvested facts are summed.
Number of fact occurrences in text sources
|
|
|
|
|---|---|---|
| Scientific Publications | PubMed Medline | 39,266 |
| PubMed Central | 6,979 | |
| Encyclopedic Articles | Drugs.com | 461,130 |
| Mayo Clinic | 35,300 | |
| Medline Plus | 6,559 | |
| RxList | 5,818 | |
| Wikipedia Health | 17,588 |
Error analysis (number of facts in brackets)
|
| ||||
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| ||
| 8.16% (62) | Preprocessing | 38.71% (24) | 3.23% (2) | 58.06% (36) |
| 27.24% (207) | Entity Recognition | 13.04% (27) | 45.41% (94) | 41.55% (86) |
| 32.11% (244) | Entity Disambiguation | 12.30% (30) | 26.23% (64) | 61.48% (150) |
| 1.97% (15) | Coreferencing | 13.33% (2) | 13.33% (2) | 73.33% (11) |
| 13.68% (104) | Nonexistent Relation | 23.08% (24) | 29.81% (31) | 47.12% (49) |
| 9.21% (70) | Pattern Relation Duality | 24.29% (17) | 27.14% (19) | 48.57% (34) |
| 3.29% (25) | Swapped left and right-hand entity | 28.00% (7) | 24.00% (6) | 48.00% (12) |
| 3.03% (23) | Negation | 17.39% (4) | 21.74% (5) | 60.87% (14) |
| 1.32% (10) | Factually Wrong | 40.00% (4) | 10.00% (1) | 50.00% (5) |
Top-20 pairs of inter-connected biomedical areas within KnowLife
|
|
| |
|---|---|---|
| Disorders | Chemicals | 310482 |
| Chemicals | Chemicals | 190160 |
| Disorders | Disorders | 36677 |
| Disorders | Procedures | 14169 |
| Chemicals | Physiology | 5397 |
| Disorders | Genes | 3831 |
| Disorders | Living Beings | 2539 |
| Chemicals | Drugs | 2455 |
| Disorders | Anatomy | 2895 |
| Disorders | Devices | 792 |
| Disorders | Activities | 592 |
| Disorders | Drugs | 511 |
| Disorders | Objects | 505 |
| Chemicals | Procedures | 544 |
| Disorders | Physiology | 370 |
| Procedures | Physiology | 123 |
| Procedures | Living Beings | 99 |
| Disorders | Geographical Areas | 82 |
| Genes | Physiology | 51 |
| Disorders | Phenomena | 50 |