| Literature DB >> 25971437 |
Judita Preiss1, Mark Stevenson2, Robert Gaizauskas2.
Abstract
OBJECTIVE: Literature-based discovery (LBD) aims to identify "hidden knowledge" in the medical literature by: (1) analyzing documents to identify pairs of explicitly related concepts (terms), then (2) hypothesizing novel relations between pairs of unrelated concepts that are implicitly related via a shared concept to which both are explicitly related. Many LBD approaches use simple techniques to identify semantically weak relations between concepts, for example, document co-occurrence. These generate huge numbers of hypotheses, difficult for humans to assess. More complex techniques rely on linguistic analysis, for example, shallow parsing, to identify semantically stronger relations. Such approaches generate fewer hypotheses, but may miss hidden knowledge. The authors investigate this trade-off in detail, comparing techniques for identifying related concepts to discover which are most suitable for LBD.Entities:
Keywords: knowledge discovery; literature based discovery; natural language processing; text mining
Mesh:
Year: 2015 PMID: 25971437 PMCID: PMC4986660 DOI: 10.1093/jamia/ocv002
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1A small scale example illustrating the difference between the co-occurrence based relations.
Number of linking terms for replication of existing discoveries with synonym merging and semantic type filtering
| SemRep | ReVerb | Stanford | |
|---|---|---|---|
| RD – fish oil | 4 | 0 | 1 |
| Somatomedin C – Arg | 130 | 22 | 27 |
| Migraine – Mg | 47 | 3 | 13 |
| Mg deficiency – ND | 43 | 5 | 0 |
| AD – estrogen | 331 | 64 | 76 |
| AD – INN | 234 | 47 | 49 |
| Schizophrenia – Ca2+iPLA2 | 13 | 0 | 0 |
Timeslice evaluation pre-slice 2000–2005, new knowledge generated from 2006 to 2010, merging synonyms, filtering semantic types.
| Hidden knowledge | Union | Majority | Intersection | ||||
|---|---|---|---|---|---|---|---|
| Correct |
| Correct |
| Correct |
| ||
| c-doc | 14 601 340 987 | 762 474 | 1.04e-04 | 25 089 | 3.44e-06 | 954 | 1.31e-07 |
| c-sent | 5 697 603 946 | 1 104 869 | 3.88e-04 | 41 147 | 1.44e-05 | 1485 | 5.41e-07 |
| c-title | 786 977 001 | 1 392 441 | 3.53e-03 | 68 393 | 1.74e-04 | 2808 | 7.14e-06 |
| SemRep | 197 590 213 | 1 268 934 | 1.27e-02 | 74 508 | 7.54e-04 | 3781 | 3.83e-05 |
| ReVerb | 91 950 221 | 1 068 498 | 2.28e-02 | 66 070 |
| 3314 | 7.21e-05 |
| Stanford | 74 442 449 | 885 203 |
| 60 120 | 1.61e-03 | 3049 |
|
Hidden connection breakdown (with synonym merging and semantic type filtering).
| No. of pairs | Terms | Mean | Median | Mode | |
|---|---|---|---|---|---|
| 2000–2005 | |||||
| c-doc | 29 202 681 794 | 233 446 | 60 145 | 117 127 | 78 405 |
| c-sent | 11 395 207 892 | 227 869 | 50 007 | 35 071 | 10 987 |
| c-title | 1 573 954 002 | 138 622 | 11 354 | 5679 | 3 |
| SemRep | 395 180 426 | 88 525 | 4464 | 1734 | 1 |
| ReVerb | 183 900 442 | 90 742 | 2027 | 662 | 1 |
| Stanford | 148 884 898 | 71 389 | 2086 | 685 | 1 |