| Literature DB >> 30304355 |
Sampo Pyysalo1, Simon Baker1, Imran Ali2, Stefan Haselwimmer1, Tejas Shah1, Andrew Young3, Yufan Guo1, Johan Högberg2, Ulla Stenius2, Masashi Narita3, Anna Korhonen1.
Abstract
MOTIVATION: The overwhelming size and rapid growth of the biomedical literature make it impossible for scientists to read all studies related to their work, potentially leading to missed connections and wasted time and resources. Literature-based discovery (LBD) aims to alleviate these issues by identifying implicit links between disjoint parts of the literature. While LBD has been studied in depth since its introduction three decades ago, there has been limited work making use of recent advances in biomedical text processing methods in LBD.Entities:
Mesh:
Year: 2019 PMID: 30304355 PMCID: PMC6499247 DOI: 10.1093/bioinformatics/bty845
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Illustration of closed and open discovery settings. In closed discovery, the goal is to identify nodes (b1, b2,…) connecting a given start and end node (a1 and c1). In open discovery, only a start node (a1) is given, and the aim is to find indirectly connected nodes (c1, c2,…). Identified candidate nodes are ranked based on the edge weights w
Fig. 2.LION LBD system components. Users interact with the system through text-based queries (A) that are mapped to ontology identifiers (B) used to search the entity-level graph (C). Mentions of entities in context can be retrieved from a separate database (D)
Fig. 3.LION LBD system build process. Source data (1–5) is processed by creating a merged identifier mapping (6), metadata extraction (7), text classification (8) and identifier mapping (9). Following mention co-occurrence analysis (10), entity-level data and metrics are aggregated from mention-level data (11) and the two layers of information stored in separate databases (12 and 13)
Fig. 4.LION LBD user interface. The user query (p53) is shown together with controls switching between different discovery modes above the result graph, where nodes represent related concepts and edges their associations
Evaluation dataset from cancer research discoveries
| A (id) | B (id) | C (id) | Reference |
|---|---|---|---|
| NF- | Bcl-2 (PR:000002307) | Adenoma (MESH:D000236) |
|
| NOTCH1 (PR:000011331) | senescence (HOC:42) | C/EBP |
|
| IL-17 (PR:000001138) | p38 | MKP-1 (PR:000006736) |
|
| Nrf2 (PR:000011170) | ROS (CHEBI:26523) | pancreatic cancer (MESH:D010190) |
|
| CXCL12 (PR:000006066) | senescence (HOC:42) | thyroid cancer (MESH:D013964) |
|
Evaluation dataset from Swanson’s discoveries
| A (id) | C (id) | Reference |
|---|---|---|
| Migraine (MESH:D008881) | Magnesium (MESH:D008274) |
|
| Somatomedin C (PR:000009182) | Arginine (CHEBI:29016) |
|
| Alzheimer’s disease (MESH:D000544) | Estrogen (MESH:D004967) |
|
| Alzheimer’s disease (MESH:D000544) | Indomethacin (MESH:D007213) |
|
| Schizophrenia (MESH:D012559) | Calcium Independent Phospholipase A2 (PR:000012942) |
|
Data statistics
| Type | Mentions (Grounded) | Entities |
|---|---|---|
| Disease | 81 993 034 (72 352 890) | 9849 |
| Chemical | 68 839 682 (46 691 595) | 110 024 |
| Species | 52 902 078 (45 937 366) | 9765 |
| Gene | 31 545 993 (24 581 542) | 27 089 |
| Hallmark | 26 769 779 (26 769 779) | 37 |
| Mutation | 1 062 702 (174 531) | 37 929 |
|
| ||
| Total | 263 113 268 (216 507 703) | 194 693 |
Closed discovery evaluation results for cancer discoveries
| Metric | Aggregation function | ||
|---|---|---|---|
| min | avg | max | |
| NPMI |
|
|
|
| SCP |
| 196 | 299 |
|
|
| 196 | 270 |
|
|
| 136 | 261 |
| LLR |
| 163 | 264 |
| Jaccard |
| 213 | 282 |
| Count | 245 |
| 245 |
| Doc-count | 231 |
| 222 |
Note: Best result in each row underlined, best in column in bold.
Open discovery evaluation results for cancer discoveries
| Metric | Accumulation function | |||||
|---|---|---|---|---|---|---|
| sum (min) | max (min) | sum (avg) | max (avg) | sum (max) | max (max) | |
| NPMI | 98 698 | 15 476 | 121 | 5897 |
| 2268 |
| SCP |
| 926 | 400 | 1176 | 399 | 727 |
|
| 547 | 3582 |
| 1159 |
| 1159 |
|
| 118 751 |
| 98 406 | 325 | 125 | 176 |
| LLR | 98 677 |
| 344 | 646 | 319 | 645 |
| Jaccard |
| 1089 | 78 | 962 | 93 | 1122 |
| Count |
| 1005 |
|
| 62 |
|
| Doc-count |
| 738 | 72 | 68 | 74 | 68 |
Note: Best result in each row underlined, best in column in bold.
Open discovery evaluation results for Swanson’s discoveries
| Metric | Accumulation function | |||||
|---|---|---|---|---|---|---|
| sum (min) | max (min) | sum (avg) | max (avg) | sum (max) | max (max) | |
| NPMI | 41 837 | 8869 | 16 714 | 9715 |
| 5545 |
| SCP |
| 427 | 154 | 250 | 154 | 250 |
|
| 37 827 | 7820 | 156 | 263 |
| 263 |
|
| 40 103 | 1808 | 37 368 | 116 |
| 105 |
| LLR | 37 820 | 3404 |
| 45 | 10 |
|
| Jaccard |
| 1075 |
| 237 | 9 | 240 |
| Count |
| 43 | 20 |
| 21 | 261 |
| Doc-count |
|
| 20 | 31 | 21 | 237 |
Note: Best result in each row underlined, best in column in bold.