| Literature DB >> 35936510 |
Anna Kirkpatrick1,2, Chidozie Onyeze1,2, David Kartchner1,3, Stephen Allegri1,4, Davi Nakajima An1,3, Kevin McCoy1,4, Evie Davalbhakta1, Cassie S Mitchell1,4,5.
Abstract
Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or "knowledge graph" of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer's disease and metabolic co-morbidities.Entities:
Keywords: Alzheimer’s disease; HeteSim; SemNet; ULARA; biomedical knowledge graph; machine learning; natural language processing; rank aggregation; relatedness; text mining
Year: 2022 PMID: 35936510 PMCID: PMC9351549 DOI: 10.3390/bdcc6010027
Source DB: PubMed Journal: Big Data Cogn Comput ISSN: 2504-2289
Figure 1.Example graph, metapath, and HeteSim computation.
Figure 2.Overview of SemNet version 1 HeteSim implementation. Speed ratio is computed as (SemNet 1 time)/(SemNet 2 time) and is given for source node insulin and target node Alzheimer’s disease. In SemNet 2, the approximate mean HeteSim algorithm is used with approximation parameters ϵ = 0.1 and r = 0.9.
SemNet version 1 HeteSim computation times for all metapaths between each of the three source nodes and Alzheimer’s disease.
| Source Node | Insulin | Hypothyroidism | Amyloid |
|---|---|---|---|
| Number of metapaths | 4873 | 2148 | 3095 |
| Total computation time (min) | 93.7 | 39.7 | 55.4 |
| Computation time per metapath (s) (±std) | 46.0 ± 6.1 | 44.2 ± 3.8 | 42.8 ± 4.2 |
| Neo4j query time, per metapath (s) (±std) | 44.9 ± 4.6 | 43.2 ± 2.8 | 42.1 ± 3.5 |
| Time per metapath, excluding query time (s) (±std) | 1.1 ± 3.1 | 1.0 ± 2.3 | 0.8 ± 1.9 |
Figure 3.Distribution of SemNet version 1 HeteSim computation times for all metapaths joining the given source node and Alzheimer’s disease. (a) Insulin; (b) Hypothyroidism; (c) Amyloid.
Figure 4.Distribution of Neo4j query times in SemNet version 1 HeteSim computation for all metapaths joining the given source node and Alzheimer’s disease. (a) Insulin; (b) Hypothyroidism; (c) Amyloid.
Figure 5.Overview of SemNet version 2 approximate mean HeteSim implementation. Speed ratio is (SemNet 1 time)/(SemNet 2 time) and is given for source node insulin and target node Alzheimer’s disease. SemNet version 2 used approximation parameters ϵ = 0.1 and r = 0.9.
Figure 6.An example knowledge graph. Here, we use the convention that nodes are organized by type into vertical columns in the order that they appear in the metapath. We also only show edges that may appear in some metapath instance. This example has m1 − 1 dead-end nodes on the left and m2 − 1 dead-end nodes on the right. The HeteSim score of s and t with respect to the metapath is 1 for all values of m1 and m2.
Figure 7.An example metapath and knowledge graph, drawn with the same conventions as in Figure 6. Note that, in this example, the removal of dead ends does change the HeteSim score.
Figure 8.Computed randomized pruned HeteSim (RPH) scores for each of the three test graphs. (a) Test graph 1; (b) Test graph 2; (c) Test graph 3.
Mean and standard deviation of runtimes for both SemNet version 1 and the approximate mean HeteSim algorithm from SemNet version 2, broken down by step as in Figure 5.
| Source Node | Insulin | Hypothyroidism | Amyloid |
|---|---|---|---|
| Num metapaths (SemNet 1) | 4873 | 2148 | 3095 |
| SemNet 1: Step 1 (s) | 81 ± 5.3 | 35 ± 2.4 | 84 ± 5.3 |
| SemNet 1: Step 2 (s) | 220,000 ± 2300 | 96,000 ± 270 | 220,000 ± 2700 |
| SemNet 1: Step 3 (s) | 0.80 ± 0.0021 | 0.39 ± 0.0093 | 0.80 ± 0.014 |
| Num metapaths (SemNet 2) | 4521 | 2130 | 3060 |
| SemNet 2: Step 1 (s) | 1.2 ± 0.0093 | 0.19 ± 0.0015 | 0.41 ± 0.0024 |
| SemNet 2: Step 2 (s) | 0.0047 ± 0.00097 | 0.0026 ± 0.00060 | 0.0027 ± 0.00061 |
| SemNet 2: Step 3 (s) | 8.3 × 10−5 ± 1.4 × 10−6 | 8.3 × 10−5 ± 1.9 × 10−6 | 8.3 × 10−5 ± 1.2 × 10−6 |
| Runtime ratio: Step 1 | 68 | 184 | 200 |
| Runtime ratio: Step 2 | 4.7 × 108 | 3.6 × 107 | 8.1 × 107 |
| Runtime ratio: Step 3 | 9600 | 470 | 9600 |
Mean and standard deviation of runtimes for the mean exact HeteSim and approximate mean HeteSim algorithms.
| Algorithm | Runtime (s) |
|---|---|
| Mean exact HeteSim | 4.1 ± 0.060 |
| Approximate mean HeteSim | 3.9 ± 0.015 |
Mean and standard deviation of runtimes for both the deterministic HeteSim and randomized pruned HeteSim algorithms on the top 20 individual length 2 metapaths.
| Source Node | Deterministic HeteSim | Randomized Pruned HeteSim |
|---|---|---|
| Insulin | 2.0 × 10−3 ± 1.2 × 10−3 | 3500 ± 3400 |
| Hypothyroidism | 7.2 × 10−4 ± 3.4 × 10−4 | 440 ± 650 |
| Amyloid | 9.9 × 10−4 ± 6.4 × 10−4 | 1200 ± 1200 |
Computation details for the randomized pruned HeteSim algorithm on the top 20 individual length 2 metapaths.
| Source Node | Insulin | Hypothyroidism | Amyloid |
|---|---|---|---|
| Max iterations ( | 28,019,926 | 8,547,987 | 12,790,378 |
| Min iterations ( | 5,308,942 | 1,666,564 | 3,229,242 |
| Mean iterations ( | 10,068,473 | 2,632,969 | 5,206,723 |
| Max runtime (s) | 14,588 | 3138 | 5052 |
| Min runtime (s) | 420 | 99 | 247 |
| Mean runtime (s) | 3491 | 438 | 1193 |
| Max metapath instances | 488 | 167 | 240 |
| Min metapath instances | 109 | 39 | 70 |
Figure 9.HeteSim computation times per metapath for all metapaths of length 2 from the given source node to Alzheimer’s disease, using the deterministic HeteSim implementation from SemNet version 2. (a) Insulin; (b) Hypothyroidism; (c) Amyloid.
Maximum, minimum, and mean runtimes (with standard deviation) for the SemNet version 2 deterministic HeteSim algorithm on the top 20 individual length 4 metapaths.
| Source Node | Insulin | Hypothyroidism | Amyloid |
|---|---|---|---|
| Max runtime (s) | 0.21 | 0.022 | 0.033 |
| Min runtime (s) | 0.032 | 0.0029 | 0.0070 |
| Mean runtime (s) (±std) | 0.11 ± 0.039 | 0.011 ± 0.0056 | 0.015 ± 0.0075 |