| Literature DB >> 34278204 |
Masaki Asada1, Nallappan Gunasekaran1, Makoto Miwa1, Yutaka Sasaki1.
Abstract
We deal with a heterogeneous pharmaceutical knowledge-graph containing textual information built from several databases. The knowledge graph is a heterogeneous graph that includes a wide variety of concepts and attributes, some of which are provided in the form of textual pieces of information which have not been targeted in the conventional graph completion tasks. To investigate the utility of textual information for knowledge graph completion, we generate embeddings from textual descriptions given to heterogeneous items, such as drugs and proteins, while learning knowledge graph embeddings. We evaluate the obtained graph embeddings on the link prediction task for knowledge graph completion, which can be used for drug discovery and repurposing. We also compare the results with existing methods and discuss the utility of the textual information.Entities:
Keywords: drug database; heterogeneous networks; knowledge graph completion; knowledge graph embedding; textual information
Year: 2021 PMID: 34278204 PMCID: PMC8281808 DOI: 10.3389/frma.2021.670206
Source DB: PubMed Journal: Front Res Metr Anal ISSN: 2504-0537
FIGURE 1Illustration of the heterogeneous pharmaceutical knowledge graph.
Statistics of heterogeneous pharmaceutical KG entities.
| Entity type | # |
|---|---|
| Drug (DrugBank-ID) | 11,516 |
| Protein (Uniprot-ID) | 5,339 |
| Pathway (SMPDB-ID) | 874 |
| Category (MESH-ID) | 2,166 |
| ATC (ATC-code) | 1,093 |
| Total | 20,988 |
Statistics of heterogeneous pharmaceutical KG edges for each relation type.
| Relation type | All | Train | Valid | Test |
|---|---|---|---|---|
| Category | 60,459 | 54,419 | 3,020 | 3,020 |
| ATC | 16,341 | 14,711 | 815 | 815 |
| Pathway | 18,707 | 16,847 | 930 | 920 |
| Interact | 2,682,142 | 2,413,932 | 134,105 | 134,105 |
| Target | 18,467 | 16,627 | 920 | 920 |
| Enzyme | 5,206 | 4,686 | 260 | 260 |
| Carrier | 815 | 735 | 40 | 40 |
| Transporter | 3,093 | 2,793 | 150 | 150 |
| Total | 2,750,228 | 2,525,829 | 140,240 | 140,240 |
FIGURE 2Overview of methods: (A) Initializing node embeddings (Initialization), (B) Aligning entity embeddings and textual embeddings (Alignment), and (C) Augmenting KG embeddings (Augmentation).
The percentage of nodes that have each type of text.
| Name | Description | Synonyms (%) | |
|---|---|---|---|
| Drug | 100 | 53.72 | 48.50 |
| Protein | 100 | 96.17 | 100 |
| Category | 100 | 94.01 | 91.42 |
| ATC | 100 | — | — |
| Pathway | 100 | 100 | — |
Nodes in all databases have a Name text information. While many proteins and categories have Description information and Synonyms information, the percentage of drugs that have these information is low.
Comparison of MRR performance for each method.
| No text | Initialization | Alignment | Augmentation | |||||
|---|---|---|---|---|---|---|---|---|
| — | Name | Desc | Syn | Name | Desc | Syn | — | |
| TransE | ||||||||
| Category | 0.1978 | 0.2117 | 0.2120 | 0.2231 | 0.2136 | 0.2059 | 0.1913 |
|
| ATC | 0.2929 | 0.2695 | 0.2571 | 0.2495 |
| 0.2973 | 0.2872 | 0.2571 |
| Pathway |
| 0.5674 | 0.5792 | 0.5793 | 0.6694 | 0.6711 | 0.6713 | 0.5473 |
| Interact |
| 0.2867 | 0.2845 | 0.2843 | 0.3103 | 0.3106 | 0.3108 | 0.2644 |
| Target | 0.0802 | 0.0748 | 0.0811 | 0.0808 | 0.0808 | 0.0780 | 0.0821 |
|
| Enzyme | 0.3262 | 0.3067 | 0.3314 | 0.3090 | 0.3474 | 0.3564 |
| 0.2926 |
| Carrier |
| 0.3078 | 0.2843 | 0.3679 | 0.4037 | 0.4044 | 0.4010 | 0.3459 |
| Transporter |
| 0.3194 | 0.3104 | 0.3182 | 0.3444 | 0.3308 | 0.3391 | 0.2866 |
| Avg. (macro) | 0.3319 | 0.2930 | 0.2950 | 0.3015 |
| 0.3318 | 0.3302 | 0.2883 |
| DistMult | ||||||||
| Category | 0.2539 |
| 0.2797 | 0.2753 | 0.2679 | 0.2661 | 0.2586 | 0.2649 |
| ATC | 0.2428 | 0.2674 |
| 0.2639 | 0.2612 | 0.2617 | 0.2698 | 0.2904 |
| Pathway |
| 0.5424 | 0.5542 | 0.6002 | 0.6524 | 0.6711 | 0.6615 | 0.4997 |
| Interact | 0.7730 | 0.6338 | 0.5895 | 0.6302 | 0.7911 | 0.7868 |
| 0.6113 |
| Target | 0.0738 | 0.0866 | 0.0864 | 0.0947 | 0.0778 | 0.0758 | 0.0734 |
|
| Enzyme | 0.2501 | 0.2516 | 0.2358 |
| 0.2247 | 0.2334 | 0.2140 | 0.2183 |
| Carrier | 0.2023 |
| 0.1640 | 0.1622 | 0.1369 | 0.1311 | 0.1649 | 0.2134 |
| transporter | 0.2293 |
| 0.1840 | 0.2190 | 0.1969 | 0.1939 | 0.1708 | 0.2062 |
| Avg. (macro) |
| 0.3206 | 0.2979 | 0.3162 | 0.3261 | 0.3274 | 0.3265 | 0.3011 |
| ComplEx | ||||||||
| Category | 0.0905 | 0.3455 |
| 0.3386 | 0.3302 | 0.0577 | 0.0611 | 0.3420 |
| ATC | 0.3326 | 0.3463 | 0.3623 | 0.3485 | 0.3271 | 0.3425 | 0.3407 |
|
| Pathway | 0.6956 | 0.6856 | 0.7051 | 0.7157 | 0.7220 | 0.6963 |
| 0.6820 |
| Interact |
| 0.7632 | 0.7166 | 0.7802 | 0.8578 | 0.8189 | 0.8497 | 0.8230 |
| Target | 0.0496 | 0.1116 | 0.1093 | 0.1153 | 0.0859 | 0.0640 | 0.0740 |
|
| Enzyme | 0.2103 | 0.2256 | 0.2512 |
| 0.2245 | 0.1907 | 0.2097 | 0.2073 |
| Carrier | 0.1533 | 0.1557 | 0.1817 | 0.1423 | 0.0994 | 0.1750 | 0.1462 |
|
| transporter | 0.1942 |
| 0.2667 | 0.2593 | 0.2151 | 0.2076 | 0.2362 | 0.2801 |
| Avg. (macro) | 0.3242 | 0.3681 | 0.3678 | 0.3692 | 0.3577 | 0.3190 | 0.3312 |
|
| SimplE | ||||||||
| Category | 0.0461 | 0.3591 | 0.3536 |
| 0.0520 | 0.3263 | 0.2619 | 0.3367 |
| ATC | 0.3278 |
| 0.3617 | 0.3732 | 0.3644 | 0.3410 | 0.3425 | 0.3475 |
| Pathway |
| 0.7164 | 0.7299 | 0.7180 | 0.7336 | 0.7428 | 0.7448 | 0.7189 |
| Interact | 0.6215 | 0.7229 |
| 0.7338 | 0.6488 | 0.6242 | 0.6602 | 0.7230 |
| Target | 0.0815 | 0.1128 | 0.1169 |
| 0.0971 | 0.0918 | 0.0873 | 0.1163 |
| Enzyme | 0.1903 | 0.2442 | 0.2143 |
| 0.2499 | 0.2031 | 0.1977 | 0.2304 |
| Carrier | 0.1358 |
| 0.2441 | 0.2526 | 0.1881 | 0.1766 | 0.1266 | 0.1493 |
| transporter | 0.2242 |
| 0.2189 | 0.2543 | 0.2396 | 0.2042 | 0.2417 | 0.2173 |
| Avg. (macro) | 0.2973 | 0.3829 | 0.3705 |
| 0.3216 | 0.3387 | 0.3328 | 0.3549 |
We summarized the MRR for each relational triple and calculated the macro-averaged MRR. The highest score for each node row is shown in bold.
Summary of the best settings for each relation.
| MRR | Method | Text information | |
|---|---|---|---|
| Category | 0.3668 | SimplE | Initialization + Synonyms |
| ATC | 0.3820 | SimplE | Initialization + Name |
| Pathway | 0.7513 | SimplE | No text |
| Interact | 0.8678 | ComplEx | No text |
| Target | 0.1243 | ComplEx | Augmentation |
| Enzyme | 0.3590 | TransE | Alignment + Synonyms |
| Carrier | 0.4155 | TransE | No text |
| Transporter | 0.3576 | TransE | No text |
| Avg. (Macro) | 0.3839 | SimplE | Initialization + Synonyms |
FIGURE 3The distribution of the frequency of nodes in train data set. The frequencies of category nodes linked by category relation are highly imbalanced, while the frequencies of drug nodes linked by interact relation are less imbalanced.
Ablation study of text information on Augmentation method (ComplEx score function).
| Averaged MRR | |
|---|---|
| Full text nodes |
|
| - Description | 0.3626 |
| - Synonyms | 0.3655 |
| - Indication (drug) | 0.3761 |
| - Pharmacodynamics (drug) | 0.3727 |
| - Mechanism-of-action (drug) | 0.3646 |
| - Metabolism (drug) | 0.3689 |
| - Gene-name (protein) | 0.3754 |
Comparison of averaged MRR performance for w/ and w/o entity type filtering.
| — | No text | Initialization | Alignment | Aug | |||||
|---|---|---|---|---|---|---|---|---|---|
| — | — | — | Name | Desc | Syn | Name | Desc | Syn | — |
| TransE | w/ type filtering | 0.3319 | 0.2930 | 0.2950 | 0.3015 | 0.3337 | 0.3318 | 0.3302 | 0.2883 |
| w/o type filtering | 0.2759 | 0.2373 | 0.2265 | 0.2429 | 0.2738 | 0.2722 | 0.2753 | 0.1932 | |
| DistMult | w/ type filtering | 0.3380 | 0.3206 | 0.2979 | 0.3162 | 0.3261 | 0.3274 | 0.3265 | 0.3011 |
| w/o type filtering | 0.2217 | 0.2285 | 0.2423 | 0.2547 | 0.2462 | 0.2219 | 0.2464 | 0.1449 | |
| ComplEx | w/ type filtering | 0.3242 | 0.3681 | 0.3678 | 0.3692 | 0.3577 | 0.3190 | 0.3312 | 0.3771 |
| w/o type filtering | 0.2848 | 0.2906 | 0.2887 | 0.2981 | 0.2931 | 0.3052 | 0.3095 | 0.2373 | |
| SimplE | w/ type filtering | 0.2973 | 0.3829 | 0.3705 | 0.3839 | 0.3216 | 0.3387 | 0.3328 | 0.3549 |
| w/o type filtering | 0.2848 | 0.2906 | 0.2887 | 0.2981 | 0.2931 | 0.3052 | 0.3095 | 0.2373 | |
The content of the text in the examples where the difference between the rank of textual model and the rank of non-textual model is largest for each relation type. The score function SimplE was used for category, ATC and pathway relation and ComplEx was used for interact relation. The Augmentation model was selected as the model with textual information. The bold is the mention common to head and tail entities. The Description and Synonyms are partly excerpted due to space limitations.
| Examples where textual information is helpful and the gap between ranks is largest for each relation type | ||
|---|---|---|
| (a) | Relation: | |
| Head | ID | DB13746 (drug entity) |
| Name |
| |
| Desc. |
| |
| Syn. |
| |
| Tail | ID | D013237 (category entity) |
| Name |
| |
| Desc. |
| |
| Syn. |
| |
| (b) | Relation: | |
| Head | ID | DB00369 (drug entity) |
| Name |
| |
| Desc. |
| |
| Syn. |
| |
| Tail | ID | J05A (ATC entity) |
| Name |
| |
| Desc. | ATC entity has no description | |
| Syn. | ATC entity has no synonyms | |
| Examples where textual information is | ||
| (c) | Relation: | |
| Head | ID | P51589 (protein entity) |
| Name |
| |
| Desc. |
| |
| Syn. |
| |
| Tail | ID | SMP0000695 (pathway entity) |
| Name |
| |
| Desc. |
| |
| Syn. | pathway entity has no synonyms | |
| (d) | Relation: | |
| Head | ID | DB08893 (drug entity) |
| Name |
| |
| Desc. |
| |
| Syn. |
| |
| Tail | ID | DB00937 (drug entity) |
| Name |
| |
| Desc. |
| |
| Syn. |
| |