| Literature DB >> 33945566 |
Juncheng Ding1, Wei Jin1.
Abstract
The embedding of Medical Subject Headings (MeSH) terms has become a foundation for many downstream bioinformatics tasks. Recent studies employ different data sources, such as the corpus (in which each document is indexed by a set of MeSH terms), the MeSH term ontology, and the semantic predications between MeSH terms (extracted by SemMedDB), to learn their embeddings. While these data sources contribute to learning the MeSH term embeddings, current approaches fail to incorporate all of them in the learning process. The challenge is that the structured relationships between MeSH terms are different across the data sources, and there is no approach to fusing such complex data into the MeSH term embedding learning. In this paper, we study the problem of incorporating corpus, ontology, and semantic predications to learn the embeddings of MeSH terms. We propose a novel framework, Corpus, Ontology, and Semantic predications-based MeSH term embedding (COS), to generate high-quality MeSH term embeddings. COS converts the corpus, ontology, and semantic predications into MeSH term sequences, merges these sequences, and learns MeSH term embeddings using the sequences. Extensive experiments on different datasets show that COS outperforms various baseline embeddings and traditional non-embedding-based baselines.Entities:
Year: 2021 PMID: 33945566 PMCID: PMC8096083 DOI: 10.1371/journal.pone.0251094
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Problem definition.
COS aims to learn MeSH term embeddings based on three data sources: corpus (the green block), ontology (the orange block), and semantic predications (the blue block). The structured relationships between MeSH terms are different across the data sources. The learned MeSH term embeddings should contain the information from all data sources.
Summary of related work.
| Methods | Corpus | Ontology | semantic predications |
|---|---|---|---|
| [ | ✔ | ||
| [ | ✔ | ||
| [ | ✔ | ✔ | |
| [ | ✔ | ||
| COS (proposed) | ✔ | ✔ | ✔ |
Fig 2Our proposed solution.
COS firstly generates MeSH term sequences from each data source. It then samples each group of generated sequences to the same number of sequences and merges them into one set of MeSH term sequences. Finally, COS learns the MeSH term embeddings based on the sequences set.
Important notation.
| Symbol | Definition |
|---|---|
| a MeSH term in the MeSH vocabulary | |
| the set containing all | |
|
| a document in PubMed which is a set containing a variable number of MeSH terms |
|
| the PubMed corpus containing all publications |
|
| the set containing all the edges in the ontology DAG |
| the ontology data source as a directed graph | |
|
| the set of a specific type of semantic predications between MeSH terms in SemMedDB |
| the semantic predications data source as a directed graph | |
| the unnormalized transition probability between nodes | |
| the normalizing constant | |
| parameters of random walks interpolating between BFS and DFS | |
| the search bias parameter defined by | |
| the weight of edge | |
| the number of walks per node and the walk length in random walks | |
|
| a sequence of MeSH terms, generated from |
|
| the set of sequences from the corpus sampled from |
|
| the set of sequences from the ontology graph |
|
| the set of sequences from semantic predications graph |
|
| the set of sequence after sampling and merging all sequences |
| the dimension of the embedding vector and the context window size in the optimization | |
|
| the embedding mapping function, can be parameterized as a matrix of size | |
The four semantic predications datasets and their statistics.
| Dataset | treat | interact | cause | affect |
|---|---|---|---|---|
| MeSH term count | 9,277 | 8,336 | 10,393 | 12,020 |
| Predications count | 178,406 | 260,762 | 148,378 | 261,677 |
Edge prediction results for treat.
| Method | P(%) | R(%) | F1(%) | MAP(%) | AUROC(%) | AUPRC(%) | |
|---|---|---|---|---|---|---|---|
| Jaccard | 49.83 | 49.83 | 33.26 | 50.00 | 49.83 | 25.00 | |
| preferential attachment | 52.72 | 52.72 | 39.71 | 51.40 | 52.72 | 75.50 | |
| Adamic-Adar | 48.86 | 48.86 | 33.04 | 49.90 | 48.86 | 29.56 | |
| common neighbors | 44.58 | 44.58 | 33.58 | 48.87 | 44.58 | 36.41 | |
| corpus | word2vec | 83.01 | 83.01 | 83.00 | 77.98 | 83.02 | 87.55 |
| ontology | DeepWalk | 74.62 | 74.62 | 74.59 | 68.17 | 74.62 | 81.02 |
| LINE | 66.48 | 66.48 | 66.19 | 61.53 | 66.48 | 74.51 | |
| Node2vec | 82.44 | 82.44 | 82.43 | 76.45 | 82.44 | 86.75 | |
| SDNE | 50.76 | 50.76 | 36.19 | 50.47 | 50.76 | 67.17 | |
| Struc2vec | 66.85 | 66.85 | 66.72 | 61.24 | 66.85 | 75.42 | |
| semantic predications | DeepWalk | 81.09 | 81.20 | 81.09 | 77.00 | 81.20 | 86.76 |
| LINE | 83.13 | 83.23 | 83.13 | 79.27 | 83.23 | 88.26 | |
| Node2vec | 81.05 | 81.06 | 81.05 | 76.20 | 81.06 | 86.38 | |
| SDNE | 87.35 | 87.39 | 87.35 | 83.71 | 87.39 | 91.09 | |
| Struc2vec | 85.90 | 85.87 | 85.88 | 81.31 | 85.87 | 89.73 | |
| merged | DeepWalk | 81.24 | 81.23 | 81.22 | 76.33 | 81.23 | 86.49 |
| LINE | 83.36 | 83.38 | 83.36 | 78.85 | 83.38 | 88.10 | |
| Node2Vec | 84.83 | 84.80 | 84.81 | 80.15 | 84.80 | 89.00 | |
| SDNE | 88.35 | 88.35 | 88.34 | 84.59 | 88.35 | 91.65 | |
| Struc2Vec | 83.98 | 83.96 | 83.97 | 79.16 | 83.96 | 88.38 | |
| COS (proposed) | |||||||
* denotes statistically significant difference (p < 0.001) compared to all the above baselines.
Edge prediction results for affect.
| Method | P(%) | R(%) | F1(%) | MAP(%) | AUROC(%) | AUPRC(%) | |
|---|---|---|---|---|---|---|---|
| Jaccard | 49.96 | 49.96 | 33.32 | 50.00 | 49.96 | 25.00 | |
| preferential attachment | 53.13 | 53.13 | 40.52 | 51.62 | 53.13 | 75.61 | |
| Adamic-Adar | 75.84 | 75.84 | 74.67 | 74.64 | 75.84 | 86.26 | |
| common neighbors | 80.58 | 80.58 | 80.57 | 74.84 | 80.58 | 85.50 | |
| corpus | word2vec | 83.96 | 83.97 | 83.96 | 79.05 | 83.97 | 88.24 |
| ontology | DeepWalk | 76.24 | 76.24 | 76.23 | 70.10 | 76.24 | 82.21 |
| LINE | 71.08 | 71.08 | 70.90 | 65.79 | 71.08 | 78.33 | |
| Node2vec | 83.25 | 83.25 | 83.24 | 77.34 | 83.25 | 87.35 | |
| SDNE | 52.26 | 52.26 | 39.54 | 51.73 | 52.26 | 67.53 | |
| Struc2vec | 71.71 | 71.71 | 71.65 | 65.91 | 71.71 | 78.77 | |
| semantic predications | DeepWalk | 76.24 | 76.24 | 76.23 | 70.10 | 76.24 | 82.21 |
| LINE | 71.08 | 71.08 | 70.90 | 65.79 | 71.08 | 78.33 | |
| Node2vec | 83.25 | 83.25 | 83.24 | 77.34 | 83.25 | 87.35 | |
| SDNE | 52.26 | 52.26 | 39.54 | 51.73 | 52.26 | 67.53 | |
| Struc2vec | 71.71 | 71.71 | 71.65 | 65.91 | 71.71 | 78.77 | |
| merged | DeepWalk | 83.64 | 83.61 | 83.61 | 78.89 | 83.61 | 88.20 |
| LINE | 85.52 | 85.48 | 85.49 | 80.96 | 85.48 | 89.52 | |
| Node2Vec | 85.66 | 85.64 | 85.64 | 81.27 | 85.64 | 89.68 | |
| SDNE | 88.76 | 88.74 | 88.75 | 84.93 | 88.74 | 91.88 | |
| Struc2Vec | 86.05 | 86.00 | 86.03 | 81.46 | 86.00 | 89.85 | |
| COS (proposed) | |||||||
* denotes statistically significant difference (p < 0.001) compared to all the above baselines.
The edge prediction results of COS using different sampling strategies.
| Relation | Sampling | P(%) | R(%) | F1(%) | MAP(%) | AUROC(%) | AUPRC(%) |
|---|---|---|---|---|---|---|---|
| treat | no | 90.07 | 90.07 | 90.07 | 86.19 | 90.07 | 92.61 |
| up&down | 90.44 | 90.44 | 90.44 | 87.04 | 90.44 | 93.06 | |
| up | |||||||
| interact | no | 89.68 | 89.68 | 89.68 | 85.45 | 89.68 | 92.21 |
| up&down | |||||||
| up | 90.13 | 90.13 | 90.13 | 86.34 | 90.13 | 92.69 | |
| cause | no | 89.17 | 89.17 | 89.17 | 84.98 | 89.17 | 91.92 |
| up&down | |||||||
| up | 90.09 | 90.09 | 90.08 | 86.55 | 90.09 | 92.79 | |
| affect | no | 90.42 | 90.42 | 90.41 | 86.63 | 90.42 | 92.86 |
| up&down | 91.36 | 91.36 | 91.36 | 88.09 | 91.36 | 93.67 | |
| up |
* denotes statistically significant difference (p < 0.001) compared to no sampling.
† denotes up’s statistically significant difference (p < 0.001) compared to up&down.
The edge prediction results of COS and COS without using the semantic predications data source.
| Relation | Data Source | P(%) | R(%) | F1(%) | MAP(%) | AUROC(%) | AUPRC(%) |
|---|---|---|---|---|---|---|---|
| treat | C&O | 90.37 | 90.37 | 90.37 | 86.45 | 90.37 | 92.76 |
| C&O&S | |||||||
| interact | C&O | 88.93 | 88.93 | 88.92 | 84.44 | 88.93 | 91.63 |
| C&O&S | |||||||
| cause | C&O | 88.79 | 88.79 | 88.78 | 84.44 | 88.79 | 91.61 |
| C&O&S | |||||||
| affect | C&O | 90.59 | 90.59 | 90.59 | 86.59 | 90.59 | 92.86 |
| C&O&S |
* denotes satistically significant difference (p < 0.001) compared to C&O.
Edge prediction results for interact.
| Method | P(%) | R(%) | F1(%) | MAP(%) | AUROC(%) | AUPRC(%) | |
|---|---|---|---|---|---|---|---|
| Jaccard | 49.97 | 49.97 | 33.33 | 50.00 | 49.97 | 29.17 | |
| preferential attachment | 52.44 | 52.44 | 38.91 | 51.25 | 52.44 | 75.51 | |
| Adamic-Adar | 88.51 | 88.51 | 88.47 | 86.00 | 88.51 | 92.44 | |
| common neighbors | 86.53 | 86.53 | 86.42 | 79.63 | 86.53 | 89.36 | |
| corpus | word2vec | 85.65 | 85.65 | 85.64 | 80.65 | 85.65 | 89.30 |
| ontology | DeepWalk | 72.60 | 72.60 | 72.57 | 66.43 | 72.60 | 79.48 |
| LINE | 69.55 | 69.55 | 69.31 | 64.36 | 69.55 | 77.09 | |
| Node2vec | 78.59 | 78.59 | 78.54 | 72.09 | 78.59 | 83.96 | |
| SDNE | 52.16 | 52.16 | 41.39 | 51.41 | 52.16 | 59.89 | |
| Struc2vec | 70.54 | 70.54 | 70.45 | 64.84 | 70.53 | 77.88 | |
| semantic predications | DeepWalk | 84.55 | 84.59 | 84.55 | 80.45 | 84.59 | 89.07 |
| LINE | 84.37 | 84.47 | 84.37 | 80.75 | 84.47 | 89.20 | |
| Node2vec | 82.94 | 82.93 | 82.92 | 78.00 | 82.93 | 87.63 | |
| SDNE | 87.24 | 87.28 | 87.24 | 83.51 | 87.28 | 90.97 | |
| Struc2vec | 86.04 | 86.02 | 86.02 | 81.50 | 86.02 | 89.83 | |
| merged | DeepWalk | 84.69 | 84.67 | 84.67 | 79.91 | 84.67 | 88.85 |
| LINE | 85.32 | 85.31 | 85.31 | 80.74 | 85.31 | 89.35 | |
| Node2Vec | 86.34 | 86.30 | 86.32 | 81.69 | 86.30 | 89.98 | |
| SDNE | 88.62 | 88.61 | 88.61 | 84.64 | 88.61 | 91.71 | |
| Struc2Vec | 85.42 | 85.40 | 85.40 | 80.74 | 85.40 | 89.37 | |
| COS (proposed) | |||||||
* denotes statistically significant difference (p < 0.001) compared to all the above baselines.
Edge prediction results for cause.
| Method | P(%) | R(%) | F1(%) | MAP(%) | AUROC(%) | AUPRC(%) | |
|---|---|---|---|---|---|---|---|
| Jaccard | 49.96 | 49.96 | 33.32 | 50.00 | 49.96 | 25.00 | |
| preferential attachment | 52.95 | 52.95 | 40.58 | 51.52 | 52.95 | 75.42 | |
| Adamic-Adar | 69.68 | 69.68 | 66.88 | 69.10 | 69.68 | 83.68 | |
| common neighbors | 79.56 | 79.56 | 79.45 | 74.95 | 79.56 | 85.33 | |
| corpus | word2vec | 83.23 | 83.24 | 83.23 | 78.22 | 83.24 | 87.70 |
| ontology | DeepWalk | 75.29 | 75.29 | 75.27 | 68.94 | 75.29 | 81.50 |
| LINE | 68.58 | 68.58 | 68.45 | 63.23 | 68.58 | 76.23 | |
| Node2vec | 81.64 | 81.64 | 81.60 | 75.15 | 81.64 | 86.10 | |
| SDNE | 50.80 | 50.80 | 36.32 | 50.50 | 50.80 | 64.63 | |
| Struc2vec | 69.27 | 69.27 | 68.97 | 64.07 | 69.27 | 77.01 | |
| semantic predications | DeepWalk | 81.57 | 81.75 | 81.55 | 78.20 | 81.75 | 87.50 |
| LINE | 82.67 | 82.74 | 82.65 | 78.54 | 82.74 | 87.86 | |
| Node2vec | 79.21 | 79.05 | 79.09 | 73.34 | 79.05 | 84.89 | |
| SDNE | 87.82 | 87.81 | 87.81 | 83.81 | 87.81 | 91.21 | |
| Struc2vec | 87.62 | 87.59 | 87.61 | 83.44 | 87.59 | 91.01 | |
| merged | DeepWalk | 82.17 | 82.15 | 82.15 | 77.36 | 82.15 | 87.18 |
| LINE | 84.06 | 84.05 | 84.05 | 79.48 | 84.05 | 88.54 | |
| Node2Vec | 84.62 | 84.62 | 84.61 | 80.18 | 84.62 | 88.97 | |
| SDNE | 88.44 | 88.44 | 88.43 | 84.79 | 88.44 | 91.76 | |
| Struc2Vec | 85.13 | 85.11 | 85.11 | 80.60 | 85.11 | 89.27 | |
| COS (proposed) | |||||||
* denotes statistically significant difference (p < 0.001) compared to all the above baselines.