| Literature DB >> 36065380 |
Meijing Li1, Xianhe Zhou1, Keun Ho Ryu2,3, Nipon Theera-Umpon3,4.
Abstract
With the increasing volume of the published biomedical literature, the fast and effective retrieval of the literature on the sequence, structure, and function of biological entities is an essential task for the rapid development of biology and medicine. To capture the semantic information in biomedical literature more effectively when biomedical documents are clustered, we propose a new multi-evidence-based semantic text similarity calculation method. Two semantic similarities and one content similarity are used, in which two semantic similarities include MeSH-based semantic similarity and word embedding-based semantic similarity. To fuse three different similarities more effectively, after, respectively, calculating two semantic and one content similarities between biomedical documents, feedforward neural network is applied to integrate the two semantic similarities. Finally, weighted linear combination method is used to integrate the semantic and content similarities. To evaluate the effectiveness, the proposed method is compared with the existing basic methods, and the proposed method outperforms the existing related methods. Based on the proven results of this study, this method can be used not only in actual biological or medical experiments such as protein sequence or function analysis but also in biological and medical research fields, which will help to provide, use, and understand thematically consistent documents.Entities:
Mesh:
Year: 2022 PMID: 36065380 PMCID: PMC9440839 DOI: 10.1155/2022/8238432
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.809
Figure 1Workflow of proposed method.
The details of two embedded models.
| Model | Dimension | Vocabulary size | Corpus |
|---|---|---|---|
| Wiki_W2V | 300 | 3,000,000 | Wikipedia |
| MEDLINE_W2V | 300 | 2,000,000 | MEDLINE |
Algorithm 1Algorithm of ASV calculation.
A MEDLINE document and its related MeSH terms.
| PMID | MeSH term |
|---|---|
| 12625756 | Animals |
| DNA | |
| Drug delivery system | |
| Electroporation | |
| Gene transfer techniques | |
| Humans | |
| Neoplasms |
Two MeSH terms and their tree numbers.
| MeSH term | MeSH tree number |
|---|---|
| Melanosomes | A11.284.430.214.190.500.560 |
| A11.284.430.214.190.875.190.190.560 | |
| A11.409.750.560 | |
| A11.436.265.531.560 | |
| A11.436.613.560 | |
| Sarcomeres | A10.690.552.875.700 |
| A11.284.430.214.190.875.820 | |
| A11.620.249.850.700 | |
| A11.620.500.500.700 |
Figure 2Illustration of FNN_sem.
Description of the data set.
| cluster_num | doc_num_cluster | doc_num_data | |
|---|---|---|---|
| Min | 3 | 10 | 84 |
| Max | 12 | 385 | 1541 |
| Average | 6.9 | 88.4 | 609.4 |
Note: cluster_num represents the number of clusters in the 100 data sets, doc_num_cluster represents the number of documents contained in each cluster, and doc_num_data represents the number of documents contained in each data set.
Metrics (average ± SD) of all 100 data sets for spectral clustering based on word embedding.
| Method | Purity | ARI | NMI | FMI |
|---|---|---|---|---|
|
|
|
|
|
|
| WE_W | 0.705 ± 0.093 | 0.306 ± 0.112 | 0.412 ± 0.114 | 0.493 ± 0.144 |
Metrics (average ± SD) of all 100 data sets for spectral clustering based on MeSH semantic similarity, where JC_i: JC with λ = i.
| Method | Purity | ARI | NMI | FMI |
|---|---|---|---|---|
| LC | 0.788 ± 0.085 | 0.453 ± 0.156 | 0.536 ± 0.112 | 0.599 ± 0.139 |
| WP | 0.794 ± 0.088 | 0.472 ± 0.167 | 0.554 ± 0.117 | 0.613 ± 0.145 |
| Lin | 0.830 ± 0.083 | 0.570 ± 0.184 | 0.629 ± 0.130 | 0.683 ± 0.149 |
|
|
|
|
|
|
| JC_2 | 0.824 ± 0.081 | 0.561 ± 0.178 | 0.617 ± 0.128 | 0.681 ± 0.141 |
| JC_3 | 0.821 ± 0.077 | 0.545 ± 0.168 | 0.605 ± 0.118 | 0.669 ± 0.136 |
| JC_4 | 0.812 ± 0.080 | 0.513 ± 0.163 | 0.584 ± 0.117 | 0.645 ± 0.135 |
| JC_5 | 0.801 ± 0.082 | 0.492 ± 0.153 | 0.563 ± 0.110 | 0.629 ± 0.134 |
Figure 3Mean and standard deviation of evaluation metrics. (a) Clustering results based on two word embedding based similarities; (b) Clustering results based on three different similarities.
Figure 4Statistical significance test of cluster results. (a) Purity scores; (b) ARI scores;(c) NMI scores; (d) FMI scores.
Comparison of metrics (average ± SD) for spectral clustering in all 100 data sets based on the similarity of different features.
| Method | Purity | ARI | NMI | FMI |
|---|---|---|---|---|
| WE_M | 0.810 ± 0.081 | 0.499 ± 0.185 | 0.600 ± 0.128 | 0.634 ± 0.149 |
| JC_1 |
| 0.668 ± 0.181 | 0.670 ± 0.135 |
|
| Con | 0.813 ± 0.087 |
|
| 0.724 ± 0.162 |
Metrics (average ± SD) of all 100 data sets for spectral clustering based on 16 integrated semantic similarities.
| Method | Purity | ARI | NMI | FMI | ||||
|---|---|---|---|---|---|---|---|---|
| WE_M | WE_W | WE_M | WE_W | WE_M | WE_W | WE_M | WE_W | |
| LC | 0.859 ± 0.079 | 0.834 ± 0.081 | 0.622 ± 0.193 | 0.572 ± 0.189 | 0.702 ± 0.127 | 0.642 ± 0.131 | 0.727 ± 0.145 | 0.692 ± 0.147 |
| WP | 0.844 ± 0.091 | 0.808 ± 0.092 | 0.612 ± 0.197 | 0.529 ± 0.168 | 0.690 ± 0.129 | 0.606 ± 0.126 | 0.723 ± 0.149 | 0.665 ± 0.149 |
| Lin | 0.860 ± 0.096 | 0.849 ± 0.111 | 0.668 ± 0.209 | 0.672 ± 0.217 | 0.725 ± 0.137 | 0.708 ± 0.150 | 0.765 ± 0.153 | 0.770 ± 0.157 |
| JC_1 |
| 0.868 ± 0.097 |
| 0.706 ± 0.197 |
| 0.734 ± 0.143 |
| 0.794 ± 0.141 |
| JC_2 | 0.868 ± 0.096 | 0.846 ± 0.105 | 0.686 ± 0.209 | 0.643 ± 0.213 | 0.736 ± 0.138 | 0.684 ± 0.147 | 0.776 ± 0.154 | 0.745 ± 0.160 |
| JC_3 | 0.863 ± 0.087 | 0.853 ± 0.097 | 0.662 ± 0.199 | 0.651 ± 0.208 | 0.720 ± 0.131 | 0.695 ± 0.144 | 0.756 ± 0.150 | 0.749 ± 0.159 |
| JC_4 | 0.854 ± 0.085 | 0.843 ± 0.088 | 0.635 ± 0.196 | 0.642 ± 0.188 | 0.700 ± 0.128 | 0.692 ± 0.117 | 0.739 ± 0.146 | 0.724 ± 0.135 |
| JC_5 | 0.850 ± 0.084 | 0.818 ± 0.079 | 0.623 ± 0.196 | 0.549 ± 0.179 | 0.691 ± 0.129 | 0.614 ± 0.122 | 0.730 ± 0.145 | 0.676 ± 0.142 |
Metrics (average ± SD) of all 100 data sets for spectral clustering based on final similarity in different w.
|
| Purity | ARI | NMI | FMI |
|---|---|---|---|---|
| 0.1 | 0.904 ± 0.067 | 0.749 ± 0.177 | 0.790 ± 0.117 | 0.820 ± 0.131 |
| 0.2 | 0.903 ± 0.075 | 0.743 ± 0.188 | 0.793 ± 0.121 | 0.815 ± 0.137 |
| 0.3 | 0.913 ± 0.063 | 0.760 ± 0.179 | 0.806 ± 0.116 | 0.826 ± 0.131 |
| 0.4 | 0.915 ± 0.068 | 0.757 ± 0.184 | 0.814 ± 0.115 | 0.822 ± 0.137 |
| 0.5 | 0.920 ± 0.067 | 0.767 ± 0.183 | 0.825 ± 0.112 | 0.830 ± 0.137 |
| 0.6 | 0.927 ± 0.064 | 0.785 ± 0.176 | 0.840 ± 0.107 | 0.843 ± 0.131 |
|
|
|
|
|
|
| 0.8 | 0.941 ± 0.057 | 0.809 ± 0.174 | 0.869 ± 0.100 | 0.860 ± 0.130 |
| 0.9 | 0.933 ± 0.063 | 0.791 ± 0.180 | 0.852 ± 0.106 | 0.866 ± 0.133 |
Figure 5Metrics for spectral clustering based on final similarity in different w. (a) Average value of cluster metrics; (b) Standard deviation of cluster metrics.
Comparison of metrics (average ± SD) between the similarities proposed in this paper and the method of Zhu et al. in spectral clustering.
| Method | Purity | ARI | NMI | FMI |
|---|---|---|---|---|
|
|
|
|
|
|
| Zhu et al. | 0.924 ± 0.048 | 0.797 ± 0.179 | 0.866 ± 0.104 | 0.805 ± 0.136 |