| Literature DB >> 34795792 |
Meijing Li1, Tianjie Chen1, Keun Ho Ryu2,3,4, Cheng Hao Jin5.
Abstract
Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability.Entities:
Mesh:
Year: 2021 PMID: 34795792 PMCID: PMC8594978 DOI: 10.1155/2021/7937573
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
An example of a biomedical document (PMID: 10496010) with corresponding MeSH headings and tree number of nodes.
| MeSH heading | Tree number |
|---|---|
| DNA repair | G02.111.222 |
| G05.219 | |
| Genetic diseases, inborn | C16.320 |
| Humans | B01.050.150.900 |
| .649.313.988.400 | |
| .112.400.400 |
Figure 1The workflow of the proposed method.
Figure 2An example of MapReduce-based data transformation and semantic similarity calculation.
Summary of the dataset SL.
| Documents | Classes | Unique MeSH headings | Total MeSH headings | |
|---|---|---|---|---|
| Min | 51 | 3 | 387 | 1619 |
| Max | 1619 | 12 | 2502 | 25631 |
| Mean | 689 | 7.5 | 1458 | 2502 |
Summary of the dataset LUs.
| Documents | Unique MeSH headings | Total MeSH headings | |
|---|---|---|---|
| Min | 10000 | 14499 | 123347 |
| Max | 60000 | 25742 | 731089 |
| Mean | 35000 | 21540 | 427731 |
Figure 3The result of MapReduce job optimization.
Computation time (minutes) of the traditional and proposed method with different semantic measures (“/” means cannot work).
| Method | Dataset: SL | Dataset: LUs-10000 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SP | WP | LC | Res | Lin | Sch | SP | WP | LC | Res | Lin | Sch | |
| Traditional | 68.9 | 66.35 | 67.9 | 61.15 | 64.8 | 66.25 | / | / | / | / | / | / |
| Proposed | 2.87 | 1.02 | 1.42 | 1.15 | 0.95 | 1.17 | 10.21 | 3.43 | 5.02 | 2.31 | 3.30 | 3.77 |
NMI (Average ± Standard Deviation) on dataset SL with different cluster algorithms and semantic measures.
| Cluster algorithm | SP | WP | LC | Resnik | Lin | Sch |
|---|---|---|---|---|---|---|
| Spectral clustering | 0.579 ± 0.126 | 0.549 ± 0.132 | 0.574 ± 0.130 | 0.647 ± 0.124 | 0.617 ± 0.111 | 0.527 ± 0.124 |
|
| 0.490 ± 0.140 | 0.495 ± 0.131 | 0.491 ± 0.134 | 0.511 ± 0.176 | 0.526 ± 0.158 | 0.520 ± 0.123 |
| Agglomerative clustering | 0.524 ± 0.123 | 0.551 ± 0.145 | 0.533 ± 0.124 | 0.591 ± 0.116 | 0.582 ± 0.135 | 0.523 ± 0.126 |
| Zhu | / | 0.568 ± 165 | 0.565 ± 0.169 | / | 0.620 ± 0.161 | / |
Figure 4The trend of speedup and computation time with increasing cluster nodes.
Figure 5Computation time of map, shuffle, and reduce on dataset LUs.
Configuration of computers.
| Configuration | Details |
|---|---|
| OS | Centos 7 |
| CPU | I5-6500, 3.2 GHz |
| HDD | 1 TB, 7200 rpm |
| RAM | 8 G |
| Hadoop | Hadoop 3.1.3 |
| JDK | 1.8.0_252 |