| Literature DB >> 35820793 |
Junaid Rashid1, Jungeun Kim2, Amir Hussain3, Usman Naseem4, Sapna Juneja5.
Abstract
BACKGROUND: Text mining in the biomedical field has received much attention and regarded as the important research area since a lot of biomedical data is in text format. Topic modeling is one of the popular methods among text mining techniques used to discover hidden semantic structures, so called topics. However, discovering topics from biomedical data is a challenging task due to the sparsity, redundancy, and unstructured format.Entities:
Keywords: Classification; Clustering; MKFTM; Medical data; Multiple kernel fuzzy topic modeling; Topic modeling
Mesh:
Year: 2022 PMID: 35820793 PMCID: PMC9277941 DOI: 10.1186/s12859-022-04780-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Datasets statistics
| Datasets | Documents (Preprocess) | Words | Unique words |
|---|---|---|---|
| MuchMore Springer | 1527 | 19,835 | 5008 |
| Ohsumed | 2092 | 22,669 | 13,238 |
| Genia | 2000 | 21,560 | 17,834 |
| Biotext | 40 | 25,921 | 10,267 |
| 58,927 | 395,636 | 25,309 | |
| WSJ | 1300 | 680 K | 36 K |
Classification results (muchmore springer bilingual corpus)
| Method | AC (%) | Precision | Recall | F1-Score | K |
|---|---|---|---|---|---|
| LSA [ | 57.65 | 0.6667 | 0.7221 | 0.6933 | 50 |
| LDA [ | 60.95 | 0.6938 | 0.7356 | 0.7141 | 50 |
| FKLSA(Entropy) [ | 97.66 | 0.955 | 0.9554 | 0.977 | 50 |
| FKLSA(IDF) [ | 95.90 | 0.937 | 0.935 | 0.959 | 50 |
| FKLSA(Normal) [ | 91.22 | 0.890 | 0.894 | 0.912 | 50 |
| FKLSA(ProbIDF) [ | 97.66 | 0.954 | 0.953 | 0.977 | 50 |
| FKTM [ | 98.29 | 0.9880 | 0.9883 | 0.9880 | 50 |
| LSA [ | 56.19 | 0.6676 | 0.6791 | 0.6733 | 100 |
| LDA [ | 58.85 | 0.6854 | 0.7011 | 0.6932 | 100 |
| FKLSA(Entropy) [ | 96.49 | 0.943 | 0.942 | 0.965 | 100 |
| FKLSA(IDF) [ | 98.24 | 0.961 | 0.960 | 0.982 | 100 |
| FKLSA(Normal) [ | 92.39 | 0.902 | 0.900 | 0.924 | 100 |
| FKLSA(ProbIDF) [ | 97.66 | 0.955 | 0.952 | 0.977 | 100 |
| FKTM [ | 98.87 | 0.9879 | 0.9841 | 0.9844 | 100 |
| LSA [ | 62.67 | 0.7091 | 0.7536 | 0.7285 | 150 |
| LDA [ | 59.23 | 0.6991 | 0.6791 | 0.6890 | 150 |
| FKLSA(Entropy) [ | 95.90 | 0.937 | 0.935 | 0.959 | 150 |
| FKLSA(IDF) [ | 97.66 | 0.955 | 0.952 | 0.977 | 150 |
| FKLSA(Normal) [ | 95.32 | 0.932 | 0.931 | 0.953 | 150 |
| FKLSA(ProbIDF) [ | 97.07 | 0.950 | 0.952 | 0.971 | 150 |
| FKTM [ | 98.97 | 0.9822 | 0.9882 | 0.9886 | 150 |
| LSA [ | 60.00 | 0.6980 | 0.7020 | 0.9886 | 200 |
| LDA [ | 63.42 | 0.7039 | 0.7765 | 0.7000 | 200 |
| FKLSA(Entropy) [ | 97.07 | 0.950 | 0.9501 | 0.7384 | 200 |
| FKLSA(IDF) [ | 97.66 | 0.955 | 0.9553 | 0.971 | 200 |
| FKLSA(Normal) [ | 92.39 | 0.901 | 0.902 | 0.977 | 200 |
| FKLSA(ProbIDF) [ | 97.66 | 0.955 | 0.950 | 0.924 | 200 |
| FKTM [ | 98.86 | 0.9883 | 0.9870 | 0.977 | 200 |
Classification results (Ohsumed collection dataset)
| Method | AC (%) | Precision | Recall | F1-Score | K |
|---|---|---|---|---|---|
| LSA [ | 48.36 | 0.4146 | 0.4224 | 0.4185 | 50 |
| LDA [ | 54.10 | 0.4789 | 0.5155 | 0.4970 | 50 |
| FKLSA(Entropy) [ | 75.21 | 0.720 | 0.722 | 0.746 | 50 |
| FKLSA(IDF) [ | 75.90 | 0.722 | 0.723 | 0.746 | 50 |
| FKLSA(Normal) [ | 71.25 | 0.6551 | 0.654 | 0.677 | 50 |
| FKLSA(ProbIDF) [ | 74.87 | 0.715 | 0.714 | 0.735 | 50 |
| FKTM [ | 92.35 | 0.9236 | 0.9006 | 0.9119 | 50 |
| 50 | |||||
| LSA [ | 51.37 | 0.4430 | 0.4099 | 0.4258 | 100 |
| LDA [ | 54.92 | 0.4873 | 0.4783 | 0.4828 | 100 |
| FKLSA(Entropy) [ | 76.24 | 0.727 | 0.726 | 0.747 | 100 |
| FKLSA(IDF) [ | 74.35 | 0.701 | 0.703 | 0.726 | 100 |
| FKLSA(Normal) [ | 71.08 | 0.670 | 0.674 | 0.694 | 100 |
| FKLSA(ProbIDF) [ | 74.52 | 0.702 | 0.704 | 0.724 | 100 |
| FKTM [ | 87.70 | 0.8867 | 0.8261 | 0.8553 | 100 |
| 100 | |||||
| LSA [ | 52.73 | 0.4651 | 0.4969 | 0.4805 | 150 |
| LDA [ | 57.10 | 0.5123 | 0.5155 | 0.5139 | 150 |
| FKLSA(Entropy) [ | 74.87 | 0.715 | 0.714 | 0.735 | 150 |
| FKLSA(IDF) [ | 76.59 | 0.732 | 0.731 | 0.752 | 150 |
| FKLSA(Normal) [ | 72.46 | 0.671 | 0.673 | 0.691 | 150 |
| FKLSA(ProbIDF) [ | 75.04 | 0.715 | 0.712 | 0.735 | 150 |
| FKTM [ | 90.16 | 0.8788 | 0.9006 | 0.8896 | 150 |
| 150 | |||||
| LSA [ | 49.73 | 0.4303 | 0.4410 | 0.4356 | 200 |
| LDA [ | 54.37 | 0.4819 | 0.4969 | 0.4893 | 200 |
| FKLSA(Entropy) [ | 75.21 | 0.720 | 0.721 | 0.740 | 200 |
| FKLSA(IDF) [ | 74.18 | 0.705 | 0.704 | 0.725 | 200 |
| FKLSA(Normal) [ | 71.94 | 0.671 | 0.673 | 0.683 | 200 |
| FKLSA(ProbIDF) [ | 74.87 | 0.701 | 0.702 | 0.729 | 200 |
| FKTM [ | 88.25 | 0.8986 | 0.8261 | 0.8608 | 200 |
| 200 |
Fig. 1CH-index results for Genia datasets with K = 50
Fig. 2CH-index results for Genia datasets with K = 100
Fig. 3CH-index results for Genia datasets with K = 150
Fig. 4CH-index results for Genia datasets with K = 200
Fig. 5CH-index results for Biotext datasets with K = 50
Fig. 6CH-index results for Biotext datasets with K = 100
Fig. 7CH-index results for Biotext datasets with K = 150
Fig. 8CH-index results for Biotext datasets with K = 200
Comparison of loglikelihood for WSJ corpora
| Topic Model | Log-Likelihood | No of Topics |
|---|---|---|
| LDA | -824,000 | 50 |
| RedLDA | -810,000 | 50 |
| FKTM | -789,000 | 50 |
| MKFTM | -773,000 | 50 |
| LDA | -814,000 | 100 |
| RedLDA | -805,000 | 100 |
| FKTM | -789,500 | 100 |
| MKFTM | -773,600 | 100 |
| LDA | -815,000 | 150 |
| RedLDA | -809,000 | 150 |
| FKTM | -789,200 | 150 |
| MKFTM | -773,700 | 150 |
| LDA | -816,000 | 200 |
| RedLDA | -800,000 | 200 |
| FKTM | -789,000 | 200 |
| MKFTM | -773,900 | 200 |
Fig. 9Comparison of execution times of health tweet dataset