| Literature DB >> 35239700 |
Muhammad Taimoor Khan1, Nouman Azam1, Shehzad Khalid2, Furqan Aziz3.
Abstract
Topic models extract latent concepts from texts in the form of topics. Lifelong topic models extend topic models by learning topics continuously based on accumulated knowledge from the past which is updated continuously as new information becomes available. Hierarchical topic modeling extends topic modeling by extracting topics and organizing them into a hierarchical structure. In this study, we combine the two and introduce hierarchical lifelong topic models. Hierarchical lifelong topic models not only allow to examine the topics at different levels of granularity but also allows to continuously adjust the granularity of the topics as more information becomes available. A fundamental issue in hierarchical lifelong topic modeling is the extraction of rules that are used to preserve the hierarchical structural information among the rules and will continuously update based on new information. To address this issue, we introduce a network communities based rule mining approach for hierarchical lifelong topic models (NHLTM). The proposed approach extracts hierarchical structural information among the rules by representing textual documents as graphs and analyzing the underlying communities in the graph. Experimental results indicate improvement of the hierarchical topic structures in terms of topic coherence that increases from general to specific topics.Entities:
Year: 2022 PMID: 35239700 PMCID: PMC8893656 DOI: 10.1371/journal.pone.0264481
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Sample dataset with 10 documents containing 9 words.
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
|
| - | - | - | - | |||||
|
| - | - | - | - | - | - | - | ||
|
| - | - | - | - | |||||
|
| - | - | - | - | - | - | |||
|
| - | - | - | - | - | - | |||
|
| - | - | - | - | |||||
|
| - | - | - | - | - | - | - | - | |
|
| - | - | - | - | |||||
|
| - | - | - | - | - | ||||
|
| - | - | - |
A dash (-) indicates the presence of a word in a document.
Term-term matrix based on the data in Table 1.
|
|
|
|
|
|
|
|
|
| |
|
| - | 6 | 4 | 4 | 3 | 4 | 4 | 3 | 3 |
|
| - | 4 | 4 | 2 | 3 | 3 | 3 | 3 | |
|
| - | 6 | 3 | 3 | 3 | 3 | 3 | ||
|
| - | 3 | 3 | 3 | 3 | 3 | |||
|
| - | 1 | 1 | 2 | 5 | ||||
|
| - | 5 | 3 | 1 | |||||
|
| - | 3 | 1 | ||||||
|
| - | 3 | |||||||
|
| - |
Values of association among the rules based on Eq 1.
|
|
|
|
| |
|
| 4 | 3 | 3.5 | |
|
| 3 | 3 | ||
|
| 1 | |||
|
|
Fig 1Architecture of NHLTM approach.
Graph properties before and after pruning weak nodes and edges from the dataset.
| Dataset | Property | Before Filtering | After Filtering |
|---|---|---|---|
| Electronic Chen 2014 | Word Nodes | 5,574 | 1,228 |
| Co-occurrence Edges | 6,55,261 | 3,46,687 | |
| Per Edge Weight | 3.43 | 4.84 | |
| Non-Electronic Chen 2014 | Word nodes | 7,391 | 2,572 |
| Co-occurrence Edges | 5,83,020 | 4,24,066 | |
| Per Edge Weight | 2.19 | 2.49 | |
| Reuters R21578 | Word Nodes | 5,193 | 2470 |
| Co-occurrence Edges | 9,74,793 | 6,73,193 | |
| Per Edge Weight | 1.62 | 1.82 |
Sample communities with their codes, top words and neighboring communities.
| ID (Hexa Codes) | Top Words | Neighboring Community ID |
|---|---|---|
| 9 | Adaptor, cam, filter, stand, indicator | C |
| C | Exception, weird, superior, break | 9 |
| 56 | Field, section, flat | 8A |
| 8A | Storage, airport, convenient | 56 |
| 21E | Environment, rest, world, answer | 23D |
| 23D | Folder, label, credit | 21E |
| 439 | Profile, background, scene, rock | 43B |
| 43B | Tray, blueray, manual, gripe | 439 |
| 10A9 | Alarm, longer, beep | 43B |
| 2146 | Police, ticket, state, ka | 2151 |
| 2151 | Dead, transmitter, awful | 2146 |
| 8528 | Hard, top, lap, scroll, pad | 2151 |
| 10A63 | Outstanding, playback, release, model | 10A70 |
| 10A70 | Gamer, thrive bigger, hub | 10A63 |
| 10B3E | Toshiba, lifetime, replacement, warranty | 10A70 |
| 21091 | Foot, pace, distance, run | 10B3E |
| 79A0D | Heavy, metal, cover, bottom | 21091 |
| 10C0C3 | Traffic, highway, city, road | 79A0D |
| 214C93 | Bluetooth, reception, works, feedback | 214C90 |
| 214C90 | Bright, display, control | 214C93 |
Fig 2Hierarchical breakdown of graph into communities till first 5 levels, nodes per community and their respective graph cut weight.
Evaluation scores of the communities shown in Table 5.
| Comm Sr.No. | Size | Entropy | Lins Approx. |
| Avg. Node Degree |
|---|---|---|---|---|---|
| 1 | 7 | 2.787 | 1.8 | 7.073 | 56.893 |
| 2 | 13 | 3.816 | 3.249 | 0.236 | 12.153 |
| 3 | 5 | 1.108 | 1.249 | 0.020 | 2 |
| 4 | 5 | 1.130 | 1.249 | 0.040 | 4.6 |
| 5 | 6 | 1.440 | 1.499 | 0.040 | 3.83 |
| 6 | 5 | 1.055 | 1.249 | 0.030 | 2.4 |
| 7 | 8 | 2.087 | 1.999 | 0.061 | 4.125 |
| 8 | 5 | 1.085 | 1.249 | 0.054 | 6.6 |
| 9 | 6 | 1.425 | 1.499 | 0.86 | 7.83 |
| 10 | 5 | 1.135 | 1.249 | 0.336 | 19.2 |
| 11 | 6 | 1.449 | 1.499 | 0.036 | 3.33 |
| 12 | 28 | 1.745 | 1.749 | 0.0833 | 6.85 |
| 13 | 6 | 1.488 | 1.499 | 0.0923 | 10.83 |
| 14 | 5 | 1.109 | 1.249 | 0.0414 | 3.8 |
| 15 | 5 | 1.071 | 1.249 | 0.3227 | 48.6 |
| 16 | 7 | 1.698 | 1.749 | 1.538 | 103.285 |
| 17 | 5 | 1.129 | 1.249 | 0.348 | 52.8 |
| 18 | 7 | 1.817 | 1.749 | 0.829 | 52.85 |
| 19 | 5 | 1.141 | 1.249 | 0.062 | 6 |
| 20 | 4 | 1.746 | 1.448 | 0.085 | 75.4 |
Fig 3Comparison of NHLTM with existing approaches.
Fig 4Comparison of NHLTM and HLDA with increasing number of topics and levels of hierarchy for the Alarm clock dataset in large-scale data.
Fig 5Topic hierarchy of Alarm Clock dataset.