| Literature DB >> 35222628 |
Qiming Du1, Nan Li1, Wenfu Liu1, Daozhu Sun1, Shudan Yang1, Feng Yue1.
Abstract
Topic recognition technology has been commonly applied to identify different categories of news topics from the vast amount of web information, which has a wide application prospect in the field of online public opinion monitoring, news recommendation, and so on. However, it is very challenging to effectively utilize key feature information such as syntax and semantics in the text to improve topic recognition accuracy. Some researchers proposed to combine the topic model with the word embedding model, whose results had shown that this approach could enrich text representation and benefit natural language processing downstream tasks. However, for the topic recognition problem of news texts, there is currently no standard way of combining topic model and word embedding model. Besides, some existing similar approaches were more complex and did not consider the fusion between topic distribution of different granularity and word embedding information. Therefore, this paper proposes a novel text representation method based on word embedding enhancement and further forms a full-process topic recognition framework for news text. In contrast to traditional topic recognition methods, this framework is designed to use the probabilistic topic model LDA, the word embedding models Word2vec and Glove to fully extract and integrate the topic distribution, semantic knowledge, and syntactic relationship of the text, and then use popular classifiers to automatically recognize the topic categories of news based on the obtained text representation vectors. As a result, the proposed framework can take advantage of the relationship between document and topic and the context information, which improves the expressive ability and reduces the dimensionality. Based on the two benchmark datasets of 20NewsGroup and BBC News, the experimental results verify the effectiveness and superiority of the proposed method based on word embedding enhancement for the news topic recognition problem.Entities:
Mesh:
Year: 2022 PMID: 35222628 PMCID: PMC8865979 DOI: 10.1155/2022/4582480
Source DB: PubMed Journal: Comput Intell Neurosci
The definition and description of symbols involved in this paper.
| Symbol definition | Description |
|---|---|
|
| The number of tokens in document |
|
| The vocabulary that including all words of corpus |
|
| The number of topics when training the LDA model |
| TM | LDA model obtained by training on corpus |
|
| The number of word vector dimension when training the word embedding model |
| EM | Word embedding model obtained by training on corpus |
| DTD | Document-level topic distribution for document |
| WTD | Word-level topic distribution for document |
| DTE | Text representation based on word embedding model and doc-level topic distribution |
| WTE | Text representation based on word embedding model and word-level topic distribution |
|
| Dimension of document representation vector |
|
| Topic distribution |
| ⊕ | Symbol that means concatenation operation of vectors |
| ⊙ | Symbol that means summation operation of vectors |
|
| Corpus set |
Figure 1An example of document-topic distribution and topic-word distribution in LDA.
Figure 2Architecture of DTE and WTE text representation models.
Figure 3The architecture of news topic recognition framework.
The information of the 20NewsGroup dataset.
| Topic category | Len | Size | Topic category | Len | Size |
|---|---|---|---|---|---|
| comp.graphics | 163.0 | 973 | rec.autos | 126.3 | 990 |
| comp.os.ms.windows.misc | 160.6 | 985 | rec.motorcycles | 118.4 | 996 |
| comp.sys.ibm.pc.hardware | 116.7 | 982 | rec.sport.baseball | 131.2 | 994 |
| comp.sys.mac.hardware | 109.0 | 963 | rec.sport.hockey | 155.3 | 999 |
| comp.windows.x | 174.3 | 988 | misc.forsale | 95.1 | 975 |
| talk.politics.misc | 232.7 | 775 | sci.crypt | 189.6 | 991 |
| talk.politics.guns | 189.9 | 910 | sci.electronics | 121.8 | 984 |
| talk.politics.mideast | 269.4 | 940 | sci.med | 173.8 | 990 |
| alt.atheism | 183.0 | 799 | sci.space | 174.9 | 987 |
| soc.religion.Christian | 194.2 | 997 | talk.religion.misc | 192.1 | 628 |
The information of the BBC news dataset.
| Topic category | Len | Size |
|---|---|---|
| Tech | 507.4 | 401 |
| Business | 334.2 | 510 |
| Sport | 336.3 | 511 |
| Entertainment | 337.7 | 386 |
| Politics | 461.2 | 417 |
Figure 4The structure about summation “⊙” and concatenation “⊕” operations of representation vectors.
Results of the 20NewsGroup in 20 classes for 7532 texts by SVM and LR.
| Model | Average 5-fold micro-F1 score of different dimensions | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 200 | 300 | 400 | 500 | |||||||||||
|
| SVM | LR |
| SVM | LR |
| SVM | LR |
| SVM | LR |
| SVM | LR | |
| TF-IDF | 100 | 0.373 | 0.372 | 200 | 0.48 | 0.475 | 300 | 0.586 | 0.580 | 400 | 0.637 | 0.631 | 500 | 0.671 | 0.663 |
| LDA | 100 | 0.686 | 0.682 | 200 | 0.692 | 0.689 | 300 | 0.712 | 0.711 | 400 | 0.715 | 0.714 | 500 | 0.721 | 0.723 |
| Glove | 100 | 0.724 | 0.710 | 200 | 0.767 | 0.754 | 300 | 0.784 | 0.771 | 400 | 0.794 | 0.780 | 500 | 0.799 | 0.788 |
| SGL | 100 | 0.745 | 0.732 | 200 | 0.779 | 0.767 | 300 | 0.792 | 0.780 | 400 | 0.807 | 0.794 | 500 | 0.813 | 0.802 |
| CGL | 200 |
| 0.772 | 400 |
|
| 600 |
|
| 800 |
|
| 1000 |
|
|
| Word2vec | 100 | 0.740 | 0.733 | 200 | 0.765 | 0.756 | 300 | 0.766 | 0.755 | 400 | 0.769 | 0.758 | 500 | 0.765 | 0.755 |
| SWL | 100 | 0.747 | 0.743 | 200 | 0.777 | 0.769 | 300 | 0.782 | 0.771 | 400 | 0.787 | 0.777 | 500 | 0.787 | 0.777 |
| CWL | 200 | 0.780 |
| 400 | 0.795 | 0.787 | 600 | 0.793 | 0.782 | 800 | 0.793 | 0.783 | 1000 | 0.796 | 0.785 |
Bold indicates that values are the optimal results.
Figure 5Different representation vectors of 20 news groups are embedded into 2-dimensions using UMAP.
Overview of different fusion models.
| Model | Dim | Submodels | Model | Dim | Submodels |
|---|---|---|---|---|---|
| FPW | 300 | TF-IDF, LDA, Word2vec | TDE | 600 | TF-IDF, LDA, Word2vec |
| FPW | 300 | TF-IDF, LDA, Glove | TDE | 600 | TF-IDF, LDA, Glove |
| FPC | 600 | TF-IDF, LDA, Word2vec | DRIWL | 300 | LDA, Word2vec |
| FPC | 600 | TF-IDF, LDA, Glove | DRIWL | 300 | LDA, Glove |
| FP2 | 300 | TF-IDF, LDA, Word2vec | DTE | 300 + | LDA, Glove |
| FP2 | 300 | TF-IDF, LDA, Glove | WTE | 300 + | LDA, Glove |
The upper right corner of the model with “” indicates that this model uses Glove as the word embedding model. And, the dim represents the dimension of the final document representation vector.
The quality of the LDA model with different topic numbers based on the 20NewsGroup.
| Evaluation indicators | Quality evaluation of LDA model under different topic numbers | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 15 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 25 | 30 | |
| PERP |
| 371.873 |
| 371.723 | 372.899 |
| 371.960 | 377.523 | 389.457 | 388.428 | 395.543 |
|
|
| 0.557 | 0.591 | 0.577 | 0.595 |
| 0.585 |
| 0.586 | 0.576 | 0.583 |
Figure 6The topic detection results of the 20NewsGroup for different methods by SVM and LR.
The quality of the LDA model with different topic numbers based on BBC News.
| Evaluation indicators | Quality evaluation of LDA model under different topic numbers | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 10 | 15 | 20 | 25 | 30 | 35 | 40 | 45 | 50 | 55 | |
| PERP |
|
| 249.625 |
| 249.677 | 255.310 | 256.739 | 254.678 | 259.439 | 265.335 | 261.736 |
|
| 0.448 |
| 0.478 |
| 0.454 | 0.445 |
| 0.462 | 0.469 | 0.442 | 0.462 |
For the above experimental values, we mark the best three results under different evaluation indicators.
Figure 7The topic detection results of BBC News for different methods by SVM and LR.