| Literature DB >> 35161808 |
Wenfu Liu1,2, Jianmin Pang1, Qiming Du1, Nan Li1, Shudan Yang1.
Abstract
Short text representation is one of the basic and key tasks of NLP. The traditional method is to simply merge the bag-of-words model and the topic model, which may lead to the problem of ambiguity in semantic information, and leave topic information sparse. We propose an unsupervised text representation method that involves fusing word embeddings and extended topic information. Following this, two fusion strategies of weighted word embeddings and extended topic information are designed: static linear fusion and dynamic fusion. This method can highlight important semantic information, flexibly fuse topic information, and improve the capabilities of short text representation. We use classification and prediction tasks to verify the effectiveness of the method. The testing results show that the method is valid.Entities:
Keywords: information fusion; short text representation; topic information; word embeddings
Year: 2022 PMID: 35161808 PMCID: PMC8839561 DOI: 10.3390/s22031066
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Notation description in this section.
| Notation | Description |
|---|---|
|
| Number of words in corpus |
|
| Number of words in short text |
|
| Number of words in extended short text |
|
| |
|
| |
|
| Word vector dictionary of corpus |
|
| Word embeddings of short text |
|
| WWE of short text |
|
| Topic vectors of extended short text |
|
| The |
|
| The |
|
| Word distribution of topic, where |
|
| Topic distribution of the |
|
| Prior hyperparameter, generally set as
|
|
| Word list of short text |
|
| Word list of short text |
|
| Fusion of WWE and ETI for representing short text |
|
| Denotes multiplication cross of matrix |
Figure 1Architecture of short text representation.
Figure 2Model of the WWE.
The length distribution of original text in IMDB and 20 Newsgroups.
| Datasets | 100 ≤ | 200 ≤ | 300 ≤ | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| |
| IMDB | 2950 | 11.8% | 11680 | 46.7% | 4665 | 18.7% | 2402 | 9.6% | 3303 | 13.2% |
| 20 Newsgroups | 2027 | 17.9% | 4147 | 36.7% | 2301 | 20.3% | 1124 | 9.9% | 1715 | 15.2% |
Figure 3Confusion matrix.
Results of the influence of different text lengths on different representation models (IMDB).
| Model | Classifier | 100 ≤ | 200 ≤ | 300 ≤ | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||||||
| W2V | KNN | 0.9461 | 0.9442 | 0.9485 | 0.9454 | 0.9432 | 0.9411 | 0.9453 | 0.9421 | 0.9405 | 0.9399 |
| SVM | 0.9525 | 0.9501 | 0.9392 | 0.9388 | 0.9475 | 0.9444 | 0.9436 | 0.9417 | 0.9423 | 0.9403 | |
| GLV | KNN | 0.9510 | 0.9493 | 0.9441 | 0.9413 | 0.9442 | 0.9426 | 0.9408 | 0.9383 | 0.9284 | 0.9265 |
| SVM | 0.9491 | 0.9475 | 0.9504 | 0.9487 | 0.9452 | 0.9430 | 0.9393 | 0.9364 | 0.9275 | 0.9255 | |
| WWE | KNN | 0.9531 | 0.9525 | 0.9557 | 0.9532 | 0.9555 | 0.9543 | 0.9504 | 0.9487 | 0.9491 | 0.9472 |
| SVM | 0.9504 | 0.9483 | 0.9545 | 0.9504 | 0.9534 | 0.9513 | 0.9537 | 0.9501 | 0.9463 | 0.9404 | |
| LDA | KNN | 0.8602 | 0.8565 | 0.8624 | 0.8607 | 0.8653 | 0.8631 | 0.8766 | 0.8744 | 0.8732 | 0.8713 |
| SVM | 0.8581 | 0.8572 | 0.8584 | 0.8573 | 0.8651 | 0.8647 | 0.8772 | 0.8754 | 0.8781 | 0.8745 | |
| ETI | KNN | 0.8774 | 0.8752 | 0.8806 | 0.8787 | 0.8822 | 0.8815 | 0.8876 | 0.8757 | 0.8821 | 0.8808 |
| SVM | 0.8797 | 0.8765 | 0.8814 | 0.8796 | 0.8798 | 0.8761 | 0.8856 | 0.8834 | 0.8846 | 0.8822 | |
Results of the influence of different text lengths on different representation models (20 Newsgroups).
| Model | Classifier | 100 ≤ | 200 ≤ | 300 ≤ | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||||||
| W2V | KNN | 0.7313 | 0.7291 | 0.7354 | 0.7332 | 0.7154 | 0.7133 | 0.7192 | 0.7181 | 0.7119 | 0.7092 |
| SVM | 0.7182 | 0.7164 | 0.7091 | 0.7075 | 0.7112 | 0.7102 | 0.7063 | 0.7055 | 0.6892 | 0.6872 | |
| GLV | KNN | 0.7285 | 0.7257 | 0.7212 | 0.7196 | 0.7223 | 0.7195 | 0.7156 | 0.7132 | 0.7083 | 0.7064 |
| SVM | 0.7196 | 0.7183 | 0.7154 | 0.7142 | 0.7116 | 0.7109 | 0.7081 | 0.7067 | 0.7054 | 0.7035 | |
| WWE | KNN | 0.7398 | 0.7374 | 0.7473 | 0.7454 | 0.7515 | 0.7494 | 0.7364 | 0.7345 | 0.7281 | 0.7262 |
| SVM | 0.7297 | 0.7275 | 0.7385 | 0.7363 | 0.7354 | 0.7337 | 0.7319 | 0.7301 | 0.7252 | 0.7231 | |
| LDA | KNN | 0.6542 | 0.6521 | 0.6592 | 0.6571 | 0.6662 | 0.6635 | 0.6692 | 0.6672 | 0.6675 | 0.6654 |
| SVM | 0.6616 | 0.6596 | 0.6581 | 0.6562 | 0.6691 | 0.6672 | 0.7013 | 0.7015 | 0.7133 | 0.7119 | |
| ETI | KNN | 0.6827 | 0.6794 | 0.6876 | 0.6851 | 0.6915 | 0.6891 | 0.6832 | 0.6816 | 0.6881 | 0.6862 |
| SVM | 0.6929 | 0.6903 | 0.6954 | 0.6939 | 0.6909 | 0.6886 | 0.6997 | 0.6974 | 0.6945 | 0.6935 | |
Comparison of SFM results with different by SVM classifier (IMDB dataset).
| Text Partition Strategy | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| ||||||
| 0.9416 | 0.9394 | 0.9452 | 0.9434 | 0.9581 | 0.9563 | 0.9577 | 0.9556 | 0.9634 | 0.9616 | |
| 100 ≤ | 0.9553 | 0.9535 | 0.9643 | 0.9622 | 0.9712 | 0.9692 | 0.9704 | 0.9681 | 0.9692 | 0.9672 |
| 200 ≤ | 0.9742 | 0.9721 | 0.9715 | 0.9705 | 0.9784 | 0.9765 | 0.9752 | 0.9731 | 0.9771 | 0.9753 |
| 300 ≤ | 0.9735 | 0.9724 | 0.9774 | 0.9752 | 0.9705 | 0.9682 | 0.9665 | 0.9643 | 0.9695 | 0.9672 |
| 0.9687 | 0.9662 | 0.9753 | 0.9732 | 0.9612 | 0.9593 | 0.9531 | 0.9515 | 0.9572 | 0.9551 | |
| All | 0.9551 | 0.9538 | 0.9581 | 0.9563 | 0.9661 | 0.9641 | 0.9574 | 0.9552 | 0.9653 | 0.9632 |
Comparison of SFM results with different by SVM classifier (20 Newsgroups dataset).
| Text Partition Strategy | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| ||||||
| 0.7295 | 0.7271 | 0.7162 | 0.7154 | 0.7381 | 0.7365 | 0.7454 | 0.7432 | 0.7415 | 0.7391 | |
| 100 ≤ | 0.7352 | 0.7332 | 0.7325 | 0.7302 | 0.7493 | 0.7472 | 0.7516 | 0.7492 | 0.7473 | 0.7463 |
| 200 ≤ | 0.7514 | 0.7492 | 0.7653 | 0.7635 | 0.7685 | 0.7663 | 0.7612 | 0.7591 | 0.7581 | 0.7562 |
| 300 ≤ | 0.7693 | 0.7674 | 0.7781 | 0.7762 | 0.7743 | 0.7754 | 0.7735 | 0.7715 | 0.7642 | 0.7625 |
| 0.7442 | 0.7425 | 0.7472 | 0.7453 | 0.7392 | 0.7372 | 0.7364 | 0.7345 | 0.7316 | 0.7291 | |
| All | 0.7415 | 0.7403 | 0.7393 | 0.7372 | 0.7454 | 0.7432 | 0.7553 | 0.7532 | 0.7481 | 0.7465 |
Comparison of DFM results with different by SVM classifier (IMDB dataset).
| Text Partition Strategy | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| ||||||
| 0.9441 | 0.9423 | 0.9536 | 0.9516 | 0.9591 | 0.9572 | 0.9675 | 0.9651 | 0.9624 | 0.9604 | |
| 100 ≤ | 0.9716 | 0.9692 | 0.9773 | 0.9754 | 0.9775 | 0.9754 | 0.9846 | 0.9824 | 0.9775 | 0.9762 |
| 200 ≤ | 0.9792 | 0.9775 | 0.9832 | 0.9815 | 0.9733 | 0.9713 | 0.9783 | 0.9767 | 0.9692 | 0.9674 |
| 300 ≤ | 0.9713 | 0.9694 | 0.9784 | 0.9762 | 0.9768 | 0.9748 | 0.9802 | 0.9796 | 0.9851 | 0.9831 |
| 0.9754 | 0.9747 | 0.9731 | 0.9713 | 0.9854 | 0.9847 | 0.9819 | 0.9798 | 0.9803 | 0.9792 | |
| All | 0.9742 | 0.9736 | 0.9709 | 0.9694 | 0.9776 | 0.9765 | 0.9737 | 0.9713 | 0.9752 | 0.9746 |
Comparison of DFM results with different by SVM classifier (20 Newsgroups dataset).
| Text Partition Strategy | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| ||||||
| 0.7313 | 0.7305 | 0.7335 | 0.7315 | 0.7382 | 0.7372 | 0.7402 | 0.7390 | 0.7431 | 0.7427 | |
| 100 ≤ | 0.7462 | 0.7454 | 0.7521 | 0.7516 | 0.7595 | 0.7572 | 0.7481 | 0.7473 | 0.7552 | 0.7542 |
| 200 ≤ | 0.7775 | 0.7763 | 0.7736 | 0.7711 | 0.7689 | 0.7664 | 0.7634 | 0.7625 | 0.7693 | 0.7675 |
| 300 ≤ | 0.7654 | 0.7642 | 0.7714 | 0.7695 | 0.7674 | 0.7669 | 0.7670 | 0.7654 | 0.7635 | 0.9618 |
| 0.7532 | 0.7519 | 0.7598 | 0.7584 | 0.7560 | 0.7551 | 0.7515 | 0.7493 | 0.7543 | 0.7534 | |
| All | 0.7571 | 0.7564 | 0.7606 | 0.7593 | 0.7585 | 0.7576 | 0.7552 | 0.7541 | 0.7587 | 0.7569 |
Comparison of experimental results between different methods (IMDB dataset).
| Classifier | Metrics | LDA | W2V | GLV | FPW | ETI | WWE |
|
|
|---|---|---|---|---|---|---|---|---|---|
| KNN |
| 0.8625 | 0.9471 | 0.9368 | 0.9583 | 0.8863 | 0.9568 | 0.9685 | 0.9713 |
| 0.8619 | 0.9465 | 0.9354 | 0.9526 | 0.8847 | 0.9551 | 0.9673 | 0.9708 | ||
| SVM |
| 0.8674 | 0.9580 | 0.9476 | 0.9632 | 0.8796 | 0.9587 | 0.9661 | 0.9776 |
| 0.8623 | 0.9563 | 0.9461 | 0.9618 | 0.8772 | 0.9575 | 0.9641 | 0.9765 |
Comparison of experimental results between different methods (20 Newsgroups dataset).
| Classifier | Metrics | LDA | W2V | GLV | FPW | ETI | WWE |
|
|
|---|---|---|---|---|---|---|---|---|---|
| KNN |
| 0.6635 | 0.7384 | 0.7236 | 0.7523 | 0.6812 | 0.7316 | 0.7529 | 0.7517 |
| 0.6618 | 0.7371 | 0.7219 | 0.7508 | 0.6708 | 0.7209 | 0.7516 | 0.7506 | ||
| SVM |
| 0.6724 | 0.7129 | 0.7196 | 0.7462 | 0.6951 | 0.7311 | 0.7553 | 0.7606 |
| 0.6708 | 0.7113 | 0.7183 | 0.7450 | 0.6938 | 0.7294 | 0.7532 | 0.7593 |