| Literature DB >> 31466389 |
Wanting Zhou1, Hanbin Wang1, Hongguang Sun2, Tieli Sun3,4.
Abstract
Text representation is one of the key tasks in the field of natural language processing (NLP). Traditional feature extraction and weighting methods often use the bag-of-words (BoW) model, which may lead to a lack of semantic information as well as the problems of high dimensionality and high sparsity. At present, to solve these problems, a popular idea is to utilize deep learning methods. In this paper, feature weighting, word embedding, and topic models are combined to propose an unsupervised text representation method named the feature, probability, and word embedding method. The main idea is to use the word embedding technology Word2Vec to obtain the word vector, and then combine this with the feature weighted TF-IDF and the topic model LDA. Compared with traditional feature engineering, the proposed method not only increases the expressive ability of the vector space model, but also reduces the dimensions of the document vector. Besides this, it can be used to solve the problems of the insufficient information, high dimensions, and high sparsity of BoW. We use the proposed method for the task of text categorization and verify the validity of the method.Entities:
Keywords: feature weighting; latent Dirichlet allocation; text representation; word embedding
Year: 2019 PMID: 31466389 PMCID: PMC6749449 DOI: 10.3390/s19173728
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Basic idea of the feature and word embedding (FW) model and the topic probability and word embedding (PW) model. BoW: bag-of-words model.
Figure 2Assuming that there are 6 documents (texts) in a corpus, the (a) document—topic distribution and (b) topic—word distribution are as depicted.
Figure 3Feature, probability, and word embedding (FPW); and feature and word embedding (FW) and topic probability and word embedding (PW) conjunction (FPC) models.
Figure 4Second kind of feature probability (FP2) model.
Results of Amazon_6 in two classes for 4000 texts by k-nearest neighbor (KNN).
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | ||||||
| FW | 0.924 | 0.9239 | 0.923 | 0.9229 | 0.9125 | 0.9124 | 0.9003 | 0.9001 |
| P2 | 0.9078 | 0.9075 | 0.9075 | 0.9073 | 0.9078 | 0.9075 | 0.9083 | 0.908 |
| PW | 0.9555 | 0.9554 | 0.9543 | 0.9542 | 0.9548 | 0.9547 | 0.955 | 0.9549 |
| FP2 | 0.9335 | 0.9334 | 0.9273 | 0.9272 | 0.9208 | 0.9207 | 0.907 | 0.9069 |
| FPW |
|
| 0.9555 | 0.9549 |
|
|
|
|
| FPC | 0.9565 | 0.9564 |
|
| 0.9557 | 0.9557 | 0.9575 | 0.9574 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.7798 | 0.7683 | ||||||
| LDA | 0.954 | 0.9539 | ||||||
| Word2Vec | 0.9538 | 0.9537 | ||||||
Results of Amazon_6 in two classes for 4000 texts by support vector machine (SVM).
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | ||||||
| FW | 0.89 | 0.8888 | 0.9038 | 0.9030 | 0.9218 | 0.9196 | 0.914 | 0.9168 |
| P2 | 0.6845 | 0.6831 | 0.6853 | 0.6839 | 0.6853 | 0.6838 | 0.6853 | 0.6837 |
| PW | 0.959 | 0.9589 | 0.96 | 0.96 | 0.9595 | 0.9594 | 0.9605 | 0.9604 |
| FP2 | 0.91 | 0.9003 | 0.9105 | 0.9099 | 0.9285 | 0.9282 | 0.9218 | 0.9218 |
| FPW |
|
|
|
|
|
|
|
|
| FPC | 0.967 | 0.968 | 0.9685 | 0.9687 | 0.9705 | 0.9717 | 0.9715 | 0.9732 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.955 | 0.9607 | ||||||
| LDA | 0.9488 | 0.9487 | ||||||
| Word2Vec | 0.9587 | 0.967 | ||||||
Results of Amazon_6 in four classes for 5705 texts by KNN.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.9063 | 0.8818 | 0.8964 | 0.8721 | 0.9051 | 0.8817 | 0.9018 | 0.8786 |
| P2 | 0.8676 | 0.8504 | 0.8678 | 0.8509 | 0.8676 | 0.8509 | 0.8826 | 0.851 |
| PW |
|
|
|
|
|
|
|
|
| FP2 | 0.9167 | 0.8976 | 0.9078 | 0.8823 | 0.9167 | 0.8972 | 0.9125 | 0.8918 |
| FPW | 0.9427 | 0.931 | 0.9427 | 0.9312 | 0.9435 | 0.9319 | 0.9433 | 0.9326 |
| FPC | 0.9407 | 0.9283 | 0.9411 | 0.0295 | 0.9395 | 0.9276 | 0.9417 | 0.9304 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.8292 | 0.8064 | ||||||
| LDA | 0.9253 | 0.9103 | ||||||
| Word2Vec | 0.9404 | 0.9243 | ||||||
Results of Amazon_6 in four classes for 5705 texts by SVM.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.8038 | 0.7662 | 0.8068 | 0.7733 | 0.8334 | 0.8044 | 0.858 | 0.809 |
| P2 | 0.6206 | 0.531 | 0.6205 | 0.5309 | 0.609 | 0.5348 | 0.6196 | 0.5294 |
| PW | 0.9468 | 0.9343 | 0.9442 | 0.9336 | 0.9465 | 0.9346 | 0.9472 | 0.9352 |
| FP2 | 0.835 | 0.7993 | 0.8329 | 0.7969 | 0.8502 | 0.8213 | 0.8448 | 0.8197 |
| FPW |
|
|
|
|
|
|
|
|
| FPC | 0.9488 | 0.9366 | 0.9493 | 0.9375 | 0.9497 | 0.9383 | 0.95 | 0.939 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.94 | 0.9366 | ||||||
| LDA | 0.936 | 0.9215 | ||||||
| Word2Vec | 0.9349 | 0.919 | ||||||
Results of FudanNLP in three classes for 1500 texts by KNN.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.94 | 0.9383 | 0.9353 | 0.934 | 0.93 | 0.9267 | 0.9253 | 0.9221 |
| P2 | 0.9127 | 0.9119 | 0.9227 | 0.9214 | 0.9147 | 0.9139 | 0.9153 | 0.9145 |
| PW | 0.96 |
| 0.9527 | 0.9524 |
|
|
|
|
| FP2 | 0.934 | 0.9383 | 0.934 | 0.9326 | 0.9307 | 0.9284 | 0.926 | 0.9228 |
| FPW | 0.9493 | 0.9489 |
| 0.9598 | 0.95 | 0.9496 | 0.9513 | 0.951 |
| FPC |
| 0.9482 | 0.9527 |
|
| 0.9516 |
| 0.9494 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.5047 | 0.581 | ||||||
| LDA | 0.9586 | 0.9584 | ||||||
| Word2Vec | 0.9467 | 0.9457 | ||||||
Results of FudanNLP in three classes for 1500 texts by SVM.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.9393 | 0.9368 | 0.9387 | 0.9358 | 0.9387 | 0.9465 | 0.9473 | 0.9486 |
| P2 | 0.602 | 0.5736 | 0.5647 | 0.5188 | 0.604 | 0.5774 | 0.6053 | 0.578 |
| PW | 0.9573 | 0.9569 | 0.9587 | 0.9585 | 0.9573 | 0.9568 | 0.9567 | 0.9561 |
| FP2 | 0.938 | 0.9353 | 0.9393 | 0.9359 | 0.9493 | 0.9451 | 0.9513 | 0.9486 |
| FPW | 0.964 | 0.9635 | 0.9693 | 0.9685 | 0.9653 | 0.9649 | 0.9653 | 0.9649 |
| FPC |
|
|
|
|
|
|
|
|
|
| ||||||||
|
|
|
| ||||||
|
| 0.962 | 0.9613 | ||||||
| LDA | 0.9406 | 0.9397 | ||||||
| Word2Vec | 0.9567 | 0.9564 | ||||||
Results of FudanNLP in 17 classes for 4117 texts by KNN.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.7971 | 0.532 | 0.793 | 0.5306 | 0.8022 | 0.5328 | 0.7996 | 0.5316 |
| P2 | 0.7566 | 0.3719 | 0.7575 | 0.3703 | 0.7583 | 0.3717 | 0.7585 | 0.3726 |
| PW | 0.796 | 0.4382 | 0.7964 | 0.4486 | 0.7928 | 0.4355 | 0.7921 | 0.4329 |
| FP2 | 0.7973 | 0.5246 | 0.7938 | 0.5213 | 0.8023 | 0.5366 | 0.8001 | 0.5343 |
| FPW | 0.8303 |
|
|
| 0.8337 |
|
|
|
| FPC |
| 0.5761 | 0.8351 | 0.517 |
| 0.5612 | 0.8321 | 0.5308 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.6408 | 0.4968 | ||||||
| LDA | 0.7979 | 0.4211 | ||||||
| Word2Vec | 0.8153 | 0.5749 | ||||||
Results of FudanNLP in 17 classes for 4117 texts by SVM.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.7931 | 0.5369 | 0.8037 | 0.5552 | 0.8091 | 0.5583 | 0.8156 | 0.5856 |
| P2 | 0.5317 | 0.3603 | 0.5486 | 0.3662 | 0.5563 | 0.3672 | 0.5108 | 0.3524 |
| PW | 0.8012 | 0.4436 | 0.8002 | 0.4431 | 0.8011 | 0.4437 | 0.8005 | 0.4434 |
| FP2 | 0.7931 | 0.5356 | 0.804 | 0.5355 | 0.8094 | 0.5707 | 0.8152 | 0.5815 |
| FPW | 0.8323 |
| 0.8396 |
| 0.8432 | 0.6334 |
| 0.6494 |
| FPC |
| 0.6081 |
| 0.6230 |
|
| 0.8491 |
|
|
| ||||||||
|
|
|
| ||||||
|
| 0.7628 | 0.6325 | ||||||
| LDA | 0.7346 | 0.432 | ||||||
| Word2Vec | 0.828 | 0.606 | ||||||
Results of laptops in ChnSentiCorp by KNN.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.8083 | 0.8079 | 0.813 | 0.8104 | 0.7833 | 0.7829 | 0.7633 | 0.7632 |
| P2 | 0.7605 | 0.7602 | 0.7685 | 0.768 | 0.7623 | 0.762 | 0.7623 | 0.7619 |
| PW | 0.816 | 0.8157 | 0.8153 | 0.8147 | 0.8108 | 0.8105 | 0.8148 | 0.8144 |
| FP2 | 0.81 | 0.8096 | 0.811 | 0.8083 | 0.7868 | 0.7864 | 0.7675 | 0.7675 |
| FPW |
|
|
|
|
|
|
|
|
| FPC | 0.8288 | 0.8284 | 0.8343 | 0.8326 | 0.8108 | 0.8102 | 0.8095 | 0.8092 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.5973 | 0.5391 | ||||||
| LDA | 0.8065 | 0.8059 | ||||||
| Word2Vec | 0.8178 | 0.8173 | ||||||
Results of laptops in ChnSentiCorp by SVM.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW | 0.8575 | 0.8574 | 0.8805 | 0.8804 | 0.8548 | 0.8541 | 0.8598 | 0.8637 |
| P2 | 0.6413 | 0.6405 | 0.6187 | 0.6184 | 0.6415 | 0.6407 | 0.6415 | 0.6335 |
| PW | 0.8305 | 0.8303 | 0.8425 | 0.8422 | 0.8288 | 0.8286 | 0.8308 | 0.8306 |
| FP2 | 0.8562 | 0.8559 | 0.8825 | 0.8824 | 0.8545 | 0.8574 | 0.8608 | 0.8603 |
| FPW | 0.869 | 0.8694 | 0.8878 | 0.8886 |
|
| 0.871 | 0.8694 |
| FPC |
|
|
|
| 0.868 | 0.8679 |
|
|
|
| ||||||||
|
|
|
| ||||||
|
| 0.8555 | 0.8552 | ||||||
| LDA | 0.8195 | 0.8185 | ||||||
| Word2Vec | 0.822 | 0.8235 | ||||||
Results of books in ChnSentiCorp by KNN.
| Model | Accuracy F1 of Different Dimension | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW |
|
|
|
| 0.8975 | 0.8974 | 0.9 | 0.8999 |
| P2 | 0.7953 | 0.7932 | 0.77 | 0.7679 | 0.7945 | 0.7925 | 0.794 | 0.792 |
| PW | 0.8125 | 0.811 | 0.8208 | 0.8195 | 0.811 | 0.8097 | 0.8088 | 0.8074 |
| FP2 | 0.902 | 0.9019 | 0.8975 | 0.8975 |
|
|
|
|
| FPW | 0.8975 | 0.8974 | 0.8925 | 0.8924 | 0.8878 | 0.8877 | 0.8855 | 0.8853 |
| FPC | 0.8958 | 0.8957 | 0.8938 | 0.8937 | 0.8928 | 0.8927 | 0.8928 | 0.8927 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.5008 | 0.5004 | ||||||
| LDA | 0.7833 | 0.7778 | ||||||
| Word2Vec | 0.8825 | 0.8823 | ||||||
Results of books in ChnSentiCorp by SVM.
| Model | Accuracy and | |||||||
|---|---|---|---|---|---|---|---|---|
| 200 | 300 | 500 | 800 | |||||
| Acc | Acc | Acc | Acc | |||||
| FW |
|
|
|
| 0.9175 | 0.9174 | 0.9163 |
|
| P2 | 0.621 | 0.6204 | 0.5918 | 0.5912 | 0.6213 | 0.6204 | 0.6213 | 0.6204 |
| PW | 0.7783 | 0.778 | 0.785 | 0.7845 | 0.7778 | 0.7775 | 0.7778 | 0.7775 |
| FP2 | 0.9195 | 0.9194 | 0.9168 | 0.9167 |
|
|
|
|
| FPW | 0.9115 | 0.9114 | 0.9095 | 0.9094 | 0.9113 | 0.9112 | 0.9143 | 0.9145 |
| FPC | 0.91 | 0.9099 | 0.9153 | 0.9152 | 0.912 | 0.9117 | 0.9153 | 0.9152 |
|
| ||||||||
|
|
|
| ||||||
|
| 0.9058 | 0.9056 | ||||||
| LDA | 0.7815 | 0.7812 | ||||||
| Word2Vec | 0.89 | 0.8123 | ||||||