| Literature DB >> 35437420 |
Abstract
In this paper we review a wide spectrum of techniques which have been proposed in literature to enable acceptable recognition of language and text by machines. We discuss many techniques which have been proposed by researchers in the field of term weighting and explore the mathematical foundations of these methods. Term weighting schemes have broadly been classified as supervised and statistical methods and we present numerous examples from both categories to highlight the difference in approaches between the two broad categories. We pay particular attention to the Vector Space Model and its variants which form the basis of many of the other methods which have been discussed in the paper.Entities:
Keywords: Term weighting; Term weighting techniques; Word embedding
Year: 2022 PMID: 35437420 PMCID: PMC9007265 DOI: 10.1007/s11042-022-12538-3
Source DB: PubMed Journal: Multimed Tools Appl ISSN: 1380-7501 Impact factor: 2.577
Fig. 1The graph allow us to understand the vector relation between countries
Fig. 2Softmax probability calculation in skip-gram
Fig. 3CBOW probability calculation in skip-gram
Analysis of one-gram words with respect to ranking in a corpus with different documents
| Algorithms | Cacic | Krapvin2008 | Schutz2008 | Wicc | WikiNews |
|---|---|---|---|---|---|
| tf | |||||
| tf-idf | 0.22 | 0.25 | 0.21 | 0.14 | 0.12 |
| mtf-idf | 0.08 | 0.02 | 0.04 | 0.05 | 0.12 |
| tf-midf | 0.13 | 0.31 | 0.07 | 0.15 | 0.12 |
| mtf-midf | 0.07 | 0.021 | 0.04 | 0.05 | 0.12 |
| YAKE | 0.23 | 0.23 | 0.2 | 0.32 | 0.43 |
Measuring accuracy of different models for classification
| Algorithms | NewsGroup | BBC news |
|---|---|---|
| tf-idf | ||
| tf-pf | 0.82 | 0.965 |
| tf-rf | 0.815 | 0.965 |
| tf-icf | 0.827 | 0.974 |
| tf-chi2 | 0.814 | 0.966 |
| tf-binicf | 0.79 | 0.964 |
| tf-rrf | 0.816 | 0.971 |
| CEW | 0.59 ± 0.01 | — |
Average cosine similarity between different labels in newsgroup corpus is presented
| Algorithms | Datasets | rec.sport. baseball | talk. politics. mideast | talk.politics. guns | rec.sport. hockey |
|---|---|---|---|---|---|
| tf-idf | rec.sport.baseball | 0.0194 | 0.016 | 0.014 | 0.016 |
| talk.politics.mideast | 0.016 | 0.0272 | 0.018 | 0.0175 | |
| talk.politics.guns | 0.014 | 0.018 | 0.022 | 0.163 | |
| rec.sport.hockey | 0.016 | 0.017 | 0.016 | 0.023 | |
| mtf-idf | rec.sport.baseball | 0.11 | 0.082 | 0.09 | 0.101 |
| talk.politics.mideast | 0.08 | 0.09 | 0.08 | 0.07 | |
| talk.politics.guns | 0.09 | 0.08 | 0.1 | 0.08 | |
| rec.sport.hockey | 0.101 | 0.07 | 0.08 | 0.109 | |
| tf-midf | rec.sport.baseball | 0.31 | 0.34 | 0.34 | 0.306 |
| talk.politics.mideast | 0.34 | 0.42 | 0.4 | 0.346 | |
| talk.politics.guns | 0.34 | 0.4 | 0.407 | 0.342 | |
| rec.sport.hockey | 0.306 | 0.346 | 0.342 | 0.316 | |
| mtf-midf | rec.sport.baseball | 0.44 | 0.467 | 0.483 | 0.449 |
| talk.politics.mideast | 0.467 | 0.536 | 0.533 | 0.474 | |
| talk.politics.guns | 0.483 | 0.533 | 0.533 | 0.489 | |
| rec.sport.hockey | 0.449 | 0.474 | 0.489 | 0.464 | |
| BM25 score | rec.sport.baseball | 9.47 | 9.125 | 7.098 | 7.856 |
| talk.politics.mideast | 6.982 | 27.97 | 12.62 | 7.002 | |
| talk.politics.guns | 6.55 | 15.105 | 15.38 | 6.499 | |
| rec.sport.hockey | 6.876 | 8.9 | 6.832 | 11.97 | |
| word2vec skip-gram | rec.sport.baseball | 0.209 | 0.074 | -0.004 | 0.139 |
| talk.politics.mideast | 0.074 | 0.207 | 0.168 | 0.142 | |
| talk.politics.guns | -0.004 | 0.168 | 0.222 | 0.103 | |
| rec.sport.hockey | 0.139 | 0.142 | 0.103 | 0.197 | |
| word2vec CBOW | rec.sport.baseball | 0.369 | 0.33 | 0.299 | 0.253 |
| talk.politics.mideast | 0.33 | 0.365 | 0.33 | 0.266 | |
| talk.politics.guns | 0.299 | 0.332 | 0.407 | 0.328 | |
| rec.sport.hockey | 0.253 | 0.266 | 0.328 | 0.335 |
For BM25 we measure the score instead of cosine similarity