| Literature DB >> 33286170 |
Abstract
The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods.Entities:
Keywords: chi-square statistic; information gain; mutual information; quadratic programming; text categorization
Year: 2020 PMID: 33286170 PMCID: PMC7516869 DOI: 10.3390/e22040395
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Datasets used in the experiments.
| Datasets | Documents | Terms | Topics | Average Terms | Maximum Terms |
|---|---|---|---|---|---|
| in Each Document | in a Document | ||||
| 20NG | 18,774 | 11,745 | 20 | 131.6 | 6216 |
| Reuters10 | 7285 | 5204 | 10 | 48.57 | 464 |
| TDT10 | 7456 | 12,867 | 10 | 174.1 | 1392 |
Figure 1Experimental comparison result of naive Bayes classifier for 20NG dataset.
Figure 2Experimental comparison result of naive Bayes classifier for Reuters10 dataset.
Figure 3Experimental comparison result of naive Bayes classifier for TDT10 dataset.
Figure 4Experimental comparison result of naive Bayes classifier with conventional feature transform methods and the proposed method.
Figure 5Experimental comparison result of naive Bayes classifier with conventional feature selection method.
Experimental micro- results of naive Bayes classifier when the number of selected terms is 300.
| Datasets | IGFSS [ | DFS [ | t-test (MAX) [ | Proposed |
|---|---|---|---|---|
| 20NG | 0.3757 | 0.5505 | 0.1901 |
|
| Reuters10 | 0.8246 | 0.8904 | 0.6293 |
|
| TDT10 | 0.8103 | 0.9411 | 0.3576 |
|
Experimental macro- results of naive Bayes classifier when the number of selected terms is 300.
| Datasets | IGFSS [ | DFS [ | t-test (MAX) [ | Proposed |
|---|---|---|---|---|
| 20NG | 0.3846 | 0.5665 | 0.1825 |
|
| Reuters10 | 0.6220 | 0.7696 | 0.3085 |
|
| TDT10 | 0.7502 | 0.9271 | 0.3122 |
|
Type I and II errors of the proposed method in 20NG dataset.
| Topic Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Type I error | 211 | 346 | 371 | 344 | 184 | 192 | 227 | 106 | 80 | 123 |
| Type II error | 112 | 178 | 101 | 142 | 106 | 122 | 109 | 124 | 51 | 70 |
| True Positive | 206 | 211 | 290 | 250 | 277 | 268 | 273 | 271 | 346 | 327 |
| True Negative | 6770 | 6743 | 6769 | 6938 | 6923 | 6896 | 7004 | 7028 | 6985 | 6988 |
Type I and II errors of the proposed method in Reuters10 dataset.
| Topic Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Type I error | 15 | 31 | 34 | 64 | 76 | 67 | 26 | 32 | 35 | 25 |
| Type II error | 25 | 17 | 6 | 1 | 4 | 3 | 4 | 1 | 1 | 0 |
| True Positive | 1015 | 603 | 92 | 72 | 65 | 54 | 31 | 23 | 20 | 20 |
| True Negative | 1002 | 1406 | 1925 | 1920 | 1912 | 1933 | 1996 | 2001 | 2001 | 2012 |
Type I and II errors of the proposed method in TDT10 dataset.
| Topic Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Type I error | 73 | 56 | 5 | 3 | 2 | 1 | 65 | 19 | 1 | 7 |
| Type II error | 27 | 7 | 3 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| True Positive | 529 | 572 | 348 | 239 | 132 | 127 | 85 | 67 | 66 | 53 |
| True Negative | 1628 | 1602 | 1881 | 1995 | 2103 | 2108 | 2086 | 2151 | 2170 | 2177 |