| Literature DB >> 28379986 |
Guozhong Feng1,2,3, Baiguo An4, Fengqin Yang1, Han Wang1,3, Libiao Zhang1.
Abstract
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.Entities:
Mesh:
Year: 2017 PMID: 28379986 PMCID: PMC5381872 DOI: 10.1371/journal.pone.0174341
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The term frequencies of all the distinct words.
| What | do | you | at | work | I | answer | telephones | and | some | typing | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
The numbers of the documents.
| Class | |||
|---|---|---|---|
| positive | negative | ||
| occur | |||
| not occur | |||
Fig 1Graphic model representing the term event model with the NB assumption.
Fig 2Block diagram of RP.
Summary of the feature selection Methods.
CHI, IG are based on DF, and the others are based on TF.
| CHI | measuring the dependence between a term and the document label |
| IG | the number of bits of information obtained for label prediction given a feature |
| RP | our newly proposed scheme based on term event model and the Gini coefficient |
| WCP | the Gini coefficient of within class probability |
| TT | the diversity of the distributions of a term between the specific class and the entire corpus, as based on the T-test |
MPH-20: The categories of the appeal call text records.
| Chaoyang District Government | Dehui Government | City Development and Reform Commission |
| Nanguan District Government | Jiutai District Government | Municipal Public Security Bureau |
| Kuancheng District Government | Nongan Government | Municipal Environmental Protection Bureau |
| Erdao District Government | Jingyue Development Zone | City Water Group |
| Shuangyang District Government | Economic Development Zone | Changchun Gas |
| Lvyuan District Government | Hi-tech Development Zone | City Transit Administration Bureau |
| Yushu Government | Automobile Development Zone |
Statistical information of the two corpora.
| Corpus |
| |||||
|---|---|---|---|---|---|---|
| MPH-20 | 20,000 | 24,772 | 43.46 | 32.51 | 10,095 | 9,905 |
| 20 Newsgroups | 18,774 | 61,188 | 243.01 | 489.38 | 9,511 | 9,263 |
MPH-20: Top 20 Chinese terms using each feature selection method.
| RP | WCP | TT | IG | CHI |
|---|---|---|---|---|
| Take an exam | Yushu city | Shuangyang district | Shuangyang district | Shuangyang district |
| Chauffeured car | Shuangyang district | Yushu city | Yushu city | Nongan county |
| Boshuo road | Dehui city | Nongan county | Nongan county | Yushu city |
| Heilin town | Jiutai city | Dehui city | Dehui city | Dehui city |
| Daqing | Nongan county | Jingyue development zone | Erdao district | Jiutai city |
| Shuangde township | Gas corporation | Automobile development zone | Kuancheng district | Automobile development zone |
| Suitcase | Automobile development zone | Nanguan district | Nanguan district | Jingyue development zone |
| Operate | Gas | Chaoyang district | Chaoyang district | Gas |
| Yunshan | Jingyue development zone | Erdao district | Jingyue development zone | Erdao district |
| Cremation | High-tech development zone | Jiutai city | Lvyuan district | Economic development zone |
| Wanjinta township | Economic development zone | Kuancheng district | Automobile development zone | High-tech development Zone |
| Kaoshan town | Water group | Lvyuan district | Jiutai city | Lvyuan district |
| Gongpeng town | Driver | Economic development zone | Economic development zone | Nanguan district |
| Gong | Erdao district | High-tech development zone | Gas | Kuancheng district |
| Yuxi street | Taxi | Village | High-tech development zone | Chaoyang district |
| Rename | Jiutai | Gas | Villager | Gas corporation |
| Longjia town | Switch on | Villager | Citizen | Water group |
| Shanghewan | Chaoyang district | Citizen | Village | Water pause |
| Gaming machine | Nanguan district | Water pause | Water pause | Charge |
| Festival | Lvyuan district | Water group | Gas corporation | Taxi |
Fig 3MPH-20: The classification accuracy values of five feature selection methods when using the Multinomial NB classifier.
Fig 620 Newsgroups: The classification accuracy values of the five feature selection methods when using the SVM classifier.
Fig 4MPH-20: The classification accuracy values of the five feature selection methods when using the SVM classifier.
Fig 520 Newsgroups: The classification accuracy values of the five feature selection methods when using the Multinomial NB classifier.
MPH-20: The classification accuracy values (A) and the including feature numbers N of the five feature selection methods.
The largest accuracy value and the smallest feature numbers are highlighted in bold for each classifier.
| RP | WCP | TT | IG | CHI | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Classifier | ||||||||||
| Multinomial NB | 0.8114 | 0.7848 | 0.7821 | 0.8195 | ||||||
| SVM | 17,242 | 0.8851 | 0.8794 | 13,410 | 0.8796 | 15,326 | 0.8818 | 13,410 | ||
20 Newsgroups: The classification accuracy values (A) and the including feature numbers N of the five feature selection methods.
The largest accuracy value and the smallest feature numbers are highlighted in bold for each classifier.
| RP | WCP | TT | IG | CHI | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Classifier | ||||||||||
| Multinomial NB | 32,326 | 0.8517 | 0.8459 | 26,938 | 0.8451 | 0.8376 | ||||
| SVM | 26,938 | 0.7547 | 0.7014 | 48,489 | 0.7016 | 48,489 | 0.7007 | 48,489 | ||