| Literature DB >> 27433512 |
Subhajit Dey Sarkar1, Saptarsi Goswami1, Aman Agarwal1, Javed Aktar1.
Abstract
With the proliferation of unstructured data, text classification or text categorization has found many applications in topic classification, sentiment analysis, authorship identification, spam detection, and so on. There are many classification algorithms available. Naïve Bayes remains one of the oldest and most popular classifiers. On one hand, implementation of naïve Bayes is simple and, on the other hand, this also requires fewer amounts of training data. From the literature review, it is found that naïve Bayes performs poorly compared to other classifiers in text classification. As a result, this makes the naïve Bayes classifier unusable in spite of the simplicity and intuitiveness of the model. In this paper, we propose a two-step feature selection method based on firstly a univariate feature selection and then feature clustering, where we use the univariate feature selection method to reduce the search space and then apply clustering to select relatively independent feature sets. We demonstrate the effectiveness of our method by a thorough evaluation and comparison over 13 datasets. The performance improvement thus achieved makes naïve Bayes comparable or superior to other classifiers. The proposed algorithm is shown to outperform other traditional methods like greedy search based wrapper or CFS.Entities:
Year: 2014 PMID: 27433512 PMCID: PMC4897287 DOI: 10.1155/2014/717092
Source DB: PubMed Journal: Int Sch Res Notices ISSN: 2356-7872
Comparison of classifiers based on classification accuracy.
| Classification algorithms | CNAE-9 | SPAMHAM | Hotel dataset |
|---|---|---|---|
| Decision tree | 46% | 87% | 47% |
| SVM | 83% | 90% | 60% |
| Naive Bayes | 19.8% | 13% | 40% |
| k-NN | 77% | 90% | 53% |
Term document matrix example.
| Document |
|
|
|
|
|---|---|---|---|---|
|
| 1.1 | 0 | 2 | 3 |
|
| 2.3 | 0 | 0 | .5 |
|
| 1.1 | 0 | 0 | .25 |
Characteristics of the datasets.
| Datasets | Number of documents | Number of terms | Number of classes/categories |
|---|---|---|---|
| CNAE-9 | 1080 | 856 | 9 |
| Hotel | 50 | 3360 | 2 |
| Gender | 3232 | 100 | 2 |
| Prosncons | 2000 | 1493 | 2 |
| CookWare | 50 | 2370 | 2 |
| MyMail | 194 | 4466 | 2 |
| Reuters* | 279 | 3170 | 3 |
| Computers | 50 | 3358 | 2 |
| Flipkart | 400 | 3043 | 2 |
| SpamHam | 5572 | 6631 | 2 |
| Books | 50 | 3300 | 2 |
| DBWorld | 64 | 3723 | 2 |
| NYdtm | 3104 | 5587 | 27 |
The datasets can be mostly found at [11, 12].
*Data for three classes have been used for Reuters.
Classification accuracy rate of naïve Bayes during the three phases of experiment.
| Datasets | Naïve Bayes | Chi-squared | FS-CHICLUST |
|---|---|---|---|
| CNAE-9 | 19% | 53% |
|
| Hotel | 40% | 60% |
|
| Gender | 57% | 64% |
|
| Prosncons | 50% | 70% |
|
| CookWare | 46% | 67% |
|
| MyMail | 48% | 86% |
|
| Reuters-21578 | 12% | 60% |
|
| Computers | 38% | 53% |
|
| Flipkart | 39% | 75% |
|
| SpamHam | 13% | 92% |
|
| Books | 40% | 67% |
|
| DBWorld | 55% | 85% |
|
| NYdtm | .4% | 3% |
|
Figure 1Improvement FS-CHICLUST.
Feature reduction of naïve Bayes after the three phases of experiment.
| Datasets | Total features | Features using chi-square | Using FS-CHICLUST |
|---|---|---|---|
| CNAE-9 | 856 | 80 | 32 |
| Hotel | 3360 | 6 | 3 |
| Gender | 100 | 26 | 12 |
| Prosncons | 1493 | 23 | 11 |
| CookWare | 2370 | 6 | 3 |
| MyMail | 4466 | 25 | 14 |
| Reuters-21578 | 3170 | 38 | 11 |
| Computers | 3358 | 8 | 4 |
| Flipkart | 3043 | 25 | 14 |
| SpamHam | 6631 | 150 | 30 |
| Books | 3300 | 4 | 2 |
| DBWorld | 3723 | 5 | 2 |
| NYdtm | 5587 | 44 | 6 |
Summary of feature reduction and classification accuracy improvement.
| Datasets | % reduction | % improvement in classification accuracy |
|---|---|---|
| CNAE-9 | 96.3% | 258% |
| Hotel | 99.9% | 118% |
| Gender | 88% | 21% |
| Prosncons | 99.3% | 52% |
| CookWare | 99.9% | 74% |
| MyMail | 99.7% | 94% |
| Reuters-21578 | 99.7% | 533% |
| Computers | 99.9% | 145% |
| Flipkart | 99.5% | 115% |
| SpamHam | 99.5% | 631% |
| Books | 99.9% | 133% |
| DBWorld | 99.9% | 66% |
| NYdtm | 99.2% | 1000% |
Comparison of proposed method with other classifiers.
| Datasets | NB + FS-CHICLUST | Naïve Bayes | DT | SVM | kNN |
|---|---|---|---|---|---|
| CNAE-9 |
| 19% | 46% | 83% | 77% |
| Hotel |
| 40% | 47% | 60% | 53% |
| Gender |
| 57% | 59% | 67% | 62% |
| Prosncons |
| 50% | 65% | 72% | 71% |
| CookWare |
| 46% | 47% | 73% | 53% |
| MyMail |
| 48% | 78% | 88% | 85% |
| Reuters |
| 12% | 73% | 73% | 56% |
| Computers |
| 38% | 47% | 67% | 60% |
| Flipkart |
| 39% | 63% | 71% | 67% |
| SpamHam |
| 13% | 87% | 90% | 90% |
| Books |
| 40% | 40% | 53% | 47% |
| DBTotal |
| 55% | 75% | 85% | 65% |
| NYtdm |
| .4% | 24% | 6% | 25% |
Figure 2FS-CHICLUST with other classifiers.
Comparison of proposed method with other classifiers.
| Algorithms | Mean rank |
|---|---|
| FS-CHICLUST | 1.18 |
| SVM | 2 |
| kNN | 2.77 |
| DT | 3.91 |
| NB | 4.95 |
P value corresponding to classification accuracy and execution time.
| Metric |
|
|
|---|---|---|
| Classification accuracy | 0.0008321 | 0.004 |
| Execution time | 0.000581 | 0.008 |
| Datasets | Wrapper forward search greedy | NB + FS-CHICLUST | ||
|---|---|---|---|---|
| Execution time | Classification accuracy | Execution time | Classification accuracy | |
| CNAE-9 | 53.76 min | 54% | 0.81 min | 68% |
| Hotel | 44.89 min | 60% | 3.92 min | 87% |
| Gender | 3.10 min | 59% | 0.06 min | 69% |
| Prosncons | 71.82 min | 51% | 1.70 min | 76% |
| CookWare | 48.19 min | 53% | 3 min | 80% |
| MyMail | 120.41 min | 85% | 5.10 min | 93% |
| Reuters | 119.30 min | 19% | 3.80 min | 76% |
| Computers | 65.75 min | 40% | 4.25 min | 93% |
| Flipkart | 105.48 min | 67% | 4.12 min | 84% |
| SpamHam | 191.41 min | 14% | 5.15 min | 95% |
| Books | 42 min | 33% | 3.85 min | 93% |
| DBTotal | 50.52 min | 60% | 8.1 min | 90% |
| Datasets | Filter (CFS) | NB + FS-CHICLUST | ||
|---|---|---|---|---|
| Execution time | Classification accuracy | Execution time | Classification accuracy | |
| CNAE-9 | 85.80 min | 68% | 0.81 min | 68% |
| Gender | 22.02 min | 55% | 0.06 min | 69% |
| Prosncons | 293.40 min | 70% | 1.70 min | 76% |
| CookWare | 308.34 min | 73% | 3 min | 80% |
| DBTotal | 24 min | 75% | 8.1 min | 90% |
| Books | 348 min | 53% | 3.85 min | 93% |
| Hotel | 382.4 min | 60% | 3.92 min | 87% |
| Computers | 392 min | 67% | 4.25 min | 93% |