| Literature DB >> 32831821 |
Kelin Shen1, Peinan Hao2,3, Ran Li2,3.
Abstract
Text classification plays an important role in various applications of big data by automatically classifying massive text documents. However, high dimensionality and sparsity of text features have presented a challenge to efficient classification. In this paper, we propose a compressive sensing- (CS-) based model to speed up text classification. Using CS to reduce the size of feature space, our model has a low time and space complexity while training a text classifier, and the restricted isometry property (RIP) of CS ensures that pairwise distances between text features can be well preserved in the process of dimensionality reduction. In particular, by structural random matrices (SRMs), CS is free from computation and memory limitations in the construction of random projections. Experimental results demonstrate that CS effectively accelerates the text classification while hardly causing any accuracy loss.Entities:
Mesh:
Year: 2020 PMID: 32831821 PMCID: PMC7428956 DOI: 10.1155/2020/8879795
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1A typical flow of text classification.
Figure 2Framework of CS-based text classification.
Algorithm 1Flow of SRM sensing algorithm.
Figure 3Statistics of the Twitter sentiment and weather report datasets after balancing. (a) Twitter sentiment dataset. (b) Weather report dataset.
Figure 4Average distances between any BOW or CS feature and others on multiclass classification dataset at different subrates when using Block DCT matrix. BOW denotes average distances between any BOW feature and others. These average distances are sorted in a descending order.
MSEs between average distances of CS and BOW features on multiclass classification dataset when using different subrates and SRMs.
| Subrate | SRMs | ||||
|---|---|---|---|---|---|
| DCT | FFT | Block DCT | Block WHT | Block Gaussian | |
| 0.1 | 18.78 | 18.81 | 18.38 | 18.48 | 18.62 |
| 0.2 | 14.37 | 14.40 | 14.38 | 14.08 | 14.51 |
| 0.3 | 11.39 | 11.39 | 11.17 | 11.11 | 11.78 |
| 0.4 | 9.14 | 9.12 | 9.018 | 9.01 | 9.33 |
| 0.5 | 7.36 | 7.34 | 7.19 | 7.21 | 7.33 |
| 0.6 | 5.92 | 5.91 | 5.96 | 5.90 | 6.03 |
| Avg. | 11.16 | 11.16 | 11.02 | 10.96 | 11.27 |
Accuracies of SVM classifier associated with different SRMs on binary and multiclass classification datasets at different subrates.
| Subrate | SRMs | ||||
|---|---|---|---|---|---|
| DCT | FFT | Block DCT | Block WHT | Block Gaussian | |
| Binary classification | |||||
| 0.1 | 0.6955 | 0.7220 | 0.6975 | 0.6880 | 0.6930 |
| 0.2 | 0.7185 | 0.7135 | 0.7135 | 0.7200 | 0.7055 |
| 0.3 | 0.7195 | 0.7140 | 0.7285 | 0.7215 | 0.7125 |
| 0.4 | 0.7285 | 0.7190 | 0.7265 | 0.7170 | 0.7185 |
| 0.5 | 0.7235 | 0.7195 | 0.7290 | 0.7270 | 0.7145 |
| 0.6 | 0.7255 | 0.7290 | 0.7265 | 0.7280 | 0.7285 |
| Avg. | 0.7185 | 0.7195 | 0.7203 | 0.7169 | 0.7121 |
|
| |||||
| Multiclass classification | |||||
| 0.1 | 0.8590 | 0.8575 | 0.8358 | 0.8444 | 0.8227 |
| 0.2 | 0.8616 | 0.8606 | 0.8651 | 0.8636 | 0.8585 |
| 0.3 | 0.8651 | 0.8737 | 0.8666 | 0.8737 | 0.8606 |
| 0.4 | 0.8686 | 0.8702 | 0.8712 | 0.8747 | 0.8712 |
| 0.5 | 0.8712 | 0.8732 | 0.8767 | 0.8691 | 0.8757 |
| 0.6 | 0.8747 | 0.8782 | 0.8803 | 0.8732 | 0.8762 |
| Avg. | 0.8668 | 0.8689 | 0.8660 | 0.8665 | 0.8609 |
Accuracies of different classifiers driven by BOW and CS features on binary and multiclass classification datasets when SRM is Block DCT.
| Classifier | BOW feature | Subrate | |||||
|---|---|---|---|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | ||
| Binary classification | |||||||
| SVM | 0.7220 | 0.6975 | 0.7135 | 0.7285 | 0.7265 | 0.7290 | 0.7265 |
| Decision tree | 0.6235 | 0.6365 | 0.6395 | 0.6460 | 0.6355 | 0.6465 | 0.6485 |
| AdaBoost | 0.7060 | 0.7020 | 0.6975 | 0.7075 | 0.7035 | 0.7020 | 0.7110 |
| KNN | 0.6040 | 0.5955 | 0.6120 | 0.6200 | 0.6140 | 0.6145 | 0.6125 |
| Naïve Bayes | 0.7275 | 0.7035 | 0.7130 | 0.7125 | 0.7170 | 0.7200 | 0.7150 |
| Avg. | 0.6766 | 0.6670 | 0.6751 | 0.6829 | 0.6793 | 0.6824 | 0.6827 |
|
| |||||||
| Multiclass classification | |||||||
| SVM | 0.8732 | 0.8358 | 0.8651 | 0.8666 | 0.8712 | 0.8767 | 0.8803 |
| Decision tree | 0.8560 | 0.8454 | 0.8434 | 0.8510 | 0.8520 | 0.8525 | 0.8530 |
| AdaBoost | 0.7777 | 0.7535 | 0.7737 | 0.7732 | 0.7813 | 0.7808 | 0.7818 |
| KNN | 0.8252 | 0.8080 | 0.8146 | 0.8207 | 0.8242 | 0.8257 | 0.8252 |
| Naïve Bayes | 0.7737 | 0.7373 | 0.7404 | 0.7464 | 0.7429 | 0.7424 | 0.7454 |
| Avg. | 0.8212 | 0.7960 | 0.8074 | 0.8116 | 0.8143 | 0.8156 | 0.8171 |
Figure 5Average training time (s) on all classifiers driven by BOW and CS features for binary and multiclass classification tasks when SRM is Block DCT. (a) Binary classification. (b) Multiclass classification.
Average accuracy, precision, recall, and F1 on all classifiers for binary classification dataset when SRM is Block DCT.
| Metrics | BOW feature | Subrate | |||||
|---|---|---|---|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | ||
| Accuracy | 0.6766 | 0.6670 | 0.6751 | 0.6829 | 0.6793 | 0.6824 | 0.6827 |
| Precision | 0.6564 | 0.6658 | 0.6674 | 0.6722 | 0.6694 | 0.6670 | 0.6694 |
| Recall | 0.6817 | 0.6671 | 0.6775 | 0.6866 | 0.6824 | 0.6871 | 0.6864 |
| F1 | 0.6679 | 0.6664 | 0.6723 | 0.6790 | 0.6756 | 0.6766 | 0.6774 |
Average accuracies of all classifiers for binary and multiclass classification datasets when using different DR methods.
| DR method | Subrate | |||||
|---|---|---|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | |
| Binary classification | ||||||
| PCA | 0.6221 | 0.6236 | 0.6206 | 0.6154 | 0.6222 | 0.6091 |
| ICA | 0.5754 | 0.5830 | 0.5862 | 0.5974 | 0.5903 | 0.6009 |
| NMF | 0.5926 | 0.6127 | 0.6193 | 0.6067 | 0.6157 | 0.6000 |
| CS | 0.6670 | 0.6751 | 0.6829 | 0.6793 | 0.6824 | 0.6827 |
|
| ||||||
| Multiclass classification | ||||||
| PCA | 0.7253 | 0.7213 | 0.7019 | 0.6845 | 0.6822 | 0.6726 |
| ICA | 0.4938 | 0.5170 | 0.5305 | 0.5448 | 0.5455 | 0.5479 |
| NMF | 0.7112 | 0.7080 | 0.7123 | 0.7123 | 0.7096 | 0.7063 |
| CS | 0.7960 | 0.8074 | 0.8116 | 0.8143 | 0.8156 | 0.8171 |
Note that SRM in CS is Block DCT.
Execution time (s) of different DR methods on binary and multiclass classification datasets when using different subrates.
| DR method | Subrate | |||||
|---|---|---|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | |
| Binary classification | ||||||
| PCA | 384.75 | 384.75 | 384.75 | 384.75 | 384.75 | 384.75 |
| ICA | 369.72 | 3094.00 | 17259.27 | 34511.16 | 35281.73 | 50355.25 |
| NMF | 187.33 | 456.67 | 1169.65 | 1873.32 | 2481.44 | 2201.12 |
| CS | 3.32 | 3.64 | 3.92 | 4.19 | 4.58 | 4.63 |
|
| ||||||
| Multiclass classification | ||||||
| PCA | 275.49 | 275.49 | 275.49 | 275.49 | 275.49 | 275.49 |
| ICA | 188.77 | 382.82 | 990.19 | 6592.11 | 10829.64 | 20559.64 |
| NMF | 159.21 | 327.14 | 652.83 | 1239.35 | 1529.07 | 2358.88 |
| CS | 3.10 | 3.77 | 3.94 | 4.03 | 4.25 | 4.41 |
Note that SRM in CS is Block DCT.