| Literature DB >> 24778587 |
Wuying Liu1, Lin Wang2, Mianzhu Yi3.
Abstract
Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.Entities:
Mesh:
Year: 2014 PMID: 24778587 PMCID: PMC3977423 DOI: 10.1155/2014/517498
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Token frequency as the function of the token's rank with the word-level 2-grams token model in the TanCorp collection.
Trendline (y = a x + b) coefficients in TanCorp collection.
| 1-grams | 2-grams | 3-grams | 4-grams | |
|---|---|---|---|---|
|
| −1.766 | −0.849 | −0.460 | −0.280 |
|
| 20.129 | 12.022 | 6.879 | 4.242 |
Feature number and uselessness rate in TanCorp collection.
| Feature number (num) |
| |||
|---|---|---|---|---|
|
|
|
|
|
|
| 1,316,422 | 325,834 | 2,087,815 | 63 | 79 |
Figure 2Token level memory.
Figure 3SRS sketch.
Algorithm 1Pseudo-Code of SRSMTC Algorithm.
MacroF1 and MicroF1 results.
| MacroF1 | MicroF1 | |
|---|---|---|
|
| 0.9172 | 0.9483 |
|
| 0.8696 | 0.9126 |
|
| 0.8632 | 0.9053 |
|
| 0.8478 | 0.9035 |
|
| 0.7587 | 0.8645 |
Random sampling rate, token compressing rate, and performances.
|
|
| MacroF1 | MicroF1 |
|---|---|---|---|
| 100 | 100 | 0.8696 | 0.9126 |
| 90 | 93 | 0.8715 | 0.9136 |
| 80 | 86 | 0.8677 | 0.9119 |
| 70 | 79 | 0.8663 | 0.9113 |
| 60 | 71 | 0.8657 | 0.9114 |
| 50 | 63 | 0.8653 | 0.9103 |
| 40 | 53 | 0.8609 | 0.9083 |
| 30 | 43 | 0.8570 | 0.9051 |
| 20 | 32 | 0.8517 | 0.9010 |
| 10 | 19 | 0.8345 | 0.8921 |