| Literature DB >> 35746202 |
Depei Wang1, Zhuowei Wang2, Lianglun Cheng2, Weiwen Zhang2.
Abstract
Meta-learning frameworks have been proposed to generalize machine learning models for domain adaptation without sufficient label data in computer vision. However, text classification with meta-learning is less investigated. In this paper, we propose SumFS to find global top-ranked sentences by extractive summary and improve the local vocabulary category features. The SumFS consists of three modules: (1) an unsupervised text summarizer that removes redundant information; (2) a weighting generator that associates feature words with attention scores to weight the lexical representations of words; (3) a regular meta-learning framework that trains with limited labeled data using a ridge regression classifier. In addition, a marine news dataset was established with limited label data. The performance of the algorithm was tested on THUCnews, Fudan, and marine news datasets. Experiments show that the SumFS can maintain or even improve accuracy while reducing input features. Moreover, the training time of each epoch is reduced by more than 50%.Entities:
Keywords: feature selection; few-shot learning; news categorization; text classification
Mesh:
Year: 2022 PMID: 35746202 PMCID: PMC9229404 DOI: 10.3390/s22124420
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The SumFS framework. The blue arrow denotes the global feature processing process, the black arrow denotes the training process, and the red arrow denotes the testing process.
Dataset division. The training and validation data were utilized for training the SumFS model. The unseen target category was applied to evaluate the model. The category numbers were derived from the order in Table 2.
| Dataset | Training Category | Validation Category | Target Category |
|---|---|---|---|
| THUCnews | 0, 1, 7, 8, 9, 10 | 4, 5, 6, 12 | 2, 3, 11, 13 |
| Fudan news | 0, 2, 4, 6, 7, 9, 11, 17 | 1, 3, 10, 13, 18 | 5, 8, 12, 14, 15, 16, 19 |
| Marine news | 0, 1, 7, 8, 9 | 4, 5, 10 | 2, 3, 6 |
Dataset categories.
| Dataset | Category | Average Number |
|---|---|---|
| THUCnews | Sports, Entertainment, Home, Lottery, ticket, Estate, | 50.36 |
| Fudan news | Agriculture, Art, Communication, Computer science, | 92.97 |
| Marine news | Marine Equipment, The Blue Economy, | 40.78 |
Figure 2CH index under several Sum-H. The number of sentences corresponding to the red points was chosen.
Figure 3TSNE visualization of the input representations for Sum-H and raw data of three datasets, which are THUCnews, Fudan news and Marine news. Their Sum-H’s distributions are shown in (a), (c) and (e), respectively, and their distributions of raw data are shown in (b), (d) and (f), respectively. Each color/marker pair corresponds to a specific label. The class numbers are listed in Table 1.
The accuracies of different models on Sum-H.
| Method | Marine Sum13 | THUCnews Sum8 | Fudan Sum15 | |||
|---|---|---|---|---|---|---|
| 3-1 | 3-3 | 4-1 | 4-4 | 5-1 | 5-5 | |
| MAML | 36.02 | 39.06 | 28.01 | 35.68 | 21.45 | 21.14 |
| Proto | 62.02 | 78.03 | 47.90 | 64.90 | 51.64 | 65.33 |
| R2D2 | 69.08 |
| 59.26 | 80.50 | 56.43 | 83.00 |
| MLADA | 69.07 | 82.44 | 54.16 | 76.91 | 53.08 | 80.76 |
| Ours |
| 83.35 |
|
|
|
|
Figure 4Comparison of IDF-IWF-ATT results with different numbers of sentences.
Figure 5The accuracy trends of training and testing using the ATT-IDF-IWF weighted strategy. According to the results, Sum-H preserves the original text information well.
The accuracies of different weighting methods on Sum-H.
| Dataset | N-K | Weight Generator | ||||
|---|---|---|---|---|---|---|
| IDF-ATT | IWF-ATT | IDF-IWF-ATT | TF-IDF | IDF-IWF | ||
| THU-Sum8 | 2-1 | 73.73 | 73.85 | 73.83 |
| 73.8 |
| 2-2 | 82.65 | 83.17 | 82.92 |
| 82.51 | |
| 4-1 | 59.09 |
| 58.96 | 59.09 | 58.7 | |
| 4-4 | 80.37 |
| 80.27 | 80.31 | 79.96 | |
| Fudan-Sum15 | 2-1 | 78.73 | 78.35 |
| 78.34 | 78.77 |
| 2-2 | 86.24 | 86.21 |
| 86.09 | 86.25 | |
| 5-1 | 56.67 | 56.42 |
| 56.21 | 56.53 | |
| 5-5 | 83.13 | 83.06 |
| 83.14 | 83.16 | |
| Marine-Sum13 | 2-1 |
| 79.31 | 79.20 | 79.2 | 79.35 |
| 2-2 | 87.04 | 87.00 | 87.03 | 86.73 |
| |
| 3-1 | 69.22 |
| 69.17 | 68.81 | 69.33 | |
| 3-3 |
| 83.23 | 83.33 | 83.16 | 83.33 | |
Figure 6Comparison of the standard deviations during the testing steps.
Figure 7Comparison of results with the optimal solution strategy.
The statistical parameters for input text.
| Dataset | The Number of Sentences | The Average Length of Texts | ||
|---|---|---|---|---|
| Sum-H | Raw | Sum-H | Raw | |
| THUCnews | 8 | 50.36 | 240 | 382 |
| Fudan news | 15 | 92.97 | 390 | 941 |
| Marine news | 13 | 40.78 | 151 | 360 |
Figure 8Comparison of time consumption between Raw and Sum-H. These results were estimated using the average time consumption for each epoch.