| Literature DB >> 36038620 |
Sebastian Jamroziński1, Urszula Markowska-Kaczmar2.
Abstract
Some machine learning applications do not allow for data augmentation or are applied to modalities where the augmentation is difficult to define. Our study aimed to develop a new method in semi-supervised learning (SSL) applicable to various modalities of data (images, sound, text), especially when augmentation is hard or impossible to define, i.e., medical images. Assuming that all samples, labeled and unlabeled, come from the same data distribution, we can say that labeled and unlabeled data sets used in the semi-supervised learning tasks are similar. Based on this observation, the data embeddings created by the classifier should also be similar for both sets. In our method, finding these embeddings is achieved based on two models-classifier and an auxiliary discriminator model, inspired by the Generative Adversarial Network (GAN) learning process. The classifier is trained to build embeddings for labeled and unlabeled datasets to cheat discriminator, which recognizes whether the embedding comes from a labeled or unlabeled dataset. The method was named the DGSSC from Discriminator Guided Semi-Supervised Classifier. The experimental research aimed evaluation of the proposed method on the classification task in combination with the teacher-student approach and comparison with other SSL methods. In most experiments, training the networks with the DGSSC method improves accuracy with the teacher-student approach. It does not deteriorate the accuracy of any experiment.Entities:
Mesh:
Year: 2022 PMID: 36038620 PMCID: PMC9424248 DOI: 10.1038/s41598-022-18947-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Data flow in the DGSSC method. denotes the set of labeled samples , set of unlabeled samples , C the classifier network and D the discriminator network. Classifier is trained in two modes. In supervised training it predicts output . It also takes a part in adversarial training to produce embeddings of samples from both sets.
Figure 2The diagram of model training in the teacher–student mode. The training comprises four main steps: starting from the one presented at the top—DGSSC procedure, followed by pseudo-labeling unlabeled set, knowledge transfer to a newly initialized model, and fine-tuning on labeled set. The objects on the blue arrows represent artifacts created by a given step used in the following step. The grey color denotes newly initialized models. T denotes the classifier–teacher, D discriminator, S the classifier–student, H cross-entropy loss, and MSE means squared error loss.
Characteristics of the datasets used in the experiments. In the subsequent columns splits into labeled, unlabeled, development and test sets are given.
| Dataset | No. of classes | The training set | Test set | Architecture | ||
|---|---|---|---|---|---|---|
| Labeled | Unlab | Dev | ||||
| AG News | 4 | 800 | 114,200 | 5000 | 7600 | BERT |
| IMDB | 2 | 400 | 19,600 | 5000 | 25,000 | BERT |
| Speech Commands | 35 | 4000 | 80,843 | 5000 | 11,005 | M5 |
| FINDSOUNDS | 7 | 4000 | 6000 | 5000 | 1930 | M5 |
| CIFAR-10 | 10 | 4000 | 41,000 | 5000 | 10,000 | CNN13 |
| SVHN | 10 | 4000 | 64,257 | 5000 | 26,032 | CNN13 |
The last column describes the model architecture chosen for the corresponding dataset. All samples in dev and test sets are labeled, and predictions for these points are compared with their true labels to calculate the accuracy metric.
BERT architecture.
| Layer description | Output size | #Neurons |
|---|---|---|
| Input—token Ids (T) | T | |
| Encoder—words 30,522, pos. 512, tokens 2, dim. 768 | Tx768 | 23,837,184 |
| Tx768 | 85,054,464 | |
| AvgPool | 768 | |
| Linear—128 | 128 | 98,432 |
| Linear—num. classes | C | 128C |
| Softmax | C |
T denotes the maximum length of a sample for a given task, i.e. for IMDB and for AG News datasets, C indicates the number of classes for a given task, i.e. for IMDB and for AG News datasets.
M5 architecture.
| Layer description | Output size (SpeechCommands or FindSounds) | #Neurons |
|---|---|---|
| Input (16kHz)—one channel, length L | 16,000 or 40,000 | |
| 1Dconv—32, kernel 80, stride 16 + batch norm | 32 | 2656 |
| MaxPool1D—kernel 4 | 32 | |
| 1Dconv—32, kernel 3, stride 1 + batch norm | 32 | 3168 |
| MaxPool1D– kernel 4 | 32 | |
| 1Dconv—64, kernel 3, stride 1 + batch norm | 64 | 6336 |
| MaxPool1D—kernel 4 | 64 | |
| 1Dconv—64, kernel 3, stride 1 + batch norm | 64 | 12,480 |
| MaxPool1D—kernel 4 | 64 | |
| AvgPool1D | 64 | |
| Linear—num. classes | 35 or 7 | 2275 or 455 |
| Softmax | 35 or 7 |
CNN13 architecture. Every layer weights are stored in normalized form as proposed in[54], i.e. weights are decomposed to magnitude and direction components.
| Name | Layer description | Output size | #Neurons |
|---|---|---|---|
| Input—32 | 3 | ||
| 3 | 128 | 3968 | |
| 3 | 128 | 147,968 | |
| 3 | 128 | 147,968 | |
| 2 | 128 | ||
| 3 | 256 | 295,936 | |
| 3 | 256 | 590,848 | |
| 3 | 256 | 590,848 | |
| 2 | 256 | ||
| 3 | 512 | 1,181,696 | |
| 3 | 256 | 132,096 | |
| 3 | 128 | 33,280 | |
| 6 | 128 | ||
| Linear—10 | 10 | 1300 | |
| Softmax | 10 |
Column name denotes a layer name used in this paper’s, where activations of different layers are analyzed.
Figure 3T-SNE projection of activations of hidden layers for labeled and unlabeled samples for baseline classifier trained for CIFAR-10 dataset. From top-left shown are the predictions (), logits (), and other intermediate layers until the projection of activations of the first convolutional layer (bottom-right, ). Labeled and unlabeled samples come from the same distribution therefore their activations should be indistinguishable, which is not satisfied in this case.
Figure 4Two greatest components of activations for labeled (orange) and unlabeled (blue) samples for baseline classifier trained for CIFAR-10 dataset. Labeled samples (orange) are drawn on top of unlabeled ones (blue).
Figure 5T-SNE projection of activations of hidden layers for labeled and unlabeled samples for classifier trained using the DGSSC method for CIFAR-10 dataset. From top-left shown are the predictions (), logits (), and other intermediate layers until the projection of activations of the first convolutional layer (bottom-right, ).
Figure 6Experimental results of the proposed method and reference baselines. The lines represent the average from runs of each experiment and the lighter regions represent the standard deviation of the aggregated runs.
Classification accuracy of the proposed method under different configurations, reference baselines and reference from other works.
| method | AG News | IMDB | Speech Commands | FindSounds | SVHN | CIFAR-10 | CIFAR-10 + augm |
|---|---|---|---|---|---|---|---|
| DGSSC (no teacher–student) | 88.93±0.15 | 85.56±0.67 | 82.75±0.61 | 59.02±1.48 | 94.75±0.17 | 77.00±0.47 | 83.32±0.35 |
| DGSSC (cycle 0) | 89.61±0.16 | 86.90±0.53 | 82.75±0.61 | 59.02±1.48 | 95.08±0.10 | 79.86±0.42 | 84.02±0.41 |
| DGSSC (last cycle) | 89.61±0.36 | 86.90±0.89 | 83.49±0.27 | 55.04±2.31 | 95.71±0.14 | 81.93±0.67 | 85.62±0.58 |
| Supervised baseline | 88.36±0.25 | 85.76±0.75 | 71.53±1.10 | 54.72±2.34 | 90.90±0.31 | 74.22±1.27 | 79.45±0.39 |
| Teacher–student only (without DGSSC) | 89.81±0.69 | 87.39±1.11 | 81.84±0.56 | 56.63±0.84 | 94.76±0.16 | 81.50±0.54 | 84.18±0.59 |
| MixText[ | 89.2 | 89.4 | |||||
| MixMatch[ | 97.11±0.06 | 95.05±0.08 | |||||
| ICT[ | 92.71±0.02 | ||||||
| FixMatch[ | 95.74±0.05 |
Each result of our experiments is provided with the mean and standard deviation calculated from 3 experiment runs for AG News and IMDB datasets, and 5 runs for other experiments.