| Literature DB >> 35808537 |
Marek Hrúz1, Ivan Gruber1, Jakub Kanis1, Matyáš Boháček1,2, Miroslav Hlaváč1, Zdeněk Krňoul1.
Abstract
In this paper, we dive into sign language recognition, focusing on the recognition of isolated signs. The task is defined as a classification problem, where a sequence of frames (i.e., images) is recognized as one of the given sign language glosses. We analyze two appearance-based approaches, I3D and TimeSformer, and one pose-based approach, SPOTER. The appearance-based approaches are trained on a few different data modalities, whereas the performance of SPOTER is evaluated on different types of preprocessing. All the methods are tested on two publicly available datasets: AUTSL and WLASL300. We experiment with ensemble techniques to achieve new state-of-the-art results of 73.84% accuracy on the WLASL300 dataset by using the CMA-ES optimization method to find the best ensemble weight parameters. Furthermore, we present an ensembling technique based on the Transformer model, which we call Neural Ensembler.Entities:
Keywords: CNN; Transformer; ensemble; sign language recognition
Mesh:
Year: 2022 PMID: 35808537 PMCID: PMC9269724 DOI: 10.3390/s22135043
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Overview of the datasets. Column “Mean” refers to the average number of video instances per gloss (class).
| Dataset | Language | Sensor | Gloss | Videos | Mean | Signers |
|---|---|---|---|---|---|---|
| WLASL300 | ASL | RGB | 300 | 5117 | 17.1 | 109 |
| AUTSL | TSL | RGB+D | 226 | 38,336 | 169.6 | 43 |
Figure 1The first line: AUTSL dataset, the second line: WLASL300 dataset. On the left: Positions of joints and salient points on the face in a source RGB video frame of a given size, on the right: Data normalization and preprocessing (a) Crop&Resize data variant, (b) the Masked data variant, and (c) the OptFlow data variant.
Figure 2Different types of neural ensemblers. The backbone is a standard Transformer encoder with an optional class token. The rest of the input sequence are the embedded softmax vectors of models to be ensembled for a given video sequence. The output of the encoder has the same length as the input sequence. Bert-like decoder uses a class head to predict the final class from the class token. Bert-like weighter uses the class token to decode the weights for individual models. Model weighter, on the other hand, does not use a class token, but decodes individual sequence elements into a scalar weight that is then used to compute the weighted average.
Optimal settings for neural ensembler.
| Model | Data | Dim_ff | Layers | Heads | Per_head | Lr | Optim. | Aug. |
|---|---|---|---|---|---|---|---|---|
| BERT-like | AUTSL | 504 | 4 | 8 | 18 |
| SGD | |
| BERT-like | 648 | 5 | 9 | 52 |
| ✓ | ||
| W-BERT | 472 | 5 | 5 | 38 |
| |||
| W-BERT | 572 | 5 | 7 | 20 |
| ✓ | ||
| W-Model | 886 | 4 | 6 | 45 |
| |||
| W-Model | 973 | 6 | 7 | 37 |
| ✓ | ||
| BERT-like | WLASL | 248 | 5 | 7 | 64 |
| SGD | |
| BERT-like | 504 | 4 | 8 | 18 |
| ✓ | ||
| W-BERT | 937 | 4 | 8 | 29 |
| ✓ | ||
| W-Model | 596 | 6 | 8 | 33 |
| ✓ |
AUTSL results. Model Picker = percentage of samples that were recognized correctly by at least one model. Ens. EQ = equally weighted ensemble, Ens. OPT = ensemble with optimized weights, Logits = using logits outputs instead of softmax, Neural Ens. = neural ensembler, Bert = bert-like transformer, W-Bert = bert-like weighter, W-Model = model weighter. and = weights of individual models in the CMA-ES ensemble.
| Method | Data | VAL [%] | TEST [%] |
|
|
|---|---|---|---|---|---|
| I3D |
| 92.84 | 0.10 | 0.05 | |
| 91.90 | 91.07 | 0.08 | 0.08 | ||
| 92.51 |
| 0.08 | 0.15 | ||
| 91.51 | 91.45 | 0.13 | 0.20 | ||
|
| 88.68 | 88.00 | 0.12 | 0.15 | |
| TimeSformer | 83.39 | 81.08 | 0.03 | −0.01 | |
| 87.55 | 85.89 | 0.12 | 0.09 | ||
|
| 85.60 | 82.82 | 0.10 | 0.05 | |
| SPOTER | MMPose | 85.31 | 84.90 | 0.23 | 0.10 |
| MMPose no Face | 76.73 | 77.15 | −0.14 | −0.07 | |
| MMPose no iHands | 83.77 | 83.51 | −0.0004 | 0.11 | |
| OpenPose | 80.04 | 78.89 | 0.01 | 0.14 | |
| OpenPose no Face | 79.49 | 76.62 | 0.12 | −0.03 | |
| OpenPose no iHands | 78.59 | 76.96 | 0.02 | −0.01 | |
| Ens. EQ | 95.07 | 95.80 | |||
| Ens. OPT | 95.84 |
| |||
| Ens. Logits EQ | 95.04 | 95.99 | |||
| Ens. Logits OPT |
| 95.83 | |||
| Neural Ens. (Bert) | 97.22 |
| |||
| Neural Ens. (W-Bert) | 95.22 | 95.41 | |||
| Neural Ens. (W-Model) | 95.31 | 96.29 | |||
| Neural Ens. (Bert) + Aug |
| 96.29 | |||
| Neural Ens. (W-Bert) + Aug | 95.13 | 96.21 | |||
| Neural Ens. (W-Model) + Aug | 95.38 | 96.21 | |||
| Model Picker | 98.96 | — |
WLASL300 results. Only MMPose is used during the preprocessing step. When we used the fine-tuned models, the results are reported as (WLASL only/AUTSL pretrain). and = weights of individual models in the CMA-ES ensemble.
| Method | Data | VAL [%] | TEST [%] |
|
|
|---|---|---|---|---|---|
| I3D |
| 63.44/ | 55.26/ | 0.10/0.12 | 0.07/0.12 |
|
| 61.67/63.00 | 55.11/57.06 | 0.06/0.08 | 0.003/0.12 | |
|
| 45.56/50.56 | 36.19/39.49 | 0.05/0.04 | 0.03/0.11 | |
| TimeSformer |
| 43.67/59.00 | 36.79/52.55 | −0.10/0.14 | −0.08/0.28 |
|
| 46.44/58.33 | 40.84/50.75 | 0.07/0.10 | −0.03/0.08 | |
|
| 26.78/48.67 | 22.37/37.99 | −0.16/0.32 | −0.26/0.23 | |
| SPOTER | MMPose | 57.30 | 53.75 | 0.19 | 0.33 |
| Ens. EQ | 75.78 | 66.52 | |||
| Ens. OPT | 80.33 | 70.72 | |||
| Ens. Logits EQ | 77.11 | 69.52 | |||
| Ens. Logits OPT |
|
| |||
| Neural Ens. (Bert) |
| 68.92 | |||
| NE (W-Bert) + Aug | 75.56 | 69.22 | |||
| NE (W-Model) + Aug | 75.56 |
|
Difference in accuracy between the original models and the transferred models (models with the AUTSL pretraining) on WLASL300.
| Method | Data | VAL | TEST |
|---|---|---|---|
| I3D |
| +3.34 | +5.40 |
|
| +1.33 | +1.95 | |
|
| +5.00 | +3.30 | |
| TimeSformer |
| +15.33 | +15.76 |
|
| +11.89 | +9.91 | |
|
| +21.89 | +15.62 |
Ablation of SPOTER performance on AUTSL depending on included joints. The experiments were performed with two pose estimation libraries (MMPose and OpenPose).
| Joint Configuration | Framework | VAL | TEST |
|---|---|---|---|
| All joints and keypoints | MMPose | 85.31 | 84.90 |
| OpenPose | 80.04 | 78.89 | |
| No metacarpal joints | MMPose | 83.77 | 83.51 |
| OpenPose | 78.59 | 76.96 | |
| No face keypoints | MMPose | 76.73 | 77.15 |
| OpenPose | 79.49 | 76.62 |