| Literature DB >> 33916231 |
Ilias Papastratis1, Kosmas Dimitropoulos1, Petros Daras1.
Abstract
Continuous sign language recognition is a weakly supervised task dealing with the identification of continuous sign gestures from video sequences, without any prior knowledge about the temporal boundaries between consecutive signs. Most of the existing methods focus mainly on the extraction of spatio-temporal visual features without exploiting text or contextual information to further improve the recognition accuracy. Moreover, the ability of deep generative models to effectively model data distribution has not been investigated yet in the field of sign language recognition. To this end, a novel approach for context-aware continuous sign language recognition using a generative adversarial network architecture, named as Sign Language Recognition Generative Adversarial Network (SLRGAN), is introduced. The proposed network architecture consists of a generator that recognizes sign language glosses by extracting spatial and temporal features from video sequences, as well as a discriminator that evaluates the quality of the generator's predictions by modeling text information at the sentence and gloss levels. The paper also investigates the importance of contextual information on sign language conversations for both Deaf-to-Deaf and Deaf-to-hearing communication. Contextual information, in the form of hidden states extracted from the previous sentence, is fed into the bidirectional long short-term memory module of the generator to improve the recognition accuracy of the network. At the final stage, sign language translation is performed by a transformer network, which converts sign language glosses to natural language text. Our proposed method achieved word error rates of 23.4%, 2.1% and 2.26% on the RWTH-Phoenix-Weather-2014 and the Chinese Sign Language (CSL) and Greek Sign Language (GSL) Signer Independent (SI) datasets, respectively.Entities:
Keywords: continuous sign language recognition; generative adversarial networks; sign language translation
Mesh:
Year: 2021 PMID: 33916231 PMCID: PMC8038055 DOI: 10.3390/s21072437
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Overview of the proposed framework that performs continuous sign language recognition and sign language translation.
Figure 2The proposed generator extracts spatio-temporal features from a video and predicts the signed gloss sequences.
Figure 3The proposed discriminator aims to distinguish between the ground truth and predicted glosses by modeling text information at both the gloss and sentence levels.
Figure 4Context modeling on Deaf-to-hearing and Deaf-to-Deaf conversations. In the first case, the previous sentence (text) is passed through a word embedding and a Bidirectional Long Short-Term Memory (BLSTM) layer. In the Deaf-to-Deaf setting, the previous hidden state of the generator is passed through a mapper network. In both cases, the produced hidden state is fed into the BLSTM layer of the generator.
Figure 5Overall architecture of the sign language translation method. The proposed generator is extended by a transformer network to perform translation of the predicted glosses.
SLRGAN performance of the generator with different configurations on the RWTH-Phoenix-Weather-2014 dataset measured with the word error rate (WER).
| SLRGAN (Generator Only) | Validation | Test |
|---|---|---|
| 2D-CNN+TCL (without BLSTM) | 30.1 | 29.8 |
| 2D-CNN+BLSTM (without TCL) | 27.9 | 27.7 |
| 2D-CNN+TCL+LSTM | 26.0 | 25.8 |
| 2D-CNN+TCL+BLSTM |
|
|
SLRGAN performance measured with the WER with different discriminator settings on the RWTH-Phoenix-Weather-2014 dataset.
| Method | Validation | Test |
|---|---|---|
| SLRGAN (generator only) | 25.1 | 25.0 |
| SLRGAN (gloss-level) | 23.8 | 23.9 |
| SLRGAN (sentence-level) | 23.9 | 24.0 |
|
|
|
|
Comparison of the Continuous Sign Language Recognition (CSLR) approaches on the RWTH-Phoenix-Weather-2014 dataset, measured with the WER.
| Method | Validation | Test |
|---|---|---|
| Staged-Opt [ | 39.4 | 38.7 |
| CNN-Hybrid [ | 38.3 | 38.8 |
| Dilated [ | 38.0 | 37.3 |
| Align-iOpt [ | 37.1 | 36.7 |
| DenseTCN [ | 35.9 | 36.5 |
| SF-Net [ | 35.6 | 34.9 |
| DPD [ | 35.6 | 34.5 |
| Fully-Inception Networks [ | 31.7 | 31.3 |
| Re-Sign [ | 27.1 | 26.8 |
| CNN-TEMP-RNN (RGB) [ | 23.8 | 24.4 |
| CrossModal [ | 23.9 | 24.0 |
| Fully-Conv-Net [ | 23.7 | 23.9 |
|
|
|
|
Evaluation comparison on the Chinese Sign Language (CSL) dataset, measured with the WER.
| Method | Test |
|---|---|
| LS-HAN [ | 17.3 |
| DenseTCN [ | 14.3 |
| CTF [ | 11.2 |
| Align-iOpt [ | 6.1 |
| DPD [ | 4.7 |
| SF-Net [ | 3.8 |
| Fully-Conv-Net [ | 3.0 |
| CrossModal [ | 2.4 |
|
|
|
Comparison of Sign Language Recognition (SLR) methods on the Greek Sign Language (GSL) SI and SD datasets.
| GSL SI | GSL SD | |||
|---|---|---|---|---|
| Method | Validation | Test | Validation | Test |
| CrossModal [ | 3.56 | 3.52 | 38.21 | 41.98 |
|
| 2.87 | 2.98 | 36.91 | 37.11 |
|
|
| 2.86 |
| 36.68 |
|
| 2.72 |
| 34.52 |
|
Figure 6CSLR performance comparison on a sign language conversation. It was observed that the context-aware SLRGAN performs better during recognition of a sign language conversation.
Reported results on sign language translation.
| GSL SI | GSL SD | |||
|---|---|---|---|---|
| Test | Test | |||
| Method | BLEU-4 | METEOR | BLEU-4 | METEOR |
| Ground Truth | 85.17 | 85.89 | 21.89 | 28.47 |
| SLRGAN+Transformer | 84.24 | 84.58 | 19.34 | 25.90 |
| Deaf-to-hearing SLRGAN+Transformer | 84.91 | 85.26 | 20.26 | 26.71 |
| Deaf-to-Deaf SLRGAN+Transformer | 84.96 | 85.48 | 20.33 | 26.42 |
Qualitative sign language translation results.
|
|
|
|
|
| HELLO I CAN HELP YOU HOW | Hello, how can I help you? |
| SLRGAN+Transformer | HELLO I CAN HELP | Hello, can I help? |
| Deaf-to-hearing SLRGAN+Transformer | HELLO I CAN HELP YOU | Hello, can I help you? |
| Deaf-to-Deaf SLRGAN+Transformer | HELLO I CAN HELP YOU HOW | Hello, can I help you how? |
|
| YOU_GIVE_MY PAPER APPROVAL | The secretariat will give you the opinion. |
| SLRGAN+Transformer | ME PAPER APPROVAL DOCTOR | Medical opinion. |
| Deaf-to-hearing SLRGAN+Transformer | YOU_GIVE_MY PAPER APPROVAL DOCTOR | Secretariat will give you the opinion. |
| Deaf-to-Deaf SLRGAN+Transformer | YOU_GIVE_MY PAPER APPROVAL DOCTOR | The secretariat will give you the opinion |
|
| YOU HAVE A CERTIFICATE BOSS | You have an employment certificate. |
| SLRGAN+Transformer | YOU HAVE CERTIFICATE DOCTOR OWNER | You have a national team certificate |
| Deaf-to-hearing SLRGAN+Transformer | YOU HAVE CERTIFICATE BOSS | You have employer certificate. |
| Deaf-to-Deaf SLRGAN+Transformer | YOU HAVE CERTIFICATE | You have a certificate. |