Literature DB >> 36094924

FluentSigners-50: A signer independent benchmark dataset for sign language processing.

Medet Mukushev¹, Aidyn Ubingazhibov², Aigerim Kydyrbekova¹, Alfarabi Imashev¹, Vadim Kimmelman³, Anara Sandygulova¹.

Abstract

This paper presents a new large-scale signer independent dataset for Kazakh-Russian Sign Language (KRSL) for the purposes of Sign Language Processing. We envision it to serve as a new benchmark dataset for performance evaluations of Continuous Sign Language Recognition (CSLR) and Translation (CSLT) tasks. The proposed FluentSigners-50 dataset consists of 173 sentences performed by 50 KRSL signers resulting in 43,250 video samples. Dataset contributors recorded videos in real-life settings on a wide variety of backgrounds using various devices such as smartphones and web cameras. Therefore, distance to the camera, camera angles and aspect ratio, video quality, and frame rates varied for each dataset contributor. Additionally, the proposed dataset contains a high degree of linguistic and inter-signer variability and thus is a better training set for recognizing a real-life sign language. FluentSigners-50 baseline is established using two state-of-the-art methods, Stochastic CSLR and TSPNet. To this end, we carefully prepared three benchmark train-test splits for models' evaluations in terms of: signer independence, age independence, and unseen sentences. FluentSigners-50 is publicly available at https://krslproject.github.io/FluentSigners-50/.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36094924 PMCID： PMC9467305 DOI： 10.1371/journal.pone.0273649

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Sign languages are natural languages used primarily by deaf communities around the world. Speech in sign languages is realized as sequences of gestures that include movements of the hands, body, arms, head, and face. Similar to spoken languages, sign languages have different levels of linguistic structure, including phonology, morphology, syntax, semantics, and pragmatics. Recently, there has been a substantial increase in interest in automatic sign language recognition [1]. Sign Language Processing (SLP) integrated three related research and development directions such as automated Sign Language recognition, generation and translation that aim to create technological solutions that would help break communication barriers for Deaf community and sign language users [2]. In order to advance these disciplines it is necessary to create large Sign Language corpora for data-driven approaches to learn from. As rightly pointed out by Bragg et al. [2], one of the main challenges of SLP is related to significant shortcomings of public sign language datasets that limit the power and generalizability of recognition systems trained on them. Apart from the apparent limitations of datasets such as the size of the vocabulary primarily due to expensive recording and annotation processes, most datasets contain only isolated signs (such as MS-ASL [3] and Devisign [4]), which are not appropriate for most real-world use cases that involve natural signing (continuous and spontaneous) and need to train on complete sentences and longer utterances [2]. As a result, Continuous Sign Language Recognition (CSLR) is a more complex problem than Isolated Sign Recognition (ISR), that is a recognition of glosses (i.e., a form of translation to represent signs [2]) per each frame/time-step in the video. CSLR is a sequence to sequence problem, where the input sequence is the video, i.e., sequence of images, and the output is a list of ordered glosses. Sign Language Translation (SLT), on the other hand, provides a list of ordered words that correspond to its translation in a spoken language (e.g., English). Most works on CSLR and SLT exploit a commonly used benchmark dataset—the RWTH-PHOENIX-Weather 2014 [5] that contains continuous signing performed by nine interpreters recorded in professional studio conditions. Another limitation of most currently utilized datasets is in the lack of environmental variability as they are typically recorded in the same setting(s) and have one vocabulary domain, which results in overfitting when applied to models that are architecturally more complex [6]. As a result, many CSLR approaches focus on cropped hands only to reduce the problem [2]. It leads to losing information expressed by body movements, facial expressions, and mouthing, containing essential linguistic and grammatical information. Additionally, many sign language datasets have novice or non-native contributors (i.e., students) in addition to slower signing and simplifying the style and the vocabulary to make the computer vision problem easier, but of no real value [2]. This paper proposes a new large-scale Kazakh-Russian Sign Language (KRSL) dataset (FluentSigners-50) as a new CSLR benchmark. FluentSigners-50 proposes to address three shortcomings of commonly used datasets identified by Bragg et al. [2]: continuous signing, signer variety, and native signers. FluentSigners-50’s main advantage is in its large signer variety: age (ranging from 8 to 57 years old), gender (18 male and 32 female), clothing, skin tone, body proportions, disability (deaf or hard of hearing), and fluency. Additionally, as the dataset was crowd-sourced: the participants were using a variety of their own recording devices (such as smartphones and web cameras), it resulted in a large variety of backgrounds, lighting conditions, camera quality, frame rates, camera aspect ratios, and angles. Finally, FluentSigners-50 contains recordings of 50 contributors that use sign language on a daily basis: either deaf, hard of hearing, hearing CODA (Child of Deaf Adults), and hearing SODA (Sibling of a Deaf Adult). As a result, the dataset contains a high degree of linguistic variability, including phonetic, phonological, lexical, and syntactic variations. It thus is a better training set for recognition of natural signing. While FluentSigners-50 directly contributes to SLP research related to KRSL, it can be utilized to test how well a model generalizes to unseen signers. Fig 1 demonstrates ten participants showcasing signer variety as well as video-related differences.

Fig 1

Signers showing the sign HI.

The main purpose of this dataset is to be used as a benchmark for sign language recognition/translation architectures. It can help researchers find out if their proposed models perform well and can generalize on unseen signers and different sign languages. Additionally, the dataset can be of interest to sign language linguistics as it has real-life, linguistic and inter-signer variability. Sentence types include statements, polar questions, wh-questions, and requests. This can allow linguists to analyze the data regarding its sentence type or non-manual features. Additionally, this paper provides a baseline performance on two state-of-the-art architectures for the problems of CSLR and SLT, Stochastic CSLR (SCSLR) [7] and TSPNet [8] respectively. Stochastic CSLR is an end-to-end trainable state-of-the-art model that is based on the transformer encoder and Connectionist Temporal Classification (CTC) [9] decoder. It achieves a Word Error Rate (WER) of 25.3 on the RWTH-PHOENIX-Weather 2014 dataset [5] and WERs of 24.9 ± 6.2 on Split 1 (5-fold), 31.7 on Split 1 (one fold), 47.1 on Split 2, 52.0 ± 4.68 on Split 3 (5-fold), and 48.7 on Split 3 (one fold) of FluentSigners-50 dataset. TSPNet is a novel hierarchical video feature learning method obtained via a temporal semantic pyramid network. It achieves a 13.41 BLEU-4 score on the RWTH-PHOENIX-Weather 2014 dataset [5] and 16.0 ± 0.8 on Split 1 (5-fold), 15.7 on Split 1 (one fold), 10.5 on Split 2, 2.0 ± 1.1 on Split 3 (5-fold), and 2.2 on Split 3 (one fold) of FluentSigners-50 dataset. The remainder of this paper discusses Related Work, followed by descriptions of data collection and the three splits. We then briefly introduce baseline methods used for the experiments detailed in the Experiments section. The paper concludes with Conclusions.

Related work

This section discusses related work on sign language datasets and state of the art in Sign Language Processing.

Sign language datasets

Sign language datasets are of great importance in order to advance the tasks of SLR and SLT. There exist several different types of datasets: 1) motion-capture data collected using sensors attached to various body locations (e.g., [10, 11]) 2) RGB-D data collected with the help of depth cameras (e.g., Kinects [12, 13]), and 3) RGB data that are more popular datasets due to their direct utility in real-life situations. Such datasets contain videos of either isolated signs or continuous signing. Table 1 presents an overview of the most commonly used sign language datasets that are appropriate for the problem of CSLR with an inclusion of the proposed FluentSigners-50.

Table 1

Datasets used for continuous sign language recognition.

Datasets	Language	Signers	Deaf	Vocabulary	Samples	In the wild
The SIGNUM (2007) [14]	DGS	25	Yes	780	780	No
The RWTH-BOSTON-400 (2008) [15]	ASL	4	Yes	483	843	No
The RWTH-PHOENIX-Weather 2014T [16]	DGS	9	No	2887	8257	No
Video-Based CSL (2018) [17]	CSL	50	No	178	25000	No
The BSL-1K (2020) [18]	BSL	40	Yes	1064	-	No
The How2Sign (2020) [19]	ASL	11	Yes	16000	35000	No
FluentSigners-50	KRSL	50	Yes	278	43250	Yes

Datasets used for continuous sign language recognition.

This list excludes datasets of isolated signs. Deaf column indicates if deaf signers contributed to the dataset. In the wild column indicates if recording settings varied. No means that the settings were the same for all samples. The high performance of deep learning methods for sign language recognition and translation tasks requires thousands of samples of data for training machine learning methods. Bragg et al. [2] highlight that only a few publicly available and large-scale sign language corpora exist. Furthermore, they specify the main concerns of existing datasets: a relatively small vocabulary size, absence of spontaneous (real-life) signing, novice signers and interpreters (e.g., students), and lack of signers’ variety. Because of the importance of fluency and the naturalness of signing, we should distinguish between datasets containing contributors whose experience in sign language is unknown (e.g., learned a few gestures for the sake of dataset collection) and signers who use sign language as their first language. Many datasets record professional interpreters performing interpreting tasks. The nature of the task might lead them to use calque or loan translation, or the signed version of the spoken language instead of the natural sign language. Furthermore, many interpreters acquire sign language as adults, which also has important consequences to their sign language production. Additionally, datasets should differentiate between desired content and “real-life” signs (i.e., self-generated rather than prompted) [2] and datasets collected in the wild (i.e., varying recording settings and devices). RWTH-Phoenix-Weather-2014 [5] is a German Sign Language (DGS) dataset used as a benchmark for most recent works in SLP. It features nine signers who performed sign language translations of the weather forecast on TV broadcasts. RWTH-Boston-400 [15] is one of the first CSLR benchmark datasets for American Sign Language (ASL). But it has only four signers present in the videos. In contrast, Video-Based CSL (Chinese Sign Language) [17] provides a large number of participants (n = 50) involved in collecting the dataset. At the same time, they are all recorded in the same recording settings, and most participants seem to be unfamiliar with sign language as they sign in slow and artificial ways without involving any facial expressions. SIGNUM [14] is a signer-independent CSLR dataset of DGS with all participants being fluent in DGS and are either deaf or hard-of-hearing. However, all videos were shot with a single RGB camera in a supervised condition with the same lighting and uniform black background. These concerns of existing datasets limit the accuracy and robustness of the models developed for SLR and their contribution to the challenges of real-world signing. More recent datasets aim to address most challenges of the previous datasets: BSL-1K [18] provides the largest number of annotated sign data while How2Sign [19] provides the largest vocabulary size. Similar to older datasets, they were either recorded in a controlled lab environment or extracted from the TV broadcast. From this perspective, FluentSigners-50 is the first sign language dataset that includes 1) a large signer variety recorded in various environmental conditions and 2) fluent sign language contributors (deaf, hard of hearing, CODA, or SODA). Future SLR and SLT models can now be benchmarked on more than one dataset, which will help build more reliable recognition and translation systems.

Sign language recognition

As stated previously, CSLR is a more complex problem than ISR as it deals with long temporal dependencies. Lately, alignment proposal optimizations were the focus of SLR methods. Additionally, deep neural networks, reinforcement learning, or recurrent neural networks were also widely used to advance the field. Evaluation of the works mentioned in this section were performed on the RWTH-PHOENIX-Weather 2014 [5] and RWTH-PHOENIX-Weather 2014T [16] datasets, which are used as a community benchmark [1, 2]. Zhang et al. [20] proposed an approach that applies encoder-decoder structure to reinforcement learning. It was on of the first models that deployed the Transformer [21] for sequence learning in CSLR. The Transformer’s attention mechanism was extremely useful for distinguishing an effective sign language signal from a sequence of video clip features. Their method achieved comparable results with other methods and has a WER of 38.3%. Temporal segmentation creates additional challenges for CSLR. To address this issue, Huang et al. [22] proposed the Hierarchical Attention Network with Latent Space (LS-HAN). This proposed framework eliminated the pre-processing of temporal segmentation and achieved an accuracy of 0.617. Advantage of this method is that it removed both the error-prone temporal segmentation in pre-processing and the sentences synthesis in the post-processing phases. Comparing to other algorithms that make use of convolutional neural network’s feature learning capabilities as well as the iterative recurrent neural network’s temporal sequence modeling capabilities, their method uses a similar approach but goes a step further by bridging the semantic gap with a latent space and then applying Hierarchical Attention Network to hypothesize semantic sentences. Alternatively, Zhou et al. [23] proposed an I3D-TEM-CTC framework with iterative optimization for CSLR. In this work, they designed a dynamic pseudo label decoding approach that uses dynamic programming to identify an acceptable alignment path. In contrast to approaches that choose labels with the highest posterior probability from the entire lexicon, or utilize probability distributions directly as pseudo labels, their method filters away apparent incorrect labels and provides pseudo labels that follow the natural word order of sign language. By increasing the quality of pseudo labels, the system’s final performance was improved and achieved a WER of 34.5%. However, the most promising results were achieved by combining different modalities. For example, Koller et al. [24] presented an approach that achieved state-of-the-art results focusing on the sequential parallelism to learn a sign language, a mouth shape, and a hand shape classifier. They improved the WER to 26.0%. It clearly shows that a combination of manual and non-manual features such as the inclusion of a mouth shape could significantly enhance the performance of the recognition systems. Stochastic CSLR [7] is an end-to-end trainable state-of-the-art model that is based on the transformer encoder and Connectionist Temporal Classification (CTC) [9] decoder. They represent each sign gloss with several states, with the number of states being a categorical random variable that follows a learning probability distribution, resulting in stochastic fine-grained labels for training the CTC decoder. In addition, they suggest a stochastic frame dropping mechanism and a gradient stopping approach to address the severe overfitting problem while training the transformer model with CTC loss. These two approaches also greatly minimize the training calculation, both in terms of time and space. It achieves a WER of 25.3 on the RWTH-PHOENIX-Weather 2014 dataset [5] and outperforms Koller et al. [24] results by 0.7%.

Sign language translation

The main difference between recognition and translation tasks is that SLT requires learning the sequence of words and their order. Camgoz et al. [16] introduced RWTH-PHOENIX-Weather 2014 T dataset with spoken language annotation as a benchmark for SLT. They used attention-based encoder-decoder models to extract gloss-level features from video frames and applied a sequence-to-sequence model to perform German Sign Language translation to German. Alternative approaches, such as work from Ko et al. [25] utilized human keypoints estimation [26] for SLT. As a result, they claim that extracting high-level features from sign language video with a sufficiently lower dimension is essential. They were successful in training a novel sign language translation system based on OpenPose human keypoints [26] and achieved 55.28% accuracy on the test set of KETI Sign Language Dataset. Orbay and Akarun [27] proposed a pre-processing step called tokenization by multi-task learning and showed that a model could achieve a higher translation scored without laborious gloss annotation. Recently, transformer networks have been applied for SLT and showed encouraging results. For example, Camgoz et al. [28] applied multi-task transformers for both recognition and translation in an end-to-end manner achieving 21.32 black-4 score. This is accomplished by combining the recognition and translation challenges into a single unified architecture utilizing a Connectionist Temporal Classification (CTC) loss. This method eliminates the need for ground-truth timing information while simultaneously addressing two interconnected sequence-to-sequence learning tasks, resulting in considerable performance increase. They encode each frame individually using pre-trained spatial embeddings from Koller et al. [24], which are dependent on the gloss annotations. Li et al. [8] introduced TSPNet, hierarchical sign video segment representation and achieved state-of-the-art results for video-to-text translation. TSPNet is a novel hierarchical video feature learning method obtained via a temporal semantic pyramid network. It achieves a 13.41 BLEU-4 score on the RWTH-PHOENIX-Weather 2014 dataset [16]. Although Camgoz et al. [16] obtained a higher BLEU-4 score on the RWTH-PHOENIX-Weather 2014T dataset, TSPNet is state-of-the-art method for video-to-text task. Camgoz et al. [16] uses an intermediate step of segmenting video into glosses and then translating them into text, while TSPNet proposes novel sign video segment representation alleviating the need for accurate segmentation into glosses. We believe that this approach has a more applicable potential as not every dataset has gloss-based annotations. And despite its current lower performances compared to Camgoz et al. [16], TSPNet can accommodate more datasets that have only spoken language translations of sign language videos.

FluentSigners-50 dataset

Given the importance of signer independence and signer variety, we involved the local Deaf community in FluentSigners-50 data collection.

The data

The FluentSigners-50 dataset consists of everyday conversational phrases and sentences in KRSL, the sign language used in the Republic of Kazakhstan. KRSL is closely related to Russian Sign Language (RSL) and some other sign languages of the ex-Soviet Union. While no official research comparing KRSL with RSL exists, our observations based on our experience researching both languages are that they show a substantial lexical overlap and are entirely mutually intelligible [29]. The issue of dialectal variation of KRSL in Kazakhstan has not been studied at all yet. (See Kimmelman et al. [30] for a discussion of lexical variation in RSL in Russia.) As we discuss below, the signers in the dataset come from different regions of Kazakhstan, so some regional variation is possibly represented in the dataset. However, it was not created for the purposes of studying such variation. The sentences and phrases of FluentSigners-50 represent the following sentence types: statements, polar questions, wh-questions, and requests.

Dataset collection

At first, we invited six professional sign language interpreters who work at the national television. They were born and grew up in families with at least one deaf parent (i.e. hearing CODA). We worked together to compose the phrases and sentences for the dataset commonly used in the Deaf community. These six interpreters translated the sentences into KRSL and recorded five repetitions per sentence using the Logitech C920 Pro web camera. In addition, we asked one of them to record the instruction video with the welcoming address, explanation of the task, and one exemplary performance of the 173 sentences in KRSL. We later distributed this video to all other contributors of the dataset. Thus, for all participants except for the CODA interpreters, the task was to repeat the KRSL sentences they saw in the recording. It was a tradeoff that we had to make in order for the dataset to have the same KRSL sentences. Although this approach does not result in constructions that would typically be produced by native signers, we had to rely on the interpreters’ opinions and translations. In addition, the contributors were allowed to make changes to the original interpretation that often included changing the order of signs, omitting or replacing signs, adding extra signs as well as varying the way the signs were performed. As a result of this process, the dataset contains a high degree of linguistic variability. The summary of FluentSigners-50 dataset is presented in Table 2.

Table 2

Statistics of the FluentSigners-50 dataset.

Video resolution	Range
Number of Signers	50
Repetitions	5
Number of sentences	173
Video duration (seconds)	2∼11
Body joints	Upper-body involved
Mean number of signs per sentence/phrase	4
Vocabulary size	278
Total number of videos	43250
Total number of hours	43.9 (∼150 raw)

The other contributors of the dataset were friends and relatives of the six interpreters who participated in this dataset voluntarily and signed an informed consent form approved by the Ethical committee of Nazarbayev University. All contributors received monetary compensation for the participation and agreed to have their data shared as a dataset. Note that the informed consent form was translated to KRSL to enable full accessibility. The individuals in this manuscript have given written informed consent (as outlined in PLOS consent form) to publish their photographs (Fig 1). All FluentSigners-50 contributors use sign language on the daily basis. Not all of them can be defined as native signers, that is, as signers who have acquired KRSL from birth from their parents, but note that this notion in general is complicated and questionable [31]. Instead of trying to divide them into native or non-native, we collected data on their hearing status, daily use of KRSL, where they acquired KRSL, and the preferred language of their family when they were growing up (adapted from Allen (2015) [32]). The results of this survey can be seen in Table 3. Concerning the hearing status, the participants are deaf (N = 32), hard of hearing (N = 6), hearing SODA (N = 3) or hearing CODA (N = 9). In total, 30 participants (deaf and hearing) had at least one signing deaf parent, and 34 of the participants have acquired KRSL from birth, 4 in kindergarten, 9 at school, and 3 in adulthood. They all came from various regions of Kazakhstan and are of different age and gender groups. Therefore, the dataset represents a diverse population of signers, mostly deaf and some hearing, with a majority of signers who have acquired KRSL early, but also with some signers who acquired it later. Fig 2 shows the demographics of participants. This dataset is thus more representative of the diversity of the whole signing population than most other comparable datasets for other sign languages.

Table 3

Survey results with KRSL status for participants of FluentSigners-50 dataset.

1	To the best of your memory, or from what your parents have told you, which of the following best describes your use of sign language in your home during your early childhood?
	We only signed and used no spoken language.	30 (60%)
	We mostly signed, but we used some spoken language as well.	11 (22%)
	We signed and spoke in roughly equal amounts.	4 (8%)
	We mostly spoke, but used some sign language too.	3 (6%)
	We only spoke and used no sign language.	0 (0%)
	We rarely spoke or signed, but relied on gestures to communicate	2 (4%)
2	Do you use sign language on a daily basis?
	Yes	49 (98%)
	No	1 (2%)
3	When did you learn sign language?
	from birth	34 (68%)
	In kindergarten	4 (8%)
	In school	9 (18%)
	In adulthood	3 (6%)

Fig 2

Distribution of FluentSigners-50 contributors’ demographics such as city, age, parents and status (deaf, hard of hearing, hearing SODA or CODA).

The participants were asked to watch the pre-recorded sentences one by one and record themselves repeating each sentence five times. Such a data collection process did not require the presence of a researcher or an interpreter. Even though the signers were asked to repeat the pre-recorded KRSL sentences, many of them added their minor corrections. They performed the sentences in their way since they relied on their own communication experience, method of interpretation, etc. All collected videos have different quality and resolutions since they used their mobile phones or web-cameras with varying backgrounds, illumination conditions, camera angles, making the FluentSigners-50 dataset diverse and realistic compared to other CSLR datasets. The filming process of each contributor took about 3.5 hours. The duration of all raw videos is more than 150 hours. Each video was carefully validated and annotated, resulting in a total of 43 hours of labeled trimmed materials. Fig 3 shows one frame from videos of each participants.

Fig 3

Diversity of video resolutions, camera angles, lighting conditions and backgrounds present in FluentSigners-50.

Linguistic properties of the data

In addition to the real-life variability in recording conditions, the data set also contains a high degree of linguistic variability. It thus is a better training set for recognition of real-life sign language. The phrases recorded mainly include everyday conversational phrases, such as greetings, simple questions, and answers that are part of daily conversations related to age, living, family, food, weather, work, and other categories. The phrases represent the following sentence types: statements, polar questions, wh-questions, and requests. In addition, some phrases are single-sign utterances, such as “Hello!” and “Goodbye!”. Some of the statements and requests express negative polarity (contain negation). Many sentences contain 1st or 2nd person pronouns, but no 3rd person pronouns appear in the data set. While a complete linguistic analysis of the data set is yet to be conducted, we can already observe a large amount of variation at different levels, as would also be expected in naturalistic sign language production [33]. Phonetic and phonological variations are observed in many signs. For example, the sign HELLO is produced with 2, 3, or 4 movement repetitions by different signers, which is most likely phonetic variation. The sign TWO.OF.US is produced either with the thumb-and-index handshape or with the index-and-middle handshape (phonological variation). An interesting case is the sign DEAF, which can be produced with movement either from the ear to the mouth or in the opposite direction; in addition, the ear location can be lowered to the cheek; finally, the index finger might touch the initial and final locations, or only approach them (see [34] for similar variation in American Sign Language). Lexical variation is also found in the data set, where different lexical signs for the same concept are chosen by different signers. For example, in sentence 109, different lexical signs for ‘adore’ are used by different signers. On the boundary between lexical and grammatical, variation can be observed in the first person singular pronoun ‘I,’ which occurs either as pointing to oneself with the index finger, or as touching the chest with a flat hand, or with the handshape representing the Russian letter ‘Я’. This letter is used as the first person pronoun in Russian, so the third variant of the pronoun is an instance of borrowing. Syntactic variation can be observed as well. Word order varies between signers: one pattern concerns the position of wh-signs; for instance, in sentence 65 ‘Where were you born?’, the wh-sign WHERE occurs either in the initial or medial position, as also described for some other sign languages [35]. In addition, in some sentences, some signers produce more signs than others, e.g., producing or omitting the past tense marker. Finally, manual and non-manual prosody also vary in the data set. Some signers produce phrases much more fluently and fluidly, while others make more considerable pauses between signs. Concerning non-manuals, eyebrow raise associated with polar question marking is much more pronounced in some signers than in others. Negative facial expression and headshake also vary in degree, but also in the scope: in sentence 126, ‘No, I do not eat fish’ some signers only accompany the negative signs NO and NOT with the non-manuals, while others also mark the subject and the verb; similar variability has been reported for other sign languages [36, 37].

Suggested splits

In contrast to random train-test splits, we propose to set baseline performance for our dataset using three splits: signer-independent, age-independent, and unseen vocabulary splits. The distributions of each split are represented in Fig 4. We suggest conducting a 5-fold cross-validation on Split 1 and Split 3 to report mean and standard deviation results. Split 2 has only one fold, as it is not suitable for creating 5-folds due to not having enough data points to satisfy split’s definition (age independence).

Fig 4

Distribution of the number of frames over sentence-level clips in training, validation and test sets for each split: Split 1 (left), Split 2 (middle), Split 3 (right).

Split I: Signer independence

Signer independence is one of the main challenges that must be addressed for the real-life value of SLR models. Our dataset can help address this challenge as it provides both visual-related differences of each signer, such as variability in postures, distance to the camera, lighting, backgrounds, camera aspect ratios and angles, frame rates, quality, as well as linguistic-related differences of each signer, such as phonetic, phonological, lexical and syntactic variations. In order for implemented architectures to generalize well for unseen signers, training, validation, and testing splits should include different signers. It is important for the proposed architectures to perform well on signers that are not seen during the model training. For this reason, we take all sentences of 40 signers as a training set, and the remaining 10 signers are divided into validation and testing sets (5 signers in the test set and 5 signers in the validation set). We also ensure that validation and testing sets have both CODA and not CODA signers. It is needed to test a model against the fluency of the participants in signing. As discussed above, the dataset contains variation in fluency between different signers. Also, some signs are performed differently by various participants, which brings additional challenges. To this end, the training set has 34600 samples, the validation set has 4325 samples, and the testing set has 4325 samples.

Split II: Age independence

An additional advantage of our dataset is the presence of child contributors, as we noticed that they increase the difficulty and variability of the dataset due to the differences in the fluency of signing between child and adult signers. The exact proportions and strategy are applied for the second split. However, in this case, we exclude child signers from the training set and explicitly create an age-independent split with signers aged 9 to 18 years old only included in validation and testing sets. Furthermore, in this case, the training set contains videos recorded on web cameras or mobile phone cameras. In contrast, the testing set contains only videos recorded on mobile phone cameras (as all children used their mobile phones). This brings additional complexity for model testing, as videos are from a diverse sample space. Similar to Split 1, the training set has 37195 samples, the validation set has 1730 samples, and the testing set has 4325 samples.

Split III: Unseen sentences and signer independence

As previously discussed, sign languages can have several properties of linguistic variability. FluentSigners-50 is rich in linguistic variability as contributors often refused to follow the exemplary video and interpreted the sentences in the way they are used to. Apart from that, FluentSigners-50’s sentences were composed in a way that there are multiple instances of the same glosses used in several sentences surrounded by different glosses and in a different order (e.g., I love coffee (S112). I don’t drink coffee, but I like tea (S113). Let’s have some tea (S117).) Such split can be used to see how well a proposed architecture can recognize and translate glosses if used in other contexts and gloss order. As the overall vocabulary is limited, we decided to have only training and testing sets. We select ten sentences as the testing set and the rest 163 sentences as the training set. To this end, the testing set includes signs used in unseen sentences. Each sign varies with respect to the order in a sentence it appears in and the context of the sentence. Additionally, a test set does not include the same signers as in the training set.

Baseline method

This section focuses on the model architectures and reasoning behind decisions for the model design.

SLR baseline: Stochastic CSLR

Stochastic CSLR [7] is an end-to-end trainable state-of-the-art model that is based on the transformer encoder [21] and CTC decoder. There were three stochastic components proposed within this architecture: stochastic frame dropping mechanism, stochastic gradient stopping method, and stochastic fine-grained labeling method. Stochastic frame dropping (SFD) mechanism discards a fixed proportion of frames in a video uniformly without replacement. Not only does the SFD mechanism alleviate overfitting, but it also reduces the memory footprint and training time. Stochastic gradient stopping (SGS) method stops gradient back-propagation during visual feature extraction for a fixed proportion of frames, which is sampled uniformly without replacement. This method is also a measure against overfitting and, at the same time, reduces memory footprint and training time. Stochastic fine-grained labeling (SFL) assigns multiple states to each gloss, the number of which is a categorical random variable. Intuitively, each gloss constitutes multiple frames, and multiple state modeling helps discriminative feature learning of the network. SFL method allows the network to learn the number of states for each gloss. A different number of states for each gloss are sampled to learn the probability of the state number sequences that produce lower CTC loss. ResNet18 [38] model pretrained on ImageNet [39] was used as the visual model for Stochastic CSLR.

SLT baseline: TSPNet

TSPNet [8] architecture is a state-of-the-art model for Video-to-Text Sign Language Translation task. TSPNet introduces inter-scale attention to evaluate and enhance local semantic consistency of sign segments and intra-scale attention to resolve semantic ambiguity using non-local video context. It employs an encoder-decoder architecture. TSPNet has two main components: Multi-scale Segment Representation helps to alleviate the influence of imprecise sign video segmentation by employing a sliding window approach to create video segments with multiple window widths. To eliminate substantial ambiguity in gesture semantics, they develop a hierarchical feature learning method that utilizes local temporal structure to enforce semantic consistency and non-local video context. Hierarchical Video Feature Learning is used by the encoder to learn discriminative sign video representations by exploiting the semantic hierarchical structure among video segments. The output of the encoder is fed to a Transformer decoder to acquire the translation. They develop a segment representation for sign videos and learn both spatial and temporal semantics of sign gestures.

Results and discussion

Metrics

The following two metrics are commonly used to evaluate models’ performances in SLT and SLR tasks: Word Error Rate (WER) [5] metric is reported for SLR baseline. with minimum S + D + I where S is the number of substitutions, D is the number of deletions, I is the number of insertions required to convert the prediction into the reference, and N is the number of words in the reference phrase. Bilingual Evaluation Understudy (BLEU) [40] metric is reported for SLT baseline. n-grams are n words appearing together sequentially. n-grams precision score p is calculated as follows: where Count is the maximum number of occurrences in any of the reference phrases of a given n-gram in the candidate (predicted) phrase. Count is the number of occurrences in a candidate phrase of a given n-gram in the predicted phrase. To penalize short candidates, Brevity Penalty is used: Final n-gram BLEU score is computed as follows: We consider n ∈ {1, 2, 3, 4} with equal weights for our baseline.

Experiments

Training details

Both Stochastic CSLR [7] and TSPNet [8] are implemented in PyTorch [41] framework. Tesla V100 GPU was used to train the models. For Stochastic CSLR, the video frames are resized to 256 × 256, and random crops of size 224 × 224 are extracted. The model is trained using Adam optimizer with a batch size of 8 for 30 epochs. The learning rate is scheduled according to for i epoch with η0 = 1 × 10−4. 50% of the frames are dropped randomly from each video and the gradient stopping parameter is 75%. TSPNet is developed using FAIRSEQ [42] framework in PyTorch [41]. Video features are extracted with the I3D network [43] pre-trained on the Kinetics dataset. The model is trained with Adam optimizer with an initial learning rate of 10−4. The network trains for a maximum of 200 epochs, and the learning rate reduce factor is 0.5, and the patience is 8 epochs. Label smoothing [44] with a weight factor of 0.1, and weight decay parameter of 10−4 are applied to regularize the model.

SLR results

Table 4 shows the SLR task results of Stochastic CSLR [7] on FluentSigners-50 and RWTH-PHOENIX-Weather 2014T [16] in terms of WER score with a lower number being a better result. For the RWTH-PHOENIX-Weather 2014T [16] dataset we printed the numbers from the Stochastic CSLR [7] paper directly.

Table 4

SLR results of Stochastic CSLR [7] on RWTH-PHOENIX-Weather 2014T [16] and different splits of FluentSigners-50.

Dataset	val (WER)	test (WER)
FluentSigners-50: Split 1	25.4 ± 2.8	24.9 ± 6.2
FluentSigners-50: Split 1 (one fold)	21.8	31.7
FluentSigners-50: Split 2	10.6	47.1
FluentSigners-50: Split 3	−	52.0 ± 4.68
FluentSigners-50: Split 3 (one fold)	−	48.7
RWTH-PHOENIX-Weather 2014T	25.1	26.1

We firstly report the results obtained from 5-fold cross-validation in order to avoid a possibly biased test set. And since our dataset is relatively large with each sample being a video, computing 5-fold cross-validation is computationally intensive, we also present the results from 1-fold similarly to RWTH-PHOENIX-Weather 2014T [16] and Chinese Sign Language (CSL) [17] datasets that have one fixed test set that is shared with the community. The results on Split 2 of FluentSigners-50 demonstrate model generalization by age. The reported results of the testing set suggest that this split was much more difficult than Split 1. It could be explained by the difficulty of generalizing on child signers when the model is trained and validated on adult signers. We noticed that many children were less confident on camera, and it might have caused the lower performance of the model compared to Split 1. An interesting observation is that Split 2’s WER is 10.6 on the validation set where the training set does not include child signers, while Split 1’s WER is 26.9, where child signers add noise to the training process. Therefore, Split 2’s WER is 47.1 compared to only 27.3 WER of Split 1. The results on Split 3 of FluentSigners-50 show the performance of model generalization on unseen sentences. The reported results prove that Split 3 is much more challenging than Split 1 due to a relatively small number of sentences (N = 173). Therefore, each sentence in the testing set consists of glosses that the model encountered only several times during training in each epoch but used in a different sentence of the testing set. Table 5 presents several examples of how the model failed in its predictions. For example, in some cases (S005, S021, S041), the model confused signs with the wrong ones, while in others the model skipped some signs (S081, S134, S159).

Table 5

Ground-truth (GT) and predictions in Split 3 of FluentSigners-50 for SLR.

Sentence ID	Ground-truth	Prediction
S005	‘у мeня всe хоРошо’	‘у мeня всe ПЛОХО’
S021	‘у мeня eсть новости’	‘у мeня ДЛЯ ТЕБЯ eсть новости’
S041	‘жeлaю хоpошо дeнь’	‘КАК ПРОШЕЛ дeнь’
S081	‘у мeня СЕГОДНЯ дeнь pождeния’	‘у мeня дeнь pождeния’
S134	‘тьi мнe НЕ нpaвится’	‘тьi мнe нpaвится’
S159	‘тьi глухой’	‘У ТЕБЯ ЕСТЬ ТЕЛЕФОН’
S159	‘тьi глухой’	‘У ВАМ глухой’

SLT results

Table 6 shows the SLT task results of TSPNet [8] on FluentSigners-50 and RWTH-PHOENIX-Weather 2014T [16] across four BLEU scores, with the higher numbers being the better score. For the RWTH-PHOENIX-Weather 2014T [16] dataset we printed the numbers from the TSPNet [8] paper directly.

Table 6

SLT results of TSPNet [8] on RWTH-PHOENIX-Weather 2014T [16] and different splits of FluentSigners-50.

Dataset	BLEU-1	BLEU-2	BLEU-3	BLEU-4
FluentSigners-50: Split 1	20.3 ± 1.0	17.8 ± 0.9	16.6 ± 0.8	16.0 ± 0.8
FluentSigners-50: Split 1 (one fold)	20.7	18.0	16.7	15.7
FluentSigners-50: Split 2	14.2	12.0	11.0	10.5
FluentSigners-50: Split 3	5.1 ± 0.45	3.9 ± 0.53	3.1 ± 0.78	2.0 ± 1.1
FluentSigners-50: Split 3 (one fold)	5.1	4.1	3.0	2.2
RWTH-PHOENIX-Weather 2014T	36.1	23.1	16.9	13.4

As BLEU-4 is the most challenging score, which is also commonly used for comparison between models, we can see that BLEU-4 results indicate that Split 1 is the easiest split and has 16.0 compared to 10.5 and 2.0 for Split 2 and 3, respectively. It is consistent with Stochastic CSLR results on the SLR task.

Discussion

Split 1 of FluentSigners-50 dataset results demonstrates how a model generalizes to unseen signers and variability of both camera and environmental settings. In contrast, the RWTH-PHOENIX-Weather 2014T has only 9 signers recorded in professional studio conditions leading to a lack of diversity in signer variability and environmental variability. Thus, any implemented models might overfit on these attributes and perform poorly on unseen signers recorded in various conditions. Therefore, FluentSigners-50 aims to address the potential limitations of RWTH-PHOENIX-Weather 2014T by providing an opportunity to learn from large signer variety and in the wild recording conditions. This will allow for more robust and applicable solutions. We encourage researchers to conduct performance tests of their proposed models and report their results on both RWTH-PHOENIX-Weather 2014T and Split 1 of FluentSigners-50 to account for this problem. Similarly to SLR results, the results of TSPNet on FluentSigners-50 demonstrate similar behavior of splits’ performances with Split 1 being the easiest split, followed by Split 2, and Split 3 being the most challenging split.

Conclusion

This paper presents the FluentSigners-50 dataset, a new large-scale Kazakh-Russian Sign Language dataset that aims to contribute to the development of continuous sign language recognition by introducing a new large-scale multi-signer benchmark. FluentSigners-50 consists of 173 sentences performed by 50 contributors with more than 43 hours of video (43,250 samples). The main difference with other sign language datasets is its large number of sign language contributors who are deaf, hard of hearing, hearing CODA or SODA. Every video was recorded in different settings with varying backgrounds and lighting using their web or mobile phone cameras, which resulted in considerable variability in the resolution and frame rate of videos. Additionally, the FluentSigners-50 dataset contains a high degree of linguistic and inter-signer variability and thus is a better training set for recognition of a real-life sign language. We establish benchmark performances using state-of-the-art models (SCSLR and TSPNet) on three splits: signer-independent, age-independent, and unseen sentences + signer independence. The dataset is fully open and is available online at https://krslproject.github.io/FluentSigners-50.

This table contains the sentences used in this study.

(PDF) Click here for additional data file. 7 Dec 2021

PONE-D-21-21778

NativeSigners-50: a signer independent benchmark dataset for Sign Language Processing

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Aaron Jon Newman Academic Editor PLOS ONE Journal requirements: 1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that Figure 1 includes an image of a [patients / participants / in the study]. As per the PLOS ONE policy (http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research) on papers that include identifying, or potentially identifying, information, the individual(s) or parent(s)/guardian(s) must be informed of the terms of the PLOS open-access (CC-BY) license and provide specific permission for publication of these details under the terms of this license. Please download the Consent Form for Publication in a PLOS Journal (http://journals.plos.org/plosone/s/file?id=8ce6/plos-consent-form-english.pdf). The signed consent form should not be submitted with the manuscript, but should be securely filed in the individual's case notes. Please amend the methods section and ethics statement of the manuscript to explicitly state that the patient/participant has provided consent for publication: “The individual in this manuscript has given written informed consent (as outlined in PLOS consent form) to publish these case details”. If you are unable to obtain consent from the subject of the photograph, you will need to remove the figure and any other textual identifying information or case descriptions for this individual. 3. Thank you for stating the following in the Acknowledgments Section of your manuscript: “We would like to thank the dataset contributors for agreeing to participate in data 451 collection. This work was supported by the Nazarbayev University Faculty Development 452 Competitive Research Grant Program 2019-2021 “Kazakh Sign Language Automatic 453 Recognition System (K-SLARS)”. Award number is 110119FD4545.” We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: “A.S. was awarded the funding by Nazarbayev University Faculty Development Competitive Research Grant Program 2019-2021 for the project "Kazakh Sign Language Automatic Recognition System (K-SLARS)". Award number is 110119FD4545. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 4. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. Additional Editor Comments: I am sorry that it has taken so long to reach a decision. It has been very hard to find a second reviewer for this paper. As a result, I have included the one review I received, along with my own review of the paper (below). I agree with all of R1's comments, especially the concern that the data set is not actually available for review. - The motivation for this work is not clear. Please state more clearly the goals of: (a) automated sign language (SL) recognition; (b) automated SL translation; (c) developing corpora to train these. With regard to the last point, is the goal of developing a KRSL database to support the field of SLP generally, or more narrowly SLP related to KRSL? What kind of generalizability is desirable? - Related to the above question, please provide a better rationale for a corpus that includes a wide range of cameras, backgrounds, and other recording details. If the purpose of a corpus is to train SLR/SLT systems to achieve high accuracy, then wouldn't more uniform recording conditions be preferable? - There are a few questions around "native signers". It is true that, like all natural human languages, native signers should be the standard in determining what "correct" SL is, and that non-native signers are likely to produce more variable and nonstandard signing. However, it is not clear that restricting training coprora for automated SLR and SLT tasks to native signers is appropriate. This would seem to limit their generalizability, and indeed due to the variability of when people become deaf, and when, relative to becoming deaf, they learn a SL, and the historic patterns of suppression of SLs in educational systems (e.g., the residential school systems in the US and Canada), there are many more non-native than native signers in most SL communities. As a result, there is greater variability amongst signers in how signs are produced. As such, one would predict that a system trained only on native signing would generalize quite badly to much natural SL input. The question/request here is, please provide a discussion of why restricting a corpus like this to native signers is appropriate, and what the limitations are. - p 3 contains the sentence, "Many datasets record professional interpreters who are often not native signers (i.e., CODA)". Being a CODA is not an acceptable definition of "native signer". Firstly, children of deaf adults may not be native signers because their parents don't sign. Secondly, one could be a native signer by virtue of being born to hearing parents who do sign (e.g., the child of two CODAs). Please revise this sentence, and provide an accurate definition of "native signer". - The text of the manuscript is not clear on a very important point - the "native" status of the signers who contributed. Again, being born to a deaf parent is not a guarantee that the child is a native signer, since the parent may not sign. Thus on p. 5 the description, "They are native to KRSL since they were born and grown up in families with at least one deaf parent." is not an acceptable definition of a native signer. Please provide detail on how it was determined/confirmed that each contributor was a "native" signer. - Related to the above, the sentences, "According to this distinction, NativeSigners-50 has 30 CODA contributors (including nine hearing signers) and 20 who are not CODA (16 deaf, one hard of hearing, and three hearing SODA). Nevertheless, we decided to name our dataset “NativeSigners-50” because all of our contributors use sign language daily, and it is their primary language of communication." are problematic. If you believe that 20/30 contributors are not native signers, then it is misleading and frankly incorrect to title your database "NativeSigners-50". It would be more appropriate to remove "native" from the title or use a numeral that accurately reflects the number of native signers who contributed. - It seems problematic that you apparently created signs in a written language, then translated them into KRSL. SLs have their own grammars and so translations from spoken languages can result in constructions that would not be produced by native signers. Please clarify how this was addressed. - Text makes statements that are unsupported by references. For example, p 3 states, "Many datasets record professional interpreters who are often not native signers (i.e., CODA)." But fails to provide citations to clarify what "many" is or what datasets are in question here. - the description of approaches to SLR on p. 4 is too short, and as a result, confusing. It conflates feature extraction methods with training approaches, and reads more like a list of papers and what algorithm each used, than an explanation of the approaches, their advantages, and their limitations. This section should be revised to be more clear and detailed. - The scope and significance of the sentence on p. 4, "All the evaluations were performed on the RWTH-PHOENIX-Weather 2014 [5] and RWTH-PHOENIX-Weather 2014T [15] datasets.", is unclear. What do you mean by "all the evaluations"? Do you mean all of the papers that you cited in this section, or only the Koller et al reference. And, what is the point of stating this? Are you implying that these data sets are considered a benchmark in the field? - Related to the previous point, please provide a rationale for using the RWTH-PHOENIX-Weather 2014 data set as a benchmark - Another limitation of the SLR and SLT sections on pp. 4-5 is that they do not provide the reader with any information on what benchmark levels of performance are, or what is considered "state of the art" in this field. As a result, when you describe your own methods, it is completely unclear why you chose the Stochastic CLSR and TSPNet approaches - indeed, you didn't even introduce these approaches in the "related work" section so the reader has no context as to what these are, why they were chosen, or how they compare to other approaches. - Typically in ML, one defines the proportion of items assigned to the training, validation, and test sets a priori as percentages of the total items (e.g., 80-10-10). In the present case, (Splits 1 and 2) it does appear that your test set comprises 10% of your samples (although this is never expressed as a percentage); however your validation comprises only 4% of the data. Please provide a rationale for this low number. - Please clarify how the RWTH-PHOENIX-Weather 2014[T] data set was used. Did you divide it into training, test, and validation sets? If so, how and in what proportions? - I agree with R1's concerns regarding using a single split of each type. This raises significant concerns about generalizability, because your results in the present case are entirely dependent on the choice of what items were assigned to which set. By using a number of different folds you would increase generalizability, as well as the reader's confidence that your results were robust. - Please provide more details on your methods. For instance, it is insufficent to say that "Before passing through the model, each raw image is preprocessed." What preprocessing steps were performed? - please provide a citation for the "end2end architecture" - please ensure that all acronyms and abbreviations are defined, and do so the first time they are used (e.g., WER, CTC) - I am unclear on the reason for discussing the fact that "n-gram BLEU scores rapidly increase as n decreases from 4 to 1". This seems intuitive, since any training set will naturally include fewer repetitions of any 4-gram than 3-gram, etc. - and as noted, this is a property of the results on the RWTH-PHOENIX-Weather 2014T data set as well. Thus, your point seems irrelevant to your paper or your dataset (although it doesn't drop as dramatically in Splits 1 or 2). Perhaps your point has to do with the ability to generalize to signs/phrases not seen in the training set, however if so this needs to be discussed more clearly and explicitly. - I also disagree with the statement that, "the results of TSPNet on RWTH-PHOENIX-Weather 2014T demonstrate similar behavior of rapid increase (13.41 increasing to 16.88 to 23.12 to 36.10) but with a much better BLEU-4 score (13.41 vs. 3.08)." - because you are selectively reporting only the NativeSigners-50 result for the poorest-performing split; the performance on Split 1 is actually higher than for RWTH-PHOENIX-Weather 2014T. - Please provide some discussion of how the SLR and SLT results compare between your data set and the RWTH-PHOENIX-Weather 2014[T] one. You present comparisons of these in tables, but the significance of these is never discussed. Given that you refer to "potential limitations" of RWTH-PHOENIX-Weather 2014[T], how do you reconcile the fact that SLR/SLT performance was in general very similar between the two data sets? What would the benefit be to researchers of employing your suggestion that they report results of training applied to both your and this other data set? - please correct the text on lines 126-127 of p. 4, which appears to be a mistaken pasting of text, or some sort of formatted article generation error. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper presents a new dataset of Kazakh-Russian Sign Language from 50 native signers. The dataset contains native signers, different ages, different devices. Hence, it is challenging. However, the dataset is not available. The link provided is empty. I would like to inspect the dataset, the annotations, etc. The annotation protocol is not provided. The dataset has 3 splits: Signer independence, age, and unseen sentences. Solit 1 has 2 validation subject, 5 test subject and the rest is training. Is it just one fold or several folds? If so, how many. 1 fold is not enough because the selected 5 subjects may be too easy or too hard. Ideally, one must have, say, 10 folds, random 5 subjects in test each time. Then, one can report mean performance figures and variance. The paper is nicely written. However, figures could be better quality with informative captions. It would be good to have successful and failing examples in split 3 as a figure. The term "signed speech" should be replaced with "sign language". You have used it once in page 1 and then at the end of page 5. Signed speech is signing each word in the order of spoken language, simultaneously with spoken language. This dataset is sign language, not signed speech or signed language. The paper uses the RWTH-PHOENIX-Weather-2014T dataset as a sign language recogntion and translation benchmark. Stochastic CSLR is used as a benchmark for continuous SLR. 26.1 WER is obtained on Phoenix. TSPNet is used as a segmentation benchmark for sign language translation using multi scale attention. 13.41 Bleu4 score is obtained. The results for the RWTH-PHOENIX-Weather-2014T dataset are the same numerical values from the respective papers. Did you print the numbers from the authors' papers directly or did you train your methods on these methods and confirm these numbers? Camgoz et al. (Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation) obtains 24.5 WER and 21.8 BLEU4 score on the same 2014T dataset in 2020. You already have the reference in your literature section, but omitted the results. Why does split 3 obtain better BLEU1-3 scores than split 1? WER scores from the first method do not correlate with these results. Is signer dependence the cause of this? If so would removing sentences from the same user make split 3 more realistic? Some minor erros to be corrected: -recognizing a real-life signed speech --> recognizing real-life sign language -were born and grown up --> were born and grew up - 173 translations --> 173 sentences - too many uses of "etc." ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 1 Mar 2022 Dear editor Aaron Jon Newman and the reviewer, We are grateful for your time and reviews. We carefully addressed all the raised points and highlighted all the implemented changes with blue color in the marked up copy of the manuscript. We respond to questions and comments in the Response letter attached with this submission. Thank you for your time and consideration. Submitted filename: Response to Reviewers_ krsl dataset PLOS ONE .pdf Click here for additional data file. 8 Jun 2022

PONE-D-21-21778R1

FluentSigners-50: a signer independent benchmark dataset for Sign Language Processing

PLOS ONE Dear Dr. Sandygulova, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. I have included reviews from one previous and one new reviewer. Please address their comments. Although R3 suggested "major revisions", I feel their points can be addressed with minor revisions. I think some additional context for the data set and its possible applicaitons will strengthen the paper, however I do not think this necessitates much in the way of added text. I do agree that some additional figures would enhance the paper as well. Please submit your revised manuscript by Jul 23 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Aaron Jon Newman Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed Reviewer #3: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes Reviewer #3: Partly ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: I Don't Know Reviewer #3: N/A ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: No Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: 1. in Paper [1] did they mentioned 300 distinct sign languages? 2. Comparison with RWTH-PHOENIX-Weather database is not convincing? 3. Pasper formatting needs attention. Reviewer #3: The paper presents a dataset of Russian sign language. This is a valuable resource as we need to recognize, document, and study different sign languages around the world. The paper needs major revisions before it’s ready for publication. It needs clear arguments and explanations for how it can be used. What are the kinds of questions that people can answer with this dataset? Is this going to be good for linguistics and cognitive science studies for coreference resolution for example? Is the quality of the facial gesture good enough for sign simultaneously modeling? Is it only good for certain computer vision modeling tasks? The paper needs lots of screenshots, figures, and examples so the reader can learn about the context, use-case, environment of the datacollection, etc. In its current form, it’s really hard for readers to understand the context and details of the dataset. The anonymity issue is not discussed. If the dataset is going to be bale publicly, how are anatomizing it? What kinds of consent forms do researchers need to sign in order for them to be ale to work with the facial gestures? How many different Russian sign language dialects exist? How diverse is your dataset? What are the limitations? ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No Reviewer #3: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

11 Jul 2022 Dear editor and the reviewers, We are grateful for your time and reviews. Below we address the points raised by the reviewers and describe the changes we implemented. All the implemented changes are highlighted with blue text color in the marked up copy of the manuscript. We address each comment in details below. Reviewer #2: 1. in Paper [1] did they mentioned 300 distinct sign languages? Thank you for pointing this out. Yes, they analyzed 300 works on sign language recognition, not 300 sign languages. We corrected this text in the paper. 2. Comparison with RWTH-PHOENIX-Weather database is not convincing? According to [1], RWTH-PHOENIX-Weather dataset is referred to as “the only resource for large vocabulary continuous sign language recognition benchmarking world wide”, although it has its own limitations. As the goal of our work is to provide an alternative dataset that could be used for sign language recognition tasks, a comparison to RWTH-PHOENIX-Weather is required. This way, the researchers can test their models’ performance on unseen signers and new sign languages. 3. Paper formatting needs attention. Thank you. We tried to address formatting issues. Reviewer #3: The paper presents a dataset of Russian sign language. This is a valuable resource as we need to recognize, document, and study different sign languages around the world. The paper needs major revisions before it’s ready for publication. It needs clear arguments and explanations for how it can be used. What are the kinds of questions that people can answer with this dataset? Is this going to be good for linguistics and cognitive science studies for coreference resolution for example? Is the quality of the facial gesture good enough for sign simultaneously modeling? Is it only good for certain computer vision modeling tasks? The main purpose of this dataset is to be used as a benchmark for sign language recognition/translation architectures. It can help researchers find out if their proposed models perform well and can generalize on unseen signers and different sign languages. Additionally, the dataset can be of interest to sign language linguistics as it has real-life, linguistic and inter-signer variability. Sentence types include statements, polar questions, wh-questions, and requests. This can allow linguists to analyze the data regarding its sentence type or non-manual features. Although the dataset has its own shortcomings, we believe that it is necessary to provide alternative datasets for ML purposes to be used by the community working on sign language recognition and translation tasks. We added these details to the paper. The paper needs lots of screenshots, figures, and examples so the reader can learn about the context, use-case, environment of the data collection, etc. In its current form, it’s really hard for readers to understand the context and details of the dataset. Thank you for your suggestion. We have updated the figures and added more visual examples from the dataset. The anonymity issue is not discussed. If the dataset is going to be bale publicly, how are anatomizing it? What kinds of consent forms do researchers need to sign in order for them to be ale to work with the facial gestures? The contributors agreed and signed consent forms for the dataset to be released publicly. It is freely available to be used for research and development purposes. How many different Russian sign language dialects exist? How diverse is your dataset? What are the limitations? Russian Sign Language (RSL) is the signed language used in Russian Federation; it has dialectal variation even though it is understudied at the moment (see e.g. https://www.frontiersin.org/articles/10.3389/fpsyg.2021.740734/full ). The signed language used in Kazakhstan, which is under investigation in the current study, is, as discussed, very close to RSL lexically and most likely grammatically, which is why we use the name KRSL for it. Some other ex-Soviet countries also have sign languages that were heavily influenced by or closely related to RSL, but this is not relevant for the analysis of KRSL, so we do not discuss it. Finally, the issue of dialectal variation of KRSL in Kazakhstan has not been studied at all. As mentioned in the paper, the signers in the dataset come from different regions of Kazakhstan, so some regional variation is possibly represented in the dataset. However, it was not created for the purposes of studying dialectal variation. We added this explanation to the paper. Thank you for your question. Submitted filename: Response to Reviewers 2_ krsl dataset PLOS ONE .pdf Click here for additional data file. 15 Aug 2022 FluentSigners-50: a signer independent benchmark dataset for sign language processing PONE-D-21-21778R2 Dear Dr. Sandygulova, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Aaron Jon Newman Academic Editor PLOS ONE Additional Editor Comments (optional): Thank you for your revisions. Although R2 made reference to unclear figures and captions in their most recent review, I believe they may not have downloaded the high-resolution TIF versions, because I find those to be quite clear and readable. Therefore, I am pleased to recommend your manuscript for publication. Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: I am satisfied with the revision but the legends of the figures are still not clear/readable. They needs to clear so that can be easily readable. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No ********** 2 Sep 2022 PONE-D-21-21778R2 FluentSigners-50: a signer independent benchmark dataset for sign language processing Dear Dr. Sandygulova: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Aaron Jon Newman Academic Editor PLOS ONE

5 in total

1. Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos.

Authors: Oscar Koller; Necati Cihan Camgoz; Hermann Ney; Richard Bowden
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2019-04-15 Impact factor: 6.226