| Literature DB >> 35214309 |
Sunusi Bala Abdullahi1,2, Kosin Chamnongthai3.
Abstract
Complex hand gesture interactions among dynamic sign words may lead to misclassification, which affects the recognition accuracy of the ubiquitous sign language recognition system. This paper proposes to augment the feature vector of dynamic sign words with knowledge of hand dynamics as a proxy and classify dynamic sign words using motion patterns based on the extracted feature vector. In this method, some double-hand dynamic sign words have ambiguous or similar features across a hand motion trajectory, which leads to classification errors. Thus, the similar/ambiguous hand motion trajectory is determined based on the approximation of a probability density function over a time frame. Then, the extracted features are enhanced by transformation using maximal information correlation. These enhanced features of 3D skeletal videos captured by a leap motion controller are fed as a state transition pattern to a classifier for sign word classification. To evaluate the performance of the proposed method, an experiment is performed with 10 participants on 40 double hands dynamic ASL words, which reveals 97.98% accuracy. The method is further developed on challenging ASL, SHREC, and LMDHG data sets and outperforms conventional methods by 1.47%, 1.56%, and 0.37%, respectively.Entities:
Keywords: American sign language words; bidirectional long short-term memory; computer vision; deep learning; dynamic hand gestures; leap motion controller sensor; sign language recognition; ubiquitous system; video processing
Mesh:
Year: 2022 PMID: 35214309 PMCID: PMC8963088 DOI: 10.3390/s22041406
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Table of Related works.
| Algorithm Name | Methodology | Results (%) | Limitations |
|---|---|---|---|
| 1. DEEP LEARNING-BASED SYSTEM | |||
|
Konstantinidis et al. [ | Meta-learner + stacked LSTMs | 99.84 and 69.33 | Computational complexity |
| Ye et al. [ | 3DRCNN + FC-RNN | 69.2 for 27 ASL words | annotation and labeling is required |
| Parelli et al. [ | Attention-based CNN | 94.56 and 91.38 | large-scale data set required |
| Asl-3dcnn [ | cascaded 3-D CNN | 96.0, 97.1, 96.4 | spatial transformation |
| B3D ResNet [ | 3D-ResConVNet + BLSTM | 89.8 and 86.9 | Misclassification + large data require |
| Rastgoo et al. [ | SSD + 2DCNN + 3DCNN + LSTM | 99.80 and 91.12 | complex background + large data |
| 2. FEATURE-BASED SYSTEM | |||
| 2.1 Static sign words | |||
| Mohandes et al. [ | MLP + Naïve Bayes | 98 and 99 | Not very useful for daily interact |
| Nguyen and Do [ | HOG + LBP + SVM | 98.36 | Not very useful for daily interact |
| 2.2 Dynamic Sign words | |||
| 2.2.1 single-hand dynamic sign words recognition | |||
| Naglot and Kulkarni [ | LMC + MLP | 96.15 | Misclassification |
| Chopuk et al. [ | LMC + polygon region + Decision tree | 96.1 | Misclassfication |
| Chong and Lee [ | LMC + SVM + DNN | 80.30 and 93.81 | Occlusion and similarity |
| Massey 99.39, | |||
| Shin et al. [ | SVM + GBM | ASL alphabet 87.60 | error in estimated 3D coordinates |
| FingerspellingA 98.45. | |||
| Vaitkevicius et al. [ | SOAP + MS SQL + HMC | 86.1 ± 8.2 | Misclassfication |
| Lee et al. [ | LSTM + k-NN | 99.44, at 5-fold 91.82 | limited extensibility due to tracking |
| Chophuk and Kosin [ | BiLSTM | 96.07 | limited extensibility to double hand |
| 2.2.2 double-hand dynamic sign words recognition | |||
| Igari and Fukumura [ | minimum jerk trajectory + | 98 | Cumbersome + limited number |
| DP-matching + Via-points + CC | of features | ||
| Dutta and GS [ | Minimum EigenValue + COM | Text + Speech | Poor extensibility to word |
| Server | |||
| Demrcioglu et al. [ | Heuristics + RF + MLP | 99.03, 93.59, and 96.67 | insufficient hand features |
| Deriche [ | CyberGlove + SVM | 99.6 | Cumbersome + intrusive |
| DLMC-based ArSLRs [ | LDA + GMM bayes classifier | 94.63 | sensor fusion complexity |
| separate feature learning | |||
| Deriche et al. [ | Dempster-Shaper + LDA + GMM | 92 | Misclassification and not mobile |
| Haque et al. [ | Eigenmage + PCA + k-NN | 77.8846 | complex segmentation + few feature |
| Raghuveera et al. [ | SURF + HOG + LBP +SVM | 71.85 | non-scalable features + segmentaton |
| Mittal et al. [ | CNN + LSTM | Word 89.5 | low accuracy due to weak learning |
| Katilmis and Karakuzu [ | LDA + SVM + RF | 93, 95 and 98 | letters are not very useful for daily |
| communication | |||
| Karaci et al. [ | LR + k-NN + RF + DNN + ANN | cascade voting achieve | letters are not very useful for daily |
| 98.97 | Fails to track double hands | ||
| Kam and Kose [ | Expert systems + LMC + WinApp | SLR App | Mobility is not actualized |
| Katilmis and Karakuzu [ | ELM + ML-KELM | 96 and 99 | complex feature extension may |
| fall into over-fitting | |||
| Hisham and Hamouda [ | Latte Panda + Ada-Boosting | double hand accuracy 93 | similarity due to tracking issues |
| Avola et al. [ | Multi-stacked LSTM | 96 | insufficient hand features |
Figure 1Procedures of the proposed method.
Figure 2Skeletal palm joints with angle and joint length primes captured by LMC [43].
Figure 33D Hand motion trajectories across video frames using EKF.
Results of correlational analysis.
| Shape | Motion | Position | Angles | Pause | Relative Trajectories | |
|---|---|---|---|---|---|---|
| Shape | 1 | |||||
| Motion | 0.9444 | 1 | ||||
| Position | 0.8781 | 0.9351 | 1 | |||
| Angles | 0.86728 | 0.93985 | 0.84453 | 1 | ||
| Pause | 0.87361 | 0.71931 | 0.90719 | 0.89857 | 1 | |
| Relative trajectories | 0.95351 | 0.94203 | 0.89075 | 0.90681 | 0.81375 | 1 |
Figure 4Cumulative match characteristics curve of the features from MIC.
Figure 5Proposed Architecture of Multi-stack Deep BiLSTM.
Extracted features.
| Features | Point of Interest | Description |
|---|---|---|
| Angle points | Pitch, yaw and roll | Hand orientation; 44 skeletal hand joints |
| Relative trajectories | Motion | Hand motion trajectories, frame speed and velocity. |
| Positions | Direction | Arm, palm, wrist and five fingers; (thumb, index, middle, ringy and pinky) |
| Finger joints | Coordinates | Coordinate of five fingers’ tip, metacarpal, proximal, distal and interdistal. |
| Hand pause | Motion | Hand preparation and retraction. |
Figure 6Diagram of transfer deep LSTM network.
Network parameter selection.
| Network Layers | Parameter Options | Selection |
|---|---|---|
| Input layer | Sequence length | Longest |
| Batch size | 27 | |
| Input per sequence | 170 | |
| Feature vector | 1 dimension | |
| Hidden layer | 3 Bi-LSTM layers | Longest |
| Hidden state | 200 | |
| Activation function | Softmax | |
| Output layer | LSTM model | Many to one |
| Number of classes | 40 |
Execution environment.
| Deployment | Descriptions |
|---|---|
| PC | Dell |
| CPU: Intel Core i7-9th Gen | |
| Memory Size: 8 GB DDR4 | |
| SSD: 500 GB | |
| LMC sensor | Frame rate: 64 fps |
| Weight: 32 g | |
| Infrared camera: 2 × 640 × 240 | |
| Range: 80 cm | |
|
| |
| Participants | Ten |
| Selections | 40 ASL words |
| 10 times number of occurrence |
Figure 7Experimental set up.
Dataset description.
| Classes | Amount |
|---|---|
| Frames | 524,000 |
| Samples | 4000 |
| Duration (sec) | 8200 |
Proposed models combination.
| Models | Epochs | Execution Time (s) | |
|---|---|---|---|
| Features Combination | Model Depth | ||
| Shape + Motion + Position + Angles + | 3-BiLSTM layers | 300 | 1.05 |
| Pause + Relative trajectory | |||
| Shape + Motion + Position + Angles | 3-BiLSTM layers | 300 | 1.01 |
Figure 8Training performance of Deep BiLSTM network on Model-1.
Figure 9Training performance of Deep BiLSTM network on Model-2.
Performance evaluation of multi-stack deep BiLSTM network using leave-one-subject-out cross-validation.
| Word Classes |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| Again | 0.98 | 0.7 | 1 | 0.98 | 0.992361 | 0.972361 | 0.494949 |
| Angry | 0.92 | 0.707721 | 0.842701 | 0.888889 | 0.956522 | 0.845411 | 0.510638 |
| Available | 0.986486 | 0.582301 | 0.970193 | 0.970874 | 0.994819 | 0.965692 | 0.341297 |
| Bad | 0.99 | 0.703562 | 0.959462 | 0.99 | 0.985622 | 0.975622 | 0.497487 |
| Bicycle | 0.993197 | 0.579324 | 0.984886 | 0.98 | 0.972569 | 0.952569 | 0.335616 |
| Big | 0.984772 | 0.707251 | 0.969581 | 0.99 | 0.979381 | 0.969381 | 0.505102 |
| But | 0.980198 | 0.989899 | 0.489899 | 0.989899 | 0.5 | 0.489899 | 0.98 |
| Car | 1 | 1 | 0.899994 | 1 | 0.989457 | 0.989457 | 1 |
| Cheap | 0.975 | 0.701793 | 0.943489 | 0.970297 | 0.999673 | 0.96997 | 0.497462 |
| Clothes | 0.97 | 0.703599 | 1 | 0.960784 | 0.986117 | 0.946902 | 0.5 |
| Cold | 0.994898 | 0.716115 | 0.989841 | 0.990099 | 1 | 0.990099 | 0.512821 |
| Come | 1 | 1 | 0.959284 | 1 | 0.98884 | 0.98884 | 1 |
| Dance | 0.975 | 0.701793 | 0.963497 | 0.970297 | 0.989916 | 0.960213 | 0.497462 |
| Embarrassed | 0.955 | 0.709103 | 0.976592 | 0.933333 | 0.979465 | 0.912798 | 0.507772 |
| Empty | 0.98 | 0.7 | 0.965484 | 0.98 | 0.969476 | 0.949476 | 0.494949 |
| Excuse | 0.98 | 0.7 | 1 | 0.98 | 0.981238 | 0.961238 | 0.494949 |
| Expensive | 0.98 | 0.7 | 0.974596 | 0.98 | 0.961296 | 0.941296 | 0.494949 |
| Finish | 1 | 1 | 0.958249 | 1 | 0.952948 | 0.952948 | 1 |
| Fork | 0.975 | 0.701793 | 0.976222 | 0.970297 | 0.983176 | 0.953473 | 0.497462 |
| Friendly | 0.993103 | 0.571305 | 0.984467 | 0.979167 | 1 | 0.979167 | 0.326389 |
| Funeral | 0.985 | 0.698221 | 0.965785 | 0.989899 | 0.976355 | 0.966254 | 0.492462 |
| Go | 1 | 1 | 1 | 1 | 0.954389 | 0.954389 | 1 |
| Good | 1 | 1 | 0.969829 | 1 | 0.965481 | 0.965481 | 1 |
| Happy | 0.975 | 0.701793 | 0.949985 | 0.970297 | 0.968367 | 0.938664 | 0.497462 |
| Help | 0.98 | 0.7 | 0.947892 | 0.98 | 0.983923 | 0.963923 | 0.494949 |
| Jump | 0.975 | 0.701793 | 0.928965 | 0.970297 | 0.976583 | 0.94688 | 0.497462 |
| March | 0.994898 | 0.716115 | 0.989841 | 0.990099 | 1 | 0.990099 | 0.512821 |
| Money | 0.975 | 0.701793 | 0.939785 | 0.970297 | 0.954873 | 0.92517 | 0.497462 |
| Please | 0.912458 | 0.485965 | 0.801818 | 0.930233 | 0.905213 | 0.835446 | 0.274914 |
| Pray | 0.984694 | 0.705419 | 0.969432 | 0.989899 | 0.979381 | 0.96928 | 0.502564 |
| Read | 0.98 | 0.7 | 0.985672 | 0.98 | 0.972345 | 0.952345 | 0.494949 |
| Request | 0.98 | 0.7 | 0.925498 | 0.98 | 0.991438 | 0.971438 | 0.494949 |
| Sad | 0.979899 | 0.712525 | 0.960582 | 0.961165 | 1 | 0.961165 | 0.507692 |
| Small | 0.989712 | 0.444416 | 0.968427 | 0.95 | 1 | 0.95 | 0.197505 |
| Soup | 0.935 | 0.694709 | 0.977349 | 0.92233 | 0.965481 | 0.887811 | 0.494792 |
| Spoon | 0.979452 | 0.573573 | 0.954375 | 0.97 | 0.984375 | 0.954375 | 0.33564 |
| Time | 0.98 | 0.7 | 0.982396 | 0.98 | 0.974928 | 0.954928 | 0.494949 |
| Want | 0.994819 | 0.714435 | 0.989686 | 0.989899 | 1 | 0.989899 | 0.510417 |
| What | 0.994819 | 0.714435 | 0.989686 | 0.989899 | 1 | 0.989899 | 0.510417 |
| With | 0.984694 | 0.697926 | 0.969432 | 0.979381 | 0.989899 | 0.96928 | 0.489691 |
| AVERAGE | 0.979827 | 0.723467 | 0.949372 | 0.974941 | 0.967648 | 0.942588 | 0.54476 |
Performance accuracy of the developed models.
| Network Models | Top-1 | Top-2 | Top-3 | Features Combination | Model Depth |
|---|---|---|---|---|---|
| Deep BiLSTM | 0.954 | 0.971 | 0.989 | Shape + Motion + Position + Angles + Pause + Relative trajectory | 3-BiLSTM layers |
| Deep BiLSTM | 0.912 | 0.929 | 0.945 | Shape + Motion + Position + Angles | 3-BiLSTM layers |
Comparison of adopted Deep Bi-LSTM with a state-of-the-art method.
| Methods | Type of Deep Learning | No. of Epochs | Depth of LSTM | Convergence Rate | Execution Time |
|---|---|---|---|---|---|
| Avola et al. [ | Deep Bi-LSTM | 800 | 4 units | 100,000 iter | not reported |
| Our proposal | Deep Bi-LSTM | 300 | 3 units | 10,000 iter | GPU 1002 |
Figure 10Confusion matrix of the recognition performance of double-hand dynamic ASL words with Adam optimization.
Figure 11Confusion matrix of the recognition performance of double-hand dynamic ASL words with SGD optimization.
Figure 12Confusion matrix of the recognition performance of double-hand dynamic ASL words with Adagrad optimization.
Performance validation of Multi-stack deep BiLSTM from adopted optimization scheme.
| Optimization Scheme | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|---|
| AdaGrad | 94.701 | 94.006 | 94.869 | 94.003 |
| SGD | 95.011 | 94.998 | 95.01 | 94.786 |
| Adam | 97.983 | 96.765 | 97.494 | 96.968 |
Performance comparison of the multi-stacked BiLSTM network with method in [32].
| ASL Data Set | ||||
|---|---|---|---|---|
| Approach | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) |
| Avola et al. [ | 96.4102 | 96.6434 | 96.4102 | 96.3717 |
| Ours | 97.881 | 98.007 | 97.879 | 97.998 |
|
| ||||
| Avola et al. [ | 97.62 | |||
| Ours | 97.99 | |||
|
| ||||
| Accuracy (%) | ||||
| 14 Hand Gestures | 28 Hand Gestures | |||
| Avola et al. [ | 97.62 | 91.43 | ||
| Ours | 96.99 | 92.99 | ||
Performance of the multi-stacked BiLSTM network initialized with data-driven optimization applied to SHREC data set.
| Approach | Algorithms | Accuracy (%) | |
|---|---|---|---|
| 14 Hand Gestures | 28 Hand Gestures | ||
| De Smedt et al. [ | SVM | 88.62 | 81.9 |
| Boulahia et al [ | SVM | 90.5 | 80.5 |
| Ohn-Bar and Trivedi [ | SVM | 83.85 | 76.53 |
| HON4D [ | SVM | 78.53 | 74.03 |
| Devanne et al. [ | KNN | 79.61 | 62 |
| Hou et al. [ | Attention-Res-CNN | 93.6 | 90.7 |
| MFA-Net [ | MFA-Net, LSTM | 91.31 | 86.55 |
| Caputo et al. [ | NN | 89.52 | - |
| DeepGRU [ | DeepGRU | 94.5 | 91.4 |
| Liu et al. [ | CNN | 94.88 | 92.26 |
| Li et al. [ | 2D-CNN | 96.31 | 93.33 |
| Ours | Multi-stack deep BiLSTM | 96.99 | 92.99 |
Performance of the multi-stacked BiLSTM network initialized with data-driven optimization applied to LMDHG data set.
| Approach | Algorithms | Accuracy (%) |
|---|---|---|
| Boulahia et al. [ | SVM | 84.78 |
| Devanne et al. [ | KNN | 79.61 |
| Lupinetti et al. [ | CNN-ResNet50 | 92 |
| Hisham and Hamouda [ | Ada-boosting | 91.2 |
| Ours | Multi-stack deep BiLSTM | 97.99 |
Figure 13Misclassified words.
Figure 14CMC evaluation.