| Literature DB >> 34394771 |
Seyed Ramezan Hosseini1, Alireza Taheri1, Minoo Alemi1,2, Ali Meghdari1.
Abstract
This paper addresses the lack of proper Learning from Demonstration (LfD) architectures for Sign Language-based Human-Robot Interactions to make them more extensible. The paper proposes and implements a Learning from Demonstration structure for teaching new Iranian Sign Language signs to a teacher assistant social robot, RASA. This LfD architecture utilizes one-shot learning techniques and Convolutional Neural Network to learn to recognize and imitate a sign after seeing its demonstration (using a data glove) just once. Despite using a small, low diversity data set (~ 500 signs in 16 categories), the recognition module reached a promising 4-way accuracy of 70% on the test data and showed good potential for increasing the extensibility of sign vocabulary in sign language-based human-robot interactions. The expansibility and promising results of the one-shot Learning from Demonstration technique in this study are the main achievements of conducting such machine learning algorithms in social Human-Robot Interaction.Entities:
Keywords: Convolutional Neural Network (CNN); Human–Robot Interaction (HRI); One-shot Learning; Sign Language; Social Robotics
Year: 2021 PMID: 34394771 PMCID: PMC8352758 DOI: 10.1007/s12369-021-00818-1
Source DB: PubMed Journal: Int J Soc Robot ISSN: 1875-4791 Impact factor: 3.802
Fig. 1The RASA Robot
Fig. 2The general overview of RASA's cognitive architecture [46]
Fig. 3An overview of the proposed LfD architecture in this study. A user gives a sign's name to the robot and performs that sign using a Data Glove. The architecture converts the performed sign to a state image and feeds it to its recognition and regeneration modules. The recognition module feeds the incoming state image to a pre-trained convolutional neural network and stores its vector output alongside the given name in the robot's Long Term Memory (see Fig. 2). The regeneration module maps the given state image to the robot's kinematics and stores the output trajectory in the memory. This is the learning process. Now the robot can retrieve the trajectory or the embedded vector for recognizing or regenerating that sign in the future
Fig. 4Iranian Signs selected for the HRI [3]
Fig. 5State Image. a The structure of the State Images. b State Image sample of the sign “Orange” [54]. In the "Orange" sign the handshape and palm direction are fixed so the last 9 rows of the image do not show significant changes in their values over time. The thumb and index are closed so the 7th and 8th rows (from top) are slightly darker than the rows beneath them. The hand moves in a circle in the x–y plane so the sinusoidal movement in the 1st and 2nd rows are clear, but there is no change in the 3rd row (associated with the z-axis)
Fig. 6The chosen CNN Architecture [55]
Summary of the recognition module's accuracy (percentage) including the mean and the standard deviations of all the trials (For more details, see Table 2 in the Appendix)
| 16-way | 12-way | 8-way | 4-way | |
|---|---|---|---|---|
| 45.89 ± 9.84 | 50.54 ± 9.69 | 57.81 ± 9.16 | 70.83 ± 8.02 | Test |
| 46.75 ± 2.88 | 51.55 ± 2.93 | 58.25 ± 2.33 | 71.66 ± 2.06 | Train |
Recognition module's accuracy (percentage) in Detail for all the trials. (Color figure online)