| Literature DB >> 36185977 |
Koji Inoue1, Divesh Lala1, Tatsuya Kawahara1.
Abstract
Spoken dialogue systems must be able to express empathy to achieve natural interaction with human users. However, laughter generation requires a high level of dialogue understanding. Thus, implementing laughter in existing systems, such as in conversational robots, has been challenging. As a first step toward solving this problem, rather than generating laughter from user dialogue, we focus on "shared laughter," where a user laughs using either solo or speech laughs (initial laugh), and the system laughs in turn (response laugh). The proposed system consists of three models: 1) initial laugh detection, 2) shared laughter prediction, and 3) laugh type selection. We trained each model using a human-robot speed dating dialogue corpus. For the first model, a recurrent neural network was applied, and the detection performance achieved an F1 score of 82.6%. The second model used the acoustic and prosodic features of the initial laugh and achieved a prediction accuracy above that of the random prediction. The third model selects the type of system's response laugh as social or mirthful laugh based on the same features of the initial laugh. We then implemented the full shared laughter generation system in an attentive listening dialogue system and conducted a dialogue listening experiment. The proposed system improved the impression of the dialogue system such as empathy perception compared to a naive baseline without laughter and a reactive system that always responded with only social laughs. We propose that our system can be used for situated robot interaction and also emphasize the need for integrating proper empathetic laughs into conversational robots and agents.Entities:
Keywords: android robot; empathy; laughter generation; laughter type; shared laughter; spoken dialogue system
Year: 2022 PMID: 36185977 PMCID: PMC9522467 DOI: 10.3389/frobt.2022.933261
Source DB: PubMed Journal: Front Robot AI ISSN: 2296-9144
FIGURE 1Example of shared laughter by spoken dialogue systems.
FIGURE 2Snapshot of dialogue recording.
FIGURE 3Diagram of annotated labels (number of extracted samples).
Distribution of annotated samples on laughter type (Social: number of people who annotated the sample as social; Mirthful: number of people who annotated the sample as mirthful).
| Social | Mirthful | #Sample |
|---|---|---|
| 5 | 0 | 86 |
| 4 | 1 | 60 |
| 3 | 2 | 40 |
| 2 | 3 | 22 |
| 1 | 4 | 30 |
| 0 | 5 | 30 |
FIGURE 4Architecture of the proposed system.
FIGURE 5Recurrent neural network for laughter detection (FC: fully connected layer).
Laughter detection performance (%).
| Precision | Recall | F1 score | |
|---|---|---|---|
| BiGRU | 82.4 | 75.8 | 79.0 |
| + SpecAugment | 85.3 | 78.7 | 81.8 |
| + Up-sampling | 78.2 | 87.6 | 82.6 |
Shared laughter prediction performance (%).
| Feature | Precision | Recall | F1 score |
|---|---|---|---|
| Random | 16.2 | 16.2 | 16.2 |
| Acoustic (A) | 20.3 | 52.2 | 29.2 |
| Prosodic (P) | 17.8 | 46.3 | 25.7 |
| A + P | 21.2 | 53.4 | 30.3 |
Laughter type selection performance (%).
| Feature | Mirthful laugh | Social laugh | Macro F1 score | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 score | Precision | Recall | F1 score | ||
| Random | 30.6 | 30.6 | 30.6 | 69.4 | 69.4 | 69.4 | 50.0 |
| Acoustic (A) | 48.5 | 61.0 | 54.1 | 80.6 | 71.5 | 75.8 | 64.9 |
| Prosodic (P) | 54.9 | 68.2 | 60.8 | 84.3 | 75.2 | 79.5 | 70.2 |
| A + P | 52.6 | 61.0 | 56.5 | 81.5 | 75.8 | 78.6 | 67.5 |
Distribution of annotated samples on laughter type (social: number of people who annotated the sample as social; mirthful: number of people who annotated the sample as mirthful).
| Scenario | Speaker (user) | Length | #User laughs | #Shared laughter (#mirthful/#social) | #Evaluators |
|---|---|---|---|---|---|
| 1 | A | 2 min 26 s | 6 | 3 (0/3) | 41 |
| 2 | B | 1 min 21 s | 3 | 2 (2/0) | 31 |
| 3 | A | 2 min 2 s | 5 | 2 (2/0) | 30 |
| 4 | A | 3 min 17 s | 12 | 7 (4/3) | 30 |
Mean values on evaluation scores in a subjective experiment (Emp: empathy, Nat: naturalness, Hum: human likeness, Und: understanding).
| Scenario | Proposed | No laugh (baseline 1) | Reactive (baseline 2) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Emp | Nat | Hum | Und | Emp | Nat | Hum | Und | Emp | Nat | Hum | Und | |
| 1 | 4.54 | 3.88 | 4.12 | 4.24 | 4.39 | 3.98 | 3.83 | 4.20 | 4.00 | 3.20 | 3.37 | 3.66 |
| 2 | 4.39 | 3.48 | 3.87 | 4.10 | 3.97 | 3.42 | 3.81 | 3.58 | 4.29 | 3.68 | 4.23 | 4.13 |
| 3 | 4.30 | 3.77 | 4.07 | 4.30 | 4.10 | 3.87 | 3.97 | 3.90 | 4.93 | 4.03 | 4.60 | 4.57 |
| 4 | 5.23 | 4.97 | 5.47 | 5.00 | 4.70 | 4.30 | 4.43 | 3.93 | 5.10 | 4.63 | 4.73 | 4.83 |
| Mean | 4.61 | 4.01 | 4.36 | 4.39 | 4.30 | 3.89 | 3.99 | 3.92 | 4.53 | 3.83 | 4.16 | 4.24 |