| Literature DB >> 29441027 |
Shamima Najnin1, Bonny Banerjee1,2.
Abstract
Cross-situational learning and social pragmatic theories are prominent mechanisms for learning word meanings (i.e., word-object pairs). In this paper, the role of reinforcement is investigated for early word-learning by an artificial agent. When exposed to a group of speakers, the agent comes to understand an initial set of vocabulary items belonging to the language used by the group. Both cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver's speech are considered. During agent-caregiver interaction, the agent selects a word from the caregiver's utterance and learns the relations between that word and the objects in its visual environment. The "novel words to novel objects" language-specific constraint is assumed for computing rewards. The models are learned by maximizing the expected reward using reinforcement learning algorithms [i.e., table-based algorithms: Q-learning, SARSA, SARSA-λ, and neural network-based algorithms: Q-learning for neural network (Q-NN), neural-fitted Q-network (NFQ), and deep Q-network (DQN)]. Neural network-based reinforcement learning models are chosen over table-based models for better generalization and quicker convergence. Simulations are carried out using mother-infant interaction CHILDES dataset for learning word-object pairings. Reinforcement is modeled in two cross-situational learning cases: (1) with joint attention (Attentional models), and (2) with joint attention and prosodic cues (Attentional-prosodic models). Attentional-prosodic models manifest superior performance to Attentional ones for the task of word-learning. The Attentional-prosodic DQN outperforms existing word-learning models for the same task.Entities:
Keywords: Q-learning; cross-situational learning; deep reinforcement learning; joint attention; neural network; prosodic cue
Year: 2018 PMID: 29441027 PMCID: PMC5797660 DOI: 10.3389/fpsyg.2018.00005
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1A taxonomy tree of existing word-learning models, including the models proposed in this paper.
Figure 2Agent-environment interaction in reinforcement learning. Here, f indicate environment state (s) is perceived by the agent as s through some process f and R represents the process of reward computation.
Figure 3For each iteration, reward is computed and plotted for Q-learning, SARSA, and SARSA-λ, respectively.
Figure 4For Q-NN, NFQ, and DQN: (A) obtained maximum rewards for different number of hidden units are plotted to find the optimal structure of Q-network; (B) for each iteration, with optimal number of hidden unit (200 units) reward is computed and plotted.
Figure 5Confusion Matrix for (A) Attentional Q-learning, (B) Attentional SARSA, (C) Attentional SARSA-λ, (D) Attentional Q-NN, (E) Attentional NFQ, and (F) Attentional DQN respectively.
Gold-standard words incorrectly associated with gold-objects using Q-learning, SARSA, SARSA-λ, Q-NN, NFQ, and DQN.
| “bunny” | “bigbirds” | “moocows” | “bird” | “bird” | “bird” |
| “cows” | “bunnyrabbit” | “duck” | |||
| “moocows” | “cows” | “duckie” | |||
| “duck” | “moocows” | “kittycats” | |||
| “duckie” | “duckie” | “lambie” | |||
| “kitty” | “kitty” | “bird” | |||
| “mirror” | “kittycats” | “hiphop” | |||
| “piggies” | “lamb” | ||||
| “ring” | “lambie” | ||||
| “bunnies” | “bunnies” | ||||
| “bird” | “bird” | ||||
| “hiphop” | “oink” | ||||
| “meow” | |||||
| “oink” |
Figure 6ROC for Q-learning, SARSA, SARSA-λ, Q-NN, NFQ, and DQN, respectively.
F-score, precision, and recall values using the learned lexicon from Attentional reinforcement learning models.
| Attentional Q-learning | 0.5667 | 0.7391 | 0.4595 |
| Attentional SARSA | 0.5205 | 0.5278 | 0.5135 |
| Attentional SARSA-λ | 0.5763 | 0.7727 | 0.4595 |
| Attentional Q-NN | 0.6667 | 0.5135 | |
| Attentional NFQ | 0.7097 | 0.88 | |
| Attentional DQN | 0.9167 |
The bold values in each column indicate highest score over the algorithms.
Figure 7Confusion Matrix for (A) Attentional-prosodic Q-NN, (B) Attentional-prosodic NFQ, and (C) Attentional-prosodic DQN, respectively.
Figure 8ROC for Attentional Q-NN, NFQ, DQN, Attentional-prosodic Q-NN, NFQ, DQN, existing Beagle, Beagle+PMI model, COOC, and Bayesian CSL.
Comparison of proposed attentional and Attentional-prosodic reinforcement learning models with existing models.
| Attentional Q-NN | 0.6667 | 0.5135 | |
| Attentional NFQ | 0.7097 | 0.88 | 0.5946 |
| Attentional DQN | 0.7213 | 0.9167 | 0.5946 |
| Attentional-prosodic Q-NN | 0.77419 | 0.6486 | |
| Attentional-prosodic NFQ | 0.78378 | 0.7838 | 0.7838 |
| Attentional-prosodic DQN | 0.8421 | ||
| COOC | 0.53 | 0.7578 | 0.4012 |
| Bayesian CSL model | 0.54 | 0.64 | 0.47 |
| Beagle model | 0.55 | 0.58 | 0.525 |
| Beagle+prosodic cue model | 0.6629 | 0.71 | 0.525 |
| Beagle+PMI model | 0.83 | 0.86 | 0.81 |
| MSG model | 0.64 | NA | NA |
| Attentive MSG | 0.7 | NA | NA |
| AttentiveSocial MSG | 0.73 | NA | NA |
The bold values in each column indicate highest score over the algorithms.
Learned best lexicon (word-object pairs) using Attentional-prosodic DQN.
| “ahhah” | eyes | “bunnies” | bunny | “hiphop” | bunny | “pig” | pig |
| “ahhah” | rattle | “bunny” | bunny | “david” | mirror | “piggie” | pig |
| “baby” | baby | “bunnyrabbit” | bunny | “kitty” | kitty | “piggies” | pig |
| “bear” | bear | “cow” | cow | “kittycat” | kitty | “rattle” | rattle |
| “big” | bunny | “duck” | duck | “kittycats” | kitty | “ring” | ring |
| “bigbird” | bird | “duckie” | duck | “lamb” | lamb | “rings” | ring |
| “bird” | bird | “eyes” | eyes | “lambie” | lamb | “sheep” | sheep |
| “birdie” | duck | “hand” | hand | “meow” | kitty | “through” | bunny |
| “book” | book | “hat” | hat | “mirror” | mirror | ||
| “books” | book | “he” | duck | “moocow” | cow |
Q-learning Algorithm for Word-Learning
| 1: Input: s is the current state, a is the current action, s′ is the next state, |
| 2: Initialize action-value function, |
| 3: Initialize Q-matrix for word-object pairs, |
| 4: |
| 5: Observe current state, |
| 6: |
| 7: Choose action, |
| 8: Observe next state, |
| 9: |
| 10: r = 100 |
| 11: |
| 12: r = −1 |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: |
SARSA Algorithm for Word-Learning
| 1: Input: States, |
| 2: Initialize action-value function, |
| 3: Initialize Q-matrix for word-object pairs, |
| 4: |
| 5: Observe current state, |
| 6: |
| 7: Choose action, |
| 8: Observe next state, |
| 9: |
| 10: r = 100 |
| 11: |
| 12: r = −1 |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: |
SARSA-λ Algorithm for Word-Learning
| 1: Input: States, |
| 2: Initialize action-value function, |
| 3: Initialize Q-matrix for word-object pairs, |
| 4: |
| 5: Observe current state, |
| 6: for |
| 7: Choose action, |
| 8: Observe next state, |
| 9: |
| 10: r = 100 |
| 11: else |
| 12: r = −1 |
| 13: |
| 14: δ = |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: |
| 26: |
| 27: |
| 28: |
Q-learning with Neural Network for Word-Learning
| 1: Initialize Q-matrix for word-object pairs, |
| 2: Initialize action-value function, |
| 3: Learning rate, α = 0.001, discount rate, γ = 0.99 |
| 4: Initialize ϵ = 0.99 for ϵ − |
| 5: |
| 6: Observe current state, |
| 7: |
| 8: Calculate |
| 9: Choose action, |
| 10: Observe |
| 11: Calculate reward |
| 12: Set |
| 13: Perform gradient descent step on |
| 14: Calculate |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: Set |
| 23: |
| 24: Decrease ϵ linearly. |
| 25: |
NFQ-Algorithm for Word-Learning
| 1: Initialize Q-matrix for word-object pairs, |
| 2: Initialize a set of transition samples, |
| 3: Initialize action-value function, |
| 4: Learning rate, α = 0.001, discount rate, γ = 0.99 |
| 5: Initialize ϵ = 0.99 for ϵ- |
| 6: |
| 7: Observe current state, |
| 8: for |
| 9: Calculate |
| 10: Choose action, |
| 11: Observe |
| 12: Calculate reward |
| 13: Store transition ( |
| 14: Set the target Q-value for current state as: |
| 15: Perform gradient descent step on |
| 16: Calculate |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: Set |
| 25: |
| 26: Decrease ϵ linearly. |
| 27: |
Double deep Q learning for optimal control
| 1: Initialize experience replay memory |
| 2: Initialize action-value function, |
| 3: Initialize target action value function, |
| 4: Learning rate, α = 0.001, discount rate, γ = 0.99 |
| 5: Initialize ϵ = 0.99 for ϵ- |
| 6: |
| 7: Observe current state, |
| 8: |
| 9: Calculate |
| 10: With probability ϵ select random acrion |
| 11: otherwise select |
| 12: Observe |
| 13: Calculate reward |
| 14: Store transition ( |
| 15: Sample minibatch of transitions ( |
| 16: Set |
| 17: Perform gradient descent step on |
| 18: After |
| 19: Calculate |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: |
| 26: |
| 27: Set |
| 28: |
| 29: Decrease ϵ linearly. |
| 30: |