| Literature DB >> 25610457 |
Amin Mousavi1, Babak Nadjar Araabi2, Majid Nili Ahmadabadi2.
Abstract
This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents' MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework is Q-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. The Q-values learned during the learning process of the source tasks are mapped to the sets of Q-values for the target task. These transferred Q-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task.Entities:
Mesh:
Year: 2014 PMID: 25610457 PMCID: PMC4293791 DOI: 10.1155/2014/428567
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1A 10 × 10 grid as a farm with three crops and three harvesting robots. Robot 1: Sensor modules: GPS, color&weight sensor, S 1 = {(x, y, k) | 1 ≤ x, y ≤ 10, k ∈ {RL, GL, GH, YH, 0}}, x: column number, y: row number, R: Red, G: Green, Y: Yellow, L: Light, H: Heavy, 0: Nothing, A 1 = {N, S, E, W, 0, P, D}, N: Move North, S: Move South, E: Move East, W: Move West, 0: Nothing, P: Pickup, D: Dropoff. Robot 2: Sensor modules: GPS, Compass, B&W camera, S 2 = {(x, y, d, c) | 1 ≤ x, y ≤ 10, d ∈ {N, S, E, W}, c ∈ {SG, T, BG, 0}}, x, y are the same as robot 1, d: direction, SG: Small Globe, T: Rod, BG: Big Globe, 0: Nothing, A 2 = {F, B, L, R, LF, RF, 0, P, D}, F: Move Forward, B: Move Backward, L: Turn left, R: Turn Right, LF: Turn left & F, RF: Turn right & F, 0: Nothing, P: Pickup, D: Dropoff. Robot 3: Sensor modules: beam's signal distance indicator, Compass, color & weight sensor, S 3 = {(b 1, b 2, d, k) | 1 ≤ b 1, b 2 ≤ 20, d ∈ {N, S, E, W}, k ∈ {RL, GL, GH, YH, 0}}, b : 1-norm distance to beam i, d is the same as robot 2 and k as robot 1, A 3 = A 2.
The output of the sensor module for different kinds of crops.
| Crops | B&W camera | Color | Weight |
|---|---|---|---|
| Tomato | Small globe | Red | Light |
| Cucumber | Rod | Green | Light |
| Watermelon | Big globe | Green or yellow | Heavy |
Figure 2The process of context transfer between source task T and target task T in which all mappings are known except Q .
Figure 3The comparison of average reward of learning for the four cases of transfer: without transfer, with transfer from robot 1, with transfer from robot 2, and with transfer from both robots.
Figure 4The comparison of regret of learning for the four cases of transfer: without transfer, with transfer from robot 1, with transfer from robot 2, and with transfer from both robots.
Figure 5Crossroad Traffic Controller. Old system: Sensors: distance sensor, S old = {(x, y, d) | 1 ≤ x, y ≤ 10, d ∈ {V, H}}, x: distance to first car in vertical lane, y: distance to first car in horizontal lane, V: vertical lane is green, H: horizontal lane is green, A old = {GV, GH, N}, GV: change the vertical lane to green, GH: change the horizontal lane to green, N: no action. New system: Sensors: camera, S new = {(x, y, d) | 0 ≤ x, y ≤ 1023, d ∈ {V, H}}, x: cars' existence coding in the first ten squares of the vertical lane, y: cars' existence coding in the first ten squares of the horizontal lane, V: vertical lane is green, H: horizontal lane is green, A new = {C, N}, C: change the light, N: no action.
Figure 6The comparison of average reward of learning with and without transfer for crossroad traffic controller.
Figure 7The comparison of regret of learning with and without transfer for crossroad traffic controller.