| Literature DB >> 35214293 |
Mohammad Salimibeni1, Arash Mohammadi1, Parvin Malekzadeh2, Konstantinos N Plataniotis2.
Abstract
Development of distributed Multi-Agent Reinforcement Learning (MARL) algorithms has attracted an increasing surge of interest lately. Generally speaking, conventional Model-Based (MB) or Model-Free (MF) RL algorithms are not directly applicable to the MARL problems due to utilization of a fixed reward model for learning the underlying value function. While Deep Neural Network (DNN)-based solutions perform well, they are still prone to overfitting, high sensitivity to parameter selection, and sample inefficiency. In this paper, an adaptive Kalman Filter (KF)-based framework is introduced as an efficient alternative to address the aforementioned problems by capitalizing on unique characteristics of KF such as uncertainty modeling and online second order learning. More specifically, the paper proposes the Multi-Agent Adaptive Kalman Temporal Difference (MAK-TD) framework and its Successor Representation-based variant, referred to as the MAK-SR. The proposed MAK-TD/SR frameworks consider the continuous nature of the action-space that is associated with high dimensional multi-agent environments and exploit Kalman Temporal Difference (KTD) to address the parameter uncertainty. The proposed MAK-TD/SR frameworks are evaluated via several experiments, which are implemented through the OpenAI Gym MARL benchmarks. In these experiments, different number of agents in cooperative, competitive, and mixed (cooperative-competitive) scenarios are utilized. The experimental results illustrate superior performance of the proposed MAK-TD/SR frameworks compared to their state-of-the-art counterparts.Entities:
Keywords: Kalman Temporal Difference; Multi-Agent Reinforcement Learning; Multiple Model Adaptive Estimation; Successor Representation
Year: 2022 PMID: 35214293 PMCID: PMC8962978 DOI: 10.3390/s22041393
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Different multi-agent scenarios implemented within the OpenAI gym. (a) Cooperation Scenario (b) Competition Scenario (c) Predator-Prey 2v1, and (d) Predator-Prey 1v2.
Total loss averaged across all the episodes and for all the four implemented scenarios.
| Environment | MAK-SR | MAK-TD | MADDPG | DDPG | DQN |
|---|---|---|---|---|---|
| Cooperation | 8.93 | 2.4088 | 9649.84 | 10,561.16 | 10.93 |
| Competition | 0.43 | 4.9301 | 10,158.18 | 10,710.37 | 107.39 |
| Predator–Prey 1v2 | 0.005 | 1.9374 | 6816.34 | 6884.33 | 8.21 |
| Predator–Prey 2v1 | 8.87 | 1.2421 | 7390.18 | 6882.2 | 10.24 |
Total received reward by the agents averaged for all the four implemented scenarios.
| Environment | MAK-SR | MAK-TD | MADDPG | DDPG | DQN |
|---|---|---|---|---|---|
| Cooperation | −16.0113 | −23.0113 | −69.28 | −66.29 | −39.96 |
| Competition | −0.778 | −13.358 | −63.30 | −61.34 | −14.49 |
| Predator–Prey 1v2 | −0.0916 | −13.432 | −46.17 | −20.53 | −23.451 |
| Predator–Prey 2v1 | −0.081 | −17.0058 | −55.69 | −49.41 | −44.32 |
Average steps taken by agents per episode for all the environments based on the implemented platforms.
| Environment | MAK-SR | MAK-TD | MADDPG | DDPG | DQN |
|---|---|---|---|---|---|
| Cooperation | 14.03 | 12.064 | 7.377 | 7.369 | 15.142 |
| Competition | 17.59 | 17.48 | 7.36 | 7.18 | 11.98 |
| Predator–Prey 1v2 | 14.78 | 12.36 | 6.21 | 7.69 | 10.02 |
| Predator–Prey 2v1 | 9.94 | 9.773 | 6.25 | 7.12 | 8.46 |
Figure 2The Predator–Prey environment: (a) Loss. (b) Received rewards.
Figure 3Cumulative distance walked by the agents in four different environments based on the five implemented algorithms (a) Cooperation. (b) Competition. (c) Predator–Prey 2v1. (d) Predator–Prey 1v2.
Figure 4Four different normalized loss functions results for all the agents in the for the four algorithms in four different environments: (a) Cooperation. (b) Competition. (c) Predator–Prey 2v1. (d) Predator–Prey 1v2.
Figure 5Four different reward functions results for all the agents for the five algorithms in four different environments: (a) Cooperation. (b) Competition. (c) Predator–Prey 2v1. (d) Predator–Prey 1v2.
Figure 6The mean (solid lines) and standard deviation (shaded regions) of cumulative episode’s reward for the four algorithms in four different environments: (a) Cooperation. (b) Competition (c) Predator–Prey 2v1. (d) Predator–Prey 1v2.