| Literature DB >> 33267388 |
Boris Belousov1, Jan Peters1,2.
Abstract
An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback-Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of f-divergences, and more concretely α -divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson χ 2 -divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function f. On a concrete instantiation of our framework with the α -divergence, we carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.Entities:
Keywords: KL control; actor-critic methods; f-divergence; maximum entropy reinforcement learning
Year: 2019 PMID: 33267388 PMCID: PMC7515171 DOI: 10.3390/e21070674
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Empirical policy evaluation and policy improvement objectives for .
| KL Divergence ( | Pearson |
|---|---|
|
|
|
|
|
|
Figure 1Effects of on policy improvement. Each row corresponds to a fixed . First four iterations of policy improvement together with a later iteration are shown in each row. Large positive ’s eliminate bad actions one by one, keeping the exploration level equal among the rest. Small ’s weigh actions according to their values; actions with low value get zero probability for , but remain possible with small probability for . Large negative ’s focus on the best action, exploring the remaining actions with equal probability.
Figure 2Average regret for various values of .
Figure 3Regret after a fixed time as a function of .
Figure 4Effects of -divergence on policy iteration. Each row corresponds to a given environment. Results for different values of are split into three subplots within each row, from the more extreme ’s on the left to the more refined values on the right. In all cases, more negative values initially show faster improvement because they immediately jump to the mode and keep the exploration level low; however, after a certain number of iterations they get overtaken by moderate values that weigh advantage estimates more evenly. Positive demonstrate high variance in the learning dynamics because they clamp the probability of good actions to zero if the advantage estimates are overly pessimistic, never being able to recover from such a mistake. Large positive ’s may even fail to reach the optimum altogether, as exemplified by in the plots. The most stable and reliable -divergences lie between the reverse KL () and the KL (), with the Hellinger distance () outperforming both on the FrozenLake environment.
Function , its convex conjugate , and their derivatives for some values of .
| Divergence |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| KL | 1 |
|
|
|
|
|
| Reverse KL | 0 |
|
|
|
|
|
| Pearson | 2 |
|
|
|
|
|
| Neyman |
|
|
|
|
|
|
| Hellinger |
|
|
|
|
|
|
Chain environment.
| Parameter | Value |
|---|---|
| Number of states | 8 |
| Action success probability | 0.9 |
| Small and large rewards | (2.0, 10.0) |
| Number of runs | 10 |
| Number of iterations | 30 |
| Number of samples | 800 |
| Temperature parameters | (15.0, 0.9) |
CliffWalking environment.
| Parameter | Value |
|---|---|
| Punishment for falling from the cliff |
|
| Reward for reaching the goal | 100 |
| Number of runs | 10 |
| Number of iterations | 40 |
| Number of samples | 1500 |
| Temperature parameters | (50.0, 0.9) |
FrozenLake environment.
| Parameter | Value |
|---|---|
| Action success probability | 0.8 |
| Number of runs | 10 |
| Number of iterations | 50 |
| Number of samples | 2000 |
| Temperature parameters | (1.0, 0.8) |