Reinforcement Learning Theory
Origin

image.webp
Definition of RL
\[\underbrace{ a_t }_{\text{agent action}}
\sim
\underbrace{ \pi_{
\underbrace{ \theta }_{\text{NN}}
}
}_{\text{stochastic policy parameterized by NN}}
(\cdot |
\underbrace{ s_t }_{\text{game state}}
)\]
\[s'\sim P \Longleftrightarrow s'\equiv s_{t+1} \sim \underbrace{ P }_{\text{natural law}}(\cdot | s_t, a_t)\]
\[\underbrace{ \tau = (s_0, a_0, s_1, a_1, ...) }_{\text{trajectory or episode or rollout}}\]
\[\begin{split}\text{entropy: } H(P) = \underset{x\sim P}{E}[-\log P(x)] \\
\begin{cases}
\text{reward} & r_t=r(s_t, a_t) \\
\text{reward-to-go} & \hat{R}_t = \sum\limits_{t'=t}^{T} r(s_t, a_t) \\
\text{entropy-regularized reward} & R_\tau =\sum\limits_{t=0}^{\infty} (\underbrace{ \gamma }_{\text{discount}} )^t (r_t+\alpha H(\pi(\cdot|s_t)))
\end{cases}\end{split}\]
\[\begin{split}J(\pi) =
\underbrace{ \underset{\tau\sim\pi}{E} [R_\tau] }_{\text{expectation over }\tau \text{ generated by }\pi } \\
\text{RL: find } \pi^* = \arg\max_\pi J(\pi)\end{split}\]
Bellman equation: A self-consistency equation just like in Density-functional theory
\[\begin{split}\begin{align}
\text{AVF} &\quad \text{Action-Value Function} \\
\text{on-policy AVF} &\quad Q^\pi(s,a) =
\underbrace{ \underset{\tau\sim\pi}{E} [R_\tau | s_0=s, a_0=a] }_{\text{expectation over }\tau \text{ generated by }\pi } \\
\text{Bellman eqn} &\quad Q^\pi(s,a) = \underset{s'\sim P}{E}\left[
r(s,a) + \gamma \;
\underbrace{ \underset{a'\sim\pi}{E}[Q^\pi(s',a') ] }_{V^\pi(s')}
\right] \\
\text{optimal AVF} &\quad Q^* = \max_\pi Q^\pi \\
\text{Bellman eqn} &\quad Q^*(s,a) = \underset{s'\sim P}{E}\left[
r(s,a) + \gamma \;
\underbrace{ \max_{a'}[Q^*(s',a') ] }_{V^*(s')}
\right] \\
\end{align}\end{split}\]
\[\begin{split}\text{replay buffer: } D = \{ (s,a,r,s', d),... \} \quad d=\begin{cases} 1& s'\text{ is done} \\ 0& \text{otherwise} \end{cases} \\
L(\phi, D) = \underset{(s,a,r,s', d) \sim D}{E} \left[ \left( Q_\phi(s,a) -
\underbrace{ (r+\gamma(1-d)
\overbrace{ \max_{a'} Q_{\phi_{\text{targ}}}(s',a') }^{ \text{approx by } Q_{\phi_{\text{targ}}}(s', \; \pi_{\theta_{\text{targ}}}(s') ) }
) }_{\text{target}}
\right)^2 \right] \\
\phi_{\text{targ}} \underset{\text{every k steps}}{\longleftarrow}
\underbrace{ \rho }_{\text{polyak, } \to 1}
\phi_{\text{targ}} + (1-\rho)\phi \\
\theta_{\text{targ}} = \max_\theta \underset{s\sim D}{E}[Q_\phi(s, \; \pi_\theta(s))]\end{split}\]
\[\text{advantage function: } A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]
policy gradient
\[\underbrace{ \nabla_\theta J(\pi_\theta) }_{\text{policy gradient}}
= \underset{\tau\sim\pi_\theta}{E} \left[ \sum_{t=0}^T \boxed{ \Phi_t } \nabla_\theta \log \pi_\theta(a_t|s_t) \right]\]
\[\begin{split}\text{lemma: } \underset{a_t\sim\pi_\theta}{E} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t)b(s_t)\right] =0 \text{ for any } \underset{\text{baseline}}{ \boxed{ b(s_t) } } \\
\Phi_t = \begin{cases}
\hat{R}_t - b(s_t) \\
Q^{\pi_\theta}(s_t,a_t) - b(s_t)
\end{cases} \\
b(s_t) = \begin{cases} V^\pi(s_t) \end{cases}\end{split}\]
Soft Actor-Critic
\[\hat{Q}^\pi(s,a) = Q^\pi(s,a) -
\underbrace{ \gamma^0 \alpha H(\pi(\cdot|s)) }_{\text{entropy when }t=0}\]
\[\begin{split}\text{Bellman eqn} \quad Q^\pi(s,a) = \underset{s'\sim P}{E}\left[
r(s,a) + \gamma \;
\underbrace{ \underset{a'\sim\pi}{E}[Q^\pi(s',a') ] }_{V^\pi(s')}
\right] \\
= \underset{s'\sim P \\ a'\sim\pi}{E}\left[
r(s,a) + \gamma \; Q^\pi(s',a')
\right] \\
= \underset{s'\sim P \\ a'\sim\pi}{E}\left[
r(s,a) + \gamma \left( \hat{Q}^\pi(s',a')+ \alpha H(\pi(\cdot|s')) \right)
\right] \\
\hat{Q}^\pi(s,a)+ \underset{ }{\boxed{ \alpha H(\pi(\cdot|s)) }} = \underset{s'\sim P \\ a'\sim\pi}{E}\left[
r(s,a) + \gamma \left( \hat{Q}^\pi(s',a')+ \alpha H(\pi(\cdot|s')) \right)
\right]\end{split}\]
Policy Optimization vs Q-Learning
Policy Optimization:
learns the policy \(\pi_\theta(a|s)\) directly with NN
\(\theta\), also learns \(V^\pi_\phi(s)\) with NN
\(\phi\) in order update the policy
gradient ascent on or maximizing local approximation of
\(J(\pi_\theta)\)
on-policy: each update only uses data resulted from latest
policy
stable and reliable, but sample inefficient
Q-Learning:
learns \(Q^*_\theta(s,a)\) with NN \(\theta\), which
indirectly gives the policy
\(a^*(s) = \arg\max\limits_a Q^*_\theta(s,a)\)
uses cost function based on the Bellman equation
off-policy: each update uses any previous data
less stable, but sample efficient
hybrid
use NN to approximate VF
\[\begin{split}k: \text{ epoch index} \\
\underbrace{ \phi_k }_{\text{NN to learn }V} = \arg\min_\phi \underset{s_t,\hat{R}_t\sim \pi_k }{E}\left[ \left( V_\phi(s_t) - \hat{R}_t \right)^2 \right]\end{split}\]
Algorithms
2015 Trust Region Policy Optimization (not in comparisons)
John Schulman (OpenAI, 16524, 31)
Sergey Levine (UC Berkeley Assist Prof, 27200, 76)
Philipp Moritz
Michael I. Jordan (UC Berkeley Prof, 91071, 118)
Pieter Abbeel (UC Berkeley Prof, 37231, 91)
2017 Proximal Policy Optimization
John Schulman
Filip Wolski (OpenAI, 3220, 6)
Prafulla Dhariwal (OpenAI, 4443, 8)
Alec Radford (OpenAI, 16501, 14)
Oleg Klimov (OpenAI, 3481, 8)
2016 Deep Deterministic Policy Gradient
Timothy P. Lillicrap (DeepMind, 29763, 43)
Jonathan J. Hunt (DeepMind, 4458, 10)
Alexander Pritzel (Deepmind, 6236, 14)
Nicolas Heess (DeepMind, 13047, 35)
Tom Erez (DeepMind, 7435, 21)
Yuval Tassa (DeepMind, 8087, 23)
David Silver (DeepMind, 49958, 58)
Daan Wierstra (DeepMind, 31959, 40)
notation: in the following \(\mu_\theta = \pi_\theta\)
2018 Twin Delayed DDPG
Scott Fujimoto (McGill University, 581, 7)
Herke van Hoof (University of Amsterdam, 1185, 15)
David Meger (McGill University, 1620, 14)
2018 Soft Actor-Critic
Tuomas Haarnoja (DeepMind, 1599, 7)
Aurick Zhou (UC Berkeley, 1093, 5)
Pieter Abbeel
Sergey Levine