Reinforcement Learning Theory

Origin

Richard E. Bellman (1957). Dynamic Programming.
Richard S. Sutton (1984). Temporal Credit Assignment in Reinforcement Learning.

Definition of RL

the policy

\[\underbrace{ a_t }_{\text{agent action}} \sim \underbrace{ \pi_{ \underbrace{ \theta }_{\text{NN}} } }_{\text{stochastic policy parameterized by NN}} (\cdot | \underbrace{ s_t }_{\text{game state}} )\]

the environment (natural law)

\[s'\sim P \Longleftrightarrow s'\equiv s_{t+1} \sim \underbrace{ P }_{\text{natural law}}(\cdot | s_t, a_t)\]

definition: trajectory

\[\underbrace{ \tau = (s_0, a_0, s_1, a_1, ...) }_{\text{trajectory or episode or rollout}}\]

definition: rewards

\[\begin{split}\text{entropy: } H(P) = \underset{x\sim P}{E}[-\log P(x)] \\ \begin{cases} \text{reward} & r_t=r(s_t, a_t) \\ \text{reward-to-go} & \hat{R}_t = \sum\limits_{t'=t}^{T} r(s_t, a_t) \\ \text{entropy-regularized reward} & R_\tau =\sum\limits_{t=0}^{\infty} (\underbrace{ \gamma }_{\text{discount}} )^t (r_t+\alpha H(\pi(\cdot|s_t))) \end{cases}\end{split}\]

what is RL ?

\[\begin{split}J(\pi) = \underbrace{ \underset{\tau\sim\pi}{E} [R_\tau] }_{\text{expectation over }\tau \text{ generated by }\pi } \\ \text{RL: find } \pi^* = \arg\max_\pi J(\pi)\end{split}\]

Bellman equation: A self-consistency equation just like in Density-functional theory

Action-Value Function

\[\begin{split}\begin{align} \text{AVF} &\quad \text{Action-Value Function} \\ \text{on-policy AVF} &\quad Q^\pi(s,a) = \underbrace{ \underset{\tau\sim\pi}{E} [R_\tau | s_0=s, a_0=a] }_{\text{expectation over }\tau \text{ generated by }\pi } \\ \text{Bellman eqn} &\quad Q^\pi(s,a) = \underset{s'\sim P}{E}\left[ r(s,a) + \gamma \; \underbrace{ \underset{a'\sim\pi}{E}[Q^\pi(s',a') ] }_{V^\pi(s')} \right] \\ \text{optimal AVF} &\quad Q^* = \max_\pi Q^\pi \\ \text{Bellman eqn} &\quad Q^*(s,a) = \underset{s'\sim P}{E}\left[ r(s,a) + \gamma \; \underbrace{ \max_{a'}[Q^*(s',a') ] }_{V^*(s')} \right] \\ \end{align}\end{split}\]

mean-squared Bellman error

\[\begin{split}\text{replay buffer: } D = \{ (s,a,r,s', d),... \} \quad d=\begin{cases} 1& s'\text{ is done} \\ 0& \text{otherwise} \end{cases} \\ L(\phi, D) = \underset{(s,a,r,s', d) \sim D}{E} \left[ \left( Q_\phi(s,a) - \underbrace{ (r+\gamma(1-d) \overbrace{ \max_{a'} Q_{\phi_{\text{targ}}}(s',a') }^{ \text{approx by } Q_{\phi_{\text{targ}}}(s', \; \pi_{\theta_{\text{targ}}}(s') ) } ) }_{\text{target}} \right)^2 \right] \\ \phi_{\text{targ}} \underset{\text{every k steps}}{\longleftarrow} \underbrace{ \rho }_{\text{polyak, } \to 1} \phi_{\text{targ}} + (1-\rho)\phi \\ \theta_{\text{targ}} = \max_\theta \underset{s\sim D}{E}[Q_\phi(s, \; \pi_\theta(s))]\end{split}\]

advantage function

\[\text{advantage function: } A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]

policy gradient

\[\underbrace{ \nabla_\theta J(\pi_\theta) }_{\text{policy gradient}} = \underset{\tau\sim\pi_\theta}{E} \left[ \sum_{t=0}^T \boxed{ \Phi_t } \nabla_\theta \log \pi_\theta(a_t|s_t) \right]\]

\[\begin{split}\text{lemma: } \underset{a_t\sim\pi_\theta}{E} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t)b(s_t)\right] =0 \text{ for any } \underset{\text{baseline}}{ \boxed{ b(s_t) } } \\ \Phi_t = \begin{cases} \hat{R}_t - b(s_t) \\ Q^{\pi_\theta}(s_t,a_t) - b(s_t) \end{cases} \\ b(s_t) = \begin{cases} V^\pi(s_t) \end{cases}\end{split}\]

Soft Actor-Critic

rewrite

\[\hat{Q}^\pi(s,a) = Q^\pi(s,a) - \underbrace{ \gamma^0 \alpha H(\pi(\cdot|s)) }_{\text{entropy when }t=0}\]

Bellman eqn

\[\begin{split}\text{Bellman eqn} \quad Q^\pi(s,a) = \underset{s'\sim P}{E}\left[ r(s,a) + \gamma \; \underbrace{ \underset{a'\sim\pi}{E}[Q^\pi(s',a') ] }_{V^\pi(s')} \right] \\ = \underset{s'\sim P \\ a'\sim\pi}{E}\left[ r(s,a) + \gamma \; Q^\pi(s',a') \right] \\ = \underset{s'\sim P \\ a'\sim\pi}{E}\left[ r(s,a) + \gamma \left( \hat{Q}^\pi(s',a')+ \alpha H(\pi(\cdot|s')) \right) \right] \\ \hat{Q}^\pi(s,a)+ \underset{ }{\boxed{ \alpha H(\pi(\cdot|s)) }} = \underset{s'\sim P \\ a'\sim\pi}{E}\left[ r(s,a) + \gamma \left( \hat{Q}^\pi(s',a')+ \alpha H(\pi(\cdot|s')) \right) \right]\end{split}\]

Policy Optimization vs Q-Learning

Policy Optimization:
- learns the policy \(\pi_\theta(a|s)\) directly with NN \(\theta\), also learns \(V^\pi_\phi(s)\) with NN \(\phi\) in order update the policy
- gradient ascent on or maximizing local approximation of \(J(\pi_\theta)\)
- on-policy: each update only uses data resulted from latest policy
- stable and reliable, but sample inefficient
Q-Learning:
- learns \(Q^*_\theta(s,a)\) with NN \(\theta\), which indirectly gives the policy \(a^*(s) = \arg\max\limits_a Q^*_\theta(s,a)\)
- uses cost function based on the Bellman equation
- off-policy: each update uses any previous data
- less stable, but sample efficient
hybrid
- Deep Deterministic Policy Gradient
  - off-policy
  - continuous action space version of Q learning
  - differs from Q learning in finding \(\arg\max\)
use NN to approximate VF

\[\begin{split}k: \text{ epoch index} \\ \underbrace{ \phi_k }_{\text{NN to learn }V} = \arg\min_\phi \underset{s_t,\hat{R}_t\sim \pi_k }{E}\left[ \left( V_\phi(s_t) - \hat{R}_t \right)^2 \right]\end{split}\]

Algorithms

Vanilla Policy Gradient

2015 Trust Region Policy Optimization (not in comparisons)

John Schulman (OpenAI, 16524, 31)
Sergey Levine (UC Berkeley Assist Prof, 27200, 76)
Philipp Moritz
Michael I. Jordan (UC Berkeley Prof, 91071, 118)
Pieter Abbeel (UC Berkeley Prof, 37231, 91)

2017 Proximal Policy Optimization

John Schulman
Filip Wolski (OpenAI, 3220, 6)
Prafulla Dhariwal (OpenAI, 4443, 8)
Alec Radford (OpenAI, 16501, 14)
Oleg Klimov (OpenAI, 3481, 8)

2016 Deep Deterministic Policy Gradient

Timothy P. Lillicrap (DeepMind, 29763, 43)
Jonathan J. Hunt (DeepMind, 4458, 10)
Alexander Pritzel (Deepmind, 6236, 14)
Nicolas Heess (DeepMind, 13047, 35)
Tom Erez (DeepMind, 7435, 21)
Yuval Tassa (DeepMind, 8087, 23)
David Silver (DeepMind, 49958, 58)
Daan Wierstra (DeepMind, 31959, 40)

notation: in the following \(\mu_\theta = \pi_\theta\)

2018 Twin Delayed DDPG

Scott Fujimoto (McGill University, 581, 7)
Herke van Hoof (University of Amsterdam, 1185, 15)
David Meger (McGill University, 1620, 14)

2018 Soft Actor-Critic

Tuomas Haarnoja (DeepMind, 1599, 7)
Aurick Zhou (UC Berkeley, 1093, 5)
Pieter Abbeel
Sergey Levine