Reinforcement Learning Theory

Origin

  • Richard E. Bellman (1957). Dynamic Programming.

  • Richard S. Sutton (1984). Temporal Credit Assignment in Reinforcement Learning.

image0

image.webp

image.webp

Definition of RL

  • the policy

\[\underbrace{ a_t }_{\text{agent action}} \sim \underbrace{ \pi_{ \underbrace{ \theta }_{\text{NN}} } }_{\text{stochastic policy parameterized by NN}} (\cdot | \underbrace{ s_t }_{\text{game state}} )\]
  • the environment (natural law)

\[s'\sim P \Longleftrightarrow s'\equiv s_{t+1} \sim \underbrace{ P }_{\text{natural law}}(\cdot | s_t, a_t)\]
  • definition: trajectory

\[\underbrace{ \tau = (s_0, a_0, s_1, a_1, ...) }_{\text{trajectory or episode or rollout}}\]
  • definition: rewards

\[\begin{split}\text{entropy: } H(P) = \underset{x\sim P}{E}[-\log P(x)] \\ \begin{cases} \text{reward} & r_t=r(s_t, a_t) \\ \text{reward-to-go} & \hat{R}_t = \sum\limits_{t'=t}^{T} r(s_t, a_t) \\ \text{entropy-regularized reward} & R_\tau =\sum\limits_{t=0}^{\infty} (\underbrace{ \gamma }_{\text{discount}} )^t (r_t+\alpha H(\pi(\cdot|s_t))) \end{cases}\end{split}\]
  • what is RL ?

\[\begin{split}J(\pi) = \underbrace{ \underset{\tau\sim\pi}{E} [R_\tau] }_{\text{expectation over }\tau \text{ generated by }\pi } \\ \text{RL: find } \pi^* = \arg\max_\pi J(\pi)\end{split}\]

Bellman equation: A self-consistency equation just like in Density-functional theory

  • Action-Value Function

\[\begin{split}\begin{align} \text{AVF} &\quad \text{Action-Value Function} \\ \text{on-policy AVF} &\quad Q^\pi(s,a) = \underbrace{ \underset{\tau\sim\pi}{E} [R_\tau | s_0=s, a_0=a] }_{\text{expectation over }\tau \text{ generated by }\pi } \\ \text{Bellman eqn} &\quad Q^\pi(s,a) = \underset{s'\sim P}{E}\left[ r(s,a) + \gamma \; \underbrace{ \underset{a'\sim\pi}{E}[Q^\pi(s',a') ] }_{V^\pi(s')} \right] \\ \text{optimal AVF} &\quad Q^* = \max_\pi Q^\pi \\ \text{Bellman eqn} &\quad Q^*(s,a) = \underset{s'\sim P}{E}\left[ r(s,a) + \gamma \; \underbrace{ \max_{a'}[Q^*(s',a') ] }_{V^*(s')} \right] \\ \end{align}\end{split}\]
  • mean-squared Bellman error

\[\begin{split}\text{replay buffer: } D = \{ (s,a,r,s', d),... \} \quad d=\begin{cases} 1& s'\text{ is done} \\ 0& \text{otherwise} \end{cases} \\ L(\phi, D) = \underset{(s,a,r,s', d) \sim D}{E} \left[ \left( Q_\phi(s,a) - \underbrace{ (r+\gamma(1-d) \overbrace{ \max_{a'} Q_{\phi_{\text{targ}}}(s',a') }^{ \text{approx by } Q_{\phi_{\text{targ}}}(s', \; \pi_{\theta_{\text{targ}}}(s') ) } ) }_{\text{target}} \right)^2 \right] \\ \phi_{\text{targ}} \underset{\text{every k steps}}{\longleftarrow} \underbrace{ \rho }_{\text{polyak, } \to 1} \phi_{\text{targ}} + (1-\rho)\phi \\ \theta_{\text{targ}} = \max_\theta \underset{s\sim D}{E}[Q_\phi(s, \; \pi_\theta(s))]\end{split}\]
  • advantage function

\[\text{advantage function: } A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]

policy gradient

\[\underbrace{ \nabla_\theta J(\pi_\theta) }_{\text{policy gradient}} = \underset{\tau\sim\pi_\theta}{E} \left[ \sum_{t=0}^T \boxed{ \Phi_t } \nabla_\theta \log \pi_\theta(a_t|s_t) \right]\]
\[\begin{split}\text{lemma: } \underset{a_t\sim\pi_\theta}{E} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t)b(s_t)\right] =0 \text{ for any } \underset{\text{baseline}}{ \boxed{ b(s_t) } } \\ \Phi_t = \begin{cases} \hat{R}_t - b(s_t) \\ Q^{\pi_\theta}(s_t,a_t) - b(s_t) \end{cases} \\ b(s_t) = \begin{cases} V^\pi(s_t) \end{cases}\end{split}\]

Soft Actor-Critic

  • rewrite

\[\hat{Q}^\pi(s,a) = Q^\pi(s,a) - \underbrace{ \gamma^0 \alpha H(\pi(\cdot|s)) }_{\text{entropy when }t=0}\]
  • Bellman eqn

\[\begin{split}\text{Bellman eqn} \quad Q^\pi(s,a) = \underset{s'\sim P}{E}\left[ r(s,a) + \gamma \; \underbrace{ \underset{a'\sim\pi}{E}[Q^\pi(s',a') ] }_{V^\pi(s')} \right] \\ = \underset{s'\sim P \\ a'\sim\pi}{E}\left[ r(s,a) + \gamma \; Q^\pi(s',a') \right] \\ = \underset{s'\sim P \\ a'\sim\pi}{E}\left[ r(s,a) + \gamma \left( \hat{Q}^\pi(s',a')+ \alpha H(\pi(\cdot|s')) \right) \right] \\ \hat{Q}^\pi(s,a)+ \underset{ }{\boxed{ \alpha H(\pi(\cdot|s)) }} = \underset{s'\sim P \\ a'\sim\pi}{E}\left[ r(s,a) + \gamma \left( \hat{Q}^\pi(s',a')+ \alpha H(\pi(\cdot|s')) \right) \right]\end{split}\]

Policy Optimization vs Q-Learning

  • Policy Optimization:

    • learns the policy \(\pi_\theta(a|s)\) directly with NN \(\theta\), also learns \(V^\pi_\phi(s)\) with NN \(\phi\) in order update the policy

    • gradient ascent on or maximizing local approximation of \(J(\pi_\theta)\)

    • on-policy: each update only uses data resulted from latest policy

    • stable and reliable, but sample inefficient

  • Q-Learning:

    • learns \(Q^*_\theta(s,a)\) with NN \(\theta\), which indirectly gives the policy \(a^*(s) = \arg\max\limits_a Q^*_\theta(s,a)\)

    • uses cost function based on the Bellman equation

    • off-policy: each update uses any previous data

    • less stable, but sample efficient

  • hybrid

    • Deep Deterministic Policy Gradient

      • off-policy

      • continuous action space version of Q learning

      • differs from Q learning in finding \(\arg\max\)

  • use NN to approximate VF

\[\begin{split}k: \text{ epoch index} \\ \underbrace{ \phi_k }_{\text{NN to learn }V} = \arg\min_\phi \underset{s_t,\hat{R}_t\sim \pi_k }{E}\left[ \left( V_\phi(s_t) - \hat{R}_t \right)^2 \right]\end{split}\]

Algorithms

Vanilla Policy Gradient

2015 Trust Region Policy Optimization (not in comparisons)

  • John Schulman (OpenAI, 16524, 31)

  • Sergey Levine (UC Berkeley Assist Prof, 27200, 76)

  • Philipp Moritz

  • Michael I. Jordan (UC Berkeley Prof, 91071, 118)

  • Pieter Abbeel (UC Berkeley Prof, 37231, 91)

2017 Proximal Policy Optimization

  • John Schulman

  • Filip Wolski (OpenAI, 3220, 6)

  • Prafulla Dhariwal (OpenAI, 4443, 8)

  • Alec Radford (OpenAI, 16501, 14)

  • Oleg Klimov (OpenAI, 3481, 8)

2016 Deep Deterministic Policy Gradient

  • Timothy P. Lillicrap (DeepMind, 29763, 43)

  • Jonathan J. Hunt (DeepMind, 4458, 10)

  • Alexander Pritzel (Deepmind, 6236, 14)

  • Nicolas Heess (DeepMind, 13047, 35)

  • Tom Erez (DeepMind, 7435, 21)

  • Yuval Tassa (DeepMind, 8087, 23)

  • David Silver (DeepMind, 49958, 58)

  • Daan Wierstra (DeepMind, 31959, 40)

notation: in the following \(\mu_\theta = \pi_\theta\)

2018 Twin Delayed DDPG

  • Scott Fujimoto (McGill University, 581, 7)

  • Herke van Hoof (University of Amsterdam, 1185, 15)

  • David Meger (McGill University, 1620, 14)

2018 Soft Actor-Critic

  • Tuomas Haarnoja (DeepMind, 1599, 7)

  • Aurick Zhou (UC Berkeley, 1093, 5)

  • Pieter Abbeel

  • Sergey Levine