Deep Reinforcement Learning - II


CSE 891: Deep Learning

Vishnu Boddeti

Monday December 06, 2021

Overview

  • Previous Lecture: Policy Gradient (REINFORCE)
    • Optimize a policy directly, without modeling anything about the environment
  • Today's Lecture: $Q$-Learning
    • Learn an action-value function that predicts future returns

Problem Setting

  • Agent interacts with an environment. Environment is treated as a black box.
  • One can only access the environment through an API since it is external to the agent.
    • we are not "allowed" to inspect the internal transition probabilities, reward distributions, etc.

Recap: Markov Decision Processes

  • The environment is represented as a Markov Decision Process $\mathcal{M}$.
  • Markov assumption: all relevant information is encapsulated in the current state; i.e. the policy, reward, and transitions are all independent of past states given the current state.
  • Components of an MDP:
    • initial state distribution $p(\mathbf{s}_0)$
    • policy $\pi_{\mathbf{\theta}}(\mathbf{a}_t|\mathbf{s}_t)$
    • transition distribution $p(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$
    • reward function $r(\mathbf{s}_t, \mathbf{a}_t)$
  • Assume a fully observable environment, i.e. $\mathbf{s}_t$ can be observed directly

Rollout Horizon

  • Previous class: finite horizon MDPs
    • Fixed number of time steps $T$ per episode, maximize expected return $R=\mathbb{E}_{p(\tau)}[r(\tau)]$
  • Today: consider infinite horizon
    • Practically, we cannot sum up the rewards for an infinite horizon, so we need to discount future rewards.
      • One billion today is worth more than one billion 10 years from now.
    • Discounted Returns: \begin{equation} R_t = r_t + \gamma r_{t+1} + \gamma ^2 r_{t+2} + \dots \nonumber \end{equation}
    • $0 \leq \gamma < 1$ is the discounting factor.
      • small $\gamma$: myopic, large $\gamma$: far-sighted
    • Choose an action that maximizes the expected discounted return.

Optimal Quantities


  • The value (utility) of a state $s$:
    • $V^*(s) =$ expected utility starting in $s$ and acting optimally

  • The value (utility) of a $q$-state $(s,a)$:
    • $Q^*(s,a) = $ expected utility starting out having taken action a from state s and (thereafter) acting optimally

  • The optimal policy:
    • $\pi^*(s) = $ optimal action from state $s$

Value Function

  • Value Function $V^{\pi}(\mathbf{s})$: expected discounted return starting at state $\mathbf{s}$ and follow policy $\pi$. \begin{equation} \begin{aligned} V^{\pi}(\mathbf{s}) &= \mathbb{E}[R_t | \mathbf{s}_t=\mathbf{s}] \\ &= \mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^i r_{t+i} | \mathbf{s}_t=\mathbf{s} \right] \nonumber \end{aligned} \end{equation}
  • Computing the value function is generally impractical, but we can try to approximate (learn) it.
  • The benefit of value function is credit assignment. See directly how an action affects future returns rather than wait for rollouts.

Value Function

Value Function

Value to Action

  • Can we use the value function to choose actions? \begin{equation} \underset{\mathbf{a}}{\operatorname{arg max}} \mbox{ } \left(r(\mathbf{s}_t,\mathbf{a}) + \gamma \mathbb{E}_{p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)}[V^{\pi}(\mathbf{s}_{t+1})]\right) \end{equation}
  • Problem: this requires taking the expectation with respect to the environment’s dynamics, which we don’t have direct access to.

Action-Value Function

  • Instead learn an action-value function, or Q-function: expected returns if you take action $\mathbf{a}$ and then follow your policy \begin{equation} Q^{\pi}(\mathbf{s},\mathbf{a}) = \mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^i r_{t+i}|\mathbf{s}_t=\mathbf{s}, \mathbf{a}_t=\mathbf{a}\right] \nonumber \end{equation}
  • Relationship: \begin{equation} V^{\pi}(\mathbf{s}) = \sum_{\mathbf{a}} \pi(\mathbf{a}|\mathbf{s})Q^{\pi}(\mathbf{s},\mathbf{a}) \nonumber \end{equation}
  • Optimal Action: \begin{equation} \underset{\mathbf{a}}{\operatorname{arg max}} Q^{\pi}(\mathbf{s},\mathbf{a}) \nonumber \end{equation}

Bellman Equation

  • The Bellman Equation is a recursive formula for the action-value function: \begin{equation} Q^{\pi}(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{p(\mathbf{s}'|\mathbf{s},\mathbf{a})\pi(\mathbf{a}'|\mathbf{s}')}[Q^{\pi}(\mathbf{s}',\mathbf{a}')] \nonumber \end{equation}
  • There are various Bellman equations, and most RL algorithms are based on repeatedly applying one of them.

Optimal Bellman Equation

  • The optimal policy $\pi^*$ is the one that maximizes the expected discounted return, and the optimal action-value function $Q^*$ is the action-value function for $\pi^*$.
  • The Optimal Bellman Equation gives a recursive formula for $Q^*$: \begin{equation} Q^*(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{p(\mathbf{s}'|\mathbf{s},\mathbf{a})}[\max_{\mathbf{a}'}Q^*(\mathbf{s}_{t+1},\mathbf{a}')|\mathbf{s}_t=\mathbf{s},\mathbf{a}_t=\mathbf{a}] \nonumber \end{equation}
  • This system of equations characterizes the optimal action-value function. So maybe we can approximate $Q^*$ by trying to solve the optimal Bellman equation.

$Q$-Learning

  • Let $Q$ be an action-value function which hopefully approximates $Q^*$.
  • The Bellman error is the update to our expected return when we observe the next state $\mathbf{s}'$. \begin{equation} e = r(\mathbf{s}_t,\mathbf{a}_t) + \gamma \max_{\mathbf{a}} Q_{\mathbf{\theta}}(\mathbf{s}_{t+1}, \mathbf{a}) - Q_{\mathbf{\theta}}(\mathbf{s}_{t}, \mathbf{a}_t) \nonumber \end{equation}
  • The Bellman equation says the Bellman error is 0 in expectation
  • $Q$-Learning is an algorithm that repeatedly adjusts Q to minimize the Bellman error
  • Each time we sample consecutive states and actions $(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})$: \begin{equation} Q(\mathbf{s}_t, \mathbf{a}_t) \leftarrow Q(\mathbf{s}_t, \mathbf{a}_t) + \alpha \left[r(\mathbf{s}_t, \mathbf{a}_t) + \gamma\max_{\mathbf{a}}Q(\mathbf{s}_{t+1},\mathbf{a})-Q(\mathbf{s}_t, \mathbf{a})\right] \nonumber \end{equation}

Exploration-Exploitation Tradeoff

  • Note: $Q$-Learning only learns about the states and actions it visits.
  • Exploration-exploitation Tradeoff: the agent should sometimes pick suboptimal actions in order to visit new states and actions.
  • Simple solution: $\epsilon$-greedy policy
    • With probability $1-\epsilon$, choose the optimal action according to $Q$.
    • With probability $\epsilon$, choose a random action
  • Believe it or not, $\epsilon$-greedy is still used today.

Exploration-Exploitation Tradeoff

  • You cannot use an $\epsilon$-greedy strategy with policy gradient because it is an on-policy algorithm: the agent can only learn about the policy it is actually following.
  • $Q$-learning is an off-policy algorithm: the agent can learn $Q$ regardless of whether it is actually following the optimal policy.
  • Hence, $Q$-learning is typically done with an $\epsilon$-greedy policy, or some other policy that encourages exploration.

$Q$-Learning

Function Approximation

  • So far, we assumed a tabular representation of $Q$: one entry for every state/action pair.
  • This is impractical to store for all but the simplest problems, and does not share structure between related states.
  • Solution: approximate $Q$ using a parametric function e.g.,
    • linear function approximation: $Q(\mathbf{s},\mathbf{a})=\mathbf{w}^T\psi(\mathbf{s},\mathbf{a})$
    • compute $Q$ with a neural network
  • Update $Q_{\mathbf{\theta}}$ using backpropagation: \begin{equation} \begin{aligned} d &\leftarrow r(\mathbf{s}_t, \mathbf{a}_t) + \gamma \max_{\mathbf{a}}Q_{\mathbf{\theta}}(\mathbf{s}_{t+1}, \mathbf{a}) \\ \mathbf{\theta} &\leftarrow + \alpha(d-Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a}))\frac{\partial Q_{\mathbf{\theta}}}{\partial \mathbf{\theta}} \nonumber \end{aligned} \end{equation}
  • Problem: Nonstationary! The "target" for $Q_{\mathbf{\theta}}(s, a)$ depends on the current weights $\mathbf{\theta}$!
  • Problem: How to sample batches of data for training?

Function Approximation

  • Approximating $Q$ with a neural net is a decades-old idea, but DeepMind got it to work really well on Atari games in 2013 ("deep Q-learning")
  • They used a very small network by today's standards
  • Technical Innovation: store experience into a replay buffer, and perform Q-learning using stored experience.
    • Gains sample efficiency by separating environment interaction from optimization — don't need new experience for every SGD update.

Atari

  • Mnih et al., Nature 2015. Human-level control through deep reinforcement learning
  • Network was given raw pixels as observations
  • Same architecture shared between all games
  • Assume fully observable environment, even though that's not the case
  • After about a day of training on a particular game, often beat "human-level" performance (number of points within 5 minutes of play)
    • Did very well on reactive games, poorly on ones that require planning (e.g. Montezuma's Revenge)

Deep Q-Networks


Faulty Reward Functions

  • If rats have a lever that causes an electrode to stimulate certain "reward centers" in their brain, they'll keep pressing the lever at the expense of sleep, food, etc.
  • RL algorithms show this behavior if the reward function is not designed carefully.

Policy Gradient vs $Q$-Learning

  • Policy gradient and $Q$-learning use two very different choices of representation: policies and value functions.
  • Advantage of both methods: do not need to model the environment

Policy Gradient

  • Pros/cons of policy gradient:
    • Pro: unbiased estimate of gradient of expected return
    • Pro: can handle a large space of actions (since you only need to sample one)
    • Con: high variance updates (implies poor sample efficiency)
    • Con: does not do credit assignment

$Q$-Learning

  • Pros/cons of $Q$-Learning:
    • Pro: lower variance updates, more sample efficient
    • Pro: does credit assignment
    • Con: biased updates since $Q$ function is approximate
    • Con: hard to handle many actions (since you need to take the max)

So Far: $Q$-Learning and Polciy Gradients

    • Q-Learning: Train network $Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a})$ to estimate future rewards for every (state, action) pair. Use Bellman Equation to define loss function for training $Q_{\mathbf{\theta}}$: \begin{equation} d_{\theta}(\mathbf{s},\mathbf{a}) = \mathbb{E}\left[r + \gamma \max_{a'}Q_{\mathbf{\theta}}(\mathbf{s}',\mathbf{a}')\right] \mbox{ and } L(\mathbf{s},\mathbf{a}) = (d_{\theta}(\mathbf{s},\mathbf{a}) - Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a}))^2 \end{equation}
    • Policy Gradients: Train a network $\pi_{\mathbf{\theta}}(\mathbf{a}|\mathbf{s})$ that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: \begin{equation} R = \mathbb{E}_{p_{\theta}(\tau)}[r(\tau)] \mbox{ and } \frac{\partial R}{\partial \theta} = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[r(\tau)\sum_{t=0}^T\frac{\partial}{\partial \theta}\log \pi_{\theta}(\mathbf{a}_t|\mathbf{s}_t)\right] \end{equation}
  • Sutton and Barto, “Reinforcement Learning: An Introduction”, 1998; Degris et al, “Model-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, "Asynchronous Methods for Deep Reinforcement Learning", ICML 2016

Other Approaches: Model Based RL

  • Model Based:Learn a model of the world's state transition function $p(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$ and then use planning through the model to make decisions
  • Imitation Learning: Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach)
  • Inverse Reinforcement Learning: Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function
    • Ng et al, “Algorithms for Inverse Reinforcement Learning”, ICML 2000
  • Adversarial Learning: Learn to fool a discriminator that classifies actions as real/fake
    • Ho and Ermon, “Generative Adversarial Imitation Learning”, NeurIPS 2016

Actor-Critic Methods

  • Actor-Critic methods combine the best of both worlds:
    • Fit both a policy network (the "actor") and a value network (the "critic").
    • Repeatedly update the value network to estimate $V^{\pi}$
    • Unroll for only a few steps, then compute the REINFORCE policy update using the expected returns estimated by the value network
    • The two networks adapt to each other, much like GAN training
    • Modern version: Asynchronous Advantage Actor-Critic (A3C)

Playing Games

  • AlphaGo: (January 2016)
    • Used imitation learning + tree search + RL
    • Beat 18-time world champion Lee Sedol
  • AlphaGo Zero (October 2017)
    • Simplified version of AlphaGo
    • No longer using imitation learning
    • Beat (at the time) #1 ranked Ke Jie
  • Alpha Zero (December 2018)
    • Generalized to other games: Chess and Shogi
  • MuZero (November 2019)
    • Plans through a learned model of the game
  • Silver et al, “Mastering the game of Go with deep neural networks and tree search”, Nature 2016
  • Silver et al, “Mastering the game of Go without human knowledge”, Nature 2017
  • Silver et al, “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018
  • Schrittwieser et al, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019

More Complex Games

  • StarCraft II: AlphaStar (October 2019) Vinyals et al, "Grandmaster level in StarCraft II using multi-agent reinforcement learning", Science 2018
  • Dota 2: OpenAI Five (April 2019) No paper, only a blog post: OpenAI Dota 2

Real-World Application of Deep-RL