Deep Reinforcement Learning

Deep Reinforcement Learning - II

CSE 891: Deep Learning

Vishnu Boddeti

Wednesday December 07, 2020

Overview

Previous Lecture: Policy Gradient (REINFORCE)

Optimize a policy directly, without modeling anything about the environment

Today's Lecture: $Q$-Learning

Learn an action-value function that predicts future returns

Problem Setting

Agent interacts with an environment. Environment is treated as a black box.
One can only access the environment through an API since it is external to the agent.

we are not "allowed" to inspect the internal transition probabilities, reward distributions, etc.

Recap: Markov Decision Processes

The environment is represented as a Markov Decision Process $\mathcal{M}$.
Markov assumption: all relevant information is encapsulated in the current state; i.e. the policy, reward, and transitions are all independent of past states given the current state.
Components of an MDP:

initial state distribution $p(\mathbf{s}_0)$
policy $\pi_{\mathbf{\theta}}(\mathbf{a}_t|\mathbf{s}_t)$
transition distribution $p(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$
reward function $r(\mathbf{s}_t, \mathbf{a}_t)$

Assume a fully observable environment, i.e. $\mathbf{s}_t$ can be observed directly

Rollout Horizon

Previous class: finite horizon MDPs

Fixed number of time steps $T$ per episode, maximize expected return $R=\mathbb{E}_{p(\tau)}[r(\tau)]$

Today: consider infinite horizon

Practically, we cannot sum up the rewards for an infinite horizon, so we need to discount future rewards.

One billion today is worth more than one billion 10 years from now.

Discounted Returns: \begin{equation} R_t = r_t + \gamma r_{t+1} + \gamma ^2 r_{t+2} + \dots \nonumber \end{equation}
$0 \leq \gamma < 1$ is the discounting factor.

small $\gamma$: myopic, large $\gamma$: far-sighted

Choose an action that maximizes the expected discounted return.

Optimal Quantities

The value (utility) of a state $s$:

$V^*(s) =$ expected utility starting in $s$ and acting optimally

The value (utility) of a $q$-state $(s,a)$:

$Q^*(s,a) = $ expected utility starting out having taken action a from state s and (thereafter) acting optimally

The optimal policy:

$\pi^*(s) = $ optimal action from state $s$

Value Function

Value Function $V^{\pi}(\mathbf{s})$: expected discounted return starting at state $\mathbf{s}$ and follow policy $\pi$. \begin{equation} \begin{aligned} V^{\pi}(\mathbf{s}) &= \mathbb{E}[R_t | \mathbf{s}_t=\mathbf{s}] \\ &= \mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^i r_{t+i} | \mathbf{s}_t=\mathbf{s} \right] \nonumber \end{aligned} \end{equation}
Computing the value function is generally impractical, but we can try to approximate (learn) it.
The benefit of value function is credit assignment. See directly how an action affects future returns rather than wait for rollouts.

Value Function

Rewards: -1 per time step
Undiscounted ($\gamma=1$)
Actions: N, E, W, S
State: current location

Value Function

Value to Action

Can we use the value function to choose actions? \begin{equation} \underset{\mathbf{a}}{\operatorname{arg max}} \mbox{ } \left(r(\mathbf{s}_t,\mathbf{a}) + \gamma \mathbb{E}_{p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)}[V^{\pi}(\mathbf{s}_{t+1})]\right) \end{equation}
Problem: this requires taking the expectation with respect to the environment’s dynamics, which we don’t have direct access to.

Action-Value Function

Instead learn an action-value function, or Q-function: expected returns if you take action $\mathbf{a}$ and then follow your policy \begin{equation} Q^{\pi}(\mathbf{s},\mathbf{a}) = \mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^i r_{t+i}|\mathbf{s}_t=\mathbf{s}, \mathbf{a}_t=\mathbf{a}\right] \nonumber \end{equation}
Relationship: \begin{equation} V^{\pi}(\mathbf{s}) = \sum_{\mathbf{a}} \pi(\mathbf{a}|\mathbf{s})Q^{\pi}(\mathbf{s},\mathbf{a}) \nonumber \end{equation}
Optimal Action: \begin{equation} \underset{\mathbf{a}}{\operatorname{arg max}} Q^{\pi}(\mathbf{s},\mathbf{a}) \nonumber \end{equation}

Bellman Equation

The Bellman Equation is a recursive formula for the action-value function: \begin{equation} Q^{\pi}(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{p(\mathbf{s}'|\mathbf{s},\mathbf{a})\pi(\mathbf{a}'|\mathbf{s}')}[Q^{\pi}(\mathbf{s}',\mathbf{a}')] \nonumber \end{equation}
There are various Bellman equations, and most RL algorithms are based on repeatedly applying one of them.

Optimal Bellman Equation

The optimal policy $\pi^*$ is the one that maximizes the expected discounted return, and the optimal action-value function $Q^*$ is the action-value function for $\pi^*$.
The Optimal Bellman Equation gives a recursive formula for $Q^*$: \begin{equation} Q^*(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{p(\mathbf{s}'|\mathbf{s},\mathbf{a})}[\max_{\mathbf{a}'}Q^*(\mathbf{s}_{t+1},\mathbf{a}')|\mathbf{s}_t=\mathbf{s},\mathbf{a}_t=\mathbf{a}] \nonumber \end{equation}
This system of equations characterizes the optimal action-value function. So maybe we can approximate $Q^*$ by trying to solve the optimal Bellman equation.

$Q$-Learning

Let $Q$ be an action-value function which hopefully approximates $Q^*$.
The Bellman error is the update to our expected return when we observe the next state $\mathbf{s}'$. \begin{equation} e = r(\mathbf{s}_t,\mathbf{a}_t) + \gamma \max_{\mathbf{a}} Q_{\mathbf{\theta}}(\mathbf{s}_{t+1}, \mathbf{a}) - Q_{\mathbf{\theta}}(\mathbf{s}_{t}, \mathbf{a}_t) \nonumber \end{equation}
The Bellman equation says the Bellman error is 0 in expectation
$Q$-Learning is an algorithm that repeatedly adjusts Q to minimize the Bellman error
Each time we sample consecutive states and actions $(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})$: \begin{equation} Q(\mathbf{s}_t, \mathbf{a}_t) \leftarrow Q(\mathbf{s}_t, \mathbf{a}_t) + \alpha \left[r(\mathbf{s}_t, \mathbf{a}_t) + \gamma\max_{\mathbf{a}}Q(\mathbf{s}_{t+1},\mathbf{a})-Q(\mathbf{s}_t, \mathbf{a})\right] \nonumber \end{equation}

Exploration-Exploitation Tradeoff

Note: $Q$-Learning only learns about the states and actions it visits.
Exploration-exploitation Tradeoff: the agent should sometimes pick suboptimal actions in order to visit new states and actions.
Simple solution: $\epsilon$-greedy policy

With probability $1-\epsilon$, choose the optimal action according to $Q$.
With probability $\epsilon$, choose a random action

Believe it or not, $\epsilon$-greedy is still used today.

Exploration-Exploitation Tradeoff

You cannot use an $\epsilon$-greedy strategy with policy gradient because it is an on-policy algorithm: the agent can only learn about the policy it is actually following.
$Q$-learning is an off-policy algorithm: the agent can learn $Q$ regardless of whether it is actually following the optimal policy.
Hence, $Q$-learning is typically done with an $\epsilon$-greedy policy, or some other policy that encourages exploration.

$Q$-Learning

Function Approximation

So far, we assumed a tabular representation of $Q$: one entry for every state/action pair.
This is impractical to store for all but the simplest problems, and does not share structure between related states.
Solution: approximate $Q$ using a parametric function e.g.,

linear function approximation: $Q(\mathbf{s},\mathbf{a})=\mathbf{w}^T\psi(\mathbf{s},\mathbf{a})$
compute $Q$ with a neural network

Update $Q_{\mathbf{\theta}}$ using backpropagation: \begin{equation} \begin{aligned} d &\leftarrow r(\mathbf{s}_t, \mathbf{a}_t) + \gamma \max_{\mathbf{a}}Q_{\mathbf{\theta}}(\mathbf{s}_{t+1}, \mathbf{a}) \\ \mathbf{\theta} &\leftarrow + \alpha(d-Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a}))\frac{\partial Q_{\mathbf{\theta}}}{\partial \mathbf{\theta}} \nonumber \end{aligned} \end{equation}
Problem: Nonstationary! The "target" for $Q_{\mathbf{\theta}}(s, a)$ depends on the current weights $\mathbf{\theta}$!
Problem: How to sample batches of data for training?

Function Approximation

Approximating $Q$ with a neural net is a decades-old idea, but DeepMind got it to work really well on Atari games in 2013 ("deep Q-learning")
They used a very small network by today's standards
Technical Innovation: store experience into a replay buffer, and perform Q-learning using stored experience.

Gains sample efficiency by separating environment interaction from optimization — don't need new experience for every SGD update.

Atari

Mnih et al., Nature 2015. Human-level control through deep reinforcement learning
Network was given raw pixels as observations
Same architecture shared between all games
Assume fully observable environment, even though that's not the case
After about a day of training on a particular game, often beat "human-level" performance (number of points within 5 minutes of play)

Did very well on reactive games, poorly on ones that require planning (e.g. Montezuma's Revenge)

Deep Q-Networks

Faulty Reward Functions

If rats have a lever that causes an electrode to stimulate certain "reward centers" in their brain, they'll keep pressing the lever at the expense of sleep, food, etc.
RL algorithms show this behavior if the reward function is not designed carefully.

Policy Gradient vs $Q$-Learning

Policy gradient and $Q$-learning use two very different choices of representation: policies and value functions.
Advantage of both methods: do not need to model the environment

Policy Gradient

Pros/cons of policy gradient:

Pro: unbiased estimate of gradient of expected return
Pro: can handle a large space of actions (since you only need to sample one)
Con: high variance updates (implies poor sample efficiency)
Con: does not do credit assignment

$Q$-Learning

Pros/cons of $Q$-Learning:

Pro: lower variance updates, more sample efficient
Pro: does credit assignment
Con: biased updates since $Q$ function is approximate
Con: hard to handle many actions (since you need to take the max)

So Far: $Q$-Learning and Polciy Gradients

Q-Learning: Train network $Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a})$ to estimate future rewards for every (state, action) pair. Use Bellman Equation to define loss function for training $Q_{\mathbf{\theta}}$: \begin{equation} d_{\theta}(\mathbf{s},\mathbf{a}) = \mathbb{E}\left[r + \gamma \max_{a'}Q_{\mathbf{\theta}}(\mathbf{s}',\mathbf{a}')\right] \mbox{ and } L(\mathbf{s},\mathbf{a}) = (d_{\theta}(\mathbf{s},\mathbf{a}) - Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a}))^2 \end{equation}
Policy Gradients: Train a network $\pi_{\mathbf{\theta}}(\mathbf{a}|\mathbf{s})$ that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: \begin{equation} R = \mathbb{E}_{p_{\theta}(\tau)}[r(\tau)] \mbox{ and } \frac{\partial R}{\partial \theta} = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[r(\tau)\sum_{t=0}^T\frac{\partial}{\partial \theta}\log \pi_{\theta}(\mathbf{a}_t|\mathbf{s}_t)\right] \end{equation}

Sutton and Barto, “Reinforcement Learning: An Introduction”, 1998; Degris et al, “Model-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, "Asynchronous Methods for Deep Reinforcement Learning", ICML 2016

Other Approaches: Model Based RL

Model Based: Learn a model of the world's state transition function $p(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$ and then use planning through the model to make decisions
Imitation Learning: Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach)
Inverse Reinforcement Learning: Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function

Ng et al, “Algorithms for Inverse Reinforcement Learning”, ICML 2000

Adversarial Learning: Learn to fool a discriminator that classifies actions as real/fake

Ho and Ermon, “Generative Adversarial Imitation Learning”, NeurIPS 2016

Actor-Critic Methods

Actor-Critic methods combine the best of both worlds:

Fit both a policy network (the "actor") and a value network (the "critic").
Repeatedly update the value network to estimate $V^{\pi}$
Unroll for only a few steps, then compute the REINFORCE policy update using the expected returns estimated by the value network
The two networks adapt to each other, much like GAN training
Modern version: Asynchronous Advantage Actor-Critic (A3C)

Playing Games

AlphaGo: (January 2016)

Used imitation learning + tree search + RL
Beat 18-time world champion Lee Sedol

AlphaGo Zero (October 2017)

Simplified version of AlphaGo
No longer using imitation learning
Beat (at the time) #1 ranked Ke Jie

Alpha Zero (December 2018)

Generalized to other games: Chess and Shogi

MuZero (November 2019)

Plans through a learned model of the game

Silver et al, “Mastering the game of Go with deep neural networks and tree search”, Nature 2016
Silver et al, “Mastering the game of Go without human knowledge”, Nature 2017
Silver et al, “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018
Schrittwieser et al, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019

More Complex Games

StarCraft II: AlphaStar (October 2019) Vinyals et al, "Grandmaster level in StarCraft II using multi-agent reinforcement learning", Science 2018
Dota 2: OpenAI Five (April 2019) No paper, only a blog post: OpenAI Dota 2

Overview

Problem Setting

Recap: Markov Decision Processes

Rollout Horizon

Optimal Quantities

Value Function

Value Function

Value Function

Value to Action

Action-Value Function

Bellman Equation

Optimal Bellman Equation

$Q$-Learning

Exploration-Exploitation Tradeoff

Exploration-Exploitation Tradeoff

$Q$-Learning

Function Approximation

Function Approximation

Atari

Deep Q-Networks

Faulty Reward Functions

Policy Gradient vs $Q$-Learning

Policy Gradient

$Q$-Learning

So Far: $Q$-Learning and Polciy Gradients

Other Approaches: Model Based RL

Actor-Critic Methods

Playing Games

More Complex Games

Real-World Application of Deep-RL