Optimize a policy directly, without modeling anything about the environment
Today's Lecture: $Q$-Learning
Learn an action-value function that predicts future returns
Problem Setting
Agent interacts with an environment. Environment is treated as a black box.
One can only access the environment through an API since it is external to the agent.
we are not "allowed" to inspect the internal transition probabilities, reward distributions, etc.
Recap: Markov Decision Processes
The environment is represented as a Markov Decision Process $\mathcal{M}$.
Markov assumption: all relevant information is encapsulated in the current state; i.e. the policy, reward, and transitions are all independent of past states given the current state.
small $\gamma$: myopic, large $\gamma$: far-sighted
Choose an action that maximizes the expected discounted return.
Optimal Quantities
The value (utility) of a state $s$:
$V^*(s) =$ expected utility starting in $s$ and acting optimally
The value (utility) of a $q$-state $(s,a)$:
$Q^*(s,a) = $ expected utility starting out having taken action a from state s and (thereafter) acting optimally
The optimal policy:
$\pi^*(s) = $ optimal action from state $s$
Value Function
Value Function $V^{\pi}(\mathbf{s})$: expected discounted return starting at state $\mathbf{s}$ and follow policy $\pi$.
\begin{equation}
\begin{aligned}
V^{\pi}(\mathbf{s}) &= \mathbb{E}[R_t | \mathbf{s}_t=\mathbf{s}] \\
&= \mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^i r_{t+i} | \mathbf{s}_t=\mathbf{s} \right] \nonumber
\end{aligned}
\end{equation}
Computing the value function is generally impractical, but we can try to approximate (learn) it.
The benefit of value function is credit assignment. See directly how an action affects future returns rather than wait for rollouts.
Value Function
Rewards: -1 per time step
Undiscounted ($\gamma=1$)
Actions: N, E, W, S
State: current location
Value Function
Value to Action
Can we use the value function to choose actions?
\begin{equation}
\underset{\mathbf{a}}{\operatorname{arg max}} \mbox{ } \left(r(\mathbf{s}_t,\mathbf{a}) + \gamma \mathbb{E}_{p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)}[V^{\pi}(\mathbf{s}_{t+1})]\right)
\end{equation}
Problem: this requires taking the expectation with respect to the environment’s dynamics, which we don’t have direct access to.
Action-Value Function
Instead learn an action-value function, or Q-function: expected returns if you take action $\mathbf{a}$ and then follow your policy
\begin{equation}
Q^{\pi}(\mathbf{s},\mathbf{a}) = \mathbb{E}\left[\sum_{i=0}^{\infty}\gamma^i r_{t+i}|\mathbf{s}_t=\mathbf{s}, \mathbf{a}_t=\mathbf{a}\right] \nonumber
\end{equation}
The Bellman Equation is a recursive formula for the action-value function:
\begin{equation}
Q^{\pi}(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{p(\mathbf{s}'|\mathbf{s},\mathbf{a})\pi(\mathbf{a}'|\mathbf{s}')}[Q^{\pi}(\mathbf{s}',\mathbf{a}')] \nonumber
\end{equation}
There are various Bellman equations, and most RL algorithms are based on repeatedly applying one of them.
Optimal Bellman Equation
The optimal policy $\pi^*$ is the one that maximizes the expected discounted return, and the optimal action-value function $Q^*$ is the action-value function for $\pi^*$.
The Optimal Bellman Equation gives a recursive formula for $Q^*$:
\begin{equation}
Q^*(\mathbf{s},\mathbf{a}) = r(\mathbf{s},\mathbf{a}) + \gamma\mathbb{E}_{p(\mathbf{s}'|\mathbf{s},\mathbf{a})}[\max_{\mathbf{a}'}Q^*(\mathbf{s}_{t+1},\mathbf{a}')|\mathbf{s}_t=\mathbf{s},\mathbf{a}_t=\mathbf{a}] \nonumber
\end{equation}
This system of equations characterizes the optimal action-value function. So maybe we can approximate $Q^*$ by trying to solve the optimal Bellman equation.
$Q$-Learning
Let $Q$ be an action-value function which hopefully approximates $Q^*$.
The Bellman error is the update to our expected return when we observe the next state $\mathbf{s}'$.
\begin{equation}
e = r(\mathbf{s}_t,\mathbf{a}_t) + \gamma \max_{\mathbf{a}} Q_{\mathbf{\theta}}(\mathbf{s}_{t+1}, \mathbf{a}) - Q_{\mathbf{\theta}}(\mathbf{s}_{t}, \mathbf{a}_t) \nonumber
\end{equation}
The Bellman equation says the Bellman error is 0 in expectation
$Q$-Learning is an algorithm that repeatedly adjusts Q to minimize the Bellman error
Each time we sample consecutive states and actions $(\mathbf{s}_t,\mathbf{a}_t,\mathbf{s}_{t+1})$:
\begin{equation}
Q(\mathbf{s}_t, \mathbf{a}_t) \leftarrow Q(\mathbf{s}_t, \mathbf{a}_t) + \alpha \left[r(\mathbf{s}_t, \mathbf{a}_t) + \gamma\max_{\mathbf{a}}Q(\mathbf{s}_{t+1},\mathbf{a})-Q(\mathbf{s}_t, \mathbf{a})\right] \nonumber
\end{equation}
Exploration-Exploitation Tradeoff
Note: $Q$-Learning only learns about the states and actions it visits.
Exploration-exploitation Tradeoff: the agent should sometimes pick suboptimal actions in order to visit new states and actions.
Simple solution: $\epsilon$-greedy policy
With probability $1-\epsilon$, choose the optimal action according to $Q$.
With probability $\epsilon$, choose a random action
Believe it or not, $\epsilon$-greedy is still used today.
Exploration-Exploitation Tradeoff
You cannot use an $\epsilon$-greedy strategy with policy gradient because it is an on-policy algorithm: the agent can only learn about the policy it is actually following.
$Q$-learning is an off-policy algorithm: the agent can learn $Q$ regardless of whether it is actually following the optimal policy.
Hence, $Q$-learning is typically done with an $\epsilon$-greedy policy, or some other policy that encourages exploration.
$Q$-Learning
Function Approximation
So far, we assumed a tabular representation of $Q$: one entry for every state/action pair.
This is impractical to store for all but the simplest problems, and does not share structure between related states.
Solution: approximate $Q$ using a parametric function e.g.,
linear function approximation: $Q(\mathbf{s},\mathbf{a})=\mathbf{w}^T\psi(\mathbf{s},\mathbf{a})$
Problem: Nonstationary! The "target" for $Q_{\mathbf{\theta}}(s, a)$ depends on the current weights $\mathbf{\theta}$!
Problem: How to sample batches of data for training?
Function Approximation
Approximating $Q$ with a neural net is a decades-old idea, but DeepMind got it to work really well on Atari games in 2013 ("deep Q-learning")
They used a very small network by today's standards
Technical Innovation: store experience into a replay buffer, and perform Q-learning using stored experience.
Gains sample efficiency by separating environment interaction from optimization — don't need new experience for every SGD update.
Atari
Mnih et al., Nature 2015. Human-level control through deep reinforcement learning
Network was given raw pixels as observations
Same architecture shared between all games
Assume fully observable environment, even though that's not the case
After about a day of training on a particular game, often beat "human-level" performance (number of points within 5 minutes of play)
Did very well on reactive games, poorly on ones that require planning (e.g. Montezuma's Revenge)
Deep Q-Networks
Faulty Reward Functions
If rats have a lever that causes an electrode to stimulate certain "reward centers" in their brain, they'll keep pressing the lever at the expense of sleep, food, etc.
RL algorithms show this behavior if the reward function is not designed carefully.
Policy Gradient vs $Q$-Learning
Policy gradient and $Q$-learning use two very different choices of representation: policies and value functions.
Advantage of both methods: do not need to model the environment
Policy Gradient
Pros/cons of policy gradient:
Pro: unbiased estimate of gradient of expected return
Pro: can handle a large space of actions (since you only need to sample one)
Con: high variance updates (implies poor sample efficiency)
Con: does not do credit assignment
$Q$-Learning
Pros/cons of $Q$-Learning:
Pro: lower variance updates, more sample efficient
Pro: does credit assignment
Con: biased updates since $Q$ function is approximate
Con: hard to handle many actions (since you need to take the max)
So Far: $Q$-Learning and Polciy Gradients
Q-Learning: Train network $Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a})$ to estimate future rewards for every (state, action) pair. Use Bellman Equation to define loss function for training $Q_{\mathbf{\theta}}$:
\begin{equation}
d_{\theta}(\mathbf{s},\mathbf{a}) = \mathbb{E}\left[r + \gamma \max_{a'}Q_{\mathbf{\theta}}(\mathbf{s}',\mathbf{a}')\right] \mbox{ and } L(\mathbf{s},\mathbf{a}) = (d_{\theta}(\mathbf{s},\mathbf{a}) - Q_{\mathbf{\theta}}(\mathbf{s},\mathbf{a}))^2
\end{equation}
Policy Gradients: Train a network $\pi_{\mathbf{\theta}}(\mathbf{a}|\mathbf{s})$ that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients:
\begin{equation}
R = \mathbb{E}_{p_{\theta}(\tau)}[r(\tau)] \mbox{ and } \frac{\partial R}{\partial \theta} = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[r(\tau)\sum_{t=0}^T\frac{\partial}{\partial \theta}\log \pi_{\theta}(\mathbf{a}_t|\mathbf{s}_t)\right]
\end{equation}
Sutton and Barto, “Reinforcement Learning: An Introduction”, 1998; Degris et al, “Model-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, "Asynchronous Methods for Deep Reinforcement Learning", ICML 2016
Other Approaches: Model Based RL
Model Based: Learn a model of the world's state transition function $p(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$ and then use planning through the model to make decisions
Imitation Learning: Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach)
Inverse Reinforcement Learning: Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function
Ng et al, “Algorithms for Inverse Reinforcement Learning”, ICML 2000
Adversarial Learning: Learn to fool a discriminator that classifies actions as real/fake
Ho and Ermon, “Generative Adversarial Imitation Learning”, NeurIPS 2016
Actor-Critic Methods
Actor-Critic methods combine the best of both worlds:
Fit both a policy network (the "actor") and a value network (the "critic").
Repeatedly update the value network to estimate $V^{\pi}$
Unroll for only a few steps, then compute the REINFORCE policy update using the expected returns estimated by the value network
The two networks adapt to each other, much like GAN training
Modern version: Asynchronous Advantage Actor-Critic (A3C)
Playing Games
AlphaGo: (January 2016)
Used imitation learning + tree search + RL
Beat 18-time world champion Lee Sedol
AlphaGo Zero (October 2017)
Simplified version of AlphaGo
No longer using imitation learning
Beat (at the time) #1 ranked Ke Jie
Alpha Zero (December 2018)
Generalized to other games: Chess and Shogi
MuZero (November 2019)
Plans through a learned model of the game
Silver et al, “Mastering the game of Go with deep neural networks and tree search”, Nature 2016
Silver et al, “Mastering the game of Go without human knowledge”, Nature 2017
Silver et al, “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018
Schrittwieser et al, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019
More Complex Games
StarCraft II: AlphaStar (October 2019) Vinyals et al, "Grandmaster level in StarCraft II using multi-agent reinforcement learning", Science 2018
Dota 2: OpenAI Five (April 2019) No paper, only a blog post: OpenAI Dota 2