Most of this course was about supervised and unsupervised learning.
We will look at another machine learning topic, Reinforcement Learning.
Conceptually, somewhere in between supervised and unsupervised learning.
An agent acts in an environment and receives a reward signal.
Today: policy gradient (directly do SGD over a stochastic policy using trial-and-error)
Reinforcement Learning
An agent interacts with an environment.
In each time step $t$,
the agent receives observations which give it information about the state $\mathbf{s}_t$ (e.g. positions of the ball and paddle)
the agent picks an action $\mathbf{a}_t$ which affects the state
The agent periodically receives a reward $r(\mathbf{s}_t, \mathbf{a}_t)$, which depends on the state and action
The agent wants to learn a policy $\pi_{\mathbf{\theta}}(\mathbf{a}_t | \mathbf{s}_t)$
Distribution over actions depending on the current state and parameters $\mathbf{\theta}$
Markov Decision Processes
The environment is represented as a Markov Decision Process $\mathcal{M}$.
Markov assumption: all relevant information is encapsulated in the current state; i.e. the policy, reward, and transitions are all independent of past states given the current state
You can show that this still gives unbiased gradient estimates.
Optimizing Discontinuous Objectives
Edge case of RL: handwritten digit classification, but maximizing accuracy (or minimizing 0–1 loss)
Gradient descent completely fails if the cost function is discontinuous:
Supervised learning solution: use a surrogate loss function, e.g. logistic-cross-entropy
RL formulation: in each episode, the agent is shown an image, guesses a digit class, and receives a reward of 1 if it is correct or 0 if it is wrong.
In practice, we would never actually do it this way, but it gives us an interesting comparison with backprop.
Optimizing Discontinuous Objectives
RL Formulation:
one step at a time
state $\mathbf{x}$: an image
action $\mathbf{a}$: a digit class
reward $r(\mathbf{s},\mathbf{a})$: 1 if correct and 0 if wrong
policy $\pi_{\mathbf{\theta}}(\mathbf{a}|\mathbf{s})$: a distribution over classes
policy network: can be implemented through a MLP with softmax outputs
Optimizing Discontinuous Objectives
Let $t$ the integer target, $\mathbf{t}_k$ the target one-hot representation, $\mathbf{z}_k$ denote the logits, and $\mathbf{y}_k$ denote the softmax output.
A constant offset can be added to the reward to obtain an equivalent optimization problem.
Behavior if $r = 0$ for wrong answers and $r = 1$ for correct answers
wrong: do nothing, correct: make the action more likely
If $r = 10$ for wrong answers and $r = 11$ for correct answers
wrong: make the action more likely, correct: make the action even more likely
If $r = -10$ for wrong answers and $r = -9$ for correct answers
wrong: make the action less likely, correct: make the action less likely, but not as much
Reward Functions
Problem: REINFORCE update depends on arbitrary constant factors added to the reward.
Observation: we can subtract a baseline $b$ from the reward without biasing the gradient.
\begin{equation}
\begin{aligned}
\mathbb{E}_{p(\tau)}\left[(r(\tau)-b)\frac{\partial}{\partial\mathbf{\theta}}\log p(\tau)\right] &= \mathbb{E}_{p(\tau)}\left[r(\tau)\frac{\partial}{\partial\mathbf{\theta}}\log p(\tau)\right] - b\mathbb{E}_{p(\tau)}\left[\frac{\partial}{\partial\mathbf{\theta}}\log p(\tau)\right] \\
&= \mathbb{E}_{p(\tau)}\left[r(\tau)\frac{\partial}{\partial\mathbf{\theta}}\log p(\tau)\right] - b\sum_{\tau}p(\tau)\frac{\partial}{\partial\mathbf{\theta}}\log p(\tau) \\
&= \mathbb{E}_{p(\tau)}\left[r(\tau)\frac{\partial}{\partial\mathbf{\theta}}\log p(\tau)\right] - b\sum_{\tau}\frac{\partial}{\partial\mathbf{\theta}}p(\tau) \\
&= \mathbb{E}_{p(\tau)}\left[r(\tau)\frac{\partial}{\partial\mathbf{\theta}}\log p(\tau)\right] - 0
\end{aligned}
\end{equation}
Pick a baseline $b$ such that good rewards are positive and bad ones are negative.
$\mathbb{E}[r(\tau)]$ is a good choice of baseline, but cannot always be computed easily. Lots of research on trying to approximate it.
Getting Policy Gradient to Work
Evaluate each action using only future rewards, since it has no influence on past rewards. It can be shown this gives unbiased gradients.
Natural policy gradient corrects for the geometry of the space of policies, preventing the policy from changing too quickly.
Rather than use the actual return, evaluate actions based on estimates of future returns. This is a class of methods known as actor-critic.
Trust region policy optimization (TRPO) and proximal policy optimization (PPO) are modern policy gradient algorithms which are very effective for continuous control problems.
Backpropagation vs REINFORCE
Backpropagation: Does credit assignment – it tells you exactly which activations and parameters should be adjusted upwards or downwards to decrease the loss on some training example.
REINFORCE: Does not do credit assignment. If a rollout happens to be good, all the actions get reinforced, even if some of them were bad.
Reinforcing all the actions as a group leads to random walk behavior.
Why Policy Gradients
Can handle discontinuous cost functions.
Does not need an explicit model of the environment, i.e. environment rewards and dynamics are treated as black boxes.
Policy gradient is an example of model-free reinforcement learning, since the agent doesn’t try to fit a model of the environment.
Almost everyone thinks model-based approaches are needed for AI, but nobody has a clue how to get it to work.
Evolution Strategies
REINFORCE can handle discontinuous dynamics and reward functions, but needs a differentiable network since it computes $\frac{\partial}{\partial \mathbf{\theta}}\log \pi_{\theta}(\mathbf{a}_t|\mathbf{s}_t)$.
Evolution strategies (ES) take the policy gradient idea a step further, and avoid backprop entirely.
ES can use deterministic policies. It randomizes over the choice of policy rather than over the choice of actions.
sample a random policy from a distribution $p_{\eta}(\mathbf{\theta})$ parameterized by $\eta$ and apply the policy gradient trick
\begin{equation}
\frac{\partial}{\partial \eta}\mathbb{E}_{\mathbf{\theta}\sim p{\eta}}[r(\tau(\mathbf{\theta}))] = \mathbb{E}_{\mathbf{\theta}\sim p{\eta}}\left[r(\tau(\mathbf{\theta}))\frac{\partial}{\partial \eta}\log p_{\eta}(\mathbf{\theta})\right] \nonumber
\end{equation}
The policy network itself can be discontinuous.
Evolution Strategies
Salimans et. al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning, arXiv:1703.03864
Evolution Strategies: IEEE Floating Point
The IEEE floating point standard is nonlinear, since small enough numbers get truncated to zero.
This acts as a discontinuous activation function, which ES is able to handle.
ES was able to train a good MNIST classifier using a "linear" activation function.