Reinforcement Learning

Reinforcement Learning - I

CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti

October 27, 2020

Content Credits: CMU AI, http://ai.berkeley.edu

Today

Reinforcement Learning
Model-Based RL
Model-Free RL

Reinforcement Learning

Receive feedback in the form of rewards.
Agent's utility is defined by the reward function
Must (learn to) act so as to maximize expected rewards.
All learning is based on observed samples of outcomes !!

Learning to Walk

Initial

While Learning

Finished Learning

Kohl and Stone, ICRA 2004

Learning to Walk: Initial

Learning to Walk: Training

Learning to Walk: Finished

Example: Sidewinding

Example: Toddler Robot

Example: The Biped

Reinforcement Learning

Still assume a Markov decision process (MDP):

A set of states $s \in S$
A set of states per state $A$
A model $T(s,a,s')$
A reward function $R(s,a,s')$

Still looking for a policy $\pi(s)$
New twist: do not know $T$ or $R$

we do not know which states are good
we do not know what the actions do
must try out actions and states to learn

Off-Line (MDP) vs On-Line (RL)

MDP

Imagine what it is to take an action and learn

RL

Actually take an action and learn.

Model-Based Learning

Model-Based Idea:

Learn an approximate model based on experiences
Solve for values as if the learned model were correct

Step 1: Learn empirical MDP model

Count outcomes $s'$ for each $s$, $a$
Normalize to give an estimate of $\hat{T}(s,a,s')$
Discover each $\hat{R}(s,a,s')$ when we experience $(s, a, s')$

Step 2: Solve the learned MDP

For example, use value iteration, as before

Example: Model-Based Learning

Input Policy $\pi$

Assume: $\gamma=1$

Observed Episodes (Training)

Learned Model

Example: Expected Age

Goal: Compute expected age of CSE 440 students

$$\begin{equation} E(A) = \sum_a P(a)\cdot a = 0.35 \times 21 + \dots \end{equation}$$

Without $P(A)$, instead collect samples $[a_1, a_2, \dots, a_N]$

$$\begin{equation} \begin{aligned} \hat{P}(A) &= \frac{num(a)}{N} \\ E(A) &\approx \sum_{a} \hat{P}(a)\cdot a \end{aligned} \end{equation}$$

$$\begin{equation} E(A) = \frac{1}{N}\sum_{i}a_i \nonumber \end{equation}$$

Model-Free Learning

Passive Reinforcement Learning

Simplified task: policy evaluation

Input: a fixed policy $\pi(s)$
You do not know the transitions $T(s,a,s')$
You don’t know the rewards $R(s,a,s')$
Goal: learn the state values

In this case:

Learner is ``along for the ride"
No choice about what actions to take
Just execute the policy and learn from experience
This is NOT offline planning !! You actually take actions in the world.

Direct Evaluation

Goal: Compute values for each state under $\pi$
Idea: Average together observed sample values

Act according to $\pi$
Every time you visit a state, write down what the sum of discounted rewards turned out to be
Average those samples

This is called direct evaluation

Example: Direct Evaluation

Input Policy $\pi$

Assume: $\gamma=1$

Observed Episodes (Training)

Output Values

Problems with Direct Evaluation

What is good about direct evaluation?

It is easy to understand
It does not require any knowledge of $T$, $R$
It eventually computes the correct average values, using just sample transitions

What is bad about it?

It wastes information about state connections
Each state must be learned separately
So, it takes a long time to learn

Output Values

If B and E both go to C under this policy, how can their values be different?

Why not use Policy Evaluation

Simplified Bellman updates calculate V for a fixed policy:

Each round, replace V with a one-step-look-ahead layer over V

\begin{equation} \begin{aligned} V_0^{\pi}(s) &= 0 \\ V_{k+1}^{\pi}(s) &\leftarrow \sum_{s'}T(s,\pi(s),s')[R(s,\pi(s),s')+\gamma V_k^{\pi}(s')] \end{aligned} \end{equation}

This approach fully exploits the connections between the states
Unfortunately, we need T and R to do it.

Key question: how can we do this update to V without knowing T and R?

In other words, how to we take a weighted average without knowing the weights?

Sample-Based Policy Evaluation

We want to improve our estimate of $V$ by computing these averages:
Idea: Take samples of outcomes $s'$ (by doing the action) and average

$$\begin{equation} sample_1 = R(s,\pi(s),s_1') + \gamma V_k^{\pi}(s'_1) \end{equation}$$ $$\begin{equation} sample_2 = R(s,\pi(s),s_2') + \gamma V_k^{\pi}(s'_2) \end{equation}$$ $$\begin{equation} \begin{aligned} \cdots \\ sample_n &= R(s,\pi(s),s_n') + \gamma V_k^{\pi}(s'_n) \end{aligned} \end{equation}$$ $$\begin{equation} V_{k+1}^{\pi}(s) \leftarrow \frac{1}{n}\sum_{i} sample_i \end{equation}$$