Reinforcement Learning - I


CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti


Content Credits: CMU AI, http://ai.berkeley.edu

Today

  • Reinforcement Learning
  • Model-Based RL
  • Model-Free RL

Reinforcement Learning

  • Receive feedback in the form of rewards.
  • Agent's utility is defined by the reward function
  • Must (learn to) act so as to maximize expected rewards.
  • All learning is based on observed samples of outcomes !!

Learning to Walk

Initial
While Learning
Finished Learning
Kohl and Stone, ICRA 2004

Learning to Walk: Initial

Learning to Walk: Training

Learning to Walk: Finished

Example: Sidewinding

Example: Toddler Robot

Example: The Biped

Reinforcement Learning

  • Still assume a Markov decision process (MDP):
    • A set of states $s \in S$
    • A set of states per state $A$
    • A model $T(s,a,s')$
    • A reward function $R(s,a,s')$
  • Still looking for a policy $\pi(s)$
  • New twist: do not know $T$ or $R$
    • we do not know which states are good
    • we do not know what the actions do
    • must try out actions and states to learn

Off-Line (MDP) vs On-Line (RL)

MDP


Imagine what it is to take an action and learn
RL


Actually take an action and learn.

Model-Based Learning

Model-Based Learning

  • Model-Based Idea:
    • Learn an approximate model based on experiences
    • Solve for values as if the learned model were correct
  • Step 1: Learn empirical MDP model
    • Count outcomes $s'$ for each $s$, $a$
    • Normalize to give an estimate of $\hat{T}(s,a,s')$
    • Discover each $\hat{R}(s,a,s')$ when we experience $(s, a, s')$
  • Step 2: Solve the learned MDP
    • For example, use value iteration, as before

Example: Model-Based Learning



Input Policy $\pi$
Assume: $\gamma=1$
Observed Episodes (Training)
Learned Model

Example: Expected Age

Goal: Compute expected age of CSE 440 students
$$\begin{equation} E(A) = \sum_a P(a)\cdot a = 0.35 \times 21 + \dots \end{equation}$$
Without $P(A)$, instead collect samples $[a_1, a_2, \dots, a_N]$
$$\begin{equation} \begin{aligned} \hat{P}(A) &= \frac{num(a)}{N} \\ E(A) &\approx \sum_{a} \hat{P}(a)\cdot a \end{aligned} \end{equation}$$

Model-Free Learning

Passive Reinforcement Learning

  • Simplified task: policy evaluation
    • Input: a fixed policy $\pi(s)$
    • You do not know the transitions $T(s,a,s')$
    • You don’t know the rewards $R(s,a,s')$
    • Goal: learn the state values
  • In this case:
    • Learner is ``along for the ride"
    • No choice about what actions to take
    • Just execute the policy and learn from experience
    • This is NOT offline planning !! You actually take actions in the world.



Direct Evaluation

  • Goal: Compute values for each state under $\pi$
  • Idea: Average together observed sample values
    • Act according to $\pi$
    • Every time you visit a state, write down what the sum of discounted rewards turned out to be
    • Average those samples

  • This is called direct evaluation

Example: Direct Evaluation



Input Policy $\pi$
Assume: $\gamma=1$
Observed Episodes (Training)


Output Values

Problems with Direct Evaluation

  • What is good about direct evaluation?
    • It is easy to understand
    • It does not require any knowledge of $T$, $R$
    • It eventually computes the correct average values, using just sample transitions
  • What is bad about it?
    • It wastes information about state connections
    • Each state must be learned separately
    • So, it takes a long time to learn

Why not use Policy Evaluation

  • Simplified Bellman updates calculate V for a fixed policy:
    • Each round, replace V with a one-step-look-ahead layer over V
    • \begin{equation} \begin{aligned} V_0^{\pi}(s) &= 0 \\ V_{k+1}^{\pi}(s) &\leftarrow \sum_{s'}T(s,\pi(s),s')[R(s,\pi(s),s')+\gamma V_k^{\pi}(s')] \end{aligned} \end{equation}
    • This approach fully exploits the connections between the states
    • Unfortunately, we need T and R to do it.
  • Key question: how can we do this update to V without knowing T and R?
    • In other words, how to we take a weighted average without knowing the weights?


Sample-Based Policy Evaluation

  • We want to improve our estimate of $V$ by computing these averages:
  • \begin{equation} V_{k+1}^{\pi}(s) \leftarrow \sum_{s'}T(s,\pi(s),s')[R(s,\pi(s),s')+\gamma V_k^{\pi}(s')] \nonumber \end{equation}
  • Idea: Take samples of outcomes $s'$ (by doing the action) and average
$$\begin{equation} sample_1 = R(s,\pi(s),s_1') + \gamma V_k^{\pi}(s'_1) \end{equation}$$ $$\begin{equation} sample_2 = R(s,\pi(s),s_2') + \gamma V_k^{\pi}(s'_2) \end{equation}$$ $$\begin{equation} \begin{aligned} \cdots \\ sample_n &= R(s,\pi(s),s_n') + \gamma V_k^{\pi}(s'_n) \end{aligned} \end{equation}$$ $$\begin{equation} V_{k+1}^{\pi}(s) \leftarrow \frac{1}{n}\sum_{i} sample_i \end{equation}$$

Q & A

Image
XKCD