Reinforcement Learning - I
CSE 440: Introduction to Artificial Intelligence
Vishnu Boddeti
Content Credits: CMU AI, http://ai.berkeley.edu
Today
- Reinforcement Learning
- Model-Based RL
- Model-Free RL
Reinforcement Learning
- Receive feedback in the form of rewards.
- Agent's utility is defined by the reward function
- Must (learn to) act so as to maximize expected rewards.
- All learning is based on observed samples of outcomes !!
Learning to Walk: Initial
Learning to Walk: Training
Learning to Walk: Finished
Reinforcement Learning
- Still assume a Markov decision process (MDP):
- A set of states $s \in S$
- A set of states per state $A$
- A model $T(s,a,s')$
- A reward function $R(s,a,s')$
- Still looking for a policy $\pi(s)$
- New twist: do not know $T$ or $R$
- we do not know which states are good
- we do not know what the actions do
- must try out actions and states to learn
Off-Line (MDP) vs On-Line (RL)
MDP
Imagine what it is to take an action and learn
RL
Actually take an action and learn.
Model-Based Learning
- Model-Based Idea:
- Learn an approximate model based on experiences
- Solve for values as if the learned model were correct
- Step 1: Learn empirical MDP model
- Count outcomes $s'$ for each $s$, $a$
- Normalize to give an estimate of $\hat{T}(s,a,s')$
- Discover each $\hat{R}(s,a,s')$ when we experience $(s, a, s')$
- Step 2: Solve the learned MDP
- For example, use value iteration, as before
Example: Model-Based Learning
Input Policy $\pi$
Assume: $\gamma=1$
Observed Episodes (Training)
Learned Model
Example: Expected Age
Goal: Compute expected age of CSE 440 students
$$\begin{equation}
E(A) = \sum_a P(a)\cdot a = 0.35 \times 21 + \dots
\end{equation}$$
Without $P(A)$, instead collect samples $[a_1, a_2, \dots, a_N]$
$$\begin{equation}
\begin{aligned}
\hat{P}(A) &= \frac{num(a)}{N} \\
E(A) &\approx \sum_{a} \hat{P}(a)\cdot a
\end{aligned}
\end{equation}$$
$$\begin{equation}
E(A) = \frac{1}{N}\sum_{i}a_i \nonumber
\end{equation}$$
Passive Reinforcement Learning
- Simplified task: policy evaluation
- Input: a fixed policy $\pi(s)$
- You do not know the transitions $T(s,a,s')$
- You don’t know the rewards $R(s,a,s')$
- Goal: learn the state values
- In this case:
- Learner is ``along for the ride"
- No choice about what actions to take
- Just execute the policy and learn from experience
- This is NOT offline planning !! You actually take actions in the world.
Direct Evaluation
- Goal: Compute values for each state under $\pi$
- Idea: Average together observed sample values
- Act according to $\pi$
- Every time you visit a state, write down what the sum of discounted rewards turned out to be
- Average those samples
- This is called direct evaluation
Example: Direct Evaluation
Input Policy $\pi$
Assume: $\gamma=1$
Observed Episodes (Training)
Output Values
Problems with Direct Evaluation
- What is good about direct evaluation?
- It is easy to understand
- It does not require any knowledge of $T$, $R$
- It eventually computes the correct average values, using just sample transitions
- What is bad about it?
- It wastes information about state connections
- Each state must be learned separately
- So, it takes a long time to learn
Output Values
If B and E both go to C under this policy, how can their values be different?
Why not use Policy Evaluation
- Simplified Bellman updates calculate V for a fixed policy:
- Each round, replace V with a one-step-look-ahead layer over V
\begin{equation}
\begin{aligned}
V_0^{\pi}(s) &= 0 \\
V_{k+1}^{\pi}(s) &\leftarrow \sum_{s'}T(s,\pi(s),s')[R(s,\pi(s),s')+\gamma V_k^{\pi}(s')]
\end{aligned}
\end{equation}
- This approach fully exploits the connections between the states
- Unfortunately, we need T and R to do it.
- Key question: how can we do this update to V without knowing T and R?
- In other words, how to we take a weighted average without knowing the weights?
Sample-Based Policy Evaluation
- We want to improve our estimate of $V$ by computing these averages:
\begin{equation}
V_{k+1}^{\pi}(s) \leftarrow \sum_{s'}T(s,\pi(s),s')[R(s,\pi(s),s')+\gamma V_k^{\pi}(s')] \nonumber
\end{equation}
- Idea: Take samples of outcomes $s'$ (by doing the action) and average
$$\begin{equation}
sample_1 = R(s,\pi(s),s_1') + \gamma V_k^{\pi}(s'_1)
\end{equation}$$
$$\begin{equation}
sample_2 = R(s,\pi(s),s_2') + \gamma V_k^{\pi}(s'_2)
\end{equation}$$
$$\begin{equation}
\begin{aligned}
\cdots \\
sample_n &= R(s,\pi(s),s_n') + \gamma V_k^{\pi}(s'_n)
\end{aligned}
\end{equation}$$
$$\begin{equation}
V_{k+1}^{\pi}(s) \leftarrow \frac{1}{n}\sum_{i} sample_i
\end{equation}$$