Reinforcement Learning

Reinforcement Learning - II

CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti

Content Credits: CMU AI, http://ai.berkeley.edu

Today

Temporal-Difference Learning
Active Reinforcement Learning
Exploration vs Exploitation

Recap: Reinforcement Learning

Still assume a Markov decision process (MDP):

A set of states $s \in S$
A set of states per state $A$
A model $T(s,a,s')$
A reward function $R(s,a,s')$

Still looking for a policy $\pi(s)$
New twist: do not know $T$ or $R$
Big idea: Compute all averages over $T$ using sample outcomes

Temporal-Difference Learning

Big idea: learn from every experience

Update $V(s)$ each time we experience a transition $(s, a, s', r)$
Likely outcomes $s'$ will contribute updates more often

Temporal difference learning of values

Policy still fixed, still doing evaluation
Move values toward value of whatever successor occurs: running average

$$\begin{equation} \mbox{Sample of } V(s)\mbox{: } sample = R(s,\pi(s),s') + \gamma V^{\pi}(s') \end{equation}$$ $$\begin{equation} \mbox{Update to } V(s)\mbox{: } V^{\pi}(s) \leftarrow (1-\alpha)V^{\pi}(s) + (\alpha) sample \end{equation}$$ $$\begin{equation} \mbox{Same Update } V^{\pi}(s) = V^{\pi}(s) + \alpha(sample-V^{\pi}(s)) \end{equation}$$

Exponential Moving Average

Exponential moving average

The running interpolation update: $$\begin{equation} \bar{x}_n = (1-\alpha)\cdot \bar{x}_{n-1} + \alpha \cdot x_n \nonumber \end{equation}$$
Makes recent samples more important: $$\begin{equation} \bar{x}_n = \frac{x_n + (1-\alpha)\cdot x_{n-1} + (1-\alpha)^2 \cdot x_{n-2} + \dots}{1+(1-\alpha)+(1-\alpha)^2 + \dots} \nonumber \end{equation}$$
Forgets about the past (distant past values were wrong anyway)

Decreasing learning rate ($\alpha$) can give converging averages

Example: Temporal-Difference Learning

Problems with TD Value Learning

TD value learning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages
However, if we want to turn values into a (new) policy, we are sunk:

$$\begin{equation} \begin{aligned} \pi(s) &= \underset{\mathbf{a}}{\operatorname{arg max}} Q(s, a) \nonumber \\ Q(s,a) &= \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma V(s')] \nonumber \end{aligned} \end{equation}$$

Idea: learn $Q$-values, not values
Makes action selection model-free too

Active Reinforcement Learning

Full reinforcement learning: optimal policies (like value iteration)

You do not know the transitions $T(s,a,s')$
You do not know the rewards $R(s,a,s')$
You choose the actions now
Goal: learn the optimal policy / values

In this case:

Learner makes choices.
Fundamental trade-off: exploration vs. exploitation
This is NOT off-line planning. You actually take actions in the world and find out what happens $\dots$

Detour: $Q$-Value Iteration

Value iteration: find successive (depth-limited) values

Start with $V_0(s) = 0$, which we know is right
Given $V_k$, calculate the depth $k+1$ values for all states:

$$\begin{equation} V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma V_k(s')] \nonumber \end{equation}$$

But $Q$-values are more useful, so compute them instead

Start with $Q_0(s,a) = 0$ which we know is right
Given $Q_k$, calculate the depth $k+1$ $q$-values for all $q$-states:

$$\begin{equation} Q_{k+1}(s,a) \leftarrow \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma \max_{a'}Q_k(s',a')] \nonumber \end{equation}$$

$Q$-Learning

$Q$-Learning: sample-based $Q$-value iteration

$$\begin{equation} Q_{k+1}(s,a) \leftarrow \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma \max_{a'}Q_k(s',a')] \end{equation}$$

Learn $Q(s,a)$ values as you go

Receive a sample $(s,a,s',r)$
Consider your old estimate: $Q(s,a)$
Consider your new sample estimate:

$$\begin{equation} sample = R(s,a,s') + \gamma \max_{a'}Q(s',a') \end{equation}$$

Incorporate the new estimate into a running average:

$$\begin{equation} Q(s,a) \leftarrow (1-\alpha)Q(s,a) + (\alpha) sample \end{equation}$$

$Q$-Learning Properties

Amazing result: $Q$-learning converges to optimal policy -- even if you are acting sub-optimally.
This is called off-policy learning.
Caveats:

You have to explore enough
You have to eventually make the learning rate small enough
$\dots$ but not decrease it too quickly
Basically, in the limit, it does not matter how you select actions.

The Story So Far: MDPs vs RL

Exploration vs Exploitation

How to Explore?

Several schemes for forcing exploration

Simplest: random actions ($\epsilon$-greedy)

Every time step, flip a coin
With (small) probability $\epsilon$, act randomly
With (large) probability $1-\epsilon$, act on current policy

Problems with random actions?

You do eventually explore the space, but keep thrashing around once learning is done
One solution: lower $\epsilon$ over time
Another solution: exploration functions

Exploration Functions

When to explore?

Random actions: explore a fixed amount
Better idea: explore areas whose badness is not (yet) established, eventually stop exploring

Exploration function

Takes a value estimate $u$ and a visit count $n$, and returns an optimistic utility, e.g. $\color{cyan}{f(u,n)=u+\frac{k}{n+1}}$

$$\begin{equation} \mbox{Regular } Q\mbox{-Update: } Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s,a,s')+\gamma \max_{a'}Q(s',a')] \end{equation}$$

$$\begin{equation} \mbox{Modified } Q\mbox{-Update: } Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s,a,s')+\gamma \max_{a'}f(Q(s',a'),N(s',a'))] \end{equation}$$

Note: this propagates the "bonus" back to states that lead to unknown states as well.

Q & A