Reinforcement Learning - II
CSE 440: Introduction to Artificial Intelligence
Vishnu Boddeti
October 29, 2020
Content Credits: CMU AI, http://ai.berkeley.edu
Today
- Temporal-Difference Learning
- Active Reinforcement Learning
- Exploration vs Exploitation
Recap: Reinforcement Learning
- Still assume a Markov decision process (MDP):
- A set of states $s \in S$
- A set of states per state $A$
- A model $T(s,a,s')$
- A reward function $R(s,a,s')$
- Still looking for a policy $\pi(s)$
- New twist: do not know $T$ or $R$
- Big idea: Compute all averages over $T$ using sample outcomes
Temporal-Difference Learning
Temporal-Difference Learning
- Big idea: learn from every experience
- Update $V(s)$ each time we experience a transition $(s, a, s', r)$
- Likely outcomes $s'$ will contribute updates more often
- Temporal difference learning of values
- Policy still fixed, still doing evaluation
- Move values toward value of whatever successor occurs: running average
$$\begin{equation}
\mbox{Sample of } V(s)\mbox{: } sample = R(s,\pi(s),s') + \gamma V^{\pi}(s')
\end{equation}$$
$$\begin{equation}
\mbox{Update to } V(s)\mbox{: } V^{\pi}(s) \leftarrow (1-\alpha)V^{\pi}(s) + (\alpha) sample
\end{equation}$$
$$\begin{equation}
\mbox{Same Update } V^{\pi}(s) = V^{\pi}(s) + \alpha(sample-V^{\pi}(s))
\end{equation}$$
Exponential Moving Average
- Exponential moving average
- The running interpolation update:
$$\begin{equation}
\bar{x}_n = (1-\alpha)\cdot \bar{x}_{n-1} + \alpha \cdot x_n \nonumber
\end{equation}$$
- Makes recent samples more important:
$$\begin{equation}
\bar{x}_n = \frac{x_n + (1-\alpha)\cdot x_{n-1} + (1-\alpha)^2 \cdot x_{n-2} + \dots}{1+(1-\alpha)+(1-\alpha)^2 + \dots} \nonumber
\end{equation}$$
- Forgets about the past (distant past values were wrong anyway)
- Decreasing learning rate ($\alpha$) can give converging averages
Example: Temporal-Difference Learning
Problems with TD Value Learning
- TD value learning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages
- However, if we want to turn values into a (new) policy, we are sunk:
$$\begin{equation}
\begin{aligned}
\pi(s) &= \underset{\mathbf{a}}{\operatorname{arg max}} Q(s, a) \nonumber \\
Q(s,a) &= \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma V(s')] \nonumber
\end{aligned}
\end{equation}$$
- Idea: learn $Q$-values, not values
- Makes action selection model-free too
Active Reinforcement Learning
Active Reinforcement Learning
- Full reinforcement learning: optimal policies (like value iteration)
- You do not know the transitions $T(s,a,s')$
- You do not know the rewards $R(s,a,s')$
- You choose the actions now
- Goal: learn the optimal policy / values
- In this case:
- Learner makes choices.
- Fundamental trade-off: exploration vs. exploitation
- This is NOT off-line planning. You actually take actions in the world and find out what happens $\dots$
Detour: $Q$-Value Iteration
- Value iteration: find successive (depth-limited) values
- Start with $V_0(s) = 0$, which we know is right
- Given $V_k$, calculate the depth $k+1$ values for all states:
$$\begin{equation}
V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma V_k(s')] \nonumber
\end{equation}$$
- But $Q$-values are more useful, so compute them instead
- Start with $Q_0(s,a) = 0$ which we know is right
- Given $Q_k$, calculate the depth $k+1$ $q$-values for all $q$-states:
$$\begin{equation}
Q_{k+1}(s,a) \leftarrow \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma \max_{a'}Q_k(s',a')] \nonumber
\end{equation}$$
$Q$-Learning
- $Q$-Learning: sample-based $Q$-value iteration
$$\begin{equation}
Q_{k+1}(s,a) \leftarrow \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma \max_{a'}Q_k(s',a')]
\end{equation}$$
- Learn $Q(s,a)$ values as you go
- Receive a sample $(s,a,s',r)$
- Consider your old estimate: $Q(s,a)$
- Consider your new sample estimate:
$$\begin{equation}
sample = R(s,a,s') + \gamma \max_{a'}Q(s',a')
\end{equation}$$
- Incorporate the new estimate into a running average:
$$\begin{equation}
Q(s,a) \leftarrow (1-\alpha)Q(s,a) + (\alpha) sample
\end{equation}$$
$Q$-Learning Properties
- Amazing result: $Q$-learning converges to optimal policy -- even if you are acting sub-optimally.
- This is called off-policy learning.
- Caveats:
- You have to explore enough
- You have to eventually make the learning rate small enough
- $\dots$ but not decrease it too quickly
- Basically, in the limit, it does not matter how you select actions.
The Story So Far: MDPs vs RL
Exploration vs Exploitation
How to Explore?
- Several schemes for forcing exploration
- Simplest: random actions ($\epsilon$-greedy)
- Every time step, flip a coin
- With (small) probability $\epsilon$, act randomly
- With (large) probability $1-\epsilon$, act on current policy
- Problems with random actions?
- You do eventually explore the space, but keep thrashing around once learning is done
- One solution: lower $\epsilon$ over time
- Another solution: exploration functions
Exploration Functions
- When to explore?
- Random actions: explore a fixed amount
- Better idea: explore areas whose badness is not (yet) established, eventually stop exploring
- Exploration function
- Takes a value estimate $u$ and a visit count $n$, and returns an optimistic utility, e.g. $\color{cyan}{f(u,n)=u+\frac{k}{n+1}}$
$$\begin{equation}
\mbox{Regular } Q\mbox{-Update: } Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s,a,s')+\gamma \max_{a'}Q(s',a')]
\end{equation}$$
$$\begin{equation}
\mbox{Modified } Q\mbox{-Update: } Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s,a,s')+\gamma \max_{a'}f(Q(s',a'),N(s',a'))]
\end{equation}$$
- Note: this propagates the ``bonus" back to states that lead to unknown states as well.
Announcements
- No class on Tuesday (3rd Nov. 2020)
- Mid-Term exam solution discussion on Thursday (5th Nov. 2020)