Reinforcement Learning - II


CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti

October 29, 2020


Content Credits: CMU AI, http://ai.berkeley.edu

Today

  • Temporal-Difference Learning
  • Active Reinforcement Learning
  • Exploration vs Exploitation

Recap: Reinforcement Learning

  • Still assume a Markov decision process (MDP):
    • A set of states $s \in S$
    • A set of states per state $A$
    • A model $T(s,a,s')$
    • A reward function $R(s,a,s')$
  • Still looking for a policy $\pi(s)$
  • New twist: do not know $T$ or $R$
  • Big idea: Compute all averages over $T$ using sample outcomes

Temporal-Difference Learning

Temporal-Difference Learning

  • Big idea: learn from every experience
    • Update $V(s)$ each time we experience a transition $(s, a, s', r)$
    • Likely outcomes $s'$ will contribute updates more often
  • Temporal difference learning of values
    • Policy still fixed, still doing evaluation
    • Move values toward value of whatever successor occurs: running average

$$\begin{equation} \mbox{Sample of } V(s)\mbox{: } sample = R(s,\pi(s),s') + \gamma V^{\pi}(s') \end{equation}$$ $$\begin{equation} \mbox{Update to } V(s)\mbox{: } V^{\pi}(s) \leftarrow (1-\alpha)V^{\pi}(s) + (\alpha) sample \end{equation}$$ $$\begin{equation} \mbox{Same Update } V^{\pi}(s) = V^{\pi}(s) + \alpha(sample-V^{\pi}(s)) \end{equation}$$

Exponential Moving Average

  • Exponential moving average
    • The running interpolation update:
    • $$\begin{equation} \bar{x}_n = (1-\alpha)\cdot \bar{x}_{n-1} + \alpha \cdot x_n \nonumber \end{equation}$$
    • Makes recent samples more important:
    • $$\begin{equation} \bar{x}_n = \frac{x_n + (1-\alpha)\cdot x_{n-1} + (1-\alpha)^2 \cdot x_{n-2} + \dots}{1+(1-\alpha)+(1-\alpha)^2 + \dots} \nonumber \end{equation}$$
    • Forgets about the past (distant past values were wrong anyway)
  • Decreasing learning rate ($\alpha$) can give converging averages

Example: Temporal-Difference Learning

Problems with TD Value Learning

  • TD value learning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages
  • However, if we want to turn values into a (new) policy, we are sunk:
$$\begin{equation} \begin{aligned} \pi(s) &= \underset{\mathbf{a}}{\operatorname{arg max}} Q(s, a) \nonumber \\ Q(s,a) &= \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma V(s')] \nonumber \end{aligned} \end{equation}$$
  • Idea: learn $Q$-values, not values
  • Makes action selection model-free too

Active Reinforcement Learning

Active Reinforcement Learning

  • Full reinforcement learning: optimal policies (like value iteration)
    • You do not know the transitions $T(s,a,s')$
    • You do not know the rewards $R(s,a,s')$
    • You choose the actions now
    • Goal: learn the optimal policy / values

  • In this case:
    • Learner makes choices.
    • Fundamental trade-off: exploration vs. exploitation
    • This is NOT off-line planning. You actually take actions in the world and find out what happens $\dots$

Detour: $Q$-Value Iteration

  • Value iteration: find successive (depth-limited) values
    • Start with $V_0(s) = 0$, which we know is right
    • Given $V_k$, calculate the depth $k+1$ values for all states:
    • $$\begin{equation} V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma V_k(s')] \nonumber \end{equation}$$
  • But $Q$-values are more useful, so compute them instead
    • Start with $Q_0(s,a) = 0$ which we know is right
    • Given $Q_k$, calculate the depth $k+1$ $q$-values for all $q$-states:
    • $$\begin{equation} Q_{k+1}(s,a) \leftarrow \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma \max_{a'}Q_k(s',a')] \nonumber \end{equation}$$

$Q$-Learning

  • $Q$-Learning: sample-based $Q$-value iteration
  • $$\begin{equation} Q_{k+1}(s,a) \leftarrow \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma \max_{a'}Q_k(s',a')] \end{equation}$$
  • Learn $Q(s,a)$ values as you go
    • Receive a sample $(s,a,s',r)$
    • Consider your old estimate: $Q(s,a)$
    • Consider your new sample estimate:
    • $$\begin{equation} sample = R(s,a,s') + \gamma \max_{a'}Q(s',a') \end{equation}$$
    • Incorporate the new estimate into a running average:
    • $$\begin{equation} Q(s,a) \leftarrow (1-\alpha)Q(s,a) + (\alpha) sample \end{equation}$$

$Q$-Learning Properties

  • Amazing result: $Q$-learning converges to optimal policy -- even if you are acting sub-optimally.
  • This is called off-policy learning.
  • Caveats:
    • You have to explore enough
    • You have to eventually make the learning rate small enough
    • $\dots$ but not decrease it too quickly
    • Basically, in the limit, it does not matter how you select actions.

The Story So Far: MDPs vs RL

Exploration vs Exploitation

How to Explore?

  • Several schemes for forcing exploration
    • Simplest: random actions ($\epsilon$-greedy)
      • Every time step, flip a coin
      • With (small) probability $\epsilon$, act randomly
      • With (large) probability $1-\epsilon$, act on current policy
    • Problems with random actions?
      • You do eventually explore the space, but keep thrashing around once learning is done
      • One solution: lower $\epsilon$ over time
      • Another solution: exploration functions

Exploration Functions

  • When to explore?
    • Random actions: explore a fixed amount
    • Better idea: explore areas whose badness is not (yet) established, eventually stop exploring
  • Exploration function
    • Takes a value estimate $u$ and a visit count $n$, and returns an optimistic utility, e.g. $\color{cyan}{f(u,n)=u+\frac{k}{n+1}}$
    • $$\begin{equation} \mbox{Regular } Q\mbox{-Update: } Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s,a,s')+\gamma \max_{a'}Q(s',a')] \end{equation}$$ $$\begin{equation} \mbox{Modified } Q\mbox{-Update: } Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha[R(s,a,s')+\gamma \max_{a'}f(Q(s',a'),N(s',a'))] \end{equation}$$
    • Note: this propagates the ``bonus" back to states that lead to unknown states as well.

Announcements

    • No class on Tuesday (3rd Nov. 2020)


    • Mid-Term exam solution discussion on Thursday (5th Nov. 2020)