Markov Decision Process - II


CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti

October 20, 2020


Content Credits: CMU AI, http://ai.berkeley.edu

Today

  • Value Iteration
  • Policy Iteration
    • Policy Evaluation
    • Policy Extraction

Example: Grid World

  • A maze like problem
    • The agent lives in a grid
    • Walls block the agent's path
  • Noisy movement: actions do not always go as planned
    • 80% of the time, the action North takes the agent North (if there is no wall there)
    • 10% of the time, North takes the agent West; 10% East
    • If there is a wall in the direction the agent would have been taken, the agent stays put
  • The agent receives rewards each time step
    • Small "living" reward each step (can be negative)
    • Big rewards come at the end (good or bad)
  • Goal: maximize sum of rewards

Recap: Defining MDPs

  • Markov decision processes:
    • Set of states $S$
    • Start state $s_0$
    • Set of actions $A$
    • Transitions $P(s'|s,a)$ (or $T(s,a,s')$)
    • Rewards R(s,a,s') (and discount $\gamma$)

  • MDP quantities so far:
    • Policy = Choice of action for each state
    • Utility = sum of (discounted) rewards

Solving MDPs

Optimal Quantities

  • The value (utility) of a state $s$: $V^*(s) =$ expected utility starting in $s$ and acting optimally

  • The value (utility) of a $q$-state $(s,a)$: $Q^*(s,a) = $ expected utility starting out having taken action a from state s and (thereafter) acting optimally

  • The optimal policy: $\pi^*(s) = $ optimal action from state $s$

Values of States

  • Fundamental operation: compute the (expectimax) value of a state
    • Expected utility under optimal action
    • Average sum of (discounted) rewards
    • This is just what expectimax computed
  • Recursive definition of value:
  • $$\begin{equation} \color{cyan}{V^*(s)} = \color{cyan}{\max_{a} Q^*(s,a)} \end{equation}$$ $$\begin{equation} \color{green}{Q^*(s,a)} = \color{green}{\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V^*(s')]} \nonumber \\ \end{equation}$$ $$\begin{equation} \color{cyan}{V^*(s)} = \color{cyan}{\max_{a} \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^*(s')]} \end{equation}$$

Racing Search Tree

Racing Search Tree

  • We are doing way too much work with expectimax.

  • Problem: States are repeated
    • Idea: Only compute needed quantities once

  • Problem: Tree goes on forever
    • Idea: Do a depth-limited computation, but with increasing depths until change is small
    • Note: deep parts of the tree eventually do not matter if $\gamma < 1$

Time-Limited Values

  • Key idea: time-limited values
  • Define $V_k(s)$ to be the optimal value of $s$ if the game ends in $k$ more time steps
    • Equivalently, it is what a depth-$k$ expectimax would give from $s$

Value Iteration

Value Iteration

  • Start with $V_0(s) = 0$: no time steps left means an expected reward sum of zero
  • Given vector of $V_k(s)$ values, do one ply of expectimax from each state:
  • \[V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V_k(s')]\]
  • Repeat until convergence
  • Complexity of each iteration: $\mathcal{O}(S^2A)$
  • Theorem: will converge to unique optimal values
    • Basic idea: approximations get refined towards optimal values
    • Policy may converge long before values do

Example Value Iteration

Cool Warm Overheated
$V_2$ 3.5 2.5 0
$V_1$ 2 1 0
$V_0$ 0 0 0
$$\begin{equation} V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')[R(s,a,s') + V_k(s')] \nonumber \end{equation}$$
  • Assume no discount

Convergence

  • How do we know the $V_k$ vectors are going to converge?
  • Case 1: If the tree has maximum depth $M$, then $V_M$ holds the actual untruncated values
  • Case 2: If the discount is less than 1
    • Sketch: For any state $V_k$ and $V_{k+1}$ can be viewed as depth $k+1$ expectimax results in nearly identical search trees
    • The difference is that on the bottom layer, $V_{k+1}$ has actual rewards while $V_k$ has zeros
    • That last layer is at best all $R_{MAX}$
    • It is at worst $R_{MIN}$
    • But everything is discounted by $\gamma^k$ that far out
    • So $V_k$ and $V_{k+1}$ are at most $\gamma^k \max|R|$ different
    • So as $k$ increases, the values converge

Policy Evaluation

Fixed Policies

  • Expectimax trees max over all actions to compute the optimal values
  • If we fixed some policy $\pi(s)$, then the tree would be simpler – only one action per state
    • $\dots$ though the tree's value would depend on which policy we fixed

Utilities for a Fixed Policy

  • Another basic operation: compute the utility of a state $s$ under a fixed (generally non-optimal) policy
  • Define the utility of a state $s$, under a fixed policy $\pi$:
  • $V^{\pi}(s) = $ expected total discounted rewards starting in $s$ and following $\pi$
  • Recursive relation (one-step look-ahead / Bellman equation):
  • $$\begin{equation} V^{\pi}(s) = \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^{\pi}(s')] \nonumber \end{equation}$$

Policy Evaluation

  • How do we calculate the $V$'s for a fixed policy $\pi$?
  • Idea 1: Turn recursive Bellman equations into updates (like value iteration)
  • $$\begin{eqnarray} V_0^{\pi}(s) &=& 0\\ V_{k+1}^{\pi}(s) &=& \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V_k^{\pi}(s')] \end{eqnarray}$$
  • Efficiency: $\mathcal{O}(S^2)$ per iteration
  • Idea 2: Without the maxes, the Bellman equations are just a linear system
    • Solve with Matlab (or your favorite linear system solver)

Policy Extraction

Computing Values from Actions

  • Imagine we have the optimal values $V^*(s)$
  • How should we act?
    • It is not obvious
  • We need to do a mini-expectimax (one step)
  • $$\begin{equation} \pi^*(s) = \underset{a}{\operatorname{arg max}} \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma V^*(s')] \nonumber \end{equation}$$
  • This is called policy extraction, since it gets the policy implied by the values.

Computing Actions from Q-Values

  • Imagine we have the optimal q-values: $Q^*(s,a)$

  • How should we act?
    • Completely trivial to decide !!
    • $$\begin{equation} \pi^*(s) = \underset{a}{\operatorname{arg max}} Q^*(s,a) \nonumber \end{equation}$$

  • Important lesson: actions are easier to select from q-values than values !!

Policy Iteration

Problems with Value Iteration

  • Value iteration repeats the Bellman updates:
  • $$\begin{equation} V^*_{k+1} = \max_{a}\sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^*_k(s')] \nonumber \end{equation}$$
  • Problem 1: It is slow – $\mathcal{O}(S^2A)$ per iteration

  • Problem 2: The "max" at each state rarely changes

  • Problem 3: The policy often converges long before the values

Policy Iteration

  • Alternative approach for optimal values:
    • Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities) until convergence
    • Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal) utilities as future values
    • Repeat steps until policy converges

  • This is policy iteration
    • It is still optimal !!
    • Can converge (much) faster under some conditions

Policy Iteration

  • Evaluation: For fixed current policy $\pi$, find values with policy evaluation:
    • Iterate until values converge:
    • $$\begin{equation} V_{k+1}^{\pi_i}(s) = \sum_{s'}T(s,\pi_i(s),s')[R(s,\pi_i(s),s')+\gamma V_k^{\pi_i}(s')] \nonumber \end{equation}$$
  • Improvement: For fixed values, get a better policy using policy extraction
    • One-step look-ahead:
    • $$\begin{equation} \pi_{i+1}(s) = \underset{a}{\operatorname{arg max}}\sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^{\pi_i}(s')] \nonumber \end{equation}$$

Comparison

  • Both value iteration and policy iteration compute the same thing (all optimal values)
  • In value iteration:
    • Every iteration updates both the values and (implicitly) the policy
    • We do not track the policy, but taking the max over actions implicitly recomputes it.
  • In policy iteration:
    • We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them)
    • After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
    • The new policy will be better (or we are done)
  • Both are dynamic programs for solving MDPs

Summary: MDP Algorithms

  • So you want to $\dots$
    • Compute optimal values: use value iteration or policy iteration
    • Compute values for a particular policy: use policy evaluation
    • Turn your values into a policy: use policy extraction (one-step lookahead)

  • These all look the same $\dots$
    • They basically are – they are all variations of Bellman updates
    • They all use one-step lookahead expectimax fragments
    • They differ only in whether we plug in a fixed policy or max over actions