Markov Decision Process

Markov Decision Process - II

CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti

October 20, 2020

Content Credits: CMU AI, http://ai.berkeley.edu

Today

Value Iteration
Policy Iteration

Policy Evaluation
Policy Extraction

Example: Grid World

A maze like problem

The agent lives in a grid
Walls block the agent's path

Noisy movement: actions do not always go as planned

80% of the time, the action North takes the agent North (if there is no wall there)
10% of the time, North takes the agent West; 10% East
If there is a wall in the direction the agent would have been taken, the agent stays put

The agent receives rewards each time step

Small "living" reward each step (can be negative)
Big rewards come at the end (good or bad)

Goal: maximize sum of rewards

Recap: Defining MDPs

Markov decision processes:

Set of states $S$
Start state $s_0$
Set of actions $A$
Transitions $P(s'|s,a)$ (or $T(s,a,s')$)
Rewards R(s,a,s') (and discount $\gamma$)

MDP quantities so far:

Policy = Choice of action for each state
Utility = sum of (discounted) rewards

Solving MDPs

Optimal Quantities

The value (utility) of a state $s$: $V^*(s) =$ expected utility starting in $s$ and acting optimally

The value (utility) of a $q$-state $(s,a)$: $Q^*(s,a) = $ expected utility starting out having taken action a from state s and (thereafter) acting optimally

The optimal policy: $\pi^*(s) = $ optimal action from state $s$

Values of States

Fundamental operation: compute the (expectimax) value of a state

Expected utility under optimal action
Average sum of (discounted) rewards
This is just what expectimax computed

Recursive definition of value:

$$\begin{equation} \color{cyan}{V^*(s)} = \color{cyan}{\max_{a} Q^*(s,a)} \end{equation}$$

$$\begin{equation} \color{green}{Q^*(s,a)} = \color{green}{\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V^*(s')]} \nonumber \\ \end{equation}$$

$$\begin{equation} \color{cyan}{V^*(s)} = \color{cyan}{\max_{a} \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^*(s')]} \end{equation}$$

Racing Search Tree

We are doing way too much work with expectimax.

Problem: States are repeated

Idea: Only compute needed quantities once

Problem: Tree goes on forever

Idea: Do a depth-limited computation, but with increasing depths until change is small
Note: deep parts of the tree eventually do not matter if $\gamma < 1$

Time-Limited Values

Key idea: time-limited values
Define $V_k(s)$ to be the optimal value of $s$ if the game ends in $k$ more time steps

Equivalently, it is what a depth-$k$ expectimax would give from $s$

Value Iteration

Start with $V_0(s) = 0$: no time steps left means an expected reward sum of zero
Given vector of $V_k(s)$ values, do one ply of expectimax from each state:

\[V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V_k(s')]\]

Repeat until convergence
Complexity of each iteration: $\mathcal{O}(S^2A)$
Theorem: will converge to unique optimal values

Basic idea: approximations get refined towards optimal values
Policy may converge long before values do

Example Value Iteration

	Cool	Warm
$V_2$	3.5	2.5
$V_1$	2	1
$V_0$	0	0

$$\begin{equation} V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')[R(s,a,s') + V_k(s')] \nonumber \end{equation}$$

Assume no discount

Convergence

How do we know the $V_k$ vectors are going to converge?
Case 1: If the tree has maximum depth $M$, then $V_M$ holds the actual untruncated values
Case 2: If the discount is less than 1

Sketch: For any state $V_k$ and $V_{k+1}$ can be viewed as depth $k+1$ expectimax results in nearly identical search trees
The difference is that on the bottom layer, $V_{k+1}$ has actual rewards while $V_k$ has zeros
That last layer is at best all $R_{MAX}$
It is at worst $R_{MIN}$
But everything is discounted by $\gamma^k$ that far out
So $V_k$ and $V_{k+1}$ are at most $\gamma^k \max|R|$ different
So as $k$ increases, the values converge

Policy Evaluation

Fixed Policies

Expectimax trees max over all actions to compute the optimal values
If we fixed some policy $\pi(s)$, then the tree would be simpler – only one action per state

$\dots$ though the tree's value would depend on which policy we fixed

Utilities for a Fixed Policy

Another basic operation: compute the utility of a state $s$ under a fixed (generally non-optimal) policy
Define the utility of a state $s$, under a fixed policy $\pi$:

$V^{\pi}(s) = $ expected total discounted rewards starting in $s$ and following $\pi$

Recursive relation (one-step look-ahead / Bellman equation):

$$\begin{equation} V^{\pi}(s) = \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^{\pi}(s')] \nonumber \end{equation}$$

Policy Evaluation

How do we calculate the $V$'s for a fixed policy $\pi$?
Idea 1: Turn recursive Bellman equations into updates (like value iteration)

$$\begin{eqnarray} V_0^{\pi}(s) &=& 0\\ V_{k+1}^{\pi}(s) &=& \sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V_k^{\pi}(s')] \end{eqnarray}$$

Efficiency: $\mathcal{O}(S^2)$ per iteration
Idea 2: Without the maxes, the Bellman equations are just a linear system

Solve with Matlab (or your favorite linear system solver)

Policy Extraction

Computing Values from Actions

Imagine we have the optimal values $V^*(s)$
How should we act?

It is not obvious

We need to do a mini-expectimax (one step)

$$\begin{equation} \pi^*(s) = \underset{a}{\operatorname{arg max}} \sum_{s'} T(s,a,s')[R(s,a,s') + \gamma V^*(s')] \nonumber \end{equation}$$

This is called policy extraction, since it gets the policy implied by the values.

Computing Actions from Q-Values

Imagine we have the optimal q-values: $Q^*(s,a)$

How should we act?

Completely trivial to decide !!

$$\begin{equation} \pi^*(s) = \underset{a}{\operatorname{arg max}} Q^*(s,a) \nonumber \end{equation}$$

Important lesson: actions are easier to select from q-values than values !!

Policy Iteration

Problems with Value Iteration

Value iteration repeats the Bellman updates:

Problem 1: It is slow – $\mathcal{O}(S^2A)$ per iteration

Problem 2: The "max" at each state rarely changes

Problem 3: The policy often converges long before the values

Policy Iteration

Alternative approach for optimal values:

Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities) until convergence
Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal) utilities as future values
Repeat steps until policy converges

This is policy iteration

It is still optimal !!
Can converge (much) faster under some conditions

Policy Iteration

Evaluation: For fixed current policy $\pi$, find values with policy evaluation:

Iterate until values converge:

$$\begin{equation} V_{k+1}^{\pi_i}(s) = \sum_{s'}T(s,\pi_i(s),s')[R(s,\pi_i(s),s')+\gamma V_k^{\pi_i}(s')] \nonumber \end{equation}$$

Improvement: For fixed values, get a better policy using policy extraction

One-step look-ahead:

$$\begin{equation} \pi_{i+1}(s) = \underset{a}{\operatorname{arg max}}\sum_{s'}T(s,a,s')[R(s,a,s')+\gamma V^{\pi_i}(s')] \nonumber \end{equation}$$

Comparison

Both value iteration and policy iteration compute the same thing (all optimal values)
In value iteration:

Every iteration updates both the values and (implicitly) the policy
We do not track the policy, but taking the max over actions implicitly recomputes it.

In policy iteration:

We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them)
After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
The new policy will be better (or we are done)

Both are dynamic programs for solving MDPs

Summary: MDP Algorithms

So you want to $\dots$

Compute optimal values: use value iteration or policy iteration
Compute values for a particular policy: use policy evaluation
Turn your values into a policy: use policy extraction (one-step lookahead)

These all look the same $\dots$

They basically are – they are all variations of Bellman updates
They all use one-step lookahead expectimax fragments
They differ only in whether we plug in a fixed policy or max over actions