Markov Decision Process - I
CSE 440: Introduction to Artificial Intelligence
Vishnu Boddeti
Content Credits: CMU AI, http://ai.berkeley.edu
Today
- Markov Decision Processes (MDPs)
Example: Grid World
- A maze like problem
- The agent lives in a grid
- Walls block the agent's path
- Noisy movement: actions do not always go as planned
- 80% of the time, the action North takes the agent North (if there is no wall there)
- 10% of the time, North takes the agent West; 10% East
- If there is a wall in the direction the agent would have been taken, the agent stays put
- The agent receives rewards each time step
- Small "living" reward each step (can be negative)
- Big rewards come at the end (good or bad)
- Goal: maximize sum of rewards
Markov Decision Processes
- An MDP is defined by:
- A set of states $s \in S$
- A set of actions $a \in A$
- A transition function $T(s, a, s')$
- Probability that $a$ from $s$ leads to $s'$, i.e., $P(s'| s, a)$
- Also called the model or the dynamics
- A reward function $R(s, a, s')$
- Sometimes just $R(s)$ or $R(s')$
- A start state, $s_0$
- Maybe a terminal state
- MDPs are non-deterministic search problems
- One way to solve them is with expectimax search
- We will have a new tool soon
What is Markov about MDPs
- "Markov" generally means that given the present state, the future and the past are independent
- For Markov decision processes, "Markov" means action outcomes depend only on the current state
$$\begin{equation}
\begin{aligned}
& p(S_{t+1}=s'|S_t=s_t, A_t=a_t, S_{t-1}=s_{t-1},A_{t-1},\dots,S_0=s_0) \nonumber \\
& = p(S_{t+1}=s'|S_t=s_t, A_t=a_t) \nonumber
\end{aligned}
\end{equation}$$
- This is just like search, where the successor function could only depend on the current state (not the history)
Andrey Markov (1856-1922)
Policies
- In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal
- For MDPs, we want an optimal policy $\pi^*:S\rightarrow A$
- A policy $\pi$ gives an action for each state
- An optimal policy is one that maximizes expected utility if followed
- An explicit policy defines a reflex agent
- Expectimax did not compute entire policies
- It computed the action for a single state only
- Optimal policy when $R(s, a, s') = -0.4$ for all non-terminals $s$.
Optimal Policies
$R(s) = -0.01$
$R(s) = -0.03$
$R(s) = -0.4$
$R(s) = -2.0$
Example: Car Racing
- A robot car wants to travel far, quickly
- Three states: Cool, Warm, Overheated
- Two actions: Slow, Fast
- Going faster gets double reward
Racing Search Tree
MDP Search Tree
- Each MDP state projects an expectimax-like search tree
Utility of Sequences
- What preferences should an agent have over reward sequences?
- More or less? [1,2,2] or [2,3,4]
- Now or later? [0,0,1] or [1,0,0]
Discounting
- It is reasonable to maximize the sum of rewards
- It is also reasonable to prefer rewards now to rewards later
- One solution: values of rewards decay exponentially
Worth Now
Worth Next Step
Worth in Two Steps
Discounting
- How to discount?
- Each time we descend a level, we multiply in the discount once
- Why discount?
- Sooner rewards probably do have higher utility than later rewards
- Also helps our algorithms converge
- Example: discount of 0.5
- $U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3$
- $U([1,2,3]) < U([3,2,1])$
Stationary Preferences
- Theorem: if we assume stationary preferences:
$$\begin{eqnarray}
[a_1, a_2, \dots] &\succ& [b_1, b_2, \dots] \\
& \Updownarrow &\\
[r, a_1, a_2, \dots] &\succ& [r, b_1, b_2, \dots]\\
\end{eqnarray}$$
- Then: there are only two ways to define utilities
- Additive utility: \[U([r_0, r_1, r_2, \dots]) = r_0 + r_1 + r_2 + \dots\]
- Discounted utility: \[U([r_0, r_1, r_2, \dots]) = r_0 + \gamma r_1 + \gamma^2 r_2 + \dots\]
Infinite Utilities
- Problem: What if the game lasts forever? Do we get infinite rewards?
- Solutions:
- Finite horizon: (similar to depth-limited search)
- Terminate episodes after a fixed T steps (e.g. life)
- Gives non-stationary policies ($\pi$ depends on time left)
- Discounting: use $0 < \gamma < 1$
\[U([r_0,\dots,r_{\infty}]) = \sum_{t=0}^{\infty}\gamma^tr_t \leq \frac{R_{max}}{1-\gamma}\]
- Smaller $\gamma$ means smaller "horizon" – shorter term focus
- Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like "overheated" for racing)
Recap: Defining MDPs
- Markov decision processes:
- Set of states $S$
- Start state $s_0$
- Set of actions $A$
- Transitions $P(s'|s,a)$ (or $T(s,a,s')$)
- Rewards R(s,a,s') (and discount $\gamma$)
- MDP quantities so far:
- Policy = Choice of action for each state
- Utility = sum of (discounted) rewards