Noisy movement: actions do not always go as planned
80% of the time, the action North takes the agent North (if there is no wall there)
10% of the time, North takes the agent West; 10% East
If there is a wall in the direction the agent would have been taken, the agent stays put
The agent receives rewards each time step
Small "living" reward each step (can be negative)
Big rewards come at the end (good or bad)
Goal: maximize sum of rewards
Recap: Defining MDPs
Markov decision processes:
Set of states $S$
Start state $s_0$
Set of actions $A$
Transitions $P(s'|s,a)$ (or $T(s,a,s')$)
Rewards R(s,a,s') (and discount $\gamma$)
MDP quantities so far:
Policy = Choice of action for each state
Utility = sum of (discounted) rewards
Solving MDPs
Optimal Quantities
The value (utility) of a state $s$: $V^*(s) =$ expected utility starting in $s$ and acting optimally
The value (utility) of a $q$-state $(s,a)$: $Q^*(s,a) = $ expected utility starting out having taken action a from state s and (thereafter) acting optimally
The optimal policy: $\pi^*(s) = $ optimal action from state $s$
Values of States
Fundamental operation: compute the (expectimax) value of a state