Hidden Markov Models

CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti

Content Credits: CMU AI, http://ai.berkeley.edu

Today

Markov Models
Hidden Markov Models

Probability Recap

Conditional Probability: $P(x|y) = \frac{P(x,y)}{P(y)}$
Product Rule: $P(x,y)=P(x|y)P(y)$
Chain Rule: \begin{equation} \begin{aligned} P(x_1,\dots,x_n) &= P(x_1)P(x_2|x_1)P(x_3|x_2,x_1)\dots \nonumber \\ &= \prod_{i=1}^n P(x_i|x_1,\dots,x_{i-1}) \nonumber \end{aligned} \end{equation}
$X$,$Y$ independent if and only if: $\forall x,y \mbox{ : } P(x,y)=P(x)P(y)$
$X$ and $Y$ are conditionally independent given $Z$ if and only if: $\forall x,y,z \mbox{ : } P(x,y|z)=P(x|z)P(y|z)$

Reasoning over Time and Space

Often, we want to reason about a sequence of observations

Speech recognition
Robot localization
User attention
Medical monitoring

Need to introduce time (or space) into our models

Markov Models

Value of $X$ at a given time is called the state

Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial state probabilities)
Stationarity assumption: transition probabilities the same at all times
Same as MDP transition model, but no choice of action

Conditional Independence

Basic conditional independence:

Past and future independent given the present
Each time step only depends on the previous
This is called the (first order) Markov property

Note that the chain is just a (growable) BN

We can always use generic BN reasoning on it if we truncate the chain at a fixed length

Example Markov Chain: Weather

States: $X = \{rain, sun\}$
Initial distribution: 1.0 sun
CPT $P(X_t|X_{t-1})$:

$X_{t-1}$	$X_t$	$P(X_t\|X_{t-1})$
sun	sun	0.9
sun	rain	0.1
rain	sun	0.3
rain	rain	0.7

Two new ways of representing the same CPT

Example Markov Chain: Weather

Initial distribution: 1.0 sun
What is the probability distribution after one step? \begin{equation} \begin{aligned} P(X_2=sun) &= P(X_2=sun|X_1=sun)P(X_1=sun) \nonumber \\ & + P(X_2=sun|X_1=rain)P(X_1=rain) \nonumber \\ &= 0.9 * 1.0 + 0.3 * 0.0 = 0.9 \nonumber \end{aligned} \end{equation}

Mini Forward Algorithm

Question: What is $P(X)$ on some day $t$?

\begin{equation} \begin{aligned} P(x_1) &= known \nonumber \\ P(x_t) &= \sum_{x_{t-1}} P(x_{t-1},x_t) \nonumber \\ &= \sum_{x_{t-1}} P(x_t|x_{t-1})P(x_{t-1}) \nonumber \end{aligned} \end{equation}

Example Run of Mini-Forward Algorithm

From initial observation of sun

From initial observation of rain

From yet another initial distribution $P(X_1)$:

Stationary Distributions

For most chains:

Influence of the initial distribution gets less and less over time.
The distribution we end up in is independent of the initial distribution

Stationary distribution:

The distribution we end up with is called the stationary distribution $P_{\infty}$ of the chain
It satisfies \begin{equation} \begin{aligned} P_{\infty}(X) &= P_{\infty+1}(X) \nonumber \\ &= \sum_{x} P(X|x)P_{\infty}(x) \nonumber \end{aligned} \end{equation}

Example: Stationary Distributions

Question: What is $P(X)$ at time $t = infinity$?

\begin{equation} \begin{aligned} P_{\infty+1}(sun) &= P(sun|sun)P_{\infty}(sun) + P(sun|rain)P_{\infty}(rain) \nonumber \\ P_{\infty+1}(rain) &= P(rain|sun)P_{\infty}(sun) + P(rain|rain)P_{\infty}(rain) \nonumber \end{aligned} \end{equation} \begin{equation} \begin{aligned} P_{\infty}(sun) &= 0.9\times P_{\infty}(sun) + 0.3\times P_{\infty}(rain) \nonumber \\ P_{\infty}(rain) &= 0.1\times P_{\infty}(sun) + 0.7\times P_{\infty}(rain) \nonumber \end{aligned} \end{equation} \begin{equation} \begin{aligned} P_{\infty}(sun) &= 3P_{\infty}(rain)\nonumber \\ P_{\infty}(rain) &= \frac{1}{3}P_{\infty}(sun) \nonumber \end{aligned} \end{equation} \begin{equation} \begin{aligned} P_{\infty}(sun) + P_{\infty}(rain) = 1 \nonumber \\ P_{\infty}(sun) = 3/4 \mbox{ } P_{\infty}(rain) &= 1/4 \nonumber \end{aligned} \end{equation}

$X_{t-1}$	$X_t$	$P(X_t\|X_{t-1})$
sun	sun	0.9
sun	rain	0.1
rain	sun	0.3
rain	rain	0.7

Stationary Distribution: Web Link Analysis

PageRank over a web graph

Each web page is a state
Initial distribution: uniform over pages
Transitions:

With prob. $c$, uniform jump to a random page
With prob. $1-c$, follow a random outlink

Stationary distribution

Will spend more time on highly reachable pages
E.g. many ways to get to the Acrobat Reader download page
Somewhat robust to link spam
Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all search engines use link analysis along with many other factors (rank actually getting less important over time)

Stationary Distribution: Gibbs Sampling

Each joint instantiation over all hidden and query variables is a state: $\{X_1, \dots, X_n\} = H \cup Q$
Transitions:

With probability $1/n$ resample variable $X_j$ according to \begin{equation} P(X_j|x_1,x_2,\dots,x_{j-1},x_{j+1},\dots,x_n,e_1,\dots, e_m) \nonumber \end{equation}

Stationary distribution:

Conditional distribution $P(X_1, X_2 , \dots, X_n|e_1, \dots, e_m)$
Means that when running Gibbs sampling long enough we get a sample from the desired distribution
Requires some proof to show this is true.

Hidden Markov Models

Markov chains not so useful for most agents

Need observations to update your beliefs

Hidden Markov models (HMMs)

Underlying Markov chain over states $X$
You observe outputs (effects) at each time step

Example: Weather HMM

An HMM is defined by:

Initial distribution: $P(X_1)$
Transitions: $P(X_t|X_{t-1})$
Emissions: $P(E_t|X_{t})$

$R_t$	$U_t$	$P(U_t\|R_t)$
+r	+u	0.9
+r	-u	0.1
-r	+u	0.2
-r	-u	0.8

$R_{t-1}$	$R_{t}$	$P(R_t\|R_{t-1})$
+r	+r	0.7
+r	-r	0.3
-r	+r	0.3
-r	-r	0.7

Conditional Independence

HMMs have two important independence properties:

Markov hidden process: future depends on past via the present
Current observation independent of all else given current state

Quiz: does this mean that evidence variables are guaranteed to be independent?

No, they tend to correlated by the hidden state

Real HMM Examples

Speech recognition HMMs:

Observations are acoustic signals (continuous valued)
States are specific positions in specific words (so, tens of thousands)

Machine translation HMMs:

Observations are words (tens of thousands)
States are translation options

Robot tracking:

Observations are range readings (continuous)
States are positions on a map (continuous)

Filtering or Monitoring

Filtering, or monitoring, is the task of tracking the distribution over time.
We start with $B_1(X)$ in an initial setting, usually uniform
As time passes, or we get observations, we update $B(X)$
The Kalman filter was invented in the 60's and first implemented as a method of trajectory estimation for the Apollo program

Inference: Base Cases

\[P(X_1|e_1)\] $$\begin{equation} \begin{aligned} P(x_1|e_1) &= \frac{P(x_1,e_1)}{P(e_1)} \nonumber \\ &\propto_X P(x_1,e_1) \nonumber \\ &= P(x_1)P(e_1|x_1) \end{aligned} \end{equation}$$

\[P(X_2)\] $$\begin{equation} \begin{aligned} P(x_2) &= \sum_{x_1} P(x_1,x_2) \nonumber \\ &= \sum_{x_1} P(x_1)P(x_2|x_1) \nonumber \end{aligned} \end{equation}$$

Passage of Time

Assume we have current belief $P(X|\mbox{evidence to date})$
Then, after one time step passes: \begin{equation} \begin{aligned} P(X_{t+1}|e_{1:t}) &= \sum_{x_t} P(X_{t+1}, x_t|e_{1:t}) \nonumber \\ &= \sum_{x_t} P(X_{t+1}|x_t,e_{1:t})P(x_t|e_{1:t}) \nonumber \\ &= \sum_{x_t} P(X_{t+1}|x_t)P(x_t|e_{1:t}) \nonumber \end{aligned} \end{equation}

Or compactly: \begin{equation} B'(X_{t+1}) = \sum_{x_t} P(X_{t+1}|x_t)B(x_t) \nonumber \end{equation}

Basic idea: beliefs get "pushed" through the transitions

With the "B" notation, we have to be careful about what time step $t$ the belief is about, and what evidence it includes.

Observation

Assume we have current belief $P(X|\mbox{previous evidence})$:
Then, after evidence comes in: \begin{equation} \begin{aligned} P(X_{t+1}|e_{1:t+1}) &= P(X_{t+1},e_{t+1}|e_{1:t})/P(e_{t+1}|e_{1:t}) \nonumber \\ &\propto_{X_{t+1}} P(X_{t+1},e_{t+1}|e_{1:t}) \nonumber \\ &= P(e_{t+1}|e_{1:t},X_{t+1})P(X_{t+1}|e_{1:t}) \nonumber \\ &= P(e_{t+1}|X_{t+1})P(X_{t+1}|e_{1:t}) \nonumber \end{aligned} \end{equation}
Or, compactly: \[B(X_{t+1}) \propto_{X_{t+1}} P(e_{t+1}|X_{t+1})B'(X_{t+1})\]
Basic idea: beliefs "reweighted" by likelihood of evidence
Unlike passage of time, we have to renormalize

Example: Observation

As we get observations, beliefs get reweighted, uncertainty "decreases"

Example: Weather HMM

Beliefs

$B(+r)=0.5$ and $B(-r)=0.5$
$B'(+r)=0.5$ and $B'(-r)=0.5$
$B(+r)=0.818$ and $B(-r)=0.182$
$B'(+r)=0.627$ and $B'(-r)=0.373$
$B(+r)=0.883$ and $B(-r)=0.117$

$R_{t-1}$	$R_t$	$P(R_t\|R_t)$
+r	+r	0.7
+r	-r	0.3
-r	+r	0.3
-r	-r	0.7

$R_t$	$U_t$	$P(U_t\|R_t)$
+r	+u	0.9
+r	-u	0.1
-r	+u	0.2
-r	-u	0.8

The Forward Algorithm

We are given evidence at each time and want to know
We can derive the following updates \begin{equation} \begin{aligned} P(x_t|e_{1:t}) &\propto_{X} P(x_t,e_{1:t}) \nonumber \\ &= \sum_{x_{t-1}} P(x_{t-1}, x_t, e_{1:t}) \nonumber \\ &= \sum_{x_{t-1}} P(x_{t-1}, e_{1:t})P(x_t|x_{t-1})P(e_t|x_t) \nonumber \\ &= P(e_t|x_t) \sum_{x_{t-1}} P(x_t|x_{t-1}) P(x_{t-1},e_{1:t-1}) \nonumber \end{aligned} \end{equation}

Online Belief Updates

Every time step, we start with current $P(X|evidence)$
We update for time: \begin{equation} P(x_t|e_{1:t-1}) = \sum_{x_{t-1}} P(x_{t-1}|e_{1:t-1})\cdot P(x_t|x_{t-1}) \nonumber \end{equation}
We update for evidence: \begin{equation} P(x_t|e_{1:t}) \propto_{X} P(x_t|e_{1:t-1})\cdot P(e_t|x_t) \nonumber \end{equation}
The forward algorithm does both at once (and does not normalize)

Q & A