Bayesian Networks: Sampling
CSE 440: Introduction to Artificial Intelligence
Vishnu Boddeti
Content Credits: CMU AI, http://ai.berkeley.edu
Today
- Prior Sampling
- Rejection Sampling
- Likelihood Weighting
- Gibbs Sampling
Recap: Bayes Net Representation
- A directed, acyclic graph, one node per random variable
- A conditional probability table (CPT) for each node
- A collection of distributions over $X$, one for each combination of parents values
\[P(X|a_1,\dots,a_n)\]
- Bayes nets implicitly encode joint distributions
- As a product of local conditional distributions
- To see what probability a BN gives to a full assignment, multiply all the relevant conditionals together:
\[P(x_1,\dots,x_n)=\prod_{i=1}^nP(x_i|parents(x_i))\]
Recap: Variable Elimination
- Interleave joining and marginalizing
- $d^k$ entries computed for a factor over $k$ variables with domain sizes $d$
- Ordering of elimination of hidden variables can affect size of factors generated
- Worst case: running time exponential in the size of the Bayes net
Approximate Inference: Sampling
Sampling
- Sampling is a lot like repeated simulation
- Predicting the weather, basketball games, $\dots$
- Basic idea
- Draw $N$ samples from a sampling distribution $S$
- Compute an approximate posterior probability
- Show this converges to the true probability $P$
- Why sample?
- Learning: get samples from a distribution you don’t know
- Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination)
Sampling
- Sampling from given distribution
- Step 1: Get sample $u$ from uniform distribution over $[0, 1)$
- e.g. $random()$ in python
- Step 2: Convert this sample u into an outcome for the given distribution by having each target outcome associated with a sub-interval of $[0,1)$ with sub-interval size equal to probability of the outcome
- Example:
- If $random()$ returns $u = 0.83$, then our sample is $C = blue$.
Prior Sampling
Prior Sampling
Prior Sampling
- This process generates samples with probability:
\begin{equation}
S_{PS}(x_1,\dots,x_n) = \prod_{i=1}^n P(x_i|Parents(X_i)) =P(x_1,\dots,x_n) \nonumber
\end{equation}
i.e., the BN's joint probability.
- Let the number of samples of an event be $N_{PS}(x_1,\dots,x_n)$
- Then,
\begin{equation}
\begin{aligned}
\lim_{N\rightarrow \infty} \hat{P}(x_1,\dots,x_n) &= \lim_{N\rightarrow \infty} \frac{N_{PS}(x_1,\dots,x_n)}{N} \nonumber \\
&= S_{PS}(x_1,\dots,x_n) \nonumber \\
&= P(x_1,\dots,x_n)
\end{aligned}
\end{equation}
- i.e., the sampling procedure is consistent
Example: Prior Sampling
- We will get a bunch of samples from the BN:
- +c, -s, +r, +w
- +c, +s, +r, +w
- -c, +s, +r, -w
- +c, -s, +r, +w
- -c, -s, -r, +w
- If we want to know $P(W)$
- We have counts $<+w:4, -w:1>$
- Normalize to get $P(W) = <+w:0.8, -w:0.2>$
- This will get closer to the true distribution with more samples
- Can estimate anything else, too
- What about $P(C | +w)$? $P(C | +r, +w)$? $P(C | -r, -w)$?
- Fast: can use fewer samples if less time (what is the drawback?)
Rejection Sampling
- Let us say we want $P(C)$
- No point keeping all samples around
- Just tally counts of $C$ as we go
- Let us say we want $P(C|+s)$
- Same thing: tally $C$ outcomes, but ignore (reject) samples which do not have $S=+s$
- This is called rejection sampling
- It is also consistent for conditional probabilities (i.e., correct in the limit)
Rejection Sampling
Likelihood Weighting
- Problem with rejection sampling:
- If evidence is unlikely, rejects lots of samples
- Evidence not exploited as you sample
- Consider $P(Shape|blue)$
- Idea: fix evidence variables and sample the rest
- Problem: sample distribution not consistent
- Solution: weight by probability of evidence given parents
Likelihood Weighting: Example
Likelihood Weighting: Algorithm
Likelihood Weighting
- Sampling distribution if $z$ sampled and $e$ fixed evidence
\[S_{WS}(\mathbf{z},\mathbf{e})=\prod_{i=1}^l P(z_i|Parents(Z_i))\]
- Now, samples have weights
\[w(\mathbf{z},\mathbf{e})=\prod_{i=1}^m P(e_i|Parents(E_i))\]
- Together, weighted sampling distribution is consistent
\begin{equation}
\begin{aligned}
S_{WS}(z,e)\cdot w(z,e) &= \prod_{i=1}^l P(z_i|Parents(z_i))\prod_{i=1}^l P(e_i|Parents(e_i)) \nonumber \\
&= P(\mathbf{z},\mathbf{e}) \nonumber
\end{aligned}
\end{equation}
Likelihood Weighting
- Likelihood weighting is good
- We have taken evidence into account as we generate the sample
- E.g. here, $W$'s value will get picked based on the evidence values of $S$, $R$
- More of our samples will reflect the state of the world suggested by the evidence
- Likelihood weighting does not solve all our problems
- Evidence influences the choice of downstream variables, but not upstream ones ($C$ is not more likely to get a value matching the evidence)
- We would like to consider evidence when we sample every variable (leads to Gibbs sampling)
Gibbs Sampling
- Procedure: keep track of a full instantiation $x_1, x_2, \dots, x_n$. Start with an arbitrary instantiation consistent with the evidence. Sample one variable at a time, conditioned on all the rest, but keep evidence fixed. Keep repeating this for a long time.
- Property: in the limit of repeating this infinitely many times the resulting samples come from the correct distribution (i.e. conditioned on evidence) variable.
- Rationale: both upstream and downstream variables condition on evidence
- In contrast: likelihood weighting only conditions on upstream evidence, and hence weights obtained in likelihood weighting can sometimes be very small. Sum of weights over all samples is indicative of how many ``effective" samples were obtained, so we want high weight.
Gibbs Sampling Example: $P(S|+r)$
- Step 1: Fix evidence
- Step 2: Initialize other variables
- Step 3: Repeat
- Choose a non-evidence variable $X$
- Resample $X$ from $P(X|\mbox{all other variables})$
Efficient Resampling of One Variable
- Sample from $P(S|+c,+r,-w)$
\begin{equation}
\begin{aligned}
P(S|+c,+r,-w) &= \frac{P(S,+c,+r,-w)}{P(+c,+r,-w)}\nonumber \\
&= \frac{P(S,+c,+r,-w)}{\sum_s P(s,+c,+r,-w)} \nonumber \\
&= \frac{P(+c)(S|+c)P(+r|+c)P(-w|S,+r)}{\sum_s P(+c)(S|+c)P(+r|+c)P(-w|s,+r)} \nonumber \\
&= \frac{P(S|+c)P(-w|S,+r)}{\sum_s P(S|+c)P(-w|s,+r)} \nonumber
\end{aligned}
\end{equation}
- Many things cancel out – only CPTs with $S$ remain
- More generally: only CPTs that have resampled variable need to be considered, and joined together
Summary: Bayes Net Sampling
- Prior Sampling $P(Q)$
- Rejection Sampling $P(Q|e)$
- Likelihood Weighting $P(Q|e)$
- Gibbs Sampling $P(Q|e)$
Beyond Gibbs Sampling
- Gibbs sampling produces sample from the query distribution $P(Q|e)$ in limit of re-sampling infinitely often
- Gibbs sampling is a special case of more general methods called Markov chain Monte Carlo (MCMC) methods
- Metropolis-Hastings is one of the more famous MCMC methods (in fact, Gibbs sampling is a special case of Metropolis-Hastings)
- You may read about Monte Carlo methods – they are just sampling