Bayesian Networks: Sampling


CSE 440: Introduction to Artificial Intelligence

Vishnu Boddeti


Content Credits: CMU AI, http://ai.berkeley.edu

Today

  • Prior Sampling
  • Rejection Sampling
  • Likelihood Weighting
  • Gibbs Sampling

Recap: Bayes Net Representation

  • A directed, acyclic graph, one node per random variable
  • A conditional probability table (CPT) for each node
    • A collection of distributions over $X$, one for each combination of parents values
    • \[P(X|a_1,\dots,a_n)\]
  • Bayes nets implicitly encode joint distributions
    • As a product of local conditional distributions
    • To see what probability a BN gives to a full assignment, multiply all the relevant conditionals together:
    • \[P(x_1,\dots,x_n)=\prod_{i=1}^nP(x_i|parents(x_i))\]

Recap: Variable Elimination

  • Interleave joining and marginalizing
  • $d^k$ entries computed for a factor over $k$ variables with domain sizes $d$
  • Ordering of elimination of hidden variables can affect size of factors generated
  • Worst case: running time exponential in the size of the Bayes net

Approximate Inference: Sampling

Sampling

  • Sampling is a lot like repeated simulation
    • Predicting the weather, basketball games, $\dots$
  • Basic idea
    • Draw $N$ samples from a sampling distribution $S$
    • Compute an approximate posterior probability
    • Show this converges to the true probability $P$
  • Why sample?
    • Learning: get samples from a distribution you don’t know
    • Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination)

Sampling

  • Sampling from given distribution
    • Step 1: Get sample $u$ from uniform distribution over $[0, 1)$
      • e.g. $random()$ in python
    • Step 2: Convert this sample u into an outcome for the given distribution by having each target outcome associated with a sub-interval of $[0,1)$ with sub-interval size equal to probability of the outcome
  • Example:
    • If $random()$ returns $u = 0.83$, then our sample is $C = blue$.

Prior Sampling

Prior Sampling

Prior Sampling

Prior Sampling

  • This process generates samples with probability:
  • \begin{equation} S_{PS}(x_1,\dots,x_n) = \prod_{i=1}^n P(x_i|Parents(X_i)) =P(x_1,\dots,x_n) \nonumber \end{equation} i.e., the BN's joint probability.
  • Let the number of samples of an event be $N_{PS}(x_1,\dots,x_n)$
  • Then, \begin{equation} \begin{aligned} \lim_{N\rightarrow \infty} \hat{P}(x_1,\dots,x_n) &= \lim_{N\rightarrow \infty} \frac{N_{PS}(x_1,\dots,x_n)}{N} \nonumber \\ &= S_{PS}(x_1,\dots,x_n) \nonumber \\ &= P(x_1,\dots,x_n) \end{aligned} \end{equation}
  • i.e., the sampling procedure is consistent

Example: Prior Sampling

  • We will get a bunch of samples from the BN:
    • +c, -s, +r, +w
    • +c, +s, +r, +w
    • -c, +s, +r, -w
    • +c, -s, +r, +w
    • -c, -s, -r, +w
  • If we want to know $P(W)$
    • We have counts $<+w:4, -w:1>$
    • Normalize to get $P(W) = <+w:0.8, -w:0.2>$
    • This will get closer to the true distribution with more samples
    • Can estimate anything else, too
    • What about $P(C | +w)$? $P(C | +r, +w)$? $P(C | -r, -w)$?
    • Fast: can use fewer samples if less time (what is the drawback?)

Rejection Sampling

Rejection Sampling

  • Let us say we want $P(C)$
    • No point keeping all samples around
    • Just tally counts of $C$ as we go
  • Let us say we want $P(C|+s)$
    • Same thing: tally $C$ outcomes, but ignore (reject) samples which do not have $S=+s$
    • This is called rejection sampling
    • It is also consistent for conditional probabilities (i.e., correct in the limit)

Rejection Sampling

Likelihood Weighting

Likelihood Weighting

  • Problem with rejection sampling:
    • If evidence is unlikely, rejects lots of samples
    • Evidence not exploited as you sample
    • Consider $P(Shape|blue)$
  • Idea: fix evidence variables and sample the rest
    • Problem: sample distribution not consistent
    • Solution: weight by probability of evidence given parents

Likelihood Weighting: Example

Likelihood Weighting: Algorithm

Likelihood Weighting

  • Sampling distribution if $z$ sampled and $e$ fixed evidence
  • \[S_{WS}(\mathbf{z},\mathbf{e})=\prod_{i=1}^l P(z_i|Parents(Z_i))\]
  • Now, samples have weights \[w(\mathbf{z},\mathbf{e})=\prod_{i=1}^m P(e_i|Parents(E_i))\]
  • Together, weighted sampling distribution is consistent \begin{equation} \begin{aligned} S_{WS}(z,e)\cdot w(z,e) &= \prod_{i=1}^l P(z_i|Parents(z_i))\prod_{i=1}^l P(e_i|Parents(e_i)) \nonumber \\ &= P(\mathbf{z},\mathbf{e}) \nonumber \end{aligned} \end{equation}

Likelihood Weighting

  • Likelihood weighting is good
    • We have taken evidence into account as we generate the sample
    • E.g. here, $W$'s value will get picked based on the evidence values of $S$, $R$
    • More of our samples will reflect the state of the world suggested by the evidence
  • Likelihood weighting does not solve all our problems
    • Evidence influences the choice of downstream variables, but not upstream ones ($C$ is not more likely to get a value matching the evidence)
  • We would like to consider evidence when we sample every variable (leads to Gibbs sampling)

Gibbs Sampling

Gibbs Sampling

  • Procedure: keep track of a full instantiation $x_1, x_2, \dots, x_n$. Start with an arbitrary instantiation consistent with the evidence. Sample one variable at a time, conditioned on all the rest, but keep evidence fixed. Keep repeating this for a long time.
  • Property: in the limit of repeating this infinitely many times the resulting samples come from the correct distribution (i.e. conditioned on evidence) variable.
  • Rationale: both upstream and downstream variables condition on evidence
  • In contrast: likelihood weighting only conditions on upstream evidence, and hence weights obtained in likelihood weighting can sometimes be very small. Sum of weights over all samples is indicative of how many ``effective" samples were obtained, so we want high weight.

Gibbs Sampling Example: $P(S|+r)$

  • Step 1: Fix evidence
    • $R=+r$
  • Step 2: Initialize other variables
    • randomly
  • Step 3: Repeat
    • Choose a non-evidence variable $X$
    • Resample $X$ from $P(X|\mbox{all other variables})$

Efficient Resampling of One Variable

  • Sample from $P(S|+c,+r,-w)$
  • \begin{equation} \begin{aligned} P(S|+c,+r,-w) &= \frac{P(S,+c,+r,-w)}{P(+c,+r,-w)}\nonumber \\ &= \frac{P(S,+c,+r,-w)}{\sum_s P(s,+c,+r,-w)} \nonumber \\ &= \frac{P(+c)(S|+c)P(+r|+c)P(-w|S,+r)}{\sum_s P(+c)(S|+c)P(+r|+c)P(-w|s,+r)} \nonumber \\ &= \frac{P(S|+c)P(-w|S,+r)}{\sum_s P(S|+c)P(-w|s,+r)} \nonumber \end{aligned} \end{equation}
  • Many things cancel out – only CPTs with $S$ remain
  • More generally: only CPTs that have resampled variable need to be considered, and joined together

Summary: Bayes Net Sampling

  • Prior Sampling $P(Q)$

  • Rejection Sampling $P(Q|e)$

  • Likelihood Weighting $P(Q|e)$

  • Gibbs Sampling $P(Q|e)$

Beyond Gibbs Sampling

  • Gibbs sampling produces sample from the query distribution $P(Q|e)$ in limit of re-sampling infinitely often
  • Gibbs sampling is a special case of more general methods called Markov chain Monte Carlo (MCMC) methods
    • Metropolis-Hastings is one of the more famous MCMC methods (in fact, Gibbs sampling is a special case of Metropolis-Hastings)
  • You may read about Monte Carlo methods – they are just sampling

Q & A

Image
XKCD