Variational Autoencoders

CSE 891: Deep Learning

Vishnu Boddeti

Monday November 22, 2021

Latent Variable Model

Goal: modeling $p_{data}$

Autoregressive models

All random variables are observed

Latent Variable Models (LVMs)

Some random variables are hidden - we do not get to observe

Fully Observed Models

Latent Variable Models (observation noise)

Why Latent Variable Models?

Simpler, lower-dimensional representations of data often possible

Latent variable models hold the promise of automatically identifying those hidden representations

Why Latent Variable Models?

AR models are slow to sample because all pixels (observation dims) are assumed to be dependent on each other

We can make part of observation space independent conditioned on some latent variables

Latent variable models can have faster sampling by exploiting statistical patterns

Latent Variable Models

Sometimes, it is possible to design a latent variable model with an understanding of the causal process that generates data
In general, we do not know what are the latent variables and how they interact with observations

Most popular models make little assumption about what are the latent variables
Best way to specify latent variables is still an active area of research

Inferential Problems

Evidence Estimation \begin{eqnarray} p(\mathbf{x}) = \int p(\mathbf{x},\mathbf{z})d\mathbf{z} \nonumber \end{eqnarray}
Moment Computation \begin{eqnarray} \mathbb{E}[f(\mathbf{x})|\mathbf{z}] = \int f(\mathbf{x})p(\mathbf{x}|\mathbf{z})d\mathbf{x} \nonumber \end{eqnarray}
Prediction \begin{eqnarray} p(\mathbf{x}_{t+1}) = \int p(\mathbf{x}_{t+1}|\mathbf{x}_t)p(\mathbf{x}_t)d\mathbf{x}_t \nonumber \end{eqnarray}
Hypothesis Testing \begin{eqnarray} \mathcal{B} = \log p(\mathbf{x}|H_1) - \log p(\mathbf{x}|H_2) \nonumber \end{eqnarray}

Example Latent Variable Model

\begin{eqnarray} z &=& (z_1,z_2,\dots,z_K)\sim p(z;\beta)=\prod_{k=1}^K \beta_k^{z_k}(1-\beta)^{1-z_k} \\ x &=& (x_1,x_2,\dots,x_L)\sim p_{\theta}(x|z) \iff \mbox{Bernoulli}(x_i,DNN(z)) \end{eqnarray}

Latent Variable Model

Sample: \begin{eqnarray} z &\sim& p(z) \\ x &\sim& p_{\theta}(x|z) \end{eqnarray}
Evaluate Likelihood \begin{equation}p_{\theta}(x)=\sum_z p_Z(z)p_{\theta}(x|z)\end{equation}
Train \begin{equation}\max_{\theta}\sum_i \log p_{\theta}(x^{(i)})=\sum_i\log\left(\sum_z p_Z(z)p_{\theta}(x^{(i)}|z)\right)\end{equation}
Representation: $x \rightarrow z$

Training Latent Variable Model

Objective:
Scenario 1: $z$ can only take on a small number of values $\rightarrow$ exact objective tractable
Scenario 2: $z$ can only take on an impractical number of values $\rightarrow$ approximate

Bayesian Model Evidence

Learning Principle: Model Evidence

improve model evidence from data
integral intractable in general
idea: transform integral into expectation over simple known distribution

Importance Sampling

$q(\mathbf{z}|\mathbf{x})>0$, when $f(\mathbf{z})p(\mathbf{z})\neq 0$
Easy to sample from $q(\mathbf{z})$

\begin{eqnarray} p(\mathbf{x}) &=& \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z} \nonumber \\ &=& \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\frac{q(\mathbf{z}|\mathbf{x})}{q(\mathbf{z}|\mathbf{x})}d\mathbf{z} \nonumber \\ &=& \int p(\mathbf{x}|\mathbf{z})\frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})}q(\mathbf{z}|\mathbf{x})d\mathbf{z} \nonumber \\ && w^{(s)} = \frac{p(z)}{q(z|x)} \hspace{10pt} z^{(s)} \sim q(z|x) \nonumber \\ p(\mathbf{x}) &=& \frac{1}{S}\sum_{s} w^{(s)} p(\mathbf{x}|\mathbf{z}^{(s)}) \nonumber \end{eqnarray}

Importance Sampling to Variational Inference

Jensen's Inequality:

Variational Lower Bound: \[\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]-KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})]\]

Integral Problem: \begin{eqnarray} p(\mathbf{x}) &=& \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z} \nonumber \\ &=& \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})\frac{q(\mathbf{z}|\mathbf{x})}{q(\mathbf{z}|\mathbf{x})}d\mathbf{z} \nonumber \\ &=& \int p(\mathbf{x}|\mathbf{z})\frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})}q(\mathbf{z}|\mathbf{x})d\mathbf{z} \nonumber \\ \log p(\mathbf{x}) &\geq& \int q(\mathbf{z}|\mathbf{x})\log\left(p(\mathbf{x}|\mathbf{z})\frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})}\right) \nonumber \\ &=& \int q(\mathbf{z}|\mathbf{x})\log p(\mathbf{x}|\mathbf{z}) - \int q(\mathbf{z}|\mathbf{x})\log \frac{q(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})} \nonumber \end{eqnarray}

Variational Free Energy

\begin{equation} \mathcal{F}(\mathbf{x},q) = \underbrace{\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]}_{\text{Reconstruction}}-\underbrace{KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})]}_{\text{Penalty}} \nonumber \end{equation}

Interpreting the Lower Bound:

Approximate posterior distribution $q(\mathbf{z}|\mathbf{x})$: Best match to true posterior $p(\mathbf{z}|\mathbf{x})$, one of the unknown inferential quantities of interest to us.
Reconstruction Cost: The expected log-likelihood measures how well samples from $q(\mathbf{z}|\mathbf{x})$ are able to explain the data $\mathbf{x}$.
Penalty: Ensures that the explanation of the data $q(\mathbf{z}|\mathbf{x})$ doesn't deviate too far from your beliefs $p(\mathbf{z})$. A mechanism for realizing Ockham's razor.

Other Families of Variational Bounds

Variational Free Energy

\begin{equation} \mathcal{F}(\mathbf{x},q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]-KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})] \nonumber \end{equation}

Multi-Sample Variational Objective

\begin{equation} \mathcal{F}(\mathbf{x},q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log \frac{1}{S}\sum_{S} \frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})}p(\mathbf{x}|\mathbf{z})\right] \nonumber \end{equation}

Renyi Divergence

\begin{equation} \mathcal{F}(\mathbf{x},q) = \frac{1}{1-\alpha}\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\left(\log \frac{1}{S}\sum_{S} \frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})}p(\mathbf{x}|\mathbf{z})\right)^{1-\alpha}\right] \nonumber \end{equation}

Learning: Variational EM

\begin{equation} \mathcal{F}(\mathbf{x},q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]-KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})] \nonumber \end{equation}

Alternating Optimization
Repeat:

E-Step: $\phi \propto \nabla_{\phi}\mathcal{F}(\mathbf{x},q)$ (Variational params)
M-Step: $\theta \propto \nabla_{\theta}\mathcal{F}(\mathbf{x},q)$ (Model params)

Convergence

Stochastic Approximation

\begin{equation} \mathcal{F}(\mathbf{x},q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]-KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})] \nonumber \end{equation}

Optimize using a stochastic gradient based on a mini-batch of data
$N$ is a mini-batch sampled with replacement from the full dataset.
E-Step (compute $q$): Inference \begin{eqnarray} \text{For } n&=&1,\dots,N \nonumber \\ && \phi \propto \nabla_{\phi} \mathbb{E}_{q_{\phi}(z)}[\log p(\mathbf{x}_n|\mathbf{z}_n)]-KL[q(\mathbf{z}_n|\mathbf{x}_n)\|p(\mathbf{z})] \nonumber \end{eqnarray}
M-Step: Parameter Learning \[\theta \propto \frac{1}{N}\sum_{n} \mathbb{E}_{q_{\phi}(z)}[\nabla_{\theta}\log p_{\theta}(\mathbf{x}_n|\mathbf{z}_n)]\]

Memoryless Inference

\begin{equation} \mathcal{F}(\mathbf{x},q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]-KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})] \nonumber \end{equation}

E-step does not reuse any previous computation.
Memoryless: Any inference computations are discarded after the M-step update.
E-Step (compute $q$): Inference \begin{eqnarray} \text{For } n&=&1,\dots,N \nonumber \\ && \phi \propto \nabla_{\phi} \mathbb{E}_{q_{\phi}(z)}[\log p(\mathbf{x}_n|\mathbf{z}_n)]-KL[q(\mathbf{z}_n|\mathbf{x}_n)\|p(\mathbf{z})] \nonumber \end{eqnarray}
M-Step: Parameter Learning \[\theta \propto \frac{1}{N}\sum_{n} \mathbb{E}_{q_{\phi}(z)}[\nabla_{\theta}\log p_{\theta}(\mathbf{x}_n|\mathbf{z}_n)]\]

Amortized Inference

Instead of solving E-step for every observation, amortize using a model.
Inference Network: $q$ is an encoder, an inverse model, recognition model.
Parameters of $q$ are now a set of global parameters used for inference of all the data points, both test and train.
Amortize (spread) the cost of inference over all data.
Joint optimization of variational and model parameters.

Amortized Variational Inference

\begin{equation} \mathcal{F}(\mathbf{x},q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]-KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})] \nonumber \end{equation}

Variational Auto-Encoder: Specific combination of variational inference in latent variable models using inference networks

Model (Decoder): likelihood $p(\mathbf{x}|\mathbf{z})$
Inference (Encoder): variational distribution $q(\mathbf{z}|\mathbf{x})$
Transforms an auto-encoder into a generative model.

Latent Gaussian VAE

\begin{eqnarray} p(\mathbf{z}) &=& \mathcal{N}(\mathbf{0},\mathbf{I}) \nonumber \\ p_{\theta}(\mathbf{x}|\mathbf{z}) &=& \mathcal{N}(\mathbf{\mu}_{\theta}(\mathbf{z}),\mathbf{\Sigma}_{\theta}(\mathbf{z})) \nonumber \\ q_{\phi}(\mathbf{z}|\mathbf{x}) &=& \mathcal{N}(\mathbf{\mu}_{\phi}(\mathbf{x}),\mathbf{\Sigma}_{\phi}(\mathbf{x})) \nonumber \end{eqnarray}

$KL(p\|q)$ vs $KL(q\|p)$

Reverse KL: Zero-Forcing/Mode-Seeking

Forward KL: Mass-Covering/Mean-Seeking

Variational Auto-Encoders in General

\begin{equation} \mathcal{F}(q) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]-KL[q_{\phi}(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})] \nonumber \end{equation}

Design Choices:

Prior on latent variables: continuous, discrete, Gaussian, Bernoulli, mixture
Likelihood Function: iid (static), sequential, temporal, spatial
Approximating Posterior: distribution, sequential, spatial

Scalability and Ease of Implementation:

stochastic gradient estimation
stochastic gradient descent (and variants)

Minimum Description Length

\begin{equation} \mathcal{F}(\mathbf{x},q) = \underbrace{\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]}_{\text{Data code-length}}-\underbrace{KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})]}_{\text{Hypothesis code}} \nonumber \end{equation}

Compressibility: Regularity in our data can be explained with latent variables.
Inference is a problem of compression.
Minimum Description Length (MDL):

we must find the ideal shortest message of our data $\mathbf{x}$: marginal likelihood.
Must introduce an approximation to the ideal message.

Learning: Stochastic Backpropagation

Common gradient problem

\begin{equation} \nabla_{\phi}\mathbb{E}_{q_{\phi}(\mathbf{z})}[f_{\theta}(\mathbf{z})] = \nabla \int q_{\phi}(\mathbf{z})f_{\theta}(\mathbf{z})d\mathbf{z} \nonumber \end{equation}

\[\mathbf{z} \sim q_{\phi}(\mathbf{x})\] \[\mathbf{z} = g(\epsilon,\phi)\] \[\epsilon \sim p(\epsilon)\]

Generating Complex Distributions

Generating complex distributions from simple distributions.
Substitute random variable by deterministic function of simpler random variable.

Transformation Models
$\mathcal{N}(0,1)$	$\epsilon \sim [0,1]$	$\sqrt{ln\left(\frac{1}{\epsilon_1}\right)}\cos(2\pi\epsilon_2)$
$\mathcal{N}(\mathbf{\mu},\mathbf{RR}^T)$	$\epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I})$	$\mu + \mathbf{R}\mathbf{\epsilon}$
$\exp(-x); x > 0$	$\epsilon \sim [0,1]$	$ln\left(\frac{1}{\epsilon}\right)$
$\frac{1}{\pi(1+x^2)}$	$\epsilon \sim [0,1]$	$tan(\pi\epsilon)$
$\exp(-\|x\|)$	$\epsilon \sim [0,1]$	$ln\left(\frac{\epsilon_1}{\epsilon_2}\right)$

Implementing Variational Algorithms

Ideally want probabilistic programming using variational inference. Variational inference turns integration into optimization.

Differentiation: PyTorch, TensforFlow
Stochastic gradient descent and other preconditioned optimization
Same code can run on GPUs and distributed clusters
Probabilistic models are modular and can be easily combined

Visualizing Latent Space

Interpolating Latent Space

VQ-VAE

\begin{equation} L = \log p(x|z_q(x)) + \|sg[z_e(x)]-e\|_2^2 + \beta\|z_e(x)-sg[e]\|_2^2 \end{equation} Neural Discrete Representation Learning, NeurIPS 2017

VQ-VAE

VQ-VAE-2

Generating Diverse High-Fidelity Images with VQ-VAE-2, NeurIPS 2019

Learning Disentangled Representations

\begin{equation} \mathcal{F}(\mathbf{x},q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})]-\beta KL[q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})] \nonumber \end{equation}

$\beta$-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework, ICLR 2017