Score Matching and Diffusion Models

Score Matching and Diffusion Models - I

CSE 849: Deep Learning

Vishnu Boddeti

A Probabilistic Viewpoint

Goal: modeling $p_{data}$

Fully Observed Models

Transformation Models (likelihood free)

Latent Variable Models (observation noise)

Undirected Latent Variable Models (hidden factors)

Modern Generative Models

Core Idea

Interpolating between two distributions:

Density:

Samples:

Core Idea

Interpolating between two distributions:

The data distribution is denoted $p_{data} \in \mathcal{P}(\mathbb{R}_d)$.
The easy-to-sample distribution is denoted $p_{ref} \in \mathcal{P}(\mathbb{R}_d)$.
$p_{ref}$ is usually the standard multivariate Gaussian.

Going from the data to the easy-to-sample distribution: noising process.
Going from the easy-to-sample to the data distribution: generative process.
How to invert the forward noising process?

Ancestral Sampling: Forward

Let $N \in \mathbb{N}$ with $N > 0$ and consider $p$ a density on $(\mathbb{R}^d)^{N+1}$ such that for any $\mathbf{x}_{0:N} = {\mathbf{x}_k}_{k=0}^N$ we have $$p(\mathbf{x}_{0:N}) = p_0(\mathbf{x}_0)\prod_{i=1}^N p_{k+1|k}(\mathbf{x}_{k+1}|\mathbf{x}_k)$$
This the forward decomposition of $p$.

Ancestral Sampling: Backward

For any $k \in \{0,\dots,N-1\}$ we define the marginal $p_{k+1}$ for any $\mathbf{x}_{k+1}\in \mathbb{R}^d$. $$p_{k+1}(\mathbf{x}_{k+1}) = \int_{\mathbb{R}^d} p_k(\mathbf{x}_k)p_{k+1|k}(\mathbf{x}_{k+1}|\mathbf{x}_k)d\mathbf{x}_k$$
Assume that for any $k \in \{0,\dots,N\}, p_k > 0$ and define $p_{k|k+1}$ for any $\mathbf{x}_k, \mathbf{x}_{k+1} \in \mathbb{R}^d$ $$p_{k|k+1}(\mathbf{x}_k|\mathbf{x}_{k+1})=\frac{p_{k+1|k}(\mathbf{x}_{k+1}|\mathbf{x}_k)p_{k}(\mathbf{x}_{k})}{p_{k+1}(\mathbf{x}_{k+1})}$$
We obtain the backward decomposition $$p(\mathbf{x}_{0:N}) = p_N(\mathbf{x}_N)\prod_{i=0}^{N-1} p_{k|k+1}(\mathbf{x}_{k}|\mathbf{x}_{k+1})$$

Background Knowledge

Adding noise to images

We are given an input dataset $$\mathcal{D} = \{x_i\}_{i=1}^N \subset \mathbb{R}^d$$
We assume that these images are independent samples of a common distribution $p_0$ over $\mathbb{R}^d$.
Consider the random process that consists of adding noise to images: $$\mathbf{x}_t = \mathbf{x}_0 + \mathbf{w}_t, \quad t \in [0,T]$$ where $\mathbf{x}_0 \sim p_0$ is a sample image and $\mathbf{w}_t$ is a Brownian motion (also called Wiener process).

Brownian Motion

Real-valued: A standard (real-valued) Brownian motion (also called Wiener process is a stochastic process $(w_t)_{t \geq 0}$ such that

$w_0=0$
With probability one, the function $t \mapsto w_t$ is continuous.
The process $(w_t)_{t \geq 0}$ has stationary, independent increment.
$w_t \sim \mathcal{N}(0,t)$

Direct consequences:

For $s < t$, $w_s$ and $w_t-w_s$ are independent and $w_{t-s} \sim \mathcal{N}(0, t-s)$
Markovian random field.

$\mathbb{R}^d$-valued: A standard $\mathbb{R}^d$-valued Brownian motion $(\mathbf{w}_t)_{t \geq 0}$ is made of $d$ independent real-valued Brownian motion $$ \mathbf{w}_t = (w_{t,1}, \cdots, w_{t,d}) \in \mathbb{R}^d $$

Adding noise to images

Adding noise to images: $\mathbf{x}_t = \mathbf{x}_0 + \mathbf{w}_t, t\in[0,T].$
This corresponds to the stochastic differential equation (SDE): $$d\mathbf{x}_t = d\mathbf{w}_t, \text{ with initial condition } \mathbf{x}_0 \sim p_0.$$
We denote by $p_t$ the distribution of $\mathbf{x}_t$ at time $t \in [0, T]$. What is $p_t$? $$p_t = p_0 \ast \mathcal{N}(\mathbf{0}, t\mathbf{I}_d)$$
This corresponds to applying the heat equation starting from $p_0$: $$\partial_tp_t(\mathbf{x})=\frac{1}{2}\nabla^2_xp_t(\mathbf{x}_t) \text{ with } p_{t=0}=p_0$$
This PDE is called the Fokker-Planck equation associated with the SDE.
This is an example of diffusion equation.

Diffusion SDE and Fokker-Planck equation

More generally we will consider diffusion SDE of the form (Song et al., 2021b): $$d\mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t$$

$\mathbf{f}:\mathbb{R}^d\times[0,T]\rightarrow \mathbb{R}^d$ is called the drift. External deterministic force that drives $\mathbf{x}_t$ in the direction of $\mathbf{f}(\mathbf{x}_t, t)$
$g:[0,T]\rightarrow [0,+\infty)$ is the diffusion coefficient

The corresponding Fokker-Planck equation is: $$ \partial_tp_t(\mathbf{x})= - \nabla_{\mathbf{x}}(\mathbf{f}(\mathbf{x}_t,t)p_t(\mathbf{x})) + \frac{1}{2}g(t)^2\nabla^2_{\mathbf{x}}p_t(\mathbf{x})$$

Diffusion SDE: Two examples

Ornstein-Uhlenbeck process: $$d\mathbf{x}_t = -\mathbf{x}_t dt + \sqrt{2}d\mathbf{w}_t$$
Geometric Brownian motion: $$d\mathbf{x}_t = \mathbf{x}_t dt + \mathbf{x}_t d\mathbf{w}_t$$

Diffusion SDE: Two examples

$$ d\mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t$$

Example 1: Variance exploding diffusion (VE-SDE) $$ \begin{aligned} \text{SDE:} & \quad d\mathbf{x}_t = d\mathbf{w}_t \\ \text{Solution:} & \quad \mathbf{x}_t = \mathbf{x}_0 + \mathbf{w}_t \\ \text{Variance:} & \quad Var(\mathbf{x}_t) = Var(\mathbf{x}_0) + t \end{aligned} $$
Example 2: Variance preserving diffusion (VP-SDE) $$ \begin{aligned} \text{SDE:} & \quad \mathbf{x}_t = -\beta_t\mathbf{x}_tdt + \sqrt{2\beta_t}d\mathbf{w}_t \\ \text{Solution:} & \quad \mathbf{x}_t = e^{-B_t}\mathbf{x}_0 + \int_{0}^te^{B_2-B_t}\sqrt{2\beta_s}d\mathbf{w}_s \text{ with } B_t=\int_{0}^t\beta_tdt \\ \text{Variance:} & \quad Var(\mathbf{x}_t) = e^{-2B_t}Var(\mathbf{x}_0) + 1 - e^{-2B_t} \end{aligned} $$
Both variants have the form $\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \beta_t \mathbf{Z}_t: \mathbf{x}_t$ is a rescaled noisy version of $\mathbf{x}_0$ and the noise is more and more predominant as time grows.

Diffusion SDE

Diffusion SDE: Forward

$$\mathbf{x}_t = -\beta_t\mathbf{x}_tdt + \sqrt{2\beta_t}d\mathbf{w}_t$$

Numerical scheme for diffusion SDE

$$ d\mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t$$

In general we do not have a close form formula for $\mathbf{x}_t$.
Diffusion SDEs can be approximately simulated using numerical schemes such as the Euler-Maruyama scheme:

Using the time step $h = T/N$ with $N + 1$ times $t_n=nh, n \in {0, \dots, N}$, define $\mathbf{X}_0 = \mathbf{x}_0$ and $$\mathbf{X}_{n+1} = \mathbf{X}_n + h\mathbf{f}(\mathbf{X}_n, t_n) + g(t_n)(\mathbf{w}_{t_{n+1}}-\mathbf{w}_{t_n}), n=\{1,\dots,N-1\}.$$
Remark that $\mathbf{w}_{t_{n+1}}- \mathbf{w}_{t_n} \sim \mathcal{N}(\mathbf{0}, h\mathbf{I}_d)$ and is independent of $\mathbf{X}_n$: $$\mathbf{X}_{n+1} = \mathbf{X}_n + h\mathbf{f}(\mathbf{X}_n, t_n) + \sqrt{h}g(t_n)\mathbf{Z}_n, \text{ with } \mathbf{Z}_n\sim\mathcal{N}(\mathbf{0}, \mathbf{I}_d), n=\{1,\dots,N-1\}.$$

Reversed Diffusion

For diffusion SDEs, as $t$ grows $p_t$ is closer and closer to a normal distribution.
We will consider that at the final time $t=T$ large enough so that $p_T$ can be considered to be a normal distribution.
For generative modeling, we want to reverse the process:

Start by generating $\mathbf{x}_T \sim p_T \approx \mathcal{N}(\mathbf{0}, \sigma_T^2 \mathbf{I}_d)$.
Simulate $(\mathbf{x}_{T-t}) t \in [0,T]$ such that $\mathbf{x}_{T-t} \sim p_{T-t}$.

Reversed Diffusion Continued...

Reversed diffusion: What is the SDE satisfied by $\mathbf{x}_{T-t}$? $$ d\mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t$$
has the associated Fokker-Planck equation $$ \partial_tp_t(\mathbf{x})= -\nabla_{\mathbf{x}}(\mathbf{f}(\mathbf{x}_t,t)p_t(\mathbf{x})) + \frac{1}{2}g(t)^2\nabla^2_{\mathbf{x}}p_t(\mathbf{x})$$
Let us derive the Fokker-Plank equation for $q_t = p_{T-t}$ the distribution function of $y_t = x_{T-t}$.

$$ \begin{aligned} \partial_tq_t(\mathbf{x}) & = -\partial_t p_{T-t}(\mathbf{x}) \\ & = \nabla_{\mathbf{x}}(\mathbf{f}(\mathbf{x},T-t)p_{T-t}(\mathbf{x})) - \frac{1}{2}g(T-t)^2\nabla^2_{\mathbf{x}}p_{T-t}(\mathbf{x}) \\ & = \nabla_{\mathbf{x}}(\mathbf{f}(\mathbf{x},T-t)q_{t}(\mathbf{x})) - \frac{1}{2}g(T-t)^2\nabla^2_{\mathbf{x}}q_{t}(\mathbf{x}) \\ & = \nabla_{\mathbf{x}}(\mathbf{f}(\mathbf{x},T-t)p_{T-t}(\mathbf{x})) + \color{orange}{\left(-1 + \frac{1}{2}\right)} \frac{1}{2}g(T-t)^2\nabla^2_{\mathbf{x}}q_{t}(\mathbf{x}) \end{aligned} $$

Reversed Diffusion Continued...

$$ \begin{aligned} & \partial_tq_t(\mathbf{x}) \\ & = \nabla_{\mathbf{x}}(\mathbf{f}(\mathbf{x},T-t)q_{t}(\mathbf{x})) + \color{orange}{\left(-1 + \frac{1}{2}\right)} \frac{1}{2}g(T-t)^2\nabla^2_{\mathbf{x}}q_{t}(\mathbf{x}) \\ & = \nabla_{\mathbf{x}}\left(\mathbf{f}(\mathbf{x},T-t)q_{t}(\mathbf{x}) - g(T-t)^2\nabla q_t(\mathbf{x})\right) + \frac{1}{2}g(T-t)^2\nabla^2_{\mathbf{x}}q_{t}(\mathbf{x}) \\ & = \nabla_{\mathbf{x}}\left([\mathbf{f}(\mathbf{x},T-t) - g(T-t)^2\frac{\nabla q_t(\mathbf{x})}{q_t(\mathbf{x})}]q_{t}(\mathbf{x})\right) + \frac{1}{2}g(T-t)^2\nabla^2_{\mathbf{x}}q_{t}(\mathbf{x}) \\ & = -\nabla_{\mathbf{x}}\left([-\mathbf{f}(\mathbf{x},T-t) + g(T-t)^2\nabla \log q_t(\mathbf{x})]q_{t}(\mathbf{x})\right) + \frac{1}{2}g(T-t)^2\nabla^2_{\mathbf{x}}q_{t}(\mathbf{x}) \\ \end{aligned} $$

This is the Fokker-Planck equation associated with the diffusion SDE: $$d\mathbf{y}_t = [-\mathbf{f}(\mathbf{y}_t,T-t)+g(T-t)^2\nabla_x\log p_{T-t}(\mathbf{y}_t)]dt + g(T-t)d\mathbf{w}_t$$

Reversed Diffusion Continued...

Forward diffusion: $$d\mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t$$
Backward diffusion: $y_t= x_{T-t}$ $$d\mathbf{y}_t = [-\mathbf{f}(\mathbf{y}_t,T-t)+g(T-t)^2\nabla_x\log p_{T-t}(\mathbf{y}_t)]dt + g(T-t)d\mathbf{w}_t$$

Same diffusion coefficient.
Opposite drift term with additional distribution correction: $$g(T-t)^2 \nabla_x\log p_{T-t}(\mathbf{y}_t)$$ drives the diffusion in regions with high $p_{T-t}$ probability.
$\mathbf{x} \mapsto \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ is called the (Stein) score of the distribution.

Can we simulate this backward diffusion using Euler-Maruyama? $$\mathbf{Y}_{n+1} = \mathbf{Y}_n + h[-\mathbf{f}(\mathbf{Y}_n,T-t)+g(T-t)^2\nabla_x\log p_{T-t}(\mathbf{Y}_n)] + \sqrt{h}g(T-t)\mathbf{Z}_n$$