Let $\{\mathbf{x}_1,\dots,\mathbf{x}_N\}$ be samples from an unknown distribution $p(\mathbf{x})$.
Let $f_{\mathbf{\theta}}(\mathbf{x}) \in \mathbb{R}$ be a real-valued function with parameters $\mathbf{\theta}$.
$$
p_{\mathbf{\theta}}(\mathbf{x}) = \frac{e^{-f_{\mathbf{\theta}}(\mathbf{x})}}{Z_{\mathbf{\theta}}}
$$
where $Z_{\mathbf{\theta}}$ is a normalizing constant that depends on $\mathbf{\theta}$ such that $\int p_{\mathbf{\theta}}(\mathbf{x})d\mathbf{x} = 1$.
Train $p_{\mathbf{\theta}}(\mathbf{x})$ through maximum log-likelihood of the data
$$
\max_{\mathbf{\theta}}\sum_{i=1}^N \log p_{\mathbf{\theta}}(\mathbf{x}_i) = -f_{\mathbf{\theta}}(\mathbf{x}) - \log Z_{\mathbf{\theta}}
$$
Evaluating the normalizing constant $Z_{\mathbf{\theta}}$ is typically intractable for any general $f_{\mathbf{\theta}}(\mathbf{x})$.
Score Function
Score function of $p(\mathbf{x})$ is defined as,
$$
\nabla_{\mathbf{x}} \log p(\mathbf{x})
$$
Learn a score based model $s_{\mathbf{\theta}}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$
Don't need to worry about normalizing constant.
$$
s_{\mathbf{\theta}}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x}) = -\nabla_{\mathbf{x}}f_{\mathbf{\theta}}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z_{\mathbf{\theta}}}_{0} = -\nabla_{\mathbf{x}}f_{\mathbf{\theta}}(\mathbf{x})
$$
Image Credit: Yang Song
Learning: Score Matching
Minimize Fischer Divergence between the model and the data distributions.
$$
\max_{\mathbf{\theta}} \mathbb{E}_{p(\mathbf{x})}\left[\|\nabla_{\mathbf{x}}\log p(\mathbf{x}) - s_{\mathbf{\theta}}(\mathbf{x})\|_2^2\right]
$$
Problem: need access to unknown score of data i.e., $\nabla_{\mathbf{x}}\log p(\mathbf{x})$.
Solution: Score matching algorithms minimize Fischer divergence without ground-truth data score.
$$
\mathbb{E}_{p_{data}}\left[tr(\nabla^2_{\mathbf{x}}\log p_{\mathbf{\theta}}(\mathbf{x}) + \frac{1}{2}\|\nabla_{\mathbf{x}}\log p_{\mathbf{\theta}}(\mathbf{x})\|_2^2\right]
$$
Practical realizations:
Sliced Score Matching (Yang 2019): $\mathbb{E}_{p_{data}}\left[\mathbf{v}^T\nabla^2_{\mathbf{x}}\log p_{\mathbf{\theta}}(\mathbf{x})\mathbf{v} + \frac{1}{2}(\mathbf{v}^T\nabla_{\mathbf{x}}\log p_{\mathbf{\theta}}(\mathbf{x})\mathbf{v})^2\right]$
Score function, $s_{\mathbf{\theta}}(\mathbf{x}) \approx \nabla_{\mathbf{x}}p(\mathbf{x})$, can be used to draw samples from it through an iterative procedure called Langevin Dynamics.
Langevin dynamics provides an MCMC procedure to sample from a $p(\mathbf{x})$ using only its score function $\nabla_{\mathbf{x}}p(\mathbf{x})$.
$$\mathbf{x}_0 \sim \pi(\mathbf{x})$$
$$\mathbf{x}_{i+1} \leftarrow \mathbf{x}_i + \epsilon\nabla_{\mathbf{x}}\log p(\mathbf{x}) + \sqrt{2\epsilon}\mathbf{z}_i, i=0,1,\dots,K$$
Langevin dynamics accesses $p(\mathbf{x})$ only through $\nabla_{\mathbf{x}}p(\mathbf{x})$.
Since $s_{\mathbf{\theta}}(\mathbf{x}) \approx \nabla_{\mathbf{x}}p(\mathbf{x})$, we can produce samples from $s_{\mathbf{\theta}}(\mathbf{x})$.
Image Credit: Yang Song
Summary: Score-Based Generative Models
Image Credit: Yang Song
Pitfalls of Naive Score-Based Models
Inaccurate score predictions will derail Langevin dynamics and lead to poor quality generated samples.
Image Credit: Yang Song
Generative Modeling with Multiple Noise Perturbations
Problem: How to overcome difficulty of score estimation in low-density regions?
Solution: Train score-based models on perturbations of data.
Perturbation with noise is equivalent to convolution.
$z = x + y \Longrightarrow p(z) = p(x) \ast p(y)$
How do we choose the appropriate noise scale?
Image Credit: Yang Song
Generative Modeling with Multiple Noise Perturbations
Larger noise covers more low density regions, but it alters the original distribution significantly.
Smaller noise does not alter the original distribution, but it does not cover low density regions.
Solution: Use multiple scales of noise perturbations simultaneously.
For $L$ increasing standard deviations $\sigma_1 < \sigma_2 < \cdots < \sigma_L$, perturb data with Gaussian noise $\mathcal{N}(\mathbf{0}, \sigma_i^2\mathbf{I})$ i.e.,
$$
p_{\sigma_i}(\mathbf{x}) = \int p(\mathbf{y})\mathcal{N}(\mathbf{x};\mathbf{y},\sigma_i^2\mathbf{I})d\mathbf{y}
$$
where $\lambda(i) \in \mathbb{R}_{> 0}$, and typically $\lambda(i) = \sigma_i^2$.
Sample from $\mathbf{s}_{\mathbf{\theta}}(\mathbf{x},i)$ from $i=L, L-1, \dots, 1$
Image Credit: Yang Song
Noise Conditional Score-Based Model
Practical implementation design choices:
Choose $\sigma_1 < \sigma_2 < \cdots < \sigma_L$ as a geometric progression, $L \sim 1000s$
$\mathbf{s}_{\mathbf{\theta}}(\mathbf{x},i)$ is modeled by a U-Net architecture with skip-connections.
Apply EMA on the weights of the score-based model when used at test-time.
Image Credit: Yang Song
Perturbing Data with Stochastic Differential Equations
When $L \rightarrow \infty$, we perturb the data with continuously growing levels of noise.
Stochastic processes are typically solutions of stochastic differential equations (SDEs).
$$
d\mathbf{x} = f(\mathbf{x},t)dt + g(t)d\mathbf{w}
$$
$f(\cdot,t):\mathbb{R}^d \rightarrow \mathbb{R}^d$ is a vector-valued function called the drift coefficient.
$g(t) \in \mathbb{R}$ is a real-valued function called the diffusion coefficient
$\mathbf{w}$ is standard Brownian motion, and $d\mathbf{w}$ is infinitesimal white noise.
Sample Generation by Reversing Stochastic Differential Equation
Any SDE has a corresponding reverse SDE.
$$
d\mathbf{x} = [f(\mathbf{x},t) - g^2(t)\nabla_{\mathbf{x}}\log p_t(\mathbf{x})]dt + g(t)d\mathbf{w}
$$
$dt$ is a negative infinitesimal time step.
$\nabla_{\mathbf{x}}\log p_t(\mathbf{x})$ is exactly the score function.
Summary: Stochastic Differential Equations
How does it perform?
Method
FID($\downarrow$)
Inception Score($\uparrow$)
StyleGAN2 + ADA
2.92
9.83
Score-Based SDE
2.20
9.89
Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021
Examples Images
Controllable Generation for Solving Inverse Problems
Inverse problems involve estimating a sample $\mathbf{x}$ given information $\mathbf{y}$
Can be solved as, $p(x|y) = p(x)p(y|x)$, so,
$$
\nabla_{\mathbf{x}}\log p(\mathbf{x}|\mathbf{y})=\nabla_{\mathbf{x}}\log p(\mathbf{x}) + \nabla_{\mathbf{x}}\log p(\mathbf{y}|\mathbf{x})
$$
$p(y|x)$ is a classifier. All we need to turn an unconditional diffusion model into a conditional one, is a classifier !!
An additional scaling factor is usually added to increase the temperature of $p(y|x)$.
$$
\nabla_{\mathbf{x}}\log p(\mathbf{x}|\mathbf{y})=\nabla_{\mathbf{x}}\log p(\mathbf{x}) + \gamma\nabla_{\mathbf{x}}\log p(\mathbf{y}|\mathbf{x})
$$
Controllable Generation for Solving Inverse Problems
Problem: Difficult to obtain noise-robust classifier.