The easy-to-sample distribution is denoted pref∈P(Rd).
pref is usually the standard multivariate Gaussian.
Going from the data to the easy-to-sample distribution: noising process.
Going from the easy-to-sample to the data distribution: generative process.
How to invert the forward noising process?
Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models (DDPM (Ho et al., 2020)) is a discrete model with a fixed number of T=103 steps that performs discrete diffusion.
Forward model: Discrete variance preserving diffusion(WARNING: Change of notation)
Distribution of samples: q(x0).
Conditional Gaussian noise: q(xt|xt−1)=N(√1−βtxt−1,βtId)xt=√1−βtxt−1+βtzt
where the variance schedule (βt)1≤t≤T is fixed.
One step noising q(xt|x0): with αt=1−βt and ˉα=cumulative product of αxt=ˉαtx0+√1−ˉαtzt where z is standard.
Diffusion Forward
Diffusion Backward
Diffusion parameters are designed such that q(xT)≈N(0,Id)
In general q(xt−1|xt)∝q(xt−1)q(xt|xt−1) is intractable.
Denoising Diffusion Probabilistic Models
We consider the diffusion as a fixed stochastic encoder.
We want to learn a stochastic decoderpθ:
pθ(x0:T)=p(xT)⏟fixed latent priorT∏t=1pθ(xt−1|xt)⏟learnable backward transitions
with pθ(xt−1|xt)=N(μθ(xt,t),βtId).
Compare with: q(xt|xt−1)=N(√1−βtxt−1,βtId)
Recall same diffusion coefficient, new backward drift to be learnt.
Oversimplified version compared to (Ho et al., 2020), there are ways to also learn the variance for each pixel, see (Nichol and Dhariwal, 2021).
Then we look for training the decoder by maximizing an ELBO.
The L1 term is dealt differently (to account for discretization of x0).
(Ho et al., 2020) proposes to simplify the loss (no constants):
Lsimple=Et,x0,ϵ[‖ϵθ(√ˉαx0+√1−ˉαtϵ,t)−ϵ‖]
DDPM: Training and sampling
σt=√βt here.
DDPM: Denoiser
The U-Net ϵθ(xt,t) is a (residual) denoiser that gives an estimation of the noise ϵ from
xt(x0,ϵ)=√ˉαtx0+√1−ˉαtϵ.
We get the associated estimation of x0:
ˆx0=1√ˉαtxt−√1ˉαt−1ϵθ(xt,t).
DDPM: Sampling
σt=√βt here.
Score Matching vs DDPM
Score Matching and DDPM
Diffusion model via SDE: (Song et al., 2021)
Diffusion model via Denoising Diffusion Probabilistic Models (DDPM): (Ho et al., 2020) Discrete model with a fixed number of T=103.
Score Matching vs DDPM
dxt=f(xt,t)dt+g(t)dwt
Example 1: Variance exploding diffusion (VE-SDE)
SDE:dxt=dwtSolution:xt=x0+wtVariance:Var(xt)=Var(x0)+t
Example 2: Variance preserving diffusion (VP-SDE)
SDE:xt=−βtxtdt+√2βtdwtSolution:xt=e−Btx0+∫t0eB2−Bt√2βsdws with Bt=∫t0βtdtVariance:Var(xt)=e−2BtVar(x0)+1−e−2Bt
Both variants have the form xt=αtx0+βtZt:xt is a rescaled noisy version of x0 and the noise is more and more predominant as time grows.
Learn score by denoising score matching:
θ∗=argminEt(λtE(x0,xt)‖sθ(xt,t)−∇xtlogpt|0(xt|x0)‖2) with t∼U([0,T])
Generate samples by SDE discrete scheme (e.g. Euler-Maruyama):
Yn−1=Yn−hf(Yn,tn)+hg(tn)2sθ(Yn,tn)+g(tn)√hZn with Zn∼N(0,Id)
Associated deterministic probability flow:
dyt=[−f(yt,T−t)+12g(T−t)2∇xlogpT−t(yt)]dt
Denoising Diffusion Probabilistic Models (DDPM)
Forward Diffusion:
qθ(x0:T)=q(x0)⏟data distributionT∏t=1q(xt|xt−1)⏟fixed forward transition with q(xt|xt−1)=N(√1−βtxt−1,βtId)
Backward diffusion: stochastic decoderpθ:
pθ(x0:T)=p(xT)⏟fixed latent priorT∏t=1pθ(xt−1|xt)⏟learnable backward transitions with pθ(xt−1|xt)=N(μθ(xt,t),βtId)⏟Gaussian approximation of q(xt−1|xt)
Denoising Diffusion Probabilistic Models (DDPM)
Learn the score by minimizing the ELBO (like for VAE): This boils down to denoising the diffusion iterations xt=√ˉαx0+√1−ˉαϵθ∗=argminT∑t=1βt1−ˉαtEq[‖ϵθ(xt,t)−ϵ‖2]+C
Sampling through the stochastic decoder with
μθ(xt,t)=1√αt(xt−βt√1−ˉαtϵθ(xt,t))
DDPM Training and Score Matching
Posterior mean training: Recall that μθ(xt,t) minimizes
Eq[DKL(q(xt−1|xt)‖pθ(xt−1|xt))]=1βtEq[‖μθ(xt,t)−˜μ(xt,x0)‖2]+C
where ˜μ(xt,x0) is the mean of q(xt−1|xt,x0). Hence ideally,
μθ(xt,t)=E[˜μ(xt,x0|xt)]=E[E[xt−1|xt,x0]|xt]=E[xt−1|xt]
Noise prediction training:ϵθ(xt,t) minimizes
E[‖ϵθ(xt,t)−ϵ‖2]
where ϵ is a function of (xt,x0) since xt=√ˉαx0+√1−ˉαϵ. Hence ideally,
ϵθ(xt,t)=E[ϵ|xt]
If Y=aX+σZ with Z∼N(0,Id) independent of X, a>0,σ>0, then
Tweedie denoiser:E[X|Y]=1a(Y+σ2∇YlogpY(Y))Tweedie noise predictor:E[Z|Y]=−σ∇ylogpY(Y)
Tweedie for noise prediction: Predict the noise ϵ from xt:
xt=√ˉαtx0+√1−ˉαtϵ⇒E[ϵ|xt]=−√1−ˉα∇xtlogpt(xt)
Tweedie for one-step denoising: Predict xt−1 from xt:
xt=√αtxt−1+√βtzt⇒E[xt−1|xt]=1√αt(xt+βt∇xtlogpt(xt))E[xt−1|xt]=1√αt(xt−βt√1−ˉαtE[ϵ|xt])μθ(xt,t)=1√αt(xt−βt√1−ˉαtϵθ(xt,t))
Summary of Diffusion Models
The two training strategies are the same (up to weighting constants)
The only difference between the continuous SDE model and the discrete DDPM model are the time values: t∈[0,T] vs. t=1,…,T=103.
Good news: We can train a DDPM and use it for a deterministic probability flow ODE (this is what is done by the DDIM model (Song et al., 2021)).
Q & A
Score Matching and Diffusion Models - III CSE 849: Deep Learning Vishnu Boddeti