∇xtlogpt|0(xt|x0) is explicit (forward transition): For xt|x0∼N(αtx0,β2tId)∇xtlogpt|0(xt|x0)=∇xt[12β2t‖xt−αtx0‖2+C]=1β2t(xt−αtx0)=1βt(Zt)
But the distribution p0|t(x0|xt) is not explicit (backward conditional)!
E[∇xtlogpt|0(xt|x0)|xt]=1β2t(xt−αtE[x0|xt])
E[x0|xt] is the best estimate of the initial noise-free x0 given its noisy version xt.
Learning the score function: Denoising score matching continued...
f:Rd→Rd will be approximated with a neural network such as a (complex) U-Net (Ho et al., 2020).
But we need to have an approximation of ∇xtlogpt for all time t (at least for the times tn in our Euler-Maruyama scheme).
In practice we share the same network architecture for all time t: one learns a network sθ(x,t) such that
sθ(x,t)≈∇xlogpt(x),x∈Rd,t∈[0,T].
Final loss for denoising score matching: (Song et al., 2021b)
θ∗=argminEt(λtE(x0,xt)‖sθ(xt,t)−∇xtlogpt|0(xt|x0)‖2)
where t is chosen uniformly in [0,T] and t↦λt is a weighting term to balance the importance of each t.
sθ:Rd×[0,T]→Rd is a (complex) U-net (Ronneberger et al., 2015), eg in (Ho et al., 2020) "All models have two convolutional residual blocks per resolution level and self-attention blocks at the 16x16 resolution between the convolutional blocks".
Diffusion time t is specified by adding the Transformer sinusoidal position embedding into each residual block (Vaswani et al., 2017).
Exponential moving average
Several choices for t↦λt (Kingma and Gao, 2023).
Training using Adam algorithm (Kingma and Ba, 2015), but still unstable.
To regularize: Exponential Moving Average (EMA) of weights.
ˉθn+1=(1−m)ˉθn+mθm
Typically m=10−4 (more than 104 iterations are averaged).
The final averaged parameters ˉθK are used at sampling.
Exponential moving average
Sampling Strategy
The score function of a distribution is generally used for Langevin sampling.
Xn+1=Xn+γ∇xlogp(Xn)+√2γZn
(Song et al., 2021b) propose to add one step of Langevin diffusion (same t=tn) after each step Euler-Maruyama step (tn to tn+1).
This means that we jump from one trajectory to the other, but we correct some defaults from the Euler scheme.
This is called a Predictor-Corrector sampler.
Sampling Strategy
Results
(Song et al., 2021b) achieved SOTA in terms of FID for CIFAR-10 unconditional sampling.
Very good results for 1024x1024 portrait images.
See also "Diffusion Models Beat GANs on Image Synthesis"
Many approximations
Many approximations in the full generative pipelines:
The final distribution pT is not exactly a normal distribution.
The learnt U-Net model sθ is far from being the exact score function:
Sample-based, limitations from the architecture...
Score function may behave badly near t=0 (irregular density in case of manifold hypothesis).
But we do have theoretical guarantees if all is well controlled.
Theorem (Convergence guarantees (De Bortoli, 2022))
Let p0 be the data distribution having a compact manifold support and let qT be the generator distribution from the reversed diffusion. Under suitable hypotheses, the 1-Wasserstein distance W1(p0,qT) can be explicitly bounded and tends to zero when all the parameters are refined (more Euler steps, better score learning, etc.).
Probability Flow ODE
Sampling via an ODE
We derived the Fokker-Plank equation for qt=pT−t of reversed diffusion yt=xT−t.
This is the Fokker-Planck equation associated with the diffusion SDE:
dyt=[−f(yt,T−t)+12g(T−t)2∇xlogpT−t(yt)]dt
which is an Ordinary Differential Equation (ODE) (no stochastic term).
Reverse Diffusion via an ODE
Probability flow ODE: dyt=[−f(yt,T−t)+12g(T−t)2∇xlogpT−t(yt)]dt
We get a deterministic mapping between initial noise and generated images.
We do not simulate the (chaotic) path of the stochastic diffusion but we still have the same marginal distributionpt.
We can use any ODE solver, with higher order than Euler scheme.
Reverse Diffusion via an ODE
Probability flow ODE: dyt=[−f(yt,T−t)+12g(T−t)2∇xlogpT−t(yt)]dt
From (Karras et al., 2022) "Through extensive tests, we have found Heun's 2nd order method (a.k.a. improved Euler, trapezoidal rule) [...] to provide an excellent tradeoff between truncation error and NFE."
Requires much less NFE than stochastic samplers (eg around 50 steps instead of 1000), see also Denoising Diffusion Implicit Models (DDIM) (Song et al., 2021a) for a deterministic approach.