Diffusion Posterior Sampling (DPS) for general noisy inverse problems.
Conditional Sampling
Let $A$ be a linear operator for an inverse problem (masking operator for inpainting, blur operator for deblurring, subsampling for SR, etc.).
The observation model is,
$$\mathbf{y} = A\mathbf{x}_{unknown} + \mathbf{\epsilon}, \quad \mathbf{\epsilon} \sim \mathcal{N}(0, \sigma^2I).$$
We would like to sample,
$$p_0(\mathbf{x}_0|A\mathbf{x}_0 + \mathbf{\epsilon} = \mathbf{y}) = p_0(\mathbf{x}_0|\mathbf{y})$$
to estimate $\mathbf{x}_{unknown}$ with the prior of the generative model.
Conditional Sampling Continued...
From (Song et al. 2021), we can consider the SDE for the conditional distribution $p_0(\mathbf{x}_0|\mathbf{y})$:
For clarity,
$$\nabla_{\mathbf{x}==\mathbf{y}_t}\log p_{T-t}p(\mathbf{y}_t|\mathbf{y}) = \nabla_{\mathbf{x}=\mathbf{x}_t}\log p_{t}(\mathbf{y}|\mathbf{x}_t)$$
Conditional Sampling Continued...
(Chung et al., 2023) propose the following approximation:
$$\log p_t(\mathbf{y}|\mathbf{x}_t) \approx \log p_t(\mathbf{y}|\mathbf{x}_0=\hat{\mathbf{x}}_0(\mathbf{x}_t,t))$$
with $\hat{\mathbf{x}}_0(\mathbf{x}_t,t)$ the estimate of the original image from the network.
Since
$$p(\mathbf{y}|\mathbf{x}_0) = \frac{1}{(2\pi\sigma)^{\frac{n}{2}}}\exp\left(-\frac{\|\mathbf{y}-A\mathbf{x}_0\|}{2\sigma^2}\right)$$
we finally approximate
$$\nabla_{\mathbf{x}=\mathbf{x}_t}\log p_t(\mathbf{y}|\mathbf{x}_t) = -\frac{1}{2\sigma^2}\nabla_{\mathbf{x}_t}\|\mathbf{y}-A\bar{\mathbf{x}}_0(\mathbf{x}_t,t)\|^2$$
Computing $\nabla_{\mathbf{x}_t}\|\mathbf{y} - A\hat{\mathbf{x}}_0(\mathbf{x}_t,t)\mathbf{x_0}\|^2$ involves a back propagation through the UNet.
So the conditional sampling will be twice as expensive as the sampling procedure.
Diffusion Posterior Sampling
Usual DDPM sampling (notation with $\hat{\mathbf{x}}_0(\mathbf{x}_t,t)$ instead of $\epsilon_{\theta}(\mathbf{x}_t,t)$)
$$\mu_{\theta}(\mathbf{x}_t,t)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}}_t}\epsilon_{\theta}(\mathbf{x}_t,t)\right) = \frac{\sqrt{\alpha}_t(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t + \frac{\sqrt{\bar\alpha}_{t-1}\beta_t}{1-\alpha}\hat{\mathbf{x}}_0(\mathbf{x}_t,t)$$
Add a correction term to drive $A\hat{\mathbf{x}}_0(\mathbf{x}_t,t)$ close to $\mathbf{y}$.
In practice $\zeta_i=\zeta_t=\|\mathbf{y}-A\bar{\mathbf{x}}_0(\mathbf{x}_t,t)\|^{-1}$.
Diffusion Posterior Sampling: Inpainting
Very good results in terms of perceptual metric (LPIPS).
Lack symmetry
It can sometimes be really bad though!
RePaint
RePaint
RePaint
Conditional DDPM for Super-Resolution
Super-resolution is often used to improve the quality of generated images.
One can train a speciļ¬c DDPM for this task by conditioning the Unet with the low resolution image $\epsilon_{\theta}(\mathbf{x}_t,\mathbf{y}_{LR},t)$.
(Saharia et al., 2023) To condition the model on the input $\mathbf{y}_{LR}$, we upsample the low-resolution image to the target resolution using bicubic interpolation. The result is concatenated with $\mathbf{x}_t$ along the channel dimension.
Conditional DDPM for Super-Resolution
Imagen pipeline: Text conditioning & Conditional super-resolution via DDPM.