Normalizing Flows
CSE 891: Deep Learning
Vishnu Boddeti
Monday November 17, 2021
A Probabilistic Viewpoint
Fully Observed Models
Transformation Models (likelihood free)
Latent Variable Models (observation noise)
Undirected Latent Variable Models (hidden factors)
Overview
- How to fit a density model pθ(x) with continuous x∈Rn
- What do we want from this model?
- Good fit to the training data (really, the underlying distribution!)
- For new x, ability to evaluate pθ(x)
- Ability to sample from pθ(x)
- And, ideally, a latent representation that is meaningful
Recap: Probability Density Models
P(x∈[a,b])=∫bap(x)dx
Recap: How to fit a density model?
- Maximum Likelihood:
maxθ∑ilogpθ(x(i))
- Equivalently:
minθEx[−logpθ(x)]
Example: Mixture of Gaussians
pθ(x)=k∑i=1πiN(xi;μi,σ2i)
- Parameters: means and variances of components, mixture weights
θ=(π1,…,πk,μ1,…,μk,σ1,…,σk)
How to fit a general density model?
- How to ensure proper distribution?
∫∞∞p(x)dx=1pθ(x)≥0∀x
- How to sample?
- Latent representation?
Flows: Main Idea
- Generally: z∼pZ(z)
- Normalizing Flow: z∼N(0,1)
- Key questions:
- How to train?
- How to evaluate pθ(x)?
- How to sample?
Flows: Training
maxθ∑ilogpθ(x(i))
- Need f(⋅) to be invertible and differentiable
Change of Variables Formula
- Let f−1 denote a differentiable, bijective mapping from space Z to space X, i.e., it must be 1-to-1 and cover all of X.
- Since f−1 defines a one-to-one correspondence between values z∈Z and x∈X, we can think of it as a change-of-variables transformation.
- Change-of-Variables Formula from probability theory. If z=f(x), then
pX(x)=pZ(z)|det(∂z∂x)|
- Intuition for the Jacobian term:
Flows: Training
maxθ∑ilogpθ(x(i))
z=fθ(x)pθ(x)=pZ(z)|det(∂z∂x)|=pZ(fθ(x))|det(∂fθ(x)∂x)|
maxθ∑ilogpθ(x(i))=maxθ∑ilogpZ(fθ(x(i)))+log|det(∂fθ(x)∂x)|
- Assuming we have an expression for pZ, we can optimize this with Stochastic Gradient Descent
Flows: Sampling
- Step 1: sample z∼pZ(z)
- Step 2: x=f−1θ(z)
Change of Variables Formula
- Problems?
- The mapping f needs to be invertible, with an easy-to-compute inverse.
- It needs to be differentiable, so that the Jacobian ∂x∂z is defined.
- We need to be able to compute the (log) determinant.
Example: Flow to Uniform z
Example: Flow to Beta(5,5) z
Example: Flow to Gaussian z
2-D Autoregressive Flow
x1=z1=fθ(x1)x2=z2=fθ(x1,x2)
2-D Autoregressive Flow: Two Moons
- Architecture:
- Base distribution: Uniform[0,1]2
- x1: mixture of 5 Gaussians
- x2: mixture of 5 Gaussians, conditioned on x1
2-D Autoregressive Flow: Face
- Architecture:
- Base distribution: Uniform[0,1]2
- x1: mixture of 5 Gaussians
- x2: mixture of 5 Gaussians, conditioned on x1
High-Dimensional Data
Constructing Flows: Composition
- Flows can be composed
x→f1→f2⋯→fk→zz=fk∘⋯∘f1(x)x=f−11∘⋯∘f−1k(z)logpθ(x)=logpθ(z)+k∑i=1log|det∂fi∂fi−1|
- Easy way to increase expressiveness
Affine Flows
- Another name for affine flow: multivariate Gaussian.
- Parameters: an invertible matrix A and a vector b
- f(x)=A−1(x−b)
- Sampling: x=Az+b, where z N(0,I)
- Log likelihood is expensive when dimension is large.
- The Jacobian of f is A−1
- Log likelihood involves calculating det(A)
Elementwise Flows
fθ(x1,x2,…,xn)=(fθ(x1),…,fθ(x1))
- Lots of freedom in elementwise flow
- Can use elementwise affine functions or CDF flows.
- The Jacobian is diagonal, so the determinant is easy to evaluate.
∂z∂x=diag(f′θ(x1),…,f′θ(xd))det(∂z∂x)=d∏i=1f′θ(xi)
Flows with Neural Networks?
- Requirement: Invertible and Differentiable
- Neural Network
- If each layer is a flow, then sequencing of layers is also flow
- Each layer: ReLU, Sigmoid, Tanh?
Reversible Blocks
- Now let us define a reversible block which is invertible and has a tractable determinant.
- Such blocks can be composed.
- Inversion: f−1=f−11∘⋯∘f−1k
- Determinants: |∂xk∂z|=|∂xk∂xk−1|…|∂x2∂x1||∂x1∂z|
Reversible Blocks
- Recall the residual blocks:
y=x+F(x)
- Reversible blocks are a variant of residual blocks. Divide the units into two groups, x1 and x2.
y1=x1+F(x2)y2=x2
- Inverting the reversible block:
x2=y2x1=y1−F(x2)
Reversible Blocks
- Composition of two reversible blocks, but with x1 and x2 swapped:
- Forward:
y1=x1+F(x2)y2=x2+G(y1)
- Backward:
x2=y2−G(y1)x1=y1−F(x2)
Volume Preservation
- We still need to compute the log determinant of the Jacobian.
- The Jacobian of the reversible block:
y1=x1+F(x2)y2=x2∂y∂x=[I∂F∂x20I]
- This is an upper triangular matrix. The determinant of an upper triangular matrix is the product of the diagonal entries, or in this case, 1.
- Since the determinant is 1, the mapping is said to be volume preserving.
Nonlinear Independent Components Estimation
- We just defined the reversible block.
- Easy to invert by subtracting rather than adding the residual function.
- The determinant of the Jacobian is 1.
- Nonlinear Independent Components Estimation (NICE) trains a generator network which is a composition of lots of reversible blocks.
- We can compute the likelihood function using the change-of-variables formula:
pX(x)=pZ(z)|det(∂x∂z)|−1=pZ(z)
- We can train this model using maximum likelihood, i.e., given a dataset {x(1),…,x(N)}, we maximize the likelihood:
N∏i=1pX(x(i))=N∏i=1pZ(f−1(x(i)))
Nonlinear Independent Components Estimation
- Likelihood:
pX(x)=pZ(z)=pz(f(x))
- Remember, pZ is a simple, fixed distribution (e.g. independent Gaussians)
- Intuition: train the network such that f(⋅) maps each data point to a high-density region of the code vector space Z.
- Without constraints on f(⋅), it could map everything to 0, and this likelihood objective would make no sense.
- But it cannot do this because it is volume preserving.
Nonlinear Independent Components Estimation
RealNVP
- Reversible Model:
- Forward Function:
y1=x1y2=x2⊙exp(F1(x1))+F2(x1)
- Inverse Function:
x1=y1x2=(y2−F2(x1))⊙exp(−F1(x1))
- Jacobian:
[I0∂y2∂x1diag(exp(F1(x1))]
How to Partition Variables?
Good vs Bad Partitioning
- checkerboard; channel squeeze; channel processing; channel unsqueeze; checkerboard
- Mask top half; mask bottom half; mask left half; mask right half
Nonlinear Independent Components Estimation
- Samples produced by RealNVP, a model based on NICE.
Other Classes of Flows
- Glow
- Invertible 1×1 convolutions
- Large-scale training
- Continuous time flows (FFJORD)
- Allows for unrestricted architectures. Invertibility and fast log probability computation guaranteed.
Summary
- The ultimate goal: a likelihood-based model with
- fast sampling
- fast inference
- fast training
- good samples
- good compression
- Flows seem to let us achieve some of these criteria.
- Open question: How exactly do we design and compose flows for great performance?
Normalizing Flows CSE 891: Deep Learning Vishnu Boddeti Monday November 17, 2021