$3 \times I \times I$ images preserves spatial structure
Convolve the filter with the image i.e., "slide over the image spatially, computing dot products."
Filters always extend the full depth of the input volume.
Convolution Layer
1 number: the result of taking a dot product between the filter and a small $3\times W \times W$ chunk of the image i.e., $\mathbf{w}^T\mathbf{x} + \mathbf{b}$
Convolution Layer
Convolution Layer
Stacking Convolutions
What happens if we stack two convolutional layers?
Problem: We get another convolution
Solution: Add a non-linear layer between any two linear layers
Learned Convolution Filters
Strided Convolution
Input: $7\times 7$
Filter: $3\times 3$
Stride: $2\times 2$
Output: $3\times 3$
In general:
Input: $W$
Filter: $K$
Padding: $P$
Stride: $S$
Output: $\frac{W-K+2P}{S}+1$
Convolution Complexity
Input Volume: $3\times 32\times 32$
Weights: 10 $5\times 5$ filters with stride 1, pad 2
Output Volume: $10\times 32\times 32$
Number of learnable parameters: 760
Number of multiply-add operations: 768,000
$10\times 32\times=10,240$ outputs
each putput is the inner product of two $3\times 5\times 5$ tensors
total$=75\times 10240=768,000$
Example: $1\times 1$ Convolution
Stacked $1\times 1$ conv layers is equivalent to MLP operating on each input position.
This is a differentiable function, so we can use it as an operator in our networks and backprop through it.
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015
Batch Normalization: Training
Input $x:N\times D$
Learnable scale and shift parameters $\gamma,\beta:D$
Learning $\gamma=\sigma$, $\beta=\mu$ will recover the identity function.
$$
\begin{eqnarray}
\mu_j &=& \frac{1}{N}\sum_{i=1}^N x_{i,j} \mbox{ per channel mean, shape is } D\\
\sigma^2_j &=& \frac{1}{N}\sum_{i=1}^N (x_{i,j}-\mu_j)^2 \mbox{ per channel std, shape is } D\\
\hat{x}_{i,j} &=& \frac{x_{i,j}-\mu_j}{\sqrt{\sigma^2_j+\epsilon}} \mbox{ normalized x, shape is } N\times D\\
y_{i,j} &=& \gamma_j\hat{x}_{i,j}+\beta_j \mbox{ output, shape is } N\times D
\end{eqnarray}
$$
Batch Normalization: Testing
Input $x:N\times D$
Estimates of $\mu_j$ and $\sigma_j$ depend on minibatch.
Problem:Can't do this at test-time!
Solution: use values from training
At test time batchnorm becomes a linear operator!
Can be fused with preceding FC or conv layer
$$
\begin{eqnarray}
\mu_j &=& \mbox{running average of values seen during training} \\
\sigma^2_j &=& \mbox{running average of values seen during training} \\
\hat{x}_{i,j} &=& \frac{x_{i,j}-\mu_j}{\sqrt{\sigma^2_j+\epsilon}} \mbox{ normalized x, shape is } N\times D\\
y_{i,j} &=& \gamma_j\hat{x}_{i,j}+\beta_j \mbox{ output, shape is } N\times D
\end{eqnarray}
$$