CNN Architectures
CSE 891: Deep Learning
Vishnu Boddeti
Wednesday September 29, 2021
Last Time: CNNs
Convolutional Layers
Pooling Layers
Fully-Connected Layers
Activation Function
Normalization
$$\hat{x}^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
Today
Convolutional Layer Variations
CNN Architectures
Convolutional Layer Variants
Transposed Convolutions
Upsample and Convolution in a single operation.
Useful for regression tasks. (segmentation, pose estimation etc.)
Dilated Convolutions
Increase receptive field of convolutional filters.
No increase in the number of parameters to learn.
Binary Convolutions
Convolution is the costliest operation (typically) in CNNs.
Improve efficiency of each convolution operation.
Inference: Binarization, Quantization, Sparsification,
Learning: Quantization, Sparsification, Randomization (unoptimized).
Juefei-Xu et.al. "Local Binary Convolutional Neural Networks" CVPR 2017
Juefei-Xu et.al. "Perturbative Neural Networks" CVPR 2018
CNN Architectures
ImageNet Classification Challenge
Primarily driven by skilled practitioners and elaborated design.
a.k.a
"Graduate Student Design"
#--- { "data": { "datasets" : [{ "borderColor": "#0f0", "borderDash": ["5","10"], "backgroundColor": "#333333", "fill": false }] }, "options": { "scales": { "yAxes": [{ "ticks": { "min": 43, "max": 83 } }] } } } ---#
#--- { "data": { "datasets" : [{ "borderColor": "#0f0" }, { "borderColor": "crimson" }, { "borderColor": "cyan" }] }, "options": { "scales": { "yAxes": [{ "ticks": { "min": 43, "max": 83 } }] } } } ---#
#--- { "data": { "datasets" : [{ "borderColor": "#0f0" }, { "borderColor": "crimson" }, { "borderColor": "cyan" }] }, "options": { "scales": { "yAxes": [{ "ticks": { "min": 43, "max": 83 } }] } } } ---#
AlexNet
5 convolutional layers, 3 fully connected layers
60 million parameters
GPU based training
established that deep learning works for computer vision !!
Krizhevsky et.al. "ImageNet Classification with Deep Convolutional Neural Networks" NeurIPS 2012
VGG: Deeper Networks
19 layers deep, 3 fully connected layers
144 million parameters
$3 \times 3$ convolutional filters with stride 1
$2 \times 2$ max-pooling layers with stride 2
Established that smaller filters (are parameter efficient) and deeper networks are better !!
Two $3\times 3$ conv layer has same receptive field but has fewer parameters and takes less computation.
Simonyan et.al. "Very Deep Convolutional Networks for Large-Scale Image Recognition" ICLR 2015
GoogLeNet: Focus on Efficiency
22 layers, introduced the "inception" module
efficient architecture - in terms of computation
compact model (about 5 million parameters)
computational budget - 1.5 billion multiply-adds
Szegedy et.al. "Going deeper with convolutions" CVPR 2015
GoogLeNet: Inception Module
Inception module: local unit with parallel branches
Local structure repeated many times throughout the network
Use $1\times 1$ "Bottleneck" layers to reduce channel dimension before expensive conv (we will revisit this with ResNet!)
Szegedy et.al. "Going deeper with convolutions" CVPR 2015
GoogLeNet: Auxiliary Classifiers
Training using loss at the end of the network didn't work well: Network is too deep, gradients don't propagate cleanly
As a hack, attach "auxiliary classifiers" at several intermediate points in the network that also try to classify the image and receive loss
GoogLeNet was before batch normalization! With BatchNorm no longer need to use this trick
Deeper Networks
Stack More Layers?
Once we have Batch Normalization, we can train networks with 10+ layers. What happens as we go deeper?
Deeper model does worse than shallow model!
Initial guess: Deep model is overfitting since it is much bigger than the other model
In fact the deep model seems to be underfitting since it also performs worse than the shallow model on the training set! It is actually underfitting
Training Deeper Networks
A deeper model can emulate a shallower model: copy layers from shallower model, set extra layers to identity
Thus deeper models should do at least as good as shallow models
Hypothesis: This is an optimization problem. Deeper models are harder to optimize, and in particular don’t learn identity functions to emulate shallow models
Solution: Change the network so learning identity functions with extra layers is easy!
Residual Units
Solution: Change the network so learning identity functions with extra layers is easy!
Standard Block
Residual Block
Identity Shortcuts
Forward Pass:
$$ \begin{eqnarray} \mathbf{y}_k &=& h(\mathbf{x}_k) + \mathcal{F}(\mathbf{x}_k,\mathbf{W}_k) \\ \mathbf{x}_{k+1} &=& f(\mathbf{y}_k) \end{eqnarray} $$
Backward Pass:
$$ \begin{eqnarray} \frac{\partial L}{\partial \mathbf{x}_k} = \frac{\partial L}{\partial \mathbf{y}_k}\left(1+\frac{\partial}{\partial \mathbf{x}_k}\sum_{i=1}^L\mathcal{F}(\mathbf{x}_i,\mathbf{W}_i)\right) \end{eqnarray} $$
Residual Networks
A residual network is a stack of many residual blocks
Regular design, like VGG: each residual block has two $3 \times 3$ conv
Network is divided into stages: the first block of each stage halves the resolution (with stride-2 conv) and doubles the number of channels
Residual Blocks
Total FLOPS: $18HWC^2$
Residual Block: Basic
Total FLOPS: $17HWC^2$
Residual Block: Bottleneck
How Does It Do?
Able to train very deep networks
Deeper networks do better than shallow networks (as expected)
Swept 1st place in all ILSVRC and COCO 2015 competitions
Still widely used today!
MSRA @ILSVRC & COCO 2015 Competitions
1st place in all five main tracks:
ImageNet Classification: 152-layer network
ImageNet Detection: 16% better than 2nd
ImageNet Localization: 27% better than 2nd
COCO Detection: 11% better than 2nd
COCO Segmentation: 12% better than 2nd
Improving Residual Block Design
Note ReLU after residual: cannot actually learn identity function since outputs are nonnegative
Note ReLU inside residual: can learn true identity function by setting COnv weights to zero
Slight improvement in accuracy
ResNet-152: 21.3 vs 21.1
ResNet-200: 21.8 vs 20.7
Not actually used that much in practice
Comparing Complexity
Canziani et.al. "An analysis of deep neural network models for practical applications" 2017
Model Ensembles
Multi-scale ensemble of Inception, Inception-Resnet, Resnet, Wide Resnet models
Inception-v3
Inception-v4
Inception-Resnet-v2
Resnet-200
Wrn-68-3
Fusion (Val)
Fusion (Test)
Err (%)
4.20
4.01
3.52
4.26
4.65
2.92(-0.6)
2.99
Shao et.al. 2016
Improving ResNets: ResNeXt
Total FLOPS: $17HWC^2$
Residual Block: Bottleneck
Total FLOPS: $(8Cc+9c^2)HWG$
ResNext Block: Parallel Pathways
Xie et al, "Aggregated residual transformations for deep neural networks", CVPR 2017
Grouped Convolution
ResNext Block: Parallel Pathways
Residual Block: Grouped Convolution
Squeeze-and-Excitation Networks
Adds a "Squeeze-and-excite" branch to each residual block that performs global pooling, full-connected layers, and multiplies back onto feature map
Adds
global context
to each residual block!
Hu et.al. "Squeeze-and-Excitation Networks" CVPR 2018
Dense Connections
Dense blocks where each layer is connected to every other layer in feedforward fashion.
Alleviates vanishing gradient, strenghtens feature propagation, encourages feature reuse.
Huang et.al. "Densely Connected Neural Networks" CVPR 2017
MobileNets: Tiny Networks
Standard Convolution Block
Total Cost: $9C^2HW$
$$ \begin{eqnarray} \mbox{Speedup } \approx 9 \end{eqnarray} $$
Depthwise Separable Convolution
Total Cost: $(9C+C^2)HW$
Howard et.al. "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" Arxiv 2017
Neural Architecture Search
Designing neural network architectures is hard, so let us automate it.
Map neural networks to hyperparameters.
Define a search space over hyperparameters.
Search for the networks with the best hyperparameters.
Search Methods:
Reinforcement Learning
Evolutionary Algorithms
Gradient descent on continuous relaxation
Neural Architecture Search
Lu et.al. "NSGA-NET: Neural Architecture Search using Multi-Objective Genetic Algorithm" GECCO 2019
Lu et.al. "MUXConv: Information Multiplexing in Convolutional Neural Networks" CVPR 2020
Lu et.al. "NSGANetV2:Evolutionary Multi-Objective Surrogate-Assisted Neural Architecture Search" ECCV 2020
Lu et.al. "Multi-Objective Evolutionary Design of Deep Convolutional Neural Networks for Image Classification" IEEE TEVC 2020
Lu et.al. "Neural Architecture Transfer" IEEE TPAMI 2021
Neural Architecture Search
Lu et.al. "Neural Architecture Transfer" IEEE TPAMI 2021
CNN Progression
Lu et.al. "MUXConv: Information Multiplexing in Convolutional Neural Networks" CVPR 2020
CNN Architectures Summary
Early work (AlexNet$\rightarrow$ZFNet$\rightarrow$VGG) shows that
bigger networks work better
GoogLeNet one of the first to focus on
efficiency
(aggressive stem, 1x1 bottleneck convolutions, global avg pool instead of FC layers)
ResNet showed us how to train extremely deep networks – limited only by GPU memory. Started to show diminishing returns as networks got bigger
After ResNet:
Efficient networks
became central: how can we improve the accuracy without increasing the complexity?
Lots of
tiny networks
aimed at mobile devices: MobileNet, ShuffleNet, MUXNetetc
Neural Architecture Search
promises to automate architecture design
Which Architecture should I Use?
Don't be a hero.
For most problems you should use an off-the-shelf architecture; don't try to design your own !!
If you just care about accuracy,
ResNet-50
or
ResNet-101
are great choices.
If you want an efficient network (real-time, run on mobile, etc.) try
MobileNets
,
ShuffleNets
and
NSGANets