CNN Architectures

CSE 891: Deep Learning

Vishnu Boddeti

Wednesday September 29, 2021

Last Time: CNNs

Convolutional Layers

Pooling Layers

Fully-Connected Layers

Activation Function

Normalization $$\hat{x}^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$

Today

Convolutional Layer Variations
CNN Architectures

Convolutional Layer Variants

Transposed Convolutions

Upsample and Convolution in a single operation.
Useful for regression tasks. (segmentation, pose estimation etc.)

Dilated Convolutions

Increase receptive field of convolutional filters.
No increase in the number of parameters to learn.

Binary Convolutions

Convolution is the costliest operation (typically) in CNNs.
Improve efficiency of each convolution operation.

Inference: Binarization, Quantization, Sparsification,
Learning: Quantization, Sparsification, Randomization (unoptimized).

Juefei-Xu et.al. "Local Binary Convolutional Neural Networks" CVPR 2017
Juefei-Xu et.al. "Perturbative Neural Networks" CVPR 2018

CNN Architectures

ImageNet Classification Challenge

Primarily driven by skilled practitioners and elaborated design.

a.k.a "Graduate Student Design"

AlexNet

5 convolutional layers, 3 fully connected layers
60 million parameters
GPU based training
established that deep learning works for computer vision !!

Krizhevsky et.al. "ImageNet Classification with Deep Convolutional Neural Networks" NeurIPS 2012

VGG: Deeper Networks

19 layers deep, 3 fully connected layers
144 million parameters
$3 \times 3$ convolutional filters with stride 1
$2 \times 2$ max-pooling layers with stride 2
Established that smaller filters (are parameter efficient) and deeper networks are better !!
Two $3\times 3$ conv layer has same receptive field but has fewer parameters and takes less computation.

Simonyan et.al. "Very Deep Convolutional Networks for Large-Scale Image Recognition" ICLR 2015

GoogLeNet: Focus on Efficiency

22 layers, introduced the "inception" module
efficient architecture - in terms of computation
compact model (about 5 million parameters)
computational budget - 1.5 billion multiply-adds

Szegedy et.al. "Going deeper with convolutions" CVPR 2015

GoogLeNet: Inception Module

Inception module: local unit with parallel branches
Local structure repeated many times throughout the network
Use $1\times 1$ "Bottleneck" layers to reduce channel dimension before expensive conv (we will revisit this with ResNet!)

Szegedy et.al. "Going deeper with convolutions" CVPR 2015

GoogLeNet: Auxiliary Classifiers

Training using loss at the end of the network didn't work well: Network is too deep, gradients don't propagate cleanly
As a hack, attach "auxiliary classifiers" at several intermediate points in the network that also try to classify the image and receive loss
GoogLeNet was before batch normalization! With BatchNorm no longer need to use this trick

Deeper Networks

Stack More Layers?

Once we have Batch Normalization, we can train networks with 10+ layers. What happens as we go deeper?

Deeper model does worse than shallow model!
Initial guess: Deep model is overfitting since it is much bigger than the other model
In fact the deep model seems to be underfitting since it also performs worse than the shallow model on the training set! It is actually underfitting

Training Deeper Networks

A deeper model can emulate a shallower model: copy layers from shallower model, set extra layers to identity
Thus deeper models should do at least as good as shallow models
Hypothesis: This is an optimization problem. Deeper models are harder to optimize, and in particular don’t learn identity functions to emulate shallow models
Solution: Change the network so learning identity functions with extra layers is easy!

Residual Units

Solution: Change the network so learning identity functions with extra layers is easy!

Standard Block

Residual Block

Identity Shortcuts

Forward Pass:
Backward Pass:

Residual Networks

A residual network is a stack of many residual blocks
Regular design, like VGG: each residual block has two $3 \times 3$ conv
Network is divided into stages: the first block of each stage halves the resolution (with stride-2 conv) and doubles the number of channels

Residual Blocks

Total FLOPS: $18HWC^2$

Residual Block: Basic

Total FLOPS: $17HWC^2$

Residual Block: Bottleneck

How Does It Do?

Able to train very deep networks
Deeper networks do better than shallow networks (as expected)
Swept 1st place in all ILSVRC and COCO 2015 competitions
Still widely used today!

MSRA @ILSVRC & COCO 2015 Competitions

1st place in all five main tracks:

ImageNet Classification: 152-layer network
ImageNet Detection: 16% better than 2nd
ImageNet Localization: 27% better than 2nd
COCO Detection: 11% better than 2nd
COCO Segmentation: 12% better than 2nd

Improving Residual Block Design

Note ReLU after residual: cannot actually learn identity function since outputs are nonnegative
Note ReLU inside residual: can learn true identity function by setting COnv weights to zero
Slight improvement in accuracy

ResNet-152: 21.3 vs 21.1
ResNet-200: 21.8 vs 20.7

Not actually used that much in practice

Comparing Complexity

Canziani et.al. "An analysis of deep neural network models for practical applications" 2017

Model Ensembles

Multi-scale ensemble of Inception, Inception-Resnet, Resnet, Wide Resnet models

	Inception-v3	Inception-v4	Inception-Resnet-v2	Resnet-200	Wrn-68-3	Fusion (Val)	Fusion (Test)
Err (%)	4.20	4.01	3.52	4.26	4.65	2.92(-0.6)	2.99

Shao et.al. 2016

Improving ResNets: ResNeXt

Total FLOPS: $17HWC^2$

Residual Block: Bottleneck

Total FLOPS: $(8Cc+9c^2)HWG$

ResNext Block: Parallel Pathways

Xie et al, "Aggregated residual transformations for deep neural networks", CVPR 2017

Grouped Convolution

ResNext Block: Parallel Pathways

Residual Block: Grouped Convolution

Squeeze-and-Excitation Networks

Adds a "Squeeze-and-excite" branch to each residual block that performs global pooling, full-connected layers, and multiplies back onto feature map
Adds global context to each residual block!

Hu et.al. "Squeeze-and-Excitation Networks" CVPR 2018

Dense Connections

Dense blocks where each layer is connected to every other layer in feedforward fashion.
Alleviates vanishing gradient, strenghtens feature propagation, encourages feature reuse.

Huang et.al. "Densely Connected Neural Networks" CVPR 2017

MobileNets: Tiny Networks

Standard Convolution Block

Total Cost: $9C^2HW$

$$ \begin{eqnarray} \mbox{Speedup } \approx 9 \end{eqnarray} $$

Depthwise Separable Convolution

Total Cost: $(9C+C^2)HW$

Howard et.al. "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" Arxiv 2017

Neural Architecture Search

Designing neural network architectures is hard, so let us automate it.

Map neural networks to hyperparameters.
Define a search space over hyperparameters.
Search for the networks with the best hyperparameters.
Search Methods:

Reinforcement Learning
Evolutionary Algorithms
Gradient descent on continuous relaxation

Neural Architecture Search

Lu et.al. "NSGA-NET: Neural Architecture Search using Multi-Objective Genetic Algorithm" GECCO 2019
Lu et.al. "MUXConv: Information Multiplexing in Convolutional Neural Networks" CVPR 2020
Lu et.al. "NSGANetV2:Evolutionary Multi-Objective Surrogate-Assisted Neural Architecture Search" ECCV 2020
Lu et.al. "Multi-Objective Evolutionary Design of Deep Convolutional Neural Networks for Image Classification" IEEE TEVC 2020
Lu et.al. "Neural Architecture Transfer" IEEE TPAMI 2021

Neural Architecture Search

Lu et.al. "Neural Architecture Transfer" IEEE TPAMI 2021

CNN Progression

Lu et.al. "MUXConv: Information Multiplexing in Convolutional Neural Networks" CVPR 2020

CNN Architectures Summary

Early work (AlexNet$\rightarrow$ZFNet$\rightarrow$VGG) shows that bigger networks work better
GoogLeNet one of the first to focus on efficiency (aggressive stem, 1x1 bottleneck convolutions, global avg pool instead of FC layers)
ResNet showed us how to train extremely deep networks – limited only by GPU memory. Started to show diminishing returns as networks got bigger
After ResNet: Efficient networks became central: how can we improve the accuracy without increasing the complexity?
Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet, MUXNetetc
Neural Architecture Search promises to automate architecture design

Which Architecture should I Use?

Don't be a hero. For most problems you should use an off-the-shelf architecture; don't try to design your own !!
If you just care about accuracy, ResNet-50 or ResNet-101 are great choices.
If you want an efficient network (real-time, run on mobile, etc.) try MobileNets, ShuffleNets and NSGANets