CNN Architectures


CSE 891: Deep Learning

Vishnu Boddeti

Wednesday September 30, 2020

Last Time: CNNs

Convolutional Layers
Pooling Layers
Activation Function

Today

  • Convolutional Layer Variations
  • CNN Architectures

Convolutional Layer Variants

Transposed Convolutions

  • Upsample and Convolution in a single operation.
  • Useful for regression tasks. (segmentation, pose estimation etc.)

Dilated Convolutions

  • Increase receptive field of convolutional filters.
  • No increase in the number of parameters to learn.

Binary Convolutions

  • Convolution is the costliest operation (typically) in CNNs.
  • Improve efficiency of each convolution operation.
    • Inference: Binarization, Quantization, Sparsification,
    • Learning: Quantization, Sparsification, Randomization (unoptimized).

CNN Architectures

ImageNet Classification Challenge

AlexNet

  • 5 convolutional layers, 3 fully connected layers
  • 60 million parameters
  • GPU based training
  • established that deep learning works for computer vision !!
  • Krizhevsky et.al. "ImageNet Classification with Deep Convolutional Neural Networks" NeurIPS 2012

VGG: Deeper Networks

  • 19 layers deep, 3 fully connected layers
  • 144 million parameters
  • $3 \times 3$ convolutional filters with stride 1
  • $2 \times 2$ max-pooling layers with stride 2
  • Established that smaller filters (are parameter efficient) and deeper networks are better !!
  • Two $3\times 3$ conv layer has same receptive field but has fewer parameters and takes less computation.
  • Simonyan et.al. "Very Deep Convolutional Networks for Large-Scale Image Recognition" ICLR 2015

GoogLeNet: Focus on Efficiency

  • 22 layers, introduced the "inception" module
  • efficient architecture - in terms of computation
  • compact model (about 5 million parameters)
  • computational budget - 1.5 billion multiply-adds
  • Szegedy et.al. "Going deeper with convolutions" CVPR 2015

GoogLeNet: Inception Module

  • Inception module: local unit with parallel branches
  • Local structure repeated many times throughout the network
  • Use $1\times 1$ "Bottleneck" layers to reduce channel dimension before expensive conv (we will revisit this with ResNet!)
  • Szegedy et.al. "Going deeper with convolutions" CVPR 2015

GoogLeNet: Auxiliary Classifiers

  • Training using loss at the end of the network didn't work well: Network is too deep, gradients don't propagate cleanly
  • As a hack, attach "auxiliary classifiers" at several intermediate points in the network that also try to classify the image and receive loss
  • GoogLeNet was before batch normalization! With BatchNorm no longer need to use this trick

Deeper Networks

Stack More Layers?

  • Once we have Batch Normalization, we can train networks with 10+ layers. What happens as we go deeper?
  • Deeper model does worse than shallow model!
  • Initial guess: Deep model is overfitting since it is much bigger than the other model
  • In fact the deep model seems to be underfitting since it also performs worse than the shallow model on the training set! It is actually underfitting

Training Deeper Networks

  • A deeper model can emulate a shallower model: copy layers from shallower model, set extra layers to identity
  • Thus deeper models should do at least as good as shallow models
  • Hypothesis: This is an optimization problem. Deeper models are harder to optimize, and in particular don’t learn identity functions to emulate shallow models
  • Solution: Change the network so learning identity functions with extra layers is easy!

Residual Units

  • Solution: Change the network so learning identity functions with extra layers is easy!
  • Standard Block

Identity Shortcuts

  • Forward Pass:
  • $$ \begin{eqnarray} \mathbf{y}_k &=& h(\mathbf{x}_k) + \mathcal{F}(\mathbf{x}_k,\mathbf{W}_k) \\ \mathbf{x}_{k+1} &=& f(\mathbf{y}_k) \end{eqnarray} $$
  • Backward Pass:
  • $$ \begin{eqnarray} \frac{\partial L}{\partial \mathbf{x}_k} = \frac{\partial L}{\partial \mathbf{y}_k}\left(1+\frac{\partial}{\partial \mathbf{x}_k}\sum_{i=1}^L\mathcal{F}(\mathbf{x}_i,\mathbf{W}_i)\right) \end{eqnarray} $$

Residual Networks

  • A residual network is a stack of many residual blocks
  • Regular design, like VGG: each residual block has two $3 \times 3$ conv
  • Network is divided into stages: the first block of each stage halves the resolution (with stride-2 conv) and doubles the number of channels

Residual Blocks

Total FLOPS: $18HWC^2$
Residual Block: Basic

How Does It Do?

  • Able to train very deep networks
  • Deeper networks do better than shallow networks (as expected)
  • Swept 1st place in all ILSVRC and COCO 2015 competitions
  • Still widely used today!

Improving Residual Block Design

  • Note ReLU after residual: cannot actually learn identity function since outputs are nonnegative
  • Note ReLU inside residual: can learn true identity function by setting COnv weights to zero
  • Slight improvement in accuracy
    • ResNet-152: 21.3 vs 21.1
    • ResNet-200: 21.8 vs 20.7
  • Not actually used that much in practice

Comparing Complexity

  • Canziani et.al. "An analysis of deep neural network models for practical applications" 2017

Model Ensembles

  • Multi-scale ensemble of Inception, Inception-Resnet, Resnet, Wide Resnet models
Inception-v3 Inception-v4 Inception-Resnet-v2 Resnet-200 Wrn-68-3 Fusion (Val) Fusion (Test)
Err (%) 4.20 4.01 3.52 4.26 4.65 2.92(-0.6) 2.99
  • Shao et.al. 2016

Improving ResNets: ResNeXt

Total FLOPS: $17HWC^2$
Residual Block: Bottleneck
  • Xie et al, "Aggregated residual transformations for deep neural networks", CVPR 2017

Grouped Convolution

ResNext Block: Parallel Pathways

Squeeze-and-Excitation Networks

  • Adds a "Squeeze-and-excite" branch to each residual block that performs global pooling, full-connected layers, and multiplies back onto feature map
  • Adds global context to each residual block!
  • Hu et.al. "Squeeze-and-Excitation Networks" CVPR 2018

Dense Connections

  • Dense blocks where each layer is connected to every other layer in feedforward fashion.
  • Alleviates vanishing gradient, strenghtens feature propagation, encourages feature reuse.
  • Huang et.al. "Densely Connected Neural Networks" CVPR 2017

MobileNets: Tiny Networks

Standard Convolution Block
  • Total Cost: $9C^2HW$
$$ \begin{eqnarray} \mbox{Speedup } \approx 9 \end{eqnarray} $$
  • Howard et.al. "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" Arxiv 2017

Neural Architecture Search

  • Designing neural network architectures is hard, so let us automate it.
    • Map neural networks to hyperparameters.
    • Define a search space over hyperparameters.
    • Search for the networks with the best hyperparameters.
    • Search Methods:
      • Reinforcement Learning
      • Evolutionary Algorithms
      • Gradient descent on continuous relaxation

Neural Architecture Search

  • Lu et.al. "NSGA-NET: Neural Architecture Search using Multi-Objective Genetic Algorithm" GECCO 2019
  • Lu et.al. "MUXConv: Information Multiplexing in Convolutional Neural Networks" CVPR 2020
  • Lu et.al. "NSGANetV2:Evolutionary Multi-Objective Surrogate-Assisted Neural Architecture Search" ECCV 2020
  • Lu et.al. "Multi-Objective Evolutionary Design of Deep Convolutional Neural Networks for Image Classification" IEEE TEVC 2020
  • Lu et.al. "Neural Architecture Transfer" Arxiv 2020

Neural Architecture Search

  • Lu et.al. "Neural Architecture Transfer" Arxiv 2020

CNN Progression

  • Lu et.al. "MUXConv: Information Multiplexing in Convolutional Neural Networks" CVPR 2020

CNN Architectures Summary

  • Early work (AlexNet$\rightarrow$ZFNet$\rightarrow$VGG) shows that bigger networks work better
  • GoogLeNet one of the first to focus on efficiency (aggressive stem, 1x1 bottleneck convolutions, global avg pool instead of FC layers)
  • ResNet showed us how to train extremely deep networks – limited only by GPU memory. Started to show diminishing returns as networks got bigger
  • After ResNet: Efficient networks became central: how can we improve the accuracy without increasing the complexity?
  • Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet, MUXNetetc
  • Neural Architecture Search promises to automate architecture design

Which Architecture should I Use?

  • Don't be a hero. For most problems you should use an off-the-shelf architecture; don't try to design your own !!
  • If you just care about accuracy, ResNet-50 or ResNet-101 are great choices.
  • If you want an efficient network (real-time, run on mobile, etc.) try MobileNets, ShuffleNets and NSGANets