Self-Supervised Learning


CSE 891: Deep Learning

Vishnu Boddeti

Success story of supervision: Pre-training

  • Features from networks pre-trained on ImageNet can be used for a variety of different downstream tasks
    • Pre-train on a large supervised dataset.
    • Collect a dataset of "supervised" images
    • Train a ConvNet

The promise of "alternative" supervision

  • Getting "real" labels is difficult and expensive
    • ImageNet with 14M images took 22 human years.
  • Obtain labels using a "semi-automatic" process
    • Hashtags
    • GPS locations
    • Using the data itself: "self"-supervised

Can we get labels for all data?

  • Datasets we have:
    • Bounding Boxes: $10^6$
    • Image Level: $10^7$
    • Internet Photos: $10^{13}$
    • Real World: $10^{20}$?
  • What about complex concepts?
    • Video?
  • Labelling cannot scale to the size of the data we generate

"Self"-Supervision

  • Key Idea:
    • Obtain "labels" from the data itself by using a "semi-automatic" process
    • Predict part of the data from other parts
      • Observed Data $\rightarrow$ Hidden data
      • Observed Data $\rightarrow$ Hidden property of data
  • Why self-supervision?
    • Helps us learn using observations and interactions
    • Does not require exhaustive annotation of concepts
    • Leverage multiple modalities or structure in the domain

Pretext Task

  • Self-supervised task used for learning representations
  • Often, not the "real" task (like image classification) we care about
  • What kind of pretext tasks?
    • Using images
    • Using video
    • Using video and sound
    • $\dots$
  • Doersch et al., 2015, Unsupervised visual representation learning by context prediction, ICCV 2015

Images: Relative Position of Patches

  • Doersch et al., 2015, Unsupervised visual representation learning by context prediction, ICCV 2015

Images: Relative Position: Nearest Neighbors in features

  • Doersch et al., 2015, Unsupervised visual representation learning by context prediction, ICCV 2015

Images: Predicting Rotations

  • Gidaris et al, Unsupervised Representation Learning by Predicting Image Rotations, ICLR 2018

Image: Colorization

  • Zhang et. al, Colorful Image Colorization, ECCV 2016

Image: Fill in the Blanks

  • Pathak et al, Context Encoders: Feature Learning by Inpainting, CVPR 2016

Video

  • Video is a "sequence" of frames
  • How to get "self-supervision"?
    • Predict order of frames
    • Fill in the blanks
    • Track objects and predict their position

Videos: Shuffle and Learn

  • Misra et al, Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV 2016

Videos: Shuffle and Learn

  • Misra et al, Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV 2016

Videos: Odd-one-Out Networks

  • Fernando et al, Self-Supervised Video Representation Learning With Odd-One-Out Networks, CVPR 2017

Audio-Video Co-Supervision

  • Train a network to predict if "image" and "audio clip" correspond
  • Arandjelović, Objects that Sound, ECCV 2018

Audio-Video Co-Supervision

  • Train a network to predict if "image" and "audio clip" correspond
  • Arandjelović, Objects that Sound, ECCV 2018

Audio-Video Co-Supervision

  • What would make this sound?
    • No video (motion) information is used.
  • Arandjelović, Objects that Sound, ECCV 2018

Information predicted: varies across tasks

Scaling Self-Supervised Learning

  • Doersch, Multi-task Self-Supervised Visual Learning, ICCV 2017

Scaling Self-Supervised Learning

  • Use $N=9$ patches
  • In practice, use a subset of permutations
  • E.g., 100 from 9!
  • Each patch is processed independently
  • N-way ConvNet (shared params)
  • Problem Complexity: size of subset

Evaluation - fine-tuning vs. linear classifier

  • A good representation transfers with little training

Evaluation - Many Tasks

Evaluation: Image Classification

  • Extract "fixed" features
  • Train a Linear SVM on fixed features
  • Use VOC 2007 image classification tasks

Evaluation: Object Detection

Intialization
Train Set
PASCAL VOC 2007 PASCAL VOC 2007+2012
ImageNet Supervised 70.5 76.2
Jigsaw ImageNet 14M 69.2 75.4

Evaluation: Surface Normal Estimation

Intialization Median Error % correct withing $11.25^0$
ImageNet Supervised 17.1 36.1
Jigsaw Flickr 100M 13.1 44.6

Evaluation: Few-Shot Learning

What should pre-trained features learn?

  • Represent how images relate to one another
  • Be robust to "nuisance factors" -- Invariance
    • e.g., exact location of objects, lighting, exact color
  • Clustering and Contrastive Learning are two ways to achieve the above.

Clustering

  • Boosting Knowledge (Noroozi et al., 2018); DeepCluster (Caron et al., 2018); DeeperCluster (Caron et al., 2019), ClusterFit (Yan et. al., 2020)

Evaluation: Synthetic Noise

Evaluation: Self-Supervised Images

Which SSL method is the best?

  • Supervised methods still outperform self-supervised methods.

Self-Supervised Learning for Natural Language

  • Transformer-based language models are typically learned through self-supervision.
  • Can scale to very large datasets, and give extremely powerful features that transfer to downstream tasks.
  • Very successful, the dream of SSL made real. Larger models, larger datasets give better features that improve performance on many downstream NLP tasks.

Contrastive Learning

  • Hadsell et. al., Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006

Contrastive Learning for Self-Supervised Learning

  • How to define what images are "related" and "unrelated"?
    • Nearby patches vs. distant patches of an Image

Contrastive Learning for Self-Supervised Learning

  • How to define what images are "related" and "unrelated"?
    • Patches of an image vs. patches of other images

Contrastive Learning for Self-Supervised Learning

  • How to define what images are "related" and "unrelated"?
    • Data Augmentations of each patch

Pretext-Invariant Representation Learning (PIRL)

Barlow Twins

Contrastive Learning Gives Huge Improvements

Contrastive Learning Gives Huge Improvements

Masked Autoencoders (MAE)

  • Denoising autoencoder with Vision Transformer

SSL Pre-Training, then finetuning for ImageNet Classification

Current Status on Image Based SSL

  • The motivation of SSL is scaling to large data that can't be labeled.
  • Most papers pre-train on (unlabeled) ImageNet, then evaluate on ImageNet !!
  • Unlabeled ImageNet is still curated: single object per image, balanced classes.
  • SSL on larger datasets has not been as successful as NLP.

Multimodal Self-Supervised Learning

  • Don't learn from isolated images, use images together with some context
    • Video: image together with adjacent video frames
    • Sound: image with audio track from video
    • 3D: image with depth map or point cloud
    • Language: image with natural-language text

Matching Images and Text: CLIP

  • Contrastive Loss: each image predicts which caption matches
  • Large-scale training on 400M (image, text) pairs from the internet.
  • Very strong performance on many downstream vision problems.
  • Performance continues to improve with larger models.

Summary

  • Self-supervised learning (SSL) aims to scale up to larger datasets without human annotation
  • First train for a pretext task, then transfer to downstream tasks
  • Many pretext tasks: context prediction, jigsaw, colorization, clustering, rotation
  • SSL has been wildly successful for language
  • Intense research on SSL in vision; current best are contrastive, masked autoencoding
  • Multimodal SSL uses images together with additional context
  • Multimodal SSL with vision + language has been very successful; seems very promising !!