Self-Supervised Learning

CSE 891: Deep Learning

Vishnu Boddeti

Success story of supervision: Pre-training

Features from networks pre-trained on ImageNet can be used for a variety of different downstream tasks

Pre-train on a large supervised dataset.
Collect a dataset of "supervised" images
Train a ConvNet

The promise of "alternative" supervision

Getting "real" labels is difficult and expensive

ImageNet with 14M images took 22 human years.

Obtain labels using a "semi-automatic" process

Hashtags
GPS locations
Using the data itself: "self"-supervised

Can we get labels for all data?

Datasets we have:

Bounding Boxes: $10^6$
Image Level: $10^7$
Internet Photos: $10^{13}$
Real World: $10^{20}$?

What about complex concepts?

Video?

Labelling cannot scale to the size of the data we generate

"Self"-Supervision

Key Idea:

Obtain "labels" from the data itself by using a "semi-automatic" process
Predict part of the data from other parts

Observed Data $\rightarrow$ Hidden data
Observed Data $\rightarrow$ Hidden property of data

Why self-supervision?

Helps us learn using observations and interactions
Does not require exhaustive annotation of concepts
Leverage multiple modalities or structure in the domain

Pretext Task

Self-supervised task used for learning representations
Often, not the "real" task (like image classification) we care about
What kind of pretext tasks?

Using images
Using video
Using video and sound
$\dots$

Doersch et al., 2015, Unsupervised visual representation learning by context prediction, ICCV 2015

Images: Relative Position of Patches

Doersch et al., 2015, Unsupervised visual representation learning by context prediction, ICCV 2015

Images: Relative Position: Nearest Neighbors in features

Doersch et al., 2015, Unsupervised visual representation learning by context prediction, ICCV 2015

Images: Predicting Rotations

Gidaris et al, Unsupervised Representation Learning by Predicting Image Rotations, ICLR 2018

Image: Colorization

Zhang et. al, Colorful Image Colorization, ECCV 2016

Image: Fill in the Blanks

Pathak et al, Context Encoders: Feature Learning by Inpainting, CVPR 2016

Video

Video is a "sequence" of frames
How to get "self-supervision"?

Predict order of frames
Fill in the blanks
Track objects and predict their position

Videos: Shuffle and Learn

Misra et al, Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV 2016

Videos: Shuffle and Learn

Misra et al, Shuffle and Learn: Unsupervised Learning using Temporal Order Verification, ECCV 2016

Videos: Odd-one-Out Networks

Fernando et al, Self-Supervised Video Representation Learning With Odd-One-Out Networks, CVPR 2017

Audio-Video Co-Supervision

Train a network to predict if "image" and "audio clip" correspond

What can be learnt?

Good representations – Visual features – Audio features
Intra- and cross-modal retrieval – Aligned audio and visual embeddings
“What is making the sound?” – Learn to localize objects that sound

Arandjelović, Objects that Sound, ECCV 2018

Audio-Video Co-Supervision

Train a network to predict if "image" and "audio clip" correspond

Arandjelović, Objects that Sound, ECCV 2018

Audio-Video Co-Supervision

What would make this sound?

No video (motion) information is used.

Arandjelović, Objects that Sound, ECCV 2018

Information predicted: varies across tasks

Scaling Self-Supervised Learning

Doersch, Multi-task Self-Supervised Visual Learning, ICCV 2017

Scaling Self-Supervised Learning

Use $N=9$ patches
In practice, use a subset of permutations
E.g., 100 from 9!
Each patch is processed independently
N-way ConvNet (shared params)
Problem Complexity: size of subset

Evaluation - fine-tuning vs. linear classifier

A good representation transfers with little training

Evaluation - Many Tasks

Evaluation: Image Classification

Extract "fixed" features
Train a Linear SVM on fixed features
Use VOC 2007 image classification tasks

Evaluation: Object Detection

Intialization	Train Set
	PASCAL VOC 2007	PASCAL VOC 2007+2012
ImageNet Supervised	70.5	76.2
Jigsaw ImageNet 14M	69.2	75.4

Evaluation: Surface Normal Estimation

Intialization	Median Error	% correct withing $11.25^0$
ImageNet Supervised	17.1	36.1
Jigsaw Flickr 100M	13.1	44.6

Evaluation: Few-Shot Learning

What should pre-trained features learn?

Represent how images relate to one another
Be robust to "nuisance factors" -- Invariance

e.g., exact location of objects, lighting, exact color

Clustering and Contrastive Learning are two ways to achieve the above.

Clustering

Boosting Knowledge (Noroozi et al., 2018); DeepCluster (Caron et al., 2018); DeeperCluster (Caron et al., 2019), ClusterFit (Yan et. al., 2020)

Evaluation: Synthetic Noise

Evaluation: Self-Supervised Images

Which SSL method is the best?

Supervised methods still outperform self-supervised methods.

Self-Supervised Learning for Natural Language

Transformer-based language models are typically learned through self-supervision.
Can scale to very large datasets, and give extremely powerful features that transfer to downstream tasks.
Very successful, the dream of SSL made real. Larger models, larger datasets give better features that improve performance on many downstream NLP tasks.

Contrastive Learning

Hadsell et. al., Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006

Contrastive Learning for Self-Supervised Learning

How to define what images are "related" and "unrelated"?

Nearby patches vs. distant patches of an Image

Contrastive Learning for Self-Supervised Learning

How to define what images are "related" and "unrelated"?

Patches of an image vs. patches of other images

Contrastive Learning for Self-Supervised Learning

How to define what images are "related" and "unrelated"?

Data Augmentations of each patch

Pretext-Invariant Representation Learning (PIRL)

Be invariant to $\mathbf{t}$
Representation contains no information about $\mathbf{t}$
Use a contrastive loss to enforce similarity of features \begin{equation} L_{contrastive}(\mathbf{v}_{\mathbf{I}},\mathbf{v}_{\mathbf{I}^t}) \end{equation}

Barlow Twins

Contrastive Learning Gives Huge Improvements

Masked Autoencoders (MAE)

Denoising autoencoder with Vision Transformer

SSL Pre-Training, then finetuning for ImageNet Classification

Current Status on Image Based SSL

The motivation of SSL is scaling to large data that can't be labeled.
Most papers pre-train on (unlabeled) ImageNet, then evaluate on ImageNet !!
Unlabeled ImageNet is still curated: single object per image, balanced classes.
SSL on larger datasets has not been as successful as NLP.

Multimodal Self-Supervised Learning

Don't learn from isolated images, use images together with some context

Video: image together with adjacent video frames
Sound: image with audio track from video
3D: image with depth map or point cloud
Language: image with natural-language text

Matching Images and Text: CLIP

Contrastive Loss: each image predicts which caption matches
Large-scale training on 400M (image, text) pairs from the internet.
Very strong performance on many downstream vision problems.
Performance continues to improve with larger models.

Summary

Self-supervised learning (SSL) aims to scale up to larger datasets without human annotation
First train for a pretext task, then transfer to downstream tasks
Many pretext tasks: context prediction, jigsaw, colorization, clustering, rotation
SSL has been wildly successful for language
Intense research on SSL in vision; current best are contrastive, masked autoencoding
Multimodal SSL uses images together with additional context
Multimodal SSL with vision + language has been very successful; seems very promising !!