A good representation transfers with little training
Evaluation - Many Tasks
Evaluation: Image Classification
Extract "fixed" features
Train a Linear SVM on fixed features
Use VOC 2007 image classification tasks
Evaluation: Object Detection
Intialization
Train Set
PASCAL VOC 2007
PASCAL VOC 2007+2012
ImageNet Supervised
70.5
76.2
Jigsaw ImageNet 14M
69.2
75.4
Evaluation: Surface Normal Estimation
Intialization
Median Error
% correct withing $11.25^0$
ImageNet Supervised
17.1
36.1
Jigsaw Flickr 100M
13.1
44.6
Evaluation: Few-Shot Learning
What should pre-trained features learn?
Represent how images relate to one another
Be robust to "nuisance factors" -- Invariance
e.g., exact location of objects, lighting, exact color
Clustering and Contrastive Learning are two ways to achieve the above.
Clustering
Boosting Knowledge (Noroozi et al., 2018); DeepCluster (Caron et al., 2018); DeeperCluster (Caron et al., 2019), ClusterFit (Yan et. al., 2020)
Evaluation: Synthetic Noise
Evaluation: Self-Supervised Images
Which SSL method is the best?
Supervised methods still outperform self-supervised methods.
Self-Supervised Learning for Natural Language
Transformer-based language models are typically learned through self-supervision.
Can scale to very large datasets, and give extremely powerful features that transfer to downstream tasks.
Very successful, the dream of SSL made real. Larger models, larger datasets give better features that improve performance on many downstream NLP tasks.
Contrastive Learning
Hadsell et. al., Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006
Contrastive Learning for Self-Supervised Learning
How to define what images are "related" and "unrelated"?
Nearby patches vs. distant patches of an Image
Contrastive Learning for Self-Supervised Learning
How to define what images are "related" and "unrelated"?
Patches of an image vs. patches of other images
Contrastive Learning for Self-Supervised Learning
How to define what images are "related" and "unrelated"?
Data Augmentations of each patch
Pretext-Invariant Representation Learning (PIRL)
Be invariant to $\mathbf{t}$
Representation contains no information about $\mathbf{t}$
Use a contrastive loss to enforce similarity of features
\begin{equation}
L_{contrastive}(\mathbf{v}_{\mathbf{I}},\mathbf{v}_{\mathbf{I}^t})
\end{equation}
Barlow Twins
Contrastive Learning Gives Huge Improvements
Contrastive Learning Gives Huge Improvements
Masked Autoencoders (MAE)
Denoising autoencoder with Vision Transformer
SSL Pre-Training, then finetuning for ImageNet Classification
Current Status on Image Based SSL
The motivation of SSL is scaling to large data that can't be labeled.
Most papers pre-train on (unlabeled) ImageNet, then evaluate on ImageNet !!
Unlabeled ImageNet is still curated: single object per image, balanced classes.
SSL on larger datasets has not been as successful as NLP.
Multimodal Self-Supervised Learning
Don't learn from isolated images, use images together with some context
Video: image together with adjacent video frames
Sound: image with audio track from video
3D: image with depth map or point cloud
Language: image with natural-language text
Matching Images and Text: CLIP
Contrastive Loss: each image predicts which caption matches
Large-scale training on 400M (image, text) pairs from the internet.
Very strong performance on many downstream vision problems.
Performance continues to improve with larger models.
Summary
Self-supervised learning (SSL) aims to scale up to larger datasets without human annotation
First train for a pretext task, then transfer to downstream tasks
Many pretext tasks: context prediction, jigsaw, colorization, clustering, rotation
SSL has been wildly successful for language
Intense research on SSL in vision; current best are contrastive, masked autoencoding
Multimodal SSL uses images together with additional context
Multimodal SSL with vision + language has been very successful; seems very promising !!