Emerging Properties in Self-Supervised Vision Transformers
Self-Supervised Learning is the final frontier in Representation Learning: Getting useful features without any labels.
Facebook AI's new system, DINO, combines advances in Self-Supervised Learning for Computer Vision with the new Vision Transformer (ViT) architecture and achieves impressive results without any labels. Attention maps can be directly interpreted as segmentation maps, and the obtained representations can be used for image retrieval and zero-shot k-nearest neighbor classifiers (KNNs).
You can find the paper here.
There is a blog post with more info here.
The PyTorch code can be found here.
Yannic Kilcher will run us through his astute analysis of the system.