DINOv2: Self-Supervised Computer Vision Models by Meta AI

Unleashing the Potential of Computer Vision with DINOv2: A Groundbreaking Self-Supervised Model by Meta AI.



DINOv2: Self-Supervised Computer Vision Models by Meta AI
Image from Bing Image Creator

 

Meta AI has just released open-source DINOv2 models the first method that uses self-supervised learning to train computer vision models. The DINOv2 models achieve results that match or are even better than the standard approach and models in the field. 

The models achieved strong performance without the need to fine-tune which makes a perfect choice for many different computer vision tasks and applications. DINOv2 can learn from various collections of images and features such as depth estimation without the need for explicit training thanks to the self-supervised training method.
 

DINOv2: Self-Supervised Computer Vision Models by Meta AI
Figure 1: DINOv2: Self-Supervised Computer Vision Models by Meta AI

 

1. The Need for Self-Surprised Learning

 

1.1. No fine-tuning is required

 

Self-supervised learning is a powerful method used to train machine learning models without the need for large amounts of labeled data. DINOv2 models can be trained on image corpus without the need for related metadata, specific hashtag, or image caption. DinoV2 models, unlike several recent self-supervised learning approaches, do not necessitate fine-tuning, thus producing high-performance features for different computer vision applications.

 

1.2. Overcoming human annotation limitations

 

Over the past few years, image-text pre-training has become the predominant method for various computer vision applications. However, due to its dependence on human-labeled captions to learn the semantic meaning of images. This approach often overlooks crucial information that is not explicitly included in those captions. For example, a human label caption of a picture of a red table in a yellow room might be “A red wooden table”. This caption will miss some important information about the background, the position, and the size of the table. This will cause a lack of understanding of local information and will result in poor performance on tasks that require detailed localization information. 

Also, the need for human labels and annotation will limit the amount of data that we can collect to train the models. This becomes much harder for certain applications for example annotating a cell requires a certain level of human expertise that will not be available at the scale required. Using a self-supervised training approach on cellular imagery opens the way for a more foundational model and as a result, will improve biological discovery. The same applies to similar advanced fields as the estimation of animal density.

Moving from DINO to DINOv2 required overcoming several challenges such as 

  • Creating a large and curated training dataset
  • Improving the training algorithm and implementation
  • Designing a functional distillation pipeline.

 

2. From DINO to DINOv2

 

DINOv2: Self-Supervised Computer Vision Models by Meta AI
Figure 2: DINO v1 Vs v2 comparison of segmentation precision

 

2.1. Creating a large, curated, and diverse image dataset

 

One of the main steps to building the DINOv2 is to train larger architectures and models to enhance the model's performance. However, larger models require large datasets to be efficiently trained. Since there were no large datasets available that meet the requirements researchers leveraged publicly crawled web data and built a pipeline to select only useful data as in LASER

However, two main tasks should be done to be able to use these datasets:

  • Balance the data across different concepts and tasks
  • Remove irrelevant images

As this task can be accomplished manually, they curated a set of seed images from approximately 25 third-party datasets and expanded it by fetching images that are closely related to those seed images. This approach allowed them to produce a pertaining dataset of a total of 142 million images out of 1.2 billion images.

 

2.2. Algorithmic and technical improvements

 

Although using larger models and datasets will lead to better results it comes with major challenges. Two of the main challenges are potential instability and remaining tractable during training. To make the training more stable DINOv2 includes additional regularization methods which were inspired by similarity search and classification literature. 

The training process of DINOv2 integrates the latest mixed-precision and distributed training implementations provided by the cutting-edge PyTorch 2. This allowed faster implementation of the codes and using the same hardware for training DINO models resulted in double the speed and a third of the memory usage which allowed scaling in data and model size. 

 

2.3. Decreasing inference time using models distillation

 

Running large models in inference requires powerful hardware which will limit the practical use of the methods for different use cases. To overcome this problem, researchers used model distillation to compress the knowledge of the large models into smaller ones. By utilizing this approach, researchers were able to condense high-performance architectures into smaller ones with negligible performance costs. This resulted in strong ViT-Small, ViT-Base, and ViT-Large models.

 

3. Getting Started with DINOv2 

 

The training and evaluation code requires PyTorch 2.0 and xFormers 0.0.18 as well as many other 3rd party packages and also the code expects a Linux environment. The following instructions outline how to configure all necessary dependencies for training and evaluation purposes:

  • Install PyTorch using the instruction here. It is advised to install PyTorch with CUDA support.
  • Download conda 
  • Clone the DINOv2 repository using the following command:



Code by Author

  • Proceed to create and activate a Conda environment named "dinov2" using the provided environment definition:



Code by Author

  • To install the dependencies required for this project, utilize the provided requirements.txt file.



Code by Author

  • Finally, you can load the models using the code below:



Code by Author

In conclusion, the release of DINOv2 models by Meta AI marks a significant milestone. The self-supervised learning approach used by DINOv2 models provides a powerful way to train machine learning models without the need for large amounts of labeled data. With the ability to achieve high accuarcy without the demand for fine-tuning, these models are suitable for various computer vision tasks and applications. Moreover, DINOv2 can learn from different collections of images and can learn from features such as depth estimation without explicit training. The availability of DINOv2 as an open-source model opens the doors for researchers and developers to explore new possibilities in computer vision tasks and applications.

 

References 

 

 
 
Youssef Rafaat is a computer vision researcher & data scientist. His research focuses on developing real-time computer vision algorithms for healthcare applications. He also worked as a data scientist for more than 3 years in the marketing, finance, and healthcare domain.