The Vision Transformer (ViT) has risen to prominence as a key architecture for a variety of computer vision tasks. In ViT, the input image is divided into patch tokens and processed through a series of self-attention blocks. However, unlike Convolutional Neural Networks (CNN), ViT’s simple architecture lacks an informative inductive bias, such as locality. As a result, ViT requires a substantial amount of data for pre-training.

Several data-efficient approaches, known as DeiT, have been proposed to effectively train ViT on balanced datasets. However, there is limited literature discussing the application of ViT to datasets with long-tailed imbalances. To address this, DeiT-LT has been introduced to tackle the challenge of training ViTs from scratch on long-tailed datasets.

In DeiT-LT, an efficient and effective method of distillation from CNN via a distillation DIST token is introduced. This is achieved by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This approach leads to the learning of local CNN-like features in early ViT blocks, thereby improving generalization for tail classes.

To prevent overfitting, the proposal includes distilling from a flat CNN teacher, which results in the learning of low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. These experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture.

The effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018 has been demonstrated. This shows the potential of DeiT-LT in handling datasets with long-tailed imbalances.