A groundbreaking paradigm, Visual Autoregressive modeling (VAR), has been introduced. This paradigm rethinks autoregressive learning on images as “next-scale prediction “or “next-resolution prediction”, a departure from the traditional “next-token prediction” approach. This innovative and intuitive method allows autoregressive (AR) transformers to learn visual distributions quickly and generalize effectively. Remarkably, VAR enables AR models to outperform diffusion transformers in image generation for the first time.

In the ImageNet 256×256 benchmark, VAR significantly improves the AR baseline by reducing the Frechet inception distance (FID) from 18.65 to 1.80 and boosting the inception score (IS) from 80.4 to 356.4. This is achieved while maintaining an approximately 20x faster inference speed. Empirical evidence confirms that VAR outperforms the Diffusion Transformer (DiT) in several aspects, including image quality, inference speed, data efficiency, and scalability.

When scaled up, VAR models exhibit clear power-law scaling laws, similar to those observed in LLMs. The linear correlation coefficients are close to -0.998, providing strong evidence. Furthermore, VAR demonstrates zero-shot generalization capabilities in downstream tasks such as image in-painting, out-painting, and editing. These findings suggest that VAR has successfully emulated two key properties of LLMs: Scaling Laws and zero-shot task generalization. All models and codes have been made publicly available to foster further exploration of AR/VAR models for visual generation and unified learning.