Training-Free Consistent Text-to-Image Generation
- Published on May 10th, 2024 6:11 am
- Editor: Yuvraj Singh
- Author(s) : Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon
ConsiStory is a groundbreaking training-free approach that addresses the challenge of consistently portraying the same subject across diverse prompts in text-to-image models. While these models offer unprecedented creative flexibility by allowing users to guide the image generation process through natural language, maintaining subject consistency has been a significant hurdle.
Existing methods, such as fine-tuning the model to learn new words describing user-provided subjects or adding image conditioning, have limitations. They require time-consuming per-subject optimization or extensive pre-training, and often struggle to align generated images with text prompts, particularly when portraying multiple subjects.
ConsiStory overcomes these challenges by sharing the internal activations of the pretrained model. The approach introduces a subject-driven shared attention block and correspondence-based feature injection, which promote subject consistency between images. Additionally, ConsiStory incorporates strategies to encourage layout diversity while maintaining subject consistency.
Comparative evaluations against various baselines demonstrate ConsiStory’s state-of-the-art performance in subject consistency and text alignment, without requiring any optimization steps. This training-free approach not only excels in single-subject scenarios but also naturally extends to multi-subject situations. Furthermore, ConsiStory enables training-free personalization for common objects, further expanding its versatility.
By providing a training-free solution for consistent subject generation in text-to-image models, ConsiStory represents a significant advancement in the field. Its ability to maintain subject consistency across diverse prompts while ensuring alignment with text descriptions opens up new possibilities for creative applications and personalized image generation.