• Author(s): Cusuh Ham, Matthew Fisher, James Hays, Nicholas Kolkin, Yuchen Liu, Richard Zhang, Tobias Hinz

This paper introduces a novel approach for efficient concept-driven generation using text-to-image diffusion models, combining personalized residuals and localized attention-guided sampling. The method begins by representing concepts through the freezing of weights in a pretrained text-conditioned diffusion model and the learning of low-rank residuals for a select subset of the model’s layers.

This residual-based approach directly enables the application of the proposed sampling technique, which applies the learned residuals exclusively in areas where the concept is localized via cross-attention while applying the original diffusion weights in all other regions. Localized sampling effectively combines the learned identity of the concept with the existing generative prior of the underlying diffusion model. The results demonstrate that personalized residuals effectively capture the identity of a concept in approximately 3 minutes on a single GPU without the need for regularization images and with fewer parameters compared to previous models.

Furthermore, localized sampling allows for the utilization of the original model as a strong prior for significant portions of the image. This approach offers a promising direction for efficient and effective concept-driven generation in text-to-image diffusion models, balancing the preservation of learned concept identities with the generative capabilities of the underlying model.