• Author(s) : Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica

When it comes to making high-resolution, customized images, fine-tuned adapters have become a cheaper option for scaling the base models with more data or parameters. The open-source community has learned how to use adapters, and this has led to the creation of a large database with over 100,000 adapters, many of which are highly customized without enough descriptions. This paper examines the problem of the adapter selection with the recent research results that stress the advantages of the adapter composition.

Introducing Stylus, a novel approach that efficiently selects and automatically composes task-specific adapters based on the keywords in a given prompt. Stylus employs a three-stage methodology: first, it enhances adapter descriptions and embeddings; second, it retrieves adapters relevant to the task at hand; and third, it further assembles these adapters based on their compatibility with the prompt.

To assess the effectiveness of Stylus, the authors have developed StylusDocs, a meticulously curated dataset comprising 75,000 adapters with pre-computed adapter embeddings. Evaluations conducted on popular Stable Diffusion checkpoints reveal that Stylus achieves superior CLIP-FID Pareto efficiency. Furthermore, it is twice as preferred by both human evaluators and multimodal models when compared to the base model.

In summary, Stylus presents a significant advancement in the efficient selection and composition of task-specific adapters for diffusion models, demonstrating improved performance and preference over existing methods.