Text-to-image diffusion models have shown impressive results in generating personalized images of a single subject using just a few reference images. However, these models often struggle when trying to generate images with multiple subjects, leading to mixed identities and combined attributes from different individuals.

To address this issue, they introduce MuDI, a new framework that enables the personalization of multiple subjects in a single image by effectively separating their identities. The key idea behind MuDI is to use segmented subjects generated by the Segment Anything Model (SAM) during both the training and inference stages.

During training, MuDI uses the segmented subjects as data augmentation, which helps the model learn to distinguish between different subjects and their attributes. When generating new images, MuDI initializes the generation process with the segmented subjects, ensuring that each subject’s identity remains distinct throughout the process.

Their experiments show MuDI can generate high-quality personalized images featuring multiple subjects without mixing their identities, even when the subjects are highly similar. In human evaluations, MuDI successfully personalized multiple subjects without identity mixing twice as often as existing baselines. MuDI was preferred over the strongest baseline in over 70% of cases.