• Author(s) : Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng

Recent advancements in controllable human image generation have enabled zero-shot generation using structural signals, such as pose or depth information, or facial appearance. However, generating human images conditioned on multiple parts of human appearance remains a significant challenge in the field.

To address this challenge, the researchers introduce Parts to Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. The framework incorporates two key components to achieve this goal.

Firstly, a semantic-aware appearance encoder is developed to retain the details of different human parts. This encoder processes each image based on its textual label, generating a series of multi-scale feature maps instead of a single image token. By preserving the image dimension, the encoder ensures that the intricate details of each human part are accurately captured.

Secondly, the framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. This mechanism is enhanced by incorporating mask information from the reference human images, allowing for the precise selection of any desired part. This innovative approach enables the generation of human images that seamlessly combine multiple aspects of appearance, such as facial features, hairstyles, and clothing.

Extensive experiments demonstrate the superiority of the Parts to Whole framework over existing alternatives, offering advanced capabilities for multi-part controllable human image customization. The proposed approach paves the way for more realistic and personalized human image generation, with potential applications in various domains, including virtual and augmented reality, gaming, and creative industries.