The generation of human-human motion interactions conditioned on textual descriptions is a highly useful application in various fields, including robotics, gaming, animation, and the metaverse. However, modeling the highly dimensional interpersonal dynamics and capturing the intra-personal diversity of interactions pose significant challenges.

Current methods for generating human-human motion interactions often result in limited diversity of intra-person dynamics due to the constraints of available datasets and conditioning strategies. To address this issue, a novel diffusion model called in2IN has been introduced.

in2IN is a human-human motion generation model that is conditioned not only on the textual description of the overall interaction but also on the individual descriptions of the actions performed by each person involved in the interaction. To train this model, a large language model is used to extend the InterHuman dataset with individual descriptions. As a result, in2IN achieves state-of-the-art performance in the InterHuman dataset.

In addition to in2IN, a model composition technique called DualMDM has been proposed to increase the intra-personal diversity on existing interaction datasets. DualMDM combines the motions generated with in2IN and the motions generated by a single-person motion prior pre-trained on HumanML3D. This technique generates motions with higher individual diversity and improves control over the intra-person dynamics while maintaining interpersonal coherence.

In conclusion, the in2IN diffusion model and DualMDM model composition technique represent significant advancements in the field of human-human motion generation. By conditioning on both overall and individual textual descriptions, in2IN achieves state-of-the-art performance in the InterHuman dataset. Meanwhile, DualMDM increases the intra-personal diversity of existing interaction datasets, resulting in motions with higher individual diversity and improved control over intra-person dynamics.