Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Xu He1,   Qiaochu Huang1,   Zhensong Zhang2,   Zhiwei Lin1,   Zhiyong Wu1,4,
Sicheng Yang1,   Minglei Li3,   Zhiyi Chen3,   Songcen Xu2,   Xiaofei Wu2
1Shenzhen International Graduate School, Tsinghua University,    2Huawei Noah's Ark Lab, 3Huawei Cloud Computing Technologies Co., Ltd,    4The Chinese University of Hong Kong
teaser image

Abstract

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.

Video

Pipeline

teaser image

Gesture Video Generation Pipeline of our proposed framework is composed of three core components: 1) the motion decou-pling module (green) extracts latent motion features from videos with TPS transformations and synthesizes image frames; 2) the latent motion diffusion model (pink) generates motion features conditioned on the speech; 3) the refinement network (blue) restore missing details and produce the final fine-grained video.

Long-Term Generation

Acknowledgements

We are grateful to the following repositories for their open research and exploration, which helped us significantly in this work.

Thin-Plate Spline Motion Model for Image Animation,

EDGE: Editable Dance Generation From Music,

Image Inpainting with Local and Global Refinement.

BibTeX

@inproceedings{he2024co,
  title={Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model},
  author={He, Xu and Huang, Qiaochu and Zhang, Zhensong and Lin, Zhiwei and Wu, Zhiyong and Yang, Sicheng and Li, Minglei and Chen, Zhiyi and Xu, Songcen and Wu, Xiaofei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2263--2273},
  year={2024}
}