HOP : Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

CVPR 2025

1Northwest A&F University    2Fudan University    3Monash University

* equal contribution   corresponding authors  

Teaser Image

Abstract

Co-speech gestures are crucial non-verbal cues that enhance speech clarity and expressiveness in human communication, which have attracted increasing attention in multimodal research. While the existing methods have made strides in gesture accuracy, challenges remain in generating diverse and coherent gestures, as most approaches assume independence among multimodal inputs and lack explicit modeling of their interactions. In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. By leveraging spatiotemporal graph modeling, we achieve the alignment of audio and action. Moreover, to enhance modality coherence, we build the audio-text semantic representation based on a reprogramming module, which is beneficial for cross-modality adaptation. Our approach enables the trimodal system to learn each other's features and represent them in the form of topological entanglement. Extensive experiments demonstrate that HOP achieves state-of-the-art performance, offering more natural and expressive co-speech gesture generation.

Framework

Framework Image

Overview of the proposed framework for multimodal gesture generation with heterogeneous topology entanglement. Given the input text of speech and the Mel-Spectrum obtained through audio preprocessing, we treat audio sequences as a bridge, linking text sequences and action sequences with distinct topologies. For the connection between text and audio, we apply a reprogramming layer to align data from these different modalities, utilizing a language model to extract embedded semantic information. To link action and audio, we employ the Graph-WaveNet approach to separately extract action and audio features. The entangled multimodal representations are then fed into the gesture generator through topological fusion, resulting in the generation of co-speech gestures.

Visualization

Framework Image

Visualization of generated gestures. The gestures generated by our method more effectively capture the semantic information in the text, exhibiting a greater range of movement rhythm in the highlighted sections. We highlight the text and its corresponding gesture actions using red and yellow shading, respectively.

Co-Speech Gesture Generation Demo

Ted Expressive Demo


Ted Demo

BibTeX

@article{cheng2025hop,
  title     = {HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation},
  author    = {Cheng, Hongye and Wang, Tianyu and Shi, Guangsi and Zhao, Zexing and Fu, Yanwei},
  journal   = {arXiv preprint arXiv:2503.01175},
  year      = {2025}
}