Monocular and Generalizable Gaussian Talking Head Animation

CVPR 2025

Shengjie Gong1, Haojie Li1, Jiapeng Tang2, Dongming Hu1, Shuangping Huang1,3*, Hao Chen1,
Tianshui Chen4, Zhuoman Liu5
1South China University of Technology, 2Technical University of Munich, 3Pazhou Laboratory
4Guangdong University of Technology 5The Hong Kong Polytechnic University

Abstract

In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.



Overall Pipeline

Pipeline overview of the MGGTalk.

Given a source image Is, we first use semantic parsing to extract the head region Ish and torse-background Isbg. The DGSR module generates point clouds [Pf ; Pfs] for visible and invisible regions from Ish. Expression features from the driving image or audio are used by the Deformation Network to edit the point cloud, resulting in [Pd ; Pds]. The SGP module then takes the identity encoding F from Ish and the deformed point cloud [Pd ; Pds] to predict the complete Gaussian parameters Gden. Finally, Gden is rendered and composited with torso-background Isbg to obtain the target Itat.

Pipeline Overview

Demo : Monocular and Generalizable Gaussian Talking Head Animation

To show the overall performance of MGGTalk, we provide demos...

Explore Expressions


Example A

A Fixed A Second A Third

Example B

B Fixed B Second B Third

Example C

C Fixed C Second C Third

Example D

D Fixed D Second D Third

Novel View Rendering (Head Region)


Example A

Block1 - Image1
Block1 - Image2 Block1 - Image3 Block1 - Image4 Block1 - Image5 Block1 - Image6 Block1 - Image7 Block1 - Image8

Example B

Block2 - Image1
Block2 - Image2 Block2 - Image3 Block2 - Image4 Block2 - Image5 Block2 - Image6 Block2 - Image7 Block2 - Image8

Example C

Block3 - Image1
Block3 - Image2 Block3 - Image3 Block3 - Image4 Block3 - Image5 Block3 - Image6 Block3 - Image7 Block3 - Image8

Example D

Block4 - Image1
Block4 - Image2 Block4 - Image3 Block4 - Image4 Block4 - Image5 Block4 - Image6 Block4 - Image7 Block4 - Image8

BibTeX

@article{gong2025monocular,
  title={Monocular and Generalizable Gaussian Talking Head Animation},
  author={Gong, Shengjie and Li, Haojie and Tang, Jiapeng and Hu, Dongming and Huang, Shuangping and Chen, Hao and Chen, Tianshui and Liu, Zhuoman},
  journal={arXiv preprint arXiv:2504.00665},
  year={2025}
}