Emotion-Aware Speech-Driven 3D Facial Animation with Diffusion Models

Zhen Wang

doi:10.63313/JCSFT.9059

Authors

Zhen Wang Qingdao University of Computer Science and Technology, QingDao 266000, China Author

DOI:

https://doi.org/10.63313/JCSFT.9059

Keywords:

Speech-driven, diffusion models, 3D Facial Animation

Abstract

Speech-driven 3D facial animation aims to generate realistic lip-sync and expressive facial motions from input audio. While existing methods have achieved plausible lip movements, they often struggle to capture subtle emotional variations, leading to average-looking expressions and unnatural motions. In this paper, we propose EmoFaceDiffusion, a novel framework based on denoising diffusion probabilistic models (DDPMs) that explicitly incorporates emotion awareness. Our key innovations include: (1) a multi-modal emotion encoder that extracts continuous emotion features (valence, arousal, dominance) directly from raw speech, enabling fine-grained emotional expression synthesis; (2) a conditional diffusion model with cross-modal attention that fuses audio and emotion embeddings; (3) a temporal consistency module based on graph convolutions to ensure smooth and coherent motion sequences. Extensive experiments on BIWI and IEMOCAP datasets demonstrate that EmoFaceDiffusion achieves state-of-the-art performance in lip-sync accuracy (LMD: 2.31 vs. 2.87), emotion expressiveness (classification accuracy: 78.5% vs. 65.2%), and user preference (MOS: 4.21/5). Ablation studies validate the contribution of each component. Our work offers a significant step toward expressive and emotionally aware digital avatars.

References

[1] P. Edwards et al. "JALI: An Animator-Centric Viseme Model for Expressive Lip Synchronization." ACM TOG 2016.

[2] D. Cudeiro et al. "Capture, Learning, and Synthesis of 3D Speaking Styles." CVPR 2019.

[3] Y. Fan et al. "FaceFormer: Speech-Driven 3D Facial Animation with Transformers." CVPR 2022.

[4] I. Goodfellow et al. "Generative Adversarial Nets." NeurIPS 2014.

[5] D. P. Kingma and M. Welling. "Auto-Encoding Variational Bayes." ICLR 2014.

[6] J. Ho et al. "Denoising Diffusion Probabilistic Models." NeurIPS 2020.

[7] J. Ho and T. Salimans. "Classifier-Free Diffusion Guidance." NeurIPS 2022 Workshop.

[8] Y. Liu et al. "Emotion-Aware Speech-Driven 3D Facial Animation." ACM MM 2022.

[9] A. Baevski et al. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020.

[10] W. Hsu et al. "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." TASLP 2021.

[11] S. Busso et al. "IEMOCAP: Interactive Emotional Dyadic Motion Capture Database." LREC 2008.

[12] J. Kim et al. "Towards Speech Emotion Recognition 'in the wild' Using Aggregated Corpora and Deep Multi-Task Learning." INTERSPEECH 2021.

[13] G. Tevet et al. "Human Motion Diffusion Model." ICLR 2023.

[14] Y. Fan et al. "DiffTalk: A Diffusion-Based Model for High-Fidelity Speech-Driven 3D Facial Animation." arXiv:2303.08977, 2023.

[15] S. Luo and W. Hu. "Diffusion Probabilistic Models for 3D Point Cloud Generation." CVPR 2021.

[16] R. Daněček et al. "MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement." ICCV 2021.

[17] Z. Liu et al. "EmoTalk: Speech-Driven 3D Facial Animation with Emotion Disentanglement." CVPR 2023.

[18] Y. Zhang et al. "Speech2Face: Learning the Face Behind a Voice." CVPR 2019.

[19] Y. Fan et al. "StyleTalk: One-Shot Style-Based Audio-Driven Talking Head Generation." ACM MM 2022.

Emotion-Aware Speech-Driven 3D Facial Animation with Diffusion Models

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

INDEXING & ABSTRACTING