Shallow Diffusion Motion Model for Talking Face Generation from Speech

The framework of talking face motion model with shallow diffusion model

Abstract

Talking face generation is synthesizing a lip synchronized talking face video by inputting an arbitrary face image and audio clips. People naturally conduct spontaneous head motions to enhance their speeches while giving talks. Head motion generation from the speech is inherently difficult due to the nondeterministic mapping from speech to head motions. Most existing works map speech to motion in a deterministic way by conditioning certain styles, leading to sub-optimal results. In this paper, we decompose the speech motion into two complementary parts{:} pose modes and rhythmic dynamics. Accordingly, we introduce a shallow diffusion motion model (SDM) by equipping a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of fidelity, similarity, and syncing with speech.

Type
Publication
In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.