+
+ Synthesizing realistic videos according to a given speech
+ is still an open challenge. Previous works have been
+ plagued by issues such as inaccurate lip shape generation
+ and poor image quality. The key reason is that only motions
+ and appearances on limited facial areas (e.g., lip area) are
+ mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the
+ entire head image
+ is prone to ambiguity, particularly when using a short video
+ for training. We thus propose a decomposition-synthesiscomposition framework named Speech to Lip
+ (Speech2Lip)
+ that disentangles speech-sensitive and speech-insensitive
+ motion/appearance to facilitate effective learning from limited training data, resulting in the generation
+ of naturallooking videos. First, given a fixed head pose (i.e., canonical space), we present a
+ speech-driven implicit model for lip
+ image generation which concentrates on learning speechsensitive motion and appearance. Next, to model the
+ major
+ speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping
+ (GAMEM)
+ module that establishes geometric mappings between different head poses. This allows us to paste generated
+ lip
+ images at the canonical space onto head images with arbitrary poses and synthesize talking videos with
+ natural head
+ movements. In addition, a Blend-Net and a contrastive
+ sync loss are introduced to enhance the overall synthesis
+ performance. Quantitative and qualitative results on three
+ benchmarks demonstrate that our model can be trained by
+ a video of just a few minutes in length and achieve stateof-the-art performance in both visual quality and
+ speechvisual synchronization.
+
+
+
Figure 1. Given a speech as input, our model generates highquality talking-head videos and supports pose-controllable synthesis. The decomposition and synthesis modules make learning from
+ a short video more effective and the composition module enables
+ us to synthesize high-fidelity videos.
+
+