Skip to content

Extension of Wav2Lip repository for processing high-quality videos.

Notifications You must be signed in to change notification settings

1105135335/wav2lip-hq

 
 

Repository files navigation

Wav2Lip-HQ: high quality lip-sync

This is unofficial extension of Wav2Lip: Accurately Lip-syncing Videos In The Wild repository. We use image super resolution and face segmentation for improving visual quality of lip-synced videos.

Acknowledgements

Our work is to a great extent based on the code from the following repositories:

  1. Clearly, Wav2Lip repository, that is a core model of our algorithm that performs lip-sync.
  2. Moreover, face-parsing.PyTorch repository provides us with a model for face segmentation.
  3. We also use extremely useful BasicSR respository for super resolution.
  4. Finally, Wav2Lip heavily depends on face_alignment repository for detection.

The algorithm

Our algorithm consists of the following steps:

  1. Pretrain ESRGAN on a video with some speech of a target person.
  2. Apply Wav2Lip model to the source video and target audio, as it is done in official Wav2Lip repository.
  3. Upsample the output of Wav2Lip with ESRGAN.
  4. Use BiSeNet to change only relevant pixels in video.

You can learn more about the method in this article (in russian).

Results

Our approach is definetly not at all flawless, and some of the frames produced with it contain artifacts or weird mistakes. However, it can be used to perform lip-sync to high quality videos with plausible output.

comparison

Running the model

The simpliest way is to use our Google Colab demo. However, if you want to test the algorithm on your own machine, run the following commands. Beware that you need Python 3 and CUDA installed.

  1. Clone this repository and install requirements:

    git clone https://github.com/Markfryazino/wav2lip-hq.git
    cd wav2lip-hq
    pip3 install -r requirements.txt
    
  2. Download all the .pth files from here and place them in checkpoints folder.

    Apart from that, вownload the face detection model checkpoint:

    !wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"
    
  3. Run the inference script:

    !python inference.py \
        --checkpoint_path "checkpoints/wav2lip_gan.pth" \
        --segmentation_path "checkpoints/face_segmentation.pth" \
        --sr_path "checkpoints/esrgan_yunying.pth" \
        --face <path to source video> \
        --audio <path to source audio> \
        --outfile <desired path to output>
    

Finetuning super-resolution model.

Although we provide a checkpoint of pre-trained ESRGAN, it's training dataset was quite modest, so the results may be insufficient. Hence, it can be useful to finetune the model on your target video. 1 or 2 minutes of speech is usually enough.

To simplify finetuning the model, we provide a colab notebook. You can also run the commands listed there on your machine: namely, you have to download the models, run inference with saving all the frames on-the-fly, resize them and train ESRGAN.

Bear in mind that the procedure is quite time- and memory-consuming.

About

Extension of Wav2Lip repository for processing high-quality videos.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.8%
  • Cuda 9.7%
  • C++ 6.5%