Yang Zhou, Zichong Chen, Hui Huang
Shenzhen University
[project page] [paper] [supplementary]
This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate the superiority of our approach over existing state-of-the-art one-shot face stylization methods.
Given a single real-style paired reference, we fine-tune a deformation-aware generator
We have tested on:
- Both Linux and Windows
- NVIDIA GPU + CUDA 11.6
- Python 3.9
- PyTorch 1.13.0
- torchvision 0.14.0
Install all the libraries through pip install -r requirements.txt
Please download the pre-trained models from Google Drive.
Model | Description |
---|---|
StyleGANv2 | StyleGANv2 model pretrained on FFHQ with 1024x1024 output resolution. |
e4e_ffhq_encode | FFHQ e4e encoder. |
alexnet | Pretrained alexnet, alex.pth and alexnet-owt-7be5be79.pth. |
shape_predictor_68_face_landmarks | Face detector for extracting face landmarks. |
style1 | Generator with STNs trained on one-shot paired data source1.png and target1.png. |
style2 | Generator with STNs trained on one-shot paired data source2.png and target2.png. |
style3 | Generator with STNs trained on one-shot paired data source3.png and target3.png. |
style4 | Generator with STNs trained on one-shot paired data source4.png and target4.png. |
style5 | Generator with STNs trained on one-shot paired data source5.png and target5.png. |
style6 | Generator with STNs trained on one-shot paired data source6.png and target6.png. |
style7 | Generator with STNs trained on one-shot paired data source7.png and target7.png. |
style8 | Generator with STNs trained on one-shot paired data source8.png and target8.png. |
By default, we assume that all models are downloaded and saved to the directory ./checkpoints
.
Transfer the pretrained style onto a given image. Results are saved in the ./outputs/inference
folder by default.
python inference.py --style=style3 --input_image=./data/test_inputs/002.png --alpha=0.8
Note: We use pretrained e4e for input image inversion, make sure the pretrained e4e has been downloaded and placed
to ./checkpoints
. Although using e4e can save inference time, the final results are sometimes different from the input images.
Generate random face images using pretrained styles. Results are saved in the ./outputs/generate
folder by default.
python generate_samples.py --style=style1 --seed=2024 --alpha=0.8
Generate random face images using pretrained styles with deformation control of different degrees.
Results are saved in the ./outputs/control
folder by default.
python deformation_control.py --style=style1 --alpha0=-0.5 --alpha1=1.
Prepare your (aligned) paired images as real-style samples, place them in the ./data/style_images_aligned
folder.
Make sure the pretrained StyleGANv2 stylegan2-ffhq-config-f.pt
and alexnet alex.pth, alexnet-owt-7be5be79.pth
have
been downloaded and placed to ./checkpoints
.
Start training on your own style images, run:
python train.py --style=[STYLE_NAME] --source=[REAL_IMAGE_PATH] --target=[TARGET_IMAGE_PATH]
For example,
python train.py --style=style1 --source=source1.png --target=target1.png
Note:
- If your face images are not aligned, check the face model
shape_predictor_68_face_landmarks.dat
has downloaded and placed to./checkpoints
, and run the following command for face alignment:python face_align.py --path=[YOUR_IMAGE_PATH] --output=[PATH_TO_SAVE]
- DINO-ViT will be downloaded automatically. We use
dino_vitb8
in our experiments. - The training requires ~22 GB VRAM. It averagely costs 13 mins tested on a single NVIDIA RTX 3090.
- The trained generator will be saved in
./outputs/models
.
@inproceedings{zhou2024deformable,
title = {Deformable One-shot Face Stylization via DINO Semantic Guidance},
author = {Yang Zhou, Zichong Chen, Hui Huang},
booktitle = {CVPR},
year = {2024}}
The StyleGANv2 is borrowed from this pytorch implementation by @rosinality. The implementation of e4e projection is also heavily from encoder4editing. This code also contains submodules inspired by Splice, few-shot-gan-adaptation, tps_stn_pytorch.