CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Updates

2025/02/24: We have released both 256 and 512 model weights, and provided inference scripts. Check out our HuggingFace repo for the weights.
2025/01/20: Our paper has been published on ArXiv.

Overview

CatV2TON is a DiT-based method for Vision-Based Virtual Try-On (V2TON) with Temporal Concatenation of Video Frames and Garment Condition.

Evaluation

Evaluation for Image Try-On

We provide the evaluation script for VITONHD and DressCode datasets. You can download our generated VITONHD and DressCode results to evaluate the performance of our method. Or you can infer your own results following the Inference section which may be slightly different due to the randomness of the inference process.

CUDA_VISIBLE_DEVICES=0 python eval_image_metrics.py \
--gt_folder YOUR_GT_FOLDER \
--pred_folder YOUR_PRED_FOLDER \
--batch_size 16 \
--num_workers 16 \
--paired

Evaluation for Video Try-On

We provide the evaluation script for ViViD-S-Test and VVT-Test datasets. You can download our generated ViViD-S-Test and VVT-Test results to evaluate the performance of our method. Or you can infer your own results following the Inference section which may be slightly different due to the randomness of the inference process.

CUDA_VISIBLE_DEVICES=0 python eval_video_metrics.py \
--gt_folder YOUR_GT_FOLDER \
--pred_folder YOUR_PRED_FOLDER \
--num_workers 16 \
--paired

YOUR_GT_FOLDER is the path to the ground truth video folder which includes only mp4 files. YOUR_PRED_FOLDER is the path to the predicted video folder which includes only mp4 files.

Inference

Inference for Image Try-On

We provide the inference script for VITONHD and DressCode datasets.
The datasets can be downloaded from VITONHD and DressCode. You can run the following command to do inference with some edited parameters for your own settings.

CUDA_VISIBLE_DEVICES=0 python eval_image_try_on.py \
--dataset vitonhd | dresscode \
--data_root_path YOUR_DATASET_PATH \
--output_dir OUTPUT_DIR_TO_SAVE_RESULTS \
--dataloader_num_workers 8 \
--batch_size 8 \
--seed 42 \
--mixed_precision bf16 \
--allow_tf32 \
--repaint \
--eval_pair

Inference for Video Try-On

The Video Try-On Test datasets are provided: ViViD-S-Test and VVT. You can run the following command to do inference with some edited parameters for your own settings.

CUDA_VISIBLE_DEVICES=0 python eval_video_try_on.py \
--dataset vivid | vvt \
--data_root_path YOUR_DATASET_PATH \
--output_dir OUTPUT_DIR_TO_SAVE_RESULTS \
--dataloader_num_workers 8 \
--batch_size 8 \
--seed 42 \
--mixed_precision bf16 \
--allow_tf32 \
--repaint \
--eval_pair

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
densepose		densepose
detectron2		detectron2
easyanimate		easyanimate
modules		modules
resource/img		resource/img
.gitignore		.gitignore
README.md		README.md
eval_image_metrics.py		eval_image_metrics.py
eval_image_try_on.py		eval_image_try_on.py
eval_video_metrics.py		eval_video_metrics.py
eval_video_try_on.py		eval_video_try_on.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Updates

Overview

Evaluation

Evaluation for Image Try-On

Evaluation for Video Try-On

Inference

Inference for Image Try-On

Inference for Video Try-On

About

Releases

Packages

Languages

Zheng-Chong/CatV2TON

Folders and files

Latest commit

History

Repository files navigation

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Updates

Overview

Evaluation

Evaluation for Image Try-On

Evaluation for Video Try-On

Inference

Inference for Image Try-On

Inference for Video Try-On

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages