Skip to content

CatV2TON is a lightweight DiT-based visual virtual try-on model, capable of supporting try-on for both images and videos.

Notifications You must be signed in to change notification settings

Zheng-Chong/CatV2TON

Repository files navigation

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Updates

  • 2025/02/24: We have released both 256 and 512 model weights, and provided inference scripts. Check out our HuggingFace repo for the weights.
  • 2025/01/20: Our paper has been published on ArXiv.

Overview

CatV2TON is a DiT-based method for Vision-Based Virtual Try-On (V2TON) with Temporal Concatenation of Video Frames and Garment Condition.

Evaluation

Evaluation for Image Try-On

We provide the evaluation script for VITONHD and DressCode datasets. You can download our generated VITONHD and DressCode results to evaluate the performance of our method. Or you can infer your own results following the Inference section which may be slightly different due to the randomness of the inference process.

CUDA_VISIBLE_DEVICES=0 python eval_image_metrics.py \
--gt_folder YOUR_GT_FOLDER \
--pred_folder YOUR_PRED_FOLDER \
--batch_size 16 \
--num_workers 16 \
--paired

Evaluation for Video Try-On

We provide the evaluation script for ViViD-S-Test and VVT-Test datasets. You can download our generated ViViD-S-Test and VVT-Test results to evaluate the performance of our method. Or you can infer your own results following the Inference section which may be slightly different due to the randomness of the inference process.

CUDA_VISIBLE_DEVICES=0 python eval_video_metrics.py \
--gt_folder YOUR_GT_FOLDER \
--pred_folder YOUR_PRED_FOLDER \
--num_workers 16 \
--paired

YOUR_GT_FOLDER is the path to the ground truth video folder which includes only mp4 files. YOUR_PRED_FOLDER is the path to the predicted video folder which includes only mp4 files.

Inference

Inference for Image Try-On

We provide the inference script for VITONHD and DressCode datasets.
The datasets can be downloaded from VITONHD and DressCode. You can run the following command to do inference with some edited parameters for your own settings.

CUDA_VISIBLE_DEVICES=0 python eval_image_try_on.py \
--dataset vitonhd | dresscode \
--data_root_path YOUR_DATASET_PATH \
--output_dir OUTPUT_DIR_TO_SAVE_RESULTS \
--dataloader_num_workers 8 \
--batch_size 8 \
--seed 42 \
--mixed_precision bf16 \
--allow_tf32 \
--repaint \
--eval_pair  

Inference for Video Try-On

The Video Try-On Test datasets are provided: ViViD-S-Test and VVT. You can run the following command to do inference with some edited parameters for your own settings.

CUDA_VISIBLE_DEVICES=0 python eval_video_try_on.py \
--dataset vivid | vvt \
--data_root_path YOUR_DATASET_PATH \
--output_dir OUTPUT_DIR_TO_SAVE_RESULTS \
--dataloader_num_workers 8 \
--batch_size 8 \
--seed 42 \
--mixed_precision bf16 \
--allow_tf32 \
--repaint \
--eval_pair  

About

CatV2TON is a lightweight DiT-based visual virtual try-on model, capable of supporting try-on for both images and videos.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published