GitHub - NiFangBaAGe/DATTT: [CVPR 2024] Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

Weihuang Liu ¹ Xi Shen ² Haolun Li ¹ Xiuli Bi ³ Bo Liu ³ Chi-Man Pun ^*,1 Xiaodong Cun ^*,4

¹ University of Macau ² Intellindust
³ Chongqing University of Posts and Telecommunications ⁴ Tencent AI Lab

CVPR 2024

Overview

Mainstream solutions mainly focus on learning a single model on large-scale video datasets, which struggle to generalize to unseen videos. We introduce Depth-aware test-time training (DATTT) to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. During the test-time training, the model is required to predict consistent depth maps for the same video frame under different data augmentation. The model is progressively updated and provides more precise mask prediction.

Pipeline

We add a depth decoder to commonly used two-stream ZSVOS architecture to learn 3D knowledge. The model is first trained on large-scale datasets for object segmentation and depth estimation. Then, for each test video, we employ photometric distortion-based data augmentation to the frames. The error between the predicted depth maps is backward to update the image encoder. Finally, the new model is applied to infer the object.

Environment

This code was implemented with Python 3.6 and PyTorch 1.10.0. You can install all the requirements via:

pip install -r requirements.txt

Quick Start

Download the YouTube-VOS dataset, DAVIS-16 dataset, FBMS dataset, Long-Videos dataset, MCL dataset, and SegTrackV2 dataset. You could get the processed data provided by HFAN. The depth maps are obtained by MonoDepth2, We also provide the processed data here.
Download the pre-trained Mit-b1 or Swin-Tiny backbone.
Training:

python train.py --config ./configs/train_sample.yaml

Evaluation:

python ttt_demo.py --config configs/test_sample.yaml --model model.pth --eval_type base

Test-time training:

python ttt_demo.py --config configs/test_sample.yaml --model model.pth --eval_type TTT-MWI

We provide our checkpoints here.

Citation

If you find this useful in your research, please consider citing:

@inproceedings{
title={Depth-aware Test-Time Training for Zero-shot Video Object Segmentation},
author={Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, Xiaodong Cun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}

Acknowledgements

EVP code borrows heavily from EVP, Swin and SegFormer. We thank the author for sharing their wonderful code.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
configs		configs
datasets		datasets
docs		docs
models		models
README.md		README.md
augmentation.py		augmentation.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
ttt_demo.py		ttt_demo.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

Overview

Pipeline

Environment

Quick Start

Citation

Acknowledgements

About

Releases

Packages

Languages

NiFangBaAGe/DATTT

Folders and files

Latest commit

History

Repository files navigation

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

Overview

Pipeline

Environment

Quick Start

Citation

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages