This paper aims to solve the video object segmentation (VOS) task in a scribble-supervised manner, in which VOS models are not only trained by sparse scribble annotations but also initialized with sparse target scribbles for inference. Thus, the annotation burdens for both training and initialization can be substantially lightened. The difficulties of scribble-supervised VOS lie in two aspects: 1) it necessitates a powerful capability to learn to predict dense masks from sparse scribble annotations during training; 2) it demands a strong reasoning capability during inference given only a sparse initial target scribble. In this work, we propose a Reliability-guided Hierarchical Memory Network (RHMNet) to predict the target mask in a step-wise expanding strategy w.r.t. the memory reliability level. To be specific, RHMNet first only uses the memory in the high-reliability level to locate the region with high reliability belonging to the target, which is highly similar to the initial target scribble. Then it expands the located high-reliability region to the entire target conditioned on the region itself and the memories in all reliability levels. Besides, we propose a scribble-supervised learning mechanism to facilitate the model learning for predicting dense results. It mines the pixel-level relations within the single frame and the instance-level variations across multiple frames to take full advantage of the scribble annotations in sequence samples. The favorable performance on two popular benchmarks demonstrates that our method is promising.
Reliability-guided Hierarchical Memory Network for Scribble-Supervised Video Object Segmentation, Zikun Zhou, Kaige Mao, Wenjie Pei, Hongpeng Wang, Yaowei Wang, Zhenyu He. [Paper]
- 2023.04.05 Update the missing annotation file named
COCO_scribbles.json
. [GoogleDriver] [BaiduYun(code:v3ua)]
Our Reliability-guided Hierarchical Memory Network consists of the reliability-hierarchical memory bank, the feature extractor, the memory encoding module, the matching module, and the segmentation head. For processing a new frame, the proposed method first captures the reliable region, which is the region highly similar to the initial target scribble region, and then accordingly segments the entire target. In each expanding step, only the historical information in the corresponding or higher reliability level is used as the reference for memory matching.
Synthesized training scribbles
Manually drawn evaluation scribbles
The synthesized scribble annotations used for training could be download from here: [GoogleDriver] [BaiduYun(code:o2ey)]
The manully drawn initial scribbles for the validation set of DAVIS and Youtube-VOS could be download from here: [GoogleDriver] [BaiduYun(code:yhdj)]
In the annotation file of a frame with
The downloaded scribbles should be organized as follows:
/datasets
|——————COCO
|——————scribbles
|——————annotation1.bmp
|——————annotation2.bmp
|——————......
|——————COCO_scribbles.json
|——————...
|——————DAVIS
|——————train_scribbles
|——————video1
|——————video2
|——————......
|——————valid_scribbles0
|——————video1
|——————video2
|——————......
|——————valid_scribbles1
|——————valid_scribbles2
|——————valid_scribbles3
|——————valid_scribbles4
|——————...
|——————Youtube-VOS
|——————train_scribbles
|——————video1
|——————video2
|——————......
|——————valid_scribbles
|——————video1
|——————video2
|——————......
|——————...
Our code is implemented based on the Python3.7
, PyTorch 1.10.0
and CUDA11
.
Create the environment for RHMNet:
conda env create -f requirments.yaml
Then run conda activate RHMNet
to activate the environment.
# pre-training on coco
/opt/conda/bin/python -m torch.distributed.launch --nproc_per_node 8 train_script_pretrain.py --config pre_training
# video training on davis and youtube-vos
/opt/conda/bin/python -m torch.distributed.launch --nproc_per_node 4 train_script_video_training.py --config video_training
The pre-trained models could be downloaded from here:[GoogleDriver] [BaiduYun(code:ghbg)], and should be put to the path of ./checkpoints/pretrain_weights_video_training
.
# evaluation on davis 2016 validation set
/opt/conda/bin/python eval_davis.py --gpu 0 --set val --year 16 --config video_training --ckpt_num 69
# evaluation on davis 2017 validation set
/opt/conda/bin/python eval_davis.py --gpu 0 --set val --year 17 --config video_training --ckpt_num 78
# evaluation on youtube-vos 2018/2019 validation set
/opt/conda/bin/python eval_ytb.py --gpu 0 --set val --config video_training --ckpt_num 71 --size 1000