Skip to content

[NeurIPS 2023] The official implementation of SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Notifications You must be signed in to change notification settings

RobertLuo1/NeurIPS2023_SOC

Repository files navigation

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Zhuoyan Luo*, Yicheng Xiao*, Yong Liu*, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang

Tsinghua University Intelligent Interaction Group

πŸ“’ Updates

  • Jan. 1, 2024: We Release the Code for the ICCV 2023 Workshop: The 5th Large-scale Video Object Segmentation Challenge.
  • Oct. 29, 2023: Code is released now.
  • Sep. 22, 2023: Our paper is accepted by NeurIPS 2023!

πŸ“– Abstract

This paper studies referring video object segmentation (RVOS) by boosting videolevel visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct wellaligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations.

πŸ“— FrameWork

Visualization Result

Text expressions with temporal variations

(a) and (b) are segmentation results of our SOC and ReferFormer. For more details, please refer to paper

πŸ› οΈ Environment Setup

  • install pytorch pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
  • install other dependencies pip install h5py opencv-python protobuf av einops ruamel.yaml timm joblib pandas matplotlib cython scipy
  • install transformers numpy pip install transformers==4.24.0 pip install numpy==1.23.5
  • install pycocotools pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
  • build up MultiScaleDeformableAttention
    cd ./models/ops
    python setup.py build install
    

Data Preparation

The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata and please change it to xxx/rvosdata according to your own path.

rvosdata
└── a2d_sentences/ 
    β”œβ”€β”€ Release/
    β”‚   β”œβ”€β”€ videoset.csv  (videos metadata file)
    β”‚   └── CLIPS320/
    β”‚       └── *.mp4     (video files)
    └── text_annotations/
        β”œβ”€β”€ a2d_annotation.txt  (actual text annotations)
        β”œβ”€β”€ a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files)
└── refer_youtube_vos/ 
    β”œβ”€β”€ train/
    β”‚   β”œβ”€β”€ JPEGImages/
    β”‚   β”‚   └── */ (video folders)
    β”‚   β”‚       └── *.jpg (frame image files) 
    β”‚   └── Annotations/
    β”‚       └── */ (video folders)
    β”‚           └── *.png (mask annotation files) 
    β”œβ”€β”€ valid/
    β”‚   └── JPEGImages/
    β”‚       └── */ (video folders)
    |           └── *.jpg (frame image files) 
    └── meta_expressions/
        β”œβ”€β”€ train/
        β”‚   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)
└── coco/
      β”œβ”€β”€ train2014/
      β”œβ”€β”€ refcoco/
        β”œβ”€β”€ instances_refcoco_train.json
        β”œβ”€β”€ instances_refcoco_val.json
      β”œβ”€β”€ refcoco+/
        β”œβ”€β”€ instances_refcoco+_train.json
        β”œβ”€β”€ instances_refcoco+_val.json
      β”œβ”€β”€ refcocog/
        β”œβ”€β”€ instances_refcocog_train.json
        β”œβ”€β”€ instances_refcocog_val.json

Pretrained Model

We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained according to your own path.

pretrained
└── pretrained_swin_transformer
└── pretrained_roberta
  • For pretrained_swin_transformer folder download Video-Swin-Base
  • For pretrained_roberta folder download config.json pytorch_model.bin tokenizer.json vocab.json from huggingface (roberta-base)

Model Zoo

The checkpoints are as follows:

Setting Backbone Checkpoint
a2d_from_scratch Video-Swin-T Model
a2d_with_pretrain Video-Swin-T Model
a2d_with_pretrain Video-Swin-B Model
ytb_from_scratch Video-Swin-T Model
ytb_with_pretrain Video-Swin-T Model
ytb_with_pretrain Video-Swin-B Model
ytb_joint_train Video-Swin-T Model
ytb_joint_train Video-Swin-B Model

Output Dir

We put all outputs under a dir. Specifically, We set /mnt/data_16TB/lzy23/SOC as the output dir, so please change it to xxx/SOC.

πŸš€ Training

From scratch

We only use Video-Swin-T as backbone to train and eval the dataset.

  • A2D Run the scripts "./scripts/train_a2d.sh" and make sure that change the path "/mnt/data_16TB/lzy23" to your own path(same as the following).

    bash ./scripts/train_a2d.sh
    

    The key parameters are as follows and change the ./configs/a2d_sentences.yaml:

    lr backbone_lr bs GPU_num Epoch lr_drop
    5e-5 5e-6 2 2 40 15(0.2)
  • Ref-Youtube-VOS Run the "./scripts/train_ytb.sh.

    bash ./scripts/train_ytb.sh
    

    The main parameters are as follow:

    lr backbone_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
    1e-4 1e-5 1 65 8 true 20(0.1) 30

    Please change the ./configs/refer_youtube_vos.yaml according to the setting

    Change the dataset_path according to your own path in ./datasets/refer_youtube_vos/refer_youtube_vos_dataset.py

With Pretrain

We perform pretrain and finetune on A2d-Sentences and Ref-Youtube-VOS dataset using Video-Swin-Tiny and Video-Swin-Base. Following previous work, we first pretrain on RefCOCO dataset and then finetune.

  • Pretrain

    The followings are the key parameters for pretrain. When pretrain, please specify the corresponding backbone. (Video-Swin-T and Video-Swin-B)

    lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
    1e-4 1e-5 5e-6 8 1 8 False 15 20(0.1) 30
  • Ref-Youtube-VOS

    We finetune the pretrained weight using the following key parameters:

    lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
    1e-4 1e-5 5e-6 8 1 8 False 10(0.1) 25
  • A2D-Sentences

    We finetune the pretrained weight on A2D-Sentences using the following key parameters:

    lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
    3e-5 3e-6 1e-6 1 1 8 true - 20

Joint training

We only perform Joint training on Ref-Youtube-VOS dataset with Video-Swin-Tiny and Video-Swin-Base.

  • Ref-Youtube-VOS

    Run the scripts ./scripts/train_joint.sh. Remember to change the path and the backbone name before running.

    The main parameters (Tiny and Base) are as follow:

    lr backbone_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
    1e-4 1e-5 1 1 8 true 20(0.1) 30

Evaluation

  • A2D-Sentences Run the scripts ./scripts/eval_a2d.sh and remember to specify the checkpoint_path in the config file.

  • JHMDB-Sentences Please refer to Link to prepare for the datasets and specify the checkpoint path in yaml file. Following the previous setting, we directly use the checkpoint trained on A2d-Sentences to test.

  • Ref-Youtube-VOS

    bash ./scripts/infer_ref_ytb.sh
    

    Remember to specify the checkpoint_path and the video backbone name.

  • Ref-DAVIS2017 Please refer to Link to prepare for the DAVIS dataset. We provide the infer_davis.sh to evaluate. Remember to specify the checkpoint_path and the video backbone name.

Inference

We provide the interface for inference

bash ./scripts/demo_video.sh

Acknowledgement

Code in this repository is built upon several public repositories. Thanks for the wonderful work Referformer and MTTR

Citations

If you find this work useful for your research, please cite:

@inproceedings{SOC,
  author       = {Zhuoyan Luo and
                  Yicheng Xiao and
                  Yong Liu and
                  Shuyan Li and
                  Yitong Wang and
                  Yansong Tang and
                  Xiu Li and
                  Yujiu Yang},
  title        = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
                  Segmentation},
  booktitle    = {NeurIPS},
  year         = {2023},
}

About

[NeurIPS 2023] The official implementation of SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages