HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Cong Wei, Yujie Zhong^†, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang^†

^†Correspondence

📖 Abstract

This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks.

📖 Pipeline

💡 Results

Referring Expression Segmentation

Reasoning Segmentation

Generic Segmentation

Common Video Segmentation and Multi-modal Benchmarks.

Installation

Install required packages.

conda create -n HyperSeg python=3.10.13
conda activate HyperSeg
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 -c pytorch -c conda-forge -y
pip install -r requirements.txt

Pre-trained weights

Mipha

HyperSeg needs loading Mipha-3B pre-trained weights Mipha-3B.

Vanilla Encoder

Our Vanilla Encoder needs loading SigLIP-SO pre-trained weights SigLIP-SO.

Mask2Former weights

The Segmentation Predictor requires loading Mask2Former Swin-B weights Mask2Former.

Getting Started

See Preparing Datasets for HyperSeg.

See Running Inference with HyperSeg.

See Training HyperSeg.

Citation

If you find this project useful in your research, please consider citing:

@article{wei2024hyperseg,
  title={HyperSeg: Towards Universal Visual Segmentation with Large Language Model},
  author={Wei, Cong and Zhong, Yujie and Tan, Haoxian and Liu, Yong and Zhao, Zheng and Hu, Jie and Yang, Yujiu},
  journal={arXiv preprint arXiv:2411.17606},
  year={2024}
}

Acknowledgement

Thanks for great works of Mask2Former, PSALM and Mipha. Our code is based on them.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
docs		docs
eval		eval
eval_tools		eval_tools
hyperseg		hyperseg
imgs		imgs
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Cong Wei, Yujie Zhong^†, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang^†

📖 Abstract

📖 Pipeline

💡 Results

Referring Expression Segmentation

Reasoning Segmentation

Generic Segmentation

Common Video Segmentation and Multi-modal Benchmarks.

Installation

Pre-trained weights

Mipha

Vanilla Encoder

Mask2Former weights

Getting Started

Citation

Acknowledgement

About

Releases

Packages

Languages

License

congvvc/HyperSeg

Folders and files

Latest commit

History

Repository files navigation

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Cong Wei, Yujie Zhong†, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang†

📖 Abstract

📖 Pipeline

💡 Results

Referring Expression Segmentation

Reasoning Segmentation

Generic Segmentation

Common Video Segmentation and Multi-modal Benchmarks.

Installation

Pre-trained weights

Mipha

Vanilla Encoder

Mask2Former weights

Getting Started

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Cong Wei, Yujie Zhong^†, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang^†

Packages