Updates

(2022/08/09) Application examples for polyp segmentation (polyp-pvt) and vision-language modeling.
(2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Pyramid Vision Transformer

The image is from Transformers: Revenge of the Fallen.

This repository contains the official implementation of PVTv1 & PVTv2 in image classification, object detection, and semantic segmentation tasks.

Model Zoo

Image Classification

Classification configs & weights see >>>here<<<.

PVTv2 on ImageNet-1K

Method	Size	Acc@1	#Params (M)
PVTv2-B0	224	70.5	3.7
PVTv2-B1	224	78.7	14.0
PVTv2-B2-Linear	224	82.1	22.6
PVTv2-B2	224	82.0	25.4
PVTv2-B3	224	83.1	45.2
PVTv2-B4	224	83.6	62.6
PVTv2-B5	224	83.8	82.0

PVTv1 on ImageNet-1K

Method	Size	Acc@1	#Params (M)
PVT-Tiny	224	75.1	13.2
PVT-Small	224	79.8	24.5
PVT-Medium	224	81.2	44.2
PVT-Large	224	81.7	61.4

Object Detection

Detection configs & weights see >>>here<<<.

PVTv2 on COCO

Baseline Detectors

Method	Backbone	Pretrain	Lr schd	Aug	box AP	mask AP
RetinaNet	PVTv2-b0	ImageNet-1K	1x	No	37.2	-
RetinaNet	PVTv2-b1	ImageNet-1K	1x	No	41.2	-
RetinaNet	PVTv2-b2	ImageNet-1K	1x	No	44.6	-
RetinaNet	PVTv2-b3	ImageNet-1K	1x	No	45.9	-
RetinaNet	PVTv2-b4	ImageNet-1K	1x	No	46.1	-
RetinaNet	PVTv2-b5	ImageNet-1K	1x	No	46.2	-
Mask R-CNN	PVTv2-b0	ImageNet-1K	1x	No	38.2	36.2
Mask R-CNN	PVTv2-b1	ImageNet-1K	1x	No	41.8	38.8
Mask R-CNN	PVTv2-b2	ImageNet-1K	1x	No	45.3	41.2
Mask R-CNN	PVTv2-b3	ImageNet-1K	1x	No	47.0	42.5
Mask R-CNN	PVTv2-b4	ImageNet-1K	1x	No	47.5	42.7
Mask R-CNN	PVTv2-b5	ImageNet-1K	1x	No	47.4	42.5

Advanced Detectors

Method	Backbone	Pretrain	Lr schd	Aug	box AP	mask AP
Cascade Mask R-CNN	PVTv2-b2-Linear	ImageNet-1K	3x	Yes	50.9	44.0
Cascade Mask R-CNN	PVTv2-b2	ImageNet-1K	3x	Yes	51.1	44.4
ATSS	PVTv2-b2-Linear	ImageNet-1K	3x	Yes	48.9	-
ATSS	PVTv2-b2	ImageNet-1K	3x	Yes	49.9	-
GFL	PVTv2-b2-Linear	ImageNet-1K	3x	Yes	49.2	-
GFL	PVTv2-b2	ImageNet-1K	3x	Yes	50.2	-
Sparse R-CNN	PVTv2-b2-Linear	ImageNet-1K	3x	Yes	48.9	-
Sparse R-CNN	PVTv2-b2	ImageNet-1K	3x	Yes	50.1	-

PVTv1 on COCO

Detector	Backbone	Pretrain	Lr schd	box AP	mask AP
RetinaNet	PVT-Tiny	ImageNet-1K	1x	36.7	-
RetinaNet	PVT-Small	ImageNet-1K	1x	40.4	-
Mask RCNN	PVT-Tiny	ImageNet-1K	1x	36.7	35.1
Mask RCNN	PVT-Small	ImageNet-1K	1x	40.4	37.8
DETR	PVT-Small	ImageNet-1K	50ep	34.7	-

Semantic Segmentation

Segmentation configs & weights see >>>here<<<.

PVT-v2 + Segmentation see >>>here<<<.

PVTv1 on ADE20K

Method	Backbone	Pretrain	Iters	mIoU
Semantic FPN	PVT-Tiny	ImageNet-1K	40K	35.7
Semantic FPN	PVT-Small	ImageNet-1K	40K	39.8
Semantic FPN	PVT-Medium	ImageNet-1K	40K	41.6
Semantic FPN	PVT-Large	ImageNet-1K	40K	42.1

Polyp Segmentation

Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers. pdf | code

Vision-Language Modeling

Masked Vision-Language Transformer in Fashion. pdf | code

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Citation

If you use this code for a paper, please cite:

PVTv1

@inproceedings{wang2021pyramid,
  title={Pyramid vision transformer: A versatile backbone for dense prediction without convolutions},
  author={Wang, Wenhai and Xie, Enze and Li, Xiang and Fan, Deng-Ping and Song, Kaitao and Liang, Ding and Lu, Tong and Luo, Ping and Shao, Ling},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={568--578},
  year={2021}
}

PVTv2

@article{wang2021pvtv2,
  title={Pvtv2: Improved baselines with pyramid vision transformer},
  author={Wang, Wenhai and Xie, Enze and Li, Xiang and Fan, Deng-Ping and Song, Kaitao and Liang, Ding and Lu, Tong and Luo, Ping and Shao, Ling},
  journal={Computational Visual Media},
  volume={8},
  number={3},
  pages={1--10},
  year={2022},
  publisher={Springer}
}

Contact

This repo is currently maintained by Wenhai Wang (@whai362), Enze Xie (@xieenze), and Zhe Chen (@czczup).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Updates

Pyramid Vision Transformer

Model Zoo

Image Classification

Object Detection

Baseline Detectors

Advanced Detectors

Semantic Segmentation

Polyp Segmentation

Vision-Language Modeling

License

Citation

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Updates

Pyramid Vision Transformer

Model Zoo

Image Classification

Object Detection

Baseline Detectors

Advanced Detectors

Semantic Segmentation

Polyp Segmentation

Vision-Language Modeling

License

Citation

Contact