Skip to content

v4.22.0: Swin Transformer v2, VideoMAE, Donut, Pegasus-X, X-CLIP, ERNIE

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 14 Sep 18:57
· 7160 commits to main since this release

Swin Transformer v2

The Swin Transformer V2 model was proposed in Swin Transformer V2: Scaling Up Capacity and Resolution by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.

Swin Transformer v2 improves the original Swin Transformer using 3 main techniques: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) a log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.

VideoMAE

The VideoMAE model was proposed in VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. VideoMAE extends masked auto encoders (MAE) to video, claiming state-of-the-art performance on several video classification benchmarks.

VideoMAE is an extension of ViTMAE for video.

Donut

The Donut model was proposed in OCR-free Document Understanding Transformer by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. Donut consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform document understanding tasks such as document image classification, form understanding and visual question answering.

Pegasus-X

The PEGASUS-X model was proposed in Investigating Efficiently Extending Transformers for Long Input Summarization by Jason Phang, Yao Zhao and Peter J. Liu.

PEGASUS-X (PEGASUS eXtended) extends the PEGASUS models for long input summarization through additional long input pretraining and using staggered block-local attention with global tokens in the encoder.

X-CLIP

The X-CLIP model was proposed in Expanding Language-Image Pretrained Models for General Video Recognition by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. X-CLIP is a minimal extension of CLIP for video. The model consists of a text encoder, a cross-frame vision encoder, a multi-frame integration Transformer, and a video-specific prompt generator.

X-CLIP is a minimal extension of CLIP for video-language understanding.

ERNIE

ERNIE is a series of powerful models proposed by baidu, especially in Chinese tasks, including ERNIE1.0, ERNIE2.0, ERNIE3.0, ERNIE-Gram, ERNIE-health, etc.
These models are contributed by nghuyong and the official code can be found in PaddleNLP (in PaddlePaddle).

TensorFlow models

MobileViT and LayoutLMv3 are now available in TensorFlow.

New task-specific architectures

A new question answering head was added for the LayoutLM model.

New pipelines

Two new pipelines are available in transformers: a document question answering pipeline, as well as an image to text generation pipeline.

M1 support

There is now Mac M1 support in PyTorch in transformers in pipelines and the Trainer.

Backend version compatibility

Starting from version v4.22.0, we'll now officially support PyTorch and TensorFlow versions that were released up to two years ago.
Versions older than two years-old will not be supported going forward.

We're making this change as we begin actively testing transformers compatibility on older versions.
This project can be followed here.

Generate method updates

The generate method now starts enforcing stronger validation in order to ensure proper usage.

  • Generate: validate model_kwargs (and catch typos in generate arguments) by @gante in #18261
  • Generate: validate model_kwargs on TF (and catch typos in generate arguments) by @gante in #18651
  • Generate: add model class validation by @gante in #18902

API changes

The as_target_tokenizer and as_target_processor context managers have been deprecated. The new API is to use the call method of the tokenizer/processor with keyword arguments. For instance:

with tokenizer.as_target_tokenizer():
    encoded_labels = tokenizer(labels, padding=True)

becomes

encoded_labels = tokenizer(text_target=labels, padding=True)
  • Replace as_target context managers by direct calls by @sgugger in #18325

Bits and bytes integration

Bits and bytes is now integrated within transformers. This feature can reduce the size of large models by up to 2, with low loss in precision.

Large model support

Models that have sharded checkpoints in PyTorch can be loaded in Flax.

TensorFlow improvements

The TensorFlow examples have been rewritten to support all recent features developped in the past months.

DeBERTa-v2 is now trainable with XLA.

Documentation changes

Improvements and bugfixes

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @nandwalritik
    • Add swin transformer v2 (#17469)
    • Update no trainer scripts for language modeling and image classification examples (#18443)
  • @ankrgyl
    • Include tensorflow-aarch64 as a candidate (#18345)
    • Specify en in doc-builder README example (#18526)
    • Add LayoutLMForQuestionAnswering model (#18407)
    • Pin revision for LayoutLMForQuestionAnswering and TFLayoutLMForQuestionAnswering tests (#18854)
    • Add DocumentQuestionAnswering pipeline (#18414)
    • Update default revision for document-question-answering (#18938)
  • @ikuyamada
    • Adding fine-tuning models to LUKE (#18353)
  • @duongna21
    • Add Flax BART pretraining script (#18297)
    • Fix incomplete outputs of FlaxBert (#18772)
  • @donelianc
    • Add Spanish translation of run_scripts.mdx (#18415)
    • Add Spanish translation of converting_tensorflow_models.mdx (#18512)
    • Add type hints for ViLT models (#18577)
  • @sayakpaul
    • fix: keras fit tests for segformer tf and minor refactors. (#18412)
    • TensorFlow MobileViT (#18555)
  • @flozi00
    • german docs translation (#18544)
    • Update longt5.mdx (#18634)
    • Create pipeline_tutorial.mdx german docs (#18625)
  • @stancld
    • Add TF implementation of XGLMModel (#16543)
  • @ChrisFugl
    • [LayoutLMv3] Add TensorFlow implementation (#18678)
  • @zphang
  • @nghuyong
    • add task_type_id to BERT to support ERNIE-2.0 and ERNIE-3.0 models (#18686)