Skip to content

Latest commit

 

History

History

training

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Efficient Training Library for Transformer-based Models

logo

训练模块中文版本介绍

LightSeq supports fast training for models in the Transformer family now!

We provide highly optimized custom operators for PyTorch and TensorFlow, which cover the entire training process for Transformer-based models. Users of LightSeq can use these operators to build their own models with efficient computation.

In addition, we integrate our custom operators into popular training libraries like Fairseq, Hugging Face, NeurST, which enables a 1.5X-3X end-to-end speedup compared to the native version.

With only a few lines of code, you can enjoy the excellent performance provided by LightSeq. Try it now!

Features

  • High performance. In WMT14 English to German dataset, compared to Fairseq with Apex, LightSeq can provide 1.53 times speedup for transformer big model on NVIDIA Ampere A100 with 4096 batch size.
  • Comprehensive operators. LightSeq provides comprehensive efficient custom operators for PyTorch and TensorFlow, including embedding, encoder layer, decoder layer, criterion and optimizer. To the best of our knowledge, LightSeq is the first open source project that cover the entire training process for Transformer-based models. In contrast, DeepSpeed only provides encoder layer.
  • Simple and multi-level usage. In addition to directly using the custom layer in model code, users can also use LightSeq in popular training libraries without perception. For example, we register efficient versions of tasks and models in Fairseq.
  • Rich secondary development tools. LightSeq provides complete unit tests and debug tools, which help users develop their own custom layer.

The following is a support matrix of LightSeq compared with DeepSpeed.

features

Performance

Detailed experimental results is available here. Here are the experimental results on WMT14 English to German task.

We train transformer models of different sizes on eight NVIDIA Tesla V100/NVIDIA Ampere A100 GPUs with data parallel and fp16 mixed precision. Fairseq with Apex is choosed as our baseline.

Speedup for single training step

We compute speedup on different batch size using the WPS (real words per second) metric.

End-to-end wall-clock training time

Requirements and Installation

PyTorch

  • PyTorch version with supported cuda
  • Python version >= 3.6

To install LightSeq training library,

pip install lightseq

or install in develop mode,

git clone https://github.com/bytedance/lightseq.git
cd lightseq
pip install -e .

TensorFlow

  • Tensorflow version = 2.4
  • Python version = 3.7
  • Cuda version = 11.0
  • To install LightSeq training library:
pip install http://sf3-ttcdn-tos.pstatp.com/obj/nlp-opensource/lightseq/tensorflow/lightseq_tf-2.0.1-cp37-cp37m-linux_x86_64.whl

Usage

Quick start for different training libraries

LightSeq integrate its custom operators into popular training libraries. Users of these libraries can use LightSeq without perception:

Building models from scratch

You can also use LightSeq operators directly in your codes to build your own models. To simplify the use of individual operators, LightSeq designed a simple and self-contained interface.

For example, if you want to use the encoder layers, you first need to generate a config containing all the arguments of the models and training. Then you can initialize the LightSeq encoder layer using the config and integrate it into you models.

from lightseq.training import LSTransformerEncoderLayer

config = LSTransformerEncoderLayer.get_config(
    max_batch_tokens=4096,
    max_seq_len=256,
    hidden_size=1024,
    intermediate_size=4096,
    nhead=16,
    attn_prob_dropout_ratio=0.1,
    activation_dropout_ratio=0.1,
    hidden_dropout_ratio=0.1,
    pre_layer_norm=True,
    fp16=True,
    local_rank=0,
)
enc_layer = LSTransformerEncoderLayer(config)

Or you can use the default config by specifying the model architecture.

from lightseq.training import LSTransformerEncoderLayer

config = LSTransformerEncoderLayer.get_config(
    model="transformer-big",
    max_batch_tokens=4096,
    max_seq_len=256,
    fp16=True,
    local_rank=0,
)
enc_layer = LSTransformerEncoderLayer(config)

Currently, LightSeq supports the separate use of five operations: embedding, encoder layer, decoder layer, criterion and optimizer. Besides, LightSeq also provides the whole Transformer model interface for convenient usage. You can checkout out the lightseq/training/ops/pytorch and lightseq/training/ops/tensorflow directory for detail.

We provide a simple example to show how to build the whole Transformer model and train it successfully. Details are illustrated here.

Limitations and Future Plans

  • Training with 8 bit integers.