LightSeq supports fast training for models in the Transformer family now!
We provide highly optimized custom operators for PyTorch and TensorFlow, which cover the entire training process for Transformer-based models. Users of LightSeq can use these operators to build their own models with efficient computation.
In addition, we integrate our custom operators into popular training libraries like Fairseq, Hugging Face, NeurST, which enables a 1.5X-3X end-to-end speedup compared to the native version.
With only a few lines of code, you can enjoy the excellent performance provided by LightSeq. Try it now!
- High performance. In WMT14 English to German dataset, compared to Fairseq with Apex, LightSeq can provide 1.53 times speedup for transformer big model on NVIDIA Ampere A100 with 4096 batch size.
- Comprehensive operators. LightSeq provides comprehensive efficient custom operators for PyTorch and TensorFlow, including embedding, encoder layer, decoder layer, criterion and optimizer. To the best of our knowledge, LightSeq is the first open source project that cover the entire training process for Transformer-based models. In contrast, DeepSpeed only provides encoder layer.
- Simple and multi-level usage. In addition to directly using the custom layer in model code, users can also use LightSeq in popular training libraries without perception. For example, we register efficient versions of tasks and models in Fairseq.
- Rich secondary development tools. LightSeq provides complete unit tests and debug tools, which help users develop their own custom layer.
The following is a support matrix of LightSeq compared with DeepSpeed.
Detailed experimental results is available here. Here are the experimental results on WMT14 English to German task.
We train transformer models of different sizes on eight NVIDIA Tesla V100/NVIDIA Ampere A100 GPUs with data parallel and fp16 mixed precision. Fairseq with Apex is choosed as our baseline.
We compute speedup on different batch size using the WPS (real words per second) metric.
- PyTorch version with supported cuda
- Python version >= 3.6
To install LightSeq training library,
pip install lightseq
or install in develop mode,
git clone https://github.com/bytedance/lightseq.git
cd lightseq
pip install -e .
- Tensorflow version = 2.4
- Python version = 3.7
- Cuda version = 11.0
- To install LightSeq training library:
pip install http://sf3-ttcdn-tos.pstatp.com/obj/nlp-opensource/lightseq/tensorflow/lightseq_tf-2.0.1-cp37-cp37m-linux_x86_64.whl
LightSeq integrate its custom operators into popular training libraries. Users of these libraries can use LightSeq without perception:
You can also use LightSeq operators directly in your codes to build your own models. To simplify the use of individual operators, LightSeq designed a simple and self-contained interface.
For example, if you want to use the encoder layers, you first need to generate a config containing all the arguments of the models and training. Then you can initialize the LightSeq encoder layer using the config and integrate it into you models.
from lightseq.training import LSTransformerEncoderLayer
config = LSTransformerEncoderLayer.get_config(
max_batch_tokens=4096,
max_seq_len=256,
hidden_size=1024,
intermediate_size=4096,
nhead=16,
attn_prob_dropout_ratio=0.1,
activation_dropout_ratio=0.1,
hidden_dropout_ratio=0.1,
pre_layer_norm=True,
fp16=True,
local_rank=0,
)
enc_layer = LSTransformerEncoderLayer(config)
Or you can use the default config by specifying the model architecture.
from lightseq.training import LSTransformerEncoderLayer
config = LSTransformerEncoderLayer.get_config(
model="transformer-big",
max_batch_tokens=4096,
max_seq_len=256,
fp16=True,
local_rank=0,
)
enc_layer = LSTransformerEncoderLayer(config)
Currently, LightSeq supports the separate use of five operations: embedding, encoder layer, decoder layer, criterion and optimizer. Besides, LightSeq also provides the whole Transformer model interface for convenient usage. You can checkout out the lightseq/training/ops/pytorch
and lightseq/training/ops/tensorflow
directory for detail.
We provide a simple example to show how to build the whole Transformer model and train it successfully. Details are illustrated here.
- Training with 8 bit integers.