Update TensorRT-LLM #2562

kaiyux · 2024-12-11T07:43:56Z

Features
- The LLM API
  - Added lookahead decoding support.
  - Added DeepSeek V1 support.
  - Added Medusa support.
- Added support for LogN scaling for Qwen models.
- Added quantization support for RecurrentGemma. Refer to examples/recurrentgemma/README.md.
- Added AutoAWQ checkpoints support for Qwen. Refer to the “INT4-AWQ” section in examples/qwen/README.md.
API
- [BREAKING CHANGE] Chunked context is enabled by default when KV cache and paged context FMHA is enabled on non-RNN based models.
- [BREAKING CHANGE] Enable embedding sharing automatically when possible and remove the flag --use_embedding_sharing from convert checkpoints scripts.
Bug fixes
- Fixed the in-place clamp operation usage in smooth quant. Thanks for the contribution from @StarrickLiu in The clamp in-place operation cannot modify the weight_scales tensor directly. #2485.
Infra
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.11-py3.
- The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.11-py3.
- The dependent TensorRT version is updated to 10.7.
- The dependent CUDA version is updated to 12.6.3.
- Starting from the latest release, TensorRT-LLM Python wheels available on PyPI support both Python 3.10 and Python 3.12.
Known Issues
- Windows build is broken and the team is working on it.

open source 0d6fea855fdf304673e7f9f660bb4319e480bb89

bd19024

kaiyux force-pushed the preview/main branch from 17f2b9b to bd19024 Compare December 11, 2024 08:21

Shixiaowei02 approved these changes Dec 11, 2024

View reviewed changes

kaiyux merged commit aaacc9b into main Dec 11, 2024

kaiyux deleted the preview/main branch December 11, 2024 08:31

Njuapp mentioned this pull request Dec 11, 2024

[feature request] qwen model's query logn-scaling attn #836

Closed

Provide feedback