Skip to content

Intel® Extension for PyTorch* v2.4.0+cpu Release Notes

Compare
Choose a tag to compare
@jingxu10 jingxu10 released this 14 Aug 22:19
· 4 commits to release/2.4 since this release
5d871a0

We are excited to announce the release of Intel® Extension for PyTorch* 2.4.0+cpu which accompanies PyTorch 2.4. This release mainly brings you the support for Llama3.1, basic support for LLM serving frameworks like vLLM/TGI, and a set of optimization to push better performance for LLM models. This release also extends the list of optimized LLM models to a broader level and includes a set of bug fixing and small optimizations. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.

Highlights

  • Llama 3.1 support

Meta has newly released Llama 3.1 with new features like longer context length (128K) support. Intel® Extension for PyTorch* provides support of Llama 3.1 since its launch date with early release version, and now support with this official release.

  • Serving framework support

Typical LLM serving frameworks including vLLM, TGI can co-work with Intel® Extension for PyTorch* now which provides optimized performance for Xeon® Scalable CPUs. Besides the integration of LLM serving frameworks with ipex.llm module level APIs, we also continue optimizing the performance and quality of underneath Intel® Extension for PyTorch* operators such as paged attention and flash attention. We also provide new support in ipex.llm module level APIs for 4bits AWQ quantization based on weight only quantization, and distributed communications with shared memory optimization.

  • Large Language Model (LLM) optimization:

Intel® Extension for PyTorch* further optimized the performance of the weight only quantization kernels, enabled more fusion pattern variants for LLMs and extended the optimized models to include whisper, falcon-11b, Qwen2, and definitely Llama 3.1, etc. A full list of optimized models can be found at LLM optimization.

  • Bug fixing and other optimization

    • Fixed the quantization with auto-mixed-precision (AMP) mode of Qwen-7b #3030

    • Fixed the illegal memory access issue in the Flash Attention kernel #2987

    • Re-structured the paths of LLM example scripts #3080

    • Upgraded oneDNN to v3.5.3 #3160

    • Misc fix and enhancement #3079 #3116

Full Changelog: v2.3.0+cpu...v2.4.0+cpu