LLM example path re-structure (release 2.4) (#3080)

* LLM example files restructure * update * update path in docs * symlink * cherry-pick the typo fix (#3083) * fix path in quant script --------- Co-authored-by: WeizhuoZhang-intel <weizhuo.zhang@intel.com>
intel · Jul 17, 2024 · bee4a42 · bee4a42
1 parent f3b57ef
commit bee4a42
Show file tree

Hide file tree

Showing 64 changed files with 370 additions and 323 deletions.
diff --git a/README.md b/README.md
@@ -5,14 +5,14 @@ Intel® Extension for PyTorch\*
 
 </div>
 
-**CPU** [💻main branch](https://github.com/intel/intel-extension-for-pytorch/tree/main)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[🌱Quick Start](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/getting_started.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[📖Documentations](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[🏃Installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=cpu&version=v2.4.0%2Bcpu)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[💻LLM Example](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm) <br>
+**CPU** [💻main branch](https://github.com/intel/intel-extension-for-pytorch/tree/main)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[🌱Quick Start](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/getting_started.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[📖Documentations](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[🏃Installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=cpu&version=v2.4.0%2Bcpu)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[💻LLM Example](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/llm) <br>
 **GPU** [💻main branch](https://github.com/intel/intel-extension-for-pytorch/tree/xpu-main)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[🌱Quick Start](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[📖Documentations](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[🏃Installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[💻LLM Example](https://github.com/intel/intel-extension-for-pytorch/tree/xpu-main/examples/gpu/inference/python/llm)<br>  
 
 Intel® Extension for PyTorch\* extends PyTorch\* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel X<sup>e</sup> Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device.
 
 ## ipex.llm - Large Language Models (LLMs) Optimization
 
-In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain LLM models are introduced in the Intel® Extension for PyTorch\*. Check [**LLM optimizations**](./examples/cpu/inference/python/llm) for details.
+In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain LLM models are introduced in the Intel® Extension for PyTorch\*. Check [**LLM optimizations**](./examples/cpu/llm) for details.
 
 ### Optimized Model List
 

diff --git a/docs/tutorials/examples.md b/docs/tutorials/examples.md
@@ -240,7 +240,7 @@ generate results for the input prompt.
 [//]: # (marker_llm_optimize_woq)
 [//]: # (marker_llm_optimize_woq)
 
-**Note:** Please check [LLM Best Known Practice Page](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm)
+**Note:** Please check [LLM Best Known Practice Page](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/llm)
 for detailed environment setup and LLM workload running instructions.
 
 ## C++

diff --git a/docs/tutorials/features/int8_recipe_tuning_api.md b/docs/tutorials/features/int8_recipe_tuning_api.md
@@ -10,7 +10,7 @@ Users need to provide a fp32 model and some parameters required for tuning. The
 Please refer to [static_quant example](../../../examples/cpu/features/int8_recipe_tuning/imagenet_autotune.py).
 
 - Smooth Quantization
-Please refer to [llm sq example](../../../examples/cpu/inference/python/llm/single_instance/run_generation.py).
+Please refer to [LLM SmoothQuant example](../../../examples/cpu/llm/inference/single_instance/run_generation.py).
 
 ## Smooth Quantization Autotune
 ### Algorithm: Auto-tuning of $\alpha$.

diff --git a/docs/tutorials/features/sq_recipe_tuning_api.md b/docs/tutorials/features/sq_recipe_tuning_api.md
@@ -1,7 +1,8 @@
 Smooth Quant Recipe Tuning API (Prototype)
 =============================================
 
-Smooth Quantization is a popular method to improve the accuracy of int8 quantization. The [autotune API](../api_doc.html#ipex.quantization.autotune) allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy.
+Smooth Quantization is a popular method to improve the accuracy of int8 quantization.
+The [autotune API](../api_doc.html#ipex.quantization.autotune) allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy.
 
 SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. SmoothQuant arguments are as below:
 
@@ -15,6 +16,6 @@ SmoothQuant will introduce alpha to calculate the ratio of input and weight upda
 | shared_criterion |     "mean"    | ["min", "mean","max"] |   criterion for input LayerNorm op of a transformer block.  |
 |   enable_blockwise_loss   |     False     |     [True, False]     |          whether to enable block-wise auto-tuning          |
 
-For LLM examples, please refer to [example](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/inference/python/llm).
+Please refer to the [LLM examples](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm) for complete examples.
 
 **Note**: When defining dataloaders for calibration, please follow INC's dataloader [format](https://github.com/intel/neural-compressor/blob/master/docs/source/dataloader.md).
diff --git a/docs/tutorials/getting_started.md b/docs/tutorials/getting_started.md
@@ -157,4 +157,4 @@ with torch.inference_mode(), torch.cpu.amp.autocast(enabled=amp_enabled):
     print(gen_text, total_new_tokens, flush=True)
 ```
 
-More LLM examples, including usage of low precision data types are available in the [LLM Examples](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm) section.
+More LLM examples, including usage of low precision data types are available in the [LLM Examples](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/llm) section.
diff --git a/docs/tutorials/installation.md b/docs/tutorials/installation.md
@@ -5,4 +5,4 @@ Select your preferences and follow the installation instructions provided on the
 
 After successful installation, refer to the [Quick Start](getting_started.md) and [Examples](examples.md) sections to start using the extension in your code.
 
-**NOTE:** For detailed instructions on installing and setting up the environment for Large Language Models (LLM), as well as example scripts, refer to the [LLM best practices](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/inference/python/llm).
+**NOTE:** For detailed instructions on installing and setting up the environment for Large Language Models (LLM), as well as example scripts, refer to the [LLM best practices](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm).
diff --git a/docs/tutorials/llm.rst b/docs/tutorials/llm.rst
@@ -13,7 +13,7 @@ These LLM-specific optimizations can be automatically applied with a single fron
 
    llm/llm_optimize
 
-`ipex.llm` Optimized Model List
+`ipex.llm` Optimized Model List for Inference
 -------------------------------
 
 Verified for single instance mode
@@ -30,7 +30,7 @@ Verified for distributed inference mode via DeepSpeed
 
 *Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). We are working in progress to better support the models in the tables with various data types. In addition, more models will be optimized in the future.
 
-Please check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm>`_ for instructions to install/setup environment and example scripts.
+Please check `LLM best known practice <https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/llm>`_ for instructions to install/setup environment and example scripts.
 
 Module Level Optimization API for customized LLM (Prototype)
 ------------------------------------------------------------

diff --git a/docs/tutorials/llm/llm_optimize.md b/docs/tutorials/llm/llm_optimize.md
@@ -1,15 +1,20 @@
-Transformers Optimization Frontend API
+LLM Optimizations Frontend API
 ======================================
 
-The new API function, `ipex.llm.optimize`, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. You just need to invoke the `ipex.llm.optimize` function instead of the `ipex.optimize` function to apply all optimizations transparently.
+The new API function, `ipex.llm.optimize`, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs).
+It provides optimizations for both model-wise and content-generation-wise.
+You just need to invoke the `ipex.llm.optimize` function instead of the `ipex.optimize` function to apply all optimizations transparently.
 
-This API currently works for inference workloads. Support for training is undergoing. Currently, this API supports certain models. Supported model list can be found at [Overview](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html#ipexllm-optimized-model-list).
+This API currently works for inference workloads.
+Currently, this API supports certain models. Supported model list can be found at [this page](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html#ipexllm-optimized-model-list-for-inference).
+For LLM fine-tuning, please check the [LLM fine-tuning tutorial](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/fine-tuning).
 
 API documentation is available at [API Docs page](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/api_doc.html#ipex.llm.optimize).
 
 ## Pseudocode of Common Usage Scenarios
 
-The following sections show pseudocode snippets to invoke Intel® Extension for PyTorch\* APIs to work with LLM models. Complete examples can be found at [the Example directory](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/inference/python/llm).
+The following sections show pseudocode snippets to invoke Intel® Extension for PyTorch\* APIs to work with LLM models.
+Complete examples can be found at [the Example directory](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/inference).
 
 ### FP32/BF16
 
@@ -98,7 +103,7 @@ model = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_chec
 
 Distributed inference can be performed with `DeepSpeed`. Based on original Intel® Extension for PyTorch\* scripts, the following code changes are required.
 
-Check [LLM distributed inference examples](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/inference/python/llm/distributed) for complete codes.
+Check [LLM distributed inference examples](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/inference/distributed) for complete codes.
 
 ``` python
 import torch

diff --git a/examples/cpu/inference/python/llm/tools/env_activate.sh b/examples/cpu/inference/python/llm/tools/env_activate.sh
diff --git a/examples/cpu/inference/python/llm/tools/get_libstdcpp_lib.sh b/examples/cpu/inference/python/llm/tools/get_libstdcpp_lib.sh
diff --git a/examples/cpu/inference/python/llm/Dockerfile → examples/cpu/llm/Dockerfile b/examples/cpu/inference/python/llm/Dockerfile → examples/cpu/llm/Dockerfile
@@ -39,7 +39,7 @@ ENV PATH=/root/.local/bin:${PATH}
 FROM base AS dev
 ARG COMPILE
 COPY . ./intel-extension-for-pytorch
-RUN cd intel-extension-for-pytorch/examples/cpu/inference/python/llm && \
+RUN cd intel-extension-for-pytorch/examples/cpu/llm && \
     export CC=gcc && export CXX=g++ && \
     if [ -z ${COMPILE} ]; then bash tools/env_setup.sh 6; else bash tools/env_setup.sh 2; fi && \
     unset CC && unset CXX
@@ -53,7 +53,7 @@ RUN apt update && \
     apt clean && \
     rm -rf /var/lib/apt/lists/* && \
     if [ -f /etc/apt/apt.conf.d/proxy.conf ]; then rm /etc/apt/apt.conf.d/proxy.conf; fi
-COPY --from=dev /root/intel-extension-for-pytorch/examples/cpu/inference/python/llm ./llm
+COPY --from=dev /root/intel-extension-for-pytorch/examples/cpu/llm ./llm
 COPY --from=dev /root/intel-extension-for-pytorch/tools/get_libstdcpp_lib.sh ./llm/tools
 RUN cd /usr/lib/x86_64-linux-gnu/ && ln -s libtcmalloc.so.4 libtcmalloc.so && cd && \
     echo "echo \"**Note:** For better performance, please consider to launch workloads with command 'ipexrun'.\"" >> ./.bashrc && \
@@ -62,8 +62,7 @@ RUN cd /usr/lib/x86_64-linux-gnu/ && ln -s libtcmalloc.so.4 libtcmalloc.so && cd
     python -m pip cache purge && \
     mv ./oneCCL_release /opt/oneCCL && \
     chown -R root:root /opt/oneCCL && \
-    sed -i "s|ONECCL_PATH=.*|ONECCL_PATH=/opt/oneCCL|" ./tools/env_activate.sh && \
-    LN=$(grep "Conda environment is not available." -n ./tools/env_activate.sh | cut -d ":" -f 1) && sed -i "${LN}s|.*|    export LD_PRELOAD=\${LD_PRELOAD}:/usr/lib/x86_64-linux-gnu/libtcmalloc.so:/usr/local/lib/libiomp5.so|" ./tools/env_activate.sh
+    sed -i "s|ONECCL_PATH=.*|ONECCL_PATH=/opt/oneCCL|" ./tools/env_activate.sh
 ARG PORT_SSH=22
 RUN mkdir /var/run/sshd && \
     sed -i "s/#Port.*/Port ${PORT_SSH}/" /etc/ssh/sshd_config && \

diff --git a/examples/cpu/llm/README.md b/examples/cpu/llm/README.md
@@ -0,0 +1,133 @@
+# 1. LLM Optimization Overview
+
+`ipex.llm` provides dedicated optimization for running Large Language Models (LLM) faster, including technical points like paged attention, ROPE fusion, etc.
+And a set of data types are supported for various scenarios, including FP32, BF16, Smooth Quantization INT8, Weight Only Quantization INT8/INT4 (prototype).
+
+<br>
+
+# 2. Environment Setup
+
+There are several environment setup methodologies provided. You can choose either of them according to your usage scenario. The Docker-based ones are recommended.
+
+## 2.1 [RECOMMENDED] Docker-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout v2.4.0+cpu
+git submodule sync
+git submodule update --init --recursive
+
+# Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch\* prebuilt wheel files
+# To have a custom ssh server port for multi-nodes run, please add --build-arg PORT_SSH=<CUSTOM_PORT> ex: 2345, otherwise use the default 22 SSH port
+DOCKER_BUILDKIT=1 docker build -f examples/cpu/llm/Dockerfile --build-arg PORT_SSH=2345 -t ipex-llm:2.4.0 .
+
+# Run the container with command below
+docker run --rm -it --privileged -v /dev/shm:/dev/shm ipex-llm:2.4.0 bash
+
+# When the command prompt shows inside the docker container, enter llm examples directory
+cd llm
+
+# Activate environment variables
+# set bash script argument to "inference" or "fine-tuning" for different usages
+source ./tools/env_activate.sh [inference|fine-tuning]
+```
+
+## 2.2 Conda-based environment setup with pre-built wheels
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout v2.4.0+cpu
+git submodule sync
+git submodule update --init --recursive
+
+# GCC 12.3 is required. Installation can be taken care of by the environment configuration script.
+# Create a conda environment
+conda create -n llm python=3.10 -y
+conda activate llm
+
+# Setup the environment with the provided script
+cd examples/cpu/llm
+bash ./tools/env_setup.sh 7
+
+# Activate environment variables
+# set bash script argument to "inference" or "fine-tuning" for different usages
+source ./tools/env_activate.sh [inference|fine-tuning]
+```
+
+## 2.3 Docker-based environment setup with compilation from source
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout v2.4.0+cpu
+git submodule sync
+git submodule update --init --recursive
+
+# Build an image with the provided Dockerfile by compiling Intel® Extension for PyTorch\* from source
+# To have a custom ssh server port for multi-nodes run, please add --build-arg PORT_SSH=<CUSTOM_PORT> ex: 2345, otherwise use the default 22 SSH port
+docker build -f examples/cpu/llm/Dockerfile --build-arg COMPILE=ON --build-arg PORT_SSH=2345 -t ipex-llm:2.4.0 .
+
+# Run the container with command below
+docker run --rm -it --privileged -v /dev/shm:/dev/shm ipex-llm:2.4.0 bash
+
+# When the command prompt shows inside the docker container, enter llm examples directory
+cd llm
+
+# Activate environment variables
+# set bash script argument to "inference" or "fine-tuning" for different usages
+source ./tools/env_activate.sh [inference|fine-tuning]
+```
+
+## 2.4 Conda-based environment setup with compilation from source
+
+```bash
+# Get the Intel® Extension for PyTorch\* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout v2.4.0+cpu
+git submodule sync
+git submodule update --init --recursive
+
+# GCC 12.3 is required. Installation can be taken care of by the environment configuration script.
+# Create a conda environment
+conda create -n llm python=3.10 -y
+conda activate llm
+
+# Setup the environment with the provided script
+cd examples/cpu/llm
+bash ./tools/env_setup.sh
+
+# Activate environment variables
+# set bash script argument to "inference" or "fine-tuning" for different usages
+source ./tools/env_activate.sh [inference|fine-tuning]
+```
+
+<br>
+
+*Note*: In `env_activate.sh` script a `prompt.json` file is downloaded, which provides prompt samples with pre-defined input token lengths for benchmarking.
+For **Llama-3 models** benchmarking, the users need to download a specific `prompt.json` file, overwriting the original one.
+
+```bash
+wget -O prompt.json https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/prompt-3.json
+```
+
+The original `prompt.json` file can be restored from the repository if needed.
+
+```bash
+wget https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/prompt.json
+```
+
+<br>
+
+# 3. How To Run LLM with ipex.llm
+
+Inference and fine-tuning are supported in respective directories.
+
+For inference example scripts, visit the [inference](./inference/) directory.
+
+For fine-tuning example scripts, visit the [fine-tuning](./fine-tuning/) directory.