diff --git a/xpu/2.1.10+xpu/_sources/tutorials/features/torch_compile_gpu.md.txt b/xpu/2.1.10+xpu/_sources/tutorials/features/torch_compile_gpu.md.txt
index 232f20f99..3ddc683db 100644
--- a/xpu/2.1.10+xpu/_sources/tutorials/features/torch_compile_gpu.md.txt
+++ b/xpu/2.1.10+xpu/_sources/tutorials/features/torch_compile_gpu.md.txt
@@ -1,11 +1,34 @@
-torch.compile for GPU
-=====================
+torch.compile for GPU (Experimental)
+====================================
## Introduction
Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) API through the default "inductor" backend ([TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747/1)). The Triton compiler has been the core of the Inductor codegen supporting various accelerator devices. Intel has extended TorchInductor by adding Intel GPU support to Triton. Additionally, post-op fusions for convolution and matrix multiplication, facilitated by oneDNN fusion kernels, contribute to enhanced efficiency for computational intensive operations. Leveraging these features is as simple as using the default "inductor" backend, making it easier than ever to unlock the full potential of your PyTorch models on Intel GPU platforms.
-**Note**: `torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® GPU Max Series.
+**Note**: `torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® Data Center GPU Max Series.
+
+## Required Dependencies
+
+**Verified version**:
+- `torch` : v2.1.0
+- `intel_extension_for_pytorch` : v2.1.10
+- `triton` : [v2.1.0](https://github.com/intel/intel-xpu-backend-for-triton/releases/tag/v2.1.0) with Intel® XPU Backend for Triton* backend enabled.
+
+Follow [Intel® Extension for PyTorch* Installation](https://intel.github.io/intel-extension-for-pytorch/xpu/2.1.10+xpu/tutorials/installation.html) to install `torch` and `intel_extension_for_pytorch` firstly.
+
+Then install [Intel® XPU Backend for Triton* backend](https://github.com/intel/intel-xpu-backend-for-triton) for `triton` package. You may install it via prebuilt wheel package or build it from the source. We recommend installing via prebuilt package:
+
+- Download the wheel package from [release page](https://github.com/intel/intel-xpu-backend-for-triton/releases). Note that you don't need to install the LLVM release manually.
+- Install the wheel package by `pip install`. Note that this wheel package is a `triton` package with Intel GPU support, so you don't need to `pip install triton` again.
+
+```Bash
+python -m pip install --force-reinstall triton-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+```
+
+Please follow the [Intel® XPU Backend for Triton* Installation](https://github.com/intel/intel-xpu-backend-for-triton?tab=readme-ov-file#setup-guide) for more detailed installation steps.
+
+Note that if you install `triton` using `make triton` command inside PyTorch\* repo, the installed `triton` does not compile with Intel GPU support by default, you will need to manually set `TRITON_CODEGEN_INTEL_XPU_BACKEND=1` for enabling Intel GPU support. In addition, for building from the source via the `triton` [repo](https://github.com/openai/triton.git), the commit needs to be pinned at a tested [triton commit](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/triton_hash.txt). Please follow the [Intel® XPU Backend for Triton* Installation #build from the source](https://github.com/intel/intel-xpu-backend-for-triton?tab=readme-ov-file#option-2-build-from-the-source) section for more information about build `triton` package from the source.
+
### Inferenece with torch.compile
diff --git a/xpu/2.1.10+xpu/genindex.html b/xpu/2.1.10+xpu/genindex.html
index 596305a51..e6341702e 100644
--- a/xpu/2.1.10+xpu/genindex.html
+++ b/xpu/2.1.10+xpu/genindex.html
@@ -446,7 +446,7 @@
X
Built with Sphinx using a
theme
provided by Read the Docs.
-
+
diff --git a/xpu/2.1.10+xpu/index.html b/xpu/2.1.10+xpu/index.html
index 1f74f9c3d..127ea1d81 100644
--- a/xpu/2.1.10+xpu/index.html
+++ b/xpu/2.1.10+xpu/index.html
@@ -182,7 +182,7 @@ Support using a
theme
provided by Read the Docs.
-
+
diff --git a/xpu/2.1.10+xpu/objects.inv b/xpu/2.1.10+xpu/objects.inv
index 9b4d8d3bb..b9c2451ae 100644
Binary files a/xpu/2.1.10+xpu/objects.inv and b/xpu/2.1.10+xpu/objects.inv differ
diff --git a/xpu/2.1.10+xpu/py-modindex.html b/xpu/2.1.10+xpu/py-modindex.html
index f83fe8aa7..e754a9f25 100644
--- a/xpu/2.1.10+xpu/py-modindex.html
+++ b/xpu/2.1.10+xpu/py-modindex.html
@@ -150,7 +150,7 @@ Python Module Index
Built with Sphinx using a
theme
provided by Read the Docs.
-
+
diff --git a/xpu/2.1.10+xpu/search.html b/xpu/2.1.10+xpu/search.html
index f4f895039..e27dfaa5b 100644
--- a/xpu/2.1.10+xpu/search.html
+++ b/xpu/2.1.10+xpu/search.html
@@ -133,7 +133,7 @@
Built with Sphinx using a
theme
provided by Read the Docs.
-
+
diff --git a/xpu/2.1.10+xpu/searchindex.js b/xpu/2.1.10+xpu/searchindex.js
index ecb765c28..831e9353c 100644
--- a/xpu/2.1.10+xpu/searchindex.js
+++ b/xpu/2.1.10+xpu/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"docnames": ["index", "tutorials/api_doc", "tutorials/blogs_publications", "tutorials/cheat_sheet", "tutorials/contribution", "tutorials/examples", "tutorials/features", "tutorials/features/DDP", "tutorials/features/DLPack", "tutorials/features/DPC++_Extension", "tutorials/features/FSDP", "tutorials/features/advanced_configuration", "tutorials/features/amp_cpu", "tutorials/features/amp_gpu", "tutorials/features/auto_channels_last", "tutorials/features/codeless_optimization", "tutorials/features/compute_engine", "tutorials/features/float8", "tutorials/features/graph_capture", "tutorials/features/horovod", "tutorials/features/hypertune", "tutorials/features/int4", "tutorials/features/int8_overview", "tutorials/features/int8_overview_xpu", "tutorials/features/int8_recipe_tuning_api", "tutorials/features/nhwc", "tutorials/features/profiler_kineto", "tutorials/features/profiler_legacy", "tutorials/features/runtime_extension", "tutorials/features/simple_trace", "tutorials/features/torch_compile_gpu", "tutorials/getting_started", "tutorials/installation", "tutorials/introduction", "tutorials/license", "tutorials/llm", "tutorials/llm/llm_optimize_transformers", "tutorials/performance_tuning", "tutorials/performance_tuning/known_issues", "tutorials/performance_tuning/launch_script", "tutorials/performance_tuning/torchserve", "tutorials/performance_tuning/tuning_guide", "tutorials/releases", "tutorials/technical_details", "tutorials/technical_details/AOT", "tutorials/technical_details/graph_optimization", "tutorials/technical_details/isa_dynamic_dispatch", "tutorials/technical_details/memory_management", "tutorials/technical_details/optimizer_fusion_cpu", "tutorials/technical_details/optimizer_fusion_gpu", "tutorials/technical_details/split_sgd"], "filenames": ["index.rst", "tutorials/api_doc.rst", "tutorials/blogs_publications.md", "tutorials/cheat_sheet.md", "tutorials/contribution.md", "tutorials/examples.md", "tutorials/features.rst", "tutorials/features/DDP.md", "tutorials/features/DLPack.md", "tutorials/features/DPC++_Extension.md", "tutorials/features/FSDP.md", "tutorials/features/advanced_configuration.md", "tutorials/features/amp_cpu.md", "tutorials/features/amp_gpu.md", "tutorials/features/auto_channels_last.md", "tutorials/features/codeless_optimization.md", "tutorials/features/compute_engine.md", "tutorials/features/float8.md", "tutorials/features/graph_capture.md", "tutorials/features/horovod.md", "tutorials/features/hypertune.md", "tutorials/features/int4.md", "tutorials/features/int8_overview.md", "tutorials/features/int8_overview_xpu.md", "tutorials/features/int8_recipe_tuning_api.md", "tutorials/features/nhwc.md", "tutorials/features/profiler_kineto.md", "tutorials/features/profiler_legacy.md", "tutorials/features/runtime_extension.md", "tutorials/features/simple_trace.md", "tutorials/features/torch_compile_gpu.md", "tutorials/getting_started.md", "tutorials/installation.rst", "tutorials/introduction.rst", "tutorials/license.md", "tutorials/llm.rst", "tutorials/llm/llm_optimize_transformers.md", "tutorials/performance_tuning.rst", "tutorials/performance_tuning/known_issues.md", "tutorials/performance_tuning/launch_script.md", "tutorials/performance_tuning/torchserve.md", "tutorials/performance_tuning/tuning_guide.md", "tutorials/releases.md", "tutorials/technical_details.rst", "tutorials/technical_details/AOT.md", "tutorials/technical_details/graph_optimization.md", "tutorials/technical_details/isa_dynamic_dispatch.md", "tutorials/technical_details/memory_management.rst", "tutorials/technical_details/optimizer_fusion_cpu.md", "tutorials/technical_details/optimizer_fusion_gpu.md", "tutorials/technical_details/split_sgd.rst"], "titles": ["Intel\u00ae Extension for PyTorch*", "API Documentation", "Blogs & Publications", "Cheat Sheet", "Contribution", "Examples", "Features", "DistributedDataParallel (DDP)", "DLPack Solution", "DPC++ Extension", "Fully Sharded Data Parallel (FSDP)", "Advanced Configuration", "Auto Mixed Precision (AMP) on CPU", "Auto Mixed Precision (AMP) on GPU", "Auto Channels Last", "Codeless Optimization (Experimental)", "Compute Engine (Experimental feature for debug)", "Float8 Data Type Support [GPU] (Experimental)", "Graph Capture (Experimental)", "Horovod with PyTorch (Experimental)", "HyperTune (Experimental)", "INT4 inference [GPU] (Experimental)", "Intel\u00ae Extension for PyTorch* optimizations for quantization [CPU]", "Intel\u00ae Extension for PyTorch* Optimizations for Quantization [GPU]", "INT8 Recipe Tuning API (Experimental) [CPU]", "Channels Last", "Kineto Supported Profiler Tool (Experimental)", "Legacy Profiler Tool (Experimental)", "Runtime Extension", "Simple Trace Tool (Experimental)", "torch.compile for GPU", "Quick Start", "Installation", "Introduction", "License", "Large Language Models (LLM) Optimizations Overview", "Transformers Optimization Frontend API", "Performance Tuning Guide", "Troubleshooting", "Launch Script Usage Guide", "TorchServe with Intel\u00ae Extension for PyTorch*", "Performance Tuning Guide", "Releases", "Technical Details", "Ahead of Time (AOT) Compilation", "Graph Optimization", "ISA Dynamic Dispatching", "Memory Management", "Optimizer Fusion on CPU", "Optimizer Fusion on GPU", "Split SGD"], "terms": {"intel optim": 0, "intel\u00ae extension for pytorch*": 0, "gpu": [0, 2, 3, 4, 5, 9, 11, 16, 19, 26, 31, 33, 35, 44, 47], "discrete gpu": 0, "intel discrete gpu": 0, "extend": [0, 6, 8, 30, 33, 35, 41, 42], "latest": [0, 7, 8, 19, 31, 33, 35, 38], "perform": [0, 1, 2, 3, 5, 6, 9, 12, 13, 14, 15, 16, 20, 21, 22, 23, 25, 30, 33, 35, 36, 38, 39, 42, 43, 45, 48, 49, 50], "optim": [0, 1, 2, 3, 7, 10, 12, 13, 14, 16, 18, 19, 20, 24, 25, 28, 30, 31, 33, 38, 39, 40, 41, 42, 50], "hardwar": [0, 2, 6, 33, 35, 37, 40, 42, 46], "take": [0, 1, 9, 12, 13, 15, 18, 20, 25, 33, 39, 41, 42, 43, 45, 50], "advantag": [0, 1, 14, 18, 25, 33, 39, 41, 42, 43, 50], "advanc": [0, 1, 9, 24, 31, 42, 43, 47], "vector": [0, 1, 5, 9, 25, 42], "512": [0, 5, 24, 39, 42], "avx": [0, 42, 46], "neural": [0, 2, 6, 16, 17, 24, 41, 42], "network": [0, 2, 6, 12, 13, 16, 17, 28, 41, 42], "instruct": [0, 4, 5, 6, 12, 31, 32, 33, 35, 38, 41, 42, 43, 50], "vnni": [0, 22, 42], "matrix": [0, 6, 30, 33, 42], "amx": [0, 2, 6, 42, 46], "cpu": [0, 2, 3, 5, 7, 15, 20, 26, 27, 28, 31, 39, 40, 42], "well": [0, 1, 4, 5, 6, 21, 28, 35, 37, 40, 41, 42, 50], "x": [0, 9, 10, 12, 13, 15, 22, 24, 25, 28, 33, 38, 44, 45, 50], "e": [0, 1, 5, 9, 12, 13, 18, 23, 25, 33, 35, 38, 39, 41, 42, 43, 44], "xmx": [0, 33, 42], "ai": [0, 2, 11, 33, 35, 38, 42], "engin": [0, 5, 25, 33, 41, 42], "discret": [0, 33, 42], "moreov": [0, 1, 35, 42], "provid": [0, 1, 4, 5, 6, 7, 9, 10, 12, 13, 16, 18, 20, 24, 28, 32, 33, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 49], "easi": [0, 2, 19, 33, 37, 42, 50], "acceler": [0, 1, 2, 6, 17, 30, 33, 42, 45], "through": [0, 1, 5, 6, 9, 12, 13, 18, 30, 33, 41, 42], "xpu": [0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 13, 16, 17, 19, 23, 27, 29, 30, 31, 33, 36, 38], "devic": [0, 5, 7, 8, 9, 10, 11, 16, 19, 22, 25, 27, 30, 33, 35, 36, 38, 39, 42, 43, 44], "In": [0, 1, 5, 6, 9, 12, 13, 16, 18, 24, 25, 26, 27, 29, 35, 39, 40, 41, 42, 48, 50], "current": [0, 1, 4, 6, 8, 10, 16, 17, 20, 21, 22, 23, 24, 26, 28, 29, 35, 36, 38, 44, 45, 46, 48, 49], "technolog": [0, 35], "landscap": [0, 35], "gener": [0, 4, 5, 6, 7, 8, 9, 15, 16, 18, 21, 23, 24, 25, 35, 36, 37, 39, 40, 41, 42, 43, 44, 46, 50], "genai": [0, 35], "workload": [0, 5, 6, 12, 13, 15, 18, 23, 35, 36, 38, 39, 41, 42, 43, 50], "model": [0, 1, 2, 3, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 30, 31, 36, 38, 41, 42], "have": [0, 1, 4, 5, 7, 8, 9, 14, 16, 20, 23, 25, 26, 27, 28, 29, 31, 34, 35, 38, 39, 40, 41, 44, 46, 50], "gain": [0, 6, 35, 38], "widespread": [0, 35], "attent": [0, 35], "popular": [0, 6, 8, 35], "larg": [0, 1, 6, 10, 21, 36, 38, 41, 42, 48, 49], "languag": [0, 9, 36, 38, 42], "llm": [0, 36, 38, 42], "emerg": [0, 35], "domin": [0, 35], "drive": [0, 35], "applic": [0, 1, 5, 28, 35, 40, 41, 43, 44, 47], "start": [0, 1, 2, 3, 4, 5, 7, 15, 19, 26, 28, 29, 32, 38], "from": [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 13, 15, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50], "2": [0, 1, 2, 5, 6, 7, 8, 9, 10, 12, 13, 15, 25, 26, 28, 29, 30, 34, 35, 38, 39, 41, 43, 44, 46, 50], "1": [0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 16, 18, 24, 25, 26, 28, 29, 30, 35, 38, 39, 41, 45, 46, 48, 49, 50], "0": [0, 1, 3, 4, 5, 7, 9, 10, 11, 12, 13, 15, 19, 24, 26, 27, 28, 29, 34, 38, 39, 40, 41, 45, 48, 49, 50], "specif": [0, 5, 8, 11, 16, 18, 19, 25, 28, 35, 39, 41, 42], "certain": [0, 1, 6, 36, 38, 39, 41], "ar": [0, 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 15, 17, 19, 20, 21, 23, 25, 26, 27, 28, 29, 31, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "introduc": [0, 2, 9, 22, 25, 39, 41, 42, 50], "For": [0, 1, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 20, 22, 25, 26, 27, 28, 31, 33, 35, 38, 39, 40, 41, 42, 43, 44, 45, 47, 48, 49, 50], "more": [0, 1, 4, 6, 7, 9, 10, 11, 12, 13, 15, 23, 24, 26, 27, 28, 29, 31, 35, 38, 40, 41, 42, 43, 44, 45, 47, 48, 49, 50], "inform": [0, 1, 4, 6, 7, 8, 9, 10, 20, 23, 25, 26, 27, 31, 35, 39, 40, 41, 42, 43], "refer": [0, 1, 7, 9, 10, 11, 14, 16, 20, 24, 25, 26, 27, 28, 31, 32, 33, 40, 42, 44, 45], "section": [0, 5, 12, 13, 20, 23, 28, 32, 33, 36, 40, 41], "The": [0, 1, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "can": [0, 1, 4, 5, 6, 7, 8, 9, 11, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], "load": [0, 1, 5, 9, 22, 26, 27, 38, 40, 42, 45], "python": [0, 1, 3, 4, 7, 8, 9, 10, 11, 15, 19, 20, 26, 27, 28, 31, 35, 36, 38, 39, 40, 41, 42, 46], "modul": [0, 1, 5, 6, 7, 9, 10, 12, 13, 23, 24, 25, 36, 38, 39, 42, 45, 46], "program": [0, 1, 6, 28, 39, 41], "link": [0, 5, 46], "c": [0, 6, 7, 9, 12, 13, 24, 28, 29, 38, 39, 40, 41, 46], "librari": [0, 1, 5, 6, 7, 8, 9, 10, 11, 16, 26, 27, 28, 29, 40, 41, 42], "script": [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13, 15, 19, 20, 23, 28, 29, 31, 35, 36, 37, 38, 40, 41, 43], "user": [0, 1, 5, 6, 7, 11, 14, 15, 16, 18, 22, 24, 25, 27, 28, 30, 37, 38, 39, 40, 41, 42, 43, 44, 45, 47], "enabl": [0, 1, 2, 3, 5, 6, 7, 11, 12, 13, 15, 17, 23, 25, 26, 27, 28, 31, 35, 38, 39, 40, 41, 42, 43, 44, 45], "dynam": [0, 3, 5, 17, 28, 40, 41], "import": [0, 1, 3, 4, 5, 7, 9, 10, 15, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 35, 36, 38, 40, 41, 42, 45, 46, 50], "intel_extension_for_pytorch": [0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 15, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 36, 38, 40, 42, 45, 46], "featur": [0, 1, 2, 4, 5, 12, 13, 15, 20, 25, 28, 30, 33, 38, 39, 40, 41, 42, 43, 44, 45], "includ": [0, 1, 5, 6, 7, 9, 15, 20, 22, 26, 27, 31, 34, 35, 38, 42], "onli": [0, 1, 4, 5, 6, 8, 9, 11, 12, 13, 15, 16, 17, 19, 20, 22, 23, 25, 26, 28, 31, 35, 38, 39, 40, 42, 45, 46, 50], "packag": [0, 5, 7, 9, 10, 15, 31, 38, 40, 41, 42], "mai": [0, 1, 2, 4, 8, 9, 12, 13, 14, 16, 23, 25, 26, 27, 28, 38, 39, 40, 41, 42, 43], "newer": [0, 41], "code": [0, 1, 4, 6, 9, 10, 11, 15, 16, 18, 19, 25, 26, 27, 29, 31, 32, 34, 36, 38, 41, 42, 43, 44, 45, 47, 48, 49, 50], "base": [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 15, 16, 19, 28, 31, 35, 36, 38, 40, 41, 42, 46, 50], "due": [0, 12, 13, 15, 23, 28, 38, 42], "differ": [0, 1, 5, 6, 7, 8, 22, 25, 26, 27, 28, 35, 39, 40, 41], "develop": [0, 2, 3, 5, 9, 38, 41, 43, 44], "schedul": [0, 1, 10, 26, 28, 39, 41, 45], "ha": [0, 1, 5, 6, 8, 9, 15, 16, 20, 25, 28, 30, 38, 39, 41, 42, 44, 50], "been": [0, 1, 5, 6, 9, 15, 25, 30, 39, 41, 42, 46], "releas": [0, 1, 7, 14, 25, 38, 41, 44, 47], "an": [0, 1, 4, 5, 6, 7, 8, 9, 10, 12, 13, 15, 19, 20, 23, 25, 26, 27, 28, 30, 31, 35, 37, 38, 39, 40, 41, 42, 45, 46, 48, 49, 50], "open": [0, 24, 38, 41, 42], "sourc": [0, 4, 9, 11, 19, 26, 27, 29, 31, 34, 38, 41, 44], "project": [0, 5, 9], "github": [0, 4, 6, 7, 8, 10, 12, 13], "you": [0, 1, 4, 5, 6, 7, 9, 10, 12, 13, 19, 20, 21, 22, 25, 26, 27, 28, 29, 31, 35, 36, 38, 39, 41, 42, 43, 44, 45, 47], "find": [0, 1, 5, 8, 9, 20, 26, 27, 38, 39, 42, 43], "how": [0, 1, 5, 6, 7, 8, 9, 15, 22, 25, 31, 40, 41], "get": [0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 15, 22, 26, 27, 28, 35, 38, 39, 41, 42, 50], "main": [0, 4, 5, 10, 20, 28, 30, 39, 40], "branch": [0, 6], "quick": [0, 28, 32, 33], "about": [0, 1, 4, 7, 9, 10, 24, 40, 41, 45], "product": [0, 6, 20, 35, 42, 43, 44], "i": [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 26, 27, 29, 30, 31, 34, 35, 36, 38, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "structur": [0, 1, 6, 8, 21, 39], "shown": [0, 1, 5, 7, 25, 26, 27, 29, 35, 39, 40], "follow": [0, 1, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 16, 17, 19, 20, 22, 23, 25, 26, 27, 29, 31, 32, 34, 35, 36, 38, 39, 40, 41, 42, 50], "figur": [0, 8, 35, 41, 50], "eager": [0, 18, 23, 40, 42, 43], "mode": [0, 1, 4, 6, 11, 15, 18, 25, 28, 31, 38, 40, 42, 43], "frontend": [0, 1, 6, 28, 35, 42], "custom": [0, 1, 6, 9, 11, 16, 26, 35, 38, 42], "fusion": [0, 1, 5, 15, 21, 23, 30, 36, 42, 43, 50], "int8": [0, 1, 2, 3, 6, 23, 28, 36, 42], "quantiz": [0, 2, 3, 5, 24, 35, 38, 40, 42, 45], "api": [0, 2, 5, 9, 10, 11, 15, 22, 23, 26, 28, 30, 35, 38, 41, 42], "further": [0, 1, 5, 6, 25, 28, 35, 37, 41, 43], "improv": [0, 2, 6, 12, 13, 17, 21, 28, 35, 40, 41, 42, 45], "achiev": [0, 1, 5, 41], "convert": [0, 1, 3, 5, 6, 8, 12, 13, 14, 15, 17, 21, 23, 24, 25, 28, 36, 38, 40, 42, 45], "graph": [0, 1, 3, 12, 13, 15, 23, 30, 38, 39, 42], "us": [0, 1, 2, 3, 4, 7, 10, 11, 17, 19, 20, 21, 22, 23, 24, 25, 30, 31, 32, 34, 35, 37, 38, 40, 41, 42, 43, 46, 47, 48, 50], "pass": [0, 1, 4, 5, 9, 15, 26, 27, 28, 38, 40], "reduc": [0, 1, 6, 10, 17, 21, 22, 26, 28, 35, 38, 41, 42, 43, 48, 49, 50], "oper": [0, 1, 5, 8, 9, 11, 12, 13, 22, 23, 26, 27, 29, 30, 38, 40, 41, 42, 43, 45, 50], "kernel": [0, 1, 6, 9, 11, 16, 26, 28, 30, 35, 38, 41, 42, 46], "invoc": [0, 38, 42], "overhead": [0, 1, 6, 9, 15, 27, 28, 35, 38, 41, 42, 43, 48, 49], "result": [0, 1, 9, 15, 18, 20, 25, 28, 39, 40, 41, 50], "compar": [0, 1, 6, 25, 38, 39, 41, 43, 45, 49, 50], "normal": [0, 1, 5, 10, 19, 26, 28, 35, 41, 43], "yield": [0, 37, 41, 43], "better": [0, 1, 6, 16, 22, 23, 25, 28, 35, 39, 40, 41, 42, 43, 49], "techniqu": [0, 1, 9, 18, 35], "like": [0, 1, 2, 4, 6, 9, 12, 20, 21, 23, 26, 27, 35, 38, 39, 41, 42, 48, 50], "amplifi": 0, "them": [0, 4, 11, 19, 25, 26, 27, 38, 39, 41, 42, 48, 49], "comprehens": [0, 47], "both": [0, 1, 5, 6, 8, 17, 23, 25, 36, 39, 40, 41, 42, 44, 48, 49, 50], "torchscript": [0, 1, 6, 15, 18, 31, 38, 40, 43, 48, 49], "torchdynamo": [0, 6, 18], "With": [0, 1, 6, 7, 9, 15, 19, 23, 26, 27, 28, 29, 39], "we": [0, 1, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 20, 21, 22, 23, 24, 25, 28, 35, 38, 40, 41, 42, 43, 47, 48, 49, 50], "recommend": [0, 5, 6, 7, 14, 15, 16, 22, 23, 24, 28, 31, 38, 39, 41, 42, 43], "torch": [0, 1, 3, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28, 29, 31, 36, 38, 40, 41, 42, 43, 45, 49], "jit": [0, 1, 5, 12, 13, 22, 23, 24, 25, 28, 31, 36, 38, 40, 42, 43, 44, 45], "trace": [0, 5, 11, 12, 13, 18, 22, 23, 24, 28, 31, 36, 38, 40, 43, 45], "your": [0, 4, 5, 7, 9, 10, 12, 13, 15, 19, 20, 22, 26, 27, 28, 29, 30, 31, 32, 34, 38, 43, 44, 47], "prefer": [0, 16, 22, 32], "option": [0, 1, 7, 11, 15, 17, 20, 22, 26, 27, 30, 39, 44], "wider": [0, 5], "rang": [0, 7, 9, 10, 17, 19, 22, 23, 24, 26, 36, 38, 39, 40, 48, 50], "ipex": [0, 1, 2, 3, 5, 6, 9, 14, 18, 21, 22, 24, 28, 29, 31, 35, 36, 38, 39, 40, 41, 42, 45], "backend": [0, 1, 2, 6, 7, 8, 9, 10, 16, 18, 19, 24, 30, 35, 38, 39, 41, 42, 44, 45], "avail": [0, 1, 5, 6, 7, 9, 11, 16, 28, 30, 31, 36, 39, 41, 42, 43, 46, 47], "good": [0, 1, 4, 6, 18, 25, 35, 41, 48, 49], "On": [0, 1, 6, 17, 21, 25, 35, 41], "automat": [0, 1, 5, 6, 14, 15, 17, 18, 22, 23, 25, 26, 29, 35, 39, 40, 41, 42, 43, 44, 45], "dispatch": [0, 9, 42], "underli": [0, 6, 35, 46, 47], "detect": [0, 5, 18, 38, 41, 46], "set": [0, 1, 3, 4, 5, 6, 7, 9, 10, 11, 12, 16, 19, 20, 22, 27, 29, 31, 38, 39, 40, 41, 42, 43, 44, 50], "isa": 0, "leverag": [0, 6, 30, 40], "unit": [0, 9, 41], "runtim": [0, 8, 9, 10, 12, 13, 26, 27, 31, 39, 41, 42, 44, 45], "offer": [0, 4, 26, 41, 47], "finer": [0, 6, 28], "grain": [0, 2, 6, 28], "thread": [0, 1, 5, 6, 9, 28, 29, 38, 39, 40, 41], "control": [0, 1, 6, 26, 27, 28, 29, 38, 39, 41], "weight": [0, 1, 5, 7, 9, 15, 17, 18, 19, 22, 23, 25, 28, 35, 42, 43], "share": [0, 4, 6, 7, 8, 9, 28, 38, 40, 41, 42], "increas": [0, 1, 2, 19, 26, 27, 35, 38, 41, 42, 43, 44, 47, 50], "effici": [0, 7, 9, 10, 17, 21, 28, 30, 35, 39, 41, 42, 48, 49], "implement": [0, 4, 5, 6, 7, 8, 9, 10, 25, 35, 38, 41, 42, 48, 49], "regist": [0, 11, 15, 42], "mechan": [0, 6, 9, 42, 46, 50], "These": [0, 5, 6, 12, 13, 17, 35, 42, 45], "nativ": [0, 6, 12, 13, 38, 42, 48, 49, 50], "calcul": [0, 1, 9, 12, 13, 26, 27, 42, 50], "util": [0, 5, 6, 7, 8, 9, 10, 15, 16, 17, 19, 22, 24, 25, 35, 38, 39, 41, 44, 50], "dpc": [0, 8, 11, 38, 42], "compil": [0, 4, 5, 6, 7, 11, 26, 27, 31, 38, 41, 42], "sycl": [0, 1, 6, 8, 11, 16, 42, 43], "standard": [0, 9, 35], "also": [0, 1, 5, 6, 8, 9, 11, 15, 20, 23, 25, 26, 27, 35, 36, 38, 39, 41, 42, 43, 44, 45, 47, 48, 49], "number": [0, 4, 5, 6, 7, 9, 10, 19, 20, 26, 27, 28, 29, 38, 40, 42, 48, 49, 50], "which": [0, 1, 5, 6, 7, 8, 9, 11, 12, 13, 15, 17, 20, 21, 22, 25, 26, 28, 29, 35, 38, 39, 40, 41, 42, 43, 44, 47], "found": [0, 5, 20, 23, 25, 36, 39, 40, 41, 42, 43], "doc": [0, 4, 23, 36], "directori": [0, 4, 9, 20, 31, 36, 38, 39, 40, 42], "team": [0, 4], "track": [0, 1], "bug": [0, 4, 43, 44], "enhanc": [0, 2, 30, 35], "request": [0, 1, 4, 28, 40, 43], "issu": [0, 1, 4, 12, 13, 37, 41, 50], "befor": [0, 1, 4, 5, 6, 11, 20, 23, 25, 26, 27, 28, 29, 38, 39, 41, 43, 44, 45], "submit": [0, 1, 4, 6, 9, 28], "suggest": [0, 1, 22, 25, 26, 28, 41], "report": [0, 38], "search": [0, 3, 4, 6, 35, 39], "exist": [0, 4, 23, 26, 38, 39, 41, 42, 45], "see": [0, 1, 4, 9, 12, 13, 17, 20, 25, 26, 27, 29, 38, 42, 44], "alreadi": [0, 4, 5, 19, 25, 35, 41, 43], "dtype": [1, 3, 5, 12, 13, 15, 16, 17, 22, 23, 24, 26, 27, 30, 31, 36, 38, 39, 42, 45], "none": [1, 7, 10, 39, 49], "level": [1, 6, 9, 11, 15, 25, 26, 28, 35, 38, 41, 42, 44, 45, 50], "o1": [1, 38], "inplac": [1, 3, 22, 23, 24, 25, 36, 40, 45], "fals": [1, 3, 5, 10, 12, 13, 20, 22, 23, 24, 25, 26, 27, 28, 29, 31, 36, 38, 39, 40, 45, 46], "conv_bn_fold": [1, 38], "linear_bn_fold": 1, "weights_prepack": [1, 6, 38], "replace_dropout_with_ident": 1, "optimize_lstm": 1, "split_master_weight_for_bf16": 1, "fuse_update_step": 1, "auto_kernel_select": [1, 6], "sample_input": [1, 14], "graph_mod": [1, 3, 6, 18], "concat_linear": 1, "appli": [1, 5, 6, 12, 13, 18, 19, 21, 25, 31, 35, 36, 38, 39, 42, 45, 48, 49, 50], "given": [1, 20, 21, 35, 45], "nn": [1, 5, 6, 7, 10, 12, 13, 15, 16, 17, 22, 24, 25, 28, 38, 45], "If": [1, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 20, 22, 23, 24, 25, 28, 38, 39, 40, 41, 42, 43, 44, 45], "train": [1, 2, 3, 7, 10, 17, 19, 21, 22, 23, 24, 25, 31, 36, 39, 42, 43, 50], "otherwis": [1, 10, 11, 28], "infer": [1, 2, 3, 6, 15, 17, 18, 22, 23, 25, 28, 30, 31, 38, 41, 42, 50], "conv": [1, 12, 13, 15, 22, 24, 28, 38, 45], "bn": [1, 15, 22, 24, 38], "fold": [1, 15, 22, 24, 38], "prepack": [1, 15, 25, 35], "so": [1, 5, 6, 8, 12, 13, 19, 22, 25, 28, 29, 30, 38, 39, 40, 41, 42, 47, 49], "onednn": [1, 2, 6, 11, 16, 30, 35, 38, 42, 45], "order": [1, 8, 16, 17, 25, 27, 29, 38, 39, 41, 50], "cach": [1, 4, 9, 11, 28, 42, 43, 47, 48, 49], "reus": [1, 41], "layout": [1, 6, 38], "call": [1, 6, 7, 9, 12, 13, 25, 26, 27, 29, 38, 40, 41, 45, 47, 50], "block": [1, 4, 28, 41, 42, 43], "although": [1, 41], "itself": [1, 25, 26, 27], "fast": [1, 3, 9, 18, 19, 41, 43], "enough": [1, 38, 48], "usag": [1, 6, 8, 12, 13, 16, 23, 25, 26, 27, 31, 33, 37, 40, 41, 42], "perspect": [1, 25, 39, 41, 45, 50], "drawback": [1, 50], "run": [1, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 15, 18, 20, 24, 26, 27, 28, 29, 31, 38, 39, 40, 41, 42, 43, 44], "split": [1, 9, 11, 28, 38, 43, 48], "one": [1, 4, 7, 8, 9, 11, 16, 18, 19, 20, 23, 25, 26, 28, 36, 38, 39, 41, 42, 45, 48, 49], "sever": [1, 6, 15, 26, 27, 37, 38, 39, 48, 49], "dimens": [1, 9, 16, 25], "data": [1, 3, 5, 7, 9, 12, 13, 14, 15, 18, 19, 23, 24, 25, 28, 31, 36, 38, 39, 40, 42, 44, 48, 49, 50], "fix": [1, 4, 6, 21, 38], "size": [1, 5, 6, 7, 8, 9, 10, 19, 22, 24, 25, 35, 38, 40, 41, 43, 44], "each": [1, 6, 7, 9, 10, 11, 12, 13, 16, 19, 20, 26, 27, 28, 29, 38, 39, 40, 41, 48, 50], "time": [1, 4, 6, 7, 9, 10, 20, 25, 26, 27, 35, 38, 41, 45, 48, 49], "execut": [1, 3, 5, 6, 8, 9, 11, 12, 13, 15, 16, 18, 20, 21, 24, 26, 28, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49], "detail": [1, 4, 5, 6, 7, 9, 11, 12, 13, 14, 16, 23, 25, 31, 33, 35, 38, 40, 41, 42, 45, 46], "mermori": 1, "format": [1, 4, 6, 7, 8, 10, 14, 16, 17, 19, 20, 26, 27, 29, 39, 41, 42], "manual": [1, 6, 15, 16, 20, 25, 28], "To": [1, 4, 5, 6, 7, 10, 15, 19, 22, 25, 26, 27, 28, 29, 31, 35, 37, 40, 41, 42, 43, 45, 50], "thi": [1, 4, 5, 6, 7, 8, 9, 10, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 34, 35, 36, 37, 38, 39, 42, 43, 44, 45, 46, 48, 49, 50], "predefin": 1, "shape": [1, 6, 8, 16, 24, 26, 27, 28, 35, 41], "prior": [1, 31], "match": [1, 12, 13, 39], "requir": [1, 4, 5, 7, 8, 9, 12, 13, 15, 16, 17, 23, 24, 25, 31, 35, 36, 38, 39, 40, 42, 50], "won": [1, 6, 12, 13, 26, 27, 29], "t": [1, 4, 6, 8, 9, 12, 13, 20, 22, 24, 25, 26, 27, 28, 29, 38, 40, 42], "convers": [1, 5, 12, 13, 23, 42, 45], "directli": [1, 9, 21, 41], "go": [1, 4, 5, 9, 12, 13, 43, 44], "methodologi": [1, 5, 6, 9, 41, 43, 48, 49], "possibl": [1, 5, 8, 16, 20, 22, 38, 41, 48], "avoid": [1, 4, 9, 15, 28, 38, 39, 40, 41, 50], "thu": [1, 9, 12, 13, 15, 16, 25, 28, 39, 40, 41, 42, 50], "paramet": [1, 5, 6, 7, 10, 12, 13, 15, 19, 24, 26, 27, 28, 30, 35, 38, 39, 41, 48, 49, 50], "work": [1, 4, 5, 6, 7, 9, 10, 11, 20, 22, 25, 26, 27, 28, 31, 35, 36, 38, 39, 41, 43, 46], "bfloat16": [1, 2, 3, 6, 15, 16, 17, 30, 31, 39, 42], "half": [1, 43, 50], "k": 1, "float16": [1, 6, 12, 21, 30, 31, 36, 42], "cast": [1, 12, 13, 50], "accord": [1, 35, 41, 45], "default": [1, 3, 5, 6, 7, 8, 10, 11, 15, 18, 22, 24, 26, 27, 28, 29, 30, 37, 38, 40, 41, 42, 45, 46], "valu": [1, 6, 11, 15, 17, 20, 28, 35, 38, 39, 40, 41, 46, 48, 50], "mean": [1, 25, 28, 29, 35, 38, 42], "do": [1, 4, 6, 9, 12, 13, 24, 25, 26, 27, 28, 35, 38, 39, 40, 41, 50], "noth": 1, "note": [1, 2, 4, 7, 8, 9, 10, 14, 19, 21, 22, 25, 28, 30, 35, 38, 39, 40, 41, 42, 44, 46], "type": [1, 3, 4, 5, 6, 8, 9, 10, 15, 23, 24, 25, 28, 31, 38, 39, 40, 42, 43, 44, 50], "conv2d": [1, 10, 12, 13, 15, 25, 28, 38, 42, 43, 45], "linear": [1, 7, 10, 12, 13, 16, 17, 21, 22, 24, 25, 38, 41, 42, 43, 45], "convtranspose2d": [1, 45], "case": [1, 5, 6, 9, 10, 11, 14, 18, 25, 37, 38, 39, 41], "addit": [1, 5, 42, 43, 44, 46, 50], "embed": [1, 35], "lstm": [1, 15, 16, 22], "sgd": [1, 5, 7, 12, 13, 19, 24, 30, 42, 43, 48, 49], "string": [1, 10, 39], "o0": [1, 38], "No": [1, 25, 29, 38, 42], "function": [1, 4, 5, 6, 9, 10, 12, 13, 15, 18, 20, 22, 23, 26, 28, 29, 30, 31, 35, 36, 38, 39, 41, 42, 43, 46, 49, 50], "just": [1, 9, 20, 36, 42, 43, 44], "return": [1, 5, 7, 9, 10, 12, 13, 15, 24, 25, 26, 28, 38], "origin": [1, 8, 17, 18, 19, 22, 28, 36, 45, 49], "dropout": [1, 10, 15, 42], "remov": [1, 4, 26, 42, 50], "inferenc": 1, "master": [1, 6, 7, 39, 43, 50], "fuse": [1, 35, 42, 45, 48, 49], "updat": [1, 4, 7, 10, 42, 48, 49, 50], "step": [1, 4, 5, 7, 9, 10, 12, 13, 16, 19, 20, 24, 26, 29, 30, 40, 48, 50], "overridden": [1, 46], "explicitli": [1, 5, 9, 11, 12, 13, 26, 28, 38, 39], "bool": [1, 20], "whether": [1, 12, 13, 25, 26, 27, 41], "conv_bn": 1, "It": [1, 6, 7, 8, 9, 10, 12, 15, 17, 21, 25, 28, 29, 31, 36, 38, 39, 41, 42, 43, 44, 45, 46, 50], "knob": [1, 3, 18, 39], "overwrit": [1, 39], "configur": [1, 3, 5, 9, 20, 22, 23, 26, 31, 37, 39, 40, 44, 46], "linear_bn": 1, "convolut": [1, 6, 13, 28, 30, 41, 45], "reorder": [1, 25, 35], "replac": [1, 4, 6, 7, 10, 15, 23, 38], "ident": [1, 5, 15, 25], "aten": [1, 6, 8, 9], "opportunit": 1, "bf16": [1, 2, 31, 38, 42, 43, 48, 49, 50], "save": [1, 4, 5, 10, 17, 19, 20, 21, 22, 24, 25, 35, 40, 42, 45, 50], "solut": [1, 10, 35, 38, 42], "doesn": [1, 22, 24, 25, 26, 38], "support": [1, 4, 5, 7, 8, 9, 11, 16, 22, 23, 24, 27, 28, 30, 33, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "all": [1, 4, 5, 7, 9, 10, 11, 12, 13, 16, 19, 20, 26, 27, 28, 29, 35, 36, 40, 41, 42, 45, 47, 48, 49], "param": [1, 39, 48, 49], "tupl": [1, 7, 28], "tensor": [1, 5, 6, 8, 9, 12, 13, 16, 22, 23, 26, 27, 28, 35, 38, 40, 42, 47], "feed": [1, 14, 25], "sampl": [1, 7, 14, 20, 41], "input": [1, 5, 6, 7, 9, 10, 14, 15, 16, 22, 23, 25, 26, 27, 30, 38, 40, 41, 45], "impact": [1, 28, 38], "pack": [1, 8, 28], "intel": [1, 2, 3, 6, 8, 9, 10, 11, 12, 14, 15, 16, 19, 20, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 42, 43, 44, 45, 46, 48, 49, 50], "extens": [1, 2, 3, 5, 8, 10, 11, 14, 15, 20, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 39, 41, 42, 43, 44, 45, 46, 48, 49], "pytorch": [1, 2, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 20, 24, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "per": [1, 8, 9, 11, 15, 19, 22, 23, 28, 39, 40, 41], "some": [1, 4, 6, 9, 10, 12, 13, 16, 24, 25, 26, 28, 31, 38, 39, 40, 41, 45], "heurist": [1, 28], "real": [1, 5, 6, 7, 20, 22], "best": [1, 6, 12, 13, 20, 23, 35, 41], "try": [1, 4, 5, 6, 7, 18, 20, 24, 38, 39, 41], "select": [1, 5, 6, 32, 42, 45], "true": [1, 3, 5, 6, 7, 9, 10, 15, 16, 17, 18, 20, 22, 23, 24, 27, 30, 31, 36, 39, 40, 41, 45, 46], "might": [1, 25, 38, 41, 49], "cost": [1, 6, 9, 26, 41], "extra": [1, 7, 15, 28, 39, 40, 42], "auto": [1, 5, 9, 15, 25, 35, 38, 39, 41, 42], "experiment": [1, 3, 4, 28, 30, 38, 42, 44, 45], "combin": [1, 18, 20, 23, 39], "method": [1, 8, 9, 12, 13, 21, 22, 26, 27, 29, 38, 41], "multipl": [1, 4, 6, 7, 12, 13, 25, 30, 35, 38, 40, 41, 42, 44], "subgraph": 1, "modifi": [1, 4, 19], "other": [1, 6, 8, 10, 12, 13, 16, 17, 19, 20, 23, 25, 26, 31, 35, 38, 39, 41, 42, 47, 48, 49], "place": [1, 7, 12, 13, 35, 41], "scenario": [1, 8, 23, 38, 41], "convolutuon": 1, "counterpart": [1, 6, 42], "pleas": [1, 4, 7, 10, 21, 26, 27, 31, 38, 39, 42, 44], "invok": [1, 5, 9, 12, 13, 15, 28, 31, 36, 38, 42, 45], "ddp": [1, 6, 10, 42], "distribut": [1, 2, 7, 10, 19, 38, 39, 40, 41, 42, 43, 44], "deepcopi": 1, "rather": [1, 25], "than": [1, 6, 11, 16, 23, 25, 26, 27, 28, 30, 38, 41, 42, 43, 50], "allreduc": [1, 7, 19, 38], "caus": [1, 35, 38, 39, 41, 42, 44, 50], "unpredict": 1, "accuraci": [1, 2, 6, 10, 12, 13, 21, 22, 24, 35, 42, 50], "loss": [1, 5, 7, 10, 12, 13, 19, 24, 25, 30, 50], "exampl": [1, 4, 6, 11, 12, 13, 16, 19, 25, 26, 27, 29, 31, 32, 33, 35, 36, 38, 40, 41, 45, 48, 49, 50], "load_state_dict": 1, "path": [1, 5, 9, 10, 11, 20, 25, 28, 38, 39, 41], "eval": [1, 3, 5, 10, 12, 13, 15, 18, 21, 22, 23, 24, 28, 31, 36, 38, 40, 45], "optimized_model": 1, "evalu": [1, 24, 31, 42], "optimized_optim": 1, "altern": [1, 5, 6, 7, 25], "motiv": [1, 5, 28], "ad": [1, 5, 6, 7, 15, 26, 30, 41], "alia": [1, 5, 9], "unifi": [1, 5, 39], "style": [1, 4, 5, 9, 27], "modular": [1, 5], "optimize_transform": [1, 35, 36, 42], "float32": [1, 26, 27, 31, 39, 45, 50], "quantization_config": 1, "qconfig_summary_fil": 1, "low_precision_checkpoint": 1, "deployment_mod": 1, "transform": [1, 2, 3, 5, 10, 15, 21, 24, 25, 40, 41, 42], "focu": [1, 15, 25, 36, 42], "especi": [1, 4, 9, 35], "task": [1, 6, 35, 38, 39, 41], "famili": [1, 35, 41], "llama": [1, 2, 35], "gpt": [1, 21, 35], "j": [1, 21, 35], "neox": 1, "opt": [1, 7, 31, 35, 46], "falcon": 1, "now": [1, 6, 9, 22, 25, 30, 40, 41], "float": [1, 5, 6, 9, 10, 12, 13, 17, 20, 22, 24, 50], "when": [1, 4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 19, 20, 25, 28, 29, 35, 38, 39, 40, 41, 42, 43, 44, 48, 49, 50], "mix": [1, 5, 9, 38, 42, 45], "str": [1, 7, 20, 26, 39], "curentlti": 1, "object": [1, 5, 20, 28, 38, 41, 42, 46], "defin": [1, 6, 8, 9, 10, 12, 13, 15, 17, 19, 23, 24, 25, 26, 27, 36, 40, 42, 45], "recip": [1, 3, 6, 17, 22, 45], "quant": 1, "static": [1, 3, 23, 39, 40, 41], "onc": [1, 4, 6, 16, 20, 25, 26, 27, 28, 40, 41, 44, 50], "quantizat": 1, "config": [1, 5, 23, 39, 40], "json": [1, 22, 24, 26, 27, 40], "file": [1, 3, 4, 5, 7, 9, 12, 13, 20, 22, 24, 25, 26, 27, 38, 39, 42, 44, 46], "under": [1, 6, 8, 12, 13, 25, 28, 34, 38, 39, 42], "need": [1, 4, 5, 6, 7, 9, 10, 15, 19, 20, 23, 24, 25, 26, 27, 28, 29, 31, 36, 38, 39, 40, 41, 42, 43, 44, 45, 48, 49, 50], "calibr": [1, 5, 21, 23, 36, 38, 40], "dict": 1, "int4": [1, 6, 35, 36, 42], "": [1, 2, 7, 9, 10, 12, 13, 15, 20, 22, 25, 26, 27, 28, 38, 39, 40, 41, 48, 49, 50], "should": [1, 4, 9, 10, 12, 13, 22, 27, 28, 29, 35, 37, 38, 39, 41], "state_dict": [1, 5, 10, 19], "checkpoint": [1, 5, 19, 38], "pt": [1, 5, 10, 20, 21, 22, 36, 40, 45], "gptq": [1, 21], "etc": [1, 4], "where": [1, 4, 6, 7, 8, 10, 41, 50], "specifi": [1, 9, 10, 20, 28, 39, 41, 44], "kei": [1, 6, 7, 35, 42, 43], "group": [1, 7, 9, 10, 28, 41, 48], "chang": [1, 4, 5, 6, 7, 9, 10, 12, 13, 15, 18, 19, 22, 25, 28, 31, 36, 38, 39, 42, 46], "make": [1, 4, 5, 6, 7, 9, 10, 19, 20, 22, 26, 30, 31, 35, 40, 41, 44, 46, 50], "n": [1, 5, 6, 7, 9, 10, 24, 25, 28, 38, 40, 41, 48], "thei": [1, 6, 9, 12, 13, 25, 35, 39, 41, 42], "uint4": 1, "compress": [1, 17, 21], "along": [1, 4, 29, 41, 50], "store": [1, 25, 35, 39, 40, 41, 48, 49, 50], "int32": 1, "zero": [1, 6, 10, 11, 22, 26, 38, 44], "point": [1, 6, 8, 9, 12, 13, 17, 21, 22, 41, 50], "scale": [1, 2, 6, 10, 17, 19, 22, 35, 42], "bia": [1, 9, 12, 13, 25, 28], "state": [1, 6, 9, 10, 19, 22, 35, 47, 48], "channel": [1, 2, 11, 15, 22, 23, 42], "automaticlli": 1, "deploy": [1, 5, 6, 45], "torchscirpt": 1, "workabl": 1, "forward": [1, 5, 7, 9, 10, 12, 13, 16, 24, 25, 28, 38, 40, 41, 45, 50], "after": [1, 4, 5, 9, 23, 26, 27, 28, 29, 31, 32, 38, 40, 41, 45, 49, 50], "deepspe": [1, 35], "parallel": [1, 7, 35, 38, 41, 42], "get_fp32_math_mod": 1, "fpmath_mod": 1, "fpmath": 1, "fp32mathmod": 1, "fp32": [1, 3, 11, 16, 23, 31, 38, 42, 48, 49, 50], "bf32": [1, 11], "tf32": [1, 11], "disabl": [1, 6, 10, 11, 38, 39, 41, 45], "implicit": 1, "set_fp32_math_mod": 1, "class": [1, 6, 7, 10, 12, 13, 15, 24, 25, 28, 38], "verbos": [1, 3, 6, 9, 11, 29, 39], "demand": [1, 6], "easier": [1, 25, 30, 50], "debug": [1, 11, 23, 29, 36, 39, 46], "dump": [1, 23, 39], "messag": [1, 6, 9, 15, 18, 25, 29, 38, 39, 42], "contain": [1, 4, 7, 8, 16, 38, 39, 40, 41, 45], "durat": [1, 50], "while": [1, 11, 12, 13, 18, 23, 25, 26, 27, 35, 38, 40, 41, 42, 50], "via": [1, 4, 6, 8, 9, 11, 23, 26, 27, 28, 30, 35, 39, 41, 44, 47], "environ": [1, 4, 6, 7, 10, 11, 19, 28, 31, 35, 38, 39, 40, 41, 46], "variabl": [1, 4, 6, 11, 19, 31, 38, 39, 40, 41, 46], "name": [1, 9, 16, 20, 26, 27, 29, 35, 39, 40, 41], "dnnl_verbos": 1, "howev": [1, 4, 6, 9, 11, 12, 13, 14, 25, 28, 35, 38, 39, 41, 47], "those": [1, 9, 19, 22, 26, 35, 41, 47], "amount": [1, 38, 41, 47], "investig": [1, 38, 39], "singl": [1, 10, 19, 20, 28, 35, 40, 45, 48, 49], "iter": [1, 7, 26, 35, 50], "scope": [1, 6, 12, 13, 50], "out": [1, 5, 6, 8, 9, 12, 13, 15, 24, 26, 27, 28, 29, 38, 39, 41, 48, 49], "second": [1, 9, 15, 19, 26, 35, 38, 40, 41], "verbose_on": 1, "verbose_off": 1, "verbose_on_cr": 1, "creation": 1, "current_devic": 1, "int": [1, 5, 6, 7, 9, 10, 20, 38, 39], "index": [1, 4, 7, 8, 9, 10, 25, 29, 35, 41], "current_stream": [1, 9], "ani": [1, 4, 12, 13, 15, 25, 38, 40, 42, 43, 46], "context": [1, 4, 6, 8, 9, 11, 12, 13, 28, 29, 35, 41], "wrapper": [1, 7, 9, 29], "encapsul": [1, 9], "op": [1, 4, 10, 16, 22, 29, 30, 35], "argument": [1, 7, 10, 16, 26, 39], "neg": [1, 38, 50], "integ": [1, 35], "device_count": [1, 10, 26], "device_of": 1, "obj": 1, "storag": [1, 17, 48, 49], "alloc": [1, 8, 15, 19, 28, 35, 40, 43, 47], "get_device_nam": 1, "get_device_properti": 1, "properti": [1, 5, 9, 40], "_deviceproperti": 1, "init": [1, 4, 7, 19, 22, 24], "initi": [1, 7, 10, 19, 28, 40], "lazi": 1, "until": [1, 4, 28, 29, 41, 50], "first": [1, 2, 4, 5, 7, 9, 14, 15, 18, 19, 23, 24, 26, 28, 38, 39, 40, 41, 42, 43, 45, 48, 50], "access": [1, 8, 9, 25, 35, 40, 42, 48, 49], "veri": [1, 4, 9, 22, 25, 26, 27, 35, 38], "rare": 1, "sinc": [1, 6, 8, 9, 25, 28, 38, 41, 43, 44, 48, 50], "could": [1, 6, 7, 11, 23, 25, 36, 38, 40, 41, 45], "doe": [1, 8, 9, 25, 28, 38, 42, 45], "repeatedli": [1, 4], "is_avail": 1, "indic": [1, 25, 35], "is_initi": 1, "set_devic": [1, 7, 10, 19], "discourag": 1, "favor": 1, "most": [1, 6, 11, 16, 23, 35, 38, 40, 41, 42, 43, 45, 50], "xpu_visible_devic": 1, "environment": 1, "streamcontext": 1, "around": [1, 9, 39], "synchron": [1, 8, 9, 11, 19, 26, 28, 38], "wait": [1, 8, 26, 28, 41], "complet": [1, 4, 5, 20, 25, 26, 36, 41, 47], "fp8": [1, 6, 42], "fp8_autocast": [1, 17], "fp8_recip": [1, 17], "delayedsc": [1, 17], "inp": 1, "_gptq": [1, 21, 36], "dataset": [1, 5, 7, 10, 19, 21, 24, 36, 41], "quantized_ckpt": 1, "wbit": [1, 21, 36], "4": [1, 7, 9, 20, 21, 25, 28, 29, 35, 36, 39, 41, 42], "perchannel": [1, 23, 36], "symmetr": [1, 22, 23], "group_siz": 1, "pack_dtyp": 1, "uint8": [1, 23], "param_dtyp": 1, "list": [1, 4, 5, 11, 12, 13, 20, 25, 26, 33, 35, 36, 39, 40, 41, 42, 44, 45], "bloom": [1, 35], "3": [1, 4, 5, 6, 7, 9, 10, 12, 13, 15, 16, 18, 20, 24, 25, 26, 28, 29, 30, 39, 41, 42, 45, 46, 50], "bit": [1, 17, 21, 35, 50], "calib": 1, "batch": [1, 5, 6, 9, 10, 19, 24, 25, 28, 38, 40], "int2": 1, "int3": 1, "granular": [1, 39, 40, 41], "scheme": [1, 40], "determin": [1, 41, 50], "except": [1, 7, 35, 39], "huggingfac": [1, 21, 35, 38, 40], "guarante": [1, 16], "gptjforcausallm": [1, 21, 36], "model_path": [1, 21, 36], "from_pretrain": [1, 3, 5, 21, 36, 40], "quantized_weight": [1, 21, 36], "get_rng_stat": 1, "bytetensor": 1, "rng": 1, "eagerli": 1, "get_rng_state_al": 1, "repres": [1, 4, 6, 7, 42, 50], "set_rng_stat": 1, "new_stat": 1, "desir": [1, 5, 23, 24, 39], "set_rng_state_al": 1, "manual_se": [1, 7, 10], "seed": [1, 7, 10], "safe": [1, 8], "silent": 1, "ignor": 1, "multi": [1, 6, 7, 20, 28, 38, 39, 41, 44], "insuffici": 1, "manual_seed_al": 1, "seed_al": 1, "initial_se": 1, "prioriti": [1, 16], "kwarg": [1, 26], "record_ev": 1, "record": [1, 10, 20, 26, 27, 40], "new": [1, 2, 4, 17, 18, 24, 25, 28, 36, 41, 42, 43], "sycl_queu": [1, 5, 9], "pycapsul": [1, 9], "queue": [1, 5, 6, 11], "correspond": [1, 7, 28, 38, 39, 42], "void": [1, 9], "pointer": [1, 9, 38], "address": [1, 7, 25, 39, 40, 41], "Its": 1, "capsul": 1, "self": [1, 7, 10, 12, 13, 15, 24, 25, 26, 27, 28, 38], "wait_ev": 1, "futur": [1, 4, 14], "wait_stream": 1, "anoth": [1, 20, 39, 41], "without": [1, 6, 7, 8, 12, 13, 15, 24, 26, 27, 28, 29, 38, 40, 42, 43, 50], "enqueu": 1, "affect": [1, 39], "elapsed_tim": [1, 10], "end_ev": 1, "elaps": [1, 10, 41], "millisecond": [1, 41], "wa": [1, 5, 8, 9, 27, 29, 38, 39, 40, 41, 42], "queri": [1, 25], "check": [1, 5, 6, 7, 9, 16, 25, 26, 31, 35, 36, 39, 43, 45], "captur": [1, 3, 47], "A": [1, 4, 5, 6, 15, 23, 25, 38, 39, 41, 42, 44], "boolean": [1, 6], "prevent": [1, 19, 48, 49], "proceed": 1, "empty_cach": [1, 47], "unoccupi": 1, "held": 1, "visibl": [1, 42], "sysman": 1, "toolkit": [1, 7, 31], "help": [1, 4, 5, 9, 10, 16, 35, 39, 41, 43, 44, 47], "fragment": [1, 41], "memory_stat": [1, 47], "dictionari": 1, "statist": [1, 5, 6, 23, 36], "non": [1, 4, 12, 13, 25, 40, 45], "core": [1, 6, 19, 20, 30, 38, 41, 46], "large_pool": 1, "small_pool": 1, "peak": 1, "freed": [1, 47], "receiv": [1, 38, 42, 50], "allocated_byt": 1, "segment": [1, 6, 26, 27, 42], "reserv": [1, 41, 43], "xpumalloc": 1, "reserved_byt": 1, "activ": [1, 5, 17, 19, 22, 23, 26, 28, 31, 35, 36, 39, 41], "active_byt": 1, "inactive_split": 1, "inact": 1, "inactive_split_byt": 1, "broken": 1, "down": [1, 38, 40], "pool": [1, 28, 43], "across": [1, 6, 7, 10, 39], "octob": 1, "2019": [1, 2], "1mb": [1, 41], "small": [1, 6, 23, 41, 48], "metric": 1, "maximum": [1, 46], "histor": 1, "total": [1, 9, 26, 27, 41, 47], "decreas": [1, 38], "simpl": [1, 9, 11, 12, 13, 25, 30, 31, 41, 42], "counter": 1, "num_alloc_retri": 1, "fail": [1, 15, 38, 42], "flush": 1, "retri": 1, "num_oom": 1, "error": [1, 4, 5, 9, 10, 15, 24, 25, 27, 38, 42, 50], "thrown": [1, 38], "memory_summari": 1, "abbrevi": 1, "human": 1, "readabl": [1, 9], "printout": 1, "displai": 1, "period": [1, 41], "dure": [1, 3, 4, 5, 15, 35, 38, 39, 41, 44, 45, 50], "handl": [1, 5, 8, 25, 41], "summari": 1, "memory_snapshot": [1, 47], "snapshot": [1, 47], "interpret": [1, 9, 39], "output": [1, 5, 6, 10, 12, 13, 16, 17, 19, 20, 25, 26, 29, 30, 38, 45], "familiar": [1, 9], "intern": [1, 11, 25, 28, 40, 46], "memory_alloc": [1, 47], "occupi": [1, 38, 47], "byte": 1, "less": [1, 6, 12, 13, 28, 38, 42], "unus": [1, 41, 47], "creat": [1, 4, 5, 8, 9, 11, 21, 23, 24, 28, 30, 36, 41, 42], "max_memory_alloc": [1, 47], "By": [1, 6, 39, 41, 46], "begin": [1, 4, 9, 29], "reset_peak_stat": 1, "reset": 1, "two": [1, 6, 8, 9, 17, 20, 28, 35, 40, 41, 43, 50], "measur": 1, "loop": [1, 4, 11, 26, 50], "memory_reserv": [1, 47], "max_memory_reserv": [1, 47], "reset_peak_memory_stat": 1, "stat": 1, "individu": [1, 4], "memory_stats_as_nested_dict": 1, "nest": [1, 29], "reset_accumulated_memory_stat": 1, "accumul": 1, "enum": 1, "fp32_math_mod": 1, "dpccp": 1, "packet": 1, "enumer": [1, 5, 8, 10, 19, 24], "math": [1, 6, 9, 11, 16], "fp32_math_mode_max": 1, "comput": [1, 5, 7, 17, 19, 21, 22, 24, 25, 28, 30, 35, 39, 40, 41, 43, 44, 45, 50], "primit": [1, 6, 11, 28, 38], "attribut": [1, 25], "descript": [1, 3, 6, 10, 11, 25, 28, 33, 41], "definit": [1, 9, 50], "numer": [1, 12, 13, 41], "behavior": [1, 16, 28, 29, 39, 41], "get_queue_from_stream": [1, 5, 9], "c10": [1, 5], "dpcpp": [1, 5, 7, 9, 38, 42], "enable_onednn_fus": [1, 45], "prepar": [1, 3, 23, 24, 36, 38, 40, 45], "example_input": [1, 3, 22, 23, 24, 36, 40, 45], "bn_fold": 1, "example_kwarg_input": 1, "qconfig": [1, 3, 5, 23, 24, 36, 38, 40, 45], "observ": [1, 5, 14, 22, 23, 36, 45], "insert": [1, 5, 23, 36], "fake": 1, "introduct": [1, 35, 37, 41], "avaiabl": 1, "page": [1, 5, 7, 10, 26, 27, 28, 32, 36, 37, 41, 45], "autotun": [1, 3, 24], "prepared_model": [1, 3, 22, 24, 38, 45], "calib_dataload": 1, "eval_func": 1, "sampling_s": [1, 3, 24], "accuracy_criterion": [1, 3, 24], "tuning_tim": [1, 3, 24], "driven": 1, "tune": [1, 2, 3, 6, 12, 13, 22, 28, 39, 40], "quickli": 1, "dataload": [1, 5, 7, 10, 15, 19, 24, 26, 28], "entir": [1, 9, 35], "process": [1, 5, 6, 7, 9, 10, 16, 17, 18, 19, 20, 28, 29, 38, 39, 40, 41, 48, 50], "scalar": [1, 42], "higher": [1, 6, 25, 35, 45], "algorithm": [1, 17, 21, 25, 45], "would": [1, 4, 5, 6, 9, 16, 20, 23, 25, 38, 39, 40, 41, 42, 46], "explor": 1, "100": [1, 3, 7, 10, 19, 20, 24, 26, 29, 40, 42, 46], "accuracy_criterion_typ": 1, "rel": [1, 3, 24, 39], "absolut": [1, 5, 39], "accuracy_criterion_valu": 1, "allow": [1, 5, 12, 13, 20, 38, 39, 41, 43, 44], "either": [1, 6, 7, 38, 39], "01": [1, 3, 24, 39, 40], "timeout": [1, 4, 50], "earli": 1, "stop": [1, 26, 38, 41], "is_runtime_ext_en": 1, "helper": [1, 9], "exetens": 1, "openmp": [1, 6, 28, 38, 40], "preload": [1, 38, 39, 41, 42], "cpupool": [1, 28], "core_id": [1, 28, 39], "node_id": [1, 28, 39, 40], "abstract": [1, 9, 28], "intra": 1, "id": [1, 7, 8, 10, 27, 29, 35, 39, 40], "numa": [1, 28, 39, 40], "node": [1, 28, 40, 41], "pin": [1, 19, 28], "cpu_pool": [1, 28], "region": [1, 12, 13, 41], "def": [1, 7, 9, 10, 12, 13, 15, 24, 25, 26, 28, 38, 42, 45], "design": [1, 4, 10, 12, 13, 16, 25, 36, 42, 50], "decor": 1, "multistreammodulehint": [1, 28], "arg": [1, 3, 6, 7, 10, 19, 20, 26, 39, 40, 45, 48, 49], "hint": [1, 28], "multistreammodul": [1, 6, 28, 38], "concat": [1, 16, 25, 28, 35, 38], "its": [1, 5, 7, 8, 12, 13, 16, 20, 26, 27, 31, 37, 39, 40, 41, 50], "dim": [1, 5, 8, 9, 10, 16, 25], "length": [1, 4, 20, 38, 50], "arbitrari": 1, "keyword": 1, "num_stream": [1, 28], "concat_output": 1, "input_split_hint": [1, 28], "multi_stream": 1, "output_concat_hint": [1, 28], "throughput": [1, 2, 28, 35, 38, 42], "insid": [1, 4, 9, 28, 39, 43], "divis": [1, 16, 28], "equal": [1, 11, 22, 28, 38, 40, 41], "remaind": [1, 28], "divisor": [1, 28], "batchsiz": [1, 28], "larger": [1, 28, 41], "piec": [1, 6, 28, 29], "mini": [1, 28, 38], "don": [1, 4, 9, 12, 13, 20, 26], "want": [1, 4, 9, 11, 20, 22, 25, 26, 27, 28, 31, 39], "num": [1, 28, 40, 41], "leav": [1, 28, 41], "scriptmodul": [1, 23, 28, 36, 45], "union": 1, "instanc": [1, 6, 15, 20, 40], "usual": [1, 25, 28, 41], "reason": [1, 15, 25, 28], "still": [1, 4, 6, 9, 12, 13, 24, 25, 26, 37, 38, 42, 44, 45, 50], "flag": [1, 6, 28, 39], "concaten": [1, 16, 35, 50], "raw": 1, "asynchron": [1, 6], "get_core_list_of_node_id": 1, "softwar": [2, 34], "jul": 2, "2023": [2, 42], "deep": [2, 6, 7, 8, 10, 12, 13, 16, 17, 19, 20, 21, 41, 45, 50], "learn": [2, 6, 7, 8, 10, 12, 13, 17, 19, 20, 21, 39, 41, 45, 50], "boost": [2, 5, 6, 14, 39, 41, 42, 50], "dl": [2, 6], "hug": 2, "face": 2, "bert": [2, 3, 15, 17], "googl": [2, 4], "cloud": 2, "platform": [2, 5, 8, 25, 30, 38, 40, 41, 42, 44], "gcp": 2, "technologi": [2, 6], "guid": [2, 6, 7, 40], "apr": 2, "mar": [2, 40], "x86": 2, "sapphir": [2, 6], "rapid": [2, 6], "part": [2, 5, 9, 12, 13, 25, 26, 27, 38, 41, 43, 50], "jan": 2, "secur": 2, "torchserv": [2, 37], "confer": 2, "dec": 2, "2022": [2, 39, 40, 42], "what": [2, 6, 12, 13, 26, 27, 29, 42, 43, 44], "pyg": 2, "stabl": [2, 3, 6, 7, 8, 12, 13, 38], "diffus": 2, "arc": [2, 38, 42, 44], "nov": [2, 42], "13": [2, 15, 29, 39, 40, 41], "potenti": [2, 30], "fine": [2, 9, 28, 39, 40, 41], "fx": [2, 15, 38], "sep": 2, "empow": [2, 6, 30], "xeon": [2, 6, 20, 40, 41, 50], "scalabl": [2, 6, 41, 50], "processor": [2, 6, 41, 48, 49, 50], "aug": 2, "vision": [2, 5], "last": [2, 9, 11, 15, 42, 50], "One": [2, 8, 17, 25, 39, 41, 48, 49], "click": 2, "compressor": [2, 6, 24], "4x": 2, "jun": 2, "grokk": 2, "principl": [2, 19, 25], "kt": 2, "person": 2, "text": [2, 35, 38, 41, 44], "speech": [2, 41], "2021": [2, 7, 39, 40], "up": [2, 6, 7, 8, 9, 10, 28, 35, 41, 42, 43], "modern": 2, "naver": 2, "low": [2, 3, 6, 7, 9, 31, 38, 39, 41, 42, 49, 50], "latenc": [2, 20, 35, 40, 42], "machin": [2, 4, 7, 20, 38, 39, 40, 41, 43], "feb": [2, 42], "dlrm": [2, 6, 38], "oneccl": [2, 6, 10, 38, 39], "mention": [2, 7, 9, 15, 28, 50], "deprec": [2, 14], "facebook": [2, 35], "3rd": [2, 6, 50], "gen": 2, "capabl": [2, 6, 23, 30, 42, 47], "2020": [2, 8], "collabor": 2, "caff": 2, "2017": 2, "command": [3, 4, 5, 7, 9, 19, 20, 38, 39, 40, 41, 42], "basic": [3, 16, 24, 41, 42, 50], "instal": [3, 4, 5, 10, 11, 26, 27, 31, 33, 35, 38, 41, 42, 44], "m": [3, 7, 9, 10, 19, 20, 28, 38, 39, 40, 41], "pip": [3, 4, 7, 19], "lt": [3, 42], "version": [3, 5, 8, 9, 34, 38, 40, 41, 42, 49], "gt": [3, 20, 41, 42], "f": [3, 4, 5, 10, 19, 24], "http": [3, 4, 7, 8, 9, 10, 24, 38], "com": [3, 4, 7, 10, 38], "whl": [3, 7, 38], "xpupip": 3, "log": [3, 10, 29, 39, 40, 42, 45], "prompt": [3, 35], "export": [3, 7, 11, 29, 38, 39, 41, 42], "onednn_verbos": 3, "precis": [3, 5, 7, 17, 21, 31, 38, 42, 45, 50], "no_grad": [3, 5, 10, 15, 18, 22, 24, 28, 30, 31, 38, 40, 45], "amp": [3, 5, 15, 17, 30, 31, 38, 42], "autocast": [3, 5, 15, 17, 30, 31], "bertmodelmodel": 3, "bertmodel": [3, 5, 40], "uncas": [3, 5, 15, 40], "fast_bert": 3, "launch": [3, 9, 11, 28, 35, 37, 40, 42], "autom": [3, 6, 12, 13, 20, 39, 40], "ipexrun": [3, 15, 39], "your_pytorch_script": [3, 39], "hypertun": 3, "hyperparamet": [3, 6], "conf": [3, 20, 39, 45], "your_conf_fil": 3, "your_python_script": 3, "post": [3, 4, 22, 23, 30, 35], "default_static_qconfigprepared_model": 3, "anyplac": 3, "d": [3, 4, 5, 6, 9, 12, 13, 38], "calibration_data_load": [3, 5, 45], "converted_model": [3, 38], "default_dynamic_qconfigprepared_model": 3, "tuned_model": [3, 24], "eval_funct": 3, "convert_model": [3, 22, 24, 45], "thank": 4, "interest": 4, "intent": 4, "propos": [4, 25, 50], "intend": 4, "shall": [4, 25], "discuss": [4, 9, 25, 41], "agre": 4, "plan": [4, 15], "look": [4, 5, 9, 20, 25], "ahead": [4, 9, 26, 27, 29], "outstand": 4, "pick": 4, "comment": [4, 20], "particular": [4, 12, 13, 36, 38, 42], "ask": 4, "pull": 4, "full": [4, 9, 30, 40, 41], "here": [4, 7, 8, 9, 10, 12, 13, 15, 25, 26, 27, 28, 29, 38, 40, 41, 45], "uninstal": 4, "ll": [4, 9, 26, 27, 29, 40, 41], "know": [4, 8, 9, 43, 44], "fulli": [4, 22, 23, 36, 41, 42, 50], "warn": [4, 18, 39, 40], "skip": [4, 5, 25, 29, 39, 43, 44, 46], "few": [4, 6, 14, 25, 40, 45], "alwai": [4, 12, 13, 16, 25, 38, 39, 41, 42], "ye": 4, "clone": [4, 7, 29, 49], "copi": [4, 6, 8, 25], "git": [4, 7], "b": [4, 6, 7, 12, 13, 29], "cd": [4, 5, 7], "rebas": 4, "submodul": [4, 7], "sync": [4, 7, 28], "recurs": [4, 7], "job": [4, 38], "setup": [4, 7, 9, 10, 19, 26, 27, 35], "py": [4, 7, 9, 10, 11, 15, 20, 26, 27, 28, 38, 39, 40], "symlink": 4, "tree": 4, "reinstal": [4, 38], "again": [4, 40, 48, 49], "__init__": [4, 7, 10, 12, 13, 15, 24, 25, 28, 38], "interfac": [4, 5, 9, 25, 35, 38], "pyi": 4, "cpp": [4, 5, 9, 41, 46], "h": [4, 5, 6, 9, 24, 25, 38, 39, 40], "sure": [4, 7, 10, 19, 20, 22, 26, 40], "Then": [4, 7, 23, 38, 40, 42], "clean": [4, 38, 42], "our": [4, 5, 9, 16, 23, 35, 41, 48], "6": [4, 5, 6, 7, 20, 28, 29, 39, 40, 41], "binari": [4, 5, 12, 13, 25, 43, 44], "folder": 4, "mani": [4, 6, 9, 20, 26, 27, 39, 41], "wai": [4, 9, 15, 25, 35, 49], "next": [4, 6, 9, 16], "re": [4, 7, 9, 12, 13, 40, 41], "rm": 4, "rf": 4, "toplevel": 4, "over": [4, 6, 9, 12, 13, 14, 25, 38, 39, 42], "made": [4, 8, 42], "edit": [4, 38], "repo": [4, 6, 7], "commit": 4, "keep": [4, 6, 18, 25, 26, 40, 41, 50], "realli": [4, 9], "untrack": 4, "deinit": 4, "xdf": 4, "within": [4, 36, 41, 42, 50], "experi": [4, 15, 16, 18, 25, 41], "env_key1": 4, "env_val1": 4, "env_key2": 4, "env_val2": 4, "suit": 4, "locat": 4, "test_": 4, "sub_fold": 4, "filenam": 4, "wish": [4, 9, 25, 26, 27, 43], "port": [4, 7, 39], "stock": [4, 8, 25, 38, 42, 45], "10": [4, 8, 10, 20, 24, 25, 26, 29, 30, 38, 39, 40, 41, 46, 50], "regress": [4, 14, 38], "offici": [4, 23, 26, 27, 40, 41, 42], "read": [4, 9, 48, 49], "readm": 4, "md": [4, 25], "docstr": 4, "line": [4, 9, 15, 25, 26, 27, 29, 39, 40, 41, 45], "must": [4, 9, 20, 26, 44, 48, 49], "limit": [4, 12, 13, 15, 25, 28, 38, 40, 41], "80": [4, 39], "charact": 4, "fit": [4, 41, 43], "jupyt": 4, "popup": 4, "abov": [4, 7, 8, 10, 11, 15, 25, 26, 27, 29, 35, 39, 40, 48, 49], "prerequisit": [4, 5], "r": [4, 6, 20, 40, 41], "txt": [4, 5, 9, 40], "html": [4, 8, 9, 24], "_build": 4, "rst": 4, "live": 4, "tutori": [4, 5, 7, 9, 10, 22, 24, 26, 27, 37], "autofunct": 4, "autoclass": 4, "direct": [4, 45], "shorten": 4, "sphinx": 4, "produc": [4, 8, 9, 12, 13, 47], "miss": 4, "torchvis": [5, 10, 15, 18, 24, 40, 45], "demonstr": [5, 8, 16, 25, 38, 40], "box": [5, 15, 41], "benefit": [5, 6, 12, 13, 15, 23, 28, 40, 41, 50], "against": 5, "criterion": [5, 12, 13], "below": [5, 7, 9, 12, 13, 15, 16, 20, 25, 26, 27, 28, 29, 31, 38, 39, 40, 41, 42, 44, 48, 49, 50], "lr": [5, 7, 10, 12, 13, 24, 30, 48, 49], "001": [5, 7, 12, 13], "download": [5, 10, 24, 38], "cifar10": 5, "compos": [5, 10], "resiz": 5, "224": [5, 12, 13, 15, 18, 30, 40, 45], "totensor": [5, 10, 24], "5": [5, 7, 10, 15, 16, 20, 23, 24, 25, 28, 29, 36, 38, 39, 40, 41, 48, 50], "train_dataset": [5, 7, 19], "root": [5, 7, 11, 24, 35, 38, 42, 46], "train_load": [5, 7, 10, 12, 13, 19], "batch_siz": [5, 7, 9, 10, 19, 24, 25, 40, 45], "128": [5, 10, 12, 13, 15, 28], "crossentropyloss": [5, 24], "momentum": [5, 15, 30, 49, 50], "9": [5, 20, 29, 39, 40, 42, 46], "batch_idx": [5, 10, 19], "target": [5, 9, 10, 15, 19, 20, 42, 43, 44, 46], "zero_grad": [5, 10, 19, 24, 30], "backward": [5, 7, 9, 10, 12, 13, 19, 24, 30, 50], "print": [5, 6, 7, 10, 18, 19, 20, 23, 24, 25, 26, 27, 29, 36, 39, 45, 46], "model_state_dict": 5, "optimizer_state_dict": 5, "pth": 5, "finish": [5, 16, 18, 24, 28, 38], "nlp": [5, 6, 38], "resnet50_weight": [5, 18], "rand": [5, 12, 13, 18, 25, 28, 30, 38], "vocab_s": [5, 40], "seq_length": [5, 40], "randint": [5, 40], "freez": [5, 12, 13, 15, 22, 24, 28, 30, 31, 38, 40, 45], "strict": [5, 40], "becaus": [5, 12, 13, 25, 35, 38, 41, 44, 50], "prepare_jit": [5, 23, 36], "convert_jit": [5, 23, 36], "separ": [5, 7, 9, 11, 34, 38, 41, 48], "collect": [5, 6, 7, 10, 23, 40, 41], "o": [5, 7, 10, 38, 42, 46], "_recurs": 5, "wrap_cpp_modul": 5, "quantize_jit": [5, 23, 36], "modeljit": [5, 23, 36], "minmaxobserv": [5, 22, 23, 36], "with_arg": [5, 22, 23, 36], "qscheme": [5, 22, 23, 36], "per_tensor_symmetr": [5, 22, 23, 36], "reduce_rang": [5, 22, 23, 36], "quint8": [5, 22], "default_weight_observ": [5, 23, 36], "len": [5, 10, 19, 24, 26], "memory_format": [5, 6, 25], "channels_last": [5, 6, 25, 41], "libtorch": [5, 42], "own": [5, 9, 22], "servic": [5, 41], "regular": [5, 50], "unlik": [5, 6, 10], "cmake": [5, 6, 42, 46], "cppsdk": 5, "ensur": [5, 19, 28, 40, 48], "app": 5, "iostream": 5, "memori": [5, 6, 8, 9, 10, 12, 13, 14, 15, 17, 21, 28, 35, 38, 40, 42, 45, 48, 49, 50], "argc": 5, "const": [5, 9], "char": 5, "argv": 5, "catch": [5, 23], "std": [5, 9, 48], "cerr": 5, "kxpu": 5, "ivalu": 5, "push_back": 5, "cout": 5, "slice": [5, 9, 25], "end": [5, 26, 28, 29, 38, 43, 44, 45], "endl": 5, "cmakelist": [5, 9], "cmake_minimum_requir": [5, 9], "fatal_error": [5, 9], "find_packag": [5, 9], "add_execut": 5, "target_link_librari": [5, 9], "torch_ipex_librari": [5, 9], "set_properti": [5, 9], "cxx_standard": [5, 9], "17": [5, 9, 29, 39, 40], "mkdir": 5, "build": [5, 6, 7, 19, 29, 38, 41, 42, 44], "cc": [5, 46], "icx": [5, 9], "cxx": [5, 46], "icpx": [5, 9], "dcmake_prefix_path": [5, 9], "libpytorch_path": 5, "libpytorch": 5, "_": [5, 7, 8, 9, 22, 23, 25, 28, 29, 38, 39, 40, 41, 42, 45], "verifi": [5, 6, 21, 35, 38], "linux": [5, 9, 38, 39, 41, 42], "ldd": 5, "workspac": 5, "identif": [5, 46], "intelllvm": 5, "2024": [5, 38], "abi": [5, 42, 46], "info": [5, 23, 38, 39, 40, 46], "done": [5, 10, 15, 24, 38, 41, 46], "oneapi": [5, 6, 7, 8, 10, 16, 19, 31, 38, 41, 44], "bin": [5, 38, 39, 40, 42, 46], "pthread": [5, 28], "test": [5, 10, 24, 29, 42, 43, 44, 46], "cmake_have_libc_pthread": 5, "success": [5, 15, 32], "lib": [5, 38, 39, 40, 42], "libintel": 5, "ext": 5, "written": [5, 42, 46], "0x00007fd5bb927000": 5, "libc10": 5, "0x00007fd5bb895000": 5, "libtorch_cpu": 5, "0x00007fd5a44d8000": 5, "0x00007fd5a1a1b000": 5, "0x00007fd5862b0000": 5, "libmkl_intel_lp64": [5, 38, 42], "mkl": [5, 31, 38, 42], "intel64": [5, 38, 42], "0x00007fd584ab0000": 5, "libmkl_cor": [5, 38, 42], "0x00007fd5806cc000": 5, "libmkl_gnu_thread": [5, 38], "0x00007fd57eb1d000": 5, "libmkl_sycl": [5, 38, 42], "0x00007fd55512c000": 5, "libopencl": 5, "0x00007fd55511d000": 5, "libsvml": 5, "intel64_lin": 5, "0x00007fd553b11000": 5, "libirng": 5, "0x00007fd553600000": 5, "libimf": 5, "0x00007fd55321b000": 5, "libintlc": 5, "0x00007fd553a9c000": 5, "libsycl": 5, "0x00007fd552f36000": 5, "show": [5, 7, 8, 9, 12, 13, 26, 27, 29, 36, 37, 39, 40, 41, 50], "fsycl": [5, 9, 44], "cmake_cxx_flag": 5, "usm": [5, 8], "cl": 5, "hpp": 5, "namespac": [5, 12, 13], "fetch": 5, "stream": [5, 6, 11, 28, 38], "device_typ": [5, 9], "devicetyp": [5, 9], "impl": [5, 9], "virtualguardimpl": [5, 9], "xpu_stream": 5, "getstream": [5, 9], "input_ptr": 5, "malloc_devic": 5, "fromusm": 5, "scalartyp": 5, "nullopt": 5, "output_tensor": 5, "append": 5, "former": [5, 9], "zoo": 5, "benchmark": [5, 38, 39, 47], "mark": [5, 26, 27], "document": [5, 6, 9, 11, 28, 36, 42, 46], "column": [5, 9, 26, 27], "simpli": [5, 9, 38, 39], "guidanc": 6, "nchw": [6, 41], "nhwc": [6, 41, 42], "anymor": 6, "center": [6, 38, 42, 44], "flex": [6, 38, 42, 44], "seri": [6, 30, 38, 41, 42, 44], "choos": [6, 12, 13, 16, 26, 28, 39, 41, 42], "typic": [6, 8, 15, 19, 26, 41, 42], "speed": [6, 7, 9, 35, 41, 42, 43, 48, 49], "furthermor": 6, "aka": [6, 25], "cooper": 6, "lake": 6, "4th": 6, "bfloat": 6, "16": [6, 28, 29, 39, 40, 50], "matmul": [6, 12, 13, 38, 42, 45], "partial": [6, 10], "upstream": [6, 25], "land": 6, "pr": [6, 25, 38], "being": [6, 21, 29, 41], "review": [6, 38], "side": [6, 8, 22, 41], "respect": [6, 20, 39], "built": [6, 7, 9, 28, 29, 38, 42, 44, 46], "deliv": [6, 23, 35], "cnn": [6, 25, 41], "top": [6, 15, 42, 50], "power": [6, 9, 17, 21, 41], "meet": [6, 17, 41, 50], "commun": [6, 7, 8, 10, 38, 39, 40, 41, 42], "bind": [6, 9, 10, 38, 39, 40, 41], "formerli": [6, 7, 10, 41], "known": [6, 7, 10, 15, 35, 37], "torch_ccl": [6, 7], "horovod": [6, 38, 42], "among": [6, 8, 19, 39, 40, 41], "framework": [6, 8, 11, 19], "interopar": 6, "particularli": [6, 8], "describ": [6, 7, 12, 13, 25, 38, 40, 41, 45, 50], "write": [6, 26, 27], "practic": [6, 9, 35, 41, 50], "setuptool": 6, "suffici": [6, 11], "driver": [6, 44], "ze_flat_device_hierarchi": [6, 11], "hierarchi": 6, "expos": [6, 12, 13], "tile": [6, 7, 11, 35], "industri": [6, 10, 42], "grade": [6, 10, 42], "worker": [6, 7, 10, 19, 28, 39], "maintain": [6, 7, 9, 10, 12, 13], "replica": [6, 7, 10], "gradient": [6, 7, 10, 17, 19], "rank": [6, 7, 10, 19, 39], "footprint": [6, 10, 17, 21, 35, 42, 43, 50], "feasibl": [6, 10, 15], "seamlessli": [6, 30], "har": [6, 30], "flagship": [6, 30], "torchinductor": [6, 30], "field": [6, 26, 27, 29], "statement": [6, 20, 26, 27, 46], "let": [6, 9, 15, 25, 28, 29, 48, 49, 50], "stack": [6, 12, 13, 29], "indent": [6, 26, 27, 29], "distinguish": [6, 29], "capac": [6, 16, 41, 50], "registr": 6, "topologi": [6, 38, 39, 41, 48, 49], "roialign": 6, "nm": 6, "mask": [6, 38], "frozenbatchnorm2d": 6, "num_featur": 6, "ep": [6, 15, 48], "1e": [6, 15, 24], "05": [6, 15, 39], "batchnorm2d": [6, 15, 38], "affin": [6, 15, 22, 28, 39, 40, 41], "expect": [6, 25, 38], "w": [6, 24, 25, 40, 50], "same": [6, 7, 9, 10, 15, 22, 25, 28, 35, 38, 39, 40, 41, 46, 50], "interact": 6, "beyond": 6, "kind": 6, "gender": 6, "hobbi": 6, "dot": [6, 13, 25, 35], "between": [6, 7, 8, 12, 13, 28, 29, 38, 41, 46], "man": [6, 41], "plai": [6, 41], "footbal": 6, "gemm": [6, 25, 35, 38, 42], "onemkl": [6, 11, 16, 38, 42], "circumst": [6, 12, 13], "faster": [6, 12, 13, 41], "abl": [6, 22], "aim": [6, 15, 41], "broad": [6, 14], "toggl": 6, "switch": [6, 10, 26, 27, 39, 41], "weights_preack": 6, "concern": 6, "major": 6, "spawn": [6, 7, 10, 28], "stage": [6, 15, 17, 21, 23, 28, 38, 41, 48, 49], "subject": [6, 28, 34, 46], "hopefulli": 6, "eas": [6, 9, 25], "though": 6, "instead": [6, 20, 28, 36, 38, 39, 40, 41, 42, 48, 49], "turn": [6, 29], "off": [6, 11, 12, 13, 26, 27, 29, 35, 38, 42, 50], "variou": [6, 8, 20, 26, 27, 30, 41, 42], "area": [6, 20], "extrem": [6, 20, 41], "situat": [6, 8, 20], "space": [6, 25, 41], "huge": [6, 20, 41], "impract": [6, 20], "consum": [6, 8, 20, 26, 27], "launcher": [6, 7, 39, 41, 45], "replic": 7, "everi": [7, 29, 35], "fed": 7, "overlap": [7, 40], "c10d": [7, 10], "ccl": [7, 10, 19, 39], "processgroup": [7, 10], "hold": [7, 10, 25, 41], "allgath": [7, 10, 19], "alltoal": [7, 19], "successfulli": 7, "v2": [7, 42], "oneccl_bindings_for_pytorch": [7, 10], "third": [7, 48], "parti": 7, "compute_backend": 7, "system": [7, 9, 38, 41, 42], "apt": 7, "yum": 7, "dnf": 7, "sudo": 7, "devel": 7, "11": [7, 29, 39, 40, 46], "inteloneapiroot": 7, "use_system_oneccl": 7, "ON": [7, 11, 26, 27, 29], "repositori": 7, "repo_url": 7, "u": [7, 9, 40], "holder": 7, "url": [7, 40], "oneccl_bind_pt": 7, "cwd": 7, "env": [7, 19, 31], "setvar": 7, "sh": [7, 19, 31], "var": [7, 19, 31], "basekit": [7, 19, 38], "oneapi_root": 7, "manag": [7, 9, 12, 13, 28, 35, 39, 45], "modif": [7, 10, 19, 23], "necessari": [7, 10, 19, 25, 26, 27, 29], "dist": [7, 10, 13, 38], "init_process_group": [7, 10], "exclus": [7, 10, 11, 39], "local": [7, 10, 19, 28, 39, 40, 41], "local_rank": [7, 10, 19], "wrap": [7, 10, 19], "device_id": [7, 8, 10, 26], "exactli": [7, 9, 50], "resid": 7, "seed_numb": 7, "illustr": [7, 10, 23, 25, 39, 41, 48, 50], "Or": [7, 38], "example_ddp": 7, "super": [7, 10, 12, 13, 15, 24, 25, 28, 38], "__name__": [7, 10, 38], "__main__": [7, 10, 38, 39, 40], "123": 7, "mpi_world_s": 7, "pmi_siz": 7, "mpi_rank": 7, "pmi_rank": 7, "world_siz": [7, 10], "els": [7, 20, 25, 26, 49], "world": 7, "master_addr": [7, 10], "127": [7, 39], "master_port": [7, 10], "29500": [7, 39], "global": [7, 26, 28], "get_rank": 7, "get_world_s": 7, "loss_fn": [7, 24], "mseloss": 7, "rune": 7, "randn": [7, 15, 16, 24, 25, 26, 27, 29, 40, 45], "label": [7, 12, 13, 17], "l": 7, "mpirun": 7, "card": [7, 25, 35, 38, 42], "regard": [7, 25, 45], "explicit": [7, 28, 29, 41], "minor": 7, "single_card": 7, "single_card_dist": 7, "importerror": [7, 38, 42], "rais": [7, 15, 27, 38], "multiprocess": [7, 10], "multi_process_spawn": 7, "main_work": 7, "put": [7, 8, 10, 26, 41], "train_sampl": [7, 19], "epoch": [7, 10, 19, 24], "set_epoch": [7, 10], "adjust": 7, "warp": 7, "sampler": [7, 10, 19], "loader": [7, 24], "shuffl": [7, 10], "num_work": [7, 10], "pin_memori": [7, 10], "wide": [8, 21, 50], "adopt": [8, 35, 42], "numpi": 8, "domain": [8, 17, 21], "interoper": 8, "v0": 8, "7": [8, 10, 15, 20, 28, 29, 39, 40, 50], "relat": [8, 10, 23, 26, 39, 41, 45], "extern": 8, "from_dlpack": 8, "t2": 8, "empti": [8, 25, 29, 39], "capsule2": 8, "to_dlpack": 8, "dlmanagedtensor": 8, "stride": [8, 12, 13, 15, 28], "pars": [8, 10], "extract": 8, "data_ptr": 8, "respons": [8, 23, 29, 35], "atendlmtensor": 8, "ndim": 8, "dmlc": 8, "io": 8, "spec": 8, "dldevicetyp": 8, "kdloneapi": 8, "kdlsycl": 8, "reli": [8, 25, 28], "filter": 8, "selector": 8, "actual": [8, 9, 25, 38, 42, 50], "parent": 8, "get_devic": 8, "valid": [8, 10, 11, 50], "three": [8, 35], "host": [8, 26, 27], "far": [8, 30], "recogn": 8, "probabl": [8, 10, 38], "hard": [8, 25, 38], "monitor": [8, 47], "flow": [8, 38, 45], "readi": 8, "highli": [9, 16, 23, 31, 35, 41, 42], "org": [9, 24, 38, 43], "walk": 9, "come": [9, 41], "flavor": 9, "aot": [9, 11], "cpp_extens": 9, "approach": [9, 35, 38], "latter": 9, "afterward": [9, 39, 41], "besid": [9, 26, 35, 41, 42], "long": [9, 25, 35, 38, 50], "term": [9, 34], "lltm": 9, "dpcppextens": 9, "dpcppbuildextens": 9, "ext_modul": 9, "lltm_xpu": 9, "lltm_xpu_kernel": 9, "cmdclass": 9, "build_ext": 9, "conveni": [9, 12, 13], "correct": [9, 10, 25], "equival": [9, 38, 42, 49], "vanilla": 9, "include_dir": 9, "include_path": 9, "And": [9, 22, 28, 40, 42], "goe": 9, "plug": 9, "previous": [9, 40], "were": [9, 39, 40, 41], "elabor": 9, "fly": 9, "background": [9, 41], "temporari": 9, "tmp": [9, 15, 26, 40], "torch_extens": 9, "ver": 9, "_xpu": 9, "emit": 9, "ninja": 9, "fact": [9, 25, 41], "home": [9, 19, 38, 39, 40], "user_nam": 9, "ones": [9, 23], "complic": [9, 29, 39, 41], "increment": 9, "reload": 9, "18": [9, 29, 39, 40], "compon": [9, 22, 34, 35], "set_source_files_properti": 9, "compile_flag": 9, "add_librari": 9, "torch_librari": 9, "target_include_directori": 9, "public": [9, 42], "python_include_dir": 9, "torch_ipex_include_dir": 9, "prefix": [9, 39], "cmake_prefix_path": 9, "dcmake_c_compil": 9, "dcmake_cxx_compil": 9, "aval": 9, "c10_stream": 9, "associ": [9, 43], "subsequ": [9, 25, 41], "yourself": 9, "strategi": [9, 20, 41], "pybind11": 9, "ultim": 9, "care": [9, 29, 40], "consid": 9, "cuda": [9, 10, 26, 42], "declar": 9, "lltm_xpu_forward": 9, "old_h": 9, "old_cel": 9, "lltm_xpu_backward": 9, "grad_h": 9, "grad_cel": 9, "new_cel": 9, "input_g": 9, "output_g": 9, "candidate_cel": 9, "gate_weight": 9, "check_xpu": 9, "torch_check": 9, "is_xpu": 9, "check_contigu": 9, "is_contigu": [9, 25], "contigu": [9, 25, 35, 41, 42, 45], "check_input": 9, "lltm_forward": 9, "lltm_backward": 9, "pybind11_modul": 9, "torch_extension_nam": 9, "bridg": 9, "natur": [9, 25, 50], "templat": [9, 16, 35], "typenam": 9, "scalar_t": 9, "sigmoid": [9, 42, 45], "z": 9, "0f": 9, "exp": [9, 42, 45], "At": [9, 35], "header": 9, "essenti": 9, "d_sigmoid": 9, "d_tanh": 9, "tanh": [9, 42, 45], "elu": [9, 42, 45], "alpha": [9, 48, 49], "fmax": 9, "fmin": 9, "d_elu": 9, "d_relu": 9, "hand": 9, "cat": [9, 12, 13, 16, 39, 40], "gate": 9, "addmm": [9, 12, 13, 42], "transpos": [9, 42, 45], "state_s": 9, "new_h": 9, "zeros_lik": 9, "at_dispatch_floating_typ": 9, "lltm_forward_xpu": 9, "lltm_xpu_forward_kernel": 9, "purpos": [9, 39, 40, 41, 46], "lambda": 9, "As": [9, 15, 23, 28, 35, 38, 39, 40, 41, 48, 49], "instanti": 9, "retriev": [9, 41], "doubl": 9, "at_dispatch_all_typ": 9, "size_t": 9, "1024": [9, 26, 27, 41], "work_group": 9, "cgf": 9, "handler": [9, 17, 26, 40], "cgh": 9, "kfn": 9, "nd_item": 9, "item": [9, 10, 19, 24], "get_group": 9, "get_group_rang": 9, "get_local_id": 9, "gates_row": 9, "parallel_for": 9, "nd_rang": 9, "grid": [9, 20], "fill": 9, "matric": 9, "2048": 9, "8": [9, 17, 20, 29, 39, 40, 41], "introductori": 9, "underlai": 9, "right": [9, 31, 35, 50], "inde": [9, 38], "high": [9, 21, 41, 48, 49, 50], "agnost": 9, "ineffici": 9, "dimension": 9, "much": [9, 22, 25, 41, 49, 50], "pattern": [9, 23, 25, 36, 42, 43, 47], "packedtensoraccessor32": 9, "lltm_xpu_backward_kernel": 9, "d_old_cel": 9, "d_gate": 9, "d_gates_": 9, "d_old_cell_": 9, "d_output_g": 9, "d_tanh_new_cel": 9, "d_new_cel": 9, "d_candidate_cel": 9, "d_input_g": 9, "lltm_backward_xpu": 9, "packed_accessor32": 9, "d_gate_weight": 9, "reshap": 9, "d_weight": 9, "mm": [9, 12, 13], "d_bia": 9, "sum": [9, 10, 24, 25, 42, 45, 48], "keepdim": [9, 10], "d_x": 9, "d_old_h": 9, "d_input": 9, "similar": [10, 22, 26, 27, 38, 41, 42, 46], "reducescatt": 10, "align": [10, 26, 27, 42, 50], "convent": 10, "fullyshardeddataparallel": 10, "trigger": [10, 18, 23, 36, 38, 42, 44], "throw": 10, "argpars": 10, "functool": 10, "lr_schedul": 10, "steplr": 10, "mp": 10, "distributeddataparallel": [10, 42], "distributedsampl": [10, 19], "fully_sharded_data_parallel": 10, "cpuoffload": 10, "backwardprefetch": 10, "size_based_auto_wrap_polici": 10, "enable_wrap": 10, "localhost": 10, "12355": 10, "cleanup": [10, 38], "destroy_process_group": [10, 38], "toi": 10, "handwritten": 10, "digit": [10, 50], "classif": [10, 38], "net": 10, "conv1": 10, "32": [10, 25, 39, 40, 50], "conv2": [10, 28], "64": [10, 12, 13, 15, 24, 28, 30, 39], "dropout1": 10, "25": [10, 39, 40], "dropout2": 10, "fc1": 10, "9216": 10, "fc2": 10, "relu": [10, 24, 25, 38, 42, 43, 45], "max_pool2d": 10, "flatten": [10, 24, 28], "log_softmax": [10, 13], "logic": [10, 20, 25, 29, 38, 40, 41], "ddp_loss": 10, "nll_loss": [10, 12, 13, 19], "reduct": 10, "all_reduc": 10, "reduceop": 10, "tloss": [10, 19], "6f": 10, "test_load": 10, "pred": [10, 24], "argmax": [10, 24], "max": [10, 30, 38, 42, 44], "eq": [10, 42], "view_a": 10, "test_loss": 10, "averag": [10, 19, 26, 27], "4f": 10, "2f": 10, "fsdp_main": 10, "1307": 10, "3081": 10, "dataset1": 10, "mnist": 10, "dataset2": 10, "sampler1": 10, "num_replica": [10, 19], "sampler2": 10, "train_kwarg": 10, "test_kwarg": 10, "test_batch_s": 10, "xpu_kwarg": 10, "my_auto_wrap_polici": 10, "min_num_param": 10, "init_start_ev": 10, "event": 10, "enable_tim": 10, "init_end_ev": 10, "adadelta": 10, "step_siz": 10, "gamma": 10, "1000": 10, "sec": 10, "save_model": 10, "barrier": [10, 38], "mnist_cnn": 10, "final": [10, 38, 43, 44], "parser": 10, "argumentpars": 10, "add_argu": 10, "metavar": 10, "14": [10, 29, 39, 40], "rate": [10, 19, 50], "action": 10, "store_tru": 10, "random": [10, 19, 20, 38], "parse_arg": 10, "nproc": [10, 39], "join": [10, 41], "snippet": [10, 15, 16, 26, 27, 36], "fsdp_mnist_xpu": 10, "who": [11, 15, 42, 44], "overrid": [11, 22], "defaultvalu": 11, "use_onemkl": [11, 38, 42], "bla": 11, "use_channels_last_1d": 11, "1d": 11, "use_persist_stream": 11, "persist": 11, "use_scratchpad_mod": 11, "scratchpad": 11, "use_primitive_cach": 11, "use_queue_barri": 11, "submit_barri": 11, "dummi": [11, 40], "use_multi_context": 11, "use_profil": 11, "legaci": [11, 38], "profil": [11, 38, 42], "use_kineto": [11, 26], "kineto": [11, 38, 42], "use_sycl_assert": 11, "assert": [11, 26], "use_itt_annot": 11, "itt": 11, "annot": 11, "use_split_fp64_loop": 11, "fp64": [11, 38, 42], "element": [11, 25, 48, 49], "wise": [11, 36, 42, 48, 49], "use_xetla": 11, "xetla": [11, 16], "build_by_per_kernel": 11, "per_kernel": 11, "use_aot_devlist": [11, 44], "build_internal_debug": 11, "build_separate_op": 11, "build_simple_trac": 11, "build_opt_level": 11, "add": [11, 12, 13, 19, 20, 25, 27, 29, 38, 40, 42, 45, 46, 48, 49, 50], "ox": 11, "accept": 11, "optioncpu": 11, "ipex_fp32_math_mod": 11, "optiongpu": 11, "ipex_verbos": 11, "ipex_xpu_sync_mod": 11, "enforc": 11, "ipex_tile_as_devic": 11, "partit": [11, 19, 41, 45], "map": [11, 25], "composit": 11, "optionexperiment": 11, "ipex_simple_trac": [11, 29], "ipex_ze_trac": [11, 26], "resnet50": [11, 18, 20, 26, 39, 41, 45], "lower": [12, 13, 23, 35, 42, 50], "lighter": [12, 13], "smaller": [12, 13, 42], "sacrif": [12, 13], "trade": [12, 13, 35, 42], "slower": [12, 13, 41], "accur": [12, 13, 38], "primarili": 12, "speedup": [12, 13, 16, 35, 42], "simplenet": [12, 13, 30], "pad": [12, 13, 15, 25, 28, 42], "y": [12, 13, 22, 24, 28, 50], "chosen": [12, 13, 16, 20], "categori": [12, 13], "imag": [12, 13, 25, 38, 41, 45], "float64": [12, 13], "variant": [12, 13], "suppli": [12, 13, 25], "addmm_": [12, 13], "cannot": [12, 13, 25, 38, 42, 48], "stabil": [12, 13], "regardless": [12, 13], "unlist": [12, 13], "downstream": [12, 13], "assum": [12, 13, 31, 40, 41], "believ": [12, 13, 25], "unstabl": [12, 13], "conv1d": [12, 13, 25, 45], "conv3d": [12, 13, 42, 45], "conv_transpose1d": [12, 13], "conv_transpose2d": 12, "conv_transpose3d": [12, 13], "bmm": [12, 13], "baddbmm": [12, 13], "addbmm": [12, 13], "conv_tbc": [12, 13], "group_norm": 12, "_native_multi_head_attent": 12, "avg_pool3d": 12, "binary_cross_entropi": [12, 13], "grid_sampl": [12, 13], "polar": 12, "prod": 12, "quantil": 12, "nanquantil": 12, "stft": 12, "cdist": [12, 13], "view_as_complex": 12, "choleski": 12, "cholesky_invers": 12, "cholesky_solv": 12, "invers": 12, "lu_solv": 12, "matrix_rank": 12, "orgqr": 12, "ormqr": 12, "pinvers": 12, "max_unpool2d": 12, "max_unpool3d": 12, "adaptive_avg_pool3d": 12, "reflection_pad1d": 12, "reflection_pad2d": 12, "replication_pad1d": 12, "replication_pad2d": 12, "replication_pad3d": 12, "mse_loss": [12, 13], "cosine_embedding_loss": [12, 13], "nll_loss2d": [12, 13], "hinge_embedding_loss": [12, 13], "poisson_nll_loss": [12, 13], "smooth_l1_loss": [12, 13], "cross_entropy_loss": [12, 13], "l1_loss": [12, 13], "huber_loss": [12, 13], "margin_ranking_loss": [12, 13], "soft_margin_loss": [12, 13], "triplet_margin_loss": [12, 13], "multi_margin_loss": [12, 13], "ctc_loss": 12, "kl_div": [12, 13], "multilabel_margin_loss": [12, 13], "binary_cross_entropy_with_logit": [12, 13], "fft_fft": [12, 13], "fft_ifft": [12, 13], "fft_fft2": [12, 13], "fft_ifft2": [12, 13], "fft_fftn": [12, 13], "fft_ifftn": [12, 13], "fft_rfft": [12, 13], "fft_irfft": [12, 13], "fft_rfft2": [12, 13], "fft_irfft2": [12, 13], "fft_rfftn": [12, 13], "fft_irfftn": [12, 13], "fft_hfft": [12, 13], "fft_ihfft": [12, 13], "linalg_cond": 12, "linalg_matrix_rank": 12, "linalg_solv": 12, "linalg_choleski": 12, "linalg_svdv": 12, "linalg_eigv": 12, "linalg_eigvalsh": 12, "linalg_inv": 12, "linalg_householder_product": 12, "linalg_tensorinv": 12, "linalg_tensorsolv": 12, "fake_quantize_per_tensor_affin": 12, "eig": 12, "geqrf": 12, "lstsq": 12, "_lu_with_info": 12, "qr": 12, "svd": 12, "symeig": 12, "triangular_solv": 12, "fractional_max_pool2d": 12, "fractional_max_pool3d": 12, "adaptive_max_pool3d": 12, "multilabel_margin_loss_forward": 12, "linalg_qr": 12, "linalg_cholesky_ex": 12, "linalg_svd": 12, "linalg_eig": 12, "linalg_eigh": 12, "linalg_lstsq": 12, "linalg_inv_ex": 12, "index_copi": 12, "g": [12, 13, 23, 25, 35, 38, 42, 43, 44], "intervent": [12, 13], "mixtur": [12, 13], "_convolut": 13, "prelu": 13, "addmv": 13, "addr": [13, 39], "mv": 13, "chain_matmul": 13, "linalg_multi_dot": 13, "_thnn_fused_gru_cel": 13, "gru_cel": 13, "nll_loss_nd": 13, "reciproc": 13, "pow": [13, 42, 45], "frobenius_norm": 13, "nuclear_norm": 13, "cosine_similar": 13, "pdist": 13, "renorm": 13, "addcdiv": 13, "addcmul": 13, "atan2": 13, "bilinear": 13, "cross": [13, 38, 39, 40, 41], "index_put": 13, "tensordot": 13, "scatter_add": 13, "enable_auto_channels_last": 14, "disable_auto_channels_last": 14, "bring": [14, 22, 23, 24, 35, 38, 39, 41, 43, 50], "oob": [15, 42], "easili": [15, 22], "inevit": 15, "simplifi": 15, "optimum": 15, "impot": 15, "claus": [15, 48, 49], "monkei": 15, "patch": 15, "embedding_bag": 15, "qa": 15, "clear": 15, "ninstanc": [15, 20, 39], "ncore": [15, 39], "28": [15, 20, 24, 39, 40, 41], "run_qa": 15, "model_name_or_path": [15, 36], "dataset_nam": 15, "squad": 15, "do_ev": 15, "per_device_train_batch_s": 15, "12": [15, 20, 29, 39, 40, 46], "learning_r": 15, "3e": 15, "num_train_epoch": 15, "max_seq_length": 15, "384": [15, 40], "doc_strid": 15, "output_dir": [15, 20], "debug_squad": 15, "dummymodul": 15, "input1": 15, "kernel_s": [15, 25], "track_running_stat": 15, "customized_forward": 15, "method1": 15, "method2": 15, "unabl": [15, 38, 43], "hook": 15, "behaviour": 15, "repeat": [15, 25, 26, 50], "traced_model": [15, 22, 24, 38, 45], "special": [16, 35], "empir": 16, "ideal": 16, "xe": [16, 35, 41, 42], "algebra": [16, 35], "compute_eng": 16, "xpucomputeeng": 16, "x1": [16, 28], "20": [16, 25, 29, 38, 39, 40, 42], "x2": [16, 28], "onednn_layout": 16, "highest": 16, "upsampl": [16, 25], "align_corn": 16, "step2": 16, "continu": [16, 29, 38, 40, 42], "step3": 16, "step4": 16, "fall": [16, 18], "back": [16, 18, 25, 38, 50], "averagepool2d": 16, "maxpool2d": [16, 45], "maxpool3d": 16, "layernorm": [16, 45], "permutecontigu": 16, "softmax": [16, 42, 45], "greater": [16, 38], "fp16": [16, 31, 35, 42], "upsampleblinear2d": 16, "upsamplenearest": 16, "dnn": [17, 21], "e4m3": 17, "sign": [17, 26, 27, 50], "expon": [17, 50], "mantissa": [17, 50], "e5m2": 17, "FOR": 17, "onlin": 17, "decompress": 17, "delai": 17, "quantizaiton": 17, "showcas": 17, "_fp8_convert": 17, "convert_fp8_model": 17, "optimize_dtyp": 17, "fp8_autocas": 17, "input_id": 17, "token_type_id": 17, "segment_id": 17, "attention_mask": 17, "input_mask": 17, "masked_lm_label": 17, "next_sentence_label": 17, "tri": 18, "failur": [18, 38], "incorrect": [18, 38], "meanwhil": [18, 41], "noqa": [18, 24], "f401": [18, 24], "tensorflow": [19, 25], "kera": 19, "apach": [19, 34, 40], "mxnet": 19, "goal": 19, "mpi": [19, 38, 39], "concept": [19, 25, 41], "broadcast": 19, "hvd": [19, 38], "server": [19, 38, 40, 41], "forth": 19, "devid": 19, "effect": [19, 23, 40, 41, 46, 50], "compens": 19, "distributedoptim": 19, "deleg": [19, 43], "broadcast_paramet": 19, "root_rank": 19, "broadcast_optimizer_st": 19, "consist": [19, 35, 41], "restor": 19, "corrupt": 19, "accomplish": 19, "guard": 19, "named_paramet": 19, "log_interv": 19, "There": [20, 26, 27, 28, 31, 35, 41], "thing": [20, 38, 41], "yaml": 20, "togeth": [20, 28, 41, 44], "max_trial": 20, "trial": 20, "histori": [20, 35], "csv": 20, "hyperparam": 20, "mandatori": 20, "hp": 20, "ncores_per_inst": 20, "all_physical_cor": 20, "ncore_per_inst": 20, "all_logical_cor": 20, "use_all_nod": 20, "num_nod": 20, "use_logical_cor": [20, 40], "is_hyperthreading_en": 20, "disable_numactl": [20, 40], "disable_iomp": [20, 40], "malloc": [20, 39, 41], "tc": 20, "je": 20, "previou": [20, 25, 41], "hyperparamt": 20, "minim": [20, 41, 46], "maxim": 20, "higher_is_bett": 20, "target_v": 20, "inf": 20, "minimum": [20, 25], "suppos": [20, 31, 41], "platinum": [20, 40, 41], "8180m": [20, 41], "socket": [20, 40, 41], "physic": [20, 28, 40, 41], "conf_fil": 20, "hypertune_directori": 20, "termin": [20, 38], "15": [20, 29, 39, 40], "339081764221191": 20, "gave": 20, "offlin": 21, "woq": [21, 35], "pre": [21, 35, 44], "q": 21, "langugu": 21, "mm_qkv_int4": 21, "mm_bias_int4": 21, "mm_silu_int4": 21, "mm_resmul_int4": 21, "mm_bias_gelu_int4": 21, "mm_bias_resadd_resadd_int4": 21, "firstli": [21, 35], "present": [21, 40], "6b": [21, 35], "intens": [21, 30], "decid": [22, 28, 35], "satisfi": [22, 37], "tradeoff": 22, "default_static_qconfig": [22, 24, 40, 45], "histogramobserv": 22, "perchannelminmaxobserv": 22, "qint8": 22, "per_channel_symmetr": 22, "ao": 22, "per_tensor_affin": 22, "methond": 22, "obsev": 22, "sete": 22, "skylak": 22, "quant_stat": [22, 24], "user_model": [22, 45], "calibration_data_set": 22, "qparam": 22, "achang": 22, "save_qconf_summari": [22, 24], "qconf_summari": [22, 24], "load_qconf_summari": 22, "quantized_model": [22, 45], "dynamic_qconfig": 22, "default_dynamic_qconfig": [22, 40], "placeholderobserv": 22, "compute_dtyp": 22, "gru": 22, "lstmcell": 22, "rnncell": 22, "grucel": 22, "workflow": 23, "overal": [23, 41], "view": [23, 25, 26, 28, 42, 45, 50], "therefor": [23, 41], "move": [23, 25, 31, 41], "conv_relu": 23, "modelimp": [23, 36], "quantwrapp": [23, 36], "obtain": [23, 36], "calib_dataset": [23, 36], "inference_data": [23, 36], "asymmetr": [23, 42], "zero_point": 23, "swap": [23, 38], "Be": 23, "free": [23, 39], "warmup": [23, 26, 36], "warmup_data": [23, 36], "graph_for": [23, 36, 45], "inference_dta": [23, 36], "whole": [23, 28, 42, 48], "conv_unari": 23, "conv_binari": 23, "linear_unari": 23, "conv_sum_relu": 23, "henc": [23, 42], "consider": 23, "analysi": [23, 41], "bother": 24, "receip": [24, 28], "portion": 24, "beginn": 24, "quickstart_tutori": 24, "training_data": 24, "fashionmnist": 24, "test_data": 24, "train_dataload": 24, "test_dataload": 24, "break": 24, "neuralnetwork": 24, "linear_relu_stack": 24, "sequenti": [24, 25], "logit": 24, "predict": 24, "backpropag": 24, "7f": 24, "5d": 24, "inc": 24, "accu": 24, "tuned_conf": 24, "represent": 25, "multidimension": 25, "arrai": 25, "nd": 25, "semant": 25, "dens": 25, "spars": [25, 38], "coo": 25, "canon": 25, "assign": [25, 26, 40, 41], "2d": 25, "height": 25, "width": [25, 35], "bmp": 25, "contiguous_format": [25, 41], "close": [25, 39, 41], "difficult": 25, "manipul": 25, "to_dens": 25, "Will": 25, "secret": 25, "ingredi": 25, "cover": [25, 35, 39, 45], "almost": 25, "foundat": [25, 41], "upper": [25, 41], "expens": 25, "sequenc": [25, 26, 35, 38, 50], "benefici": 25, "nb": 25, "me": 25, "roughli": 25, "50": [25, 39, 40], "perf": 25, "mkldnn": 25, "mkldnn_util": 25, "to_mkldnn": 25, "explain": [25, 46, 50], "diagram": [25, 41], "conclus": 25, "But": 25, "neglig": 25, "organ": 25, "question": 25, "reinterpret": 25, "answer": 25, "chw": 25, "hw": [25, 44], "offset": [25, 35], "stride_n": 25, "stride_c": 25, "stride_h": 25, "stride_w": 25, "merit": 25, "express": 25, "noncontigu": 25, "big": 25, "n1": 25, "n2": 25, "mind": [25, 40], "someth": 25, "rfc": 25, "hwc": 25, "wc": 25, "chwn": 25, "hwn": 25, "wn": 25, "outplac": 25, "_appli": 25, "spontan": 25, "tell": [25, 28, 41], "NOT": [25, 39], "compris": 25, "depend": [25, 41, 42], "guidelin": 25, "awar": [25, 28, 39, 40], "my": 25, "recent": 25, "cudnn": 25, "accommod": 25, "hidden": [25, 35], "ideep": 25, "format_tag": 25, "src_md": 25, "desc": 25, "data_typ": 25, "f32": 25, "src_mem": 25, "src_data_ptr": 25, "hwio": 25, "avx512": [25, 40, 42, 46], "3d": 25, "batchnorm1d": 25, "maxpool1d": 25, "div": [25, 42, 45], "nearest": 25, "sycl_devic": 25, "test_input": 25, "test_input_xpu": 25, "to_channels_last_1d": 25, "tenor": 25, "xpu_r": 25, "is_contiguous_channels_last_1d": 25, "input_xpu": 25, "meta": [25, 35], "invalid": [25, 38, 41], "corrspond": 25, "prebuilt": [26, 27, 38, 44], "wheel": [26, 27, 38, 44], "affili": 26, "use_onetrac": 26, "onetrac": 26, "layer": [26, 28, 35, 42], "profileract": 26, "input_tensor": [26, 27], "prof": [26, 27], "proper": [26, 27], "output_tensor_1": [26, 27], "nonzero": [26, 27], "output_tensor_2": [26, 27], "uniqu": [26, 27, 29], "tabl": [26, 27, 44], "key_averag": [26, 27], "my_schedul": 26, "skip_first": 26, "trace_handl": 26, "p": 26, "sort_bi": [26, 27], "self_xpu_time_tot": 26, "row_limit": 26, "trace_": 26, "step_num": 26, "outsid": [26, 28], "on_trace_readi": 26, "forget": 26, "record_shap": [26, 27], "rememb": 26, "effort": 26, "contextlib": 26, "profiler_setup": 26, "nullcontext": 26, "should_profil": 26, "profileact": 26, "unset": 26, "involv": [26, 50], "Such": 26, "a_0": 26, "a_1": 26, "b_0": 26, "b_1": 26, "export_chrome_trac": [26, 27], "trace_example_on_multi_devic": 26, "consol": [26, 27, 29], "exclud": [26, 27], "children": [26, 27], "percentag": [26, 27], "propot": [26, 27], "percentasg": [26, 27], "avg": [26, 27], "consumpt": [26, 27], "sonsumpt": [26, 27], "viewer": [26, 27], "perfetto": 26, "ui": 26, "dev": 26, "trace_fil": [26, 27], "examin": [26, 49], "build_profil": 27, "autograd": 27, "profiler_legaci": 27, "use_xpu": 27, "temporarili": 27, "sort": 27, "revers": 27, "coupl": [28, 41], "omp": [28, 38, 39, 40, 41], "ld_preload": [28, 38, 39, 40, 41, 42], "libiomp5": [28, 39, 40, 41], "model_script": 28, "examplenet": 28, "examplenet1": 28, "start_dim": 28, "examplenet2": 28, "y1": 28, "y2": 28, "model1": 28, "traced_model1": 28, "model2": 28, "traced_model2": 28, "multi_stream_model": 28, "datatyp": [28, 42], "receipt": 28, "steam": 28, "input_hint": 28, "output_hint": 28, "async": 28, "wake": 28, "imper": 28, "suffer": 28, "gil": 28, "hurt": 28, "mitig": 28, "omp_num_thread": [28, 38, 39, 40], "phase": [28, 35, 38], "s1": 28, "c1": 28, "numactl": [28, 39, 40], "resourc": [28, 37, 40, 41, 45], "superset": 28, "undefin": [28, 38, 42], "gb": 28, "simultan": 28, "cpu_pool1": 28, "cpu_pool2": 28, "task1": 28, "task2": 28, "y1_futur": 28, "y2_futur": 28, "y_runtim": 28, "kmp_": 28, "fulfil": 28, "bound": [28, 35, 41, 48, 49], "serv": 28, "sub": [28, 41, 42], "futuretensor": 28, "didn": 28, "dlopen": 28, "symbol": [28, 38, 42], "screen": 29, "bracket": 29, "enable_simple_trac": 29, "disable_simple_trac": 29, "using_simple_trac": 29, "unintention": 29, "exmapl": 29, "262618": 29, "wrapper__empty_strid": 29, "atenipextypexpu": 29, "empty_strid": 29, "wrapper__copy_": 29, "copy_": 29, "wrapper___unique2": 29, "_unique2": 29, "wrapper__clon": 29, "wrapper___reshape_alia": 29, "_reshape_alia": 29, "wrapper_memory_format_empti": 29, "wrapper__as_strid": 29, "as_strid": 29, "wrapper___local_scalar_dens": 29, "_local_scalar_dens": 29, "wrapper__resize_": 29, "resize_": 29, "19": [29, 39, 40], "pid": 29, "tid": 29, "name1": 29, "name2": 29, "arrow": 29, "relationship": 29, "child": 29, "gdb": 29, "inductor": [30, 42], "triton": [30, 42], "codegen": 30, "addition": [30, 40], "facilit": 30, "contribut": [30, 39], "ever": 30, "unlock": 30, "compiled_model": 30, "weight_decai": [30, 48, 49], "loss_funct": 30, "demostr": 31, "cache_en": 31, "bash": [31, 38], "copyright": 34, "notic": [34, 39, 40], "condit": [34, 38], "architectur": [35, 41], "decod": 35, "multiheadattent": 35, "feedforward": 35, "kv_cach": 35, "lot": [35, 38, 42, 43, 44], "smoothquant": 35, "hub": 35, "7b": 35, "hf": 35, "13b": 35, "70b": 35, "eleutherai": 35, "30b": 35, "3b": 35, "bigscienc": 35, "7b1": 35, "quantzat": 35, "codellama": 35, "indirect": 35, "rope": 35, "tpp": 35, "progress": [35, 38], "expand": 35, "brief": 35, "xelta": 35, "rotari": 35, "posit": 35, "squar": [35, 42, 45], "rmsnorm": 35, "beam": 35, "idx": [35, 39], "reorder_cach": 35, "bottleneck": 35, "kept": [35, 50], "buffer": 35, "wast": 35, "prefil": 35, "influenc": [35, 39, 41], "left": [35, 40, 50], "timestamp": 35, "elimin": 35, "sdpa": 35, "shard": [35, 42], "lead": 35, "significantli": 35, "heavier": 35, "becom": [35, 41], "bandwidth": 35, "token": 35, "content": [36, 42], "transpar": [36, 41, 42, 43], "undergo": 36, "overview": [36, 42], "automodelforcausallm": 36, "amp_dtyp": 36, "squeez": 37, "tool": [37, 38, 41, 46], "problem": [38, 40, 41, 48, 49], "unsupport": [38, 42], "improp": 38, "unload": 38, "conda": [38, 41, 42], "encount": [38, 43, 44], "ship": 38, "libstdc": 38, "conflict": 38, "_glibcxx_use_cxx11_abi": 38, "_znk5torch8autograd4node4nameb5cxx11ev": [38, 42], "appear": [38, 42], "glibcxx_use_cxx11_abi": 38, "bad": 38, "rn50": [38, 45], "friendli": [38, 41], "ungracefulli": 38, "997": 38, "170": [38, 42, 44], "wsl2": [38, 42], "ram": 38, "killer": 38, "dmesg": 38, "oom": 38, "had": [38, 41], "kill": [38, 40], "max_job": 38, "conserv": 38, "slow": 38, "cl_device_not_found": 38, "tdr": 38, "window": [38, 42], "tdrdelai": 38, "registri": 38, "reboot": 38, "tsan": 38, "compat": [38, 50], "workaround": 38, "omp_tool": 38, "unblock": 38, "soon": 38, "sometim": [38, 39, 41], "ur_l0_in_order_barrier_by_sign": 38, "converg": 38, "24": [38, 39, 40], "hour": 38, "divid": [38, 40, 41, 45], "hang": 38, "1550": 38, "race": 38, "happen": 38, "torch_llm_allreduc": 38, "pcie": 38, "xelink": 38, "usr": [38, 39, 40, 42, 46], "ld": [38, 39, 41, 42], "lmkl_sycl": [38, 42], "lmkl_intel_ilp64": [38, 42], "lmkl_core": [38, 42], "lmkl_tbb_thread": [38, 42], "linker": [38, 42], "exit": [38, 39, 42], "v": [38, 42], "occur": [38, 42], "resolv": [38, 42], "mkl_dpcpp_root": [38, 42], "mkl_lapack_dspevd": 38, "fatal": [38, 42], "libmkl_vml_avx512": 38, "libmkl": [38, 42], "vml": [38, 42], "incorrectli": [38, 42], "oserror": [38, 42], "wrong": [38, 42], "libmkl_intel_ilp64": [38, 42], "suffix": [38, 42], "test_weight_norm": 38, "testnnmethod": 38, "test_weight_norm_differnt_typ": 38, "a770": [38, 42], "graphic": [38, 41, 42, 44], "test_foreach": 38, "testtorchmethod": 38, "test_foreach_co": 38, "test_foreach_sin": 38, "test_polar": 38, "test_polar_float": 38, "test_special_op": 38, "test_special_spherical_bessel_j0": 38, "test_transducer_loss": 38, "test_vallina_transducer_loss": 38, "pypi": 38, "remark": [38, 41], "intel_pytorch_extens": 38, "112": [38, 41], "poor": 38, "xlm": 38, "roberta": 38, "casual": 38, "gpt2": 38, "summar": 38, "t5": 38, "allenai": 38, "longform": 38, "409": 38, "_c": [38, 46], "_jit_set_texpr_fuser_en": 38, "csrc": [38, 46], "tensorexpr_fus": 38, "settensorexprfuseren": 38, "integr": [38, 41, 42, 44], "runtimeerror": 38, "overflow": 38, "unpack": 38, "min": [38, 42], "exce": [38, 41], "quantize_per_tensor": 38, "pseudocod": 38, "omp_num_threa": 38, "prototyp": 38, "set_num_thread": 38, "freezed_model": 38, "run_benchmark": 38, "embeddingbag": 38, "bag": 38, "abnorm": 38, "avx2": [38, 46], "batchnorm": [38, 45], "rnnt": 38, "joint_net": 38, "caller": 38, "yet": 38, "pend": 38, "merg": [38, 42], "factor": 39, "properli": 39, "themselv": 39, "common": [39, 41, 50], "mainli": 39, "dir": [39, 46], "choic": [39, 50], "taskset": 39, "malloc_conf": [39, 41], "crash": [39, 41], "nnode": 39, "count": 39, "ip": 39, "hostnam": 39, "proc": [39, 41], "hostfil": 39, "mpiexec": 39, "hydra": 39, "np": 39, "ppn": 39, "genv": 39, "i_mpi_pin_domain": 39, "codeless": 39, "ut": 39, "mutual": 39, "favorit": 39, "kmp": [39, 41], "compact": [39, 40, 41], "stdout": 39, "undesir": 39, "_timestamp_inst": 39, "_timestamp_instance_": 39, "_core": 39, "run_20210712212258_inst": 39, "run_20210712212258_instance_0_cores_0": 39, "43": [39, 40], "gif": 39, "07": 39, "21": [39, 40], "22": [39, 40], "58": 39, "764": 39, "conda_prefix": [39, 40], "virtual_env": [39, 40], "lib64": [39, 40], "drop": [39, 40], "44": [39, 40], "kmp_affin": [39, 40, 41], "kmp_blocktim": [39, 40, 41], "23": [39, 40, 50], "26": [39, 40], "27": [39, 40, 41], "29": [39, 40], "30": [39, 40], "31": [39, 40], "33": [39, 40, 46], "34": [39, 40], "35": [39, 40], "36": [39, 40], "37": [39, 40], "38": [39, 40], "39": [39, 40], "40": [39, 40], "41": [39, 40], "42": [39, 40], "tee": 39, "run_20210712223308_inst": 39, "run_20210712223308_instance_0_cores_0": 39, "87": 39, "08": 39, "117": 39, "88": 39, "118": 39, "45": [39, 40], "46": [39, 40], "47": [39, 40], "48": [39, 40], "49": [39, 40], "51": [39, 40], "52": [39, 40], "53": [39, 40], "54": [39, 40], "55": [39, 40, 41], "56": [39, 40, 41], "57": 39, "59": 39, "60": 39, "61": 39, "62": 39, "63": 39, "65": 39, "66": [39, 46], "67": 39, "68": 39, "69": 39, "70": 39, "71": 39, "72": 39, "73": 39, "74": 39, "75": 39, "76": 39, "77": 39, "78": 39, "79": 39, "81": 39, "82": 39, "83": [39, 41], "84": [39, 41], "85": 39, "86": 39, "run_20210712214504_inst": 39, "run_20210712214504_instance_0_cores_22": 39, "04": [39, 42], "513": 39, "run_20210712220928_inst": 39, "run_20210712220928_instance_0_cores_0": 39, "09": 39, "355": 39, "356": 39, "deduct": 39, "run_20210712221615_inst": 39, "run_20210712221615_instance_0_cores_11": 39, "591": 39, "run_20210712221150_inst": 39, "run_20210712221150_instance_0_cores_0": 39, "run_20210712221150_instance_1_cores_22": 39, "233": 39, "236": 39, "run_20210712221415_inst": 39, "run_20210712221415_instance_0_cores_0": 39, "run_20210712221415_instance_1_cores_4": 39, "run_20210712221415_instance_2_cores_8": 39, "run_20210712221415_instance_3_cores_12": 39, "run_20210712221415_instance_4_cores_16": 39, "run_20210712221415_instance_5_cores_20": 39, "run_20210712221415_instance_6_cores_24": 39, "run_20210712221415_instance_7_cores_28": 39, "run_20210712221415_instance_8_cores_32": 39, "run_20210712221415_instance_9_cores_36": 39, "run_20210712221415_instance_10_cores_40": 39, "140": 39, "143": 39, "146": 39, "149": 39, "151": 39, "154": 39, "157": 39, "159": 39, "162": 39, "164": 39, "167": 39, "run_20210712221305_inst": 39, "run_20210712221305_instance_0_cores_0": 39, "run_20210712221305_instance_1_cores_11": 39, "run_20210712221305_instance_2_cores_22": 39, "run_20210712221305_instance_3_cores_33": 39, "470": 39, "471": 39, "473": 39, "476": 39, "479": 39, "instance_idx": 39, "independ": 39, "confirm": 39, "06": [39, 40], "175": 39, "176": 39, "177": 39, "run_20220106130151_instance_0_cores_0": 39, "235": 39, "jemallocl": 39, "oversize_threshold": [39, 41], "background_thread": [39, 41], "metadata_thp": [39, 41], "dirty_decay_m": [39, 41], "9000000000": [39, 41], "muzzy_decay_m": [39, 41], "libjemalloc": 39, "run_20210713153048_instance_0_cores_0": 39, "654": 39, "libtcmalloc": [39, 40], "655": 39, "run_20210713153333_instance_0_cores_0": 39, "784": 39, "run_20210713153659_instance_0_cores_0": 39, "blocktim": [39, 41], "00": 39, "760": [39, 40], "761": [39, 40], "omp_schedul": [39, 41], "omp_proc_bind": [39, 41], "run_20210713152500_instance_0_cores_0": 39, "give": 40, "ipex_en": 40, "procedur": 40, "tunin": 40, "dramat": [40, 41], "cpu_launcher_en": 40, "cpu_launcher_arg": 40, "hyperthread": 40, "ital": 40, "ptmalloc": 40, "use_default_alloc": 40, "tcmalloc": 40, "enable_tcmalloc": 40, "jemalloc": 40, "enable_jemalloc": 40, "gnu": [40, 46], "nth": [40, 41], "uniform": 40, "tunabl": 40, "signficantli": 40, "8180": 40, "affinit": 40, "unutil": 40, "restart": 40, "remain": [40, 49], "aliv": 40, "taken": 40, "worri": 40, "interrupt": 40, "dummy_tensor": 40, "check_trac": [40, 45], "bert_int8_jit": 40, "pretrain": [40, 45], "n_iter": 40, "rn50_int8_jit": 40, "usus": 40, "rn50_ipex_int8": 40, "image_classifi": 40, "similarli": 40, "bert_ipex_int8": 40, "transformer_handler_gener": 40, "setup_config": 40, "seq_classification_artifact": 40, "index_to_nam": 40, "nc": 40, "model_stor": 40, "rest": 40, "model_log": 40, "096": 40, "8375c": 40, "02": 40, "03": 40, "981": 40, "982": 40, "cases": 40, "ab": [40, 42, 45], "223": 40, "site": 40, "model_service_work": 40, "sock": 40, "unix": 40, "9000": 40, "762": 40, "763": 40, "9001": 40, "274": 40, "9002": 40, "975": 40, "9003": 40, "bench": 40, "amazon": 40, "ec2": 40, "m6i": 40, "24xlarg": 40, "reproduc": 40, "modelurl": 40, "inputpath": 40, "concurr": [40, 41], "huggingface_transform": 40, "sample_text_captum_input": 40, "articl": 41, "briefli": 41, "understand": [41, 47, 50], "knowledg": 41, "c620": 41, "chipset": 41, "purlei": 41, "chip": 41, "inclus": 41, "l2": 41, "2666": 41, "mhz": 41, "ddr4": 41, "six": 41, "ultra": 41, "interconnect": 41, "upi": 41, "microarchitectur": 41, "connect": 41, "transfer": 41, "equip": 41, "motherboard": 41, "attach": 41, "remot": 41, "asu": 41, "z11pa": 41, "d8": 41, "competit": [41, 42], "stall": 41, "busi": 41, "visit": 41, "uma": 41, "lscpu": 41, "onboard": [41, 48], "hyper": 41, "111": 41, "50ghz": 41, "node0": 41, "node1": 41, "sophist": 41, "brought": 41, "polici": [41, 42], "later": 41, "sysctl": 41, "balanc": 41, "great": 41, "placement": 41, "idea": [41, 50], "cpunodebind": 41, "membind": 41, "wikipedia": [41, 45], "multithread": 41, "primari": [41, 42], "consecut": 41, "fork": [41, 46], "libgomp": 41, "libiomp": 41, "commonli": 41, "gomp": 41, "comma": [41, 44], "gomp_cpu_affin": 41, "thrash": 41, "did": 41, "compet": 41, "proclist": 41, "sleep": 41, "200m": 41, "appropri": [41, 43], "sole": 41, "penal": 41, "role": 41, "unnecessari": 41, "destruct": 41, "emphas": 41, "mmuzzy_decay_m": 41, "straight": [41, 45], "forg": 41, "even": 41, "dealloc": [41, 43], "costli": 41, "gpertool": 41, "plu": 41, "pretti": 41, "nifti": 41, "gperftool": 41, "solv": [41, 48, 49], "set_flush_denorm": 41, "warm": 41, "threshold": 41, "usuali": 41, "maskrcnn": 41, "wav2vec2": 41, "recognit": 41, "onednn_primitive_cache_capac": 41, "65536": 41, "voic": 41, "4096": 41, "date": 42, "hbm": 42, "kv": 42, "quit": 42, "reach": 42, "verif": 42, "vehicl": 42, "emul": 42, "fsdp": 42, "publicli": 42, "oct": 42, "focus": [42, 45], "broader": 42, "webpag": 42, "ubuntu": 42, "v1": 42, "unaryop": 42, "sqrt": [42, 45, 48], "round": [42, 45, 50], "log_sigmoid": 42, "hardswish": [42, 45], "hardsigmoid": 42, "silu": [42, 45], "hardtanh": [42, 45], "leaky_relu": [42, 45], "binaryop": 42, "mul": [42, 45], "ne": 42, "ge": 42, "le": 42, "gelu": [42, 45], "mish": [42, 45], "concret": 42, "adamw": [42, 49], "permut": 42, "dequant": [42, 45], "pixelshuffl": 42, "leaki": [42, 45], "softplu": 42, "critic": 42, "xxx": 42, "glibcxx": 42, "cxx11": 42, "gcc": [42, 46], "path_to_your_onemkl": 42, "__release_lnx": 42, "lapack": 42, "dspevd": 42, "lp64": 42, "libmkl_sequenti": 42, "adapt": 43, "frequent": 43, "websit": 43, "splitsgd": [43, 50], "lifecycl": [43, 44], "beforehand": [43, 44], "benifit": [43, 44], "qualiti": [43, 44], "deliveri": [43, 44], "disadvantag": [43, 44, 50], "500mb": [43, 44], "5gb": [43, 44], "attempt": 43, "smallest": 43, "delimit": 44, "ats": 44, "m150": 44, "pvc": 44, "seper": 44, "opencl": 44, "spir64_gen": 44, "dag": 45, "acycl": 45, "constant": 45, "__dict__": 45, "front": 45, "propag": [45, 50], "convrelu": 45, "convsumrelu": 45, "mymodel": 45, "construct": 45, "convtranspose3d": 45, "clamp": 45, "___": 45, "_____": 45, "owner": 45, "otheriws": 45, "compuat": 45, "avx512_vnni": 46, "avx512_bf16": 46, "avx2_vnni": 46, "impli": 46, "findavx": 46, "aten_cpu_cap": 46, "_get_current_isa_level": 46, "addtion": 46, "subfold": 46, "rh": 46, "toolset": 46, "cmakefil": 46, "cpu_featur": 46, "cpu_feature_main": 46, "xcr0": 46, "00000000000602e7": 46, "mmx": 46, "sse": 46, "sse2": 46, "sse3": 46, "ssse3": 46, "sse4_1": 46, "sse4_2": 46, "aes_ni": 46, "sha": 46, "xsave": 46, "fma": 46, "f16c": 46, "avx_vnni": 46, "avx512_f": 46, "avx512_cd": 46, "avx512_pf": 46, "avx512_er": 46, "avx512_vl": 46, "avx512_bw": 46, "avx512_dq": 46, "avx512_ifma": 46, "avx512_vbmi": 46, "avx512_vpopcntdq": 46, "avx512_4fmap": 46, "avx512_4vnniw": 46, "avx512_vbmi2": 46, "avx512_vpclmul": 46, "avx512_bitalg": 46, "avx512_fp16": 46, "avx512_vp2intersect": 46, "amx_bf16": 46, "amx_til": 46, "amx_int8": 46, "prefetchw": 46, "prefetchwt1": 46, "lamb": [48, 49, 50], "adagrad": [48, 50], "grad": [48, 49], "clr": 48, "lr_decai": 48, "state_sum": 48, "addcmul_": 48, "add_": [48, 49], "addcdiv_": 48, "bottl": [48, 49], "neck": [48, 49], "pseudo": [48, 49, 50], "adagrad_fused_step": 48, "grad0": 48, "grad1": 48, "grad_n": 48, "param_n": 48, "state_sum_n": 48, "adagrad_step": 48, "grad_i": 48, "param_i": 48, "state_sum_i": 48, "other_arg": 48, "adam": 49, "lar": 49, "buf": 49, "momentum_buffer_list": 49, "detach": 49, "mul_": 49, "dampen": 49, "nesterov": 49, "sgd_fused_step": 49, "bottom": 50, "shorter": 50, "fewer": 50, "shift": 50, "lose": 50, "decim": 50, "1234500000": 50, "0000012345": 50, "1234512345": 50, "sens": 50, "fraction": 50, "12345": 50, "00000": 50, "signific": 50, "bui": 50, "ground": 50, "truth": 50, "chain": 50, "rule": 50, "formula": 50, "\u03b1": 50, "gw": 50, "denot": 50, "earlier": 50, "inaccur": 50, "halv": 50, "recov": 50, "fp32_w": 50, "concat_fp32_from_bf16": 50, "bf16_w": 50, "trail": 50, "fp32_gw": 50, "bf16_gw": 50, "weight_dacai": 50, "split_bf16_from_fp32": 50}, "objects": {"": [[1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4BF32E", "xpu::BF32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4FP32E", "xpu::FP32"], [1, 1, 1, "_CPPv4N3xpu14FP32_MATH_MODEE", "xpu::FP32_MATH_MODE"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4BF32E", "xpu::FP32_MATH_MODE::BF32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4FP32E", "xpu::FP32_MATH_MODE::FP32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE", "xpu::FP32_MATH_MODE::FP32_MATH_MODE_MAX"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4TF32E", "xpu::FP32_MATH_MODE::TF32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE", "xpu::FP32_MATH_MODE_MAX"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4TF32E", "xpu::TF32"], [1, 2, 1, "_CPPv4N3xpu21get_queue_from_streamEN3c106StreamE", "xpu::get_queue_from_stream"], [1, 3, 1, "_CPPv4N3xpu21get_queue_from_streamEN3c106StreamE", "xpu::get_queue_from_stream::stream"], [1, 2, 1, "_CPPv4N3xpu18set_fp32_math_modeE14FP32_MATH_MODE", "xpu::set_fp32_math_mode"], [1, 3, 1, "_CPPv4N3xpu18set_fp32_math_modeE14FP32_MATH_MODE", "xpu::set_fp32_math_mode::mode"]], "intel_extension_for_pytorch.cpu": [[1, 4, 0, "-", "runtime"]], "intel_extension_for_pytorch.cpu.runtime": [[1, 5, 1, "", "CPUPool"], [1, 5, 1, "", "MultiStreamModule"], [1, 5, 1, "", "MultiStreamModuleHint"], [1, 5, 1, "", "Task"], [1, 6, 1, "", "get_core_list_of_node_id"], [1, 6, 1, "", "is_runtime_ext_enabled"], [1, 5, 1, "", "pin"]], "intel_extension_for_pytorch": [[1, 6, 1, "", "enable_onednn_fusion"], [1, 6, 1, "", "get_fp32_math_mode"], [1, 6, 1, "", "optimize"], [1, 6, 1, "", "optimize_transformers"], [1, 4, 0, "-", "quantization"], [1, 6, 1, "", "set_fp32_math_mode"], [1, 5, 1, "", "verbose"]], "intel_extension_for_pytorch.nn": [[6, 5, 1, "", "FrozenBatchNorm2d"]], "intel_extension_for_pytorch.nn.functional": [[6, 6, 1, "", "interaction"]], "intel_extension_for_pytorch.quantization": [[1, 6, 1, "", "_gptq"], [1, 6, 1, "", "autotune"], [1, 6, 1, "", "convert"], [1, 6, 1, "", "prepare"]], "intel_extension_for_pytorch.xpu": [[1, 5, 1, "", "Event"], [1, 5, 1, "", "Stream"], [1, 6, 1, "", "current_device"], [1, 6, 1, "", "current_stream"], [1, 5, 1, "", "device"], [1, 6, 1, "", "device_count"], [1, 5, 1, "", "device_of"], [1, 6, 1, "", "empty_cache"], [1, 6, 1, "", "get_device_name"], [1, 6, 1, "", "get_device_properties"], [1, 6, 1, "", "get_rng_state"], [1, 6, 1, "", "get_rng_state_all"], [1, 6, 1, "", "init"], [1, 6, 1, "", "initial_seed"], [1, 6, 1, "", "is_available"], [1, 6, 1, "", "is_initialized"], [1, 6, 1, "", "manual_seed"], [1, 6, 1, "", "manual_seed_all"], [1, 6, 1, "", "max_memory_allocated"], [1, 6, 1, "", "max_memory_reserved"], [1, 6, 1, "", "memory_allocated"], [1, 6, 1, "", "memory_reserved"], [1, 6, 1, "", "memory_snapshot"], [1, 6, 1, "", "memory_stats"], [1, 6, 1, "", "memory_stats_as_nested_dict"], [1, 6, 1, "", "memory_summary"], [1, 6, 1, "", "reset_accumulated_memory_stats"], [1, 6, 1, "", "reset_peak_memory_stats"], [1, 6, 1, "", "seed"], [1, 6, 1, "", "seed_all"], [1, 6, 1, "", "set_device"], [1, 6, 1, "", "set_rng_state"], [1, 6, 1, "", "set_rng_state_all"], [1, 6, 1, "", "stream"], [1, 6, 1, "", "synchronize"]], "intel_extension_for_pytorch.xpu.Event": [[1, 7, 1, "", "elapsed_time"], [1, 7, 1, "", "query"], [1, 7, 1, "", "record"], [1, 7, 1, "", "synchronize"], [1, 7, 1, "", "wait"]], "intel_extension_for_pytorch.xpu.Stream": [[1, 7, 1, "", "record_event"], [1, 8, 1, "", "sycl_queue"], [1, 7, 1, "", "synchronize"], [1, 7, 1, "", "wait_event"], [1, 7, 1, "", "wait_stream"]], "intel_extension_for_pytorch.xpu.fp8.fp8": [[1, 6, 1, "", "fp8_autocast"]]}, "objtypes": {"0": "cpp:enumerator", "1": "cpp:enum", "2": "cpp:function", "3": "cpp:functionParam", "4": "py:module", "5": "py:class", "6": "py:function", "7": "py:method", "8": "py:property"}, "objnames": {"0": ["cpp", "enumerator", "C++ enumerator"], "1": ["cpp", "enum", "C++ enum"], "2": ["cpp", "function", "C++ function"], "3": ["cpp", "functionParam", "C++ function parameter"], "4": ["py", "module", "Python module"], "5": ["py", "class", "Python class"], "6": ["py", "function", "Python function"], "7": ["py", "method", "Python method"], "8": ["py", "property", "Python property"]}, "titleterms": {"intel": [0, 4, 5, 7, 22, 23, 39, 40, 41], "extens": [0, 4, 6, 7, 9, 22, 23, 28, 38, 40], "pytorch": [0, 4, 7, 19, 22, 23, 25, 40], "architectur": 0, "support": [0, 6, 12, 13, 15, 17, 21, 25, 26], "api": [1, 6, 7, 14, 24, 25, 33, 36, 45], "document": [1, 4, 33, 40, 41], "devic": [1, 6, 26], "agnost": [1, 6], "gpu": [1, 6, 7, 10, 13, 17, 21, 23, 30, 38, 42, 43, 49], "specif": [1, 6, 12, 13, 38], "miscellan": 1, "random": 1, "number": [1, 39, 41], "gener": [1, 38], "stream": [1, 9], "event": 1, "memori": [1, 25, 39, 41, 43, 47], "manag": [1, 43, 47], "c": [1, 5, 25], "cpu": [1, 6, 12, 22, 24, 25, 38, 41, 43, 46, 48], "quantiz": [1, 6, 17, 21, 22, 23, 36], "runtim": [1, 6, 7, 11, 28, 38], "blog": 2, "public": 2, "cheat": 3, "sheet": 3, "contribut": 4, "develop": 4, "xpu": [4, 5, 25, 26, 42], "tip": 4, "debug": [4, 6, 16], "unit": [4, 38], "test": [4, 38], "better": 4, "local": 4, "pytest": 4, "write": [4, 9, 25], "build": [4, 9, 11, 26, 27, 46], "exampl": [5, 7, 8, 9, 10, 15, 17, 18, 20, 21, 24, 28, 39, 46], "python": [5, 6], "train": [5, 6, 12, 13, 30, 38], "singl": [5, 7, 39], "instanc": [5, 39], "float32": [5, 12, 13, 38], "bfloat16": [5, 12, 13, 38, 50], "infer": [5, 12, 13, 21, 35, 36, 39, 40], "imper": [5, 12, 13, 23, 36], "mode": [5, 17, 21, 23, 36, 39], "resnet50": [5, 40], "bert": [5, 40], "torchscript": [5, 12, 13, 23, 36], "float16": [5, 13], "int8": [5, 24, 38, 40, 45], "torch": [5, 30], "optim": [5, 6, 15, 22, 23, 35, 36, 43, 45, 48, 49], "basic": [5, 28], "usag": [5, 7, 10, 15, 17, 18, 19, 20, 21, 24, 28, 36, 38, 39], "us": [5, 6, 8, 9, 12, 13, 14, 15, 16, 26, 27, 28, 29, 39, 44, 45], "sycl": [5, 9], "code": 5, "custom": 5, "dpc": [5, 6, 9], "kernel": [5, 25], "ai": 5, "refer": [5, 12, 13], "model": [5, 22, 25, 26, 27, 28, 29, 35, 40, 45], "featur": [6, 16, 18, 46], "easi": 6, "channel": [6, 14, 25, 41], "last": [6, 14, 25, 41], "auto": [6, 12, 13, 14, 28], "mix": [6, 12, 13], "precis": [6, 12, 13, 35], "amp": [6, 12, 13], "distribut": [6, 35, 36], "dlpack": [6, 8], "solut": [6, 8], "advanc": [6, 11], "configur": [6, 11, 28, 41], "fulli": [6, 10], "shard": [6, 10], "data": [6, 8, 10, 17, 21, 35], "parallel": [6, 10], "fsdp": [6, 10], "inductor": 6, "legaci": [6, 27], "profil": [6, 26, 27], "tool": [6, 21, 26, 27, 29], "experiment": [6, 15, 16, 17, 18, 19, 20, 21, 24, 26, 27, 29], "simpl": [6, 29], "trace": [6, 15, 26, 27, 29], "kineto": [6, 26], "comput": [6, 16], "engin": [6, 16], "oper": [6, 16, 17, 21, 25, 35, 48, 49], "codeless": [6, 15], "new": 6, "1": [6, 20, 40, 42], "13": [6, 42], "graph": [6, 18, 43, 45], "captur": [6, 18], "0": [6, 42], "hypertun": [6, 20], "distributeddataparallel": 7, "ddp": 7, "introduct": [7, 8, 9, 10, 12, 13, 16, 26, 27, 29, 30, 33, 44, 48, 49], "instal": [7, 19, 32, 40], "oneccl": 7, "bind": [7, 28], "from": 7, "sourc": 7, "prebuilt": 7, "wheel": 7, "dynam": [7, 22, 38, 43, 46], "link": 7, "mpi": 7, "launch": [7, 15, 39], "node": [7, 39], "scale": [7, 40], "onli": [7, 10, 21, 36], "case": [8, 12, 13, 15, 16, 26, 27, 28, 29, 44], "design": [8, 28, 39], "import": 8, "capsul": 8, "export": [8, 26, 27, 40], "dldevic": 8, "pointer": 8, "asynchron": [8, 28], "program": 8, "motiv": [9, 15], "setuptool": 9, "jit": [9, 15], "compil": [9, 30, 43, 44, 46], "cmake": 9, "request": 9, "current": 9, "c10": 9, "fetch": 9, "correspond": 9, "queue": 9, "op": [9, 12, 13], "accessor": 9, "time": [11, 43, 44], "default": [12, 13, 14, 20, 25, 39], "path": [12, 13], "autocast": [12, 13], "elig": [12, 13], "behavior": [12, 13], "can": [12, 13], "promot": [12, 13], "widest": [12, 13], "input": [12, 13, 28], "type": [12, 13, 17, 21, 35], "eas": [14, 45], "enabl": [14, 29], "disabl": [14, 26, 27, 29], "known": [14, 28, 42], "issu": [14, 28, 38, 42], "huggingfac": 15, "The": 15, "origin": 15, "command": 15, "ipex": 15, "appli": 15, "fp32": [15, 45], "bf16": [15, 45], "modul": [15, 28], "forward": 15, "method": 15, "explicitli": 15, "instead": 15, "__call__": 15, "attr": 15, "alreadi": 15, "select": [16, 46], "polici": [16, 35], "multipl": [16, 39], "implement": [16, 28], "float8": 17, "fp8": 17, "run": [17, 21], "descript": 18, "horovod": 19, "your_conf_fil": 20, "hyperparamet": 20, "launcher": [20, 40], "defin": [20, 22], "search": 20, "space": 20, "tune": [20, 24, 37, 41], "2": [20, 40, 42], "user": 20, "your_python_script": 20, "int4": 21, "weight": [21, 36], "static": 22, "qconfig": 22, "prepar": 22, "do": 22, "calibr": 22, "convert": 22, "deploi": [22, 40], "recip": [24, 28], "what": 25, "i": [25, 28, 39], "format": 25, "all": [25, 39], "That": 25, "matter": 25, "nchw": 25, "b": 25, "nhwc": 25, "block": 25, "nchw16c": 25, "stride": 25, "layout": 25, "tensor": 25, "creation": 25, "convers": 25, "d": 25, "coverag": 25, "regist": [25, 40], "aten": 25, "nativ": 25, "manner": 25, "onednn": [25, 41], "creat": [25, 40], "convolut": 25, "primit": [25, 41], "1d": 25, "determin": 25, "set": [26, 28], "environ": 26, "variabl": 26, "add": 26, "Into": 26, "script": [26, 27, 39], "partli": 26, "backend": 26, "multi": [26, 40], "applic": 26, "result": [26, 27, 29, 38], "chrome": [26, 27], "requir": [28, 44, 46], "multistream": 28, "examples1": 28, "examples2": 28, "examples3": 28, "structur": [28, 41], "output": 28, "perform": [28, 37, 40, 41], "task": 28, "core": [28, 39, 40], "detail": [28, 43], "how": 28, "iomp": 28, "preload": 28, "load": 28, "dure": 28, "inferenec": 30, "quick": 31, "start": [31, 33, 40], "execut": 31, "get": 33, "licens": 34, "larg": 35, "languag": 35, "llm": 35, "overview": [35, 39, 41, 46], "methodologi": [35, 45], "linear": 35, "deep": 35, "fusion": [35, 45, 48, 49], "segment": 35, "kv": 35, "cach": [35, 41], "low": 35, "transform": 36, "frontend": 36, "pseudocod": 36, "common": 36, "scenario": 36, "fp16": 36, "smoothquant": 36, "woq": 36, "deepspe": 36, "guid": [37, 39, 41], "troubleshoot": 38, "librari": [38, 39], "depend": 38, "torchdynamo": 38, "shape": 38, "correct": 38, "physic": 39, "ii": 39, "includ": 39, "logic": 39, "iii": 39, "iv": 39, "your": 39, "v": 39, "throughput": 39, "vi": 39, "latenc": 39, "vii": 39, "viii": 39, "index": 39, "jemalloc": [39, 41], "tcmalloc": [39, 41], "alloc": [39, 41], "openmp": [39, 41], "gnu": [39, 41], "torchserv": 40, "content": [40, 41], "thi": [40, 41], "serv": 40, "pin": 40, "boost": 40, "worker": 40, "serial": 40, "file": 40, "archiv": 40, "3": 40, "4": 40, "benchmark": 40, "hardwar": 41, "non": 41, "uniform": 41, "access": 41, "numa": 41, "softwar": 41, "numactl": 41, "omp_num_thread": 41, "denorm": 41, "releas": 42, "10": 42, "highlight": 42, "110": 42, "120": 42, "200": 42, "technic": 43, "isa": [43, 46], "dispatch": [43, 46], "ahead": [43, 44], "aot": [43, 44], "pattern": 45, "fold": 45, "level": 46, "check": 46, "split": 50, "sgd": 50, "stochast": 50, "gradient": 50, "descent": 50}, "envversion": {"sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 58}, "alltitles": {"Intel\u00ae Extension for PyTorch*": [[0, "intel-extension-for-pytorch"]], "Architecture": [[0, "architecture"]], "Support": [[0, "support"]], "API Documentation": [[1, "api-documentation"], [33, "api-documentation"]], "Device-Agnostic": [[1, "device-agnostic"], [6, "device-agnostic"]], "GPU-Specific": [[1, "gpu-specific"], [6, "gpu-specific"]], "Miscellaneous": [[1, "miscellaneous"], [1, "id1"]], "Random Number Generator": [[1, "random-number-generator"]], "Streams and events": [[1, "streams-and-events"]], "Memory management": [[1, "memory-management"]], "C++ API": [[1, "c-api"]], "CPU-Specific": [[1, "cpu-specific"], [6, "cpu-specific"]], "Quantization": [[1, "module-intel_extension_for_pytorch.quantization"], [6, "quantization"]], "CPU Runtime": [[1, "module-intel_extension_for_pytorch.cpu.runtime"]], "Blogs & Publications": [[2, "blogs-publications"]], "Cheat Sheet": [[3, "cheat-sheet"]], "Contribution": [[4, "contribution"]], "Contributing to Intel\u00ae Extension for PyTorch*": [[4, "contributing-to-intel-extension-for-pytorch"]], "Developing Intel\u00ae Extension for PyTorch* on XPU": [[4, "developing-intel-extension-for-pytorch-on-xpu"]], "Tips and Debugging": [[4, "tips-and-debugging"]], "Unit testing": [[4, "unit-testing"]], "Better local unit tests with pytest": [[4, "better-local-unit-tests-with-pytest"]], "Writing documentation": [[4, "writing-documentation"]], "Building documentation": [[4, "building-documentation"]], "Tips": [[4, "tips"]], "Examples": [[5, "examples"]], "Python": [[5, "python"]], "Training": [[5, "training"]], "Single-Instance Training": [[5, "single-instance-training"]], "Float32": [[5, "float32"], [5, "id1"]], "BFloat16": [[5, "bfloat16"], [5, "id4"], [38, "bfloat16"], [50, "bfloat16"]], "Inference": [[5, "inference"]], "Imperative Mode": [[5, "imperative-mode"], [5, "id5"], [5, "id11"], [23, "imperative-mode"]], "Resnet50": [[5, "resnet50"], [5, "id2"], [5, "id6"], [5, "id9"], [5, "id12"], [5, "id15"]], "BERT": [[5, "bert"], [5, "id3"], [5, "id7"], [5, "id10"], [5, "id13"], [5, "id16"], [40, "bert"]], "TorchScript Mode": [[5, "torchscript-mode"], [5, "id8"], [5, "id14"], [23, "torchscript-mode"], [36, "torchscript-mode"]], "Float16": [[5, "float16"]], "INT8": [[5, "int8"], [38, "int8"]], "torch.xpu.optimize": [[5, "torch-xpu-optimize"]], "C++": [[5, "c"]], "Basic Usage": [[5, "basic-usage"]], "Use SYCL code": [[5, "use-sycl-code"]], "Customize DPC++ kernels": [[5, "customize-dpc-kernels"]], "Intel\u00ae AI Reference Models": [[5, "intel-ai-reference-models"]], "Features": [[6, "features"]], "Easy-to-use Python API": [[6, "easy-to-use-python-api"]], "Channels Last": [[6, "channels-last"], [25, "channels-last"], [41, "channels-last"]], "Auto Mixed Precision (AMP)": [[6, "auto-mixed-precision-amp"]], "Distributed Training": [[6, "distributed-training"]], "DLPack Solution": [[6, "dlpack-solution"], [8, "dlpack-solution"]], "DPC++ Extension": [[6, "dpc-extension"], [9, "dpc-extension"]], "Advanced Configuration": [[6, "advanced-configuration"], [11, "advanced-configuration"]], "Fully Sharded Data Parallel (FSDP)": [[6, "fully-sharded-data-parallel-fsdp"], [10, "fully-sharded-data-parallel-fsdp"]], "Inductor": [[6, "inductor"]], "Legacy Profiler Tool (Experimental)": [[6, "legacy-profiler-tool-experimental"], [27, "legacy-profiler-tool-experimental"]], "Simple Trace Tool (Experimental)": [[6, "simple-trace-tool-experimental"], [29, "simple-trace-tool-experimental"]], "Kineto Supported Profiler Tool (Experimental)": [[6, "kineto-supported-profiler-tool-experimental"], [26, "kineto-supported-profiler-tool-experimental"]], "Compute Engine (Experimental feature for debug)": [[6, "compute-engine-experimental-feature-for-debug"], [16, "compute-engine-experimental-feature-for-debug"]], "Operator Optimization": [[6, "operator-optimization"]], "Runtime Extension": [[6, "runtime-extension"], [28, "runtime-extension"], [38, "runtime-extension"]], "Codeless Optimization (Experimental, NEW feature in 1.13.*)": [[6, "codeless-optimization-experimental-new-feature-in-1-13"]], "Graph Capture (Experimental, NEW feature in 1.13.0*)": [[6, "graph-capture-experimental-new-feature-in-1-13-0"]], "HyperTune (Experimental, NEW feature in 1.13.0*)": [[6, "hypertune-experimental-new-feature-in-1-13-0"]], "DistributedDataParallel (DDP)": [[7, "distributeddataparallel-ddp"]], "Introduction": [[7, "introduction"], [8, "introduction"], [9, "introduction"], [10, "introduction"], [12, "introduction"], [13, "introduction"], [16, "introduction"], [26, "introduction"], [27, "introduction"], [29, "introduction"], [30, "introduction"], [33, "introduction"], [44, "introduction"], [48, "introduction"], [49, "introduction"]], "Installation of Intel\u00ae oneCCL Bindings for Pytorch*": [[7, "installation-of-intel-oneccl-bindings-for-pytorch"]], "Install PyTorch and Intel\u00ae Extension for PyTorch*": [[7, "install-pytorch-and-intel-extension-for-pytorch"]], "Install Intel\u00ae oneCCL Bindings for Pytorch*": [[7, "install-intel-oneccl-bindings-for-pytorch"]], "Install from source:": [[7, "install-from-source"]], "Install from prebuilt wheel:": [[7, "install-from-prebuilt-wheel"]], "Runtime Dynamic Linking": [[7, "runtime-dynamic-linking"]], "DDP Usage": [[7, "ddp-usage"]], "Example Usage (MPI launch for single node):": [[7, "example-usage-mpi-launch-for-single-node"]], "DDP scaling API (GPU Only)": [[7, "ddp-scaling-api-gpu-only"]], "Usage of DDP scaling API": [[7, "usage-of-ddp-scaling-api"]], "Use Case": [[8, "use-case"], [12, "use-case"], [13, "use-case"], [16, "use-case"], [26, "use-case"], [27, "use-case"], [29, "use-case"]], "Design": [[8, "design"]], "Import DLPack Capsule": [[8, "import-dlpack-capsule"]], "Export DLPack Capsule": [[8, "export-dlpack-capsule"]], "DLDevice and data pointer": [[8, "dldevice-and-data-pointer"]], "Asynchronous Programming": [[8, "asynchronous-programming"]], "Example Case": [[8, "example-case"]], "Motivation and Example": [[9, "motivation-and-example"]], "Writing a DPC++ Extension": [[9, "writing-a-dpc-extension"]], "Building with setuptools": [[9, "building-with-setuptools"]], "JIT Compiling Extensions": [[9, "jit-compiling-extensions"]], "Building with CMake": [[9, "building-with-cmake"]], "Requesting the current c10::Stream": [[9, "requesting-the-current-c10-stream"]], "Fetching the corresponding sycl::queue": [[9, "fetching-the-corresponding-sycl-queue"]], "Writing the DPC++ Op": [[9, "writing-the-dpc-op"]], "Using accessors": [[9, "using-accessors"]], "FSDP Usage (GPU only)": [[10, "fsdp-usage-gpu-only"]], "Example": [[10, "example"]], "Build Time Configuration": [[11, "build-time-configuration"]], "Runtime Configuration": [[11, "runtime-configuration"]], "Auto Mixed Precision (AMP) on CPU": [[12, "auto-mixed-precision-amp-on-cpu"]], "Default Precision": [[12, "default-precision"], [13, "default-precision"]], "Inference with Imperative Path": [[12, "inference-with-imperative-path"], [13, "inference-with-imperative-path"]], "Inference with TorchScript Path": [[12, "inference-with-torchscript-path"], [13, "inference-with-torchscript-path"]], "Training Support": [[12, "training-support"], [13, "training-support"]], "Autocast Op Reference": [[12, "autocast-op-reference"], [13, "autocast-op-reference"]], "Op Eligibility": [[12, "op-eligibility"], [13, "op-eligibility"]], "Op-Specific Behavior": [[12, "op-specific-behavior"], [13, "op-specific-behavior"]], "Ops that can autocast to bfloat16": [[12, "ops-that-can-autocast-to-bfloat16"], [13, "ops-that-can-autocast-to-bfloat16"]], "Ops that can autocast to float32": [[12, "ops-that-can-autocast-to-float32"], [13, "ops-that-can-autocast-to-float32"]], "Ops that promote to the widest input type": [[12, "ops-that-promote-to-the-widest-input-type"], [13, "ops-that-promote-to-the-widest-input-type"]], "Auto Mixed Precision (AMP) on GPU": [[13, "auto-mixed-precision-amp-on-gpu"]], "Ops that can autocast to float16": [[13, "ops-that-can-autocast-to-float16"]], "Auto Channels Last": [[14, "auto-channels-last"]], "Ease-of-use auto channels last API": [[14, "ease-of-use-auto-channels-last-api"]], "default": [[14, "default"]], "enable": [[14, "enable"]], "disable": [[14, "disable"]], "Known issue": [[14, "known-issue"]], "Codeless Optimization (Experimental)": [[15, "codeless-optimization-experimental"]], "Motivation": [[15, "motivation"]], "Example Usage with HuggingFace": [[15, "example-usage-with-huggingface"]], "The origin command with ipex launch": [[15, "the-origin-command-with-ipex-launch"]], "Command to apply ipex optimization for FP32": [[15, "command-to-apply-ipex-optimization-for-fp32"]], "Command to apply ipex optimization for BF16": [[15, "command-to-apply-ipex-optimization-for-bf16"]], "Use Case not supported": [[15, "use-case-not-supported"]], "Module uses forward method explicitly instead of the __call__ attr": [[15, "module-uses-forward-method-explicitly-instead-of-the-call-attr"]], "Already using ipex.optimize": [[15, "already-using-ipex-optimize"]], "Already using Jit Trace": [[15, "already-using-jit-trace"]], "Engine Selection Policy": [[16, "engine-selection-policy"]], "Multiple Implementations Operators and Engines": [[16, "multiple-implementations-operators-and-engines"]], "Float8 Data Type Support [GPU] (Experimental)": [[17, "float8-data-type-support-gpu-experimental"]], "Float8 Data Type": [[17, "float8-data-type"]], "FP8 Quantization": [[17, "fp8-quantization"]], "Supported running mode": [[17, "supported-running-mode"], [21, "supported-running-mode"]], "Supported operators": [[17, "supported-operators"], [21, "supported-operators"]], "FP8 usage example": [[17, "fp8-usage-example"]], "Graph Capture (Experimental)": [[18, "graph-capture-experimental"]], "Feature Description": [[18, "feature-description"]], "Usage Example": [[18, "usage-example"], [24, "usage-example"]], "Horovod with PyTorch (Experimental)": [[19, "horovod-with-pytorch-experimental"]], "Install Horovod with PyTorch": [[19, "install-horovod-with-pytorch"]], "Horovod with PyTorch Usage": [[19, "horovod-with-pytorch-usage"]], "HyperTune (Experimental)": [[20, "hypertune-experimental"]], "Usage of Hypertune": [[20, "usage-of-hypertune"]], "your_conf_file": [[20, "your-conf-file"]], "Hyperparameters": [[20, "hyperparameters"]], "Launcher Hyperparameters": [[20, "launcher-hyperparameters"]], "Defining hyperparameters and their search spaces": [[20, "defining-hyperparameters-and-their-search-spaces"]], "1. Defining hyperparameters to tune:": [[20, "defining-hyperparameters-to-tune"]], "2. Defining the search spaces of the hyperparameters:": [[20, "defining-the-search-spaces-of-the-hyperparameters"]], "Default search space": [[20, "default-search-space"]], "User defined search space": [[20, "user-defined-search-space"]], "": [[20, "your-python-script"]], "Usage Examples": [[20, "usage-examples"], [39, "usage-examples"]], "INT4 inference [GPU] (Experimental)": [[21, "int4-inference-gpu-experimental"]], "INT4 Data Type": [[21, "int4-data-type"]], "INT4 Quantization": [[21, "int4-quantization"]], "INT4 usage example": [[21, "int4-usage-example"]], "Weight Only Quantization Tool": [[21, "weight-only-quantization-tool"]], "Intel\u00ae Extension for PyTorch* optimizations for quantization [CPU]": [[22, "intel-extension-for-pytorch-optimizations-for-quantization-cpu"]], "Static Quantization": [[22, "static-quantization"]], "Define qconfig": [[22, "define-qconfig"]], "Prepare Model and Do Calibration": [[22, "prepare-model-and-do-calibration"]], "Convert to Static Quantized Model and Deploy": [[22, "convert-to-static-quantized-model-and-deploy"]], "Dynamic Quantization": [[22, "dynamic-quantization"]], "Define QConfig": [[22, "id1"]], "Prepare Model": [[22, "prepare-model"]], "Convert to Dynamic Quantized Model and Deploy": [[22, "convert-to-dynamic-quantized-model-and-deploy"]], "Intel\u00ae Extension for PyTorch* Optimizations for Quantization [GPU]": [[23, "intel-extension-for-pytorch-optimizations-for-quantization-gpu"]], "INT8 Recipe Tuning API (Experimental) [CPU]": [[24, "int8-recipe-tuning-api-experimental-cpu"]], "What is Channels Last": [[25, "what-is-channels-last"]], "Memory Format Is All That Matters": [[25, "memory-format-is-all-that-matters"]], "a. NCHW (default)": [[25, "a-nchw-default"]], "b. NHWC": [[25, "b-nhwc"]], "c. Blocked (nChw16c, on CPU)": [[25, "c-blocked-nchw16c-on-cpu"]], "PyTorch Strided Layout": [[25, "pytorch-strided-layout"]], "Channels Last Memory Format APIs": [[25, "channels-last-memory-format-apis"]], "a. tensor creation": [[25, "a-tensor-creation"]], "b. tensor conversion": [[25, "b-tensor-conversion"]], "c. model conversion": [[25, "c-model-conversion"]], "d. operator coverage in PyTorch": [[25, "d-operator-coverage-in-pytorch"]], "Writing Channels Last Kernels on CPU": [[25, "writing-channels-last-kernels-on-cpu"]], "a. Register Channels Last Kernel in ATen Native Manner": [[25, "a-register-channels-last-kernel-in-aten-native-manner"]], "b. Register oneDNN Kernel on Channels Last": [[25, "b-register-onednn-kernel-on-channels-last"]], "oneDNN NHWC APIs": [[25, "onednn-nhwc-apis"]], "a. Create NHWC Memory": [[25, "a-create-nhwc-memory"]], "b. Create Convolution Primitive": [[25, "b-create-convolution-primitive"]], "Channels Last 1D support on XPU": [[25, "channels-last-1d-support-on-xpu"]], "a. tensor conversion with Channels Last 1D": [[25, "a-tensor-conversion-with-channels-last-1d"]], "b. model conversion with Channels Last 1D": [[25, "b-model-conversion-with-channels-last-1d"]], "c. determine if in Channels Last 1D memory format": [[25, "c-determine-if-in-channels-last-1d-memory-format"]], "Build Tool": [[26, "build-tool"], [27, "build-tool"]], "Use Tool": [[26, "use-tool"], [27, "use-tool"]], "Set Environment Variable": [[26, "set-environment-variable"]], "Add Profiler Into Script": [[26, "add-profiler-into-script"]], "Disable Tool in Model Script": [[26, "disable-tool-in-model-script"], [27, "disable-tool-in-model-script"]], "Disable Tool Partly for XPU Backend": [[26, "disable-tool-partly-for-xpu-backend"]], "Profile on Multi-device Application": [[26, "profile-on-multi-device-application"]], "Result": [[26, "result"]], "Export to Chrome Trace": [[26, "export-to-chrome-trace"], [27, "export-to-chrome-trace"]], "Results": [[27, "results"], [29, "results"]], "Requirements": [[28, "requirements"]], "Use Cases": [[28, "use-cases"]], "Example of MultiStream Module": [[28, "example-of-multistream-module"]], "Examples1: Basic Usage": [[28, "examples1-basic-usage"]], "Examples2: Usage with \u201cAUTO\u201d setting": [[28, "examples2-usage-with-auto-setting"]], "Examples3: Usage for models with structure inputs/outputs": [[28, "examples3-usage-for-models-with-structure-inputs-outputs"]], "Performance recipes": [[28, "performance-recipes"]], "Known issues": [[28, "known-issues"]], "Example of asynchronous task": [[28, "example-of-asynchronous-task"]], "Example of configuring core binding": [[28, "example-of-configuring-core-binding"]], "Detail Design": [[28, "detail-design"]], "How the core binding is implemented": [[28, "how-the-core-binding-is-implemented"]], "Design of Task": [[28, "design-of-task"]], "IOMP preload or load during the runtime": [[28, "iomp-preload-or-load-during-the-runtime"]], "Enable and Disable Tool": [[29, "enable-and-disable-tool"]], "Use Simple Trace in Model": [[29, "use-simple-trace-in-model"]], "torch.compile for GPU": [[30, "torch-compile-for-gpu"]], "Inferenece with torch.compile": [[30, "inferenece-with-torch-compile"]], "Training with torch.compile": [[30, "training-with-torch-compile"]], "Quick Start": [[31, "quick-start"]], "Execution": [[31, "execution"]], "Installation": [[32, "installation"]], "Get Started": [[33, "get-started"]], "License": [[34, "license"]], "Large Language Models (LLM) Optimizations Overview": [[35, "large-language-models-llm-optimizations-overview"]], "Optimized Models": [[35, "optimized-models"]], "Optimization Methodologies": [[35, "optimization-methodologies"]], "Linear Operator Optimization": [[35, "linear-operator-optimization"]], "Deep Fusion Policy": [[35, "deep-fusion-policy"]], "Segment KV Cache": [[35, "segment-kv-cache"]], "Distributed Inference": [[35, "distributed-inference"]], "Low Precision Data Types": [[35, "low-precision-data-types"]], "Transformers Optimization Frontend API": [[36, "transformers-optimization-frontend-api"]], "Pseudocode of Common Usage Scenarios": [[36, "pseudocode-of-common-usage-scenarios"]], "FP16": [[36, "fp16"]], "SmoothQuant": [[36, "smoothquant"]], "Imperative mode": [[36, "imperative-mode"]], "Weight Only Quantization (WOQ)": [[36, "weight-only-quantization-woq"]], "Distributed Inference with DeepSpeed": [[36, "distributed-inference-with-deepspeed"]], "Performance Tuning Guide": [[37, "performance-tuning-guide"], [41, "performance-tuning-guide"]], "Troubleshooting": [[38, "troubleshooting"]], "GPU-specific Issues": [[38, "gpu-specific-issues"]], "General Usage": [[38, "general-usage"], [38, "id1"]], "Library Dependencies": [[38, "library-dependencies"]], "Unit Test": [[38, "unit-test"]], "CPU-specific issues": [[38, "cpu-specific-issues"]], "TorchDynamo": [[38, "torchdynamo"]], "Dynamic Shape": [[38, "dynamic-shape"]], "Result Correctness": [[38, "result-correctness"]], "Float32 Training": [[38, "float32-training"]], "Launch Script Usage Guide": [[39, "launch-script-usage-guide"]], "Overview": [[39, "overview"], [41, "overview"], [46, "overview"]], "Usage of launch script": [[39, "usage-of-launch-script"]], "Single instance for inference": [[39, "single-instance-for-inference"]], "I. Use all physical cores": [[39, "i-use-all-physical-cores"]], "II. Use all cores including logical cores": [[39, "ii-use-all-cores-including-logical-cores"]], "III. Use physical cores on designated nodes": [[39, "iii-use-physical-cores-on-designated-nodes"]], "IV. Use your designated number of cores": [[39, "iv-use-your-designated-number-of-cores"]], "Multiple instances for inference": [[39, "multiple-instances-for-inference"]], "V. Throughput mode": [[39, "v-throughput-mode"]], "VI. Latency mode": [[39, "vi-latency-mode"]], "VII. Your designated number of instances": [[39, "vii-your-designated-number-of-instances"]], "VIII. Your designated number of instances and instance index": [[39, "viii-your-designated-number-of-instances-and-instance-index"]], "Usage of Jemalloc/TCMalloc/Default memory allocator": [[39, "usage-of-jemalloc-tcmalloc-default-memory-allocator"]], "Jemalloc": [[39, "jemalloc"], [41, "jemalloc"]], "TCMalloc": [[39, "tcmalloc"], [41, "tcmalloc"]], "Default memory allocator": [[39, "default-memory-allocator"]], "Usage of OpenMP library": [[39, "usage-of-openmp-library"]], "Intel OpenMP Library": [[39, "intel-openmp-library"]], "GNU OpenMP Library": [[39, "gnu-openmp-library"]], "TorchServe with Intel\u00ae Extension for PyTorch*": [[40, "torchserve-with-intel-extension-for-pytorch"]], "Contents of this Document": [[40, "contents-of-this-document"], [41, "contents-of-this-document"]], "Install Intel\u00ae Extension for PyTorch*": [[40, "install-intel-extension-for-pytorch"]], "Serving model with Intel\u00ae Extension for PyTorch*": [[40, "serving-model-with-intel-extension-for-pytorch"]], "TorchServe with Launcher": [[40, "torchserve-with-launcher"]], "Launcher Core Pinning to Boost Performance of TorchServe Multi Worker Inference": [[40, "launcher-core-pinning-to-boost-performance-of-torchserve-multi-worker-inference"]], "Scaling workers": [[40, "scaling-workers"]], "Creating and Exporting INT8 model for Intel\u00ae Extension for PyTorch*": [[40, "creating-and-exporting-int8-model-for-intel-extension-for-pytorch"]], "1. Creating a serialized file": [[40, "creating-a-serialized-file"]], "ResNet50": [[40, "resnet50"]], "2. Creating a Model Archive": [[40, "creating-a-model-archive"]], "3. Start TorchServe to serve the model": [[40, "start-torchserve-to-serve-the-model"]], "4. Registering and Deploying model": [[40, "registering-and-deploying-model"]], "Benchmarking with Launcher": [[40, "benchmarking-with-launcher"]], "Benchmarking with Launcher Core Pinning": [[40, "benchmarking-with-launcher-core-pinning"]], "Performance Boost with Intel\u00ae Extension for PyTorch* and Launcher": [[40, "performance-boost-with-intel-extension-for-pytorch-and-launcher"]], "Hardware Configuration": [[41, "hardware-configuration"]], "Intel CPU Structure": [[41, "intel-cpu-structure"]], "Non-Uniform Memory Access (NUMA)": [[41, "non-uniform-memory-access-numa"]], "Software Configuration": [[41, "software-configuration"]], "Numactl": [[41, "numactl"]], "OpenMP": [[41, "openmp"]], "OMP_NUM_THREADS": [[41, "omp-num-threads"]], "GNU OpenMP": [[41, "gnu-openmp"]], "Intel OpenMP": [[41, "intel-openmp"]], "Memory Allocator": [[41, "memory-allocator"]], "Denormal Number": [[41, "denormal-number"]], "OneDNN primitive cache": [[41, "onednn-primitive-cache"]], "Releases": [[42, "releases"]], "2.1.10+xpu": [[42, "xpu"]], "Highlights": [[42, "highlights"], [42, "id2"], [42, "id5"], [42, "id8"], [42, "id10"]], "Known Issues": [[42, "known-issues"], [42, "id3"], [42, "id6"], [42, "id9"], [42, "id11"]], "2.0.110+xpu": [[42, "id1"]], "1.13.120+xpu": [[42, "id4"]], "1.13.10+xpu": [[42, "id7"]], "1.10.200+gpu": [[42, "gpu"]], "Technical Details": [[43, "technical-details"]], "ISA Dynamic Dispatching [CPU]": [[43, "isa-dynamic-dispatching-cpu"]], "Graph Optimization [CPU]": [[43, "graph-optimization-cpu"]], "Optimizer Optimization [CPU, GPU]": [[43, "optimizer-optimization-cpu-gpu"]], "Ahead of Time Compilation (AOT) [GPU]": [[43, "ahead-of-time-compilation-aot-gpu"]], "Memory Management [GPU]": [[43, "memory-management-gpu"]], "Ahead of Time (AOT) Compilation": [[44, "ahead-of-time-aot-compilation"]], "Use case": [[44, "use-case"]], "Requirement": [[44, "requirement"]], "Graph Optimization": [[45, "graph-optimization"]], "Ease-of-use graph optimization API": [[45, "ease-of-use-graph-optimization-api"]], "FP32 and BF16 models": [[45, "fp32-and-bf16-models"]], "INT8 models": [[45, "int8-models"]], "Methodology": [[45, "methodology"]], "Fusion": [[45, "fusion"]], "FP32 and BF16 fusion patterns": [[45, "fp32-and-bf16-fusion-patterns"]], "INT8 fusion patterns": [[45, "int8-fusion-patterns"]], "Folding": [[45, "folding"]], "ISA Dynamic Dispatching": [[46, "isa-dynamic-dispatching"]], "CPU ISA build compiler requirement": [[46, "cpu-isa-build-compiler-requirement"]], "Select ISA Level": [[46, "select-isa-level"]], "Example:": [[46, "example"]], "CPU feature check": [[46, "cpu-feature-check"]], "Memory Management": [[47, "memory-management"]], "Optimizer Fusion on CPU": [[48, "optimizer-fusion-on-cpu"]], "Operation Fusion": [[48, "operation-fusion"], [49, "operation-fusion"]], "Optimizer Fusion on GPU": [[49, "optimizer-fusion-on-gpu"]], "Split SGD": [[50, "split-sgd"], [50, "id2"]], "Stochastic Gradient Descent (SGD)": [[50, "stochastic-gradient-descent-sgd"]]}, "indexentries": {"cpupool (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.CPUPool"]], "event (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.Event"]], "multistreammodule (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModule"]], "multistreammodulehint (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModuleHint"]], "stream (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.Stream"]], "task (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.Task"]], "_gptq() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization._gptq"]], "autotune() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization.autotune"]], "convert() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization.convert"]], "current_device() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.current_device"]], "current_stream() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.current_stream"]], "device (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.device"]], "device_count() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.device_count"]], "device_of (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.device_of"]], "elapsed_time() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.elapsed_time"]], "empty_cache() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.empty_cache"]], "enable_onednn_fusion() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.enable_onednn_fusion"]], "fp8_autocast() (in module intel_extension_for_pytorch.xpu.fp8.fp8)": [[1, "intel_extension_for_pytorch.xpu.fp8.fp8.fp8_autocast"]], "get_core_list_of_node_id() (in module intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.get_core_list_of_node_id"]], "get_device_name() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_device_name"]], "get_device_properties() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_device_properties"]], "get_fp32_math_mode() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.get_fp32_math_mode"]], "get_rng_state() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_rng_state"]], "get_rng_state_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_rng_state_all"]], "init() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.init"]], "initial_seed() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.initial_seed"]], "intel_extension_for_pytorch.cpu.runtime": [[1, "module-intel_extension_for_pytorch.cpu.runtime"]], "intel_extension_for_pytorch.quantization": [[1, "module-intel_extension_for_pytorch.quantization"]], "is_available() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.is_available"]], "is_initialized() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.is_initialized"]], "is_runtime_ext_enabled() (in module intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.is_runtime_ext_enabled"]], "manual_seed() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.manual_seed"]], "manual_seed_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.manual_seed_all"]], "max_memory_allocated() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.max_memory_allocated"]], "max_memory_reserved() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.max_memory_reserved"]], "memory_allocated() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_allocated"]], "memory_reserved() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_reserved"]], "memory_snapshot() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_snapshot"]], "memory_stats() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_stats"]], "memory_stats_as_nested_dict() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_stats_as_nested_dict"]], "memory_summary() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_summary"]], "module": [[1, "module-intel_extension_for_pytorch.cpu.runtime"], [1, "module-intel_extension_for_pytorch.quantization"]], "optimize() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.optimize"]], "optimize_transformers() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.optimize_transformers"]], "pin (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.pin"]], "prepare() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization.prepare"]], "query() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.query"]], "record() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.record"]], "record_event() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.record_event"]], "reset_accumulated_memory_stats() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.reset_accumulated_memory_stats"]], "reset_peak_memory_stats() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.reset_peak_memory_stats"]], "seed() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.seed"]], "seed_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.seed_all"]], "set_device() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.set_device"]], "set_fp32_math_mode() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.set_fp32_math_mode"]], "set_rng_state() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.set_rng_state"]], "set_rng_state_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.set_rng_state_all"]], "stream() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.stream"]], "sycl_queue (intel_extension_for_pytorch.xpu.stream property)": [[1, "intel_extension_for_pytorch.xpu.Stream.sycl_queue"]], "synchronize() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.synchronize"]], "synchronize() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.synchronize"]], "synchronize() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.synchronize"]], "verbose (class in intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.verbose"]], "wait() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.wait"]], "wait_event() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.wait_event"]], "wait_stream() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.wait_stream"]], "xpu::fp32_math_mode (c++ enum)": [[1, "_CPPv4N3xpu14FP32_MATH_MODEE"]], "xpu::fp32_math_mode::bf32 (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE4BF32E"]], "xpu::fp32_math_mode::fp32 (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE4FP32E"]], "xpu::fp32_math_mode::fp32_math_mode_max (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE"]], "xpu::fp32_math_mode::tf32 (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE4TF32E"]], "xpu::get_queue_from_stream (c++ function)": [[1, "_CPPv4N3xpu21get_queue_from_streamEN3c106StreamE"]], "xpu::set_fp32_math_mode (c++ function)": [[1, "_CPPv4N3xpu18set_fp32_math_modeE14FP32_MATH_MODE"]], "frozenbatchnorm2d (class in intel_extension_for_pytorch.nn)": [[6, "intel_extension_for_pytorch.nn.FrozenBatchNorm2d"]], "interaction() (in module intel_extension_for_pytorch.nn.functional)": [[6, "intel_extension_for_pytorch.nn.functional.interaction"]]}})
\ No newline at end of file
+Search.setIndex({"docnames": ["index", "tutorials/api_doc", "tutorials/blogs_publications", "tutorials/cheat_sheet", "tutorials/contribution", "tutorials/examples", "tutorials/features", "tutorials/features/DDP", "tutorials/features/DLPack", "tutorials/features/DPC++_Extension", "tutorials/features/FSDP", "tutorials/features/advanced_configuration", "tutorials/features/amp_cpu", "tutorials/features/amp_gpu", "tutorials/features/auto_channels_last", "tutorials/features/codeless_optimization", "tutorials/features/compute_engine", "tutorials/features/float8", "tutorials/features/graph_capture", "tutorials/features/horovod", "tutorials/features/hypertune", "tutorials/features/int4", "tutorials/features/int8_overview", "tutorials/features/int8_overview_xpu", "tutorials/features/int8_recipe_tuning_api", "tutorials/features/nhwc", "tutorials/features/profiler_kineto", "tutorials/features/profiler_legacy", "tutorials/features/runtime_extension", "tutorials/features/simple_trace", "tutorials/features/torch_compile_gpu", "tutorials/getting_started", "tutorials/installation", "tutorials/introduction", "tutorials/license", "tutorials/llm", "tutorials/llm/llm_optimize_transformers", "tutorials/performance_tuning", "tutorials/performance_tuning/known_issues", "tutorials/performance_tuning/launch_script", "tutorials/performance_tuning/torchserve", "tutorials/performance_tuning/tuning_guide", "tutorials/releases", "tutorials/technical_details", "tutorials/technical_details/AOT", "tutorials/technical_details/graph_optimization", "tutorials/technical_details/isa_dynamic_dispatch", "tutorials/technical_details/memory_management", "tutorials/technical_details/optimizer_fusion_cpu", "tutorials/technical_details/optimizer_fusion_gpu", "tutorials/technical_details/split_sgd"], "filenames": ["index.rst", "tutorials/api_doc.rst", "tutorials/blogs_publications.md", "tutorials/cheat_sheet.md", "tutorials/contribution.md", "tutorials/examples.md", "tutorials/features.rst", "tutorials/features/DDP.md", "tutorials/features/DLPack.md", "tutorials/features/DPC++_Extension.md", "tutorials/features/FSDP.md", "tutorials/features/advanced_configuration.md", "tutorials/features/amp_cpu.md", "tutorials/features/amp_gpu.md", "tutorials/features/auto_channels_last.md", "tutorials/features/codeless_optimization.md", "tutorials/features/compute_engine.md", "tutorials/features/float8.md", "tutorials/features/graph_capture.md", "tutorials/features/horovod.md", "tutorials/features/hypertune.md", "tutorials/features/int4.md", "tutorials/features/int8_overview.md", "tutorials/features/int8_overview_xpu.md", "tutorials/features/int8_recipe_tuning_api.md", "tutorials/features/nhwc.md", "tutorials/features/profiler_kineto.md", "tutorials/features/profiler_legacy.md", "tutorials/features/runtime_extension.md", "tutorials/features/simple_trace.md", "tutorials/features/torch_compile_gpu.md", "tutorials/getting_started.md", "tutorials/installation.rst", "tutorials/introduction.rst", "tutorials/license.md", "tutorials/llm.rst", "tutorials/llm/llm_optimize_transformers.md", "tutorials/performance_tuning.rst", "tutorials/performance_tuning/known_issues.md", "tutorials/performance_tuning/launch_script.md", "tutorials/performance_tuning/torchserve.md", "tutorials/performance_tuning/tuning_guide.md", "tutorials/releases.md", "tutorials/technical_details.rst", "tutorials/technical_details/AOT.md", "tutorials/technical_details/graph_optimization.md", "tutorials/technical_details/isa_dynamic_dispatch.md", "tutorials/technical_details/memory_management.rst", "tutorials/technical_details/optimizer_fusion_cpu.md", "tutorials/technical_details/optimizer_fusion_gpu.md", "tutorials/technical_details/split_sgd.rst"], "titles": ["Intel\u00ae Extension for PyTorch*", "API Documentation", "Blogs & Publications", "Cheat Sheet", "Contribution", "Examples", "Features", "DistributedDataParallel (DDP)", "DLPack Solution", "DPC++ Extension", "Fully Sharded Data Parallel (FSDP)", "Advanced Configuration", "Auto Mixed Precision (AMP) on CPU", "Auto Mixed Precision (AMP) on GPU", "Auto Channels Last", "Codeless Optimization (Experimental)", "Compute Engine (Experimental feature for debug)", "Float8 Data Type Support [GPU] (Experimental)", "Graph Capture (Experimental)", "Horovod with PyTorch (Experimental)", "HyperTune (Experimental)", "INT4 inference [GPU] (Experimental)", "Intel\u00ae Extension for PyTorch* optimizations for quantization [CPU]", "Intel\u00ae Extension for PyTorch* Optimizations for Quantization [GPU]", "INT8 Recipe Tuning API (Experimental) [CPU]", "Channels Last", "Kineto Supported Profiler Tool (Experimental)", "Legacy Profiler Tool (Experimental)", "Runtime Extension", "Simple Trace Tool (Experimental)", "torch.compile for GPU (Experimental)", "Quick Start", "Installation", "Introduction", "License", "Large Language Models (LLM) Optimizations Overview", "Transformers Optimization Frontend API", "Performance Tuning Guide", "Troubleshooting", "Launch Script Usage Guide", "TorchServe with Intel\u00ae Extension for PyTorch*", "Performance Tuning Guide", "Releases", "Technical Details", "Ahead of Time (AOT) Compilation", "Graph Optimization", "ISA Dynamic Dispatching", "Memory Management", "Optimizer Fusion on CPU", "Optimizer Fusion on GPU", "Split SGD"], "terms": {"intel optim": 0, "intel\u00ae extension for pytorch*": 0, "gpu": [0, 2, 3, 4, 5, 9, 11, 16, 19, 26, 31, 33, 35, 44, 47], "discrete gpu": 0, "intel discrete gpu": 0, "extend": [0, 6, 8, 30, 33, 35, 41, 42], "latest": [0, 7, 8, 19, 31, 33, 35, 38], "perform": [0, 1, 2, 3, 5, 6, 9, 12, 13, 14, 15, 16, 20, 21, 22, 23, 25, 30, 33, 35, 36, 38, 39, 42, 43, 45, 48, 49, 50], "optim": [0, 1, 2, 3, 7, 10, 12, 13, 14, 16, 18, 19, 20, 24, 25, 28, 30, 31, 33, 38, 39, 40, 41, 42, 50], "hardwar": [0, 2, 6, 33, 35, 37, 40, 42, 46], "take": [0, 1, 9, 12, 13, 15, 18, 20, 25, 33, 39, 41, 42, 43, 45, 50], "advantag": [0, 1, 14, 18, 25, 33, 39, 41, 42, 43, 50], "advanc": [0, 1, 9, 24, 31, 42, 43, 47], "vector": [0, 1, 5, 9, 25, 42], "512": [0, 5, 24, 39, 42], "avx": [0, 42, 46], "neural": [0, 2, 6, 16, 17, 24, 41, 42], "network": [0, 2, 6, 12, 13, 16, 17, 28, 41, 42], "instruct": [0, 4, 5, 6, 12, 31, 32, 33, 35, 38, 41, 42, 43, 50], "vnni": [0, 22, 42], "matrix": [0, 6, 30, 33, 42], "amx": [0, 2, 6, 42, 46], "cpu": [0, 2, 3, 5, 7, 15, 20, 26, 27, 28, 31, 39, 40, 42], "well": [0, 1, 4, 5, 6, 21, 28, 35, 37, 40, 41, 42, 50], "x": [0, 9, 10, 12, 13, 15, 22, 24, 25, 28, 33, 38, 44, 45, 50], "e": [0, 1, 5, 9, 12, 13, 18, 23, 25, 33, 35, 38, 39, 41, 42, 43, 44], "xmx": [0, 33, 42], "ai": [0, 2, 11, 33, 35, 38, 42], "engin": [0, 5, 25, 33, 41, 42], "discret": [0, 33, 42], "moreov": [0, 1, 35, 42], "provid": [0, 1, 4, 5, 6, 7, 9, 10, 12, 13, 16, 18, 20, 24, 28, 32, 33, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 49], "easi": [0, 2, 19, 33, 37, 42, 50], "acceler": [0, 1, 2, 6, 17, 30, 33, 42, 45], "through": [0, 1, 5, 6, 9, 12, 13, 18, 30, 33, 41, 42], "xpu": [0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 13, 16, 17, 19, 23, 27, 29, 30, 31, 33, 36, 38], "devic": [0, 5, 7, 8, 9, 10, 11, 16, 19, 22, 25, 27, 30, 33, 35, 36, 38, 39, 42, 43, 44], "In": [0, 1, 5, 6, 9, 12, 13, 16, 18, 24, 25, 26, 27, 29, 30, 35, 39, 40, 41, 42, 48, 50], "current": [0, 1, 4, 6, 8, 10, 16, 17, 20, 21, 22, 23, 24, 26, 28, 29, 35, 36, 38, 44, 45, 46, 48, 49], "technolog": [0, 35], "landscap": [0, 35], "gener": [0, 4, 5, 6, 7, 8, 9, 15, 16, 18, 21, 23, 24, 25, 35, 36, 37, 39, 40, 41, 42, 43, 44, 46, 50], "genai": [0, 35], "workload": [0, 5, 6, 12, 13, 15, 18, 23, 35, 36, 38, 39, 41, 42, 43, 50], "model": [0, 1, 2, 3, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 30, 31, 36, 38, 41, 42], "have": [0, 1, 4, 5, 7, 8, 9, 14, 16, 20, 23, 25, 26, 27, 28, 29, 31, 34, 35, 38, 39, 40, 41, 44, 46, 50], "gain": [0, 6, 35, 38], "widespread": [0, 35], "attent": [0, 35], "popular": [0, 6, 8, 35], "larg": [0, 1, 6, 10, 21, 36, 38, 41, 42, 48, 49], "languag": [0, 9, 36, 38, 42], "llm": [0, 36, 38, 42], "emerg": [0, 35], "domin": [0, 35], "drive": [0, 35], "applic": [0, 1, 5, 28, 35, 40, 41, 43, 44, 47], "start": [0, 1, 2, 3, 4, 5, 7, 15, 19, 26, 28, 29, 32, 38], "from": [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 13, 15, 17, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50], "2": [0, 1, 2, 5, 6, 7, 8, 9, 10, 12, 13, 15, 25, 26, 28, 29, 30, 34, 35, 38, 39, 41, 43, 44, 46, 50], "1": [0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 16, 18, 24, 25, 26, 28, 29, 30, 35, 38, 39, 41, 45, 46, 48, 49, 50], "0": [0, 1, 3, 4, 5, 7, 9, 10, 11, 12, 13, 15, 19, 24, 26, 27, 28, 29, 30, 34, 38, 39, 40, 41, 45, 48, 49, 50], "specif": [0, 5, 8, 11, 16, 18, 19, 25, 28, 35, 39, 41, 42], "certain": [0, 1, 6, 36, 38, 39, 41], "ar": [0, 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 15, 17, 19, 20, 21, 23, 25, 26, 27, 28, 29, 31, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "introduc": [0, 2, 9, 22, 25, 39, 41, 42, 50], "For": [0, 1, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 20, 22, 25, 26, 27, 28, 31, 33, 35, 38, 39, 40, 41, 42, 43, 44, 45, 47, 48, 49, 50], "more": [0, 1, 4, 6, 7, 9, 10, 11, 12, 13, 15, 23, 24, 26, 27, 28, 29, 30, 31, 35, 38, 40, 41, 42, 43, 44, 45, 47, 48, 49, 50], "inform": [0, 1, 4, 6, 7, 8, 9, 10, 20, 23, 25, 26, 27, 30, 31, 35, 39, 40, 41, 42, 43], "refer": [0, 1, 7, 9, 10, 11, 14, 16, 20, 24, 25, 26, 27, 28, 31, 32, 33, 40, 42, 44, 45], "section": [0, 5, 12, 13, 20, 23, 28, 30, 32, 33, 36, 40, 41], "The": [0, 1, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "can": [0, 1, 4, 5, 6, 7, 8, 9, 11, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 31, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], "load": [0, 1, 5, 9, 22, 26, 27, 38, 40, 42, 45], "python": [0, 1, 3, 4, 7, 8, 9, 10, 11, 15, 19, 20, 26, 27, 28, 30, 31, 35, 36, 38, 39, 40, 41, 42, 46], "modul": [0, 1, 5, 6, 7, 9, 10, 12, 13, 23, 24, 25, 36, 38, 39, 42, 45, 46], "program": [0, 1, 6, 28, 39, 41], "link": [0, 5, 46], "c": [0, 6, 7, 9, 12, 13, 24, 28, 29, 38, 39, 40, 41, 46], "librari": [0, 1, 5, 6, 7, 8, 9, 10, 11, 16, 26, 27, 28, 29, 40, 41, 42], "script": [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13, 15, 19, 20, 23, 28, 29, 31, 35, 36, 37, 38, 40, 41, 43], "user": [0, 1, 5, 6, 7, 11, 14, 15, 16, 18, 22, 24, 25, 27, 28, 30, 37, 38, 39, 40, 41, 42, 43, 44, 45, 47], "enabl": [0, 1, 2, 3, 5, 6, 7, 11, 12, 13, 15, 17, 23, 25, 26, 27, 28, 30, 31, 35, 38, 39, 40, 41, 42, 43, 44, 45], "dynam": [0, 3, 5, 17, 28, 40, 41], "import": [0, 1, 3, 4, 5, 7, 9, 10, 15, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 35, 36, 38, 40, 41, 42, 45, 46, 50], "intel_extension_for_pytorch": [0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 15, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 36, 38, 40, 42, 45, 46], "featur": [0, 1, 2, 4, 5, 12, 13, 15, 20, 25, 28, 30, 33, 38, 39, 40, 41, 42, 43, 44, 45], "includ": [0, 1, 5, 6, 7, 9, 15, 20, 22, 26, 27, 31, 34, 35, 38, 42], "onli": [0, 1, 4, 5, 6, 8, 9, 11, 12, 13, 15, 16, 17, 19, 20, 22, 23, 25, 26, 28, 31, 35, 38, 39, 40, 42, 45, 46, 50], "packag": [0, 5, 7, 9, 10, 15, 30, 31, 38, 40, 41, 42], "mai": [0, 1, 2, 4, 8, 9, 12, 13, 14, 16, 23, 25, 26, 27, 28, 30, 38, 39, 40, 41, 42, 43], "newer": [0, 41], "code": [0, 1, 4, 6, 9, 10, 11, 15, 16, 18, 19, 25, 26, 27, 29, 31, 32, 34, 36, 38, 41, 42, 43, 44, 45, 47, 48, 49, 50], "base": [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 15, 16, 19, 28, 31, 35, 36, 38, 40, 41, 42, 46, 50], "due": [0, 12, 13, 15, 23, 28, 38, 42], "differ": [0, 1, 5, 6, 7, 8, 22, 25, 26, 27, 28, 35, 39, 40, 41], "develop": [0, 2, 3, 5, 9, 38, 41, 43, 44], "schedul": [0, 1, 10, 26, 28, 39, 41, 45], "ha": [0, 1, 5, 6, 8, 9, 15, 16, 20, 25, 28, 30, 38, 39, 41, 42, 44, 50], "been": [0, 1, 5, 6, 9, 15, 25, 30, 39, 41, 42, 46], "releas": [0, 1, 7, 14, 25, 30, 38, 41, 44, 47], "an": [0, 1, 4, 5, 6, 7, 8, 9, 10, 12, 13, 15, 19, 20, 23, 25, 26, 27, 28, 30, 31, 35, 37, 38, 39, 40, 41, 42, 45, 46, 48, 49, 50], "open": [0, 24, 38, 41, 42], "sourc": [0, 4, 9, 11, 19, 26, 27, 29, 30, 31, 34, 38, 41, 44], "project": [0, 5, 9], "github": [0, 4, 6, 7, 8, 10, 12, 13], "you": [0, 1, 4, 5, 6, 7, 9, 10, 12, 13, 19, 20, 21, 22, 25, 26, 27, 28, 29, 30, 31, 35, 36, 38, 39, 41, 42, 43, 44, 45, 47], "find": [0, 1, 5, 8, 9, 20, 26, 27, 38, 39, 42, 43], "how": [0, 1, 5, 6, 7, 8, 9, 15, 22, 25, 31, 40, 41], "get": [0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 15, 22, 26, 27, 28, 35, 38, 39, 41, 42, 50], "main": [0, 4, 5, 10, 20, 28, 30, 39, 40], "branch": [0, 6], "quick": [0, 28, 32, 33], "about": [0, 1, 4, 7, 9, 10, 24, 30, 40, 41, 45], "product": [0, 6, 20, 35, 42, 43, 44], "i": [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 26, 27, 29, 30, 31, 34, 35, 36, 38, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "structur": [0, 1, 6, 8, 21, 39], "shown": [0, 1, 5, 7, 25, 26, 27, 29, 35, 39, 40], "follow": [0, 1, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 16, 17, 19, 20, 22, 23, 25, 26, 27, 29, 30, 31, 32, 34, 35, 36, 38, 39, 40, 41, 42, 50], "figur": [0, 8, 35, 41, 50], "eager": [0, 18, 23, 40, 42, 43], "mode": [0, 1, 4, 6, 11, 15, 18, 25, 28, 31, 38, 40, 42, 43], "frontend": [0, 1, 6, 28, 35, 42], "custom": [0, 1, 6, 9, 11, 16, 26, 35, 38, 42], "fusion": [0, 1, 5, 15, 21, 23, 30, 36, 42, 43, 50], "int8": [0, 1, 2, 3, 6, 23, 28, 36, 42], "quantiz": [0, 2, 3, 5, 24, 35, 38, 40, 42, 45], "api": [0, 2, 5, 9, 10, 11, 15, 22, 23, 26, 28, 30, 35, 38, 41, 42], "further": [0, 1, 5, 6, 25, 28, 35, 37, 41, 43], "improv": [0, 2, 6, 12, 13, 17, 21, 28, 35, 40, 41, 42, 45], "achiev": [0, 1, 5, 41], "convert": [0, 1, 3, 5, 6, 8, 12, 13, 14, 15, 17, 21, 23, 24, 25, 28, 36, 38, 40, 42, 45], "graph": [0, 1, 3, 12, 13, 15, 23, 30, 38, 39, 42], "us": [0, 1, 2, 3, 4, 7, 10, 11, 17, 19, 20, 21, 22, 23, 24, 25, 30, 31, 32, 34, 35, 37, 38, 40, 41, 42, 43, 46, 47, 48, 50], "pass": [0, 1, 4, 5, 9, 15, 26, 27, 28, 38, 40], "reduc": [0, 1, 6, 10, 17, 21, 22, 26, 28, 35, 38, 41, 42, 43, 48, 49, 50], "oper": [0, 1, 5, 8, 9, 11, 12, 13, 22, 23, 26, 27, 29, 30, 38, 40, 41, 42, 43, 45, 50], "kernel": [0, 1, 6, 9, 11, 16, 26, 28, 30, 35, 38, 41, 42, 46], "invoc": [0, 38, 42], "overhead": [0, 1, 6, 9, 15, 27, 28, 35, 38, 41, 42, 43, 48, 49], "result": [0, 1, 9, 15, 18, 20, 25, 28, 39, 40, 41, 50], "compar": [0, 1, 6, 25, 38, 39, 41, 43, 45, 49, 50], "normal": [0, 1, 5, 10, 19, 26, 28, 35, 41, 43], "yield": [0, 37, 41, 43], "better": [0, 1, 6, 16, 22, 23, 25, 28, 35, 39, 40, 41, 42, 43, 49], "techniqu": [0, 1, 9, 18, 35], "like": [0, 1, 2, 4, 6, 9, 12, 20, 21, 23, 26, 27, 35, 38, 39, 41, 42, 48, 50], "amplifi": 0, "them": [0, 4, 11, 19, 25, 26, 27, 38, 39, 41, 42, 48, 49], "comprehens": [0, 47], "both": [0, 1, 5, 6, 8, 17, 23, 25, 36, 39, 40, 41, 42, 44, 48, 49, 50], "torchscript": [0, 1, 6, 15, 18, 31, 38, 40, 43, 48, 49], "torchdynamo": [0, 6, 18], "With": [0, 1, 6, 7, 9, 15, 19, 23, 26, 27, 28, 29, 39], "we": [0, 1, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 20, 21, 22, 23, 24, 25, 28, 30, 35, 38, 40, 41, 42, 43, 47, 48, 49, 50], "recommend": [0, 5, 6, 7, 14, 15, 16, 22, 23, 24, 28, 30, 31, 38, 39, 41, 42, 43], "torch": [0, 1, 3, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28, 29, 31, 36, 38, 40, 41, 42, 43, 45, 49], "jit": [0, 1, 5, 12, 13, 22, 23, 24, 25, 28, 31, 36, 38, 40, 42, 43, 44, 45], "trace": [0, 5, 11, 12, 13, 18, 22, 23, 24, 28, 31, 36, 38, 40, 43, 45], "your": [0, 4, 5, 7, 9, 10, 12, 13, 15, 19, 20, 22, 26, 27, 28, 29, 30, 31, 32, 34, 38, 43, 44, 47], "prefer": [0, 16, 22, 32], "option": [0, 1, 7, 11, 15, 17, 20, 22, 26, 27, 30, 39, 44], "wider": [0, 5], "rang": [0, 7, 9, 10, 17, 19, 22, 23, 24, 26, 36, 38, 39, 40, 48, 50], "ipex": [0, 1, 2, 3, 5, 6, 9, 14, 18, 21, 22, 24, 28, 29, 31, 35, 36, 38, 39, 40, 41, 42, 45], "backend": [0, 1, 2, 6, 7, 8, 9, 10, 16, 18, 19, 24, 30, 35, 38, 39, 41, 42, 44, 45], "avail": [0, 1, 5, 6, 7, 9, 11, 16, 28, 30, 31, 36, 39, 41, 42, 43, 46, 47], "good": [0, 1, 4, 6, 18, 25, 35, 41, 48, 49], "On": [0, 1, 6, 17, 21, 25, 35, 41], "automat": [0, 1, 5, 6, 14, 15, 17, 18, 22, 23, 25, 26, 29, 35, 39, 40, 41, 42, 43, 44, 45], "dispatch": [0, 9, 42], "underli": [0, 6, 35, 46, 47], "detect": [0, 5, 18, 38, 41, 46], "set": [0, 1, 3, 4, 5, 6, 7, 9, 10, 11, 12, 16, 19, 20, 22, 27, 29, 30, 31, 38, 39, 40, 41, 42, 43, 44, 50], "isa": 0, "leverag": [0, 6, 30, 40], "unit": [0, 9, 41], "runtim": [0, 8, 9, 10, 12, 13, 26, 27, 31, 39, 41, 42, 44, 45], "offer": [0, 4, 26, 41, 47], "finer": [0, 6, 28], "grain": [0, 2, 6, 28], "thread": [0, 1, 5, 6, 9, 28, 29, 38, 39, 40, 41], "control": [0, 1, 6, 26, 27, 28, 29, 38, 39, 41], "weight": [0, 1, 5, 7, 9, 15, 17, 18, 19, 22, 23, 25, 28, 35, 42, 43], "share": [0, 4, 6, 7, 8, 9, 28, 38, 40, 41, 42], "increas": [0, 1, 2, 19, 26, 27, 35, 38, 41, 42, 43, 44, 47, 50], "effici": [0, 7, 9, 10, 17, 21, 28, 30, 35, 39, 41, 42, 48, 49], "implement": [0, 4, 5, 6, 7, 8, 9, 10, 25, 35, 38, 41, 42, 48, 49], "regist": [0, 11, 15, 42], "mechan": [0, 6, 9, 42, 46, 50], "These": [0, 5, 6, 12, 13, 17, 35, 42, 45], "nativ": [0, 6, 12, 13, 38, 42, 48, 49, 50], "calcul": [0, 1, 9, 12, 13, 26, 27, 42, 50], "util": [0, 5, 6, 7, 8, 9, 10, 15, 16, 17, 19, 22, 24, 25, 35, 38, 39, 41, 44, 50], "dpc": [0, 8, 11, 38, 42], "compil": [0, 4, 5, 6, 7, 11, 26, 27, 31, 38, 41, 42], "sycl": [0, 1, 6, 8, 11, 16, 42, 43], "standard": [0, 9, 35], "also": [0, 1, 5, 6, 8, 9, 11, 15, 20, 23, 25, 26, 27, 35, 36, 38, 39, 41, 42, 43, 44, 45, 47, 48, 49], "number": [0, 4, 5, 6, 7, 9, 10, 19, 20, 26, 27, 28, 29, 38, 40, 42, 48, 49, 50], "which": [0, 1, 5, 6, 7, 8, 9, 11, 12, 13, 15, 17, 20, 21, 22, 25, 26, 28, 29, 35, 38, 39, 40, 41, 42, 43, 44, 47], "found": [0, 5, 20, 23, 25, 36, 39, 40, 41, 42, 43], "doc": [0, 4, 23, 36], "directori": [0, 4, 9, 20, 31, 36, 38, 39, 40, 42], "team": [0, 4], "track": [0, 1], "bug": [0, 4, 43, 44], "enhanc": [0, 2, 30, 35], "request": [0, 1, 4, 28, 40, 43], "issu": [0, 1, 4, 12, 13, 37, 41, 50], "befor": [0, 1, 4, 5, 6, 11, 20, 23, 25, 26, 27, 28, 29, 38, 39, 41, 43, 44, 45], "submit": [0, 1, 4, 6, 9, 28], "suggest": [0, 1, 22, 25, 26, 28, 41], "report": [0, 38], "search": [0, 3, 4, 6, 35, 39], "exist": [0, 4, 23, 26, 38, 39, 41, 42, 45], "see": [0, 1, 4, 9, 12, 13, 17, 20, 25, 26, 27, 29, 38, 42, 44], "alreadi": [0, 4, 5, 19, 25, 35, 41, 43], "dtype": [1, 3, 5, 12, 13, 15, 16, 17, 22, 23, 24, 26, 27, 30, 31, 36, 38, 39, 42, 45], "none": [1, 7, 10, 39, 49], "level": [1, 6, 9, 11, 15, 25, 26, 28, 35, 38, 41, 42, 44, 45, 50], "o1": [1, 38], "inplac": [1, 3, 22, 23, 24, 25, 36, 40, 45], "fals": [1, 3, 5, 10, 12, 13, 20, 22, 23, 24, 25, 26, 27, 28, 29, 31, 36, 38, 39, 40, 45, 46], "conv_bn_fold": [1, 38], "linear_bn_fold": 1, "weights_prepack": [1, 6, 38], "replace_dropout_with_ident": 1, "optimize_lstm": 1, "split_master_weight_for_bf16": 1, "fuse_update_step": 1, "auto_kernel_select": [1, 6], "sample_input": [1, 14], "graph_mod": [1, 3, 6, 18], "concat_linear": 1, "appli": [1, 5, 6, 12, 13, 18, 19, 21, 25, 31, 35, 36, 38, 39, 42, 45, 48, 49, 50], "given": [1, 20, 21, 35, 45], "nn": [1, 5, 6, 7, 10, 12, 13, 15, 16, 17, 22, 24, 25, 28, 38, 45], "If": [1, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 20, 22, 23, 24, 25, 28, 38, 39, 40, 41, 42, 43, 44, 45], "train": [1, 2, 3, 7, 10, 17, 19, 21, 22, 23, 24, 25, 31, 36, 39, 42, 43, 50], "otherwis": [1, 10, 11, 28], "infer": [1, 2, 3, 6, 15, 17, 18, 22, 23, 25, 28, 30, 31, 38, 41, 42, 50], "conv": [1, 12, 13, 15, 22, 24, 28, 38, 45], "bn": [1, 15, 22, 24, 38], "fold": [1, 15, 22, 24, 38], "prepack": [1, 15, 25, 35], "so": [1, 5, 6, 8, 12, 13, 19, 22, 25, 28, 29, 30, 38, 39, 40, 41, 42, 47, 49], "onednn": [1, 2, 6, 11, 16, 30, 35, 38, 42, 45], "order": [1, 8, 16, 17, 25, 27, 29, 38, 39, 41, 50], "cach": [1, 4, 9, 11, 28, 42, 43, 47, 48, 49], "reus": [1, 41], "layout": [1, 6, 38], "call": [1, 6, 7, 9, 12, 13, 25, 26, 27, 29, 38, 40, 41, 45, 47, 50], "block": [1, 4, 28, 41, 42, 43], "although": [1, 41], "itself": [1, 25, 26, 27], "fast": [1, 3, 9, 18, 19, 41, 43], "enough": [1, 38, 48], "usag": [1, 6, 8, 12, 13, 16, 23, 25, 26, 27, 31, 33, 37, 40, 41, 42], "perspect": [1, 25, 39, 41, 45, 50], "drawback": [1, 50], "run": [1, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 15, 18, 20, 24, 26, 27, 28, 29, 31, 38, 39, 40, 41, 42, 43, 44], "split": [1, 9, 11, 28, 38, 43, 48], "one": [1, 4, 7, 8, 9, 11, 16, 18, 19, 20, 23, 25, 26, 28, 36, 38, 39, 41, 42, 45, 48, 49], "sever": [1, 6, 15, 26, 27, 37, 38, 39, 48, 49], "dimens": [1, 9, 16, 25], "data": [1, 3, 5, 7, 9, 12, 13, 14, 15, 18, 19, 23, 24, 25, 28, 30, 31, 36, 38, 39, 40, 42, 44, 48, 49, 50], "fix": [1, 4, 6, 21, 38], "size": [1, 5, 6, 7, 8, 9, 10, 19, 22, 24, 25, 35, 38, 40, 41, 43, 44], "each": [1, 6, 7, 9, 10, 11, 12, 13, 16, 19, 20, 26, 27, 28, 29, 38, 39, 40, 41, 48, 50], "time": [1, 4, 6, 7, 9, 10, 20, 25, 26, 27, 35, 38, 41, 45, 48, 49], "execut": [1, 3, 5, 6, 8, 9, 11, 12, 13, 15, 16, 18, 20, 21, 24, 26, 28, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49], "detail": [1, 4, 5, 6, 7, 9, 11, 12, 13, 14, 16, 23, 25, 30, 31, 33, 35, 38, 40, 41, 42, 45, 46], "mermori": 1, "format": [1, 4, 6, 7, 8, 10, 14, 16, 17, 19, 20, 26, 27, 29, 39, 41, 42], "manual": [1, 6, 15, 16, 20, 25, 28, 30], "To": [1, 4, 5, 6, 7, 10, 15, 19, 22, 25, 26, 27, 28, 29, 31, 35, 37, 40, 41, 42, 43, 45, 50], "thi": [1, 4, 5, 6, 7, 8, 9, 10, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 34, 35, 36, 37, 38, 39, 42, 43, 44, 45, 46, 48, 49, 50], "predefin": 1, "shape": [1, 6, 8, 16, 24, 26, 27, 28, 35, 41], "prior": [1, 31], "match": [1, 12, 13, 39], "requir": [1, 4, 5, 7, 8, 9, 12, 13, 15, 16, 17, 23, 24, 25, 31, 35, 36, 38, 39, 40, 42, 50], "won": [1, 6, 12, 13, 26, 27, 29], "t": [1, 4, 6, 8, 9, 12, 13, 20, 22, 24, 25, 26, 27, 28, 29, 30, 38, 40, 42], "convers": [1, 5, 12, 13, 23, 42, 45], "directli": [1, 9, 21, 41], "go": [1, 4, 5, 9, 12, 13, 43, 44], "methodologi": [1, 5, 6, 9, 41, 43, 48, 49], "possibl": [1, 5, 8, 16, 20, 22, 38, 41, 48], "avoid": [1, 4, 9, 15, 28, 38, 39, 40, 41, 50], "thu": [1, 9, 12, 13, 15, 16, 25, 28, 39, 40, 41, 42, 50], "paramet": [1, 5, 6, 7, 10, 12, 13, 15, 19, 24, 26, 27, 28, 30, 35, 38, 39, 41, 48, 49, 50], "work": [1, 4, 5, 6, 7, 9, 10, 11, 20, 22, 25, 26, 27, 28, 31, 35, 36, 38, 39, 41, 43, 46], "bfloat16": [1, 2, 3, 6, 15, 16, 17, 30, 31, 39, 42], "half": [1, 43, 50], "k": 1, "float16": [1, 6, 12, 21, 30, 31, 36, 42], "cast": [1, 12, 13, 50], "accord": [1, 35, 41, 45], "default": [1, 3, 5, 6, 7, 8, 10, 11, 15, 18, 22, 24, 26, 27, 28, 29, 30, 37, 38, 40, 41, 42, 45, 46], "valu": [1, 6, 11, 15, 17, 20, 28, 35, 38, 39, 40, 41, 46, 48, 50], "mean": [1, 25, 28, 29, 35, 38, 42], "do": [1, 4, 6, 9, 12, 13, 24, 25, 26, 27, 28, 35, 38, 39, 40, 41, 50], "noth": 1, "note": [1, 2, 4, 7, 8, 9, 10, 14, 19, 21, 22, 25, 28, 30, 35, 38, 39, 40, 41, 42, 44, 46], "type": [1, 3, 4, 5, 6, 8, 9, 10, 15, 23, 24, 25, 28, 31, 38, 39, 40, 42, 43, 44, 50], "conv2d": [1, 10, 12, 13, 15, 25, 28, 38, 42, 43, 45], "linear": [1, 7, 10, 12, 13, 16, 17, 21, 22, 24, 25, 38, 41, 42, 43, 45], "convtranspose2d": [1, 45], "case": [1, 5, 6, 9, 10, 11, 14, 18, 25, 37, 38, 39, 41], "addit": [1, 5, 30, 42, 43, 44, 46, 50], "embed": [1, 35], "lstm": [1, 15, 16, 22], "sgd": [1, 5, 7, 12, 13, 19, 24, 30, 42, 43, 48, 49], "string": [1, 10, 39], "o0": [1, 38], "No": [1, 25, 29, 38, 42], "function": [1, 4, 5, 6, 9, 10, 12, 13, 15, 18, 20, 22, 23, 26, 28, 29, 30, 31, 35, 36, 38, 39, 41, 42, 43, 46, 49, 50], "just": [1, 9, 20, 36, 42, 43, 44], "return": [1, 5, 7, 9, 10, 12, 13, 15, 24, 25, 26, 28, 38], "origin": [1, 8, 17, 18, 19, 22, 28, 36, 45, 49], "dropout": [1, 10, 15, 42], "remov": [1, 4, 26, 42, 50], "inferenc": 1, "master": [1, 6, 7, 39, 43, 50], "fuse": [1, 35, 42, 45, 48, 49], "updat": [1, 4, 7, 10, 42, 48, 49, 50], "step": [1, 4, 5, 7, 9, 10, 12, 13, 16, 19, 20, 24, 26, 29, 30, 40, 48, 50], "overridden": [1, 46], "explicitli": [1, 5, 9, 11, 12, 13, 26, 28, 38, 39], "bool": [1, 20], "whether": [1, 12, 13, 25, 26, 27, 41], "conv_bn": 1, "It": [1, 6, 7, 8, 9, 10, 12, 15, 17, 21, 25, 28, 29, 31, 36, 38, 39, 41, 42, 43, 44, 45, 46, 50], "knob": [1, 3, 18, 39], "overwrit": [1, 39], "configur": [1, 3, 5, 9, 20, 22, 23, 26, 31, 37, 39, 40, 44, 46], "linear_bn": 1, "convolut": [1, 6, 13, 28, 30, 41, 45], "reorder": [1, 25, 35], "replac": [1, 4, 6, 7, 10, 15, 23, 38], "ident": [1, 5, 15, 25], "aten": [1, 6, 8, 9], "opportunit": 1, "bf16": [1, 2, 31, 38, 42, 43, 48, 49, 50], "save": [1, 4, 5, 10, 17, 19, 20, 21, 22, 24, 25, 35, 40, 42, 45, 50], "solut": [1, 10, 35, 38, 42], "doesn": [1, 22, 24, 25, 26, 38], "support": [1, 4, 5, 7, 8, 9, 11, 16, 22, 23, 24, 27, 28, 30, 33, 35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 48, 49, 50], "all": [1, 4, 5, 7, 9, 10, 11, 12, 13, 16, 19, 20, 26, 27, 28, 29, 35, 36, 40, 41, 42, 45, 47, 48, 49], "param": [1, 39, 48, 49], "tupl": [1, 7, 28], "tensor": [1, 5, 6, 8, 9, 12, 13, 16, 22, 23, 26, 27, 28, 35, 38, 40, 42, 47], "feed": [1, 14, 25], "sampl": [1, 7, 14, 20, 41], "input": [1, 5, 6, 7, 9, 10, 14, 15, 16, 22, 23, 25, 26, 27, 30, 38, 40, 41, 45], "impact": [1, 28, 38], "pack": [1, 8, 28], "intel": [1, 2, 3, 6, 8, 9, 10, 11, 12, 14, 15, 16, 19, 20, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 42, 43, 44, 45, 46, 48, 49, 50], "extens": [1, 2, 3, 5, 8, 10, 11, 14, 15, 20, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 39, 41, 42, 43, 44, 45, 46, 48, 49], "pytorch": [1, 2, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 20, 24, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49], "per": [1, 8, 9, 11, 15, 19, 22, 23, 28, 39, 40, 41], "some": [1, 4, 6, 9, 10, 12, 13, 16, 24, 25, 26, 28, 31, 38, 39, 40, 41, 45], "heurist": [1, 28], "real": [1, 5, 6, 7, 20, 22], "best": [1, 6, 12, 13, 20, 23, 35, 41], "try": [1, 4, 5, 6, 7, 18, 20, 24, 38, 39, 41], "select": [1, 5, 6, 32, 42, 45], "true": [1, 3, 5, 6, 7, 9, 10, 15, 16, 17, 18, 20, 22, 23, 24, 27, 30, 31, 36, 39, 40, 41, 45, 46], "might": [1, 25, 38, 41, 49], "cost": [1, 6, 9, 26, 41], "extra": [1, 7, 15, 28, 39, 40, 42], "auto": [1, 5, 9, 15, 25, 35, 38, 39, 41, 42], "experiment": [1, 3, 4, 28, 38, 42, 44, 45], "combin": [1, 18, 20, 23, 39], "method": [1, 8, 9, 12, 13, 21, 22, 26, 27, 29, 38, 41], "multipl": [1, 4, 6, 7, 12, 13, 25, 30, 35, 38, 40, 41, 42, 44], "subgraph": 1, "modifi": [1, 4, 19], "other": [1, 6, 8, 10, 12, 13, 16, 17, 19, 20, 23, 25, 26, 31, 35, 38, 39, 41, 42, 47, 48, 49], "place": [1, 7, 12, 13, 35, 41], "scenario": [1, 8, 23, 38, 41], "convolutuon": 1, "counterpart": [1, 6, 42], "pleas": [1, 4, 7, 10, 21, 26, 27, 30, 31, 38, 39, 42, 44], "invok": [1, 5, 9, 12, 13, 15, 28, 31, 36, 38, 42, 45], "ddp": [1, 6, 10, 42], "distribut": [1, 2, 7, 10, 19, 38, 39, 40, 41, 42, 43, 44], "deepcopi": 1, "rather": [1, 25], "than": [1, 6, 11, 16, 23, 25, 26, 27, 28, 30, 38, 41, 42, 43, 50], "allreduc": [1, 7, 19, 38], "caus": [1, 35, 38, 39, 41, 42, 44, 50], "unpredict": 1, "accuraci": [1, 2, 6, 10, 12, 13, 21, 22, 24, 35, 42, 50], "loss": [1, 5, 7, 10, 12, 13, 19, 24, 25, 30, 50], "exampl": [1, 4, 6, 11, 12, 13, 16, 19, 25, 26, 27, 29, 31, 32, 33, 35, 36, 38, 40, 41, 45, 48, 49, 50], "load_state_dict": 1, "path": [1, 5, 9, 10, 11, 20, 25, 28, 38, 39, 41], "eval": [1, 3, 5, 10, 12, 13, 15, 18, 21, 22, 23, 24, 28, 31, 36, 38, 40, 45], "optimized_model": 1, "evalu": [1, 24, 31, 42], "optimized_optim": 1, "altern": [1, 5, 6, 7, 25], "motiv": [1, 5, 28], "ad": [1, 5, 6, 7, 15, 26, 30, 41], "alia": [1, 5, 9], "unifi": [1, 5, 39], "style": [1, 4, 5, 9, 27], "modular": [1, 5], "optimize_transform": [1, 35, 36, 42], "float32": [1, 26, 27, 31, 39, 45, 50], "quantization_config": 1, "qconfig_summary_fil": 1, "low_precision_checkpoint": 1, "deployment_mod": 1, "transform": [1, 2, 3, 5, 10, 15, 21, 24, 25, 40, 41, 42], "focu": [1, 15, 25, 36, 42], "especi": [1, 4, 9, 35], "task": [1, 6, 35, 38, 39, 41], "famili": [1, 35, 41], "llama": [1, 2, 35], "gpt": [1, 21, 35], "j": [1, 21, 35], "neox": 1, "opt": [1, 7, 31, 35, 46], "falcon": 1, "now": [1, 6, 9, 22, 25, 30, 40, 41], "float": [1, 5, 6, 9, 10, 12, 13, 17, 20, 22, 24, 50], "when": [1, 4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 19, 20, 25, 28, 29, 35, 38, 39, 40, 41, 42, 43, 44, 48, 49, 50], "mix": [1, 5, 9, 38, 42, 45], "str": [1, 7, 20, 26, 39], "curentlti": 1, "object": [1, 5, 20, 28, 38, 41, 42, 46], "defin": [1, 6, 8, 9, 10, 12, 13, 15, 17, 19, 23, 24, 25, 26, 27, 36, 40, 42, 45], "recip": [1, 3, 6, 17, 22, 45], "quant": 1, "static": [1, 3, 23, 39, 40, 41], "onc": [1, 4, 6, 16, 20, 25, 26, 27, 28, 40, 41, 44, 50], "quantizat": 1, "config": [1, 5, 23, 39, 40], "json": [1, 22, 24, 26, 27, 40], "file": [1, 3, 4, 5, 7, 9, 12, 13, 20, 22, 24, 25, 26, 27, 38, 39, 42, 44, 46], "under": [1, 6, 8, 12, 13, 25, 28, 34, 38, 39, 42], "need": [1, 4, 5, 6, 7, 9, 10, 15, 19, 20, 23, 24, 25, 26, 27, 28, 29, 30, 31, 36, 38, 39, 40, 41, 42, 43, 44, 45, 48, 49, 50], "calibr": [1, 5, 21, 23, 36, 38, 40], "dict": 1, "int4": [1, 6, 35, 36, 42], "": [1, 2, 7, 9, 10, 12, 13, 15, 20, 22, 25, 26, 27, 28, 38, 39, 40, 41, 48, 49, 50], "should": [1, 4, 9, 10, 12, 13, 22, 27, 28, 29, 35, 37, 38, 39, 41], "state_dict": [1, 5, 10, 19], "checkpoint": [1, 5, 19, 38], "pt": [1, 5, 10, 20, 21, 22, 36, 40, 45], "gptq": [1, 21], "etc": [1, 4], "where": [1, 4, 6, 7, 8, 10, 41, 50], "specifi": [1, 9, 10, 20, 28, 39, 41, 44], "kei": [1, 6, 7, 35, 42, 43], "group": [1, 7, 9, 10, 28, 41, 48], "chang": [1, 4, 5, 6, 7, 9, 10, 12, 13, 15, 18, 19, 22, 25, 28, 31, 36, 38, 39, 42, 46], "make": [1, 4, 5, 6, 7, 9, 10, 19, 20, 22, 26, 30, 31, 35, 40, 41, 44, 46, 50], "n": [1, 5, 6, 7, 9, 10, 24, 25, 28, 38, 40, 41, 48], "thei": [1, 6, 9, 12, 13, 25, 35, 39, 41, 42], "uint4": 1, "compress": [1, 17, 21], "along": [1, 4, 29, 41, 50], "store": [1, 25, 35, 39, 40, 41, 48, 49, 50], "int32": 1, "zero": [1, 6, 10, 11, 22, 26, 38, 44], "point": [1, 6, 8, 9, 12, 13, 17, 21, 22, 41, 50], "scale": [1, 2, 6, 10, 17, 19, 22, 35, 42], "bia": [1, 9, 12, 13, 25, 28], "state": [1, 6, 9, 10, 19, 22, 35, 47, 48], "channel": [1, 2, 11, 15, 22, 23, 42], "automaticlli": 1, "deploy": [1, 5, 6, 45], "torchscirpt": 1, "workabl": 1, "forward": [1, 5, 7, 9, 10, 12, 13, 16, 24, 25, 28, 38, 40, 41, 45, 50], "after": [1, 4, 5, 9, 23, 26, 27, 28, 29, 31, 32, 38, 40, 41, 45, 49, 50], "deepspe": [1, 35], "parallel": [1, 7, 35, 38, 41, 42], "get_fp32_math_mod": 1, "fpmath_mod": 1, "fpmath": 1, "fp32mathmod": 1, "fp32": [1, 3, 11, 16, 23, 31, 38, 42, 48, 49, 50], "bf32": [1, 11], "tf32": [1, 11], "disabl": [1, 6, 10, 11, 38, 39, 41, 45], "implicit": 1, "set_fp32_math_mod": 1, "class": [1, 6, 7, 10, 12, 13, 15, 24, 25, 28, 38], "verbos": [1, 3, 6, 9, 11, 29, 39], "demand": [1, 6], "easier": [1, 25, 30, 50], "debug": [1, 11, 23, 29, 36, 39, 46], "dump": [1, 23, 39], "messag": [1, 6, 9, 15, 18, 25, 29, 38, 39, 42], "contain": [1, 4, 7, 8, 16, 38, 39, 40, 41, 45], "durat": [1, 50], "while": [1, 11, 12, 13, 18, 23, 25, 26, 27, 35, 38, 40, 41, 42, 50], "via": [1, 4, 6, 8, 9, 11, 23, 26, 27, 28, 30, 35, 39, 41, 44, 47], "environ": [1, 4, 6, 7, 10, 11, 19, 28, 31, 35, 38, 39, 40, 41, 46], "variabl": [1, 4, 6, 11, 19, 31, 38, 39, 40, 41, 46], "name": [1, 9, 16, 20, 26, 27, 29, 35, 39, 40, 41], "dnnl_verbos": 1, "howev": [1, 4, 6, 9, 11, 12, 13, 14, 25, 28, 35, 38, 39, 41, 47], "those": [1, 9, 19, 22, 26, 35, 41, 47], "amount": [1, 38, 41, 47], "investig": [1, 38, 39], "singl": [1, 10, 19, 20, 28, 35, 40, 45, 48, 49], "iter": [1, 7, 26, 35, 50], "scope": [1, 6, 12, 13, 50], "out": [1, 5, 6, 8, 9, 12, 13, 15, 24, 26, 27, 28, 29, 38, 39, 41, 48, 49], "second": [1, 9, 15, 19, 26, 35, 38, 40, 41], "verbose_on": 1, "verbose_off": 1, "verbose_on_cr": 1, "creation": 1, "current_devic": 1, "int": [1, 5, 6, 7, 9, 10, 20, 38, 39], "index": [1, 4, 7, 8, 9, 10, 25, 29, 35, 41], "current_stream": [1, 9], "ani": [1, 4, 12, 13, 15, 25, 38, 40, 42, 43, 46], "context": [1, 4, 6, 8, 9, 11, 12, 13, 28, 29, 35, 41], "wrapper": [1, 7, 9, 29], "encapsul": [1, 9], "op": [1, 4, 10, 16, 22, 29, 30, 35], "argument": [1, 7, 10, 16, 26, 39], "neg": [1, 38, 50], "integ": [1, 35], "device_count": [1, 10, 26], "device_of": 1, "obj": 1, "storag": [1, 17, 48, 49], "alloc": [1, 8, 15, 19, 28, 35, 40, 43, 47], "get_device_nam": 1, "get_device_properti": 1, "properti": [1, 5, 9, 40], "_deviceproperti": 1, "init": [1, 4, 7, 19, 22, 24], "initi": [1, 7, 10, 19, 28, 40], "lazi": 1, "until": [1, 4, 28, 29, 41, 50], "first": [1, 2, 4, 5, 7, 9, 14, 15, 18, 19, 23, 24, 26, 28, 38, 39, 40, 41, 42, 43, 45, 48, 50], "access": [1, 8, 9, 25, 35, 40, 42, 48, 49], "veri": [1, 4, 9, 22, 25, 26, 27, 35, 38], "rare": 1, "sinc": [1, 6, 8, 9, 25, 28, 38, 41, 43, 44, 48, 50], "could": [1, 6, 7, 11, 23, 25, 36, 38, 40, 41, 45], "doe": [1, 8, 9, 25, 28, 30, 38, 42, 45], "repeatedli": [1, 4], "is_avail": 1, "indic": [1, 25, 35], "is_initi": 1, "set_devic": [1, 7, 10, 19], "discourag": 1, "favor": 1, "most": [1, 6, 11, 16, 23, 35, 38, 40, 41, 42, 43, 45, 50], "xpu_visible_devic": 1, "environment": 1, "streamcontext": 1, "around": [1, 9, 39], "synchron": [1, 8, 9, 11, 19, 26, 28, 38], "wait": [1, 8, 26, 28, 41], "complet": [1, 4, 5, 20, 25, 26, 36, 41, 47], "fp8": [1, 6, 42], "fp8_autocast": [1, 17], "fp8_recip": [1, 17], "delayedsc": [1, 17], "inp": 1, "_gptq": [1, 21, 36], "dataset": [1, 5, 7, 10, 19, 21, 24, 36, 41], "quantized_ckpt": 1, "wbit": [1, 21, 36], "4": [1, 7, 9, 20, 21, 25, 28, 29, 35, 36, 39, 41, 42], "perchannel": [1, 23, 36], "symmetr": [1, 22, 23], "group_siz": 1, "pack_dtyp": 1, "uint8": [1, 23], "param_dtyp": 1, "list": [1, 4, 5, 11, 12, 13, 20, 25, 26, 33, 35, 36, 39, 40, 41, 42, 44, 45], "bloom": [1, 35], "3": [1, 4, 5, 6, 7, 9, 10, 12, 13, 15, 16, 18, 20, 24, 25, 26, 28, 29, 30, 39, 41, 42, 45, 46, 50], "bit": [1, 17, 21, 35, 50], "calib": 1, "batch": [1, 5, 6, 9, 10, 19, 24, 25, 28, 38, 40], "int2": 1, "int3": 1, "granular": [1, 39, 40, 41], "scheme": [1, 40], "determin": [1, 41, 50], "except": [1, 7, 35, 39], "huggingfac": [1, 21, 35, 38, 40], "guarante": [1, 16], "gptjforcausallm": [1, 21, 36], "model_path": [1, 21, 36], "from_pretrain": [1, 3, 5, 21, 36, 40], "quantized_weight": [1, 21, 36], "get_rng_stat": 1, "bytetensor": 1, "rng": 1, "eagerli": 1, "get_rng_state_al": 1, "repres": [1, 4, 6, 7, 42, 50], "set_rng_stat": 1, "new_stat": 1, "desir": [1, 5, 23, 24, 39], "set_rng_state_al": 1, "manual_se": [1, 7, 10], "seed": [1, 7, 10], "safe": [1, 8], "silent": 1, "ignor": 1, "multi": [1, 6, 7, 20, 28, 38, 39, 41, 44], "insuffici": 1, "manual_seed_al": 1, "seed_al": 1, "initial_se": 1, "prioriti": [1, 16], "kwarg": [1, 26], "record_ev": 1, "record": [1, 10, 20, 26, 27, 40], "new": [1, 2, 4, 17, 18, 24, 25, 28, 36, 41, 42, 43], "sycl_queu": [1, 5, 9], "pycapsul": [1, 9], "queue": [1, 5, 6, 11], "correspond": [1, 7, 28, 38, 39, 42], "void": [1, 9], "pointer": [1, 9, 38], "address": [1, 7, 25, 39, 40, 41], "Its": 1, "capsul": 1, "self": [1, 7, 10, 12, 13, 15, 24, 25, 26, 27, 28, 38], "wait_ev": 1, "futur": [1, 4, 14], "wait_stream": 1, "anoth": [1, 20, 39, 41], "without": [1, 6, 7, 8, 12, 13, 15, 24, 26, 27, 28, 29, 38, 40, 42, 43, 50], "enqueu": 1, "affect": [1, 39], "elapsed_tim": [1, 10], "end_ev": 1, "elaps": [1, 10, 41], "millisecond": [1, 41], "wa": [1, 5, 8, 9, 27, 29, 38, 39, 40, 41, 42], "queri": [1, 25], "check": [1, 5, 6, 7, 9, 16, 25, 26, 31, 35, 36, 39, 43, 45], "captur": [1, 3, 47], "A": [1, 4, 5, 6, 15, 23, 25, 38, 39, 41, 42, 44], "boolean": [1, 6], "prevent": [1, 19, 48, 49], "proceed": 1, "empty_cach": [1, 47], "unoccupi": 1, "held": 1, "visibl": [1, 42], "sysman": 1, "toolkit": [1, 7, 31], "help": [1, 4, 5, 9, 10, 16, 35, 39, 41, 43, 44, 47], "fragment": [1, 41], "memory_stat": [1, 47], "dictionari": 1, "statist": [1, 5, 6, 23, 36], "non": [1, 4, 12, 13, 25, 40, 45], "core": [1, 6, 19, 20, 30, 38, 41, 46], "large_pool": 1, "small_pool": 1, "peak": 1, "freed": [1, 47], "receiv": [1, 38, 42, 50], "allocated_byt": 1, "segment": [1, 6, 26, 27, 42], "reserv": [1, 41, 43], "xpumalloc": 1, "reserved_byt": 1, "activ": [1, 5, 17, 19, 22, 23, 26, 28, 31, 35, 36, 39, 41], "active_byt": 1, "inactive_split": 1, "inact": 1, "inactive_split_byt": 1, "broken": 1, "down": [1, 38, 40], "pool": [1, 28, 43], "across": [1, 6, 7, 10, 39], "octob": 1, "2019": [1, 2], "1mb": [1, 41], "small": [1, 6, 23, 41, 48], "metric": 1, "maximum": [1, 46], "histor": 1, "total": [1, 9, 26, 27, 41, 47], "decreas": [1, 38], "simpl": [1, 9, 11, 12, 13, 25, 30, 31, 41, 42], "counter": 1, "num_alloc_retri": 1, "fail": [1, 15, 38, 42], "flush": 1, "retri": 1, "num_oom": 1, "error": [1, 4, 5, 9, 10, 15, 24, 25, 27, 38, 42, 50], "thrown": [1, 38], "memory_summari": 1, "abbrevi": 1, "human": 1, "readabl": [1, 9], "printout": 1, "displai": 1, "period": [1, 41], "dure": [1, 3, 4, 5, 15, 35, 38, 39, 41, 44, 45, 50], "handl": [1, 5, 8, 25, 41], "summari": 1, "memory_snapshot": [1, 47], "snapshot": [1, 47], "interpret": [1, 9, 39], "output": [1, 5, 6, 10, 12, 13, 16, 17, 19, 20, 25, 26, 29, 30, 38, 45], "familiar": [1, 9], "intern": [1, 11, 25, 28, 40, 46], "memory_alloc": [1, 47], "occupi": [1, 38, 47], "byte": 1, "less": [1, 6, 12, 13, 28, 38, 42], "unus": [1, 41, 47], "creat": [1, 4, 5, 8, 9, 11, 21, 23, 24, 28, 30, 36, 41, 42], "max_memory_alloc": [1, 47], "By": [1, 6, 39, 41, 46], "begin": [1, 4, 9, 29], "reset_peak_stat": 1, "reset": 1, "two": [1, 6, 8, 9, 17, 20, 28, 35, 40, 41, 43, 50], "measur": 1, "loop": [1, 4, 11, 26, 50], "memory_reserv": [1, 47], "max_memory_reserv": [1, 47], "reset_peak_memory_stat": 1, "stat": 1, "individu": [1, 4], "memory_stats_as_nested_dict": 1, "nest": [1, 29], "reset_accumulated_memory_stat": 1, "accumul": 1, "enum": 1, "fp32_math_mod": 1, "dpccp": 1, "packet": 1, "enumer": [1, 5, 8, 10, 19, 24], "math": [1, 6, 9, 11, 16], "fp32_math_mode_max": 1, "comput": [1, 5, 7, 17, 19, 21, 22, 24, 25, 28, 30, 35, 39, 40, 41, 43, 44, 45, 50], "primit": [1, 6, 11, 28, 38], "attribut": [1, 25], "descript": [1, 3, 6, 10, 11, 25, 28, 33, 41], "definit": [1, 9, 50], "numer": [1, 12, 13, 41], "behavior": [1, 16, 28, 29, 39, 41], "get_queue_from_stream": [1, 5, 9], "c10": [1, 5], "dpcpp": [1, 5, 7, 9, 38, 42], "enable_onednn_fus": [1, 45], "prepar": [1, 3, 23, 24, 36, 38, 40, 45], "example_input": [1, 3, 22, 23, 24, 36, 40, 45], "bn_fold": 1, "example_kwarg_input": 1, "qconfig": [1, 3, 5, 23, 24, 36, 38, 40, 45], "observ": [1, 5, 14, 22, 23, 36, 45], "insert": [1, 5, 23, 36], "fake": 1, "introduct": [1, 35, 37, 41], "avaiabl": 1, "page": [1, 5, 7, 10, 26, 27, 28, 30, 32, 36, 37, 41, 45], "autotun": [1, 3, 24], "prepared_model": [1, 3, 22, 24, 38, 45], "calib_dataload": 1, "eval_func": 1, "sampling_s": [1, 3, 24], "accuracy_criterion": [1, 3, 24], "tuning_tim": [1, 3, 24], "driven": 1, "tune": [1, 2, 3, 6, 12, 13, 22, 28, 39, 40], "quickli": 1, "dataload": [1, 5, 7, 10, 15, 19, 24, 26, 28], "entir": [1, 9, 35], "process": [1, 5, 6, 7, 9, 10, 16, 17, 18, 19, 20, 28, 29, 38, 39, 40, 41, 48, 50], "scalar": [1, 42], "higher": [1, 6, 25, 35, 45], "algorithm": [1, 17, 21, 25, 45], "would": [1, 4, 5, 6, 9, 16, 20, 23, 25, 38, 39, 40, 41, 42, 46], "explor": 1, "100": [1, 3, 7, 10, 19, 20, 24, 26, 29, 40, 42, 46], "accuracy_criterion_typ": 1, "rel": [1, 3, 24, 39], "absolut": [1, 5, 39], "accuracy_criterion_valu": 1, "allow": [1, 5, 12, 13, 20, 38, 39, 41, 43, 44], "either": [1, 6, 7, 38, 39], "01": [1, 3, 24, 39, 40], "timeout": [1, 4, 50], "earli": 1, "stop": [1, 26, 38, 41], "is_runtime_ext_en": 1, "helper": [1, 9], "exetens": 1, "openmp": [1, 6, 28, 38, 40], "preload": [1, 38, 39, 41, 42], "cpupool": [1, 28], "core_id": [1, 28, 39], "node_id": [1, 28, 39, 40], "abstract": [1, 9, 28], "intra": 1, "id": [1, 7, 8, 10, 27, 29, 35, 39, 40], "numa": [1, 28, 39, 40], "node": [1, 28, 40, 41], "pin": [1, 19, 28, 30], "cpu_pool": [1, 28], "region": [1, 12, 13, 41], "def": [1, 7, 9, 10, 12, 13, 15, 24, 25, 26, 28, 38, 42, 45], "design": [1, 4, 10, 12, 13, 16, 25, 36, 42, 50], "decor": 1, "multistreammodulehint": [1, 28], "arg": [1, 3, 6, 7, 10, 19, 20, 26, 39, 40, 45, 48, 49], "hint": [1, 28], "multistreammodul": [1, 6, 28, 38], "concat": [1, 16, 25, 28, 35, 38], "its": [1, 5, 7, 8, 12, 13, 16, 20, 26, 27, 31, 37, 39, 40, 41, 50], "dim": [1, 5, 8, 9, 10, 16, 25], "length": [1, 4, 20, 38, 50], "arbitrari": 1, "keyword": 1, "num_stream": [1, 28], "concat_output": 1, "input_split_hint": [1, 28], "multi_stream": 1, "output_concat_hint": [1, 28], "throughput": [1, 2, 28, 35, 38, 42], "insid": [1, 4, 9, 28, 30, 39, 43], "divis": [1, 16, 28], "equal": [1, 11, 22, 28, 38, 40, 41], "remaind": [1, 28], "divisor": [1, 28], "batchsiz": [1, 28], "larger": [1, 28, 41], "piec": [1, 6, 28, 29], "mini": [1, 28, 38], "don": [1, 4, 9, 12, 13, 20, 26, 30], "want": [1, 4, 9, 11, 20, 22, 25, 26, 27, 28, 31, 39], "num": [1, 28, 40, 41], "leav": [1, 28, 41], "scriptmodul": [1, 23, 28, 36, 45], "union": 1, "instanc": [1, 6, 15, 20, 40], "usual": [1, 25, 28, 41], "reason": [1, 15, 25, 28], "still": [1, 4, 6, 9, 12, 13, 24, 25, 26, 37, 38, 42, 44, 45, 50], "flag": [1, 6, 28, 39], "concaten": [1, 16, 35, 50], "raw": 1, "asynchron": [1, 6], "get_core_list_of_node_id": 1, "softwar": [2, 34], "jul": 2, "2023": [2, 42], "deep": [2, 6, 7, 8, 10, 12, 13, 16, 17, 19, 20, 21, 41, 45, 50], "learn": [2, 6, 7, 8, 10, 12, 13, 17, 19, 20, 21, 39, 41, 45, 50], "boost": [2, 5, 6, 14, 39, 41, 42, 50], "dl": [2, 6], "hug": 2, "face": 2, "bert": [2, 3, 15, 17], "googl": [2, 4], "cloud": 2, "platform": [2, 5, 8, 25, 30, 38, 40, 41, 42, 44], "gcp": 2, "technologi": [2, 6], "guid": [2, 6, 7, 40], "apr": 2, "mar": [2, 40], "x86": 2, "sapphir": [2, 6], "rapid": [2, 6], "part": [2, 5, 9, 12, 13, 25, 26, 27, 38, 41, 43, 50], "jan": 2, "secur": 2, "torchserv": [2, 37], "confer": 2, "dec": 2, "2022": [2, 39, 40, 42], "what": [2, 6, 12, 13, 26, 27, 29, 42, 43, 44], "pyg": 2, "stabl": [2, 3, 6, 7, 8, 12, 13, 38], "diffus": 2, "arc": [2, 38, 42, 44], "nov": [2, 42], "13": [2, 15, 29, 39, 40, 41], "potenti": [2, 30], "fine": [2, 9, 28, 39, 40, 41], "fx": [2, 15, 38], "sep": 2, "empow": [2, 6, 30], "xeon": [2, 6, 20, 40, 41, 50], "scalabl": [2, 6, 41, 50], "processor": [2, 6, 41, 48, 49, 50], "aug": 2, "vision": [2, 5], "last": [2, 9, 11, 15, 42, 50], "One": [2, 8, 17, 25, 39, 41, 48, 49], "click": 2, "compressor": [2, 6, 24], "4x": 2, "jun": 2, "grokk": 2, "principl": [2, 19, 25], "kt": 2, "person": 2, "text": [2, 35, 38, 41, 44], "speech": [2, 41], "2021": [2, 7, 39, 40], "up": [2, 6, 7, 8, 9, 10, 28, 35, 41, 42, 43], "modern": 2, "naver": 2, "low": [2, 3, 6, 7, 9, 31, 38, 39, 41, 42, 49, 50], "latenc": [2, 20, 35, 40, 42], "machin": [2, 4, 7, 20, 38, 39, 40, 41, 43], "feb": [2, 42], "dlrm": [2, 6, 38], "oneccl": [2, 6, 10, 38, 39], "mention": [2, 7, 9, 15, 28, 50], "deprec": [2, 14], "facebook": [2, 35], "3rd": [2, 6, 50], "gen": 2, "capabl": [2, 6, 23, 30, 42, 47], "2020": [2, 8], "collabor": 2, "caff": 2, "2017": 2, "command": [3, 4, 5, 7, 9, 19, 20, 30, 38, 39, 40, 41, 42], "basic": [3, 16, 24, 41, 42, 50], "instal": [3, 4, 5, 10, 11, 26, 27, 30, 31, 33, 35, 38, 41, 42, 44], "m": [3, 7, 9, 10, 19, 20, 28, 30, 38, 39, 40, 41], "pip": [3, 4, 7, 19, 30], "lt": [3, 42], "version": [3, 5, 8, 9, 30, 34, 38, 40, 41, 42, 49], "gt": [3, 20, 41, 42], "f": [3, 4, 5, 10, 19, 24], "http": [3, 4, 7, 8, 9, 10, 24, 38], "com": [3, 4, 7, 10, 38], "whl": [3, 7, 30, 38], "xpupip": 3, "log": [3, 10, 29, 39, 40, 42, 45], "prompt": [3, 35], "export": [3, 7, 11, 29, 38, 39, 41, 42], "onednn_verbos": 3, "precis": [3, 5, 7, 17, 21, 31, 38, 42, 45, 50], "no_grad": [3, 5, 10, 15, 18, 22, 24, 28, 30, 31, 38, 40, 45], "amp": [3, 5, 15, 17, 30, 31, 38, 42], "autocast": [3, 5, 15, 17, 30, 31], "bertmodelmodel": 3, "bertmodel": [3, 5, 40], "uncas": [3, 5, 15, 40], "fast_bert": 3, "launch": [3, 9, 11, 28, 35, 37, 40, 42], "autom": [3, 6, 12, 13, 20, 39, 40], "ipexrun": [3, 15, 39], "your_pytorch_script": [3, 39], "hypertun": 3, "hyperparamet": [3, 6], "conf": [3, 20, 39, 45], "your_conf_fil": 3, "your_python_script": 3, "post": [3, 4, 22, 23, 30, 35], "default_static_qconfigprepared_model": 3, "anyplac": 3, "d": [3, 4, 5, 6, 9, 12, 13, 38], "calibration_data_load": [3, 5, 45], "converted_model": [3, 38], "default_dynamic_qconfigprepared_model": 3, "tuned_model": [3, 24], "eval_funct": 3, "convert_model": [3, 22, 24, 45], "thank": 4, "interest": 4, "intent": 4, "propos": [4, 25, 50], "intend": 4, "shall": [4, 25], "discuss": [4, 9, 25, 41], "agre": 4, "plan": [4, 15], "look": [4, 5, 9, 20, 25], "ahead": [4, 9, 26, 27, 29], "outstand": 4, "pick": 4, "comment": [4, 20], "particular": [4, 12, 13, 36, 38, 42], "ask": 4, "pull": 4, "full": [4, 9, 30, 40, 41], "here": [4, 7, 8, 9, 10, 12, 13, 15, 25, 26, 27, 28, 29, 38, 40, 41, 45], "uninstal": 4, "ll": [4, 9, 26, 27, 29, 40, 41], "know": [4, 8, 9, 43, 44], "fulli": [4, 22, 23, 36, 41, 42, 50], "warn": [4, 18, 39, 40], "skip": [4, 5, 25, 29, 39, 43, 44, 46], "few": [4, 6, 14, 25, 40, 45], "alwai": [4, 12, 13, 16, 25, 38, 39, 41, 42], "ye": 4, "clone": [4, 7, 29, 49], "copi": [4, 6, 8, 25], "git": [4, 7], "b": [4, 6, 7, 12, 13, 29], "cd": [4, 5, 7], "rebas": 4, "submodul": [4, 7], "sync": [4, 7, 28], "recurs": [4, 7], "job": [4, 38], "setup": [4, 7, 9, 10, 19, 26, 27, 35], "py": [4, 7, 9, 10, 11, 15, 20, 26, 27, 28, 38, 39, 40], "symlink": 4, "tree": 4, "reinstal": [4, 30, 38], "again": [4, 30, 40, 48, 49], "__init__": [4, 7, 10, 12, 13, 15, 24, 25, 28, 38], "interfac": [4, 5, 9, 25, 35, 38], "pyi": 4, "cpp": [4, 5, 9, 41, 46], "h": [4, 5, 6, 9, 24, 25, 38, 39, 40], "sure": [4, 7, 10, 19, 20, 22, 26, 40], "Then": [4, 7, 23, 30, 38, 40, 42], "clean": [4, 38, 42], "our": [4, 5, 9, 16, 23, 35, 41, 48], "6": [4, 5, 6, 7, 20, 28, 29, 39, 40, 41], "binari": [4, 5, 12, 13, 25, 43, 44], "folder": 4, "mani": [4, 6, 9, 20, 26, 27, 39, 41], "wai": [4, 9, 15, 25, 35, 49], "next": [4, 6, 9, 16], "re": [4, 7, 9, 12, 13, 40, 41], "rm": 4, "rf": 4, "toplevel": 4, "over": [4, 6, 9, 12, 13, 14, 25, 38, 39, 42], "made": [4, 8, 42], "edit": [4, 38], "repo": [4, 6, 7, 30], "commit": [4, 30], "keep": [4, 6, 18, 25, 26, 40, 41, 50], "realli": [4, 9], "untrack": 4, "deinit": 4, "xdf": 4, "within": [4, 36, 41, 42, 50], "experi": [4, 15, 16, 18, 25, 41], "env_key1": 4, "env_val1": 4, "env_key2": 4, "env_val2": 4, "suit": 4, "locat": 4, "test_": 4, "sub_fold": 4, "filenam": 4, "wish": [4, 9, 25, 26, 27, 43], "port": [4, 7, 39], "stock": [4, 8, 25, 38, 42, 45], "10": [4, 8, 10, 20, 24, 25, 26, 29, 30, 38, 39, 40, 41, 46, 50], "regress": [4, 14, 38], "offici": [4, 23, 26, 27, 40, 41, 42], "read": [4, 9, 48, 49], "readm": 4, "md": [4, 25], "docstr": 4, "line": [4, 9, 15, 25, 26, 27, 29, 39, 40, 41, 45], "must": [4, 9, 20, 26, 44, 48, 49], "limit": [4, 12, 13, 15, 25, 28, 38, 40, 41], "80": [4, 39], "charact": 4, "fit": [4, 41, 43], "jupyt": 4, "popup": 4, "abov": [4, 7, 8, 10, 11, 15, 25, 26, 27, 29, 35, 39, 40, 48, 49], "prerequisit": [4, 5], "r": [4, 6, 20, 40, 41], "txt": [4, 5, 9, 40], "html": [4, 8, 9, 24], "_build": 4, "rst": 4, "live": 4, "tutori": [4, 5, 7, 9, 10, 22, 24, 26, 27, 37], "autofunct": 4, "autoclass": 4, "direct": [4, 45], "shorten": 4, "sphinx": 4, "produc": [4, 8, 9, 12, 13, 47], "miss": 4, "torchvis": [5, 10, 15, 18, 24, 40, 45], "demonstr": [5, 8, 16, 25, 38, 40], "box": [5, 15, 41], "benefit": [5, 6, 12, 13, 15, 23, 28, 40, 41, 50], "against": 5, "criterion": [5, 12, 13], "below": [5, 7, 9, 12, 13, 15, 16, 20, 25, 26, 27, 28, 29, 31, 38, 39, 40, 41, 42, 44, 48, 49, 50], "lr": [5, 7, 10, 12, 13, 24, 30, 48, 49], "001": [5, 7, 12, 13], "download": [5, 10, 24, 30, 38], "cifar10": 5, "compos": [5, 10], "resiz": 5, "224": [5, 12, 13, 15, 18, 30, 40, 45], "totensor": [5, 10, 24], "5": [5, 7, 10, 15, 16, 20, 23, 24, 25, 28, 29, 36, 38, 39, 40, 41, 48, 50], "train_dataset": [5, 7, 19], "root": [5, 7, 11, 24, 35, 38, 42, 46], "train_load": [5, 7, 10, 12, 13, 19], "batch_siz": [5, 7, 9, 10, 19, 24, 25, 40, 45], "128": [5, 10, 12, 13, 15, 28], "crossentropyloss": [5, 24], "momentum": [5, 15, 30, 49, 50], "9": [5, 20, 29, 39, 40, 42, 46], "batch_idx": [5, 10, 19], "target": [5, 9, 10, 15, 19, 20, 42, 43, 44, 46], "zero_grad": [5, 10, 19, 24, 30], "backward": [5, 7, 9, 10, 12, 13, 19, 24, 30, 50], "print": [5, 6, 7, 10, 18, 19, 20, 23, 24, 25, 26, 27, 29, 36, 39, 45, 46], "model_state_dict": 5, "optimizer_state_dict": 5, "pth": 5, "finish": [5, 16, 18, 24, 28, 38], "nlp": [5, 6, 38], "resnet50_weight": [5, 18], "rand": [5, 12, 13, 18, 25, 28, 30, 38], "vocab_s": [5, 40], "seq_length": [5, 40], "randint": [5, 40], "freez": [5, 12, 13, 15, 22, 24, 28, 30, 31, 38, 40, 45], "strict": [5, 40], "becaus": [5, 12, 13, 25, 35, 38, 41, 44, 50], "prepare_jit": [5, 23, 36], "convert_jit": [5, 23, 36], "separ": [5, 7, 9, 11, 34, 38, 41, 48], "collect": [5, 6, 7, 10, 23, 40, 41], "o": [5, 7, 10, 38, 42, 46], "_recurs": 5, "wrap_cpp_modul": 5, "quantize_jit": [5, 23, 36], "modeljit": [5, 23, 36], "minmaxobserv": [5, 22, 23, 36], "with_arg": [5, 22, 23, 36], "qscheme": [5, 22, 23, 36], "per_tensor_symmetr": [5, 22, 23, 36], "reduce_rang": [5, 22, 23, 36], "quint8": [5, 22], "default_weight_observ": [5, 23, 36], "len": [5, 10, 19, 24, 26], "memory_format": [5, 6, 25], "channels_last": [5, 6, 25, 41], "libtorch": [5, 42], "own": [5, 9, 22], "servic": [5, 41], "regular": [5, 50], "unlik": [5, 6, 10], "cmake": [5, 6, 42, 46], "cppsdk": 5, "ensur": [5, 19, 28, 40, 48], "app": 5, "iostream": 5, "memori": [5, 6, 8, 9, 10, 12, 13, 14, 15, 17, 21, 28, 35, 38, 40, 42, 45, 48, 49, 50], "argc": 5, "const": [5, 9], "char": 5, "argv": 5, "catch": [5, 23], "std": [5, 9, 48], "cerr": 5, "kxpu": 5, "ivalu": 5, "push_back": 5, "cout": 5, "slice": [5, 9, 25], "end": [5, 26, 28, 29, 38, 43, 44, 45], "endl": 5, "cmakelist": [5, 9], "cmake_minimum_requir": [5, 9], "fatal_error": [5, 9], "find_packag": [5, 9], "add_execut": 5, "target_link_librari": [5, 9], "torch_ipex_librari": [5, 9], "set_properti": [5, 9], "cxx_standard": [5, 9], "17": [5, 9, 29, 39, 40], "mkdir": 5, "build": [5, 6, 7, 19, 29, 30, 38, 41, 42, 44], "cc": [5, 46], "icx": [5, 9], "cxx": [5, 46], "icpx": [5, 9], "dcmake_prefix_path": [5, 9], "libpytorch_path": 5, "libpytorch": 5, "_": [5, 7, 8, 9, 22, 23, 25, 28, 29, 38, 39, 40, 41, 42, 45], "verifi": [5, 6, 21, 30, 35, 38], "linux": [5, 9, 38, 39, 41, 42], "ldd": 5, "workspac": 5, "identif": [5, 46], "intelllvm": 5, "2024": [5, 38], "abi": [5, 42, 46], "info": [5, 23, 38, 39, 40, 46], "done": [5, 10, 15, 24, 38, 41, 46], "oneapi": [5, 6, 7, 8, 10, 16, 19, 31, 38, 41, 44], "bin": [5, 38, 39, 40, 42, 46], "pthread": [5, 28], "test": [5, 10, 24, 29, 30, 42, 43, 44, 46], "cmake_have_libc_pthread": 5, "success": [5, 15, 32], "lib": [5, 38, 39, 40, 42], "libintel": 5, "ext": 5, "written": [5, 42, 46], "0x00007fd5bb927000": 5, "libc10": 5, "0x00007fd5bb895000": 5, "libtorch_cpu": 5, "0x00007fd5a44d8000": 5, "0x00007fd5a1a1b000": 5, "0x00007fd5862b0000": 5, "libmkl_intel_lp64": [5, 38, 42], "mkl": [5, 31, 38, 42], "intel64": [5, 38, 42], "0x00007fd584ab0000": 5, "libmkl_cor": [5, 38, 42], "0x00007fd5806cc000": 5, "libmkl_gnu_thread": [5, 38], "0x00007fd57eb1d000": 5, "libmkl_sycl": [5, 38, 42], "0x00007fd55512c000": 5, "libopencl": 5, "0x00007fd55511d000": 5, "libsvml": 5, "intel64_lin": 5, "0x00007fd553b11000": 5, "libirng": 5, "0x00007fd553600000": 5, "libimf": 5, "0x00007fd55321b000": 5, "libintlc": 5, "0x00007fd553a9c000": 5, "libsycl": 5, "0x00007fd552f36000": 5, "show": [5, 7, 8, 9, 12, 13, 26, 27, 29, 36, 37, 39, 40, 41, 50], "fsycl": [5, 9, 44], "cmake_cxx_flag": 5, "usm": [5, 8], "cl": 5, "hpp": 5, "namespac": [5, 12, 13], "fetch": 5, "stream": [5, 6, 11, 28, 38], "device_typ": [5, 9], "devicetyp": [5, 9], "impl": [5, 9], "virtualguardimpl": [5, 9], "xpu_stream": 5, "getstream": [5, 9], "input_ptr": 5, "malloc_devic": 5, "fromusm": 5, "scalartyp": 5, "nullopt": 5, "output_tensor": 5, "append": 5, "former": [5, 9], "zoo": 5, "benchmark": [5, 38, 39, 47], "mark": [5, 26, 27], "document": [5, 6, 9, 11, 28, 36, 42, 46], "column": [5, 9, 26, 27], "simpli": [5, 9, 38, 39], "guidanc": 6, "nchw": [6, 41], "nhwc": [6, 41, 42], "anymor": 6, "center": [6, 30, 38, 42, 44], "flex": [6, 38, 42, 44], "seri": [6, 30, 38, 41, 42, 44], "choos": [6, 12, 13, 16, 26, 28, 39, 41, 42], "typic": [6, 8, 15, 19, 26, 41, 42], "speed": [6, 7, 9, 35, 41, 42, 43, 48, 49], "furthermor": 6, "aka": [6, 25], "cooper": 6, "lake": 6, "4th": 6, "bfloat": 6, "16": [6, 28, 29, 39, 40, 50], "matmul": [6, 12, 13, 38, 42, 45], "partial": [6, 10], "upstream": [6, 25], "land": 6, "pr": [6, 25, 38], "being": [6, 21, 29, 41], "review": [6, 38], "side": [6, 8, 22, 41], "respect": [6, 20, 39], "built": [6, 7, 9, 28, 29, 38, 42, 44, 46], "deliv": [6, 23, 35], "cnn": [6, 25, 41], "top": [6, 15, 42, 50], "power": [6, 9, 17, 21, 41], "meet": [6, 17, 41, 50], "commun": [6, 7, 8, 10, 38, 39, 40, 41, 42], "bind": [6, 9, 10, 38, 39, 40, 41], "formerli": [6, 7, 10, 41], "known": [6, 7, 10, 15, 35, 37], "torch_ccl": [6, 7], "horovod": [6, 38, 42], "among": [6, 8, 19, 39, 40, 41], "framework": [6, 8, 11, 19], "interopar": 6, "particularli": [6, 8], "describ": [6, 7, 12, 13, 25, 38, 40, 41, 45, 50], "write": [6, 26, 27], "practic": [6, 9, 35, 41, 50], "setuptool": 6, "suffici": [6, 11], "driver": [6, 44], "ze_flat_device_hierarchi": [6, 11], "hierarchi": 6, "expos": [6, 12, 13], "tile": [6, 7, 11, 35], "industri": [6, 10, 42], "grade": [6, 10, 42], "worker": [6, 7, 10, 19, 28, 39], "maintain": [6, 7, 9, 10, 12, 13], "replica": [6, 7, 10], "gradient": [6, 7, 10, 17, 19], "rank": [6, 7, 10, 19, 39], "footprint": [6, 10, 17, 21, 35, 42, 43, 50], "feasibl": [6, 10, 15], "seamlessli": [6, 30], "har": [6, 30], "flagship": [6, 30], "torchinductor": [6, 30], "field": [6, 26, 27, 29], "statement": [6, 20, 26, 27, 46], "let": [6, 9, 15, 25, 28, 29, 48, 49, 50], "stack": [6, 12, 13, 29], "indent": [6, 26, 27, 29], "distinguish": [6, 29], "capac": [6, 16, 41, 50], "registr": 6, "topologi": [6, 38, 39, 41, 48, 49], "roialign": 6, "nm": 6, "mask": [6, 38], "frozenbatchnorm2d": 6, "num_featur": 6, "ep": [6, 15, 48], "1e": [6, 15, 24], "05": [6, 15, 39], "batchnorm2d": [6, 15, 38], "affin": [6, 15, 22, 28, 39, 40, 41], "expect": [6, 25, 38], "w": [6, 24, 25, 40, 50], "same": [6, 7, 9, 10, 15, 22, 25, 28, 35, 38, 39, 40, 41, 46, 50], "interact": 6, "beyond": 6, "kind": 6, "gender": 6, "hobbi": 6, "dot": [6, 13, 25, 35], "between": [6, 7, 8, 12, 13, 28, 29, 38, 41, 46], "man": [6, 41], "plai": [6, 41], "footbal": 6, "gemm": [6, 25, 35, 38, 42], "onemkl": [6, 11, 16, 38, 42], "circumst": [6, 12, 13], "faster": [6, 12, 13, 41], "abl": [6, 22], "aim": [6, 15, 41], "broad": [6, 14], "toggl": 6, "switch": [6, 10, 26, 27, 39, 41], "weights_preack": 6, "concern": 6, "major": 6, "spawn": [6, 7, 10, 28], "stage": [6, 15, 17, 21, 23, 28, 38, 41, 48, 49], "subject": [6, 28, 34, 46], "hopefulli": 6, "eas": [6, 9, 25], "though": 6, "instead": [6, 20, 28, 36, 38, 39, 40, 41, 42, 48, 49], "turn": [6, 29], "off": [6, 11, 12, 13, 26, 27, 29, 35, 38, 42, 50], "variou": [6, 8, 20, 26, 27, 30, 41, 42], "area": [6, 20], "extrem": [6, 20, 41], "situat": [6, 8, 20], "space": [6, 25, 41], "huge": [6, 20, 41], "impract": [6, 20], "consum": [6, 8, 20, 26, 27], "launcher": [6, 7, 39, 41, 45], "replic": 7, "everi": [7, 29, 35], "fed": 7, "overlap": [7, 40], "c10d": [7, 10], "ccl": [7, 10, 19, 39], "processgroup": [7, 10], "hold": [7, 10, 25, 41], "allgath": [7, 10, 19], "alltoal": [7, 19], "successfulli": 7, "v2": [7, 30, 42], "oneccl_bindings_for_pytorch": [7, 10], "third": [7, 48], "parti": 7, "compute_backend": 7, "system": [7, 9, 38, 41, 42], "apt": 7, "yum": 7, "dnf": 7, "sudo": 7, "devel": 7, "11": [7, 29, 39, 40, 46], "inteloneapiroot": 7, "use_system_oneccl": 7, "ON": [7, 11, 26, 27, 29], "repositori": 7, "repo_url": 7, "u": [7, 9, 40], "holder": 7, "url": [7, 40], "oneccl_bind_pt": 7, "cwd": 7, "env": [7, 19, 31], "setvar": 7, "sh": [7, 19, 31], "var": [7, 19, 31], "basekit": [7, 19, 38], "oneapi_root": 7, "manag": [7, 9, 12, 13, 28, 35, 39, 45], "modif": [7, 10, 19, 23], "necessari": [7, 10, 19, 25, 26, 27, 29], "dist": [7, 10, 13, 38], "init_process_group": [7, 10], "exclus": [7, 10, 11, 39], "local": [7, 10, 19, 28, 39, 40, 41], "local_rank": [7, 10, 19], "wrap": [7, 10, 19], "device_id": [7, 8, 10, 26], "exactli": [7, 9, 50], "resid": 7, "seed_numb": 7, "illustr": [7, 10, 23, 25, 39, 41, 48, 50], "Or": [7, 38], "example_ddp": 7, "super": [7, 10, 12, 13, 15, 24, 25, 28, 38], "__name__": [7, 10, 38], "__main__": [7, 10, 38, 39, 40], "123": 7, "mpi_world_s": 7, "pmi_siz": 7, "mpi_rank": 7, "pmi_rank": 7, "world_siz": [7, 10], "els": [7, 20, 25, 26, 49], "world": 7, "master_addr": [7, 10], "127": [7, 39], "master_port": [7, 10], "29500": [7, 39], "global": [7, 26, 28], "get_rank": 7, "get_world_s": 7, "loss_fn": [7, 24], "mseloss": 7, "rune": 7, "randn": [7, 15, 16, 24, 25, 26, 27, 29, 40, 45], "label": [7, 12, 13, 17], "l": 7, "mpirun": 7, "card": [7, 25, 35, 38, 42], "regard": [7, 25, 45], "explicit": [7, 28, 29, 41], "minor": 7, "single_card": 7, "single_card_dist": 7, "importerror": [7, 38, 42], "rais": [7, 15, 27, 38], "multiprocess": [7, 10], "multi_process_spawn": 7, "main_work": 7, "put": [7, 8, 10, 26, 41], "train_sampl": [7, 19], "epoch": [7, 10, 19, 24], "set_epoch": [7, 10], "adjust": 7, "warp": 7, "sampler": [7, 10, 19], "loader": [7, 24], "shuffl": [7, 10], "num_work": [7, 10], "pin_memori": [7, 10], "wide": [8, 21, 50], "adopt": [8, 35, 42], "numpi": 8, "domain": [8, 17, 21], "interoper": 8, "v0": 8, "7": [8, 10, 15, 20, 28, 29, 39, 40, 50], "relat": [8, 10, 23, 26, 39, 41, 45], "extern": 8, "from_dlpack": 8, "t2": 8, "empti": [8, 25, 29, 39], "capsule2": 8, "to_dlpack": 8, "dlmanagedtensor": 8, "stride": [8, 12, 13, 15, 28], "pars": [8, 10], "extract": 8, "data_ptr": 8, "respons": [8, 23, 29, 35], "atendlmtensor": 8, "ndim": 8, "dmlc": 8, "io": 8, "spec": 8, "dldevicetyp": 8, "kdloneapi": 8, "kdlsycl": 8, "reli": [8, 25, 28], "filter": 8, "selector": 8, "actual": [8, 9, 25, 38, 42, 50], "parent": 8, "get_devic": 8, "valid": [8, 10, 11, 50], "three": [8, 35], "host": [8, 26, 27], "far": [8, 30], "recogn": 8, "probabl": [8, 10, 38], "hard": [8, 25, 38], "monitor": [8, 47], "flow": [8, 38, 45], "readi": 8, "highli": [9, 16, 23, 31, 35, 41, 42], "org": [9, 24, 38, 43], "walk": 9, "come": [9, 41], "flavor": 9, "aot": [9, 11], "cpp_extens": 9, "approach": [9, 35, 38], "latter": 9, "afterward": [9, 39, 41], "besid": [9, 26, 35, 41, 42], "long": [9, 25, 35, 38, 50], "term": [9, 34], "lltm": 9, "dpcppextens": 9, "dpcppbuildextens": 9, "ext_modul": 9, "lltm_xpu": 9, "lltm_xpu_kernel": 9, "cmdclass": 9, "build_ext": 9, "conveni": [9, 12, 13], "correct": [9, 10, 25], "equival": [9, 38, 42, 49], "vanilla": 9, "include_dir": 9, "include_path": 9, "And": [9, 22, 28, 40, 42], "goe": 9, "plug": 9, "previous": [9, 40], "were": [9, 39, 40, 41], "elabor": 9, "fly": 9, "background": [9, 41], "temporari": 9, "tmp": [9, 15, 26, 40], "torch_extens": 9, "ver": 9, "_xpu": 9, "emit": 9, "ninja": 9, "fact": [9, 25, 41], "home": [9, 19, 38, 39, 40], "user_nam": 9, "ones": [9, 23], "complic": [9, 29, 39, 41], "increment": 9, "reload": 9, "18": [9, 29, 39, 40], "compon": [9, 22, 34, 35], "set_source_files_properti": 9, "compile_flag": 9, "add_librari": 9, "torch_librari": 9, "target_include_directori": 9, "public": [9, 42], "python_include_dir": 9, "torch_ipex_include_dir": 9, "prefix": [9, 39], "cmake_prefix_path": 9, "dcmake_c_compil": 9, "dcmake_cxx_compil": 9, "aval": 9, "c10_stream": 9, "associ": [9, 43], "subsequ": [9, 25, 41], "yourself": 9, "strategi": [9, 20, 41], "pybind11": 9, "ultim": 9, "care": [9, 29, 40], "consid": 9, "cuda": [9, 10, 26, 42], "declar": 9, "lltm_xpu_forward": 9, "old_h": 9, "old_cel": 9, "lltm_xpu_backward": 9, "grad_h": 9, "grad_cel": 9, "new_cel": 9, "input_g": 9, "output_g": 9, "candidate_cel": 9, "gate_weight": 9, "check_xpu": 9, "torch_check": 9, "is_xpu": 9, "check_contigu": 9, "is_contigu": [9, 25], "contigu": [9, 25, 35, 41, 42, 45], "check_input": 9, "lltm_forward": 9, "lltm_backward": 9, "pybind11_modul": 9, "torch_extension_nam": 9, "bridg": 9, "natur": [9, 25, 50], "templat": [9, 16, 35], "typenam": 9, "scalar_t": 9, "sigmoid": [9, 42, 45], "z": 9, "0f": 9, "exp": [9, 42, 45], "At": [9, 35], "header": 9, "essenti": 9, "d_sigmoid": 9, "d_tanh": 9, "tanh": [9, 42, 45], "elu": [9, 42, 45], "alpha": [9, 48, 49], "fmax": 9, "fmin": 9, "d_elu": 9, "d_relu": 9, "hand": 9, "cat": [9, 12, 13, 16, 39, 40], "gate": 9, "addmm": [9, 12, 13, 42], "transpos": [9, 42, 45], "state_s": 9, "new_h": 9, "zeros_lik": 9, "at_dispatch_floating_typ": 9, "lltm_forward_xpu": 9, "lltm_xpu_forward_kernel": 9, "purpos": [9, 39, 40, 41, 46], "lambda": 9, "As": [9, 15, 23, 28, 35, 38, 39, 40, 41, 48, 49], "instanti": 9, "retriev": [9, 41], "doubl": 9, "at_dispatch_all_typ": 9, "size_t": 9, "1024": [9, 26, 27, 41], "work_group": 9, "cgf": 9, "handler": [9, 17, 26, 40], "cgh": 9, "kfn": 9, "nd_item": 9, "item": [9, 10, 19, 24], "get_group": 9, "get_group_rang": 9, "get_local_id": 9, "gates_row": 9, "parallel_for": 9, "nd_rang": 9, "grid": [9, 20], "fill": 9, "matric": 9, "2048": 9, "8": [9, 17, 20, 29, 39, 40, 41], "introductori": 9, "underlai": 9, "right": [9, 31, 35, 50], "inde": [9, 38], "high": [9, 21, 41, 48, 49, 50], "agnost": 9, "ineffici": 9, "dimension": 9, "much": [9, 22, 25, 41, 49, 50], "pattern": [9, 23, 25, 36, 42, 43, 47], "packedtensoraccessor32": 9, "lltm_xpu_backward_kernel": 9, "d_old_cel": 9, "d_gate": 9, "d_gates_": 9, "d_old_cell_": 9, "d_output_g": 9, "d_tanh_new_cel": 9, "d_new_cel": 9, "d_candidate_cel": 9, "d_input_g": 9, "lltm_backward_xpu": 9, "packed_accessor32": 9, "d_gate_weight": 9, "reshap": 9, "d_weight": 9, "mm": [9, 12, 13], "d_bia": 9, "sum": [9, 10, 24, 25, 42, 45, 48], "keepdim": [9, 10], "d_x": 9, "d_old_h": 9, "d_input": 9, "similar": [10, 22, 26, 27, 38, 41, 42, 46], "reducescatt": 10, "align": [10, 26, 27, 42, 50], "convent": 10, "fullyshardeddataparallel": 10, "trigger": [10, 18, 23, 36, 38, 42, 44], "throw": 10, "argpars": 10, "functool": 10, "lr_schedul": 10, "steplr": 10, "mp": 10, "distributeddataparallel": [10, 42], "distributedsampl": [10, 19], "fully_sharded_data_parallel": 10, "cpuoffload": 10, "backwardprefetch": 10, "size_based_auto_wrap_polici": 10, "enable_wrap": 10, "localhost": 10, "12355": 10, "cleanup": [10, 38], "destroy_process_group": [10, 38], "toi": 10, "handwritten": 10, "digit": [10, 50], "classif": [10, 38], "net": 10, "conv1": 10, "32": [10, 25, 39, 40, 50], "conv2": [10, 28], "64": [10, 12, 13, 15, 24, 28, 30, 39], "dropout1": 10, "25": [10, 39, 40], "dropout2": 10, "fc1": 10, "9216": 10, "fc2": 10, "relu": [10, 24, 25, 38, 42, 43, 45], "max_pool2d": 10, "flatten": [10, 24, 28], "log_softmax": [10, 13], "logic": [10, 20, 25, 29, 38, 40, 41], "ddp_loss": 10, "nll_loss": [10, 12, 13, 19], "reduct": 10, "all_reduc": 10, "reduceop": 10, "tloss": [10, 19], "6f": 10, "test_load": 10, "pred": [10, 24], "argmax": [10, 24], "max": [10, 30, 38, 42, 44], "eq": [10, 42], "view_a": 10, "test_loss": 10, "averag": [10, 19, 26, 27], "4f": 10, "2f": 10, "fsdp_main": 10, "1307": 10, "3081": 10, "dataset1": 10, "mnist": 10, "dataset2": 10, "sampler1": 10, "num_replica": [10, 19], "sampler2": 10, "train_kwarg": 10, "test_kwarg": 10, "test_batch_s": 10, "xpu_kwarg": 10, "my_auto_wrap_polici": 10, "min_num_param": 10, "init_start_ev": 10, "event": 10, "enable_tim": 10, "init_end_ev": 10, "adadelta": 10, "step_siz": 10, "gamma": 10, "1000": 10, "sec": 10, "save_model": 10, "barrier": [10, 38], "mnist_cnn": 10, "final": [10, 38, 43, 44], "parser": 10, "argumentpars": 10, "add_argu": 10, "metavar": 10, "14": [10, 29, 39, 40], "rate": [10, 19, 50], "action": 10, "store_tru": 10, "random": [10, 19, 20, 38], "parse_arg": 10, "nproc": [10, 39], "join": [10, 41], "snippet": [10, 15, 16, 26, 27, 36], "fsdp_mnist_xpu": 10, "who": [11, 15, 42, 44], "overrid": [11, 22], "defaultvalu": 11, "use_onemkl": [11, 38, 42], "bla": 11, "use_channels_last_1d": 11, "1d": 11, "use_persist_stream": 11, "persist": 11, "use_scratchpad_mod": 11, "scratchpad": 11, "use_primitive_cach": 11, "use_queue_barri": 11, "submit_barri": 11, "dummi": [11, 40], "use_multi_context": 11, "use_profil": 11, "legaci": [11, 38], "profil": [11, 38, 42], "use_kineto": [11, 26], "kineto": [11, 38, 42], "use_sycl_assert": 11, "assert": [11, 26], "use_itt_annot": 11, "itt": 11, "annot": 11, "use_split_fp64_loop": 11, "fp64": [11, 38, 42], "element": [11, 25, 48, 49], "wise": [11, 36, 42, 48, 49], "use_xetla": 11, "xetla": [11, 16], "build_by_per_kernel": 11, "per_kernel": 11, "use_aot_devlist": [11, 44], "build_internal_debug": 11, "build_separate_op": 11, "build_simple_trac": 11, "build_opt_level": 11, "add": [11, 12, 13, 19, 20, 25, 27, 29, 38, 40, 42, 45, 46, 48, 49, 50], "ox": 11, "accept": 11, "optioncpu": 11, "ipex_fp32_math_mod": 11, "optiongpu": 11, "ipex_verbos": 11, "ipex_xpu_sync_mod": 11, "enforc": 11, "ipex_tile_as_devic": 11, "partit": [11, 19, 41, 45], "map": [11, 25], "composit": 11, "optionexperiment": 11, "ipex_simple_trac": [11, 29], "ipex_ze_trac": [11, 26], "resnet50": [11, 18, 20, 26, 39, 41, 45], "lower": [12, 13, 23, 35, 42, 50], "lighter": [12, 13], "smaller": [12, 13, 42], "sacrif": [12, 13], "trade": [12, 13, 35, 42], "slower": [12, 13, 41], "accur": [12, 13, 38], "primarili": 12, "speedup": [12, 13, 16, 35, 42], "simplenet": [12, 13, 30], "pad": [12, 13, 15, 25, 28, 42], "y": [12, 13, 22, 24, 28, 50], "chosen": [12, 13, 16, 20], "categori": [12, 13], "imag": [12, 13, 25, 38, 41, 45], "float64": [12, 13], "variant": [12, 13], "suppli": [12, 13, 25], "addmm_": [12, 13], "cannot": [12, 13, 25, 38, 42, 48], "stabil": [12, 13], "regardless": [12, 13], "unlist": [12, 13], "downstream": [12, 13], "assum": [12, 13, 31, 40, 41], "believ": [12, 13, 25], "unstabl": [12, 13], "conv1d": [12, 13, 25, 45], "conv3d": [12, 13, 42, 45], "conv_transpose1d": [12, 13], "conv_transpose2d": 12, "conv_transpose3d": [12, 13], "bmm": [12, 13], "baddbmm": [12, 13], "addbmm": [12, 13], "conv_tbc": [12, 13], "group_norm": 12, "_native_multi_head_attent": 12, "avg_pool3d": 12, "binary_cross_entropi": [12, 13], "grid_sampl": [12, 13], "polar": 12, "prod": 12, "quantil": 12, "nanquantil": 12, "stft": 12, "cdist": [12, 13], "view_as_complex": 12, "choleski": 12, "cholesky_invers": 12, "cholesky_solv": 12, "invers": 12, "lu_solv": 12, "matrix_rank": 12, "orgqr": 12, "ormqr": 12, "pinvers": 12, "max_unpool2d": 12, "max_unpool3d": 12, "adaptive_avg_pool3d": 12, "reflection_pad1d": 12, "reflection_pad2d": 12, "replication_pad1d": 12, "replication_pad2d": 12, "replication_pad3d": 12, "mse_loss": [12, 13], "cosine_embedding_loss": [12, 13], "nll_loss2d": [12, 13], "hinge_embedding_loss": [12, 13], "poisson_nll_loss": [12, 13], "smooth_l1_loss": [12, 13], "cross_entropy_loss": [12, 13], "l1_loss": [12, 13], "huber_loss": [12, 13], "margin_ranking_loss": [12, 13], "soft_margin_loss": [12, 13], "triplet_margin_loss": [12, 13], "multi_margin_loss": [12, 13], "ctc_loss": 12, "kl_div": [12, 13], "multilabel_margin_loss": [12, 13], "binary_cross_entropy_with_logit": [12, 13], "fft_fft": [12, 13], "fft_ifft": [12, 13], "fft_fft2": [12, 13], "fft_ifft2": [12, 13], "fft_fftn": [12, 13], "fft_ifftn": [12, 13], "fft_rfft": [12, 13], "fft_irfft": [12, 13], "fft_rfft2": [12, 13], "fft_irfft2": [12, 13], "fft_rfftn": [12, 13], "fft_irfftn": [12, 13], "fft_hfft": [12, 13], "fft_ihfft": [12, 13], "linalg_cond": 12, "linalg_matrix_rank": 12, "linalg_solv": 12, "linalg_choleski": 12, "linalg_svdv": 12, "linalg_eigv": 12, "linalg_eigvalsh": 12, "linalg_inv": 12, "linalg_householder_product": 12, "linalg_tensorinv": 12, "linalg_tensorsolv": 12, "fake_quantize_per_tensor_affin": 12, "eig": 12, "geqrf": 12, "lstsq": 12, "_lu_with_info": 12, "qr": 12, "svd": 12, "symeig": 12, "triangular_solv": 12, "fractional_max_pool2d": 12, "fractional_max_pool3d": 12, "adaptive_max_pool3d": 12, "multilabel_margin_loss_forward": 12, "linalg_qr": 12, "linalg_cholesky_ex": 12, "linalg_svd": 12, "linalg_eig": 12, "linalg_eigh": 12, "linalg_lstsq": 12, "linalg_inv_ex": 12, "index_copi": 12, "g": [12, 13, 23, 25, 35, 38, 42, 43, 44], "intervent": [12, 13], "mixtur": [12, 13], "_convolut": 13, "prelu": 13, "addmv": 13, "addr": [13, 39], "mv": 13, "chain_matmul": 13, "linalg_multi_dot": 13, "_thnn_fused_gru_cel": 13, "gru_cel": 13, "nll_loss_nd": 13, "reciproc": 13, "pow": [13, 42, 45], "frobenius_norm": 13, "nuclear_norm": 13, "cosine_similar": 13, "pdist": 13, "renorm": 13, "addcdiv": 13, "addcmul": 13, "atan2": 13, "bilinear": 13, "cross": [13, 38, 39, 40, 41], "index_put": 13, "tensordot": 13, "scatter_add": 13, "enable_auto_channels_last": 14, "disable_auto_channels_last": 14, "bring": [14, 22, 23, 24, 35, 38, 39, 41, 43, 50], "oob": [15, 42], "easili": [15, 22], "inevit": 15, "simplifi": 15, "optimum": 15, "impot": 15, "claus": [15, 48, 49], "monkei": 15, "patch": 15, "embedding_bag": 15, "qa": 15, "clear": 15, "ninstanc": [15, 20, 39], "ncore": [15, 39], "28": [15, 20, 24, 39, 40, 41], "run_qa": 15, "model_name_or_path": [15, 36], "dataset_nam": 15, "squad": 15, "do_ev": 15, "per_device_train_batch_s": 15, "12": [15, 20, 29, 39, 40, 46], "learning_r": 15, "3e": 15, "num_train_epoch": 15, "max_seq_length": 15, "384": [15, 40], "doc_strid": 15, "output_dir": [15, 20], "debug_squad": 15, "dummymodul": 15, "input1": 15, "kernel_s": [15, 25], "track_running_stat": 15, "customized_forward": 15, "method1": 15, "method2": 15, "unabl": [15, 38, 43], "hook": 15, "behaviour": 15, "repeat": [15, 25, 26, 50], "traced_model": [15, 22, 24, 38, 45], "special": [16, 35], "empir": 16, "ideal": 16, "xe": [16, 35, 41, 42], "algebra": [16, 35], "compute_eng": 16, "xpucomputeeng": 16, "x1": [16, 28], "20": [16, 25, 29, 38, 39, 40, 42], "x2": [16, 28], "onednn_layout": 16, "highest": 16, "upsampl": [16, 25], "align_corn": 16, "step2": 16, "continu": [16, 29, 38, 40, 42], "step3": 16, "step4": 16, "fall": [16, 18], "back": [16, 18, 25, 38, 50], "averagepool2d": 16, "maxpool2d": [16, 45], "maxpool3d": 16, "layernorm": [16, 45], "permutecontigu": 16, "softmax": [16, 42, 45], "greater": [16, 38], "fp16": [16, 31, 35, 42], "upsampleblinear2d": 16, "upsamplenearest": 16, "dnn": [17, 21], "e4m3": 17, "sign": [17, 26, 27, 50], "expon": [17, 50], "mantissa": [17, 50], "e5m2": 17, "FOR": 17, "onlin": 17, "decompress": 17, "delai": 17, "quantizaiton": 17, "showcas": 17, "_fp8_convert": 17, "convert_fp8_model": 17, "optimize_dtyp": 17, "fp8_autocas": 17, "input_id": 17, "token_type_id": 17, "segment_id": 17, "attention_mask": 17, "input_mask": 17, "masked_lm_label": 17, "next_sentence_label": 17, "tri": 18, "failur": [18, 38], "incorrect": [18, 38], "meanwhil": [18, 41], "noqa": [18, 24], "f401": [18, 24], "tensorflow": [19, 25], "kera": 19, "apach": [19, 34, 40], "mxnet": 19, "goal": 19, "mpi": [19, 38, 39], "concept": [19, 25, 41], "broadcast": 19, "hvd": [19, 38], "server": [19, 38, 40, 41], "forth": 19, "devid": 19, "effect": [19, 23, 40, 41, 46, 50], "compens": 19, "distributedoptim": 19, "deleg": [19, 43], "broadcast_paramet": 19, "root_rank": 19, "broadcast_optimizer_st": 19, "consist": [19, 35, 41], "restor": 19, "corrupt": 19, "accomplish": 19, "guard": 19, "named_paramet": 19, "log_interv": 19, "There": [20, 26, 27, 28, 31, 35, 41], "thing": [20, 38, 41], "yaml": 20, "togeth": [20, 28, 41, 44], "max_trial": 20, "trial": 20, "histori": [20, 35], "csv": 20, "hyperparam": 20, "mandatori": 20, "hp": 20, "ncores_per_inst": 20, "all_physical_cor": 20, "ncore_per_inst": 20, "all_logical_cor": 20, "use_all_nod": 20, "num_nod": 20, "use_logical_cor": [20, 40], "is_hyperthreading_en": 20, "disable_numactl": [20, 40], "disable_iomp": [20, 40], "malloc": [20, 39, 41], "tc": 20, "je": 20, "previou": [20, 25, 41], "hyperparamt": 20, "minim": [20, 41, 46], "maxim": 20, "higher_is_bett": 20, "target_v": 20, "inf": 20, "minimum": [20, 25], "suppos": [20, 31, 41], "platinum": [20, 40, 41], "8180m": [20, 41], "socket": [20, 40, 41], "physic": [20, 28, 40, 41], "conf_fil": 20, "hypertune_directori": 20, "termin": [20, 38], "15": [20, 29, 39, 40], "339081764221191": 20, "gave": 20, "offlin": 21, "woq": [21, 35], "pre": [21, 35, 44], "q": 21, "langugu": 21, "mm_qkv_int4": 21, "mm_bias_int4": 21, "mm_silu_int4": 21, "mm_resmul_int4": 21, "mm_bias_gelu_int4": 21, "mm_bias_resadd_resadd_int4": 21, "firstli": [21, 30, 35], "present": [21, 40], "6b": [21, 35], "intens": [21, 30], "decid": [22, 28, 35], "satisfi": [22, 37], "tradeoff": 22, "default_static_qconfig": [22, 24, 40, 45], "histogramobserv": 22, "perchannelminmaxobserv": 22, "qint8": 22, "per_channel_symmetr": 22, "ao": 22, "per_tensor_affin": 22, "methond": 22, "obsev": 22, "sete": 22, "skylak": 22, "quant_stat": [22, 24], "user_model": [22, 45], "calibration_data_set": 22, "qparam": 22, "achang": 22, "save_qconf_summari": [22, 24], "qconf_summari": [22, 24], "load_qconf_summari": 22, "quantized_model": [22, 45], "dynamic_qconfig": 22, "default_dynamic_qconfig": [22, 40], "placeholderobserv": 22, "compute_dtyp": 22, "gru": 22, "lstmcell": 22, "rnncell": 22, "grucel": 22, "workflow": 23, "overal": [23, 41], "view": [23, 25, 26, 28, 42, 45, 50], "therefor": [23, 41], "move": [23, 25, 31, 41], "conv_relu": 23, "modelimp": [23, 36], "quantwrapp": [23, 36], "obtain": [23, 36], "calib_dataset": [23, 36], "inference_data": [23, 36], "asymmetr": [23, 42], "zero_point": 23, "swap": [23, 38], "Be": 23, "free": [23, 39], "warmup": [23, 26, 36], "warmup_data": [23, 36], "graph_for": [23, 36, 45], "inference_dta": [23, 36], "whole": [23, 28, 42, 48], "conv_unari": 23, "conv_binari": 23, "linear_unari": 23, "conv_sum_relu": 23, "henc": [23, 42], "consider": 23, "analysi": [23, 41], "bother": 24, "receip": [24, 28], "portion": 24, "beginn": 24, "quickstart_tutori": 24, "training_data": 24, "fashionmnist": 24, "test_data": 24, "train_dataload": 24, "test_dataload": 24, "break": 24, "neuralnetwork": 24, "linear_relu_stack": 24, "sequenti": [24, 25], "logit": 24, "predict": 24, "backpropag": 24, "7f": 24, "5d": 24, "inc": 24, "accu": 24, "tuned_conf": 24, "represent": 25, "multidimension": 25, "arrai": 25, "nd": 25, "semant": 25, "dens": 25, "spars": [25, 38], "coo": 25, "canon": 25, "assign": [25, 26, 40, 41], "2d": 25, "height": 25, "width": [25, 35], "bmp": 25, "contiguous_format": [25, 41], "close": [25, 39, 41], "difficult": 25, "manipul": 25, "to_dens": 25, "Will": 25, "secret": 25, "ingredi": 25, "cover": [25, 35, 39, 45], "almost": 25, "foundat": [25, 41], "upper": [25, 41], "expens": 25, "sequenc": [25, 26, 35, 38, 50], "benefici": 25, "nb": 25, "me": 25, "roughli": 25, "50": [25, 39, 40], "perf": 25, "mkldnn": 25, "mkldnn_util": 25, "to_mkldnn": 25, "explain": [25, 46, 50], "diagram": [25, 41], "conclus": 25, "But": 25, "neglig": 25, "organ": 25, "question": 25, "reinterpret": 25, "answer": 25, "chw": 25, "hw": [25, 44], "offset": [25, 35], "stride_n": 25, "stride_c": 25, "stride_h": 25, "stride_w": 25, "merit": 25, "express": 25, "noncontigu": 25, "big": 25, "n1": 25, "n2": 25, "mind": [25, 40], "someth": 25, "rfc": 25, "hwc": 25, "wc": 25, "chwn": 25, "hwn": 25, "wn": 25, "outplac": 25, "_appli": 25, "spontan": 25, "tell": [25, 28, 41], "NOT": [25, 39], "compris": 25, "depend": [25, 41, 42], "guidelin": 25, "awar": [25, 28, 39, 40], "my": 25, "recent": 25, "cudnn": 25, "accommod": 25, "hidden": [25, 35], "ideep": 25, "format_tag": 25, "src_md": 25, "desc": 25, "data_typ": 25, "f32": 25, "src_mem": 25, "src_data_ptr": 25, "hwio": 25, "avx512": [25, 40, 42, 46], "3d": 25, "batchnorm1d": 25, "maxpool1d": 25, "div": [25, 42, 45], "nearest": 25, "sycl_devic": 25, "test_input": 25, "test_input_xpu": 25, "to_channels_last_1d": 25, "tenor": 25, "xpu_r": 25, "is_contiguous_channels_last_1d": 25, "input_xpu": 25, "meta": [25, 35], "invalid": [25, 38, 41], "corrspond": 25, "prebuilt": [26, 27, 30, 38, 44], "wheel": [26, 27, 30, 38, 44], "affili": 26, "use_onetrac": 26, "onetrac": 26, "layer": [26, 28, 35, 42], "profileract": 26, "input_tensor": [26, 27], "prof": [26, 27], "proper": [26, 27], "output_tensor_1": [26, 27], "nonzero": [26, 27], "output_tensor_2": [26, 27], "uniqu": [26, 27, 29], "tabl": [26, 27, 44], "key_averag": [26, 27], "my_schedul": 26, "skip_first": 26, "trace_handl": 26, "p": 26, "sort_bi": [26, 27], "self_xpu_time_tot": 26, "row_limit": 26, "trace_": 26, "step_num": 26, "outsid": [26, 28], "on_trace_readi": 26, "forget": 26, "record_shap": [26, 27], "rememb": 26, "effort": 26, "contextlib": 26, "profiler_setup": 26, "nullcontext": 26, "should_profil": 26, "profileact": 26, "unset": 26, "involv": [26, 50], "Such": 26, "a_0": 26, "a_1": 26, "b_0": 26, "b_1": 26, "export_chrome_trac": [26, 27], "trace_example_on_multi_devic": 26, "consol": [26, 27, 29], "exclud": [26, 27], "children": [26, 27], "percentag": [26, 27], "propot": [26, 27], "percentasg": [26, 27], "avg": [26, 27], "consumpt": [26, 27], "sonsumpt": [26, 27], "viewer": [26, 27], "perfetto": 26, "ui": 26, "dev": 26, "trace_fil": [26, 27], "examin": [26, 49], "build_profil": 27, "autograd": 27, "profiler_legaci": 27, "use_xpu": 27, "temporarili": 27, "sort": 27, "revers": 27, "coupl": [28, 41], "omp": [28, 38, 39, 40, 41], "ld_preload": [28, 38, 39, 40, 41, 42], "libiomp5": [28, 39, 40, 41], "model_script": 28, "examplenet": 28, "examplenet1": 28, "start_dim": 28, "examplenet2": 28, "y1": 28, "y2": 28, "model1": 28, "traced_model1": 28, "model2": 28, "traced_model2": 28, "multi_stream_model": 28, "datatyp": [28, 42], "receipt": 28, "steam": 28, "input_hint": 28, "output_hint": 28, "async": 28, "wake": 28, "imper": 28, "suffer": 28, "gil": 28, "hurt": 28, "mitig": 28, "omp_num_thread": [28, 38, 39, 40], "phase": [28, 35, 38], "s1": 28, "c1": 28, "numactl": [28, 39, 40], "resourc": [28, 37, 40, 41, 45], "superset": 28, "undefin": [28, 38, 42], "gb": 28, "simultan": 28, "cpu_pool1": 28, "cpu_pool2": 28, "task1": 28, "task2": 28, "y1_futur": 28, "y2_futur": 28, "y_runtim": 28, "kmp_": 28, "fulfil": 28, "bound": [28, 35, 41, 48, 49], "serv": 28, "sub": [28, 41, 42], "futuretensor": 28, "didn": 28, "dlopen": 28, "symbol": [28, 38, 42], "screen": 29, "bracket": 29, "enable_simple_trac": 29, "disable_simple_trac": 29, "using_simple_trac": 29, "unintention": 29, "exmapl": 29, "262618": 29, "wrapper__empty_strid": 29, "atenipextypexpu": 29, "empty_strid": 29, "wrapper__copy_": 29, "copy_": 29, "wrapper___unique2": 29, "_unique2": 29, "wrapper__clon": 29, "wrapper___reshape_alia": 29, "_reshape_alia": 29, "wrapper_memory_format_empti": 29, "wrapper__as_strid": 29, "as_strid": 29, "wrapper___local_scalar_dens": 29, "_local_scalar_dens": 29, "wrapper__resize_": 29, "resize_": 29, "19": [29, 39, 40], "pid": 29, "tid": 29, "name1": 29, "name2": 29, "arrow": 29, "relationship": 29, "child": 29, "gdb": 29, "inductor": [30, 42], "triton": [30, 42], "codegen": 30, "addition": [30, 40], "facilit": 30, "contribut": [30, 39], "ever": 30, "unlock": 30, "llvm": 30, "forc": 30, "cp310": 30, "manylinux_2_17_x86_64": 30, "manylinux2014_x86_64": 30, "triton_codegen_intel_xpu_backend": 30, "compiled_model": 30, "weight_decai": [30, 48, 49], "loss_funct": 30, "demostr": 31, "cache_en": 31, "bash": [31, 38], "copyright": 34, "notic": [34, 39, 40], "condit": [34, 38], "architectur": [35, 41], "decod": 35, "multiheadattent": 35, "feedforward": 35, "kv_cach": 35, "lot": [35, 38, 42, 43, 44], "smoothquant": 35, "hub": 35, "7b": 35, "hf": 35, "13b": 35, "70b": 35, "eleutherai": 35, "30b": 35, "3b": 35, "bigscienc": 35, "7b1": 35, "quantzat": 35, "codellama": 35, "indirect": 35, "rope": 35, "tpp": 35, "progress": [35, 38], "expand": 35, "brief": 35, "xelta": 35, "rotari": 35, "posit": 35, "squar": [35, 42, 45], "rmsnorm": 35, "beam": 35, "idx": [35, 39], "reorder_cach": 35, "bottleneck": 35, "kept": [35, 50], "buffer": 35, "wast": 35, "prefil": 35, "influenc": [35, 39, 41], "left": [35, 40, 50], "timestamp": 35, "elimin": 35, "sdpa": 35, "shard": [35, 42], "lead": 35, "significantli": 35, "heavier": 35, "becom": [35, 41], "bandwidth": 35, "token": 35, "content": [36, 42], "transpar": [36, 41, 42, 43], "undergo": 36, "overview": [36, 42], "automodelforcausallm": 36, "amp_dtyp": 36, "squeez": 37, "tool": [37, 38, 41, 46], "problem": [38, 40, 41, 48, 49], "unsupport": [38, 42], "improp": 38, "unload": 38, "conda": [38, 41, 42], "encount": [38, 43, 44], "ship": 38, "libstdc": 38, "conflict": 38, "_glibcxx_use_cxx11_abi": 38, "_znk5torch8autograd4node4nameb5cxx11ev": [38, 42], "appear": [38, 42], "glibcxx_use_cxx11_abi": 38, "bad": 38, "rn50": [38, 45], "friendli": [38, 41], "ungracefulli": 38, "997": 38, "170": [38, 42, 44], "wsl2": [38, 42], "ram": 38, "killer": 38, "dmesg": 38, "oom": 38, "had": [38, 41], "kill": [38, 40], "max_job": 38, "conserv": 38, "slow": 38, "cl_device_not_found": 38, "tdr": 38, "window": [38, 42], "tdrdelai": 38, "registri": 38, "reboot": 38, "tsan": 38, "compat": [38, 50], "workaround": 38, "omp_tool": 38, "unblock": 38, "soon": 38, "sometim": [38, 39, 41], "ur_l0_in_order_barrier_by_sign": 38, "converg": 38, "24": [38, 39, 40], "hour": 38, "divid": [38, 40, 41, 45], "hang": 38, "1550": 38, "race": 38, "happen": 38, "torch_llm_allreduc": 38, "pcie": 38, "xelink": 38, "usr": [38, 39, 40, 42, 46], "ld": [38, 39, 41, 42], "lmkl_sycl": [38, 42], "lmkl_intel_ilp64": [38, 42], "lmkl_core": [38, 42], "lmkl_tbb_thread": [38, 42], "linker": [38, 42], "exit": [38, 39, 42], "v": [38, 42], "occur": [38, 42], "resolv": [38, 42], "mkl_dpcpp_root": [38, 42], "mkl_lapack_dspevd": 38, "fatal": [38, 42], "libmkl_vml_avx512": 38, "libmkl": [38, 42], "vml": [38, 42], "incorrectli": [38, 42], "oserror": [38, 42], "wrong": [38, 42], "libmkl_intel_ilp64": [38, 42], "suffix": [38, 42], "test_weight_norm": 38, "testnnmethod": 38, "test_weight_norm_differnt_typ": 38, "a770": [38, 42], "graphic": [38, 41, 42, 44], "test_foreach": 38, "testtorchmethod": 38, "test_foreach_co": 38, "test_foreach_sin": 38, "test_polar": 38, "test_polar_float": 38, "test_special_op": 38, "test_special_spherical_bessel_j0": 38, "test_transducer_loss": 38, "test_vallina_transducer_loss": 38, "pypi": 38, "remark": [38, 41], "intel_pytorch_extens": 38, "112": [38, 41], "poor": 38, "xlm": 38, "roberta": 38, "casual": 38, "gpt2": 38, "summar": 38, "t5": 38, "allenai": 38, "longform": 38, "409": 38, "_c": [38, 46], "_jit_set_texpr_fuser_en": 38, "csrc": [38, 46], "tensorexpr_fus": 38, "settensorexprfuseren": 38, "integr": [38, 41, 42, 44], "runtimeerror": 38, "overflow": 38, "unpack": 38, "min": [38, 42], "exce": [38, 41], "quantize_per_tensor": 38, "pseudocod": 38, "omp_num_threa": 38, "prototyp": 38, "set_num_thread": 38, "freezed_model": 38, "run_benchmark": 38, "embeddingbag": 38, "bag": 38, "abnorm": 38, "avx2": [38, 46], "batchnorm": [38, 45], "rnnt": 38, "joint_net": 38, "caller": 38, "yet": 38, "pend": 38, "merg": [38, 42], "factor": 39, "properli": 39, "themselv": 39, "common": [39, 41, 50], "mainli": 39, "dir": [39, 46], "choic": [39, 50], "taskset": 39, "malloc_conf": [39, 41], "crash": [39, 41], "nnode": 39, "count": 39, "ip": 39, "hostnam": 39, "proc": [39, 41], "hostfil": 39, "mpiexec": 39, "hydra": 39, "np": 39, "ppn": 39, "genv": 39, "i_mpi_pin_domain": 39, "codeless": 39, "ut": 39, "mutual": 39, "favorit": 39, "kmp": [39, 41], "compact": [39, 40, 41], "stdout": 39, "undesir": 39, "_timestamp_inst": 39, "_timestamp_instance_": 39, "_core": 39, "run_20210712212258_inst": 39, "run_20210712212258_instance_0_cores_0": 39, "43": [39, 40], "gif": 39, "07": 39, "21": [39, 40], "22": [39, 40], "58": 39, "764": 39, "conda_prefix": [39, 40], "virtual_env": [39, 40], "lib64": [39, 40], "drop": [39, 40], "44": [39, 40], "kmp_affin": [39, 40, 41], "kmp_blocktim": [39, 40, 41], "23": [39, 40, 50], "26": [39, 40], "27": [39, 40, 41], "29": [39, 40], "30": [39, 40], "31": [39, 40], "33": [39, 40, 46], "34": [39, 40], "35": [39, 40], "36": [39, 40], "37": [39, 40], "38": [39, 40], "39": [39, 40], "40": [39, 40], "41": [39, 40], "42": [39, 40], "tee": 39, "run_20210712223308_inst": 39, "run_20210712223308_instance_0_cores_0": 39, "87": 39, "08": 39, "117": 39, "88": 39, "118": 39, "45": [39, 40], "46": [39, 40], "47": [39, 40], "48": [39, 40], "49": [39, 40], "51": [39, 40], "52": [39, 40], "53": [39, 40], "54": [39, 40], "55": [39, 40, 41], "56": [39, 40, 41], "57": 39, "59": 39, "60": 39, "61": 39, "62": 39, "63": 39, "65": 39, "66": [39, 46], "67": 39, "68": 39, "69": 39, "70": 39, "71": 39, "72": 39, "73": 39, "74": 39, "75": 39, "76": 39, "77": 39, "78": 39, "79": 39, "81": 39, "82": 39, "83": [39, 41], "84": [39, 41], "85": 39, "86": 39, "run_20210712214504_inst": 39, "run_20210712214504_instance_0_cores_22": 39, "04": [39, 42], "513": 39, "run_20210712220928_inst": 39, "run_20210712220928_instance_0_cores_0": 39, "09": 39, "355": 39, "356": 39, "deduct": 39, "run_20210712221615_inst": 39, "run_20210712221615_instance_0_cores_11": 39, "591": 39, "run_20210712221150_inst": 39, "run_20210712221150_instance_0_cores_0": 39, "run_20210712221150_instance_1_cores_22": 39, "233": 39, "236": 39, "run_20210712221415_inst": 39, "run_20210712221415_instance_0_cores_0": 39, "run_20210712221415_instance_1_cores_4": 39, "run_20210712221415_instance_2_cores_8": 39, "run_20210712221415_instance_3_cores_12": 39, "run_20210712221415_instance_4_cores_16": 39, "run_20210712221415_instance_5_cores_20": 39, "run_20210712221415_instance_6_cores_24": 39, "run_20210712221415_instance_7_cores_28": 39, "run_20210712221415_instance_8_cores_32": 39, "run_20210712221415_instance_9_cores_36": 39, "run_20210712221415_instance_10_cores_40": 39, "140": 39, "143": 39, "146": 39, "149": 39, "151": 39, "154": 39, "157": 39, "159": 39, "162": 39, "164": 39, "167": 39, "run_20210712221305_inst": 39, "run_20210712221305_instance_0_cores_0": 39, "run_20210712221305_instance_1_cores_11": 39, "run_20210712221305_instance_2_cores_22": 39, "run_20210712221305_instance_3_cores_33": 39, "470": 39, "471": 39, "473": 39, "476": 39, "479": 39, "instance_idx": 39, "independ": 39, "confirm": 39, "06": [39, 40], "175": 39, "176": 39, "177": 39, "run_20220106130151_instance_0_cores_0": 39, "235": 39, "jemallocl": 39, "oversize_threshold": [39, 41], "background_thread": [39, 41], "metadata_thp": [39, 41], "dirty_decay_m": [39, 41], "9000000000": [39, 41], "muzzy_decay_m": [39, 41], "libjemalloc": 39, "run_20210713153048_instance_0_cores_0": 39, "654": 39, "libtcmalloc": [39, 40], "655": 39, "run_20210713153333_instance_0_cores_0": 39, "784": 39, "run_20210713153659_instance_0_cores_0": 39, "blocktim": [39, 41], "00": 39, "760": [39, 40], "761": [39, 40], "omp_schedul": [39, 41], "omp_proc_bind": [39, 41], "run_20210713152500_instance_0_cores_0": 39, "give": 40, "ipex_en": 40, "procedur": 40, "tunin": 40, "dramat": [40, 41], "cpu_launcher_en": 40, "cpu_launcher_arg": 40, "hyperthread": 40, "ital": 40, "ptmalloc": 40, "use_default_alloc": 40, "tcmalloc": 40, "enable_tcmalloc": 40, "jemalloc": 40, "enable_jemalloc": 40, "gnu": [40, 46], "nth": [40, 41], "uniform": 40, "tunabl": 40, "signficantli": 40, "8180": 40, "affinit": 40, "unutil": 40, "restart": 40, "remain": [40, 49], "aliv": 40, "taken": 40, "worri": 40, "interrupt": 40, "dummy_tensor": 40, "check_trac": [40, 45], "bert_int8_jit": 40, "pretrain": [40, 45], "n_iter": 40, "rn50_int8_jit": 40, "usus": 40, "rn50_ipex_int8": 40, "image_classifi": 40, "similarli": 40, "bert_ipex_int8": 40, "transformer_handler_gener": 40, "setup_config": 40, "seq_classification_artifact": 40, "index_to_nam": 40, "nc": 40, "model_stor": 40, "rest": 40, "model_log": 40, "096": 40, "8375c": 40, "02": 40, "03": 40, "981": 40, "982": 40, "cases": 40, "ab": [40, 42, 45], "223": 40, "site": 40, "model_service_work": 40, "sock": 40, "unix": 40, "9000": 40, "762": 40, "763": 40, "9001": 40, "274": 40, "9002": 40, "975": 40, "9003": 40, "bench": 40, "amazon": 40, "ec2": 40, "m6i": 40, "24xlarg": 40, "reproduc": 40, "modelurl": 40, "inputpath": 40, "concurr": [40, 41], "huggingface_transform": 40, "sample_text_captum_input": 40, "articl": 41, "briefli": 41, "understand": [41, 47, 50], "knowledg": 41, "c620": 41, "chipset": 41, "purlei": 41, "chip": 41, "inclus": 41, "l2": 41, "2666": 41, "mhz": 41, "ddr4": 41, "six": 41, "ultra": 41, "interconnect": 41, "upi": 41, "microarchitectur": 41, "connect": 41, "transfer": 41, "equip": 41, "motherboard": 41, "attach": 41, "remot": 41, "asu": 41, "z11pa": 41, "d8": 41, "competit": [41, 42], "stall": 41, "busi": 41, "visit": 41, "uma": 41, "lscpu": 41, "onboard": [41, 48], "hyper": 41, "111": 41, "50ghz": 41, "node0": 41, "node1": 41, "sophist": 41, "brought": 41, "polici": [41, 42], "later": 41, "sysctl": 41, "balanc": 41, "great": 41, "placement": 41, "idea": [41, 50], "cpunodebind": 41, "membind": 41, "wikipedia": [41, 45], "multithread": 41, "primari": [41, 42], "consecut": 41, "fork": [41, 46], "libgomp": 41, "libiomp": 41, "commonli": 41, "gomp": 41, "comma": [41, 44], "gomp_cpu_affin": 41, "thrash": 41, "did": 41, "compet": 41, "proclist": 41, "sleep": 41, "200m": 41, "appropri": [41, 43], "sole": 41, "penal": 41, "role": 41, "unnecessari": 41, "destruct": 41, "emphas": 41, "mmuzzy_decay_m": 41, "straight": [41, 45], "forg": 41, "even": 41, "dealloc": [41, 43], "costli": 41, "gpertool": 41, "plu": 41, "pretti": 41, "nifti": 41, "gperftool": 41, "solv": [41, 48, 49], "set_flush_denorm": 41, "warm": 41, "threshold": 41, "usuali": 41, "maskrcnn": 41, "wav2vec2": 41, "recognit": 41, "onednn_primitive_cache_capac": 41, "65536": 41, "voic": 41, "4096": 41, "date": 42, "hbm": 42, "kv": 42, "quit": 42, "reach": 42, "verif": 42, "vehicl": 42, "emul": 42, "fsdp": 42, "publicli": 42, "oct": 42, "focus": [42, 45], "broader": 42, "webpag": 42, "ubuntu": 42, "v1": 42, "unaryop": 42, "sqrt": [42, 45, 48], "round": [42, 45, 50], "log_sigmoid": 42, "hardswish": [42, 45], "hardsigmoid": 42, "silu": [42, 45], "hardtanh": [42, 45], "leaky_relu": [42, 45], "binaryop": 42, "mul": [42, 45], "ne": 42, "ge": 42, "le": 42, "gelu": [42, 45], "mish": [42, 45], "concret": 42, "adamw": [42, 49], "permut": 42, "dequant": [42, 45], "pixelshuffl": 42, "leaki": [42, 45], "softplu": 42, "critic": 42, "xxx": 42, "glibcxx": 42, "cxx11": 42, "gcc": [42, 46], "path_to_your_onemkl": 42, "__release_lnx": 42, "lapack": 42, "dspevd": 42, "lp64": 42, "libmkl_sequenti": 42, "adapt": 43, "frequent": 43, "websit": 43, "splitsgd": [43, 50], "lifecycl": [43, 44], "beforehand": [43, 44], "benifit": [43, 44], "qualiti": [43, 44], "deliveri": [43, 44], "disadvantag": [43, 44, 50], "500mb": [43, 44], "5gb": [43, 44], "attempt": 43, "smallest": 43, "delimit": 44, "ats": 44, "m150": 44, "pvc": 44, "seper": 44, "opencl": 44, "spir64_gen": 44, "dag": 45, "acycl": 45, "constant": 45, "__dict__": 45, "front": 45, "propag": [45, 50], "convrelu": 45, "convsumrelu": 45, "mymodel": 45, "construct": 45, "convtranspose3d": 45, "clamp": 45, "___": 45, "_____": 45, "owner": 45, "otheriws": 45, "compuat": 45, "avx512_vnni": 46, "avx512_bf16": 46, "avx2_vnni": 46, "impli": 46, "findavx": 46, "aten_cpu_cap": 46, "_get_current_isa_level": 46, "addtion": 46, "subfold": 46, "rh": 46, "toolset": 46, "cmakefil": 46, "cpu_featur": 46, "cpu_feature_main": 46, "xcr0": 46, "00000000000602e7": 46, "mmx": 46, "sse": 46, "sse2": 46, "sse3": 46, "ssse3": 46, "sse4_1": 46, "sse4_2": 46, "aes_ni": 46, "sha": 46, "xsave": 46, "fma": 46, "f16c": 46, "avx_vnni": 46, "avx512_f": 46, "avx512_cd": 46, "avx512_pf": 46, "avx512_er": 46, "avx512_vl": 46, "avx512_bw": 46, "avx512_dq": 46, "avx512_ifma": 46, "avx512_vbmi": 46, "avx512_vpopcntdq": 46, "avx512_4fmap": 46, "avx512_4vnniw": 46, "avx512_vbmi2": 46, "avx512_vpclmul": 46, "avx512_bitalg": 46, "avx512_fp16": 46, "avx512_vp2intersect": 46, "amx_bf16": 46, "amx_til": 46, "amx_int8": 46, "prefetchw": 46, "prefetchwt1": 46, "lamb": [48, 49, 50], "adagrad": [48, 50], "grad": [48, 49], "clr": 48, "lr_decai": 48, "state_sum": 48, "addcmul_": 48, "add_": [48, 49], "addcdiv_": 48, "bottl": [48, 49], "neck": [48, 49], "pseudo": [48, 49, 50], "adagrad_fused_step": 48, "grad0": 48, "grad1": 48, "grad_n": 48, "param_n": 48, "state_sum_n": 48, "adagrad_step": 48, "grad_i": 48, "param_i": 48, "state_sum_i": 48, "other_arg": 48, "adam": 49, "lar": 49, "buf": 49, "momentum_buffer_list": 49, "detach": 49, "mul_": 49, "dampen": 49, "nesterov": 49, "sgd_fused_step": 49, "bottom": 50, "shorter": 50, "fewer": 50, "shift": 50, "lose": 50, "decim": 50, "1234500000": 50, "0000012345": 50, "1234512345": 50, "sens": 50, "fraction": 50, "12345": 50, "00000": 50, "signific": 50, "bui": 50, "ground": 50, "truth": 50, "chain": 50, "rule": 50, "formula": 50, "\u03b1": 50, "gw": 50, "denot": 50, "earlier": 50, "inaccur": 50, "halv": 50, "recov": 50, "fp32_w": 50, "concat_fp32_from_bf16": 50, "bf16_w": 50, "trail": 50, "fp32_gw": 50, "bf16_gw": 50, "weight_dacai": 50, "split_bf16_from_fp32": 50}, "objects": {"": [[1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4BF32E", "xpu::BF32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4FP32E", "xpu::FP32"], [1, 1, 1, "_CPPv4N3xpu14FP32_MATH_MODEE", "xpu::FP32_MATH_MODE"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4BF32E", "xpu::FP32_MATH_MODE::BF32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4FP32E", "xpu::FP32_MATH_MODE::FP32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE", "xpu::FP32_MATH_MODE::FP32_MATH_MODE_MAX"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4TF32E", "xpu::FP32_MATH_MODE::TF32"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE", "xpu::FP32_MATH_MODE_MAX"], [1, 0, 1, "_CPPv4N3xpu14FP32_MATH_MODE4TF32E", "xpu::TF32"], [1, 2, 1, "_CPPv4N3xpu21get_queue_from_streamEN3c106StreamE", "xpu::get_queue_from_stream"], [1, 3, 1, "_CPPv4N3xpu21get_queue_from_streamEN3c106StreamE", "xpu::get_queue_from_stream::stream"], [1, 2, 1, "_CPPv4N3xpu18set_fp32_math_modeE14FP32_MATH_MODE", "xpu::set_fp32_math_mode"], [1, 3, 1, "_CPPv4N3xpu18set_fp32_math_modeE14FP32_MATH_MODE", "xpu::set_fp32_math_mode::mode"]], "intel_extension_for_pytorch.cpu": [[1, 4, 0, "-", "runtime"]], "intel_extension_for_pytorch.cpu.runtime": [[1, 5, 1, "", "CPUPool"], [1, 5, 1, "", "MultiStreamModule"], [1, 5, 1, "", "MultiStreamModuleHint"], [1, 5, 1, "", "Task"], [1, 6, 1, "", "get_core_list_of_node_id"], [1, 6, 1, "", "is_runtime_ext_enabled"], [1, 5, 1, "", "pin"]], "intel_extension_for_pytorch": [[1, 6, 1, "", "enable_onednn_fusion"], [1, 6, 1, "", "get_fp32_math_mode"], [1, 6, 1, "", "optimize"], [1, 6, 1, "", "optimize_transformers"], [1, 4, 0, "-", "quantization"], [1, 6, 1, "", "set_fp32_math_mode"], [1, 5, 1, "", "verbose"]], "intel_extension_for_pytorch.nn": [[6, 5, 1, "", "FrozenBatchNorm2d"]], "intel_extension_for_pytorch.nn.functional": [[6, 6, 1, "", "interaction"]], "intel_extension_for_pytorch.quantization": [[1, 6, 1, "", "_gptq"], [1, 6, 1, "", "autotune"], [1, 6, 1, "", "convert"], [1, 6, 1, "", "prepare"]], "intel_extension_for_pytorch.xpu": [[1, 5, 1, "", "Event"], [1, 5, 1, "", "Stream"], [1, 6, 1, "", "current_device"], [1, 6, 1, "", "current_stream"], [1, 5, 1, "", "device"], [1, 6, 1, "", "device_count"], [1, 5, 1, "", "device_of"], [1, 6, 1, "", "empty_cache"], [1, 6, 1, "", "get_device_name"], [1, 6, 1, "", "get_device_properties"], [1, 6, 1, "", "get_rng_state"], [1, 6, 1, "", "get_rng_state_all"], [1, 6, 1, "", "init"], [1, 6, 1, "", "initial_seed"], [1, 6, 1, "", "is_available"], [1, 6, 1, "", "is_initialized"], [1, 6, 1, "", "manual_seed"], [1, 6, 1, "", "manual_seed_all"], [1, 6, 1, "", "max_memory_allocated"], [1, 6, 1, "", "max_memory_reserved"], [1, 6, 1, "", "memory_allocated"], [1, 6, 1, "", "memory_reserved"], [1, 6, 1, "", "memory_snapshot"], [1, 6, 1, "", "memory_stats"], [1, 6, 1, "", "memory_stats_as_nested_dict"], [1, 6, 1, "", "memory_summary"], [1, 6, 1, "", "reset_accumulated_memory_stats"], [1, 6, 1, "", "reset_peak_memory_stats"], [1, 6, 1, "", "seed"], [1, 6, 1, "", "seed_all"], [1, 6, 1, "", "set_device"], [1, 6, 1, "", "set_rng_state"], [1, 6, 1, "", "set_rng_state_all"], [1, 6, 1, "", "stream"], [1, 6, 1, "", "synchronize"]], "intel_extension_for_pytorch.xpu.Event": [[1, 7, 1, "", "elapsed_time"], [1, 7, 1, "", "query"], [1, 7, 1, "", "record"], [1, 7, 1, "", "synchronize"], [1, 7, 1, "", "wait"]], "intel_extension_for_pytorch.xpu.Stream": [[1, 7, 1, "", "record_event"], [1, 8, 1, "", "sycl_queue"], [1, 7, 1, "", "synchronize"], [1, 7, 1, "", "wait_event"], [1, 7, 1, "", "wait_stream"]], "intel_extension_for_pytorch.xpu.fp8.fp8": [[1, 6, 1, "", "fp8_autocast"]]}, "objtypes": {"0": "cpp:enumerator", "1": "cpp:enum", "2": "cpp:function", "3": "cpp:functionParam", "4": "py:module", "5": "py:class", "6": "py:function", "7": "py:method", "8": "py:property"}, "objnames": {"0": ["cpp", "enumerator", "C++ enumerator"], "1": ["cpp", "enum", "C++ enum"], "2": ["cpp", "function", "C++ function"], "3": ["cpp", "functionParam", "C++ function parameter"], "4": ["py", "module", "Python module"], "5": ["py", "class", "Python class"], "6": ["py", "function", "Python function"], "7": ["py", "method", "Python method"], "8": ["py", "property", "Python property"]}, "titleterms": {"intel": [0, 4, 5, 7, 22, 23, 39, 40, 41], "extens": [0, 4, 6, 7, 9, 22, 23, 28, 38, 40], "pytorch": [0, 4, 7, 19, 22, 23, 25, 40], "architectur": 0, "support": [0, 6, 12, 13, 15, 17, 21, 25, 26], "api": [1, 6, 7, 14, 24, 25, 33, 36, 45], "document": [1, 4, 33, 40, 41], "devic": [1, 6, 26], "agnost": [1, 6], "gpu": [1, 6, 7, 10, 13, 17, 21, 23, 30, 38, 42, 43, 49], "specif": [1, 6, 12, 13, 38], "miscellan": 1, "random": 1, "number": [1, 39, 41], "gener": [1, 38], "stream": [1, 9], "event": 1, "memori": [1, 25, 39, 41, 43, 47], "manag": [1, 43, 47], "c": [1, 5, 25], "cpu": [1, 6, 12, 22, 24, 25, 38, 41, 43, 46, 48], "quantiz": [1, 6, 17, 21, 22, 23, 36], "runtim": [1, 6, 7, 11, 28, 38], "blog": 2, "public": 2, "cheat": 3, "sheet": 3, "contribut": 4, "develop": 4, "xpu": [4, 5, 25, 26, 42], "tip": 4, "debug": [4, 6, 16], "unit": [4, 38], "test": [4, 38], "better": 4, "local": 4, "pytest": 4, "write": [4, 9, 25], "build": [4, 9, 11, 26, 27, 46], "exampl": [5, 7, 8, 9, 10, 15, 17, 18, 20, 21, 24, 28, 39, 46], "python": [5, 6], "train": [5, 6, 12, 13, 30, 38], "singl": [5, 7, 39], "instanc": [5, 39], "float32": [5, 12, 13, 38], "bfloat16": [5, 12, 13, 38, 50], "infer": [5, 12, 13, 21, 35, 36, 39, 40], "imper": [5, 12, 13, 23, 36], "mode": [5, 17, 21, 23, 36, 39], "resnet50": [5, 40], "bert": [5, 40], "torchscript": [5, 12, 13, 23, 36], "float16": [5, 13], "int8": [5, 24, 38, 40, 45], "torch": [5, 30], "optim": [5, 6, 15, 22, 23, 35, 36, 43, 45, 48, 49], "basic": [5, 28], "usag": [5, 7, 10, 15, 17, 18, 19, 20, 21, 24, 28, 36, 38, 39], "us": [5, 6, 8, 9, 12, 13, 14, 15, 16, 26, 27, 28, 29, 39, 44, 45], "sycl": [5, 9], "code": 5, "custom": 5, "dpc": [5, 6, 9], "kernel": [5, 25], "ai": 5, "refer": [5, 12, 13], "model": [5, 22, 25, 26, 27, 28, 29, 35, 40, 45], "featur": [6, 16, 18, 46], "easi": 6, "channel": [6, 14, 25, 41], "last": [6, 14, 25, 41], "auto": [6, 12, 13, 14, 28], "mix": [6, 12, 13], "precis": [6, 12, 13, 35], "amp": [6, 12, 13], "distribut": [6, 35, 36], "dlpack": [6, 8], "solut": [6, 8], "advanc": [6, 11], "configur": [6, 11, 28, 41], "fulli": [6, 10], "shard": [6, 10], "data": [6, 8, 10, 17, 21, 35], "parallel": [6, 10], "fsdp": [6, 10], "inductor": 6, "legaci": [6, 27], "profil": [6, 26, 27], "tool": [6, 21, 26, 27, 29], "experiment": [6, 15, 16, 17, 18, 19, 20, 21, 24, 26, 27, 29, 30], "simpl": [6, 29], "trace": [6, 15, 26, 27, 29], "kineto": [6, 26], "comput": [6, 16], "engin": [6, 16], "oper": [6, 16, 17, 21, 25, 35, 48, 49], "codeless": [6, 15], "new": 6, "1": [6, 20, 40, 42], "13": [6, 42], "graph": [6, 18, 43, 45], "captur": [6, 18], "0": [6, 42], "hypertun": [6, 20], "distributeddataparallel": 7, "ddp": 7, "introduct": [7, 8, 9, 10, 12, 13, 16, 26, 27, 29, 30, 33, 44, 48, 49], "instal": [7, 19, 32, 40], "oneccl": 7, "bind": [7, 28], "from": 7, "sourc": 7, "prebuilt": 7, "wheel": 7, "dynam": [7, 22, 38, 43, 46], "link": 7, "mpi": 7, "launch": [7, 15, 39], "node": [7, 39], "scale": [7, 40], "onli": [7, 10, 21, 36], "case": [8, 12, 13, 15, 16, 26, 27, 28, 29, 44], "design": [8, 28, 39], "import": 8, "capsul": 8, "export": [8, 26, 27, 40], "dldevic": 8, "pointer": 8, "asynchron": [8, 28], "program": 8, "motiv": [9, 15], "setuptool": 9, "jit": [9, 15], "compil": [9, 30, 43, 44, 46], "cmake": 9, "request": 9, "current": 9, "c10": 9, "fetch": 9, "correspond": 9, "queue": 9, "op": [9, 12, 13], "accessor": 9, "time": [11, 43, 44], "default": [12, 13, 14, 20, 25, 39], "path": [12, 13], "autocast": [12, 13], "elig": [12, 13], "behavior": [12, 13], "can": [12, 13], "promot": [12, 13], "widest": [12, 13], "input": [12, 13, 28], "type": [12, 13, 17, 21, 35], "eas": [14, 45], "enabl": [14, 29], "disabl": [14, 26, 27, 29], "known": [14, 28, 42], "issu": [14, 28, 38, 42], "huggingfac": 15, "The": 15, "origin": 15, "command": 15, "ipex": 15, "appli": 15, "fp32": [15, 45], "bf16": [15, 45], "modul": [15, 28], "forward": 15, "method": 15, "explicitli": 15, "instead": 15, "__call__": 15, "attr": 15, "alreadi": 15, "select": [16, 46], "polici": [16, 35], "multipl": [16, 39], "implement": [16, 28], "float8": 17, "fp8": 17, "run": [17, 21], "descript": 18, "horovod": 19, "your_conf_fil": 20, "hyperparamet": 20, "launcher": [20, 40], "defin": [20, 22], "search": 20, "space": 20, "tune": [20, 24, 37, 41], "2": [20, 40, 42], "user": 20, "your_python_script": 20, "int4": 21, "weight": [21, 36], "static": 22, "qconfig": 22, "prepar": 22, "do": 22, "calibr": 22, "convert": 22, "deploi": [22, 40], "recip": [24, 28], "what": 25, "i": [25, 28, 39], "format": 25, "all": [25, 39], "That": 25, "matter": 25, "nchw": 25, "b": 25, "nhwc": 25, "block": 25, "nchw16c": 25, "stride": 25, "layout": 25, "tensor": 25, "creation": 25, "convers": 25, "d": 25, "coverag": 25, "regist": [25, 40], "aten": 25, "nativ": 25, "manner": 25, "onednn": [25, 41], "creat": [25, 40], "convolut": 25, "primit": [25, 41], "1d": 25, "determin": 25, "set": [26, 28], "environ": 26, "variabl": 26, "add": 26, "Into": 26, "script": [26, 27, 39], "partli": 26, "backend": 26, "multi": [26, 40], "applic": 26, "result": [26, 27, 29, 38], "chrome": [26, 27], "requir": [28, 30, 44, 46], "multistream": 28, "examples1": 28, "examples2": 28, "examples3": 28, "structur": [28, 41], "output": 28, "perform": [28, 37, 40, 41], "task": 28, "core": [28, 39, 40], "detail": [28, 43], "how": 28, "iomp": 28, "preload": 28, "load": 28, "dure": 28, "depend": [30, 38], "inferenec": 30, "quick": 31, "start": [31, 33, 40], "execut": 31, "get": 33, "licens": 34, "larg": 35, "languag": 35, "llm": 35, "overview": [35, 39, 41, 46], "methodologi": [35, 45], "linear": 35, "deep": 35, "fusion": [35, 45, 48, 49], "segment": 35, "kv": 35, "cach": [35, 41], "low": 35, "transform": 36, "frontend": 36, "pseudocod": 36, "common": 36, "scenario": 36, "fp16": 36, "smoothquant": 36, "woq": 36, "deepspe": 36, "guid": [37, 39, 41], "troubleshoot": 38, "librari": [38, 39], "torchdynamo": 38, "shape": 38, "correct": 38, "physic": 39, "ii": 39, "includ": 39, "logic": 39, "iii": 39, "iv": 39, "your": 39, "v": 39, "throughput": 39, "vi": 39, "latenc": 39, "vii": 39, "viii": 39, "index": 39, "jemalloc": [39, 41], "tcmalloc": [39, 41], "alloc": [39, 41], "openmp": [39, 41], "gnu": [39, 41], "torchserv": 40, "content": [40, 41], "thi": [40, 41], "serv": 40, "pin": 40, "boost": 40, "worker": 40, "serial": 40, "file": 40, "archiv": 40, "3": 40, "4": 40, "benchmark": 40, "hardwar": 41, "non": 41, "uniform": 41, "access": 41, "numa": 41, "softwar": 41, "numactl": 41, "omp_num_thread": 41, "denorm": 41, "releas": 42, "10": 42, "highlight": 42, "110": 42, "120": 42, "200": 42, "technic": 43, "isa": [43, 46], "dispatch": [43, 46], "ahead": [43, 44], "aot": [43, 44], "pattern": 45, "fold": 45, "level": 46, "check": 46, "split": 50, "sgd": 50, "stochast": 50, "gradient": 50, "descent": 50}, "envversion": {"sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 58}, "alltitles": {"Intel\u00ae Extension for PyTorch*": [[0, "intel-extension-for-pytorch"]], "Architecture": [[0, "architecture"]], "Support": [[0, "support"]], "API Documentation": [[1, "api-documentation"], [33, "api-documentation"]], "Device-Agnostic": [[1, "device-agnostic"], [6, "device-agnostic"]], "GPU-Specific": [[1, "gpu-specific"], [6, "gpu-specific"]], "Miscellaneous": [[1, "miscellaneous"], [1, "id1"]], "Random Number Generator": [[1, "random-number-generator"]], "Streams and events": [[1, "streams-and-events"]], "Memory management": [[1, "memory-management"]], "C++ API": [[1, "c-api"]], "CPU-Specific": [[1, "cpu-specific"], [6, "cpu-specific"]], "Quantization": [[1, "module-intel_extension_for_pytorch.quantization"], [6, "quantization"]], "CPU Runtime": [[1, "module-intel_extension_for_pytorch.cpu.runtime"]], "Blogs & Publications": [[2, "blogs-publications"]], "Cheat Sheet": [[3, "cheat-sheet"]], "Contribution": [[4, "contribution"]], "Contributing to Intel\u00ae Extension for PyTorch*": [[4, "contributing-to-intel-extension-for-pytorch"]], "Developing Intel\u00ae Extension for PyTorch* on XPU": [[4, "developing-intel-extension-for-pytorch-on-xpu"]], "Tips and Debugging": [[4, "tips-and-debugging"]], "Unit testing": [[4, "unit-testing"]], "Better local unit tests with pytest": [[4, "better-local-unit-tests-with-pytest"]], "Writing documentation": [[4, "writing-documentation"]], "Building documentation": [[4, "building-documentation"]], "Tips": [[4, "tips"]], "Examples": [[5, "examples"]], "Python": [[5, "python"]], "Training": [[5, "training"]], "Single-Instance Training": [[5, "single-instance-training"]], "Float32": [[5, "float32"], [5, "id1"]], "BFloat16": [[5, "bfloat16"], [5, "id4"], [38, "bfloat16"], [50, "bfloat16"]], "Inference": [[5, "inference"]], "Imperative Mode": [[5, "imperative-mode"], [5, "id5"], [5, "id11"], [23, "imperative-mode"]], "Resnet50": [[5, "resnet50"], [5, "id2"], [5, "id6"], [5, "id9"], [5, "id12"], [5, "id15"]], "BERT": [[5, "bert"], [5, "id3"], [5, "id7"], [5, "id10"], [5, "id13"], [5, "id16"], [40, "bert"]], "TorchScript Mode": [[5, "torchscript-mode"], [5, "id8"], [5, "id14"], [23, "torchscript-mode"], [36, "torchscript-mode"]], "Float16": [[5, "float16"]], "INT8": [[5, "int8"], [38, "int8"]], "torch.xpu.optimize": [[5, "torch-xpu-optimize"]], "C++": [[5, "c"]], "Basic Usage": [[5, "basic-usage"]], "Use SYCL code": [[5, "use-sycl-code"]], "Customize DPC++ kernels": [[5, "customize-dpc-kernels"]], "Intel\u00ae AI Reference Models": [[5, "intel-ai-reference-models"]], "Features": [[6, "features"]], "Easy-to-use Python API": [[6, "easy-to-use-python-api"]], "Channels Last": [[6, "channels-last"], [25, "channels-last"], [41, "channels-last"]], "Auto Mixed Precision (AMP)": [[6, "auto-mixed-precision-amp"]], "Distributed Training": [[6, "distributed-training"]], "DLPack Solution": [[6, "dlpack-solution"], [8, "dlpack-solution"]], "DPC++ Extension": [[6, "dpc-extension"], [9, "dpc-extension"]], "Advanced Configuration": [[6, "advanced-configuration"], [11, "advanced-configuration"]], "Fully Sharded Data Parallel (FSDP)": [[6, "fully-sharded-data-parallel-fsdp"], [10, "fully-sharded-data-parallel-fsdp"]], "Inductor": [[6, "inductor"]], "Legacy Profiler Tool (Experimental)": [[6, "legacy-profiler-tool-experimental"], [27, "legacy-profiler-tool-experimental"]], "Simple Trace Tool (Experimental)": [[6, "simple-trace-tool-experimental"], [29, "simple-trace-tool-experimental"]], "Kineto Supported Profiler Tool (Experimental)": [[6, "kineto-supported-profiler-tool-experimental"], [26, "kineto-supported-profiler-tool-experimental"]], "Compute Engine (Experimental feature for debug)": [[6, "compute-engine-experimental-feature-for-debug"], [16, "compute-engine-experimental-feature-for-debug"]], "Operator Optimization": [[6, "operator-optimization"]], "Runtime Extension": [[6, "runtime-extension"], [28, "runtime-extension"], [38, "runtime-extension"]], "Codeless Optimization (Experimental, NEW feature in 1.13.*)": [[6, "codeless-optimization-experimental-new-feature-in-1-13"]], "Graph Capture (Experimental, NEW feature in 1.13.0*)": [[6, "graph-capture-experimental-new-feature-in-1-13-0"]], "HyperTune (Experimental, NEW feature in 1.13.0*)": [[6, "hypertune-experimental-new-feature-in-1-13-0"]], "DistributedDataParallel (DDP)": [[7, "distributeddataparallel-ddp"]], "Introduction": [[7, "introduction"], [8, "introduction"], [9, "introduction"], [10, "introduction"], [12, "introduction"], [13, "introduction"], [16, "introduction"], [26, "introduction"], [27, "introduction"], [29, "introduction"], [30, "introduction"], [33, "introduction"], [44, "introduction"], [48, "introduction"], [49, "introduction"]], "Installation of Intel\u00ae oneCCL Bindings for Pytorch*": [[7, "installation-of-intel-oneccl-bindings-for-pytorch"]], "Install PyTorch and Intel\u00ae Extension for PyTorch*": [[7, "install-pytorch-and-intel-extension-for-pytorch"]], "Install Intel\u00ae oneCCL Bindings for Pytorch*": [[7, "install-intel-oneccl-bindings-for-pytorch"]], "Install from source:": [[7, "install-from-source"]], "Install from prebuilt wheel:": [[7, "install-from-prebuilt-wheel"]], "Runtime Dynamic Linking": [[7, "runtime-dynamic-linking"]], "DDP Usage": [[7, "ddp-usage"]], "Example Usage (MPI launch for single node):": [[7, "example-usage-mpi-launch-for-single-node"]], "DDP scaling API (GPU Only)": [[7, "ddp-scaling-api-gpu-only"]], "Usage of DDP scaling API": [[7, "usage-of-ddp-scaling-api"]], "Use Case": [[8, "use-case"], [12, "use-case"], [13, "use-case"], [16, "use-case"], [26, "use-case"], [27, "use-case"], [29, "use-case"]], "Design": [[8, "design"]], "Import DLPack Capsule": [[8, "import-dlpack-capsule"]], "Export DLPack Capsule": [[8, "export-dlpack-capsule"]], "DLDevice and data pointer": [[8, "dldevice-and-data-pointer"]], "Asynchronous Programming": [[8, "asynchronous-programming"]], "Example Case": [[8, "example-case"]], "Motivation and Example": [[9, "motivation-and-example"]], "Writing a DPC++ Extension": [[9, "writing-a-dpc-extension"]], "Building with setuptools": [[9, "building-with-setuptools"]], "JIT Compiling Extensions": [[9, "jit-compiling-extensions"]], "Building with CMake": [[9, "building-with-cmake"]], "Requesting the current c10::Stream": [[9, "requesting-the-current-c10-stream"]], "Fetching the corresponding sycl::queue": [[9, "fetching-the-corresponding-sycl-queue"]], "Writing the DPC++ Op": [[9, "writing-the-dpc-op"]], "Using accessors": [[9, "using-accessors"]], "FSDP Usage (GPU only)": [[10, "fsdp-usage-gpu-only"]], "Example": [[10, "example"]], "Build Time Configuration": [[11, "build-time-configuration"]], "Runtime Configuration": [[11, "runtime-configuration"]], "Auto Mixed Precision (AMP) on CPU": [[12, "auto-mixed-precision-amp-on-cpu"]], "Default Precision": [[12, "default-precision"], [13, "default-precision"]], "Inference with Imperative Path": [[12, "inference-with-imperative-path"], [13, "inference-with-imperative-path"]], "Inference with TorchScript Path": [[12, "inference-with-torchscript-path"], [13, "inference-with-torchscript-path"]], "Training Support": [[12, "training-support"], [13, "training-support"]], "Autocast Op Reference": [[12, "autocast-op-reference"], [13, "autocast-op-reference"]], "Op Eligibility": [[12, "op-eligibility"], [13, "op-eligibility"]], "Op-Specific Behavior": [[12, "op-specific-behavior"], [13, "op-specific-behavior"]], "Ops that can autocast to bfloat16": [[12, "ops-that-can-autocast-to-bfloat16"], [13, "ops-that-can-autocast-to-bfloat16"]], "Ops that can autocast to float32": [[12, "ops-that-can-autocast-to-float32"], [13, "ops-that-can-autocast-to-float32"]], "Ops that promote to the widest input type": [[12, "ops-that-promote-to-the-widest-input-type"], [13, "ops-that-promote-to-the-widest-input-type"]], "Auto Mixed Precision (AMP) on GPU": [[13, "auto-mixed-precision-amp-on-gpu"]], "Ops that can autocast to float16": [[13, "ops-that-can-autocast-to-float16"]], "Auto Channels Last": [[14, "auto-channels-last"]], "Ease-of-use auto channels last API": [[14, "ease-of-use-auto-channels-last-api"]], "default": [[14, "default"]], "enable": [[14, "enable"]], "disable": [[14, "disable"]], "Known issue": [[14, "known-issue"]], "Codeless Optimization (Experimental)": [[15, "codeless-optimization-experimental"]], "Motivation": [[15, "motivation"]], "Example Usage with HuggingFace": [[15, "example-usage-with-huggingface"]], "The origin command with ipex launch": [[15, "the-origin-command-with-ipex-launch"]], "Command to apply ipex optimization for FP32": [[15, "command-to-apply-ipex-optimization-for-fp32"]], "Command to apply ipex optimization for BF16": [[15, "command-to-apply-ipex-optimization-for-bf16"]], "Use Case not supported": [[15, "use-case-not-supported"]], "Module uses forward method explicitly instead of the __call__ attr": [[15, "module-uses-forward-method-explicitly-instead-of-the-call-attr"]], "Already using ipex.optimize": [[15, "already-using-ipex-optimize"]], "Already using Jit Trace": [[15, "already-using-jit-trace"]], "Engine Selection Policy": [[16, "engine-selection-policy"]], "Multiple Implementations Operators and Engines": [[16, "multiple-implementations-operators-and-engines"]], "Float8 Data Type Support [GPU] (Experimental)": [[17, "float8-data-type-support-gpu-experimental"]], "Float8 Data Type": [[17, "float8-data-type"]], "FP8 Quantization": [[17, "fp8-quantization"]], "Supported running mode": [[17, "supported-running-mode"], [21, "supported-running-mode"]], "Supported operators": [[17, "supported-operators"], [21, "supported-operators"]], "FP8 usage example": [[17, "fp8-usage-example"]], "Graph Capture (Experimental)": [[18, "graph-capture-experimental"]], "Feature Description": [[18, "feature-description"]], "Usage Example": [[18, "usage-example"], [24, "usage-example"]], "Horovod with PyTorch (Experimental)": [[19, "horovod-with-pytorch-experimental"]], "Install Horovod with PyTorch": [[19, "install-horovod-with-pytorch"]], "Horovod with PyTorch Usage": [[19, "horovod-with-pytorch-usage"]], "HyperTune (Experimental)": [[20, "hypertune-experimental"]], "Usage of Hypertune": [[20, "usage-of-hypertune"]], "your_conf_file": [[20, "your-conf-file"]], "Hyperparameters": [[20, "hyperparameters"]], "Launcher Hyperparameters": [[20, "launcher-hyperparameters"]], "Defining hyperparameters and their search spaces": [[20, "defining-hyperparameters-and-their-search-spaces"]], "1. Defining hyperparameters to tune:": [[20, "defining-hyperparameters-to-tune"]], "2. Defining the search spaces of the hyperparameters:": [[20, "defining-the-search-spaces-of-the-hyperparameters"]], "Default search space": [[20, "default-search-space"]], "User defined search space": [[20, "user-defined-search-space"]], "": [[20, "your-python-script"]], "Usage Examples": [[20, "usage-examples"], [39, "usage-examples"]], "INT4 inference [GPU] (Experimental)": [[21, "int4-inference-gpu-experimental"]], "INT4 Data Type": [[21, "int4-data-type"]], "INT4 Quantization": [[21, "int4-quantization"]], "INT4 usage example": [[21, "int4-usage-example"]], "Weight Only Quantization Tool": [[21, "weight-only-quantization-tool"]], "Intel\u00ae Extension for PyTorch* optimizations for quantization [CPU]": [[22, "intel-extension-for-pytorch-optimizations-for-quantization-cpu"]], "Static Quantization": [[22, "static-quantization"]], "Define qconfig": [[22, "define-qconfig"]], "Prepare Model and Do Calibration": [[22, "prepare-model-and-do-calibration"]], "Convert to Static Quantized Model and Deploy": [[22, "convert-to-static-quantized-model-and-deploy"]], "Dynamic Quantization": [[22, "dynamic-quantization"]], "Define QConfig": [[22, "id1"]], "Prepare Model": [[22, "prepare-model"]], "Convert to Dynamic Quantized Model and Deploy": [[22, "convert-to-dynamic-quantized-model-and-deploy"]], "Intel\u00ae Extension for PyTorch* Optimizations for Quantization [GPU]": [[23, "intel-extension-for-pytorch-optimizations-for-quantization-gpu"]], "INT8 Recipe Tuning API (Experimental) [CPU]": [[24, "int8-recipe-tuning-api-experimental-cpu"]], "What is Channels Last": [[25, "what-is-channels-last"]], "Memory Format Is All That Matters": [[25, "memory-format-is-all-that-matters"]], "a. NCHW (default)": [[25, "a-nchw-default"]], "b. NHWC": [[25, "b-nhwc"]], "c. Blocked (nChw16c, on CPU)": [[25, "c-blocked-nchw16c-on-cpu"]], "PyTorch Strided Layout": [[25, "pytorch-strided-layout"]], "Channels Last Memory Format APIs": [[25, "channels-last-memory-format-apis"]], "a. tensor creation": [[25, "a-tensor-creation"]], "b. tensor conversion": [[25, "b-tensor-conversion"]], "c. model conversion": [[25, "c-model-conversion"]], "d. operator coverage in PyTorch": [[25, "d-operator-coverage-in-pytorch"]], "Writing Channels Last Kernels on CPU": [[25, "writing-channels-last-kernels-on-cpu"]], "a. Register Channels Last Kernel in ATen Native Manner": [[25, "a-register-channels-last-kernel-in-aten-native-manner"]], "b. Register oneDNN Kernel on Channels Last": [[25, "b-register-onednn-kernel-on-channels-last"]], "oneDNN NHWC APIs": [[25, "onednn-nhwc-apis"]], "a. Create NHWC Memory": [[25, "a-create-nhwc-memory"]], "b. Create Convolution Primitive": [[25, "b-create-convolution-primitive"]], "Channels Last 1D support on XPU": [[25, "channels-last-1d-support-on-xpu"]], "a. tensor conversion with Channels Last 1D": [[25, "a-tensor-conversion-with-channels-last-1d"]], "b. model conversion with Channels Last 1D": [[25, "b-model-conversion-with-channels-last-1d"]], "c. determine if in Channels Last 1D memory format": [[25, "c-determine-if-in-channels-last-1d-memory-format"]], "Build Tool": [[26, "build-tool"], [27, "build-tool"]], "Use Tool": [[26, "use-tool"], [27, "use-tool"]], "Set Environment Variable": [[26, "set-environment-variable"]], "Add Profiler Into Script": [[26, "add-profiler-into-script"]], "Disable Tool in Model Script": [[26, "disable-tool-in-model-script"], [27, "disable-tool-in-model-script"]], "Disable Tool Partly for XPU Backend": [[26, "disable-tool-partly-for-xpu-backend"]], "Profile on Multi-device Application": [[26, "profile-on-multi-device-application"]], "Result": [[26, "result"]], "Export to Chrome Trace": [[26, "export-to-chrome-trace"], [27, "export-to-chrome-trace"]], "Results": [[27, "results"], [29, "results"]], "Requirements": [[28, "requirements"]], "Use Cases": [[28, "use-cases"]], "Example of MultiStream Module": [[28, "example-of-multistream-module"]], "Examples1: Basic Usage": [[28, "examples1-basic-usage"]], "Examples2: Usage with \u201cAUTO\u201d setting": [[28, "examples2-usage-with-auto-setting"]], "Examples3: Usage for models with structure inputs/outputs": [[28, "examples3-usage-for-models-with-structure-inputs-outputs"]], "Performance recipes": [[28, "performance-recipes"]], "Known issues": [[28, "known-issues"]], "Example of asynchronous task": [[28, "example-of-asynchronous-task"]], "Example of configuring core binding": [[28, "example-of-configuring-core-binding"]], "Detail Design": [[28, "detail-design"]], "How the core binding is implemented": [[28, "how-the-core-binding-is-implemented"]], "Design of Task": [[28, "design-of-task"]], "IOMP preload or load during the runtime": [[28, "iomp-preload-or-load-during-the-runtime"]], "Enable and Disable Tool": [[29, "enable-and-disable-tool"]], "Use Simple Trace in Model": [[29, "use-simple-trace-in-model"]], "torch.compile for GPU (Experimental)": [[30, "torch-compile-for-gpu-experimental"]], "Required Dependencies": [[30, "required-dependencies"]], "Inferenece with torch.compile": [[30, "inferenece-with-torch-compile"]], "Training with torch.compile": [[30, "training-with-torch-compile"]], "Quick Start": [[31, "quick-start"]], "Execution": [[31, "execution"]], "Installation": [[32, "installation"]], "Get Started": [[33, "get-started"]], "License": [[34, "license"]], "Large Language Models (LLM) Optimizations Overview": [[35, "large-language-models-llm-optimizations-overview"]], "Optimized Models": [[35, "optimized-models"]], "Optimization Methodologies": [[35, "optimization-methodologies"]], "Linear Operator Optimization": [[35, "linear-operator-optimization"]], "Deep Fusion Policy": [[35, "deep-fusion-policy"]], "Segment KV Cache": [[35, "segment-kv-cache"]], "Distributed Inference": [[35, "distributed-inference"]], "Low Precision Data Types": [[35, "low-precision-data-types"]], "Transformers Optimization Frontend API": [[36, "transformers-optimization-frontend-api"]], "Pseudocode of Common Usage Scenarios": [[36, "pseudocode-of-common-usage-scenarios"]], "FP16": [[36, "fp16"]], "SmoothQuant": [[36, "smoothquant"]], "Imperative mode": [[36, "imperative-mode"]], "Weight Only Quantization (WOQ)": [[36, "weight-only-quantization-woq"]], "Distributed Inference with DeepSpeed": [[36, "distributed-inference-with-deepspeed"]], "Performance Tuning Guide": [[37, "performance-tuning-guide"], [41, "performance-tuning-guide"]], "Troubleshooting": [[38, "troubleshooting"]], "GPU-specific Issues": [[38, "gpu-specific-issues"]], "General Usage": [[38, "general-usage"], [38, "id1"]], "Library Dependencies": [[38, "library-dependencies"]], "Unit Test": [[38, "unit-test"]], "CPU-specific issues": [[38, "cpu-specific-issues"]], "TorchDynamo": [[38, "torchdynamo"]], "Dynamic Shape": [[38, "dynamic-shape"]], "Result Correctness": [[38, "result-correctness"]], "Float32 Training": [[38, "float32-training"]], "Launch Script Usage Guide": [[39, "launch-script-usage-guide"]], "Overview": [[39, "overview"], [41, "overview"], [46, "overview"]], "Usage of launch script": [[39, "usage-of-launch-script"]], "Single instance for inference": [[39, "single-instance-for-inference"]], "I. Use all physical cores": [[39, "i-use-all-physical-cores"]], "II. Use all cores including logical cores": [[39, "ii-use-all-cores-including-logical-cores"]], "III. Use physical cores on designated nodes": [[39, "iii-use-physical-cores-on-designated-nodes"]], "IV. Use your designated number of cores": [[39, "iv-use-your-designated-number-of-cores"]], "Multiple instances for inference": [[39, "multiple-instances-for-inference"]], "V. Throughput mode": [[39, "v-throughput-mode"]], "VI. Latency mode": [[39, "vi-latency-mode"]], "VII. Your designated number of instances": [[39, "vii-your-designated-number-of-instances"]], "VIII. Your designated number of instances and instance index": [[39, "viii-your-designated-number-of-instances-and-instance-index"]], "Usage of Jemalloc/TCMalloc/Default memory allocator": [[39, "usage-of-jemalloc-tcmalloc-default-memory-allocator"]], "Jemalloc": [[39, "jemalloc"], [41, "jemalloc"]], "TCMalloc": [[39, "tcmalloc"], [41, "tcmalloc"]], "Default memory allocator": [[39, "default-memory-allocator"]], "Usage of OpenMP library": [[39, "usage-of-openmp-library"]], "Intel OpenMP Library": [[39, "intel-openmp-library"]], "GNU OpenMP Library": [[39, "gnu-openmp-library"]], "TorchServe with Intel\u00ae Extension for PyTorch*": [[40, "torchserve-with-intel-extension-for-pytorch"]], "Contents of this Document": [[40, "contents-of-this-document"], [41, "contents-of-this-document"]], "Install Intel\u00ae Extension for PyTorch*": [[40, "install-intel-extension-for-pytorch"]], "Serving model with Intel\u00ae Extension for PyTorch*": [[40, "serving-model-with-intel-extension-for-pytorch"]], "TorchServe with Launcher": [[40, "torchserve-with-launcher"]], "Launcher Core Pinning to Boost Performance of TorchServe Multi Worker Inference": [[40, "launcher-core-pinning-to-boost-performance-of-torchserve-multi-worker-inference"]], "Scaling workers": [[40, "scaling-workers"]], "Creating and Exporting INT8 model for Intel\u00ae Extension for PyTorch*": [[40, "creating-and-exporting-int8-model-for-intel-extension-for-pytorch"]], "1. Creating a serialized file": [[40, "creating-a-serialized-file"]], "ResNet50": [[40, "resnet50"]], "2. Creating a Model Archive": [[40, "creating-a-model-archive"]], "3. Start TorchServe to serve the model": [[40, "start-torchserve-to-serve-the-model"]], "4. Registering and Deploying model": [[40, "registering-and-deploying-model"]], "Benchmarking with Launcher": [[40, "benchmarking-with-launcher"]], "Benchmarking with Launcher Core Pinning": [[40, "benchmarking-with-launcher-core-pinning"]], "Performance Boost with Intel\u00ae Extension for PyTorch* and Launcher": [[40, "performance-boost-with-intel-extension-for-pytorch-and-launcher"]], "Hardware Configuration": [[41, "hardware-configuration"]], "Intel CPU Structure": [[41, "intel-cpu-structure"]], "Non-Uniform Memory Access (NUMA)": [[41, "non-uniform-memory-access-numa"]], "Software Configuration": [[41, "software-configuration"]], "Numactl": [[41, "numactl"]], "OpenMP": [[41, "openmp"]], "OMP_NUM_THREADS": [[41, "omp-num-threads"]], "GNU OpenMP": [[41, "gnu-openmp"]], "Intel OpenMP": [[41, "intel-openmp"]], "Memory Allocator": [[41, "memory-allocator"]], "Denormal Number": [[41, "denormal-number"]], "OneDNN primitive cache": [[41, "onednn-primitive-cache"]], "Releases": [[42, "releases"]], "2.1.10+xpu": [[42, "xpu"]], "Highlights": [[42, "highlights"], [42, "id2"], [42, "id5"], [42, "id8"], [42, "id10"]], "Known Issues": [[42, "known-issues"], [42, "id3"], [42, "id6"], [42, "id9"], [42, "id11"]], "2.0.110+xpu": [[42, "id1"]], "1.13.120+xpu": [[42, "id4"]], "1.13.10+xpu": [[42, "id7"]], "1.10.200+gpu": [[42, "gpu"]], "Technical Details": [[43, "technical-details"]], "ISA Dynamic Dispatching [CPU]": [[43, "isa-dynamic-dispatching-cpu"]], "Graph Optimization [CPU]": [[43, "graph-optimization-cpu"]], "Optimizer Optimization [CPU, GPU]": [[43, "optimizer-optimization-cpu-gpu"]], "Ahead of Time Compilation (AOT) [GPU]": [[43, "ahead-of-time-compilation-aot-gpu"]], "Memory Management [GPU]": [[43, "memory-management-gpu"]], "Ahead of Time (AOT) Compilation": [[44, "ahead-of-time-aot-compilation"]], "Use case": [[44, "use-case"]], "Requirement": [[44, "requirement"]], "Graph Optimization": [[45, "graph-optimization"]], "Ease-of-use graph optimization API": [[45, "ease-of-use-graph-optimization-api"]], "FP32 and BF16 models": [[45, "fp32-and-bf16-models"]], "INT8 models": [[45, "int8-models"]], "Methodology": [[45, "methodology"]], "Fusion": [[45, "fusion"]], "FP32 and BF16 fusion patterns": [[45, "fp32-and-bf16-fusion-patterns"]], "INT8 fusion patterns": [[45, "int8-fusion-patterns"]], "Folding": [[45, "folding"]], "ISA Dynamic Dispatching": [[46, "isa-dynamic-dispatching"]], "CPU ISA build compiler requirement": [[46, "cpu-isa-build-compiler-requirement"]], "Select ISA Level": [[46, "select-isa-level"]], "Example:": [[46, "example"]], "CPU feature check": [[46, "cpu-feature-check"]], "Memory Management": [[47, "memory-management"]], "Optimizer Fusion on CPU": [[48, "optimizer-fusion-on-cpu"]], "Operation Fusion": [[48, "operation-fusion"], [49, "operation-fusion"]], "Optimizer Fusion on GPU": [[49, "optimizer-fusion-on-gpu"]], "Split SGD": [[50, "split-sgd"], [50, "id2"]], "Stochastic Gradient Descent (SGD)": [[50, "stochastic-gradient-descent-sgd"]]}, "indexentries": {"cpupool (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.CPUPool"]], "event (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.Event"]], "multistreammodule (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModule"]], "multistreammodulehint (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModuleHint"]], "stream (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.Stream"]], "task (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.Task"]], "_gptq() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization._gptq"]], "autotune() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization.autotune"]], "convert() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization.convert"]], "current_device() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.current_device"]], "current_stream() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.current_stream"]], "device (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.device"]], "device_count() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.device_count"]], "device_of (class in intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.device_of"]], "elapsed_time() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.elapsed_time"]], "empty_cache() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.empty_cache"]], "enable_onednn_fusion() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.enable_onednn_fusion"]], "fp8_autocast() (in module intel_extension_for_pytorch.xpu.fp8.fp8)": [[1, "intel_extension_for_pytorch.xpu.fp8.fp8.fp8_autocast"]], "get_core_list_of_node_id() (in module intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.get_core_list_of_node_id"]], "get_device_name() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_device_name"]], "get_device_properties() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_device_properties"]], "get_fp32_math_mode() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.get_fp32_math_mode"]], "get_rng_state() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_rng_state"]], "get_rng_state_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.get_rng_state_all"]], "init() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.init"]], "initial_seed() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.initial_seed"]], "intel_extension_for_pytorch.cpu.runtime": [[1, "module-intel_extension_for_pytorch.cpu.runtime"]], "intel_extension_for_pytorch.quantization": [[1, "module-intel_extension_for_pytorch.quantization"]], "is_available() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.is_available"]], "is_initialized() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.is_initialized"]], "is_runtime_ext_enabled() (in module intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.is_runtime_ext_enabled"]], "manual_seed() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.manual_seed"]], "manual_seed_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.manual_seed_all"]], "max_memory_allocated() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.max_memory_allocated"]], "max_memory_reserved() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.max_memory_reserved"]], "memory_allocated() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_allocated"]], "memory_reserved() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_reserved"]], "memory_snapshot() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_snapshot"]], "memory_stats() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_stats"]], "memory_stats_as_nested_dict() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_stats_as_nested_dict"]], "memory_summary() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.memory_summary"]], "module": [[1, "module-intel_extension_for_pytorch.cpu.runtime"], [1, "module-intel_extension_for_pytorch.quantization"]], "optimize() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.optimize"]], "optimize_transformers() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.optimize_transformers"]], "pin (class in intel_extension_for_pytorch.cpu.runtime)": [[1, "intel_extension_for_pytorch.cpu.runtime.pin"]], "prepare() (in module intel_extension_for_pytorch.quantization)": [[1, "intel_extension_for_pytorch.quantization.prepare"]], "query() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.query"]], "record() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.record"]], "record_event() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.record_event"]], "reset_accumulated_memory_stats() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.reset_accumulated_memory_stats"]], "reset_peak_memory_stats() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.reset_peak_memory_stats"]], "seed() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.seed"]], "seed_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.seed_all"]], "set_device() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.set_device"]], "set_fp32_math_mode() (in module intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.set_fp32_math_mode"]], "set_rng_state() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.set_rng_state"]], "set_rng_state_all() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.set_rng_state_all"]], "stream() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.stream"]], "sycl_queue (intel_extension_for_pytorch.xpu.stream property)": [[1, "intel_extension_for_pytorch.xpu.Stream.sycl_queue"]], "synchronize() (in module intel_extension_for_pytorch.xpu)": [[1, "intel_extension_for_pytorch.xpu.synchronize"]], "synchronize() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.synchronize"]], "synchronize() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.synchronize"]], "verbose (class in intel_extension_for_pytorch)": [[1, "intel_extension_for_pytorch.verbose"]], "wait() (intel_extension_for_pytorch.xpu.event method)": [[1, "intel_extension_for_pytorch.xpu.Event.wait"]], "wait_event() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.wait_event"]], "wait_stream() (intel_extension_for_pytorch.xpu.stream method)": [[1, "intel_extension_for_pytorch.xpu.Stream.wait_stream"]], "xpu::fp32_math_mode (c++ enum)": [[1, "_CPPv4N3xpu14FP32_MATH_MODEE"]], "xpu::fp32_math_mode::bf32 (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE4BF32E"]], "xpu::fp32_math_mode::fp32 (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE4FP32E"]], "xpu::fp32_math_mode::fp32_math_mode_max (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE"]], "xpu::fp32_math_mode::tf32 (c++ enumerator)": [[1, "_CPPv4N3xpu14FP32_MATH_MODE4TF32E"]], "xpu::get_queue_from_stream (c++ function)": [[1, "_CPPv4N3xpu21get_queue_from_streamEN3c106StreamE"]], "xpu::set_fp32_math_mode (c++ function)": [[1, "_CPPv4N3xpu18set_fp32_math_modeE14FP32_MATH_MODE"]], "frozenbatchnorm2d (class in intel_extension_for_pytorch.nn)": [[6, "intel_extension_for_pytorch.nn.FrozenBatchNorm2d"]], "interaction() (in module intel_extension_for_pytorch.nn.functional)": [[6, "intel_extension_for_pytorch.nn.functional.interaction"]]}})
\ No newline at end of file
diff --git a/xpu/2.1.10+xpu/tutorials/api_doc.html b/xpu/2.1.10+xpu/tutorials/api_doc.html
index e326ea740..a7053f806 100644
--- a/xpu/2.1.10+xpu/tutorials/api_doc.html
+++ b/xpu/2.1.10+xpu/tutorials/api_doc.html
@@ -373,7 +373,7 @@ Device-Agnostic