From 7b809d2fa255310766c637f2e3ac855d18e82ae0 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 17 Jun 2024 20:29:28 -0700 Subject: [PATCH 01/19] New README --- README.md | 213 ++++++++++++++++++++++++++---------------------------- 1 file changed, 104 insertions(+), 109 deletions(-) diff --git a/README.md b/README.md index 6058a2ccac..9c9a27e8cb 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,91 @@ [![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](https://discord.gg/cudamode) -This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) ## Introduction -`torchao` is a PyTorch library for quantization and sparsity. + +torchao is a library which makes it easy to integrate and create high performance kernels with custom data types and layouts with up to +* **30% speedups** for training +* **2x speedups** with **65%** less VRAM for inference + +All with no intrusive code changes and minimal accuracy degradation. + +## Benchmarks + +### Training + +We've added support for semi-structured 2:4 sparsity with over 30% speedups on ViT-L + +The code change is a 1 liner with the full example available [here](torchao/sparsity/training/) + + +```python +swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear}) +``` + +For VIT-L MLP shapes on a NVIDIA A100 we see the following results: +``` +[------------------------------------------------ mlpfwbw -------------------------------------------------] + | act24 | dense | w24 | s24_inp_sparsify24 | s24_inp_clone +1 threads: ------------------------------------------------------------------------------------------------- + f16 (44160,1024,4096,1024) | 11881.0 | 11534.3 | 9204.7 | 255.1 | 125.8 + +Times are in microseconds (us). +``` + + +### Inference + +#### Without intrusive code changes + +Quantizing your own models is as simple as the below and this should work on any model with `nn.Linear`. You can find a more comprehensive usage example [here](torchao/quantization/) + +```python +from torchao.quantization.quant_api import quantize +m = quantize(m, "int4wo") +``` + +Benchmarks are run on a machine with a single A100 GPU using the script in `_models/llama` which generates text in a latency optimized way (batchsize=1) + +The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama-3-8B`. + +| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | +| ----------- | ------------------ | ------------------- | ------------- | ----------------------- | ---------------- | --------------- | +| Llama-2-7B | Base (bfloat16) | 12.212 | 105.02 | 1387.78 | 13.21 | 13.90 | +| | int8dq | 12.262 | 9.40 | 62.26 | 6.62 | 8.61 | +| | int8wo | 12.204 | 147.03 | 973.54 | 6.62 | 8.95 | +| | int4wo-64 | 12.843 | 199.81 | 746.45 | 3.74 | 4.75 | +| | int4wo-64-GPTQ | 12.489 | 199.81 | 746.45 | 3.74 | 4.75 | +| Llama-3-8B | Base (bfloat16) | N/A | 94.91 | 1424.58 | 15.01 | 16.43 | +| | int8dq | N/A | 8.41 | 63.23 | 7.52 | 9.24 | +| | int8wo | N/A | 136.75 | 1028.38 | 7.52 | 10.42 | +| | int4wo-64 | N/A | 179.41 | 757.45 | 4.22 | 6.88 | + +note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance. + +And a quick crash course on inference quantization to help parse the above table. Int4 quantization is actually an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))` wheras if it's possible to perform the computation using the smaller dtype directly pending support by a hardware vendor then that means you can perform `F.linear(input, weight)` directly and this is what we refer to as Dynamic-Quantization (dq). Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`. + + +#### With intrusive code changes + +In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes. + +* 8x speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai) +* 10x speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2) +* 3x speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) + +## Newer dtypes + +[MX](https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet. + +[nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701) + +## Composability + +A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and it needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++ or Triton - things should just work! And here has been our current strategy +1. Write the dtype, layout or bit packing logic in pure PyTorch and codegenerate efficient kernels with torch.compile. You can inspect those kernels with `TORCH_LOGS="output_code" python your_code.py` and check if a single kernel is being generated and if any unecessary buffers are being created +2. However once you get a kernel, how do you know how good it is? The best way is to benchmark the codegenerated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our [custom ops support](./torchao/csrc/) so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with `torch.compile()`. Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that's it's faster but also numerically stable so we are betting on both compilers and custom ops +3. Finally while historically most quantization has been done for inference there is now a thriving area of research combining lower dtypes and sharding. One popular example is [NF4](torchao/dtypes/nf4tensor.py) which is used to create the QLoRA algorithm and you can define the semantics for how custom tensors should be sharded over multiple devices. We gave an accessible talk on [how to do this](https://x.com/HamelHusain/status/1800315287574847701). ## Get Started @@ -14,135 +95,49 @@ This repository is currently under heavy development - if you have suggestions o Stable Release ```Shell -pip install torchao +pip install torchao --extra-index-url https://download.pytorch.org/whl/test/cu121 # full options are cpu/cu118/cu121/cu124 ``` Nightly Release ```Shell -pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cpu # CPU only builds -pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu118 # CUDA 11.8 -pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # CUDA 12.1 -pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu124 # CUDA 12.4 - +pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124 ``` -From source +## Community Contributions + +* [jeromeku](https://github.com/jeromeku) has implemented + * [GaLore](torchao/prototype/galore/) a drop for the Adam Optimizer that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch + * [DoRA](torchao/prototype/dora) a newer replacement for QLoRA with more promising convergence characteristics + * [Fused int4/fp16 Quant Matmul](torchao/prototype/hqq) which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 +* [gau-nernst](https://github.com/gau-nernst) fp6 kernels that are 4x faster than fp16 [torchao/prototype/fp6_llm](torchao/prototype/fp6_llm) +* [vayuda](https://github.com/vayuda) with generic bitpacking kernels that were codegenerated using pure PyTorch [prototype/common](torchao/prototype/common) + +## How to contribute + +This repository is currently under heavy development +* If you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) +* If you'd like to co-develop the library with us please join us on #torchao on [discord.gg/cudamode](https://discord.gg/cudamode) - there's a lot of dtypes out there and we could use a lot more hands to make them go brrr + +Installation instructions ```Shell git clone https://github.com/pytorch/ao cd ao -python setup.py install +python setup.py install ``` -If you plan to be developing the library run: +If you're contributing a feature ao ```Shell pip install -r dev-requirements.txt python setup.py develop ``` -** Note: -If you are running into any issues while building `ao` cpp extensions you can instead build using +For *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration cycles ```shell USE_CPP=0 python setup.py install ``` -### Quantization - -```python -import torch -import torchao - -# inductor settings which improve torch.compile performance for quantized modules -torch._inductor.config.force_fuse_int_mm_with_mul = True -torch._inductor.config.use_mixed_mm = True - -# Plug in your model and example input -model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16) -input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda') - -# perform autoquantization and compilation -q_model = torchao.autoquant(torch.compile(model, mode='max-autotune')) -q_model(input) -``` - -### Sparsity - -```python -import torch -from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor -from torch.ao.pruning import WeightNormSparsifier - -# bfloat16 CUDA model -model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16) - -# Accuracy: Finding a sparse subnetwork -sparse_config = [] -for name, mod in model.named_modules(): - if isinstance(mod, torch.nn.Linear): - sparse_config.append({"tensor_fqn": f"{name}.weight"}) - -sparsifier = WeightNormSparsifier(sparsity_level=1.0, - sparse_block_shape=(1,4), - zeros_per_block=2) - -# attach FakeSparsity -sparsifier.prepare(model, sparse_config) -sparsifier.step() -sparsifier.squash_mask() -# now we have dense model with sparse weights - -# Performance: Accelerated sparse inference -for name, mod in model.named_modules(): - if isinstance(mod, torch.nn.Linear): - mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight)) -``` - -To learn more try out our APIs, you can check out API examples in -* [quantization](./torchao/quantization) -* [sparsity](./torchao/sparsity) -* [dtypes](./torchao/dtypes) - - -## Supported Features -1. [Quantization algorithms](./torchao/quantization) - - [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization - - [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization - - [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference - - High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs -2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks -3. Support for lower precision [dtypes](./torchao/dtypes) such as - - [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code - - [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py) - - [MX](https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet. -4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees - - [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning - - [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads - - [FP6-LLM](torchao/prototype/fp6_llm) mixed matmul FP16 x FP6 kernel for io bound workloads - -## Our Goals - -* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels -* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently. -* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite -* Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch). -* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices - -## Integrations - -torchao has been integrated with other libraries including - -* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ -* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization. -* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference - -## Success stories -Our kernels have been used to achieve SOTA inference performance on - -* Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai) -* Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2) -* Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) - ## License `torchao` is released under the [BSD 3](https://github.com/pytorch-labs/ao/blob/main/LICENSE) license. From af10fe9ce242888eaf7507a29772a539cc813881 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 17 Jun 2024 20:34:13 -0700 Subject: [PATCH 02/19] yolo --- README.md | 41 +++++++++++++++++++---------------------- 1 file changed, 19 insertions(+), 22 deletions(-) diff --git a/README.md b/README.md index 9c9a27e8cb..89335c9e85 100644 --- a/README.md +++ b/README.md @@ -13,28 +13,6 @@ All with no intrusive code changes and minimal accuracy degradation. ## Benchmarks -### Training - -We've added support for semi-structured 2:4 sparsity with over 30% speedups on ViT-L - -The code change is a 1 liner with the full example available [here](torchao/sparsity/training/) - - -```python -swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear}) -``` - -For VIT-L MLP shapes on a NVIDIA A100 we see the following results: -``` -[------------------------------------------------ mlpfwbw -------------------------------------------------] - | act24 | dense | w24 | s24_inp_sparsify24 | s24_inp_clone -1 threads: ------------------------------------------------------------------------------------------------- - f16 (44160,1024,4096,1024) | 11881.0 | 11534.3 | 9204.7 | 255.1 | 125.8 - -Times are in microseconds (us). -``` - - ### Inference #### Without intrusive code changes @@ -75,6 +53,25 @@ In some cases we rewrote popular GenAI models to be significantly faster in nati * 10x speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2) * 3x speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) +### Training + +We've added support for semi-structured 2:4 sparsity with over 30% speedups on ViT-L + +The code change is a 1 liner with the full example available [here](torchao/sparsity/training/) + + +```python +swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear}) +``` + +For VIT-L MLP shapes on a NVIDIA A100 we see the following results: + +| | act24 | dense | w24 | s24_inp_sparsify24 | s24_inp_clone | +|---------------------|-----------|-----------|----------|--------------------|---------------| +| f16 (44160,1024,4096,1024) | 11881.0 | 11534.3 | 9204.7 | 255.1 | 125.8 | + +Times are in microseconds (us). + ## Newer dtypes [MX](https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet. From 5a14271844a7d3d0e767637902e8f6598a2252f3 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 17 Jun 2024 22:43:05 -0700 Subject: [PATCH 03/19] yolo --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 89335c9e85..ed7d7a11ad 100644 --- a/README.md +++ b/README.md @@ -78,6 +78,8 @@ Times are in microseconds (us). [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701) +[tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) we make heavy use of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores + ## Composability A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and it needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++ or Triton - things should just work! And here has been our current strategy @@ -108,6 +110,7 @@ pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/n * [Fused int4/fp16 Quant Matmul](torchao/prototype/hqq) which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 * [gau-nernst](https://github.com/gau-nernst) fp6 kernels that are 4x faster than fp16 [torchao/prototype/fp6_llm](torchao/prototype/fp6_llm) * [vayuda](https://github.com/vayuda) with generic bitpacking kernels that were codegenerated using pure PyTorch [prototype/common](torchao/prototype/common) +* [andreaskopf](https://github.com/andreaskoepf) and [melvinebenezer](https://github.com/melvinebenezer) with [bitnet tensors](torchao/prototype/dtypes) ## How to contribute From 6c5471923e394628e067c280e382294c5425b637 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 17 Jun 2024 22:51:27 -0700 Subject: [PATCH 04/19] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index ed7d7a11ad..6c5db453d4 100644 --- a/README.md +++ b/README.md @@ -109,8 +109,8 @@ pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/n * [DoRA](torchao/prototype/dora) a newer replacement for QLoRA with more promising convergence characteristics * [Fused int4/fp16 Quant Matmul](torchao/prototype/hqq) which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 * [gau-nernst](https://github.com/gau-nernst) fp6 kernels that are 4x faster than fp16 [torchao/prototype/fp6_llm](torchao/prototype/fp6_llm) -* [vayuda](https://github.com/vayuda) with generic bitpacking kernels that were codegenerated using pure PyTorch [prototype/common](torchao/prototype/common) -* [andreaskopf](https://github.com/andreaskoepf) and [melvinebenezer](https://github.com/melvinebenezer) with [bitnet tensors](torchao/prototype/dtypes) +* [vayuda](https://github.com/vayuda) with generic bitpacking kernels that were code generated using pure PyTorch [prototype/common](torchao/prototype/common) +* [andreaskopf](https://github.com/andreaskoepf) and [melvinebenezer](https://github.com/melvinebenezer) with [1 bit LLMs](torchao/prototype/dtypes) Bitnet 1.58 bitpacked into uin2 and fully code-generated with torch.compile ## How to contribute From 68e6471ca8102d91971ee51a1e9e9debfd71335e Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 17 Jun 2024 23:12:36 -0700 Subject: [PATCH 05/19] Update README.md --- README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 6c5db453d4..f44f85b898 100644 --- a/README.md +++ b/README.md @@ -17,14 +17,14 @@ All with no intrusive code changes and minimal accuracy degradation. #### Without intrusive code changes -Quantizing your own models is as simple as the below and this should work on any model with `nn.Linear`. You can find a more comprehensive usage example [here](torchao/quantization/) +Quantizing your models is a 1 liner that should work on any model with `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage example [here](torchao/quantization/) ```python from torchao.quantization.quant_api import quantize m = quantize(m, "int4wo") ``` -Benchmarks are run on a machine with a single A100 GPU using the script in `_models/llama` which generates text in a latency optimized way (batchsize=1) +Benchmarks are run on a machine with a single A100 GPU using the script in `_models/llama` which generates text in a latency-optimized way (batchsize=1) The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama-3-8B`. @@ -42,7 +42,7 @@ The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama- note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance. -And a quick crash course on inference quantization to help parse the above table. Int4 quantization is actually an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))` wheras if it's possible to perform the computation using the smaller dtype directly pending support by a hardware vendor then that means you can perform `F.linear(input, weight)` directly and this is what we refer to as Dynamic-Quantization (dq). Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`. +And a quick crash course on inference quantization to help parse the above table. Int4 quantization is an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example, if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))` whereas if it's possible to perform the computation using the smaller dtype directly pending support by a hardware vendor then that means you can perform `F.linear(input, weight)` directly and this is what we refer to as Dynamic-Quantization (dq). Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`. #### With intrusive code changes @@ -82,15 +82,15 @@ Times are in microseconds (us). ## Composability -A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and it needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++ or Triton - things should just work! And here has been our current strategy -1. Write the dtype, layout or bit packing logic in pure PyTorch and codegenerate efficient kernels with torch.compile. You can inspect those kernels with `TORCH_LOGS="output_code" python your_code.py` and check if a single kernel is being generated and if any unecessary buffers are being created -2. However once you get a kernel, how do you know how good it is? The best way is to benchmark the codegenerated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our [custom ops support](./torchao/csrc/) so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with `torch.compile()`. Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that's it's faster but also numerically stable so we are betting on both compilers and custom ops +A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++, or Triton - things should just work! And here is our current strategy +1. Write the dtype, layout or bit packing logic in pure PyTorch and code-generate efficient kernels with torch.compile. You can inspect those kernels with `TORCH_LOGS="output_code" python your_code.py` and check if a single kernel is being generated and if any unnecessary buffers are being created +2. However once you get a kernel, how do you know how good it is? The best way is to benchmark the code-generated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our [custom ops support](./torchao/csrc/) so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with `torch.compile()`. Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that it's faster but also numerically stable so we are betting on both compilers and custom ops 3. Finally while historically most quantization has been done for inference there is now a thriving area of research combining lower dtypes and sharding. One popular example is [NF4](torchao/dtypes/nf4tensor.py) which is used to create the QLoRA algorithm and you can define the semantics for how custom tensors should be sharded over multiple devices. We gave an accessible talk on [how to do this](https://x.com/HamelHusain/status/1800315287574847701). ## Get Started ### Installation -`torchao` makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch. +`torchao` makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch. Stable Release ```Shell @@ -115,8 +115,8 @@ pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/n ## How to contribute This repository is currently under heavy development -* If you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) -* If you'd like to co-develop the library with us please join us on #torchao on [discord.gg/cudamode](https://discord.gg/cudamode) - there's a lot of dtypes out there and we could use a lot more hands to make them go brrr +* If you have suggestions on the API or use cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) +* If you'd like to co-develop the library with us please join us on #torchao on [discord.gg/cudamode](https://discord.gg/cudamode) - there are a lot of dtypes out there and we could use a lot more hands to make them go brrr Installation instructions From 0d589a58feb963cebcdc95acfab2fb45bbce625f Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 08:14:42 -0700 Subject: [PATCH 06/19] Trigger CI From 874f27b8c68fe82cadaa79d3046827702c8ee638 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 08:14:54 -0700 Subject: [PATCH 07/19] Trigger CI From 5fa7f0eb97f0ecab2a3b5fd2986e18ed48439ce5 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 08:15:07 -0700 Subject: [PATCH 08/19] Trigger CI From 93f03769bd84fd2e0d4aad9e1bf5c78c5c3135f3 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 08:40:30 -0700 Subject: [PATCH 09/19] push --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f44f85b898..e415bfd6d9 100644 --- a/README.md +++ b/README.md @@ -6,8 +6,8 @@ ## Introduction torchao is a library which makes it easy to integrate and create high performance kernels with custom data types and layouts with up to -* **30% speedups** for training -* **2x speedups** with **65%** less VRAM for inference +* **30% speedups** for [training](#training) +* **2x speedups** with **65%** less VRAM for [inference](#inference) All with no intrusive code changes and minimal accuracy degradation. From 927952e396a988f0e1d7382d25ac1e796ac845b5 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 09:23:43 -0700 Subject: [PATCH 10/19] push --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index e415bfd6d9..35e7c0b051 100644 --- a/README.md +++ b/README.md @@ -35,10 +35,10 @@ The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama- | | int8wo | 12.204 | 147.03 | 973.54 | 6.62 | 8.95 | | | int4wo-64 | 12.843 | 199.81 | 746.45 | 3.74 | 4.75 | | | int4wo-64-GPTQ | 12.489 | 199.81 | 746.45 | 3.74 | 4.75 | -| Llama-3-8B | Base (bfloat16) | N/A | 94.91 | 1424.58 | 15.01 | 16.43 | -| | int8dq | N/A | 8.41 | 63.23 | 7.52 | 9.24 | -| | int8wo | N/A | 136.75 | 1028.38 | 7.52 | 10.42 | -| | int4wo-64 | N/A | 179.41 | 757.45 | 4.22 | 6.88 | +| Llama-3-8B | Base (bfloat16) | | 94.91 | 1424.58 | 15.01 | 16.43 | +| | int8dq | | 8.41 | 63.23 | 7.52 | 9.24 | +| | int8wo | | 136.75 | 1028.38 | 7.52 | 10.42 | +| | int4wo-64 | | 179.41 | 757.45 | 4.22 | 6.88 | note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance. From 04ced081b9223f3ccfa1e3f31cedd73a0c85c971 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 09:30:49 -0700 Subject: [PATCH 11/19] push --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 35e7c0b051..a5ca1847e3 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,8 @@ The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama- note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance. +For int4 we make heavy use of [tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores + And a quick crash course on inference quantization to help parse the above table. Int4 quantization is an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example, if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))` whereas if it's possible to perform the computation using the smaller dtype directly pending support by a hardware vendor then that means you can perform `F.linear(input, weight)` directly and this is what we refer to as Dynamic-Quantization (dq). Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`. @@ -78,8 +80,6 @@ Times are in microseconds (us). [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701) -[tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) we make heavy use of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores - ## Composability A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++, or Triton - things should just work! And here is our current strategy From e9c934791f6ad9c53d0aa6f7c710762e04429516 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 10:11:21 -0700 Subject: [PATCH 12/19] push --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a5ca1847e3..ce5fc90b78 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ note: Int8 dynamic quantization works best on compute bound models like [SAM](ht For int4 we make heavy use of [tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores -And a quick crash course on inference quantization to help parse the above table. Int4 quantization is an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example, if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))` whereas if it's possible to perform the computation using the smaller dtype directly pending support by a hardware vendor then that means you can perform `F.linear(input, weight)` directly and this is what we refer to as Dynamic-Quantization (dq). Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`. +And a quick crash course on inference quantization to help parse the above table. Int4 quantization is an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example, if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))`. Dynamic quantization (DQ) primarily targets activations, enabling on-the-fly quantization from higher precision formats like bf16 to lower precision formats such as int8. This process, when supported by hardware, allows for direct computation, such as performing `F.linear(input, weight)`. Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`. #### With intrusive code changes From d04589e867918ce57dcd0b1dd73bad3780e9ece5 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 10:36:45 -0700 Subject: [PATCH 13/19] push --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ce5fc90b78..37439abf81 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ All with no intrusive code changes and minimal accuracy degradation. #### Without intrusive code changes -Quantizing your models is a 1 liner that should work on any model with `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage example [here](torchao/quantization/) +Quantizing your models is a 1 liner that should work on any model with `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/) and a hugginface inference example [here](scripts/hf_eval.py) ```python from torchao.quantization.quant_api import quantize From 5c1ab3b348327ea92fc9208056cfaca21adbc597 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 11:24:06 -0700 Subject: [PATCH 14/19] push --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 37439abf81..e861f4c6e5 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ ## Introduction -torchao is a library which makes it easy to integrate and create high performance kernels with custom data types and layouts with up to +torchao is a library to create and integrate high-performance custom data types, layouts and kernels into their PyTorch workflows with up to * **30% speedups** for [training](#training) * **2x speedups** with **65%** less VRAM for [inference](#inference) From ea338a3f0a8ea368a54618523baea305fe5105c1 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 12:24:54 -0700 Subject: [PATCH 15/19] push --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index e861f4c6e5..a749711662 100644 --- a/README.md +++ b/README.md @@ -76,9 +76,9 @@ Times are in microseconds (us). ## Newer dtypes -[MX](https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet. - -[nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701) +* [MX](torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet. +* [nf4](torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701) +* [fp6](torchao/prototype/fp6_llm/) for 2x faster inference over fp16 with an easy to use wrapper api `convert_fp6_llm(model)` ## Composability From cd089dbe4851326327c6675ae377b94a5c44a028 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 12:32:10 -0700 Subject: [PATCH 16/19] push --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a749711662..eddbc96e92 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,7 @@ The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama- | | int8wo | | 136.75 | 1028.38 | 7.52 | 10.42 | | | int4wo-64 | | 179.41 | 757.45 | 4.22 | 6.88 | -note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance. +note: Int8 dynamic quantization works best on compute bound as opposed to memory bound models. Some relatable examples might be [SAM](https://github.com/pytorch-labs/segment-anything-fast) which is memory bound vs Llama at batchsize=1 which is memory bound. For int4 we make heavy use of [tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores From f56fe7ec2a53f70169b20ad5178a97e1a7879586 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 13:10:12 -0700 Subject: [PATCH 17/19] push --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index eddbc96e92..3a9007564c 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,7 @@ The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama- | | int8wo | | 136.75 | 1028.38 | 7.52 | 10.42 | | | int4wo-64 | | 179.41 | 757.45 | 4.22 | 6.88 | -note: Int8 dynamic quantization works best on compute bound as opposed to memory bound models. Some relatable examples might be [SAM](https://github.com/pytorch-labs/segment-anything-fast) which is memory bound vs Llama at batchsize=1 which is memory bound. +note: Int8 dynamic quantization works best on compute bound as opposed to memory bound models. Some relatable examples might be [SAM](https://github.com/pytorch-labs/segment-anything-fast) which is compute bound vs Llama at batchsize=1 which is memory bound. For int4 we make heavy use of [tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores From 836b117bcb9eaa6ffccd710138c86d1eb0d447db Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 17:23:56 -0700 Subject: [PATCH 18/19] push --- README.md | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/README.md b/README.md index 3a9007564c..813c33524c 100644 --- a/README.md +++ b/README.md @@ -5,9 +5,7 @@ ## Introduction -torchao is a library to create and integrate high-performance custom data types, layouts and kernels into their PyTorch workflows with up to -* **30% speedups** for [training](#training) -* **2x speedups** with **65%** less VRAM for [inference](#inference) +torchao is a library to create and integrate high-performance custom data types, layouts and kernels into their PyTorch workflows with up to **2x speedups** with **65%** less VRAM for [inference](#inference) and support for [training](#training) All with no intrusive code changes and minimal accuracy degradation. @@ -66,13 +64,6 @@ The code change is a 1 liner with the full example available [here](torchao/spar swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear}) ``` -For VIT-L MLP shapes on a NVIDIA A100 we see the following results: - -| | act24 | dense | w24 | s24_inp_sparsify24 | s24_inp_clone | -|---------------------|-----------|-----------|----------|--------------------|---------------| -| f16 (44160,1024,4096,1024) | 11881.0 | 11534.3 | 9204.7 | 255.1 | 125.8 | - -Times are in microseconds (us). ## Newer dtypes From 7464e5556c881b7ad4d34e8430d6da66a7d0ac0f Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 18 Jun 2024 17:24:51 -0700 Subject: [PATCH 19/19] push --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 813c33524c..d0bf2ca7a1 100644 --- a/README.md +++ b/README.md @@ -55,7 +55,7 @@ In some cases we rewrote popular GenAI models to be significantly faster in nati ### Training -We've added support for semi-structured 2:4 sparsity with over 30% speedups on ViT-L +We've added support for semi-structured 2:4 sparsity with 6% end to end speedups on ViT-L The code change is a 1 liner with the full example available [here](torchao/sparsity/training/)