Skip to content

v0.2.0

Compare
Choose a tag to compare
@cpuhrsch cpuhrsch released this 20 May 20:52
· 708 commits to main since this release
f0f00ce

What's Changed

Highlights

Custom CPU/CUDA extension to ship CPU/CUDA binaries.

PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()

We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR #135 to add your own custom ops to torchao.

Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6 support #223

One key benefit of integrating your kernels in torchao directly is we thanks to our manylinux GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible #176

A lot of prototype and community contributions

@jeromeku was our community champion merging support for

  1. GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
  2. DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora
  3. Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq

@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference #223

NF4 support for upcoming FSDP2

@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP #150 most notably by implementing torch.chunk(). We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

BC breaking

Deprecations

New Features

Improvements

  • FSDP2 support for NF4Tensor (#118, #150, #207)
  • Add save/load of int8 weight only quantized model (#122)
  • Add int_scaled_mm on CPU (#121)
  • Add cpu and gpu in int4wo and int4wo-gptq quantizer (#131)
  • Add torch.export support to int8_dq, int8_wo, int4_wo subclasses (#146, #226, #213)
  • Remove is_gpt_fast specialization from GTPQ (#172)
  • Common benchmark and profile utils (#238)

Bug fixes

  • Fix padding in GPTQ (#119, #120)
  • Fix Int8DynActInt4WeightLinear module swap (#151)
  • Fix NF4Tensor.to to use device kwarg (#158)
  • Fix quantize_activation_per_token_absmax perf regression (#253)

Performance

  • Chunk NF4Tensor construction to reduce memory spike (#196)
  • Fix intmm benchmark script (#141)

Docs

CI

Not user facing

Security

Untopiced

  • Version bumps (#125, #234)
  • Don't import _C in fbcode (#218)

New Contributors

Full Changelog: v0.2.0...v0.2.1

We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.