Skip to content

Commit

Permalink
Add segment-anything-fast perf/acc benchmarks to torchao (#457)
Browse files Browse the repository at this point in the history
This PR adds in segment-anything-fast evaluation to torchao, and also adds benchmarks for int8 quantization + 2:4 sparsity. 

With this we can run combined perf/accuracy benchmarks for segment-anything. This should give us a starting point for the relative perf vs relative acc graph for PTC. 

| Model Type | Technique                                                                                            | img/s | memory (MiB) | mIoU   | relative speedup | relative accuracy |
|------------|------------------------------------------------------------------------------------------------------|-------|--------------|--------|------------------|-------------------|
| ViT-h      | baseline (bfloat16, max-autotune)                                                                    | 22.75 | 15172        | 0.5811 |                  |                   |
|            | int8 dynamic quant (attn + mlp)                                                                      | 24.91 | 15154        | 0.5822 | **1.09x**        | **100.19%**       |
|            | 2:4 sparsity (mlp only)                                                                              | 24.81 | 15632        | 0.5672 | **1.10x**        | **97.61%**        |
|            | 2:4 sparsity (attn + mlp)                                                                            | 24.30 | 13429        | 0.5306 | **1.07x**        | **91.31%**        |
|            | int8 dynamic quant (attn)<br>int8 dynamic quant + 2:4 sparsity (mlp lin1)<br>2:4 sparsity (mlp lin2) | 26.46 | 14865        | 0.5668 | **1.16x**        | **97.54%**        |

This just copies over the evaluation scripts. Eventually I think we should move over the modeling code too, but plan to do that in a subsequent PR.
  • Loading branch information
jcaip authored Jul 2, 2024
1 parent 5d22ad2 commit f22e8e8
Show file tree
Hide file tree
Showing 14 changed files with 877 additions and 159 deletions.
15 changes: 14 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,20 @@ And a quick crash course on inference quantization to help parse the above table

In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes.

* 8x speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
* 9.5x speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai) compared to vanilla [sam](https://github.com/facebookresearch/segment-anything).
* 1.16x speedup when composing int8 quantization with 2:4 sparsity against the accelerated baseline `bfloat16` dtype and `torch.compile="max_autotune"`.

| Model Type | Technique | img/s | memory (MiB) | mIoU (coco2017 val) | relative speedup | relative accuracy |
|------------|------------------------------------------------------------------------------------------------------|-------|--------------|---------------------|------------------|-------------------|
| ViT-h | sam (float32, eager) | 2.78 | 28806 | 0.58 | baseline | baseline |
| | sam (bfloat16, eager) | 14.85 | 14424 | 0.58 | **5.34x** | **100%** |
| | sam-fast (bfloat16, max-autotune) | 22.75 | 15172 | 0.58 | **8.18x** | **100%** |
| | int8 dynamic quant (attn + mlp) | 24.91 | 15154 | 0.58 | **8.96x** | **100%** |
| | 2:4 sparsity (mlp only) | 24.81 | 15632 | 0.57 | **8.92x** | **98%** |
| | int8 dynamic quant (attn)<br>int8 dynamic quant + 2:4 sparsity (mlp lin1)<br>2:4 sparsity (mlp lin2) | 26.46 | 14865 | 0.57 | **9.52x** | **98%** |

The relative speedup is measured purely across the image encoder (ViT) of the model, where we apply our model optimizations. Benchmarks ran on an NVIDIA-A100-80GB with batch_size=32

* 10x speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
* 3x speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3)

Expand Down
137 changes: 0 additions & 137 deletions benchmarks/benchmark_sam.py

This file was deleted.

3 changes: 3 additions & 0 deletions scripts/sam/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
tmp
checkpoints
datasets
21 changes: 21 additions & 0 deletions scripts/sam/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# benchmarking instructions:

Setup your enviornment with:
```
conda env create -n "saf-ao" python=3.10
conda activate saf-ao
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
pip3 install git+https://github.com/pytorch-labs/segment-anything-fast.git
pip3 install tqdm fire pandas
cd ../.. && python setup.py install
```

Then download data and models by running
```
sh setup.sh
```

Finally, you can run benchmarks with
```
sh benchmark_sam.sh
```
11 changes: 11 additions & 0 deletions scripts/sam/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# baseline
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --print_header True
# int8 dynamic quant (all)
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress int8_dynamic_quant
# 2:4 sparsity (all)
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress sparse_mlp_only
# 2:4 sparsity (mlp only)
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress sparse
# int8 dynamic quant + 2:4 sparsity (attn: int8, mlp lin1: int8+2:4 fuse mul, mlp lin2: 2:4 sparse)
python eval_combo.py --coco_root_dir datasets/coco2017 --coco_slice_name val2017 --sam_checkpoint_base_path checkpoints --sam_model_type vit_h --point_sampling_cache_dir tmp/sam_coco_mask_center_cache --mask_debug_out_dir tmp/sam_eval_masks_out --batch_size 32 --num_workers 32 --use_compile max-autotune --use_half bfloat16 --device cuda --compress int8_dynamic_quant_sparse

Loading

0 comments on commit f22e8e8

Please sign in to comment.