Skip to content

Commit

Permalink
Update README.md benchmarks (#438)
Browse files Browse the repository at this point in the history
correcting old numbers in main readme with correct numbers in quantization readme
  • Loading branch information
HDCharles authored Jun 25, 2024
1 parent 70aef5d commit 505edc1
Showing 1 changed file with 11 additions and 9 deletions.
20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,17 @@ The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama-

| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
| ----------- | ------------------ | ------------------- | ------------- | ----------------------- | ---------------- | --------------- |
| Llama-2-7B | Base (bfloat16) | 12.212 | 105.02 | 1387.78 | 13.21 | 13.90 |
| | int8dq | 12.262 | 9.40 | 62.26 | 6.62 | 8.61 |
| | int8wo | 12.204 | 147.03 | 973.54 | 6.62 | 8.95 |
| | int4wo-64 | 12.843 | 199.81 | 746.45 | 3.74 | 4.75 |
| | int4wo-64-GPTQ | 12.489 | 199.81 | 746.45 | 3.74 | 4.75 |
| Llama-3-8B | Base (bfloat16) | | 94.91 | 1424.58 | 15.01 | 16.43 |
| | int8dq | | 8.41 | 63.23 | 7.52 | 9.24 |
| | int8wo | | 136.75 | 1028.38 | 7.52 | 10.42 |
| | int4wo-64 | | 179.41 | 757.45 | 4.22 | 6.88 |
| Llama-2-7B | Base (bfloat16) | 12.212 | 105.14 | 1389.35 | 13.88 | 13.21 |
| | int8dq | 12.262 | 9.20 | 60.93 | 8.33 | 6.62 |
| | int8wo | 12.204 | 150.18 | 994.40 | 8.95 | 6.62 |
| | int4wo-64 | 12.843 | 199.86 | 746.66 | 4.50 | 3.74 |
| | int4wo-64-GPTQ | 12.489 | 199.86 | 746.66 | 4.50 | 3.74 |
| | autoquant | 12.204 | 159.22 | 1069.87 | 8.91 | 6.72 |
| Llama-3-8B | Base (bfloat16) | N/A | 94.97 | 1425.55 | 16.43 | 15.01 |
| | int8dq | N/A | 8.44 | 63.45 | 8.98 | 7.52 |
| | int8wo | N/A | 139.76 | 1051.02 | 10.42 | 7.52 |
| | int4wo-64 | N/A | 179.44 | 757.60 | 6.62 | 4.22 |
| | autoquant | N/A | 137.71 | 1037.74 | 11.08 | 7.54 |

note: Int8 dynamic quantization works best on compute bound as opposed to memory bound models. Some relatable examples might be [SAM](https://github.com/pytorch-labs/segment-anything-fast) which is compute bound vs Llama at batchsize=1 which is memory bound.

Expand Down

0 comments on commit 505edc1

Please sign in to comment.