Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: int8 tensor core matrix multiplication #4801

Closed

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Jan 6, 2024

After much trial and error I finally have a working prototype with competitive performance for CUDA matrix matrix multiplication using int8 tensor core operations. At first I tried to do the implementation in the same way as with MMQ: load quantized data blocks, do the int8 arithmetic, then scale the result with the block scales. The problem with tensor cores however, is that loading the results from them is very slow. So to get good performance you need large blocks of integers with as few floating point scales as possible. However, as discussed in #4755 , large block sizes seem to lead to potentially bad results for the hidden state. The solution that I implemented for this is to use only a single scale for a row of the hidden state but to quantize it to "double precision", i.e. to use two int8 values to represent the floating point value. The total precision is 15 bits (16-1 because you need two sign bits). With double precision twice as many int8 tensor core operations need to be performed but this seems to fix the precision issues. The weights are always transformed to single precision int8 values with a single scale per row.

Currently I get the following results in terms of t/s and perplexity for 7b using my RTX 3090:

Metric Method q4_0 q4_1 q5_0 q5_1 q8_0 q2_K q3_K_S q4_K_S q5_K_S q6_K
t/s pp512 cuBLAS FP16 GEMM 3473 3406 3303 3299 3350 3633 3592 3726 3692 3657
t/s pp512 mul_mat_q 2356 1954 2059 1799 2246 1668 1829 2128 1947 1943
t/s pp512 mul_mat_i8, 8 bit prec. 2434 2421 2393 2392 3635 2503 2495 2507 2495 2465
t/s pp512 mul_mat_i8, 15 bit prec. 2132 2126 2090 2092 2973 2181 2172 2182 2163 2153
PPL cuBLAS FP16 GEMM 5.9634 6.0013 5.8282 5.8465 5.7985 6.4201 6.2956 5.8805 5.8187 5.8085
PPL mul_mat_q 5.9685 6.0033 5.8295 5.8533 5.8002 6.4209 6.2961 5.8873 5.8243 5.8095
PPL mul_mat_i8, 8 bit prec. 6.0392 6.0647 5.9045 5.9198 6.0254 6.5537 6.4204 5.9705 5.8989 5.8892
PPL mul_mat_i8, 15 bit prec. 5.9668 6.0056 5.8330 5.8513 5.9282 6.4301 6.3021 5.8849 5.8214 5.8072

Note: the MMI8 results for all quantization formats other than q8_0 are not representative in terms of performance because only q8_0 has a fast kernel for q8_0 -> i8 conversion. For q8_0 single precision MMI8_8 is 9% faster than cuBLAS, MMI8_15 is 11% slower. Perplexity with MMI8_8 is bad, with MMI8_15 I think it's okay (other than for q8_0 where I think there is something wrong with the conversion kernel; previously it was roughly +0.002 compared to cuBLAS). Performance for q8_0 scales as follows when the batch size is varied:

cublas_vs_mmq_q8_0

MMI8_8 is generally the fastest, at small batch sizes MMI8_15 is slightly faster than cuBLAS, at large batch sizes it's a little slower. VRAM usage for 7b q8_0 with a batch size of 512 is 7798 MiB for cuBLAS and 7642 MiB for MMI8_15.

Overall I think this PR in its current state is not worth merging. For one MMI8_8 is simply not precise enough (for transformers at least). And wile there are some cases where MMI8_15 is slightly faster than cuBLAS but currently not enough to warrant the additional complexity. I think there is still potential for performance optimization though: the MMI8_15 kernel achieves only 28% tensor core utilization compared to 45% utilization for the equivalent kernel used by cuBLAS.

@slaren
Copy link
Collaborator

slaren commented Jan 6, 2024

I don't know enough about this to comment on the implementation, but if the issue is scaling the result, maybe dequantizing to fp16 and multiplying as fp16 could work. It should still be faster than cuBLAS, and it would reduce the memory requirements.

@JohannesGaessler
Copy link
Collaborator Author

I don't know enough about this to comment on the implementation, but if the issue is scaling the result, maybe dequantizing to fp16 and multiplying as fp16 could work.

I have not actually tried this but I do not expect this to work well:

  • 2 FP16 values require 33% more shared memory than 3 int8 values.
  • Integer arithmetic is faster than floating point arithmetic.
  • int8 -> FP16 conversions are pretty slow; data types with 4 bytes are in my experience the fastest. For dequantize_mul_mat_vec this doesn't matter as much because you're I/O bound anyways.
  • Adding complicated instructions for data loading increases register pressure, which is already a major problem.
  • Ampere and newer can do asynchronous memcpys from global to shared memory. This would allow the same warps to do computations and data copies at the same time but only if the data does not need to be altered upon load (currently not used).

Not directly the same but in a previous version I did the q8_0 -> int8 conversion in the matrix multiplication kernel. This alone took up ~25% of the kernel's runtime.

On a more general note, now that I think about it my kernel is mostly just doing regular int8 matrix multiplication with some scaling at the end. But maybe it'd be possible to somehow use cuBLAS GEMM and add a small kernel for scaling afterwards; I'll investigate.

@JohannesGaessler
Copy link
Collaborator Author

Okay, so apparently cuBLAS allows you to do (int8, int8) -> int32 matrix multiplications. But it seems that will not be of use to us. You can do two individual matrix multiplications to replace the kernel I wrote but that is slower even without an extra kernel for scaling afterwards. What we would need is an operation $C = A (B_\mathrm{low} + 128 B_\mathrm{high})$ but I don't see a corresponding cuBLAS routine. There is batched matrix multiplication but it does not support per-matrix alpha values and the matrices $C_i$ must explicitly not overlap.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 7, 2024

Quite a stunning speedup, looks like for anything below RTX 40 series this is a winner already.

In case you are interested, a few weeks ago (#4310) I had outlined to gg how to do native FP8 (which is twice the speed of FP16) in llama.cpp by using wrappers to transparently "force" it into ggml-cuda.cu while using the new cublasLt API which is required for fp8.
It's not meant as solution but kernels and code all work, the wrapper approach allows an easy test without changing too much of the core code (given the different cuda API)

@sorasoras
Copy link

I wonder what would this work for rdna3 wmma. Would like to take advantage of wmma

@JohannesGaessler
Copy link
Collaborator Author

After the bugfix in #4809 prompt processing on master has become faster, particularly so for cuBLAS. This has raised the bar that this PR will need to meet:

cublas_vs_mmq_q8_0

(I think I did something wrong when I measured MMQ performance for the previous plot; in this one the performance is much better.)

@Dampfinchen
Copy link

Well speedwise, according to that graph, MMI8_8 appears to be almost on par with cuBLAS, which makes it very worth it in my opinion as it does save quite a bit of memory. Sadly though, perplexity seems noticeably higher than the other methods.

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Jan 13, 2024
@JohannesGaessler
Copy link
Collaborator Author

I'm slowly making progress. Square tensor core fragments seem to work better than the thin fragments I was using before. Also asychronous data copies seem to indeed be faster than regular data copies (even without interleaving compute and memory I/O). Unfortunately for best performance it seems to be necessary to copy a block of at least 512 bytes though (currently 128 bytes in this PR).

I did some more performance optimizations for cuBLAS on master but I don't expect there to be more so cuBLAS performance should now stay constant (unless full FP16 model evaluation is implemented). Currently MMI8_8 is again slightly faster than cuBLAS while MMI8_15 is about the same speed as cuBLAS prior to the performance optimizations:

cublas_vs_mmq_q8_0

@JohannesGaessler
Copy link
Collaborator Author

I think I've cracked the code. The problem with tensor cores is that while each fragment is technically just a struct with individual values you have no guarantees regarding what those values are; which specific values each thread receives varies by GPU architecture. But you can apply scalar operations to these values and the effect will be the same for all GPU architectures because each value is treated the same. So you can (in addition to the scales per row/column) define scales per 16x16 or 32x8 tensor core fragment. Then you only have to apply these scales per fragment without any expensive calls to store_matrix_sync.

I have a prototype for single precision mmi8 with per-fragment scales that gets 5.8107 PPL with q8_0 (vs. 5.7985 PPL with cuBLAS). However, among the repeating weight tensors attn_k, attn_k, attn_output, ffn_gate, and ffn_down seem to do fine with single precision; if I use single precision only for those I get 5.8007 PPL.

@jammm
Copy link
Contributor

jammm commented Jan 17, 2024

Hey @JohannesGaessler,
For AMD support of mma.h, I would recommend using rocWMMA https://github.com/ROCm/rocWMMA
It offers the same interface as mma.h so you won't have to modify your code much, but it'll also enable WMMA support for RDNA3 GPUs. Generally you just have to alias the namespace nvcuda or rocwmma to something common like wmma and use it to be cross-vendor compatible between AMD and NVIDIA. CC @cgmillette

For example:

#if defined( __CUDACC__ )
#include <mma.h>
namespace wmma = nvcuda::wmma;
#elif defined(__gfx1100__) || defined(__gfx1101__) || defined(__gfx1102__) || defined(__gfx1103__) // all RDNA3 GPUs
#include <rocwmma/rocwmma.hpp>
namespace wmma = rocwmma;
#endif

Another option is to use LLVM builtins directly for RDNA3 WMMA support. There's generally a 1:1 mapping between mma.h /rocWMMA and LLVM builtins (except the loading/storing part, which can get complicated depending on how much performance you want to squeeze out of it). I have sample code for that on my blog
(just for fp16 tho) https://gpuopen.com/learn/wmma_on_rdna3/ though IMO I'd recommend using rocWMMA as you only really have to change a couple lines of code to get it working. And it also does optimized loads/stores for you.

ggml-cuda.cu Outdated
typedef nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, 16, 16, 16, int8_t, nvcuda::wmma::col_major> frag_thin_b;
typedef nvcuda::wmma::fragment<nvcuda::wmma::accumulator, 16, 16, 16, int> frag_thin_c;

typedef cuda::barrier<cuda::thread_scope::thread_scope_block> cuda_barrier;
Copy link
Contributor

@jammm jammm Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may break compatibility with ROCm in case rocWMMA is used for AMD RDNA3 GPU support. Could there be a portable workaround using pure C++ device code w/ builtin sync primitives rather than something libcudacxx specific?

@JohannesGaessler
Copy link
Collaborator Author

I can try to make it so that AMD tensor cores can be used but I can't guarantee that the performance would be good enough to make it worthwhile.

@sorasoras
Copy link

Good, can't wait to test that on my 7900xtx

@jammm
Copy link
Contributor

jammm commented Jan 17, 2024

I can try to make it so that AMD tensor cores can be used but I can't guarantee that the performance would be good enough to make it worthwhile.

Thanks! It's still worth keeping the support, perhaps behind a feature flag that's set to false by default in case this isn't fast enough for the time being.

@JohannesGaessler
Copy link
Collaborator Author

@jammm one thing that I had to do to make the implementation is using cuda::memcpy_async. This allows you to copy data directly from global memory to shared memory without going through registers, thus reducing the number of registers used by the kernel. Does ROCm have an equivalent?

@sorasoras
Copy link

sorasoras commented Jan 17, 2024

@jammm one thing that I had to do to make the implementation is using cuda::memcpy_async. This allows you to copy data directly from global memory to shared memory without going through registers, thus reducing the number of registers used by the kernel. Does ROCm have an equivalent?

I think I found it,
it's called
hipMemcpyAsync

https://docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html#gaccf359cb35ce1887e6250c09e115e9a2

@JohannesGaessler
Copy link
Collaborator Author

No, that seems to be for RAM<->VRAM or VRAM<->VRAM copies, what I need is an VRAM->SRAM copy.

@JohannesGaessler
Copy link
Collaborator Author

I've pushed a prototype that should perform at least as well as cuBLAS:

cublas_vs_mmq_q8_0

What's very annoying is that I've gone back to thin tiles in order to get more fragment scales per number of hidden state elements but that makes it much harder to write a kernel for (4096x4096) * (4096x512) matrix multiplication that does not massively suffer from tail effects. My 3090 currently does 1.04 waves for each such matrix multiplication which means it's effectively idling 50% of the time.

@sorasoras
Copy link

No, that seems to be for RAM<->VRAM or VRAM<->VRAM copies, what I need is an VRAM->SRAM copy.

Not really, cuda​Memcpy​Async direct replacement is hipmemcpyasync according to the documentation
https://rocmdocs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Runtime_API_functions_supported_by_HIP.html
It should be a drop in replacement.

@JohannesGaessler
Copy link
Collaborator Author

Yes, but I'm not asking about a replacement for cudaMemcpyAsync but for a replacement for cuda::memcpy_async which are two entire different things.

@JohannesGaessler
Copy link
Collaborator Author

It was very tedious but I managed to tune the kernel in such a way that prompt processing with a batch size of 512 or more is fast:

cublas_vs_mmq_q8_0

The implementation is kind of awkward with each of the 8 warps working on 13 tensor core fragments but with this number you get up to 98% GPU utilization on an RTX 3090. What's annoying is that strictly speaking you would need to tune this on a GPU by GPU basis. So I think I'll make versions of the kernel where the dimensions are just powers of 2 as the baseline and some specialized kernels that run well specifically on an RTX 3090 (ti). Then at runtime evaluate a heuristic to estimate which kernel would have the best performance for the given matrix dimensions.

Current performance:

GPU Model Batch size Test t/s master t/s cuda-mmqf-22 Speedup
RTX 3090 llama 7B Q8_0 512 pp2048 3466.34 3814.23 1.10
RTX 3090 llama 7B Q8_0 1024 pp2048 3905.44 3993.58 1.02
RTX 3090 llama 7B Q8_0 2048 pp2048 3808.43 3876.39 1.02

@jammm
Copy link
Contributor

jammm commented Jan 17, 2024

@jammm one thing that I had to do to make the implementation is using cuda::memcpy_async. This allows you to copy data directly from global memory to shared memory without going through registers, thus reducing the number of registers used by the kernel. Does ROCm have an equivalent?

I believe (at least for RDNA GPUs) there's no HW support for that. Copying data from global memory to LDS (AMD term for shared memory) without using intermediate VGPRs (registers) isn't supported. So as a HIP specific workaround you may want to just copy manually using pointers or perhaps use a plain memcpy.

Yes, but I'm not asking about a replacement for cudaMemcpyAsync but for a replacement for cuda::memcpy_async which are two entire different things.

Right, it seems you're looking for a function that can be called from within your device kernel to copy from global memory to LDS and not a HIP runtime function hipMemcpyAsync that dispatches a separate kernel for asynchronous copies. Sadly, AFAIK, we don't have hardware support for the former at the moment.

EDIT: MI300 ISA does indeed have support for direct loads from global to LDS (section 9.1.9 https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf)
For RDNA GPUs however, this doesn't seem to be mentioned. The RDNA3 ISA doc section 12.1.1 mentions that some loads do support direct loads from global memory to LDS but it's not clear which load it refers to. https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf and AFAIK, I'm not aware of any way to do such a load using HIP on RDNA.

ggml-cuda.cu Outdated
const int nrows_y, const int nrows_dst) {

// #if __CUDA_ARCH__ >= CC_VOLTA && !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__))
typedef nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, 32, 8, 16, int8_t, nvcuda::wmma::row_major> frag_thin_a;
Copy link
Contributor

@jammm jammm Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. AMD RDNA3 support - only 16x16x16 sizes are supported for all of matrix_a, matrix_b and accumulator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a problem. I'm specifically using 32x8 fragments because that way I can define more integer scales for the b fragments so the precision is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Could 32x16 work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously had an implementation with 16x16 fragments. It does work but the precision is worse. The increase in perplexity with single precision MMI8 compared to cuBLAS FP16 GEMM was ~40% higher.

Copy link
Contributor

@jammm jammm Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question - so the higher perplexity is because 16 scales don't provide enough precision compared to 32 scales? if so, then I wonder if 32x16 is worth a try (two 16x16's somehow, reusing the accumulator from the first to second). But I'm unsure of how it would affect the perplexity score.

Copy link
Collaborator Author

@JohannesGaessler JohannesGaessler Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick to making the current single-precision implementation work is to calculate additional scales for the "b" fragments, i.e. the fragments of the hidden state. This makes it so that large values in the hidden state only affect the precision of a few values rather than all of the values in a column. So it is better to use small b fragments because you can scale the values with a finer granularity. 32x16 and 16x8 is better than 16x16 and 16x16 because the b fragments are smaller (the a fragment size does not matter). So 32x16 would in fact be worse than 16x16.

ggml-cuda.cu Outdated
const int j_tile = j + threadIdx.x/4;
const int k_tile = WARP_SIZE + threadIdx.x % 4;

cuda::memcpy_async(&tile_y_qs[j_tile*(WARP_SIZE + MMI8_PADDING_Y) + k_tile],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. RDNA3 support - use normal memcpy or raw pointer copy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On AMD cards, how many registers are available per streaming multiprocessor (or whatever the equivalent is)? I'm asking because memcpy_async is crucial for reducing register pressure by copying data directly from VRAM to shared memory without going through registers.

Copy link
Contributor

@jammm jammm Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming a SM is equivalent to an AMD compute unit (CU), each CU has 2 SIMD32's and each SIMD32 has 1536 VGPRs (registers) for navi31 and navi32 (gfx1100/gfx1101). So each CU will have 3072 VGPRs. Each shader can have a max. of 256 VGPRs allocated to it. Assuming CU mode (which is the default for RDNA when compiling via. hipcc), for full occupancy of 16 waves per SIMD32, you'd want to be at or below 96 VGPRs. For more info see section 3.3.2.1 of RDNA3 ISA docs https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

FWIW, navi31/32 of RDNA3 have 50% more registers than navi2 of RDNA2. gfx1102, gfx1103 (Radeon 780m) has 1024 VGPRs per SIMD, similar to navi2.

You can compile using --save-temps to get the disassembly in .s file(s) that contain metrics for register pressure, LDS usage, scratch usage etc.

ggml-cuda.cu Outdated
nvcuda::wmma::mma_sync(fc_tmp, fa, fb, fc_tmp);
#pragma unroll
for (int l = 0; l < 32*8/WARP_SIZE; ++l) {
fc[j/8].x[l] += bs * fc_tmp.x[l];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jammm here I'm directly accessing the data that each thread holds and scaling it. Is this something that you can do on AMD?

Copy link
Contributor

@jammm jammm Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you use rocWMMA, this should just work. See https://github.com/ROCm/rocWMMA/blob/7f00ff71f1d4536a86c94a963d7224a2ca9516f6/library/include/rocwmma/rocwmma.hpp#L247-L251

But I would recommend using fc.num_elements instead of 32*8/WARP_SIZE as the number of matrix elements per fragment can vary across vendors and archs. FWIW, make sure you only exchange data between fragments of the same type. Here both fc and fc_tmp are of accumulator type, so it should be fine.

@JohannesGaessler
Copy link
Collaborator Author

I did some more performance optimizations and added a few kernel variants with smaller tile sizes that work better for small batch sizes:

cublas_vs_mmq_q8_0

The performance is universally better than cuBLAS. I will now focus on getting this PR into a state where it can be merged. In particular:

  • Fast conversion to int8 for quantization formats other than q8_0.
  • Add some way to configure the matrix multiplication algorithm at runtime. I think it would make sense to offer something like mmi8_mixed where the algorithm choice is made based on how sensitive the tensor is to hidden state precision.
  • Maybe add an option LLAMA_FAST_COMPILE that only compiles a single kernel version so the compilation time is faster (but of course performance is worse). On my system teh compilation time has increased from ~20 s to ~40 s.
  • I will not implement double precision MMI8 in this PR.
  • I will not implement AMD support. I would need to write the kernel quite differently to make it compatible with AMD and the precision loss would be worse because AMD does not support 16x8 fragments. Also without access to compatible AMD hardware I will not be able to tune the performance in any meaningful way so it would probably be slower than regular BLAS anyways.

@slaren
Copy link
Collaborator

slaren commented Jan 20, 2024

  • On my system teh compilation time has increased from ~20 s to ~40 s.

This is definitely a problem. We need to be more careful about not instantiating too many combinations of the same function, in many cases the performance difference is meaningless.

@JohannesGaessler JohannesGaessler mentioned this pull request Jan 22, 2024
8 tasks
@Dampfinchen
Copy link

Dampfinchen commented Jan 22, 2024

  • On my system teh compilation time has increased from ~20 s to ~40 s.

This is definitely a problem. We need to be more careful about not instantiating too many combinations of the same function, in many cases the performance difference is meaningless.

The amount of VRAM and general bloat saved by Johannes Matmul kernels in comparison to cuBLAS is not meanigless however. This benefit far outweights longer compilation time in my opinion.

@slaren
Copy link
Collaborator

slaren commented Jan 22, 2024

This benefit far outweights longer compilation time in my opinion.

Easy to say when you are not the one working on the code. Anyway, that's not really relevant to this issue.

@JohannesGaessler
Copy link
Collaborator Author

I ran some tests using the new KL divergence code added in #5076 :

Model ln PPL ratio cuBLAS ln PPL ratio MMI8_8 KLD cuBLAS KLD MMI8_8
LLaMA 2 q4_0 0.02857 +- 0.00067 0.02966 +- 0.00068 0.02884 +- 0.00030 0.02986 +- 0.00031
LLaMA 2 q4_1 0.03492 +- 0.00067 0.03620 +- 0.00068 0.03131 +- 0.00019 0.03253 +- 0.00019
LLaMA 2 q5_0 0.00580 +- 0.00040 0.00754 +- 0.00044 0.00950 +- 0.00022 0.01065 +- 0.00026
LLaMA 2 q5_1 0.00891 +- 0.00034 0.01073 +- 0.00039 0.00755 +- 0.00007 0.00931 +- 0.00015
LLaMA 2 q8_0 0.00062 +- 0.00011 0.00269 +- 0.00023 0.00038 +- 0.00001 0.00261 +- 0.00017
LLaMA 2 q2_K 0.10241 +- 0.00129 0.10414 +- 0.00129 0.10888 +- 0.00064 0.11002 +- 0.00065
LLaMA 2 q3_K_S 0.08283 +- 0.00112 0.08412 +- 0.00112 0.08627 +- 0.00050 0.08714 +- 0.00050
LLaMA 2 q4_K_S 0.01462 +- 0.00051 0.01629 +- 0.00053 0.01782 +- 0.00011 0.01925 +- 0.00017
LLaMA 2 q5_K_S 0.00410 +- 0.00033 0.00556 +- 0.00036 0.00704 +- 0.00006 0.00844 +- 0.00012
LLaMA 2 q6_K 0.00241 +- 0.00019 0.00356 +- 0.00027 0.00210 +- 0.00001 0.00406 +- 0.00016

Measured in KL divergence using MMI8_8 for all tensors has roughly the same precision loss as using q6_K instead of FP16 (but I don't know if you can just treat it as additive). For q5_K_S and below the precision loss from MMI8_8 seems to already be small compared to the precision loss from quantization.

@JohannesGaessler JohannesGaessler mentioned this pull request Jan 23, 2024
@sorasoras
Copy link

The incoming RDNA4 gpu got most of stuff for INT8 matrix multiplications.
This could be interesting to you I guess.
https://chipsandcheese.com/2024/01/28/examining-amds-rdna-4-changes-in-llvm/

@JohannesGaessler
Copy link
Collaborator Author

This PR has become obsolete.

@Djip007
Copy link
Contributor

Djip007 commented Oct 25, 2024

The incoming RDNA4 gpu got most of stuff for INT8 matrix multiplications. This could be interesting to you I guess. https://chipsandcheese.com/2024/01/28/examining-amds-rdna-4-changes-in-llvm/

We have it on RDNA3 to https://gpuopen.com/learn/wmma_on_rdna3/ (INT8/INT4/FP16/BF16 ...)

@JohannesGaessler or other
what "replace" this PR?

@JohannesGaessler
Copy link
Collaborator Author

This PR has been obsoleted by the kernels in ggml/src/ggml-cuda/mmq.cuh making use of the primitives in ggml/src/ggml-cuda/mma.cuh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants