Another speed gain for Q4_0 and Q4_1 on Metal #2375

ikawrakow · 2023-07-24T15:50:16Z

In #2279 there was some debate on whether it is better to have each thread in a Metal SIMD group process whole Q4_0 or Q4_1 blocks, or have them process half blocks.

On my M2 Max with 30-core GPU having threads process half a block is definitely faster, see table below. I'm curious to see if this change also improves (or worsens) performance on M1 Pro/Max.

TG-128 in ms/token on 30-core M2 Max. The command used was

./main -m $model -p "I believe the meaning of life is" --ignore-eos -n 128 -s 1234 --no-mmap -t 8 -ngl 1

Quantization	t/s Master	t/s PR	speedup
Q4_0 - 7B	18.6	17.6	5.7%
Q4_1 - 7B	20.0	18.9	5.8%
Q4_0 - 13B	32.5	30.1	8.0%
Q4_1 - 13B	33.7	32.2	4.7%
Q4_0 - 33B	76.5	67.2	11.4%
Q4_1 - 33B	82.3	75.3	9.3%
Q4_0 - 65B	146.8	133.4	10.0%
Q4_1 - 65B	152.8	145.4	5.1%

lshzh-ww · 2023-07-24T16:27:22Z

Test on M1 Max 32c

Quantization	t/s Master	t/s PR	speedup
Q4_0 - 7B	19.18	18.07	6.1%
Q4_0 - 33B	73.9	70.9	4.2%

The performance improvement is solid. I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance if it weren't for you.

ikawrakow · 2023-07-24T16:39:32Z

I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance

@lshzh-ww That doesn't make sense. Your contribution in #2188 and #2248 were absolutely essential for improving Metal performance. Before you came along with #2188 we were basically stuck at 21.5 ms/t on M2 Max for the 7B model. This kind of collaborative progress is the whole point of open source!

lshzh-ww · 2023-07-24T18:52:26Z

ggml-metal.metal

+//Note: This is a template, but strictly speaking it only applies to
+//      quantizations where the block size is 32. It also does not
+//      giard against the number of rows not being divisible by
+//      N_DST, so this is another explicit assumption of the implementation.


I think this template also works on matrices with number of rows not being divisible by N_DST. In the very last of template we have if (tiisg == 0 && first_row + row < ne01) which guarantees that we don't write into other tensor's region when there are less than N_DST rows left.

I checked by using a WizardLM model, which has a row number of 32001. This template generated exactly the same results as the llama.cpp before #2188 was merged.

Well, yes, there is a guard to not write outside of the result bounds. But there is no guard to not read outside the bounds of the quantized tensor. We don't get a segfault because we only step out slightly and the address being accessed is within the process address space, but there is no guarantee that this will always be the case.

I see. Thank you for answering my question.

ggerganov

M1 Pro

Quantization	ms/t Master	ms/t PR
Q4_0 - 7B	30.2	28.7
Q4_1 - 7B	32.9	31.2
Q4_0 - 13B	53.5	50.7
Q4_1 - 13B	58.1	55.6

ggerganov · 2023-07-25T07:16:21Z

@ikawrakow and @lshzh-ww

I assume your tables show ms/t instead of t/s, correct?

ikawrakow · 2023-07-25T07:23:22Z

@ikawrakow and @lshzh-ww

I assume your tables show ms/t instead of t/s, correct?

Yes.

Another speed gain for Q4_0 and Q4_1 on Metal

b759afa

ikawrakow requested review from ggerganov and lshzh-ww July 24, 2023 15:50

Have N_DST, etc., be template parameters

7f98561

lshzh-ww approved these changes Jul 24, 2023

View reviewed changes

ggerganov approved these changes Jul 25, 2023

View reviewed changes

ikawrakow merged commit 9a08eaf into master Jul 25, 2023
4 checks passed

ikawrakow deleted the ik/metal_q4_0_1_new branch July 25, 2023 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Another speed gain for Q4_0 and Q4_1 on Metal #2375

ikawrakow commented Jul 24, 2023

lshzh-ww commented Jul 24, 2023

ikawrakow commented Jul 24, 2023

lshzh-ww Jul 24, 2023

ikawrakow Jul 25, 2023

lshzh-ww Jul 25, 2023

ggerganov left a comment

ggerganov commented Jul 25, 2023

ikawrakow commented Jul 25, 2023

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Conversation

ikawrakow commented Jul 24, 2023

lshzh-ww commented Jul 24, 2023

ikawrakow commented Jul 24, 2023

lshzh-ww Jul 24, 2023

Choose a reason for hiding this comment

ikawrakow Jul 25, 2023

Choose a reason for hiding this comment

lshzh-ww Jul 25, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented Jul 25, 2023

ikawrakow commented Jul 25, 2023