Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another speed gain for Q4_0 and Q4_1 on Metal #2375

Merged
merged 2 commits into from
Jul 25, 2023
Merged

Conversation

ikawrakow
Copy link
Contributor

In #2279 there was some debate on whether it is better to have each thread in a Metal SIMD group process whole Q4_0 or Q4_1 blocks, or have them process half blocks.

On my M2 Max with 30-core GPU having threads process half a block is definitely faster, see table below. I'm curious to see if this change also improves (or worsens) performance on M1 Pro/Max.

TG-128 in ms/token on 30-core M2 Max. The command used was

./main -m $model -p "I believe the meaning of life is" --ignore-eos -n 128 -s 1234 --no-mmap -t 8 -ngl 1
Quantization t/s Master t/s PR speedup
Q4_0 - 7B 18.6 17.6 5.7%
Q4_1 - 7B 20.0 18.9 5.8%
Q4_0 - 13B 32.5 30.1 8.0%
Q4_1 - 13B 33.7 32.2 4.7%
Q4_0 - 33B 76.5 67.2 11.4%
Q4_1 - 33B 82.3 75.3 9.3%
Q4_0 - 65B 146.8 133.4 10.0%
Q4_1 - 65B 152.8 145.4 5.1%

@lshzh-ww
Copy link
Collaborator

Test on M1 Max 32c

Quantization t/s Master t/s PR speedup
Q4_0 - 7B 19.18 18.07 6.1%
Q4_0 - 33B 73.9 70.9 4.2%

The performance improvement is solid. I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance if it weren't for you.

@ikawrakow
Copy link
Contributor Author

I am ashamed that I didn't test intensively last time; we would have lost quite a bit of performance

@lshzh-ww That doesn't make sense. Your contribution in #2188 and #2248 were absolutely essential for improving Metal performance. Before you came along with #2188 we were basically stuck at 21.5 ms/t on M2 Max for the 7B model. This kind of collaborative progress is the whole point of open source!

//Note: This is a template, but strictly speaking it only applies to
// quantizations where the block size is 32. It also does not
// giard against the number of rows not being divisible by
// N_DST, so this is another explicit assumption of the implementation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this template also works on matrices with number of rows not being divisible by N_DST. In the very last of template we have if (tiisg == 0 && first_row + row < ne01) which guarantees that we don't write into other tensor's region when there are less than N_DST rows left.

I checked by using a WizardLM model, which has a row number of 32001. This template generated exactly the same results as the llama.cpp before #2188 was merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes, there is a guard to not write outside of the result bounds. But there is no guard to not read outside the bounds of the quantized tensor. We don't get a segfault because we only step out slightly and the address being accessed is within the process address space, but there is no guarantee that this will always be the case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for answering my question.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M1 Pro

Quantization ms/t Master ms/t PR
Q4_0 - 7B 30.2 28.7
Q4_1 - 7B 32.9 31.2
Q4_0 - 13B 53.5 50.7
Q4_1 - 13B 58.1 55.6

@ggerganov
Copy link
Owner

@ikawrakow and @lshzh-ww

I assume your tables show ms/t instead of t/s, correct?

@ikawrakow
Copy link
Contributor Author

@ikawrakow and @lshzh-ww

I assume your tables show ms/t instead of t/s, correct?

Yes.

@ikawrakow ikawrakow merged commit 9a08eaf into master Jul 25, 2023
4 checks passed
@ikawrakow ikawrakow deleted the ik/metal_q4_0_1_new branch July 25, 2023 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants