Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q8: use int8_t, AVX/AVX2 optimizations #972

Merged
merged 2 commits into from
Apr 14, 2023
Merged

Conversation

sw
Copy link
Collaborator

@sw sw commented Apr 14, 2023

Note: this is not destined for master, but for the branch in PR #951

Change block_q8_0 to use int8_t and remove the additions and subtractions of 128. I can't test on ARM NEON and might have broken something. quantize_row_q8_0 NEON had +128.5 which was intended as a roundf substitute, but I think that's wrong for negative numbers.

Apart from that, this is mainly AVX and AVX2 optimizations.

With AVX2, text generation is roughly the same speed as current master.

BTW there seems to be something wrong with the Makefile dependencies. Switching between this branch and master with just make instead of make clean makes main crash with a FPE.

ggml.c Outdated
const float32x4_t vf = vaddq_f32(v, vdupq_n_f32(128.5f));
const int32x4_t vi = vcvtq_s32_f32(vf);
//TODO: rounding
const int32x4_t vi = vcvtq_s32_f32(v);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, vcvtq_s32_f32 uses round towards zero, so we have to find a workaround

@sw
Copy link
Collaborator Author

sw commented Apr 14, 2023

I'm assuming #407 made perplexity speed not comparable with default settings? It used to be faster with q8, now it's slightly slower.

Edit: meant to say that master is now faster than the Q8 branch which doesn't yet have #407

@ggerganov
Copy link
Owner

ggerganov commented Apr 14, 2023

Hmm, I didn't do timing tests, but if anything I expect the change #407 to run a bit faster with default settings, since before we were using a batch of 511 and now we use 512 which should be slightly more suitable for vectorization.

However, I did notice something strange when computing the "with BLAS" perplexity today in my PR.
It said the estimated time to be 1.93 hours, but it took just 83m45s - i.e. much faster than estimated.

Was this time estimation ever correct? I haven't done too much perplexity computations to have an impression.
Maybe #407 somehow breaks the estimation?

@ggerganov ggerganov merged commit 7c6c079 into ggerganov:q8_0 Apr 14, 2023
@sw sw deleted the mulmat-q8 branch April 14, 2023 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants