Q8: use int8_t, AVX/AVX2 optimizations #972

sw · 2023-04-14T17:44:02Z

Note: this is not destined for master, but for the branch in PR #951

Change block_q8_0 to use int8_t and remove the additions and subtractions of 128. I can't test on ARM NEON and might have broken something. quantize_row_q8_0 NEON had +128.5 which was intended as a roundf substitute, but I think that's wrong for negative numbers.

Apart from that, this is mainly AVX and AVX2 optimizations.

With AVX2, text generation is roughly the same speed as current master.

BTW there seems to be something wrong with the Makefile dependencies. Switching between this branch and master with just make instead of make clean makes main crash with a FPE.

ggerganov · 2023-04-14T17:50:05Z

ggml.c

-            const float32x4_t vf = vaddq_f32(v, vdupq_n_f32(128.5f));
-            const int32x4_t   vi = vcvtq_s32_f32(vf);
+            //TODO: rounding
+            const int32x4_t   vi = vcvtq_s32_f32(v);


Yes, vcvtq_s32_f32 uses round towards zero, so we have to find a workaround

sw · 2023-04-14T17:51:47Z

I'm assuming #407 made perplexity speed not comparable with default settings? It used to be faster with q8, now it's slightly slower.

Edit: meant to say that master is now faster than the Q8 branch which doesn't yet have #407

ggerganov · 2023-04-14T18:01:31Z

Hmm, I didn't do timing tests, but if anything I expect the change #407 to run a bit faster with default settings, since before we were using a batch of 511 and now we use 512 which should be slightly more suitable for vectorization.

However, I did notice something strange when computing the "with BLAS" perplexity today in my PR.
It said the estimated time to be 1.93 hours, but it took just 83m45s - i.e. much faster than estimated.

Was this time estimation ever correct? I haven't done too much perplexity computations to have an impression.
Maybe #407 somehow breaks the estimation?

Q8: use int8_t, AVX/AVX2 optimizations

801aab1

ggerganov reviewed Apr 14, 2023

View reviewed changes

ggml : fix quantize_row_q8_0() ARM_NEON rounding

59fb9e9

ggerganov approved these changes Apr 14, 2023

View reviewed changes

ggerganov merged commit 7c6c079 into ggerganov:q8_0 Apr 14, 2023

sw deleted the mulmat-q8 branch April 14, 2023 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q8: use int8_t, AVX/AVX2 optimizations #972

Q8: use int8_t, AVX/AVX2 optimizations #972

sw commented Apr 14, 2023 •

edited

Loading

ggerganov Apr 14, 2023

sw commented Apr 14, 2023 •

edited

Loading

ggerganov commented Apr 14, 2023 •

edited

Loading

Q8: use int8_t, AVX/AVX2 optimizations #972

Q8: use int8_t, AVX/AVX2 optimizations #972

Conversation

sw commented Apr 14, 2023 • edited Loading

ggerganov Apr 14, 2023

Choose a reason for hiding this comment

sw commented Apr 14, 2023 • edited Loading

ggerganov commented Apr 14, 2023 • edited Loading

sw commented Apr 14, 2023 •

edited

Loading

sw commented Apr 14, 2023 •

edited

Loading

ggerganov commented Apr 14, 2023 •

edited

Loading