Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster q3_0 implementation, using two planes #1

Merged
merged 6 commits into from
Apr 17, 2023
Merged

Faster q3_0 implementation, using two planes #1

merged 6 commits into from
Apr 17, 2023

Conversation

pubby
Copy link

@pubby pubby commented Apr 16, 2023

This reimplements block_q3_0 as:

typedef struct {
    ggml_fp16_t d;
    uint16_t qhi;
    uint32_t qlo;
} block_q3_0;

For each 3-bit number, the lowest 2 bits are packed into qlo, while the highest bit is packed into qhi.

To convert this representation to SIMD vectors, qlo is unpacked exactly like q2_0, while qhi is converted to a lookup table whose results are then OR'd.

This representation is both faster, and simpler regarding SIMD as it's implementation shares most code with q2_0. I'm seeing an improvement from 66.07 seconds per pass to 42.12 seconds per pass on AVX2.

Pros:

  • Simpler SIMD implementation
  • Faster
  • Shares code with q2_0

Cons:

  • The data representation is unusual
  • Requires a lookup table

ggml.c Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
@sw
Copy link
Owner

sw commented Apr 16, 2023

This lets the preprocessor generate the look-up table:

#define B0(n) 0x ## n
#define B1(n) B0(n ## FC), B0(n ## 00)
#define B2(n) B1(n ## FC), B1(n ## 00)
#define B3(n) B2(n ## FC), B2(n ## 00)
#define B4(n) B3(n ## FC), B3(n ## 00)
#define B5(n) B4(n ## FC), B4(n ## 00)
#define B6(n) B5(n ## FC), B5(n ## 00)
#define B7(n) B6(n ## FC), B6(n ## 00)
#define B8( ) B7(     FC), B7(     00)
static const uint64_t ggml_q3_table[256] = { B8() };

It's using string concatenation which is a bit silly. I'll see if I can find a more arithmetic way. Actually let's not bother, this is better debuggable by running it through cpp.

@pubby
Copy link
Author

pubby commented Apr 16, 2023

Made the changes you talked about. Now getting 38.77 seconds per pass.

Unfortunately my processor doesn't support vpdpbssd. Still overall your changes are faster for me.

I don't have vpdpbssd support either - just wanted to future proof things.

It's possible to implement the qhi conversion without the lookup table, but the table has better performance on my system. Here's the code I used to test that (shift result, OR to Crumbs, subtract 4 element-wise):

// Unpack 32 1-bit fields into 32 bytes
// The output vector contains 32 bytes, each one in [ 0 .. 1 ] interval
static inline __m256i bytesFromBits(uint16_t packed_hi, uint16_t packed_lo) {
    __m128i bx_hi = _mm_set1_epi32(packed_hi | ((packed_hi >> 1) << 16));
    __m128i bx_lo = _mm_set1_epi32(packed_lo | ((packed_lo >> 1) << 16));
    __m256i bx = _mm256_set_m128i(bx_hi, bx_lo);

    // shift counts to get all bit pairs in lowest position of each byte
    const __m256i shift256 = _mm256_set_epi32(6, 4, 2, 0,
                                              6, 4, 2, 0);
    bx = _mm256_srlv_epi32(bx, shift256); // TODO: It would be better to use srlv_epi16, if available

    const __m256i shufmask = _mm256_set_epi8(15, 13, 11,  9,
                                              7,  5,  3,  1,
                                             14, 12, 10,  8,
                                              6,  4,  2,  0,
                                             15, 13, 11,  9,
                                              7,  5,  3,  1,
                                             14, 12, 10,  8,
                                              6,  4,  2,  0);
    bx = _mm256_shuffle_epi8(bx, shufmask);

    const __m256i mask = _mm256_set1_epi8(1);
    bx = _mm256_and_si256(mask, bx);

    return bx;
}

I'm guessing other SIMD architectures can have code like above if they can't use tables. I'll look into it more - I'm mostly familiar with x64.

ggml.c Outdated
uint64_t qs = y[i].qs;
for (int l = 0; l < QK3_0; l++) {
const int8_t vi = qs & 7;
for (int i = 0; i < 16; i++) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pubby : I'm confused by this... why should we have two nested loops that go up to 16?

It's getting late for me, possibly you're right about this. But then q2 would also be wrong? I'll look into it tomorrow.

Also, you changed QK3_0 to 16 in various places, was there a specific reason? With the block_q3_0 definition as it is, no other value will work obviously, but I think it's nice to have a consistent name, just for keeping the code understandable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that line's a mistake. I believe it should be i < nb (and then add back in nb = k / QK3_0).

The QK3_0 -> 16 change was made when I was testing a customizable version block_q3_0 that looked like this:

typedef struct {
    ggml_fp16_t d;
    uint16_t qhi[QK3_0 / 16];
    uint32_t qlo[QK3_0 / 16];
} block_q3_0;

The 16 number referred to the bit count per qhi element.

I scrapped that version though, so I'll revert the changes and use QK3_0 again.

@sw sw merged commit de542b3 into sw:q2q3 Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants