Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More accurate Q4_0 and Q4_1 quantizations #896

Closed
wants to merge 12 commits into from

Commits on Apr 11, 2023

  1. Use better conversion to ints

    in quantize_row_q4_0_reference and quantize_row_q4_1_reference.
    This reduces the difference to the vectorized versions to
    ~10% for quantize_row_q4_0 and <15% for quantize_row_q4_1 on
    the two CPU's I have tried (Ryzen 7950X and M2 Max).
    Kawrakow committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    126b984 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0c9a967 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8b3d1f9 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    92408cd View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    709d235 View commit details
    Browse the repository at this point in the history
  6. Reverting round() change so we can pass tests

    But we should eventually switch back to nearestInt() and adapt the test.
    Kawrakow committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    b6df974 View commit details
    Browse the repository at this point in the history
  7. Improve Q4_0 MSE

    Somehow I had it hard-wired in my brain that quants need to be
    in -7...7 to be comparable to the original Q4_0.
    
    But this is clearly not the case, and if we relax this requirement
    this simple change brings the rmse down to 0.001966 at the expense of
    a somewhat longer computation (~67 seconds vs 49 seconds for the 7B
    model on M2 Max).
    
    Perplexity test is still running but it looks like the improvement
    compared to the previous version will be quite modest ~0.03) despite
    the significant improvement in MSE.
    
    The change does not affect Q4_1 as there we already use the full
    range of 16 possible int values.
    Kawrakow committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    931ae36 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2023

  1. Further improve Q4_0 MSE

    The RMSE of the 7B model becomes 0.00185228.
    It looks like the perplexity will end up being around 6.27-6.28.
    Kawrakow committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    6bfb00a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    29b83e5 View commit details
    Browse the repository at this point in the history
  3. POC: Even lower rmse 4-bit Q4_0 quantization

    Basically, we use two Q4_0 quantizations, each having 16 weights,
    to a quantize a set of 32 weights. We get two separate scaling
    factors, which we store as fp16, ending up using the exact same
    5 bits per weight as the current Q4_0.
    
    We end up witn an rmse of ~0.00159, so basically the same as
    the improved Q4_1. But this should run faster than `Q4_1`
    (unless fp16 -> fp32 conversion is somehow very slow).
    Kawrakow committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    679e1cb View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2023

  1. POC: Q4_1 for groups of 16 weight

    As last commit, but Q4_1 type, using the same memory as
    existing Q4_1 via fp16.
    
    We end up with
    rmse 0.00125125, maxerr 0.11657715, 95pct<0.0024, median<0.0010
    after a quantize - dequantize roundtrip.
    
    This is quite a bit better than Q4_1 with groups of 32 weights,
    but by far not as good as 5-bit quantization that uses the same
    amount of memory where we had
    rmse 0.00076131, maxerr 0.05273438, 95pct<0.0016, median<0.0006
    Kawrakow committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    6f34961 View commit details
    Browse the repository at this point in the history
  2. POC: Measure rmse of 8 bit quantization

    q8_0 : rmse 0.00010729, maxerr 0.01030385, 95pct<0.0002, median<0.0002
    Kawrakow committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    97d7ac7 View commit details
    Browse the repository at this point in the history