Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamafile : improve moe prompt eval speed on cpu #6840

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Jun 28, 2024

  1. llamafile : improve moe prompt eval speed on cpu

    This change introduces a llamafile_mixmul() API, that allows tinyBLAS to
    speed up "Mixture of Expert" models. On my Threadripper the Mixtral 8x7b
    F16 weights now process prompts 2x faster. I am also seeing a 60 percent
    improvement with Mixtral 8x22b Q4_0. Support is provided for Q8_0; it is
    also supported by tinyBLAS. MoE models spend the most time in MUL_MAT_ID
    rather than MUL_MAT, which is why llamafile_sgemm() was not able to help
    them before. The new code works by decomposing the mixmul operation into
    fast 2d llamafile_sgemm() calls. This also adds BF16 support to tinyBLAS
    jart committed Jun 28, 2024
    Configuration menu
    Copy the full SHA
    2dd5d1f View commit details
    Browse the repository at this point in the history