Skip to content

Releases: Mozilla-Ocho/llamafile

llamafile v0.8.7

24 Jun 15:00
b2f587c
Compare
Choose a tag to compare

This release includes important performance enhancements for quants.

  • 293a528 Performance improvements on Arm for legacy and k-quants (#453)
  • c38feb4 Optimized matrix multiplications for i-quants on __aarch64__ (#464)

This release fixes bugs. For example, we're now using a brand new memory
manager, which is believed to support platforms like Android that have a
virtual address space with fewer than 47 bits. This release also restores our
prebuilt Windows AMD GPU support, thanks to tinyBLAS.

It should be noted that, in future releases, we plan to introduce a new
server for llamafile. This new server is being designed for performance
and production-worthiness. It's not included in this release, since the
new server currently only supports a tokenization endpoint. However the
endpoint is capable of doing 2 million requests per second whereas with
the current server, the most we've ever seen is a few thousand.

  • e0656ea Introduce new llamafile server

llamafile v0.8.6

25 May 14:27
81cfbcf
Compare
Choose a tag to compare

Two minor issues are fixed with this release.

  • 69c2dd3 Don't print special tokens for now (improve shell scriptability)
  • 866a129 Upgrade to Cosmopolitan v3.3.8

See the llamafile v0.8.5 release notes for further details. For driver-only prebuilt AMD GPU support on Windows, please use llamafile v0.8.4 for the next few weeks, until ggerganov/llama.cpp#7156 is resolved.

llamafile v0.8.5

25 May 09:06
b79ecf4
Compare
Choose a tag to compare

This release fixes bugs and introduces @Kawrakow's latest quant
performance enhancements (a feature exclusive to llamafile). As of #435
the K quants now go consistently 2x faster than llama.cpp upstream. On
big CPUs like Threadripper we've doubled the performance of tiny models,
for both prompt processing and token generation for tiny models (see the
benchmarks below) The llamafile-bench and llamafile-upgrade-engine
commands have been introduced.

Note: Please use llamafile v0.8.4 if you need prebuilt (driver-only) AMD GPU support on Windows,
at least for the next few weeks, until ggerganov/llama.cpp#7156 is resolved.

Binaries run on Linux, Windows, MacOS, FreeBSD, OpenBSD, and NetBSD for
AMD64 and ARM64. Supported GPUs are CUDA, ROCm, and Metal. Prebuilt GPU
binaries are provided for CUDA/ROCm on Linux, and CUDA on Windows. To
install this release on systems with a POSIX-style shell:

sudo -s
cd /usr/local
wget https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.5/llamafile-0.8.5.zip
unzip llamafile-0.8.5.zip
exit
llamafile --help

To upgrade your old llamafiles without needing to redownload, run:

llamafile-upgrade-engine old.llamafile new.llamafile

Prebuilt llamafiles that have the LLM weights included are available at:

Here are some tutorials:

Here are some performance benchmarks for various quantization formats, on the world's flagship CPUs. See https://justine.lol/matmul/ to compare these numbers to where we were back in March two months ago.

cpu_info model_filename size test t/s
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.BF16 86.99 GiB pp512 447.01
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.BF16 86.99 GiB tg16 11.35
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.F16 86.99 GiB pp512 340.63
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.F16 86.99 GiB tg16 11.01
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q8_0 46.22 GiB pp512 288.16
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q8_0 46.22 GiB tg16 15.82
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q6_K 35.74 GiB pp512 431.51
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q6_K 35.74 GiB tg16 22.73
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q5_K_M 30.95 GiB pp512 427.71
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q5_K_M 30.95 GiB tg16 24.90
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q4_K_M 26.49 GiB pp512 440.03
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q4_K_M 26.49 GiB tg16 27.31
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q4_0 24.63 GiB pp512 287.51
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q4_0 24.63 GiB tg16 18.92
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q3_K_M 21.00 GiB pp512 433.89
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q3_K_M 21.00 GiB tg16 30.30
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q3_K_S 19.03 GiB pp512 432.36
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q3_K_S 19.03 GiB tg16 31.34
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q2_K 16.12 GiB pp512 449.64
AMD Ryzen Threadripper PRO 7995WX (znver4) mixtral-8x7b-instruct-v0.1.Q2_K 16.12 GiB tg16 33.71
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.F32 4.10 GiB pp512 2103.25
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.F32 4.10 GiB tg16 57.34
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.BF16 2.05 GiB pp512 2603.84
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.BF16 2.05 GiB tg16 77.18
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.F16 2.05 GiB pp512 2038.64
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.F16 2.05 GiB tg16 80.23
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB pp512 2203.77
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB tg16 100.78
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB pp512 2838.05
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB tg16 135.27
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_1 791.50 MiB pp512 2328.06
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_1 791.50 MiB tg16 138.15
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB pp512 2676.14
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB tg16 143.58
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB pp512 2281.44
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB tg16 145.02
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_K_S 729.84 MiB pp512 2757.59
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q5_K_S 729.84 MiB tg16 143.59
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_1 668.18 MiB pp512 2444.11
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_1 668.18 MiB tg16 148.50
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_K_M 636.18 MiB pp512 2758.90
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_K_M 636.18 MiB tg16 149.92
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_K_S 609.53 MiB pp512 2847.95
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_K_S 609.53 MiB tg16 150.84
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_0 606.53 MiB pp512 2420.58
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q4_0 606.53 MiB tg16 154.27
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q3_K_L 563.42 MiB pp512 2743.74
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q3_K_L 563.42 MiB tg16 155.29
AMD Ryzen Threadripper PRO 7995WX (znver4) TinyLlama-1.1B-Chat-v1.0.Q3_K_M 522.30 MiB...
Read more

llamafile v0.8.4

10 May 09:30
30cdd9c
Compare
Choose a tag to compare

This release fixes underflows and overflows.

  • A memory bug in the grammar parser has been fixed, that caused commands like ./llamafile -m foo.gguf -p bar --grammar 'root::="' (which failed to specify a closing quote) to crash. Anyone using the server as a public facing endpoint (despite our previous recommendations) is strongly encouraged to upgrade. See 22aba95 and 3fe045f. Credit for discovering (and most importantly, reporting) this issue goes to Eclypsium Security Researcher Richard Johnson. We incorrectly reported earlier that this fix was incorporated into the v0.8.2 release. You need to use the v0.8.4 release. This bug fix was upstreamed in ggerganov/llama.cpp#7194

  • Our new vectorized expf() implementation now handles underflow by producing subnormals rather than flushing to zero. b5c6df6

See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)

llamafile v0.8.2

09 May 23:20
4ee1e39
Compare
Choose a tag to compare

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

  • This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS (#394). This was contributed to llamafile by @ikawrakow who originally invented K quants last year: ggerganov/llama.cpp@99009e7. In prior releases we recommended the legacy Q4_0 quant since it was the simplest and most intuitive to get working with recent matmul optimizations. Thanks to Iwan Kawrakow's efforts, the best quants (e.g. Q5_K_M) will now go the fastest (on modern x86 systems).

  • Text generation (or prediction) should now go slightly faster too, thanks to development work matmul kernels, and enhancements to thread synchronization (see 89c189e) which should be noticed most on CPUs with many cores running smaller models. MacOS ARM users who are using CPU rather than Metal can expect to see the biggest boost, now that llamafile knows how to utilize all cores (see 6c45e3e).

  • Bugs in the server /embedding endpoint have been fixed (see 0e2845a and 7900294). You can also now pass llamafile --embedding -m model -p prompt to have embeddings printed to standard output (see 42bd9b8).

  • This release synchronizes with the upstream llama.cpp project as of May 7th in 94d0940, which improves tokenization for Command-R, Refact, Olmo, and StarCoder. There's a new flash attention op that may be enabled for many models by passing the -fa flag. We haven't been able to include this in our prebuilt cuda/rocm binaries yet, so you may need to pass the llamafile --recompile flag for GPU.

  • This release introduces the --precise, --fast, and --trap flags, which control the execution of math. The --precise flag can slightly enhance the thinking of LLMs at the cost of some performance (see 2af3b88 and 9540b43). The --fast flag is included since it's unspecified which mode llamafile will use for any given situation (see bbae0f6 and b749326). The --trap flag can help you pinpoint the exact moment any NaNs appear (on CPUs that support this, e.g. most of x86), which is useful for troubleshooting. Additionally, a new vectorized expf() function has been introduced that enables llamafile to compute the exponent function faster and at full quality (see e2b3cb2). This matters because it's the function that powers SiLU and SoftMax which are used by most of todays premier public models.

  • Most of the CPU code in the GGML library now has optimal performance across different hardware architectures, thanks to new build system techniques. Features or specific options or models that underperformed before, may do better now (see 0bdea60 and c9d7393).

Additional fixes:

  • a2d159e Fix server multimodal statistics (#392)
  • aa8c01a Revert moondream vision language model support
  • eecbf89 More conservative strong/em markdown matcher (#352)
  • 38311f2 CUDA: CUDART < 11.7 workaround for __hmax, __hmax2
  • 58d2ca0 Use qsort and set linkage to static for internal functions used for offload-arch-fix (#375)
  • 4ee1e39 The PDF documentation in llamafile-0.8.2.zip is now fixed
  • 4ee1e39 Remove warnings from cuda build

Additional notes:

  • We're experiencing some instability with our Windows AMD GPU support. If you encounter crashes using the -ngl 999 flag on Windows, then try using the previous 0.8.1 release. Please also consider filing an issue, to report if it doesn't work, or better yet, please file an issue if it does work, since we otherwise have no way of knowing that (llamafile doesn't have telemetry because maximally respecting the user's privacy on their local machine is one of the stated goals of the project). You can also share details about your experience with us on the Mozilla AI Discord server.

See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)

llamafile v0.8.1

26 Apr 20:33
2095d50
Compare
Choose a tag to compare
  • Support for Phi-3 Mini 4k has been introduced
  • A bug causing GPU module crashes on some systems has been resolved
  • Support for Command-R Plus has now been vetted with proper 64-bit indexing
  • We now support more AMD GPU architectures thanks to better detection of offload archs (#368)
  • We now ship prebuilt NVIDIA and ROCm modules for both Windows and Linux users. They link tinyBLAS which is a libre math library that only depends on the graphics driver being installed. Since it's slower, llamafile will automatically build a native module for your system if the CUDA or ROCm SDKs are installed. You can control this behavior using --nocompile or --recompile. Yes, Our LLavA llamafile still manages to squeak under the Windows 4GB file size limit!
  • An assertion error has been fixed that happened when using llamafile-quantize to create K quants from an F32 GGUF file
  • A new llamafile-tokenize command line tool has been introduced. For example, if you want to count how many "tokens" are in a text file, you can say cat file.txt | llamafile-tokenize -m model.llamafile | wc -l since it prints each token on a single line.

llamafile v0.8

24 Apr 22:05
82f87bd
Compare
Choose a tag to compare

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. llamafile goes 2x faster than llama.cpp and 25x faster than ollama for some use cases like CPU prompt evaluation. It has a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release further improves performance and introduces support for new models.

  • Support for LLaMA3 is now available
  • Support for Grok has been introduced
  • Support for Mixtral 8x22b has been introduced
  • Support for Command-R models has been introduced
  • MoE models (e.g. Mixtral, Grok) now go 2-5x faster on CPU 4db03a1
  • F16 is now 20% faster on Raspberry Pi 5 (TinyLLaMA 1.1b prompt eval improved 62 -> 75 tok/sec)
  • F16 is now 30% faster on Skylake (TinyLLaMA 1.1b prompt eval improved 171 -> 219 tok/sec)
  • F16 is now 60% faster on Apple M2 (Mistral 7b prompt eval improved 79 -> 128 tok/sec)
  • Add ability to override chat template in web gui when creating llamafiles da5cbe4
  • Improve markdown and syntax highlighting in server (#88)
  • CPU feature detection has been improved

Downloads

You can download prebuilt llamafiles from:

Errata

  • The new web gui chat template override feature isn't working as intended. If you want to use LLaMA3 8B then you need to manually copy and paste the chat templates from our README into the llamafile web GUI.
  • The llamafile-quantize program may fail with an assertion error when K-quantizing weights from an F32 converted file. You can work around this by asking llama.cpp's convert.py script to output an FP16 GGUF file, and then running lllamafile-quantize on that instead.

llamafile v0.7.4

24 Apr 17:08
73bf13d
Compare
Choose a tag to compare
  • Display prompt eval tokens per second in web gui e4d97b2
  • Add ability to override chat template in web gui ebd096e
  • Simply and optimize the sgemm code more ef1c524

llamafile v0.7.3

19 Apr 22:55
8ecb0ae
Compare
Choose a tag to compare
  • Improve markdown and syntax highlighting in server (#88) re-fixes #68

llamafile v0.7.2

19 Apr 20:45
cfae06f
Compare
Choose a tag to compare
  • Fix stop token bug with meta llama3 70b instruct da4d780
  • Fix LLaVA shell scriptability regression ff9decc (#346)