Calling llama-bench
from llama.cpp
to benchmark the speed of prompt processing and text generation,
using different models and different number of tokens.
Each benchmark scenario is repeated 5 times and run on its own to be able to enforce a timeout, which is calculated based on the model size (to be loaded into memory/VRAM using a conservative 250 MB/sec read speed), the number of tokens tested, and expected min tokens/sec -- requiring faster inference speed for more tokens as per below.
Prompt processing performance targets:
Tokens | Expected tokens/sec |
---|---|
16 | 2 |
128 | 10 |
512 | 25 |
1024 | 50 |
4096 | 250 |
16384 | 1000 |
Text generation performance targets:
Tokens | Expected tokens/sec |
---|---|
16 | 1 |
128 | 5 |
512 | 25 |
1024 | 50 |
4096 | 250 |
So running the benchmark on a hardware that can generate 512 tokens with 22
tokens/sec speed will not test 1024 and larger token lenghts and will stop early
to save compute resources. If you want to allow longer runs, use the
--benchmark-timeout-scale
flag to increase the timeouts.
CPU-only (available on AMD64 and ARM64):
docker run --rm --init ghcr.io/sparecores/benchmark-llm:main
CUDA with GPUs (available only on AMD64):
docker run --gpus all --rm --init ghcr.io/sparecores/benchmark-llm:main
JSON lines output is printed to stdout
, which can be piped to a file for
processing later, e.g.:
docker run --rm --init ghcr.io/sparecores/benchmark-llm:main | tee -a results.jsonl
The default list of models to download and benchmark is:
You can override the default list of models by passing the --model-urls
flag,
but note that the models should be GGUF files, and ordered by size (start with
the smallest).
The models are cached in the /models
directory by default, which is a
temporary docker volume, but you can override this by passing the --models-dir
flag. If you might need to rerun the benchmark multiple times, you might want to
set a different models directory or attach an external location to avoid
re-downloading the same models.