Skip to content

Latest commit

 

History

History

benchmark-llm

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Benchmarking LLM inference speed

Calling llama-bench from llama.cpp to benchmark the speed of prompt processing and text generation, using different models and different number of tokens.

Each benchmark scenario is repeated 5 times and run on its own to be able to enforce a timeout, which is calculated based on the model size (to be loaded into memory/VRAM using a conservative 250 MB/sec read speed), the number of tokens tested, and expected min tokens/sec -- requiring faster inference speed for more tokens as per below.

Prompt processing performance targets:

Tokens Expected tokens/sec
16 2
128 10
512 25
1024 50
4096 250
16384 1000

Text generation performance targets:

Tokens Expected tokens/sec
16 1
128 5
512 25
1024 50
4096 250

So running the benchmark on a hardware that can generate 512 tokens with 22 tokens/sec speed will not test 1024 and larger token lenghts and will stop early to save compute resources. If you want to allow longer runs, use the --benchmark-timeout-scale flag to increase the timeouts.

Usage

CPU-only (available on AMD64 and ARM64):

docker run --rm --init ghcr.io/sparecores/benchmark-llm:main

CUDA with GPUs (available only on AMD64):

docker run --gpus all --rm --init ghcr.io/sparecores/benchmark-llm:main

JSON lines output is printed to stdout, which can be piped to a file for processing later, e.g.:

docker run --rm --init ghcr.io/sparecores/benchmark-llm:main | tee -a results.jsonl

Models

The default list of models to download and benchmark is:

You can override the default list of models by passing the --model-urls flag, but note that the models should be GGUF files, and ordered by size (start with the smallest).

The models are cached in the /models directory by default, which is a temporary docker volume, but you can override this by passing the --models-dir flag. If you might need to rerun the benchmark multiple times, you might want to set a different models directory or attach an external location to avoid re-downloading the same models.