sc-images/images/benchmark-llm at main · SpareCores/sc-images

History

Name		Name	Last commit message	Last commit date
parent directory ..
Dockerfile		Dockerfile
README.md		README.md
benchmark.py		benchmark.py
extract-shared-cpu-libs.sh		extract-shared-cpu-libs.sh

README.md

Benchmarking LLM inference speed

Calling llama-bench from llama.cpp to benchmark the speed of prompt processing and text generation, using different models and different number of tokens.

Each benchmark scenario is repeated 5 times and run on its own to be able to enforce a timeout, which is calculated based on the model size (to be loaded into memory/VRAM using a conservative 250 MB/sec read speed), the number of tokens tested, and expected min tokens/sec -- requiring faster inference speed for more tokens as per below.

Prompt processing performance targets:

Tokens	Expected tokens/sec
16	2
128	10
512	25
1024	50
4096	250
16384	1000

Text generation performance targets:

Tokens	Expected tokens/sec
16	1
128	5
512	25
1024	50
4096	250

So running the benchmark on a hardware that can generate 512 tokens with 22 tokens/sec speed will not test 1024 and larger token lenghts and will stop early to save compute resources. If you want to allow longer runs, use the --benchmark-timeout-scale flag to increase the timeouts.

Usage

CPU-only (available on AMD64 and ARM64):

docker run --rm --init ghcr.io/sparecores/benchmark-llm:main

CUDA with GPUs (available only on AMD64):

docker run --gpus all --rm --init ghcr.io/sparecores/benchmark-llm:main

JSON lines output is printed to stdout, which can be piped to a file for processing later, e.g.:

docker run --rm --init ghcr.io/sparecores/benchmark-llm:main | tee -a results.jsonl

Models

The default list of models to download and benchmark is:

You can override the default list of models by passing the --model-urls flag, but note that the models should be GGUF files, and ordered by size (start with the smallest).

The models are cached in the /models directory by default, which is a temporary docker volume, but you can override this by passing the --models-dir flag. If you might need to rerun the benchmark multiple times, you might want to set a different models directory or attach an external location to avoid re-downloading the same models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark-llm

benchmark-llm

README.md

Benchmarking LLM inference speed

Usage

Models

Files

benchmark-llm

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark-llm

Folders and files

parent directory

README.md

Benchmarking LLM inference speed

Usage

Models