Skip to content

Commit

Permalink
readme : add Metal instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
ggerganov committed Jun 4, 2023
1 parent db3db9e commit e33002d
Showing 1 changed file with 24 additions and 3 deletions.
27 changes: 24 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,10 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook

- Plain C/C++ implementation without dependencies
- Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
- Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- Mixed F16 / F32 precision
- 4-bit, 5-bit and 8-bit integer quantization support
- Runs on the CPU
- Supports OpenBLAS/Apple BLAS/ARM Performance Lib/ATLAS/BLIS/Intel MKL/NVHPC/ACML/SCSL/SGIMATH and [more](https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors) in BLAS
- cuBLAS and CLBlast support

Expand Down Expand Up @@ -236,6 +235,28 @@ In order to build llama.cpp you have three different options.
zig build -Drelease-fast
```

### Metal Build

Using Metal allows the computation to be executed on the GPU for Apple devices:

- Using `make`:

```bash
LLAMA_METAL=1 make
```

- Using `CMake`:

```bash
mkdir build-metal
cd build-metal
cmake -DLLAMA_METAL=ON ..
cmake --build . --config Release
```

When built with Metal support, you can enable GPU inference with the `--gpu-layers|-ngl` command-line argument.
Any value larger than 0 will offload the computation to the GPU.

### BLAS Build

Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
Expand Down Expand Up @@ -367,7 +388,7 @@ Building the program with BLAS support may lead to some performance improvements
Running:
The CLBlast build supports `--gpu-layers|-ngl` like the CUDA version does.
The CLBlast build supports `--gpu-layers|-ngl` like the CUDA version does.
To select the correct platform (driver) and device (GPU), you can use the environment variables `GGML_OPENCL_PLATFORM` and `GGML_OPENCL_DEVICE`.
The selection can be a number (starting from 0) or a text string to search:
Expand Down

0 comments on commit e33002d

Please sign in to comment.