Feature Request: Add TPU/Hardware Accelerator Support (e.g., Google Coral, Hailo) to llama.cpp #11603

FixeQyt · 2025-02-02T20:43:48Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I propose adding hardware acceleration support for AI-focused chips like TPUs (e.g., Google Coral) and Hailo to llama.cpp. This would allow users to leverage dedicated AI accelerators for faster inference of LLMs (e.g., LLaMA) on edge devices like Raspberry Pi or low-power setups.

Motivation

Current Limitation: llama.cpp relies heavily on CPU/GPU, which limits performance on resource-constrained devices.
TPUs and Hailo: These accelerators are designed for efficient tensor operations and could drastically reduce inference latency/power consumption.
Community Impact: Many developers use devices like Raspberry Pi with TPU/Hailo add-ons – this integration would unlock new use cases.

Possible Implementation

1. Google Coral (Edge TPU) Integration

Libraries: Use libedgetpu (GitHub), Google's open-source library for interacting with Edge TPUs.
Model Conversion:
- Convert GGUF/GGML models to TensorFlow Lite format using existing tools in llama.cpp.
- Compile TFLite models for TPU compatibility using the edgetpu_compiler tool.
Inference Workflow:
- Offload matrix operations (e.g., tensor contractions) to the TPU via libedgetpu APIs.
- Implement TPU-specific quantization (e.g., int8) to maximize performance.

2. Hailo Integration

Libraries: Leverage hailort (GitHub), Hailo's runtime library for deploying models on Hailo accelerators.
Model Conversion:
- Convert models to Hailo's native HEF format using the Hailo Dataflow Compiler.
- Use intermediate formats like ONNX for compatibility with Hailo's tools.
Inference Workflow:
- Load HEF models via hailort and manage inference pipelines for low-latency execution.
- Optimize model layers using Hailo's profiling tools to balance compute between CPU and Hailo.

3. Unified Hardware Abstraction

Design a modular backend system in llama.cpp to support multiple accelerators (TPU, Hailo, GPU).
Add configuration flags (e.g., --tpu, --hailo) to let users select the accelerator at runtime.
Provide clear error handling for unsupported operations (e.g., fallback to CPU).

4. Cross-Platform Support

Raspberry Pi: Document driver installation and library dependencies for both Coral TPU and Hailo.
Quantization Tools: Extend llama.cpp's quantization scripts to generate accelerator-optimized models (e.g., TPU-int8, Hailo-16bit).

Use Case Examples

Raspberry Pi + Hailo-8L: Local AI chatbot with real-time response.
Google Coral + LLaMA-7B: Energy-efficient inference for IoT devices.

Testing Availability

I will soon acquire the Raspberry Pi AI Kit with Hailo-8L and can act as a tester for the Hailo integration. I should be able to start testing within a few weeks. My setup will include a Raspberry Pi 5 with 8 GB (or even 16 GB) RAM, and I plan to test models like LLaMA and DeepSeek for tasks such as text generation and chatbot applications.

The text was updated successfully, but these errors were encountered:

FixeQyt added the enhancement New feature or request label Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add TPU/Hardware Accelerator Support (e.g., Google Coral, Hailo) to llama.cpp #11603

Feature Request: Add TPU/Hardware Accelerator Support (e.g., Google Coral, Hailo) to llama.cpp #11603

FixeQyt commented Feb 2, 2025 •

edited

Loading

Feature Request: Add TPU/Hardware Accelerator Support (e.g., Google Coral, Hailo) to llama.cpp #11603

Feature Request: Add TPU/Hardware Accelerator Support (e.g., Google Coral, Hailo) to llama.cpp #11603

Comments

FixeQyt commented Feb 2, 2025 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

1. Google Coral (Edge TPU) Integration

2. Hailo Integration

3. Unified Hardware Abstraction

4. Cross-Platform Support

Use Case Examples

Testing Availability

FixeQyt commented Feb 2, 2025 •

edited

Loading