Skip to content

huggingface/lora-fast

Repository files navigation

lora-fast

Minimal repository to demonstrate fast LoRA inference with Flux.1-dev using different settings that can help with speed or memory efficiency. Please check the accompanying blog post at this URL.

The included benchmark script allows to experiment with:

  • FlashAttention3
  • torch.compile
  • Quantization
  • LoRA hot-swapping
  • CPU offloading

Key results

Option Time (s) ⬇️ Speedup (vs baseline) ⬆️ Notes
baseline 7.8910 Baseline
optimized 3.5464 2.23× Hotswapping + compilation without recompilation hiccups (FP8 on by default)
no_fp8 4.3520 1.81× Same as optimized, but with FP8 quantization disabled
no_fa3 4.3020 1.84× Disable FA3 (flash-attention v3)
baseline + compile 5.0920 1.55× Compilation on, but suffers from intermittent recompilation stalls
no_fa3_fp8 5.0850 1.55× Disable FA3 and FP8
no_compile_fp8 7.5190 1.05× Disable FP8 quantization and compilation
no_compile 10.4340 0.76× Disable compilation: the slowest setting

Installation

The requirements for this repository are listed in the requirements.txt, please ensure they are installed in your Python environment, e.g. by running:

python -m pip install -r requirements.txt.

FlashAttention3

Optionally, use FlashAttention3 for even better performance. This requires a Hopper GPU (e.g. H100). Follow the install instructions here.

Running the benchmarks

Run the benchmarks using the provided run_benchmark.py script. To check the available arguments, run:

python run_benchmark.py --help

If you want to run a battery of different settings, some shell scripts are provided to achieve that. Use run_experiments.sh if you have a server GPU like an H100. Use run_exps_rtx_4090.sh if you have a consumer GPU with 24 GB of memory, like an RTX 4090. The benchmark data and sample images are stored by default in the results/ directory.

Standalone script

The inference_lora.py script implements the optimizations in sequence and it is geared towards an H100. Refer to it for a simpler reference than run_benchmark.py. Users should only refer to this script in case they are not interested in conducting without running benchmarking.