lora-fast

Minimal repository to demonstrate fast LoRA inference with Flux.1-dev using different settings that can help with speed or memory efficiency. Please check the accompanying blog post at this URL.

The included benchmark script allows to experiment with:

FlashAttention3
torch.compile
Quantization
LoRA hot-swapping
CPU offloading

Key results

Option	Time (s) ⬇️	Speedup (vs baseline) ⬆️	Notes
baseline	7.8910	–	Baseline
optimized	3.5464	2.23×	Hotswapping + compilation without recompilation hiccups (FP8 on by default)
no_fp8	4.3520	1.81×	Same as optimized, but with FP8 quantization disabled
no_fa3	4.3020	1.84×	Disable FA3 (flash-attention v3)
baseline + compile	5.0920	1.55×	Compilation on, but suffers from intermittent recompilation stalls
no_fa3_fp8	5.0850	1.55×	Disable FA3 and FP8
no_compile_fp8	7.5190	1.05×	Disable FP8 quantization and compilation
no_compile	10.4340	0.76×	Disable compilation: the slowest setting

Installation

The requirements for this repository are listed in the requirements.txt, please ensure they are installed in your Python environment, e.g. by running:

python -m pip install -r requirements.txt.

FlashAttention3

Optionally, use FlashAttention3 for even better performance. This requires a Hopper GPU (e.g. H100). Follow the install instructions here.

Running the benchmarks

Run the benchmarks using the provided run_benchmark.py script. To check the available arguments, run:

python run_benchmark.py --help

If you want to run a battery of different settings, some shell scripts are provided to achieve that. Use run_experiments.sh if you have a server GPU like an H100. Use run_exps_rtx_4090.sh if you have a consumer GPU with 24 GB of memory, like an RTX 4090. The benchmark data and sample images are stored by default in the results/ directory.

Standalone script

The inference_lora.py script implements the optimizations in sequence and it is geared towards an H100. Refer to it for a simpler reference than run_benchmark.py. Users should only refer to this script in case they are not interested in conducting without running benchmarking.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
results		results
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference_lora.py		inference_lora.py
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
run_experiments.sh		run_experiments.sh
run_exps_rtx_4090.sh		run_exps_rtx_4090.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

lora-fast

Key results

Installation

FlashAttention3

Running the benchmarks

Standalone script

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

huggingface/lora-fast

Folders and files

Latest commit

History

Repository files navigation

lora-fast

Key results

Installation

FlashAttention3

Running the benchmarks

Standalone script

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages