Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support LoRA path renaming and add LoRA serving benchmarks #1433

Merged
merged 2 commits into from
Sep 15, 2024

Conversation

Ying1123
Copy link
Member

@Ying1123 Ying1123 commented Sep 15, 2024

This PR supports LoRA path renaming. See example below:

# launch server
python -m sglang.launch_server --model mistralai/Mistral-7B-Instruct-v0.3 --lora-paths /home/ying/test_lora lora1=/home/ying/test_lora_1 lora2=/home/ying/test_lora_2 --disable-radix --disable-cuda-graph --max-loras-per-batch 4
# send requests
# lora_path[i] specifies the LoRA used for text[i], so make sure they have the same length
# use None to specify base-only prompt, e.x. "lora_path": [None, "/home/ying/test_lora"]
import json
import requests

url = "http://127.0.0.1:30000"
json_data = {
        "text": ["prompt 1", "prompt 2", "prompt 3", "prompt 4", "prompt 5", "prompt 6", "prompt7"],
        "sampling_params": {"max_new_tokens": 32},
        "lora_path": ["/home/ying/test_lora", "lora1", "lora2", "lora1", "lora2", None, None],
}
response = requests.post(
        url + "/generate",
        json=json_data,
)
print(json.dumps(response.json()))

What has been done:

  • Initial LoRA support [Feature] Initial support for multi-LoRA serving #1307
    This PR gives initial multi-LoRA serving support. Currently, it supports LoRA on attention (qkvo) and mlp (gate, up, down) linear layers. It supports dynamic loading and offloading, but it does not support unified memory. The memory pool for LoRA adapters is pre-allocated. Please use a smaller --mem-frac to launch server with larger --max-loras-per-batch.
  • Misc: path renaming (this PR)

You can expect the items below in the follow-up PRs.

  • OpenAI compatible API
  • compatibility with cuda graph
  • compatibility with radix attention
  • fully sharded LoRAs with tensor parallelism
  • performance optimization
  • memory optimization
  • support LoRAs with different ranks
  • add triton backend
  • test cases enhancement

References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving

@Ying1123 Ying1123 changed the title Add LoRA serving benchmark Support LoRA path renaming and add LoRA serving benchmarks Sep 15, 2024
@Ying1123 Ying1123 changed the title Support LoRA path renaming and add LoRA serving benchmarks [Feature] Support LoRA path renaming and add LoRA serving benchmarks Sep 15, 2024
@Ying1123 Ying1123 merged commit 3796339 into main Sep 15, 2024
11 checks passed
@Ying1123 Ying1123 deleted the lora_bench branch September 15, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant