-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add an LLM engine #1127
[RFC] Add an LLM engine #1127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! It looks good at a high level. My review is still in progress, and other committers will also take a look.
4423ba7
to
a65ceec
Compare
@JianyuZhan Can you fully verify locally before committing now? I'm currently troubleshooting CI issues, which will be affected. |
0bd2a60
to
b779199
Compare
Hi, @Ying1123 , @zhyncs , now this PR is complete, and passed all CI tests(The previous e2e-test failure is due to missing PYTHONPATH setting, so I added one in the This PR makes modifications as below:
|
b33164f
to
1d0edcd
Compare
I have rebased upon the latest upstream/main branch, and it passed all CI tests now. |
38a6146
to
43e01f7
Compare
@JianyuZhan Hi, I try use your repo to test, I clone the code with main branch , and run pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu118/torch2.4/ but when I run follow test from sglang import LLM, SamplingParams but is says ---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sglang import LLM, SamplingParams
ImportError: cannot import name 'LLM' from 'sglang' (unknown location) How can i fix it, looking for your help! |
@DragonFive , Add "LLM" and "SamplingParams" in |
It works fine for me, thanks for your contribution! |
Running into the following issues (surfaced via
|
@jischein , Thanks for testing. I don't have this multi-GPU environment to test. Per my analysis, your error looks like the |
this PR will raise "AttributeError: 'Engine' object has no attribute 'tp_procs'" when do inference with one gpu, need add self.tp_procs=None in Engine.startup |
60efeb7
to
581f436
Compare
@@ -195,9 +195,9 @@ def load_model(self): | |||
monkey_patch_vllm_qvk_linear_loader() | |||
|
|||
self.dtype = self.vllm_model_config.dtype | |||
if self.model_config.model_override_args is not None: | |||
if self.model_config.model_overide_args is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be model_override_args
; I tried compiling the engine with tp=8 and got
(there is a typo, as this doesn't match fn signature in get_config
)
>>> from sglang import LLM, SamplingParams
>>> llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 190, in __init__
self.llm_engine = Engine(engine_args)
File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 50, in __init__
self.startup()
File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 89, in startup
self.tokenizer_manager = TokenizerManager(self.engine_args)
File "/home/ubuntu/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 93, in __init__
self.hf_config = get_config(
TypeError: get_config() got an unexpected keyword argument 'model_overide_args'
>>>
trust_remote_code=server_args.trust_remote_code, | ||
model_override_args=model_override_args, | ||
trust_remote_code=engine_args.trust_remote_code, | ||
model_overide_args=engine_args.model_override_args, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here; this should be model_override_args
to match fn signature in get_config
self.port_args = port_args | ||
self.model_override_args = model_override_args | ||
self.engine_args = engine_args | ||
self.model_overide_args = engine_args.model_override_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
engine_args.model_path, | ||
engine_args.trust_remote_code, | ||
context_length=engine_args.context_length, | ||
model_overide_args=engine_args.model_override_args, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here as well
@JianyuZhan unfortunately still running into errors after cleaning up the typos
|
JianyuZhan#1 — @JianyuZhan this compiles / addresses the typo |
Change all the 'model_overide_args' to 'model_override_args' in the repo will work well. |
a00e992
to
688cb2c
Compare
@JianyuZhan It runs well before I ungrade sglang to v0.3.0 on llama3.1-8b, after that I encounter some confused error : 10:46:19.553 [10:46:19 TP0] Exception in ControllerSingle:
10:46:19.553 Traceback (most recent call last):
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 157, in start_controller_process
10:46:19.553 controller.loop_for_forward()
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 98, in loop_for_forward
10:46:19.553 out_pyobjs = self.tp_server.exposed_step(recv_reqs)
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 243, in exposed_step
10:46:19.553 self.forward_step()
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553 return func(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 259, in forward_step
10:46:19.553 self.forward_prefill_batch(new_batch)
10:46:19.553 File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 506, in forward_prefill_batch
10:46:19.553 sample_output, logits_output = self.model_runner.forward(
10:46:19.553 File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 591, in forward
10:46:19.553 return self.forward_extend(batch)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553 return func(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 555, in forward_extend
10:46:19.553 return self.model.forward(
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553 return func(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/models/llama.py", line 317, in forward
10:46:19.553 hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553 return self._call_impl(*args, **kwargs)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553 return forward_call(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/models/llama.py", line 282, in forward
10:46:19.553 hidden_states, residual = layer(
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553 return self._call_impl(*args, **kwargs)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553 return forward_call(*args, **kwargs)
10:46:19.553 File "/github_sglang/python/sglang/srt/models/llama.py", line 232, in forward
10:46:19.553 hidden_states = self.self_attn(
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553 return self._call_impl(*args, **kwargs)
10:46:19.553 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553 return forward_call(*args, **kwargs)
10:46:19.554 File "/github_sglang/python/sglang/srt/models/llama.py", line 168, in forward
10:46:19.554 q, k = self.rotary_emb(positions, q, k)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.554 return self._call_impl(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.554 return forward_call(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/custom_op.py", line 14, in forward
10:46:19.554 return self._forward_method(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 216, in forward_cuda
10:46:19.554 ops.rotary_embedding(positions, query, key, self.head_size,
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 37, in wrapper
10:46:19.554 raise e
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 28, in wrapper
10:46:19.554 return fn(*args, **kwargs)
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 138, in rotary_embedding
10:46:19.554 torch.ops._C.rotary_embedding(positions, query, key, head_size,
10:46:19.554 File "/usr/local/lib/python3.9/site-packages/torch/_ops.py", line 1170, in __getattr__
10:46:19.554 raise AttributeError(
10:46:19.554 AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding' |
@DragonFive it is because the vllm dependency is upgraded, I think, you shall update your local installation dependency as well: |
Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine. |
Hi @yangky11 Maybe you can try this sglang/python/sglang/srt/server.py Line 562 in 05bea68
|
@JianyuZhan @zhyncs is this close to being merged? Would love to start using |
moved to #1567 |
Although this PR was closed, we still appreciate @JianyuZhan 's contribution. Thanks! |
Motivation
Edited 8/18: now it's complte, see below coversation for new PR description.
This is not complete work, just a PoC and request for comment.
This PR adds a LLM engine, addressing the Roadmap item
Add APIs for using the inference engine in a single script without launching a separate server.
The demo usage is in
examples/usage/llm_engine.py
:Modification
It adds:
Engine
, which wraps the core logic of what currentserver.launch_server()
does, with addition of shutdown logic to gracefully bring down theZMQ
sockets in theTokenizationManager
when finishing the job.class SamplingParams
is exposed as an API now(TheGenerateReqInput
has adict
version ofSamplingParams
, and the interel logic use the class version, which means if we expose it as API, we need a circuitous transform from class to dict then to class, need to somehow fix later).EngineArgs
, and make a newServerArgs
a thin wrapper of it(seesglang/srt/serving/engine_args.py
andsglang/srt/serving/server_args.py
in the commit), and some config objects built fromEngineArgs
, likeModelConfig
,ScheduleConfig
,ParallelConfig
, etc, a mimic of vllm. This opens up an opportunity to clean up internal passing ofServerArgs
arround many functions, and to draw a more clean APIs for different sub-components. But I didn't make this modification yet(these files are added, but take no effect now in the server code logic), it is quite intruisive to the current code base, thus I make this PR for RFC.Checklist
pre-commit run --all-files
or other linting tools are used to fix potential lint issues.