-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] fix num_lookahead_slots missing in async executor #4165
[BugFix] fix num_lookahead_slots missing in async executor #4165
Conversation
cc @cadedaniel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. @cadedaniel is there any relevant test that could be updated to test it?
Sorry I missed this. The fix looks good. To get it merged:
|
Hi @cadedaniel , This issue is trigger while I am testing spec infer with openai api serving.
For testcase, I'm thinking whether we could add a new interface in vllm/entrypoints/llm.py, so that could make it as Async when request? Or we may need to implement similiar things in tests folder. |
Exactly. In the ideal case we can run an async llm entrypoint, but that might be a lot of work -- it's ok to have a hackier version in a conftest somewher that allows us to get coverage of this codepath. |
I try add use a AsyncEngine mode LLM implment in conftest, but found it would not free the cuda memory after delete. So it would broken following tests behind. |
5873816
to
93fa26b
Compare
@leiwen83 CPU executor add async support recently, need to fix that too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments regarding tests
tests/spec_decode/e2e/conftest.py
Outdated
""" | ||
|
||
|
||
class AsyncLLM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider supporting gpu cleanup using cleanup
method in conftest.py?
I think you can make it a conftest and delete the instance & call cleanup() in the destructor simliar to
Line 275 in 468d761
def __del__(self): |
|
||
with Manager() as manager: | ||
result = manager.dict() | ||
p = Process(target=_test_spec_decode_e2e_with_detokenization_async, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is prone to leak. Consider;
try:
p = Process(target=_test_spec_decode_e2e_with_detokenization_async,
args=(request, common_llm_kwargs,
per_test_common_llm_kwargs, test_llm_kwargs,
batch_size, seed, result))
p.start()
p.join()
finally:
p.terminate()
or I recommend you to use ray (and use ray.shutdown() in the finally).
import ray
try:
ray.init()
@ray.remote
def run():
# call your thing
ray.get(run.remote())
finally:
ray.shutdown()
print(f"{actual_token_ids=}") | ||
assert actual_tokens.strip() == expected_tokens.strip() | ||
|
||
del llm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clean up as I suggested above ^?
@rkooo567 's suggestions are good. Another way to do it is to look at spec decode e2e tests for inspiration: vllm/tests/spec_decode/e2e/conftest.py Lines 27 to 52 in 7923dca
|
simply delete llm and clean would not help reduce gpu memory usage here, which I think it is due to async mode creating a looping thread, and it would not be destory even llm itself is deleted. I would try ray method. |
Thanks reminding, cpu execute also get fixed. |
c83d82d
to
4a4efda
Compare
@rkooo567 now testcase is reworked with ray method. @cadedaniel async mode in e2e case now could be compatible with original parameter passing way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current one is okay, but I have some impression we can just reuse engine_use_ray
(lmk if this is wrong!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good -- thanks!
what's up with the CI failure?
8b9486e
to
4f5c2c5
Compare
No idea... I rebase the code which retrigger the CI, seems still has similar issue, which looks like ray env cannot allocate required {'CPU': 1.0, 'GPU': 0.9} resource for
|
@cadedaniel I find the root cause of test case in CI failed. It is because test_spec_decode_xfail_ray in tests/spec_decode/e2e/test_compatibility.py has init the ray cluster which claim the gpu resource but never release it, causing the later ray call procedure fail to reclaim the gpu as there is only one GPU in the env. |
OK, let's skip that test. We can follow up and fix later. @pytest.mark.skip("Ray does not release GPU resources in this test") |
oh that works too. thanks! |
This may have introduced a failure on Dockerfile.cpu scenarios, cc @cadedaniel I suppose the cpu path simply needs to be updated to accept the new parameter?
|
@andysalerno do you have a repro script? |
err, why is this not showing up in CI |
…ect#4165) Co-authored-by: Lei Wen <wenlei03@qiyi.com>
…ect#4165) Co-authored-by: Lei Wen <wenlei03@qiyi.com>
…ect#4165) Co-authored-by: Lei Wen <wenlei03@qiyi.com>
…ect#4165) Co-authored-by: Lei Wen <wenlei03@qiyi.com>
This fixes the following stacktrace: