Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][misc] improve free_finished_seq_groups #6865
[core][misc] improve free_finished_seq_groups #6865
Changes from 1 commit
135d8fe
1d66572
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzusman I actually don't get the current (before this PR) logic here.
If I understand correctly, the order of execution is
schedule
->get_and_reset_finished_requests_ids
-> model execution ->free_finished_seq_groups
. Hence, the requests freed fromself.swapped
orself.waiting
were not actually added tofinished_requests_ids
before the model execution. This means thefinished_requests_ids
holds stale information lagged by 1 step. Is my understanding correct?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, to my understanding,
finished_requests_ids
doesn't have to include the requests rejected inself.waiting
since they (ignored_seq_groups
) are never passed into the model runner.finished_requests_ids
must include the requests rejected inself.swapped
in the current step, which means the current code (before this PR) has a bug. However, the bug has not been noticed because this case is very rare.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, That's correct,
finished_requests_ids
holds information that is lagged by 1 step. finished requests will always be released upon the next step.AFAIU,
self.waiting
also includes preempted requests that got rescheduled, preempted requests did previously passed into the model runner and are already registered in the mamba cache. If those requests get aborted then we do add them to thefinished_requests_ids
and release them through here though.BTW Just to be sure, by requests rejected, do you mean aborted requests?
So actually, yeah, checking for finished requests in the
self.waiting
andself.swapped
infree_finished_seq_groups
is not necessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzusman Thanks for the explanation! However, it's a bit unclear to me how it handles the preemption, both recompuation and swapping. Could you elaborate more on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon Sure! Upon recomputation we actually keep the Mamba cache as it is and do not evict it (since Mamba cache is quite small), the request id persists during the preemption therefore we can still use it's corresponding cache upon recomputation.
RE swapping - We do not handle swapping at the moment as it occurs fairly rarely and it's quite complicated to implement it for the Mamba cache atm..
Therefore there's no reason to run through
self.swapped
in search for finished requests.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzusman Thanks for the detailed explanation! It is super helpful to understand the implementation.
I think we should revisit this design decision in the near future. Currently, it's a bit confusing for code readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao how about wrapping this in
to avoid rebuilding the deque every time, since it requests will be finished relatively rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to refactor this into dictionary in the future, so that we can easily delete requests.