-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ai): update ai-video selection suspension #3033
fix(ai): update ai-video selection suspension #3033
Conversation
494b5d9
to
2504355
Compare
2504355
to
959ae10
Compare
187dcd4
to
d94d62b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like I don't have context to officially approve this, but left some comments. Only nits tho, the implementation makes sense for the PR description.
server/ai_session.go
Outdated
// penalty needs to consider the current suspender count to set the penalty | ||
last_count, ok := pool.suspender.list[sess.Transcoder()] | ||
if ok { | ||
penalty = pool.suspender.count - last_count + pool.penalty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little lost with this suspension logic. So, I see that:
- the
pool.suspender.count
is increased every timesignalRefresh()
is called pool.penalty
is always set to3
last_count
is always set tosuspender.count + 3
So, that logic would mean that we're not taking the suspended orchestrator until 3 times the signalRefresh()
is called. Is this the idea of this suspension mechanism? That we don't allow the given O to get selected in the 3 refresh sessions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thats right. A small penalty to take the orchestrator out of the working set for short period. With transcoding the refresh count starts at 0 for each new session. For AI the session pools are used across requests and are not reset. If we don't add the current suspender refresh count it wont suspend any orchestrators.
// if there are no orchestrators in the pools | ||
clog.Infof(ctx, "refreshing sessions, no orchestrators in pools") | ||
for i := 0; i < sel.penalty; i++ { | ||
sel.suspender.signalRefresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release all orchestrators
shouldn't we then just remove them from the suspender.list()
rather than calling signalRefresh()
? My understanding is that if penalty = 3
, then we would need to call signalRefresh()
3 times in order to "release all orchestrators from suspension".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way thinking that there could be more than 3 orchestrators suspended so it would be less loops to just signalRefresh()
3 times. An alternative would be to just create a new suspender for the selector to clear it or kick all the orchs out of the suspended list (will require new function to do second option).
@leszko Are we still planning to merge these fixes? They help alleviate some of the orchestrator stickiness issues we've observed over the past few months. |
Well...if it helps Orchestrators, then yes. Why not? 🙃 Please address the PR review comments and re-request review :) |
6dae336
to
9eeacac
Compare
@rickstaa i have updated this PR. I also think adding something like this would be beneficial for Gateway operators to see the current AI session pool state. WYDT? ad-astra-video@9d42251 EDIT: we included the /getAISessionsPoolInfo endpoint on the gateway cli webserver to provide data about AI pools. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3033 +/- ##
===================================================
- Coverage 32.18835% 32.15960% -0.02875%
===================================================
Files 147 147
Lines 40670 40753 +83
===================================================
+ Hits 13091 13106 +15
- Misses 26807 26874 +67
- Partials 772 773 +1
Continue to review full report in Codecov by Sentry.
|
cfa53c5
to
80b7531
Compare
@rickstaa this is ready for review. Titan did some testing and the suspension was working as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two comments. My suggestion is also to maybe split this PR to make it smaller, because I think it includes some unrelated changes.
- Wrt reserving capacity inside
CheckAICapacity()
, it looks counter-intuitive, but I may not have the full context. It raises a lot of questions in my head, like what if we reserved the capacity, but the request failed at some point. Won't O stop working? Maybe we need to have some timeout to release it? - Wrt: penalty mechanism, any chance it's possible to send this change in its own separate PR (or is it related to reserve capacity change?).
@ad-astra-video I agree with @leszko if we can have the suspense fix in a seperate pull request we can move quicker since that part is more battle tested and reviewd 🙏🏻. |
e478dd0
to
b97c3b8
Compare
b97c3b8
to
7636d8e
Compare
I have split the PR. If want split another way please tell me specifically which parts want in separate PRs. To me, all the remaining changes are needed to stablize orchestrator pools for batch jobs so I have left in this PR. I retested all batch pipelines after splitting PR. |
Where is the other PR? Could you share the link? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@rickstaa do you want to review or should I merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ad-astra-video, I made some small code improvements in ad-astra-video#32. Once those are merged, we're good to go. I thoroughly battle-tested it, trying to break it, but couldn't. 🚀
* refactor: some minor code changes * refactor: fix transient error naming * refactor: make isRetryableError case insensitive
What does this pull request do? Explain your changes. (required)
Suspension was not working because the
penalty
was always3
. This logic was a carryover from transcoding where the suspender always started at a refresh count of 0 because a new session manager was created with each stream. For AI, we are reusing the session manager and the suspender so the refresh count does not reset between requests. The fix to suspension is to consider the current refresh count when calculating the penalty so it is 3 more than the current refresh count in the suspender.There was also an issue where the
discoveryPoolSize
was always 100 and with limited orchestrators providing models a refresh of sessions was being done with every request. I added aninitialPoolSize
field to track the last refresh pool size to use with theshouldRefreshSessions
logic rather than 100. This stabilizes the suspender to allow more orchestrators to be tried with eachSelect
call.Last update was moving the
signalRefresh()
for the suspender that increments the refresh counter in the suspender to theRefresh
function makes it more stable that every time we refresh sessions we add to the suspender refresh countHappy to segregate some of these changes to separate PRs. The suspension fixes can be added separately without dependency on ai-worker PR.
Specific updates (required)
How did you test each of these updates (required)
I have been running these updates on my gateway. Tested 1-200 requests with 5-10 workers sending to gateway. All completed with 1-2 orchestrators providing Bytedance model.
Does this pull request close any open issues?
Checklist:
make
runs successfully./test.sh
pass