fix(ai): update ai-video selection suspension #3033

ad-astra-video · 2024-04-28T14:22:56Z

What does this pull request do? Explain your changes. (required)

Suspension was not working because the penalty was always 3. This logic was a carryover from transcoding where the suspender always started at a refresh count of 0 because a new session manager was created with each stream. For AI, we are reusing the session manager and the suspender so the refresh count does not reset between requests. The fix to suspension is to consider the current refresh count when calculating the penalty so it is 3 more than the current refresh count in the suspender.

There was also an issue where the discoveryPoolSize was always 100 and with limited orchestrators providing models a refresh of sessions was being done with every request. I added an initialPoolSize field to track the last refresh pool size to use with the shouldRefreshSessions logic rather than 100. This stabilizes the suspender to allow more orchestrators to be tried with each Select call.

Last update was moving the signalRefresh() for the suspender that increments the refresh counter in the suspender to the Refresh function makes it more stable that every time we refresh sessions we add to the suspender refresh count

Happy to segregate some of these changes to separate PRs. The suspension fixes can be added separately without dependency on ai-worker PR.

Specific updates (required)

Updates suspender to use the current refresh count of the suspender in the selector.
Moves penalty to the AISessionSelector to make it easier to update and available for calculations on the suspension needed
releases all Os when there are none in the warm and cold pool
Adds option to not use managed containers.

How did you test each of these updates (required)

I have been running these updates on my gateway. Tested 1-200 requests with 5-10 workers sending to gateway. All completed with 1-2 orchestrators providing Bytedance model.

Does this pull request close any open issues?

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

victorges

Feel like I don't have context to officially approve this, but left some comments. Only nits tho, the implementation makes sense for the PR description.

server/ai_session.go

leszko · 2024-09-23T11:22:40Z

server/ai_session.go

+	// penalty needs to consider the current suspender count to set the penalty
+	last_count, ok := pool.suspender.list[sess.Transcoder()]
+	if ok {
+		penalty = pool.suspender.count - last_count + pool.penalty


I'm a little lost with this suspension logic. So, I see that:

the pool.suspender.count is increased every time signalRefresh() is called

pool.penalty is always set to 3

last_count is always set to suspender.count + 3

So, that logic would mean that we're not taking the suspended orchestrator until 3 times the signalRefresh() is called. Is this the idea of this suspension mechanism? That we don't allow the given O to get selected in the 3 refresh sessions?

Yes thats right. A small penalty to take the orchestrator out of the working set for short period. With transcoding the refresh count starts at 0 for each new session. For AI the session pools are used across requests and are not reset. If we don't add the current suspender refresh count it wont suspend any orchestrators.

leszko · 2024-09-23T11:25:10Z

server/ai_session.go

+			// if there are no orchestrators in the pools
+			clog.Infof(ctx, "refreshing sessions, no orchestrators in pools")
+			for i := 0; i < sel.penalty; i++ {
+				sel.suspender.signalRefresh()


release all orchestrators

shouldn't we then just remove them from the suspender.list() rather than calling signalRefresh()? My understanding is that if penalty = 3, then we would need to call signalRefresh() 3 times in order to "release all orchestrators from suspension".

I did it this way thinking that there could be more than 3 orchestrators suspended so it would be less loops to just signalRefresh() 3 times. An alternative would be to just create a new suspender for the selector to clear it or kick all the orchs out of the suspended list (will require new function to do second option).

rickstaa · 2024-11-13T21:47:24Z

@leszko Are we still planning to merge these fixes? They help alleviate some of the orchestrator stickiness issues we've observed over the past few months.

leszko · 2024-11-14T08:14:40Z

@leszko Are we still planning to merge these fixes? They help alleviate some of the orchestrator stickiness issues we've observed over the past few months.

Well...if it helps Orchestrators, then yes. Why not? 🙃 Please address the PR review comments and re-request review :)

ad-astra-video · 2025-01-14T19:17:54Z

@rickstaa i have updated this PR.

I also think adding something like this would be beneficial for Gateway operators to see the current AI session pool state. WYDT? ad-astra-video@9d42251

EDIT: we included the /getAISessionsPoolInfo endpoint on the gateway cli webserver to provide data about AI pools.

codecov · 2025-01-14T19:20:10Z

Codecov Report

Attention: Patch coverage is 15.62500% with 81 lines in your changes missing coverage. Please review.

Project coverage is 32.15960%. Comparing base (390af43) to head (af7434f).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
server/handlers.go	2.27273%	42 Missing and 1 partial ⚠️
server/ai_session.go	0.00000%	35 Missing ⚠️
server/ai_process.go	81.25000%	3 Missing ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3033         +/-   ##
===================================================
- Coverage   32.18835%   32.15960%   -0.02875%     
===================================================
  Files            147         147                 
  Lines          40670       40753         +83     
===================================================
+ Hits           13091       13106         +15     
- Misses         26807       26874         +67     
- Partials         772         773          +1

Files with missing lines	Coverage Δ
server/webserver.go	`95.87629% <100.00000%> (+0.04296%)`	⬆️
server/ai_process.go	`1.66945% <81.25000%> (+1.07723%)`	⬆️
server/ai_session.go	`2.33333% <0.00000%> (-0.18466%)`	⬇️
server/handlers.go	`52.19092% <2.27273%> (-1.77991%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 390af43...af7434f. Read the comment docs.

Files with missing lines	Coverage Δ
server/webserver.go	`95.87629% <100.00000%> (+0.04296%)`	⬆️
server/ai_process.go	`1.66945% <81.25000%> (+1.07723%)`	⬆️
server/ai_session.go	`2.33333% <0.00000%> (-0.18466%)`	⬇️
server/handlers.go	`52.19092% <2.27273%> (-1.77991%)`	⬇️

ad-astra-video · 2025-02-04T07:53:03Z

@rickstaa this is ready for review. Titan did some testing and the suspension was working as expected.

pool_info_1000.json

pool_info_1_orch.json

leszko

Added two comments. My suggestion is also to maybe split this PR to make it smaller, because I think it includes some unrelated changes.

Wrt reserving capacity inside CheckAICapacity(), it looks counter-intuitive, but I may not have the full context. It raises a lot of questions in my head, like what if we reserved the capacity, but the request failed at some point. Won't O stop working? Maybe we need to have some timeout to release it?
Wrt: penalty mechanism, any chance it's possible to send this change in its own separate PR (or is it related to reserve capacity change?).

core/ai_orchestrator.go

rickstaa · 2025-02-05T13:14:24Z

Added two comments. My suggestion is also to maybe split this PR to make it smaller, because I think it includes some unrelated changes.

Wrt reserving capacity inside CheckAICapacity(), it looks counter-intuitive, but I may not have the full context. It raises a lot of questions in my head, like what if we reserved the capacity, but the request failed at some point. Won't O stop working? Maybe we need to have some timeout to release it?

Wrt: penalty mechanism, any chance it's possible to send this change in its own separate PR (or is it related to reserve capacity change?).

@ad-astra-video I agree with @leszko if we can have the suspense fix in a seperate pull request we can move quicker since that part is more battle tested and reviewd 🙏🏻.

ad-astra-video · 2025-02-05T16:16:30Z

I have split the PR. If want split another way please tell me specifically which parts want in separate PRs. To me, all the remaining changes are needed to stablize orchestrator pools for batch jobs so I have left in this PR.

I retested all batch pipelines after splitting PR.

leszko · 2025-02-06T12:38:58Z

I have split the PR. If want split another way please tell me specifically which parts want in separate PRs. To me, all the remaining changes are needed to stablize orchestrator pools for batch jobs so I have left in this PR.

I retested all batch pipelines after splitting PR.

Where is the other PR? Could you share the link?

leszko

LGTM

leszko · 2025-02-06T13:23:49Z

@rickstaa do you want to review or should I merge?

server/ai_http.go

rickstaa

@ad-astra-video, I made some small code improvements in ad-astra-video#32. Once those are merged, we're good to go. I thoroughly battle-tested it, trying to break it, but couldn't. 🚀

server/ai_session.go

* refactor: some minor code changes * refactor: fix transient error naming * refactor: make isRetryableError case insensitive

This reverts commit 1c1c280.

github-actions bot added the AI Issues and PR related to the AI-video branch. label Apr 28, 2024

This was referenced Apr 30, 2024

Allow Gateways to Specify the Selection retry Timeout #3037

Closed

Fix external containers livepeer/ai-runner#72

Closed

ad-astra-video force-pushed the ai-video-fix-selection-pr branch 2 times, most recently from 494b5d9 to 2504355 Compare May 7, 2024 11:24

ad-astra-video force-pushed the ai-video-fix-selection-pr branch from 2504355 to 959ae10 Compare July 20, 2024 10:54

ad-astra-video marked this pull request as ready for review July 22, 2024 12:19

ad-astra-video requested a review from rickstaa as a code owner July 22, 2024 12:19

ad-astra-video force-pushed the ai-video-fix-selection-pr branch from 187dcd4 to d94d62b Compare July 22, 2024 12:23

ad-astra-video changed the title ~~Ai video fix selection pr~~ fix(ai): update ai-video selection suspension Aug 27, 2024

rickstaa mentioned this pull request Aug 29, 2024

Call-01 Agenda - 2024-08-29 livepeer/project-management#73

Open

victorges reviewed Sep 18, 2024

View reviewed changes

server/ai_session.go Outdated Show resolved Hide resolved

server/ai_session.go Outdated Show resolved Hide resolved

server/ai_session.go Show resolved Hide resolved

leszko reviewed Sep 23, 2024

View reviewed changes

rickstaa force-pushed the ai-video branch from 4a66b22 to 2c50134 Compare October 21, 2024 09:13

leszko deleted the branch livepeer:master November 7, 2024 08:26

leszko closed this Nov 7, 2024

rickstaa reopened this Nov 13, 2024

rickstaa changed the base branch from ai-video to master November 13, 2024 21:47

ad-astra-video force-pushed the ai-video-fix-selection-pr branch from 6dae336 to 9eeacac Compare January 14, 2025 18:55

ad-astra-video requested a review from leszko January 29, 2025 05:50

ad-astra-video force-pushed the ai-video-fix-selection-pr branch 2 times, most recently from cfa53c5 to 80b7531 Compare February 2, 2025 18:12

ad-astra-video mentioned this pull request Feb 2, 2025

Ai video cli endpoint for pool info #3036

Closed

5 tasks

leszko reviewed Feb 4, 2025

View reviewed changes

core/ai_orchestrator.go Outdated Show resolved Hide resolved

core/ai_orchestrator.go Outdated Show resolved Hide resolved

ad-astra-video force-pushed the ai-video-fix-selection-pr branch from e478dd0 to b97c3b8 Compare February 5, 2025 16:05

ad-astra-video and others added 10 commits February 5, 2025 10:10

move signalRefresh() to Refresh

894e55c

add log line for session selected

b914210

fix suspension

0586602

fix penalty def and comment

d42df5e

fix variable naming to camelCase

79435d4

do not suspend orchestrators for certain errors

c484417

add cli webserver handler to get AI pools info

c553737

do not add suspended orchestrators to pool

45d0c70

fix typo

acda01a

fix insufficient capacity error text to not suspend orchestrators

7636d8e

ad-astra-video force-pushed the ai-video-fix-selection-pr branch from b97c3b8 to 7636d8e Compare February 5, 2025 16:11

leszko approved these changes Feb 6, 2025

View reviewed changes

rickstaa reviewed Feb 10, 2025

View reviewed changes

server/ai_http.go Outdated Show resolved Hide resolved

rickstaa approved these changes Feb 12, 2025

View reviewed changes

rickstaa reviewed Feb 12, 2025

View reviewed changes

server/ai_session.go Outdated Show resolved Hide resolved

rickstaa and others added 3 commits February 12, 2025 19:37

refactor: some minor code changes (#32)

8091750

* refactor: some minor code changes * refactor: fix transient error naming * refactor: make isRetryableError case insensitive

Merge branch 'master' into ai-video-fix-selection-pr

77b3923

Update ai_session.go

af7434f

ad-astra-video merged commit 1c1c280 into livepeer:master Feb 13, 2025
16 of 17 checks passed

ad-astra-video deleted the ai-video-fix-selection-pr branch February 13, 2025 03:14

leszko added a commit that referenced this pull request Feb 14, 2025

Revert "fix(ai): update ai-video selection suspension (#3033)"

a47b02b

This reverts commit 1c1c280.

leszko added a commit that referenced this pull request Feb 14, 2025

Revert "fix(ai): update ai-video selection suspension (#3033)" (#3392)

33e1bf8

This reverts commit 1c1c280.

ad-astra-video restored the ai-video-fix-selection-pr branch February 14, 2025 16:19

ad-astra-video mentioned this pull request Feb 14, 2025

fix(ai): fix orchestrator suspension for AI jobs #3393

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ai): update ai-video selection suspension #3033

fix(ai): update ai-video selection suspension #3033

ad-astra-video commented Apr 28, 2024 •

edited

Loading

victorges left a comment

leszko Sep 23, 2024

ad-astra-video Jan 14, 2025

leszko Sep 23, 2024

ad-astra-video Jan 14, 2025

rickstaa commented Nov 13, 2024 •

edited

Loading

leszko commented Nov 14, 2024

ad-astra-video commented Jan 14, 2025 •

edited

Loading

codecov bot commented Jan 14, 2025 •

edited

Loading

ad-astra-video commented Feb 4, 2025

leszko left a comment

rickstaa commented Feb 5, 2025

ad-astra-video commented Feb 5, 2025

leszko commented Feb 6, 2025

leszko left a comment

leszko commented Feb 6, 2025

rickstaa left a comment

fix(ai): update ai-video selection suspension #3033

fix(ai): update ai-video selection suspension #3033

Conversation

ad-astra-video commented Apr 28, 2024 • edited Loading

victorges left a comment

Choose a reason for hiding this comment

leszko Sep 23, 2024

Choose a reason for hiding this comment

ad-astra-video Jan 14, 2025

Choose a reason for hiding this comment

leszko Sep 23, 2024

Choose a reason for hiding this comment

ad-astra-video Jan 14, 2025

Choose a reason for hiding this comment

rickstaa commented Nov 13, 2024 • edited Loading

leszko commented Nov 14, 2024

ad-astra-video commented Jan 14, 2025 • edited Loading

codecov bot commented Jan 14, 2025 • edited Loading

Codecov Report

ad-astra-video commented Feb 4, 2025

leszko left a comment

Choose a reason for hiding this comment

rickstaa commented Feb 5, 2025

ad-astra-video commented Feb 5, 2025

leszko commented Feb 6, 2025

leszko left a comment

Choose a reason for hiding this comment

leszko commented Feb 6, 2025

rickstaa left a comment

Choose a reason for hiding this comment

ad-astra-video commented Apr 28, 2024 •

edited

Loading

rickstaa commented Nov 13, 2024 •

edited

Loading

ad-astra-video commented Jan 14, 2025 •

edited

Loading

codecov bot commented Jan 14, 2025 •

edited

Loading