-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashing revision does not scale to 0 until ProgressDeadline is reached #14656
Comments
It can be tested using the file here:
|
A possible solution can be found here:(wip) https://github.com/knative/serving/pull/14607/files |
Some details to consider.
Should the latest revision scale to 0 if it is crashing? If the revision is not the latest (routingState is reserver), should it scale to 0? assuming not in traffic config. So, I think if the route is active it should be able to continue for the case it is able to stabilize. It should not scale to 0 if the target is receiving traffic. This a strong constraint as it would affect availability and there are a few safeguards to verify this and one of them is the metrics. With no metrics you are not able to verify that its not receiving traffic. Currently this is the price to pay for availability. So, the goal would be to give the pa a way to scale to 0 with no metrics only if it is in a crashing loop and not in an active route. |
Something to explore is what's happening with the knative resources and knative controllers - since if our labeler marked the older revision as reserve then it would scale down accordingly. I think this is an interesting edge case cause the Knative Service has the following traffic block traffic:
- latestRevision: true
percent: 100 But the Configuration doesn't have a ready revision so it's empty - showing the latest configuration serving/pkg/apis/serving/v1/configuration_types.go Lines 85 to 88 in f939498
The revisions are initialized as pending and then to active (probably cause the config has the route label) serving/pkg/reconciler/configuration/resources/revision.go Lines 44 to 48 in f939498
The routing state for the first revision should become 'reserve' - so this might be better fixed by not changing the autoscaler but instead fixing the labeler https://github.com/knative/serving/tree/main/pkg/reconciler/labeler |
Added this to the current milestone since I know you're actively working on it 💪 |
I think the other thing to note is typically we want the revision to scale up to 1 (initial-scale) to ensure it's working. And if you're not aware we have a knob to let people adjust this progress deadline - #12743 |
Hey @dprotaso, |
/assign @andrew-delph Coming back to this with fresh eyes - repeating the original problem statement
I'm wondering now why not just encourage users to lower the progress deadline? For new revisions rolling out I think we want to honour the progressDeadline. If we scaled revision-01 to zero sooner just because a newer revision (revision-02) rolled out then we'd be losing the some knowledge of whether revision-01 is a safe revision to rollback to. |
I see two different scenarios here as: A) Wait for progressDeadline B) Kill once new revision is healthy I'm playing a bit of the devils advocate here and speaking without full consideration of current implementation details. Now speaking on the feature to block rollbacks to revisions which were never healthy. This is good because I am capable of shooting myself in the foot. |
For B - I think it's hard to know if the failures are transient or permanent. eg. even if a pod is pending because of cluster limits - automation could kick in and add more capacity. Then the revision would be marked Ready=True after a few minutes.
So IBM added a feature Thus - I'm not sure what else we should do here. One extreme is available - don't scale the revision when rolling out. Then we also have knobs (progressiveDeadline) so things fail sooner. The options are there and users have the ability to control it per Knative Service.
You'd use the |
Description
Revisions trapped in a continuous crashing loop fail to scale down, leading to the depletion of cluster resources. Crashing revisions will scale down once progress deadline is reached.
To Reproduce
Expected Behavior
The text was updated successfully, but these errors were encountered: