Crashing revision does not scale to 0 until ProgressDeadline is reached #14656

andrew-delph · 2023-11-22T20:52:46Z

Description

Revisions trapped in a continuous crashing loop fail to scale down, leading to the depletion of cluster resources. Crashing revisions will scale down once progress deadline is reached.

To Reproduce

Create a service which crashes on startup (rev-00001)
Update the service which reaches a stable state (rev-00002)
rev-00001 will enter a CrashLoopBackOff loop
See that rev-00001 will not scale down until progressDeadline is reached

Expected Behavior

rev-00001 should scale to 0 soon after rev-00002 is created.
Should be similar scale down behavior as if it were a stable revision

andrew-delph · 2023-11-22T20:58:00Z

It can be tested using the file here:
https://gist.github.com/andrew-delph/cb77ddb6fce475433c9754227b61aa8c

ko apply -f hello.yaml
remove the os.Exit(0) from main.go
ko apply -f hello.yaml
see the first revision does not scale to 0

andrew-delph · 2023-11-22T21:00:05Z

A possible solution can be found here:(wip) https://github.com/knative/serving/pull/14607/files

andrew-delph · 2023-11-22T22:15:37Z

Some details to consider.

The decider will generate desiredScale of -1 if it cannot generate metrics
metrics are generated off probing the revision
A crashing rev will stay in the unknown and unknown will not scale to 0 ref
An Inactive rev with no metrics will also not scale to 0 ref
a k8 deployment will not publish an event for a crashbackoff but the pod will
the pa does not know if it is in an active state but the revision does

Should the latest revision scale to 0 if it is crashing?
A healthy revision is able to do so if configured correctly and has no traffic.
Scaling an unhealthy revision to 0 would give it no chance to become healthy (the db connection comes back online)

If the revision is not the latest (routingState is reserver), should it scale to 0? assuming not in traffic config.
A healthy revision will scale to 0 once traffic stops
A crashing revision will wait for timeout.

So, I think if the route is active it should be able to continue for the case it is able to stabilize.
For crashing revisions, it already serves no purpose as the routing state has moved on.
Scaling this to 0 would also remove its chance to come back online though, and would reject a traffic configuration in the future.

It should not scale to 0 if the target is receiving traffic. This a strong constraint as it would affect availability and there are a few safeguards to verify this and one of them is the metrics. With no metrics you are not able to verify that its not receiving traffic. Currently this is the price to pay for availability.

So, the goal would be to give the pa a way to scale to 0 with no metrics only if it is in a crashing loop and not in an active route.

dprotaso · 2023-11-22T22:49:55Z

Something to explore is what's happening with the knative resources and knative controllers - since if our labeler marked the older revision as reserve then it would scale down accordingly.

I think this is an interesting edge case cause the Knative Service has the following traffic block

    traffic:
    - latestRevision: true
      percent: 100

But the Configuration doesn't have a ready revision so it's empty - showing the latest configuration

serving/pkg/apis/serving/v1/configuration_types.go

Lines 85 to 88 in f939498

    
           // LatestReadyRevisionName holds the name of the latest Revision stamped out 
        
           // from this Configuration that has had its "Ready" condition become "True". 
        
           // +optional 
        
           LatestReadyRevisionName string `json:"latestReadyRevisionName,omitempty"`

The revisions are initialized as pending and then to active (probably cause the config has the route label)

serving/pkg/reconciler/configuration/resources/revision.go

Lines 44 to 48 in f939498

    
           // Pending tells the labeler that we have not processed this revision. 
        
           rev.SetRoutingState(v1.RoutingStatePending, tm) 
        
           updateRevisionLabels(rev, configuration) 
        
           updateRevisionAnnotations(rev, configuration, tm)

The routing state for the first revision should become 'reserve' - so this might be better fixed by not changing the autoscaler but instead fixing the labeler

https://github.com/knative/serving/tree/main/pkg/reconciler/labeler

dprotaso · 2023-11-22T22:55:36Z

Added this to the current milestone since I know you're actively working on it 💪

dprotaso · 2023-11-23T15:09:39Z

I think the other thing to note is typically we want the revision to scale up to 1 (initial-scale) to ensure it's working.

And if you're not aware we have a knob to let people adjust this progress deadline - #12743

andrew-delph · 2023-11-27T10:34:27Z

Hey @dprotaso,
I looked into the initial-scale. In the case it is scaled to 0 early, it will be false. Then future traffic configs including the rev will be rejected.

dprotaso · 2024-02-17T03:32:08Z

/assign @andrew-delph

Coming back to this with fresh eyes - repeating the original problem statement

See that rev-00001 will not scale down until progressDeadline is reached

I'm wondering now why not just encourage users to lower the progress deadline? For new revisions rolling out I think we want to honour the progressDeadline. If we scaled revision-01 to zero sooner just because a newer revision (revision-02) rolled out then we'd be losing the some knowledge of whether revision-01 is a safe revision to rollback to.

andrew-delph · 2024-02-25T17:42:00Z

I see two different scenarios here as:

A) Wait for progressDeadline
pros: possibly achieve safe rollback status
cons: will consume resources

B) Kill once new revision is healthy
pros: save resources
cons: early rejection of safe rollback status

I'm playing a bit of the devils advocate here and speaking without full consideration of current implementation details.
If we assume the goal is to automate as much as possible. Then when I make a new revision and it becomes healthy, I don't want to provide resources for my old configuration. I think generally this is true.

Now speaking on the feature to block rollbacks to revisions which were never healthy. This is good because I am capable of shooting myself in the foot.
Since we are not sure if the revision is safe or not, is it possible to allow the rollback by taking off the safety? Possibly a prompt?
I'm not fully aware of how the rollback is used.

dprotaso · 2024-03-04T02:58:19Z

I see two different scenarios here as:

For B - I think it's hard to know if the failures are transient or permanent. eg. even if a pod is pending because of cluster limits - automation could kick in and add more capacity. Then the revision would be marked Ready=True after a few minutes.

Then when I make a new revision and it becomes healthy, I don't want to provide resources for my old configuration. I think generally this is true. ... Now speaking on the feature to block rollbacks to revisions which were never healthy. This is good because I am capable of shooting myself in the foot.

So IBM added a feature Intial Scale Zero for this reason. It's a foot gun because you lose validation that you know the revision started at least once.

Thus - I'm not sure what else we should do here. One extreme is available - don't scale the revision when rolling out. Then we also have knobs (progressiveDeadline) so things fail sooner. The options are there and users have the ability to control it per Knative Service.

I'm not fully aware of how the rollback is used.

You'd use the traffic block to point to an older revision.

andrew-delph added the kind/bug Categorizes issue or PR as related to a bug. label Nov 22, 2023

dprotaso added this to the v1.13.0 milestone Nov 22, 2023

andrew-delph changed the title ~~Crashing revision does not scale to 0 until ProgressDeadline is reached.~~ Crashing revision does not scale to 0 until ProgressDeadline is reached Nov 23, 2023

dprotaso linked a pull request Feb 17, 2024 that will close this issue

Fix crash-looping pods take a long time to terminate/clean #14607

Closed

knative-prow bot assigned andrew-delph Feb 17, 2024

dprotaso mentioned this issue Feb 17, 2024

Fix crash-looping pods take a long time to terminate/clean #14607

Closed

dprotaso removed this from the v1.13.0 milestone Mar 5, 2024

andrew-delph closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashing revision does not scale to 0 until ProgressDeadline is reached #14656

Crashing revision does not scale to 0 until ProgressDeadline is reached #14656

andrew-delph commented Nov 22, 2023 •

edited

Loading

andrew-delph commented Nov 22, 2023 •

edited

Loading

andrew-delph commented Nov 22, 2023

andrew-delph commented Nov 22, 2023 •

edited

Loading

dprotaso commented Nov 22, 2023

dprotaso commented Nov 22, 2023 •

edited

Loading

dprotaso commented Nov 23, 2023

andrew-delph commented Nov 27, 2023

dprotaso commented Feb 17, 2024

andrew-delph commented Feb 25, 2024

dprotaso commented Mar 4, 2024

Crashing revision does not scale to 0 until ProgressDeadline is reached #14656

Crashing revision does not scale to 0 until ProgressDeadline is reached #14656

Comments

andrew-delph commented Nov 22, 2023 • edited Loading

Description

To Reproduce

Expected Behavior

andrew-delph commented Nov 22, 2023 • edited Loading

andrew-delph commented Nov 22, 2023

andrew-delph commented Nov 22, 2023 • edited Loading

dprotaso commented Nov 22, 2023

dprotaso commented Nov 22, 2023 • edited Loading

dprotaso commented Nov 23, 2023

andrew-delph commented Nov 27, 2023

dprotaso commented Feb 17, 2024

andrew-delph commented Feb 25, 2024

dprotaso commented Mar 4, 2024

andrew-delph commented Nov 22, 2023 •

edited

Loading

andrew-delph commented Nov 22, 2023 •

edited

Loading

andrew-delph commented Nov 22, 2023 •

edited

Loading

dprotaso commented Nov 22, 2023 •

edited

Loading