Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing revision does not scale to 0 until ProgressDeadline is reached #14656

Closed
andrew-delph opened this issue Nov 22, 2023 · 10 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@andrew-delph
Copy link

andrew-delph commented Nov 22, 2023

Description

Revisions trapped in a continuous crashing loop fail to scale down, leading to the depletion of cluster resources. Crashing revisions will scale down once progress deadline is reached.

To Reproduce

  1. Create a service which crashes on startup (rev-00001)
  2. Update the service which reaches a stable state (rev-00002)
  3. rev-00001 will enter a CrashLoopBackOff loop
  4. See that rev-00001 will not scale down until progressDeadline is reached

Expected Behavior

  1. rev-00001 should scale to 0 soon after rev-00002 is created.
  2. Should be similar scale down behavior as if it were a stable revision
@andrew-delph andrew-delph added the kind/bug Categorizes issue or PR as related to a bug. label Nov 22, 2023
@andrew-delph
Copy link
Author

andrew-delph commented Nov 22, 2023

It can be tested using the file here:
https://gist.github.com/andrew-delph/cb77ddb6fce475433c9754227b61aa8c

  1. ko apply -f hello.yaml
  2. remove the os.Exit(0) from main.go
  3. ko apply -f hello.yaml
  4. see the first revision does not scale to 0

@andrew-delph
Copy link
Author

A possible solution can be found here:(wip) https://github.com/knative/serving/pull/14607/files

@andrew-delph
Copy link
Author

andrew-delph commented Nov 22, 2023

Some details to consider.

  • The decider will generate desiredScale of -1 if it cannot generate metrics
  • metrics are generated off probing the revision
  • A crashing rev will stay in the unknown and unknown will not scale to 0 ref
  • An Inactive rev with no metrics will also not scale to 0 ref
  • a k8 deployment will not publish an event for a crashbackoff but the pod will
  • the pa does not know if it is in an active state but the revision does

Should the latest revision scale to 0 if it is crashing?
A healthy revision is able to do so if configured correctly and has no traffic.
Scaling an unhealthy revision to 0 would give it no chance to become healthy (the db connection comes back online)

If the revision is not the latest (routingState is reserver), should it scale to 0? assuming not in traffic config.
A healthy revision will scale to 0 once traffic stops
A crashing revision will wait for timeout.

So, I think if the route is active it should be able to continue for the case it is able to stabilize.
For crashing revisions, it already serves no purpose as the routing state has moved on.
Scaling this to 0 would also remove its chance to come back online though, and would reject a traffic configuration in the future.

It should not scale to 0 if the target is receiving traffic. This a strong constraint as it would affect availability and there are a few safeguards to verify this and one of them is the metrics. With no metrics you are not able to verify that its not receiving traffic. Currently this is the price to pay for availability.

So, the goal would be to give the pa a way to scale to 0 with no metrics only if it is in a crashing loop and not in an active route.

@dprotaso
Copy link
Member

Something to explore is what's happening with the knative resources and knative controllers - since if our labeler marked the older revision as reserve then it would scale down accordingly.

I think this is an interesting edge case cause the Knative Service has the following traffic block

    traffic:
    - latestRevision: true
      percent: 100

But the Configuration doesn't have a ready revision so it's empty - showing the latest configuration

// LatestReadyRevisionName holds the name of the latest Revision stamped out
// from this Configuration that has had its "Ready" condition become "True".
// +optional
LatestReadyRevisionName string `json:"latestReadyRevisionName,omitempty"`

The revisions are initialized as pending and then to active (probably cause the config has the route label)

// Pending tells the labeler that we have not processed this revision.
rev.SetRoutingState(v1.RoutingStatePending, tm)
updateRevisionLabels(rev, configuration)
updateRevisionAnnotations(rev, configuration, tm)

The routing state for the first revision should become 'reserve' - so this might be better fixed by not changing the autoscaler but instead fixing the labeler

https://github.com/knative/serving/tree/main/pkg/reconciler/labeler

@dprotaso dprotaso added this to the v1.13.0 milestone Nov 22, 2023
@dprotaso
Copy link
Member

dprotaso commented Nov 22, 2023

Added this to the current milestone since I know you're actively working on it 💪

@andrew-delph andrew-delph changed the title Crashing revision does not scale to 0 until ProgressDeadline is reached. Crashing revision does not scale to 0 until ProgressDeadline is reached Nov 23, 2023
@dprotaso
Copy link
Member

I think the other thing to note is typically we want the revision to scale up to 1 (initial-scale) to ensure it's working.

And if you're not aware we have a knob to let people adjust this progress deadline - #12743

@andrew-delph
Copy link
Author

Hey @dprotaso,
I looked into the initial-scale. In the case it is scaled to 0 early, it will be false. Then future traffic configs including the rev will be rejected.

@dprotaso dprotaso linked a pull request Feb 17, 2024 that will close this issue
@dprotaso
Copy link
Member

/assign @andrew-delph

Coming back to this with fresh eyes - repeating the original problem statement

See that rev-00001 will not scale down until progressDeadline is reached

I'm wondering now why not just encourage users to lower the progress deadline? For new revisions rolling out I think we want to honour the progressDeadline. If we scaled revision-01 to zero sooner just because a newer revision (revision-02) rolled out then we'd be losing the some knowledge of whether revision-01 is a safe revision to rollback to.

@andrew-delph
Copy link
Author

I see two different scenarios here as:

A) Wait for progressDeadline
pros: possibly achieve safe rollback status
cons: will consume resources

B) Kill once new revision is healthy
pros: save resources
cons: early rejection of safe rollback status

I'm playing a bit of the devils advocate here and speaking without full consideration of current implementation details.
If we assume the goal is to automate as much as possible. Then when I make a new revision and it becomes healthy, I don't want to provide resources for my old configuration. I think generally this is true.

Now speaking on the feature to block rollbacks to revisions which were never healthy. This is good because I am capable of shooting myself in the foot.
Since we are not sure if the revision is safe or not, is it possible to allow the rollback by taking off the safety? Possibly a prompt?
I'm not fully aware of how the rollback is used.

@dprotaso
Copy link
Member

dprotaso commented Mar 4, 2024

I see two different scenarios here as:

For B - I think it's hard to know if the failures are transient or permanent. eg. even if a pod is pending because of cluster limits - automation could kick in and add more capacity. Then the revision would be marked Ready=True after a few minutes.

Then when I make a new revision and it becomes healthy, I don't want to provide resources for my old configuration. I think generally this is true. ... Now speaking on the feature to block rollbacks to revisions which were never healthy. This is good because I am capable of shooting myself in the foot.

So IBM added a feature Intial Scale Zero for this reason. It's a foot gun because you lose validation that you know the revision started at least once.

Thus - I'm not sure what else we should do here. One extreme is available - don't scale the revision when rolling out. Then we also have knobs (progressiveDeadline) so things fail sooner. The options are there and users have the ability to control it per Knative Service.

I'm not fully aware of how the rollback is used.

You'd use the traffic block to point to an older revision.

@dprotaso dprotaso removed this from the v1.13.0 milestone Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants