Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Reconciled" condition for RouteGatewayStatus #642

Closed
robscott opened this issue Apr 30, 2021 · 22 comments
Closed

Add "Reconciled" condition for RouteGatewayStatus #642

robscott opened this issue Apr 30, 2021 · 22 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@robscott
Copy link
Member

What would you like to be added:
Currently RouteGatewayStatus only has 1 possible condition defined in the API: "Admitted". It's not clear how that should be handled in the following use cases:

  • Route has been processed by Gateway controller for the specified Gateway but reconciliation has temporarily failed
  • Route is already being served by a Gateway but an incremental update failed

It seems wrong to set Admitted to false in this case, but it also feels like we need to have a way to communicate this in status.

One potential option would be to add a Reconciled condition that could be used to indicate if the most recent reconciliation cycle for the Route was successful.

Why is this needed:
To improve user experience.

@robscott robscott added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 30, 2021
@youngnick
Copy link
Contributor

I agree that we need to do something with Route status, but I think that porting some of the Listener conditions to the Route would probably work better. (Here, the example I'm thinking of is DegradedRoutes on the listener, that seems like it should translate cleanly into a Degraded condition on the Route.)

I'd also argue that while Admitted is a good shortcut for "this object is valid", there's probably some space for a Ready-like rollup condition that would serve as the indicator both that the config is accepted and the Route should be passing traffic. All other Conditions should be abnormal-true, error conditions (like Degraded), where they should only be status: True if there is a problem.

@jpeach
Copy link
Contributor

jpeach commented May 11, 2021

Here, the example I'm thinking of is DegradedRoutes on the listener, that seems like it should translate cleanly into a Degraded condition on the Route.

A route can be degraded in the context of one gateway and perfectly OK in the context of another (really, same for reconciled).

@youngnick
Copy link
Contributor

youngnick commented May 11, 2021

The status on the routes is in the context of a Gateway already though. That is, each Condition set also includes a backlink to the relevant Gateway. So bringing the Degraded condition down to the Route just means one less lookup for the user, and one more status update for the controller.

@jpeach
Copy link
Contributor

jpeach commented May 11, 2021

The status on the routes is in the context of a Gateway already though.

Oh, I forgot 🤕

@robscott
Copy link
Member Author

there's probably some space for a Ready-like rollup condition that would serve as the indicator both that the config is accepted and the Route should be passing traffic

This may be unique to our implementation, but we have a slightly different distinction. We may still be serving traffic with the older config if new config can not be reconciled. So in our case, some kind of distinction between "this config is valid" and the latest config has actually been reconciled is important. That is also fairly similar in concept to "Ready", but slightly different. Maybe both are valuable?

@youngnick
Copy link
Contributor

youngnick commented May 19, 2021

Yes, having two Conditions makes sense to me, being Config is valid and Config is being served. It seems to me like they should probably be Admitted and Ready respectively.

I note that Contour currently can't guarantee Ready, sadly, because there's no feedback loop from Envoy. We can guess with reasonable confidence, but can't guarantee. No reason not to go ahead with this distinction though.

@jpeach, thoughts?

@jpeach
Copy link
Contributor

jpeach commented May 19, 2021

Ready means it is working at some version. Actually specifying what the version is for an arbitrary tree of resources that can all change at any time doesn't seem straightforward to me. Hard to understand what the semantics of the valid config would be.

Route has been processed by Gateway controller for the specified Gateway but reconciliation has temporarily failed

Depending on the failure mode, I'd suggest a Gateway condition here.

Route is already being served by a Gateway but an incremental update failed

Could be a ResolvedRefs/DegradedRoutes condition on the Gateway.

@youngnick
Copy link
Contributor

Well, a Condition does have a version, the observedGeneration, which should definitely be being checked by things consuming them.

I must admit to being a bit confused about your reluctance here @jpeach. Is it that you think that the Route should have a minimal set of Conditions, and we should teach people to look at owning Gateways for the system's current state? Personally, I don't feel like that's a great user experience for Route owners.

@jpeach
Copy link
Contributor

jpeach commented May 21, 2021

I must admit to being a bit confused about your reluctance here @jpeach. Is it that you think that the Route should have a minimal set of Conditions, and we should teach people to look at owning Gateways for the system's current state? Personally, I don't feel like that's a great user experience for Route owners.

So the route is part of an object graph. For the proxy configuration to be "current", all the configuration in all the resources in the graph needs to be current on the server. When you look at a "reconciled" condition on the route, is it really telling you that the config is up to date, or only that this specific route hasn't changed since you last looked at it?

For the conditions that Rob listed at the start of this issue, explicit error conditions seems to me to be preferable to a general, "are we current" one.

@youngnick
Copy link
Contributor

Okay, so I'm going to try and summarise what I understand you're saying.

You're in support of having the already-existing Admitted condition, plus additional ones to cover specific errors, but not a Reconciled or Ready or similar condition.

In this case, Admitted means, essentially, "no syntactic or other simple errors found, this is okay to be reconciled, that should happen at some point`.

Finding out when it has happened requires looking at the Listener.

Specific errors on the Route could point out equivalent information to Listener conditions like DegradedRoute, so a Degraded condition for example.

Adding Reconciled doesn't make sense because the condition may change with no change to the underlying object?

Just want to make sure I've got what you're saying before I think further about where I'm at. :)

@jpeach
Copy link
Contributor

jpeach commented Jul 1, 2021

In this case, Admitted means, essentially, "no syntactic or other simple errors found, this is okay to be reconciled, that should happen at some point`.

I understood it as that, in the sense of "there is an ingress controller that accepts responsibility for this object". Once that happens, then of course the controller will reconcile it.

@robscott
Copy link
Member Author

robscott commented Jul 1, 2021

then of course the controller will reconcile it

Although this is likely true, in some cases it can take a matter of minutes to spin up the needed resources to reconcile. I realize that's not worth making a distinction for for many implementations though, so maybe "Reconciled" is a concept that only makes sense for some.

Adding Reconciled doesn't make sense because the condition may change with no change to the underlying object?

Yeah this is definitely a concern but I think it could be used reasonably effectively when combined with the LastObservedGeneration field though.
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1419

@jpeach
Copy link
Contributor

jpeach commented Jul 4, 2021

Although this is likely true, in some cases it can take a matter of minutes to spin up the needed resources to reconcile.

This is getting close to the ListenerReasonPending condition. As describe here, the use of "Reconciled" sounds a lot like "Ready" to me.

@stevesloka
Copy link
Contributor

I don't want to diverge the conversation but think it's relevant. How does the already "Scheduled" condition fit into this? @youngnick & I had a conversation about this last week and I don't fully understand what "Scheduled" means in the context of a Gateway.

A gateway starts out with Scheduled: false, Waiting for controller. To me, that's Admitted: false. Once a controller takes ownership of the Gateway, then it's Admitted: true than waiting for the Reconciled status. I don't see how Scheduled: false fits. It was explained to me similar to a pod gets scheduled to a node, but we don't schedule gateways to different controllers, there's one that should pick it up and it's not managed by a "scheduler" per say.

If we step back, what is it that we want to convey to users without trying to attached names just yet. From the conversation it sounds like folks want to convey:

  1. A controller has seen the Gateway, is responsible, and will eventually serve traffic for it, but it's doing work to get "ready"
  2. The gateway is ready to serve some sort of traffic

That leads me to Admitted & Reconciled as @robscott proposed leaving the more specific errors to the Listeners which will come and go as they arrive & are fixed.

@jpeach
Copy link
Contributor

jpeach commented Jul 5, 2021

Gateways are logical configurations. There might not be enough underlying load balancing capacity to configure in the requested configuration. In that case "Admitted" is true (the controller admits that it ought to be reconciling the Gateway) but "Scheduled" is false (there's no underlying capacity it can use to stand up the Gateway).

It was explained to me similar to a pod gets scheduled to a node, but we don't schedule gateways to different controllers, there's one that should pick it up and it's not managed by a "scheduler" per say.

Pods are scheduled to kubelets, and gateways are scheduled to whatever the underlying proxy infrastructure unit is. For example, a controller could spin up a VM instance for each gateway but find that its out of VM quota. It would not be able to schedule new gateways, though it could admit them.

@stevesloka
Copy link
Contributor

Your example @jpeach sounds like Provisioned fits better since you're still not scheduling resources, you're requesting them. If you hit a VM quota, or out of resources for a cloud load balancer, or permissions error etc, then Provisioned: false tells me that something tried to spin up, but failed since it's not provisioned.

In my head, the "Scheduled" condition doesn't fit my mental model, but curious what others think.

@jpeach
Copy link
Contributor

jpeach commented Jul 5, 2021

Your example @jpeach sounds like Provisioned fits better since you're still not scheduling resources, you're requesting them. If you hit a VM quota, or out of resources for a cloud load balancer, or permissions error etc, then Provisioned: false tells me that something tried to spin up, but failed since it's not provisioned.

That's just an example. There could be any number of reasons and the spec doesn't tell you how to do it.

@youngnick
Copy link
Contributor

I think the key part here in this specific example is that Scheduled and Provisioned are the same thing from different sides. Let me explain.

A Gateway is a thing that sets at a logical edge of the cluster in some way. What does that mean? The spec specifically does not define exactly what the "edge" of the cluster is. We don't do that so that the spec can be applicable to lots of different use cases.

In the fully general case, on the happy path, the Gateway, per the spec passes through the following Conditions;

  • Admitted. The controller recognizes the object and accepts responsibility for reconciling it. This makes no statement about anything else.
  • Scheduled. The controller has looked at the object, determined it's valid, and checked that there are resources to provision it. It sends a request to provision it, and flips this condition to true.
  • Ready. The controller has determined that all the infrastructure is provisioned and working correctly, and the object is ready to forward traffic.

Right now, the Ready condition includes the Provisioned condition @stevesloka mentioned. Steve, it seems like you don't see the value in the Scheduled condition vs having a Provisioned condition before Ready?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 3, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants