Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid resourceVersion error in upstream discovery #8635

Closed
jenshu opened this issue Sep 1, 2023 · 11 comments
Closed

Invalid resourceVersion error in upstream discovery #8635

jenshu opened this issue Sep 1, 2023 · 11 comments
Assignees
Labels
Area: Discovery post-plan/1.18 Prioritized Indicating issue prioritized to be worked on in RFE stream Type: Bug Something isn't working

Comments

@jenshu
Copy link
Contributor

jenshu commented Sep 1, 2023

Gloo Edge Product

Enterprise

Gloo Edge Version

v1.14.x

Kubernetes Version

1.27

Describe the bug

A customer reported that gloo discovery is not creating all the upstreams that they expect. Deleting and recreating the services doesn't seem to help.

When checking the discovery pod logs, they saw errors like:
{"level":"error","ts":"2023-08-30T12:43:51.281Z","logger":"uds.v1.event_loop.uds","caller":"syncer/setup_syncer.go:117","msg":"error: event_loop.uds: error in uds plugin : reconciling resource default-xxx-80: updating kube resource default-xxx-80: (want 2398796740): upstreams.gloo.solo.io \"default-xxx-80\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/uds/syncer.RunUDS.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.14.12/projects/discovery/pkg/uds/syncer/setup_syncer.go:117"}

Deleting the erroring upstreams didn't seem to resolve the issue (upstreams would get recreated again, with same error in logs)

Note: I have also seen a similar error (metadata.resourceVersion: Invalid value: 0x0: must be specified for an update) on a clean install, but in my case it didn't seem to affect upstream creation.

Recommended steps:

  • work with customer to reproduce / figure out under what circumstances the error occurs
  • confirm whether and how it affects upstream discovery
  • if needed, share any short-term workarounds while customer is awaiting a fix

Expected Behavior

resourceVersion error should not occur and/or should resolve itself with retries, and expected upstreams should be created

Steps to reproduce the bug

See above

Additional Environment Detail

No response

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

@jenshu jenshu added the Type: Bug Something isn't working label Sep 1, 2023
@DuncanDoyle
Copy link
Contributor

@jenshu Any update on this issue? Without a reproducer this is not actionable. Did "the customer" ever come back to you? If not, I'm gonna close this ticket (and can create a new one, or re-open if this issue surfaces again).

@jenshu
Copy link
Contributor Author

jenshu commented Feb 9, 2024

@DuncanDoyle - responded offline with context

@DuncanDoyle
Copy link
Contributor

Closing, as there is no reproducer. Can re-open if this surfaces again.

@ShaunPTK
Copy link

Steps to reproduce:

  1. Check the discovery pod logs to make sure there are no errors being streamed
  2. configure the Gloo settings to only auto-discover if this label is present:
spec:
...
  discovery:
...
    udsOptions:
      watchLabels:
        expose_api: "true"
  1. remove any existing auto-discovered Upstreams kubectl -n gloo-system delete upstream -l discovered_by=kubernetesplugin
  2. configure a service with the label expose_api: "true" - Gloo will/should create an upstream in gloo-system namespace for it
  3. Check the discovery pod logs - you should see errors relating to this service
  4. remove the watchLabels setting above - the errors should cease

@jenshu jenshu reopened this Aug 21, 2024
@DuncanDoyle DuncanDoyle added the Prioritized Indicating issue prioritized to be worked on in RFE stream label Sep 9, 2024
@sam-heilbron
Copy link
Contributor

See #9968 (comment) for inspiration as the underlying problems are the same (different code paths however)

@jenshu
Copy link
Contributor Author

jenshu commented Sep 16, 2024

This doesn't seem related to persistProxySpec. It's happening with upstreams, and with persistProxySpec=false.

Here are some steps to reproduce from a clean cluster:

  1. Create a cluster
kind create cluster
  1. Install GG with uds watch labels:
helm install -n gloo-system gloo gloo/gloo --create-namespace --version 1.18.0-beta21 -f - <<EOF
discovery:
  udsOptions:
    watchLabels:
      aaa: bbb
EOF
  1. Create a service with the uds watch labels:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: example-svc
  namespace: default
  labels:
    aaa: bbb
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 8080
  selector:
    ccc: ddd
EOF
  1. Verify that the Upstream was created, with the expected spec:
kubectl get us -n gloo-system default-example-svc-8000 -oyaml
  1. Check the discovery logs.
kubectl logs -f -n gloo-system deploy/discovery

At this point, the upstream got created correctly but it seems like something then tried to update the upstream, resulting in a bunch of these errors (note, these errors may not happen every time on create.. but they do consistently happen later when we try to update the service, below):
{"level":"error","ts":"2024-09-12T18:56:44.056Z","logger":"uds.v1.event_loop.uds","caller":"discovery/discovery.go:150","msg":"failed reconciling upstreams","version":"1.18.0-beta21","discovered_by":"kubernetesplugin","upstreams":1,"error":"reconciling resource default-example-svc-8000: updating kube resource default-example-svc-8000: (want 792): upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","errorVerbose":"upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\nupdating kube resource default-example-svc-8000: (want 792)\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/clients/kube.(*ResourceClient).Write\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/clients/kube/resource_client.go:232\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.writeDesiredResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:105\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.attemptSyncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:72\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource.func1\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:66\nk8s.io/client-go/util/retry.OnError.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:51\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/backoff.go:461\nk8s.io/client-go/util/retry.OnError\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:50\ngithub.com/solo-io/solo-kit/pkg/errors.RetryOnConflict\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:117\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:64\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:44\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nreconciling resource default-example-svc-8000\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:45\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:150\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117"}

  1. make a change to the service (e.g. add more labels)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: example-svc
  namespace: default
  labels:
    aaa: bbb
    xxx: yyy
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 8080
  selector:
    ccc: ddd
EOF

More resourceVersion errors are seen in the logs, and the upstream does not pick up the changes (still shows old labels in discoveryMetadata). This is a problem because further updates to the service do not get reflected in the upstream.

Note, making any change to the Service spec.selector does not cause the Upstream to get updated, regardless of whether uds watch labels are being used (and no errors in the logs). Looks like there is an issue for that already #8227

@nfuden
Copy link
Contributor

nfuden commented Sep 24, 2024

@DuncanDoyle what versions does this make sense to target this for?

@bewebi
Copy link
Contributor

bewebi commented Sep 25, 2024

I believe I've gotten to the root cause of this

In a nutshell:

  • Discovery does not propagate labels from Services to discovered Upstreams' metadata
  • When watchLabels are set, Discovery includes those labels in the selector passed to Reconcile()
    • See labels added here from d.extraSelectorLabels
    • extraSelectorLabels is set to watchOpts.Selector in StartUDS()` here
    • watchOpts.Selector gets the watchLabels directly from Settings here
  • When solo-ket Reconcile() attempts to get the original resource(s) via a List() call here, none are returned because the generated Upstream(s) do not have the labels specified in the selector
  • Because we do not find the "original" Upstream, we do not propagate its ResourceVersion, effectively treating the Update as a Create
    • This would have happened here
  • This results in the observed error, as the write operation attempts to update an existing resource without specifying the Resource Version

The key question from here is: Should Upstreams created by discovery contain the watchLabels? or perhaps all labels from the Service?

  • If yes, the solution should mainly be a matter of setting those when creating Upstreams
  • If no, the solution should mainly be a matter of not including watchLabels in the selector when syncing Upstreams

@bewebi
Copy link
Contributor

bewebi commented Sep 25, 2024

Having toyed with both options, they both seem viable in the sense that they allow resyncs to occur successfully

Removing labels from the UpstreamSelector seems to be the better option for a couple of reasons:

  • Adding labels to Upstreams that don't already have them is a user-facing change in behavior
    • This may or may not be considered a breaking change, but likely undesirable to change behavior if it's not requested or necessary
  • Even if we add labels to new Upstreams, errors will continue to occur for existing Upstreams at the time of upgrade, so resolving would require deleting all pre-existing Upstreams and allowing Discovery to re-populate
    • Even if we provided a script/tool to automate this, I am not sure this would be doable without the potential for downtime

This behavior has been present since this repo was first split out from solo-projects so there is little context around why this decision was made, and I'm not clear if the selector was actually ever populated in the opts passed to StartUDS() at that point

Created upstreams do get the discovered_by: kubernetesplugin label which is selected on and should limit any performance impact incurred by expanding the scope of the selector
It is possible, though unlikely, that Upstreams that are not created by Discovery would have that label, but even in this case, there would be no additional calls to etcd, just a larger return from a List() call and additional elements to iterate over, which happens once when converting the list to a map, so any performance impact should be negligible

Therefore I'm comfortable treating this behavior as a bug and "fixing" it accordingly

@bewebi
Copy link
Contributor

bewebi commented Oct 1, 2024

A fix for this has merged to OSS main and will be backported to all supported LTS branches (ie 1.14)

@DuncanDoyle, please advise if there is any particular version that will require OSS and/or EE release ASAP or if it's sufficient to have this bugfix to roll out whenever the next release(s) otherwise occur

@bewebi
Copy link
Contributor

bewebi commented Oct 4, 2024

A fix for this has now been released in all LTS versions for OSS and EE as well as OSS beta
OSS: 1.18.0-beta24, 1.17.11, 1.16.21, 1.15.32, 1.14.32
EE: 1.17.3, 1.16.15, 1.15.22, 1.14.23
Additionally this fix will be in EE 1.18.0-beta2

@bewebi bewebi closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Discovery post-plan/1.18 Prioritized Indicating issue prioritized to be worked on in RFE stream Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants