Invalid resourceVersion error in upstream discovery #8635

jenshu · 2023-09-01T14:05:52Z

Gloo Edge Product

Enterprise

Gloo Edge Version

v1.14.x

Kubernetes Version

1.27

Describe the bug

A customer reported that gloo discovery is not creating all the upstreams that they expect. Deleting and recreating the services doesn't seem to help.

When checking the discovery pod logs, they saw errors like:
{"level":"error","ts":"2023-08-30T12:43:51.281Z","logger":"uds.v1.event_loop.uds","caller":"syncer/setup_syncer.go:117","msg":"error: event_loop.uds: error in uds plugin : reconciling resource default-xxx-80: updating kube resource default-xxx-80: (want 2398796740): upstreams.gloo.solo.io \"default-xxx-80\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","version":"undefined","stacktrace":"github.com/solo-io/gloo/projects/discovery/pkg/uds/syncer.RunUDS.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.14.12/projects/discovery/pkg/uds/syncer/setup_syncer.go:117"}

Deleting the erroring upstreams didn't seem to resolve the issue (upstreams would get recreated again, with same error in logs)

Note: I have also seen a similar error (metadata.resourceVersion: Invalid value: 0x0: must be specified for an update) on a clean install, but in my case it didn't seem to affect upstream creation.

Recommended steps:

work with customer to reproduce / figure out under what circumstances the error occurs
confirm whether and how it affects upstream discovery
if needed, share any short-term workarounds while customer is awaiting a fix

Expected Behavior

resourceVersion error should not occur and/or should resolve itself with retries, and expected upstreams should be created

Steps to reproduce the bug

See above

Additional Environment Detail

No response

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

The text was updated successfully, but these errors were encountered:

DuncanDoyle · 2024-02-09T13:02:59Z

@jenshu Any update on this issue? Without a reproducer this is not actionable. Did "the customer" ever come back to you? If not, I'm gonna close this ticket (and can create a new one, or re-open if this issue surfaces again).

jenshu · 2024-02-09T14:34:16Z

@DuncanDoyle - responded offline with context

DuncanDoyle · 2024-02-09T16:27:18Z

Closing, as there is no reproducer. Can re-open if this surfaces again.

ShaunPTK · 2024-08-21T07:42:52Z

Steps to reproduce:

Check the discovery pod logs to make sure there are no errors being streamed
configure the Gloo settings to only auto-discover if this label is present:

spec:
...
  discovery:
...
    udsOptions:
      watchLabels:
        expose_api: "true"

remove any existing auto-discovered Upstreams kubectl -n gloo-system delete upstream -l discovered_by=kubernetesplugin
configure a service with the label expose_api: "true" - Gloo will/should create an upstream in gloo-system namespace for it
Check the discovery pod logs - you should see errors relating to this service
remove the watchLabels setting above - the errors should cease

sam-heilbron · 2024-09-12T12:58:51Z

See #9968 (comment) for inspiration as the underlying problems are the same (different code paths however)

jenshu · 2024-09-16T19:00:42Z

This doesn't seem related to persistProxySpec. It's happening with upstreams, and with persistProxySpec=false.

Here are some steps to reproduce from a clean cluster:

Create a cluster

kind create cluster

Install GG with uds watch labels:

helm install -n gloo-system gloo gloo/gloo --create-namespace --version 1.18.0-beta21 -f - <<EOF
discovery:
  udsOptions:
    watchLabels:
      aaa: bbb
EOF

Create a service with the uds watch labels:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: example-svc
  namespace: default
  labels:
    aaa: bbb
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 8080
  selector:
    ccc: ddd
EOF

Verify that the Upstream was created, with the expected spec:

kubectl get us -n gloo-system default-example-svc-8000 -oyaml

Check the discovery logs.

kubectl logs -f -n gloo-system deploy/discovery

At this point, the upstream got created correctly but it seems like something then tried to update the upstream, resulting in a bunch of these errors (note, these errors may not happen every time on create.. but they do consistently happen later when we try to update the service, below):
{"level":"error","ts":"2024-09-12T18:56:44.056Z","logger":"uds.v1.event_loop.uds","caller":"discovery/discovery.go:150","msg":"failed reconciling upstreams","version":"1.18.0-beta21","discovered_by":"kubernetesplugin","upstreams":1,"error":"reconciling resource default-example-svc-8000: updating kube resource default-example-svc-8000: (want 792): upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update","errorVerbose":"upstreams.gloo.solo.io \"default-example-svc-8000\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\nupdating kube resource default-example-svc-8000: (want 792)\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/clients/kube.(*ResourceClient).Write\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/clients/kube/resource_client.go:232\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.writeDesiredResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:105\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.attemptSyncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:72\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource.func1\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:66\nk8s.io/client-go/util/retry.OnError.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:51\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/wait.go:145\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/go/pkg/mod/k8s.io/apimachinery@v0.29.2/pkg/util/wait/backoff.go:461\nk8s.io/client-go/util/retry.OnError\n\t/go/pkg/mod/k8s.io/client-go@v0.29.2/util/retry/util.go:50\ngithub.com/solo-io/solo-kit/pkg/errors.RetryOnConflict\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:117\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).syncResource\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:64\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:44\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nreconciling resource default-example-svc-8000\ngithub.com/solo-io/solo-kit/pkg/errors.Wrapf\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/errors/errors.go:15\ngithub.com/solo-io/solo-kit/pkg/api/v1/reconcile.(*reconciler).Reconcile\n\t/go/pkg/mod/github.com/solo-io/solo-kit@v0.35.3/pkg/api/v1/reconcile/reconciler.go:45\ngithub.com/solo-io/gloo/projects/gloo/pkg/api/v1.(*upstreamReconciler).Reconcile\n\t/workspace/gloo/projects/gloo/pkg/api/v1/upstream_reconciler.sk.go:46\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:146\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).Resync\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:150\ngithub.com/solo-io/gloo/projects/gloo/pkg/discovery.(*UpstreamDiscovery).StartUds.func1\n\t/workspace/gloo/projects/gloo/pkg/discovery/discovery.go:117"}

make a change to the service (e.g. add more labels)

kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: example-svc
  namespace: default
  labels:
    aaa: bbb
    xxx: yyy
spec:
  ports:
  - name: http
    port: 8000
    targetPort: 8080
  selector:
    ccc: ddd
EOF

More resourceVersion errors are seen in the logs, and the upstream does not pick up the changes (still shows old labels in discoveryMetadata). This is a problem because further updates to the service do not get reflected in the upstream.

Note, making any change to the Service spec.selector does not cause the Upstream to get updated, regardless of whether uds watch labels are being used (and no errors in the logs). Looks like there is an issue for that already #8227

nfuden · 2024-09-24T14:25:31Z

@DuncanDoyle what versions does this make sense to target this for?

bewebi · 2024-09-25T17:54:37Z

I believe I've gotten to the root cause of this

In a nutshell:

Discovery does not propagate labels from Services to discovered Upstreams' metadata
- They are present in DiscoveryMetadata
When watchLabels are set, Discovery includes those labels in the selector passed to Reconcile()
- See labels added here from d.extraSelectorLabels
- extraSelectorLabels is set to watchOpts.Selector in StartUDS()` here
- watchOpts.Selector gets the watchLabels directly from Settings here
When solo-ket Reconcile() attempts to get the original resource(s) via a List() call here, none are returned because the generated Upstream(s) do not have the labels specified in the selector
Because we do not find the "original" Upstream, we do not propagate its ResourceVersion, effectively treating the Update as a Create
- This would have happened here
This results in the observed error, as the write operation attempts to update an existing resource without specifying the Resource Version

The key question from here is: Should Upstreams created by discovery contain the watchLabels? or perhaps all labels from the Service?

If yes, the solution should mainly be a matter of setting those when creating Upstreams
If no, the solution should mainly be a matter of not including watchLabels in the selector when syncing Upstreams

bewebi · 2024-09-25T19:00:30Z

Having toyed with both options, they both seem viable in the sense that they allow resyncs to occur successfully

Removing labels from the UpstreamSelector seems to be the better option for a couple of reasons:

Adding labels to Upstreams that don't already have them is a user-facing change in behavior
- This may or may not be considered a breaking change, but likely undesirable to change behavior if it's not requested or necessary
Even if we add labels to new Upstreams, errors will continue to occur for existing Upstreams at the time of upgrade, so resolving would require deleting all pre-existing Upstreams and allowing Discovery to re-populate
- Even if we provided a script/tool to automate this, I am not sure this would be doable without the potential for downtime

This behavior has been present since this repo was first split out from solo-projects so there is little context around why this decision was made, and I'm not clear if the selector was actually ever populated in the opts passed to StartUDS() at that point

Created upstreams do get the discovered_by: kubernetesplugin label which is selected on and should limit any performance impact incurred by expanding the scope of the selector
It is possible, though unlikely, that Upstreams that are not created by Discovery would have that label, but even in this case, there would be no additional calls to etcd, just a larger return from a List() call and additional elements to iterate over, which happens once when converting the list to a map, so any performance impact should be negligible

Therefore I'm comfortable treating this behavior as a bug and "fixing" it accordingly

bewebi · 2024-10-01T15:37:21Z

A fix for this has merged to OSS main and will be backported to all supported LTS branches (ie 1.14)

@DuncanDoyle, please advise if there is any particular version that will require OSS and/or EE release ASAP or if it's sufficient to have this bugfix to roll out whenever the next release(s) otherwise occur

bewebi · 2024-10-04T19:42:58Z

A fix for this has now been released in all LTS versions for OSS and EE as well as OSS beta
OSS: 1.18.0-beta24, 1.17.11, 1.16.21, 1.15.32, 1.14.32
EE: 1.17.3, 1.16.15, 1.15.22, 1.14.23
Additionally this fix will be in EE 1.18.0-beta2

jenshu added the Type: Bug Something isn't working label Sep 1, 2023

DuncanDoyle closed this as completed Feb 9, 2024

jenshu reopened this Aug 21, 2024

sam-heilbron added the Area: Discovery label Aug 21, 2024

DuncanDoyle added the Prioritized Indicating issue prioritized to be worked on in RFE stream label Sep 9, 2024

htpvu added the post-plan/1.18 label Sep 12, 2024

nfuden assigned bewebi Sep 24, 2024

solo-changelog-bot bot mentioned this issue Sep 26, 2024

Discovery resource version err #10118

Merged

4 tasks

bewebi closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid resourceVersion error in upstream discovery #8635

Invalid resourceVersion error in upstream discovery #8635

jenshu commented Sep 1, 2023 •

edited by sync-by-unito bot

Loading

DuncanDoyle commented Feb 9, 2024

jenshu commented Feb 9, 2024

DuncanDoyle commented Feb 9, 2024

ShaunPTK commented Aug 21, 2024

sam-heilbron commented Sep 12, 2024

jenshu commented Sep 16, 2024 •

edited

Loading

nfuden commented Sep 24, 2024

bewebi commented Sep 25, 2024

bewebi commented Sep 25, 2024 •

edited

Loading

bewebi commented Oct 1, 2024

bewebi commented Oct 4, 2024

Invalid resourceVersion error in upstream discovery #8635

Invalid resourceVersion error in upstream discovery #8635

Comments

jenshu commented Sep 1, 2023 • edited by sync-by-unito bot Loading

Gloo Edge Product

Gloo Edge Version

Kubernetes Version

Describe the bug

Expected Behavior

Steps to reproduce the bug

Additional Environment Detail

Additional Context

DuncanDoyle commented Feb 9, 2024

jenshu commented Feb 9, 2024

DuncanDoyle commented Feb 9, 2024

ShaunPTK commented Aug 21, 2024

sam-heilbron commented Sep 12, 2024

jenshu commented Sep 16, 2024 • edited Loading

nfuden commented Sep 24, 2024

bewebi commented Sep 25, 2024

bewebi commented Sep 25, 2024 • edited Loading

bewebi commented Oct 1, 2024

bewebi commented Oct 4, 2024

jenshu commented Sep 1, 2023 •

edited by sync-by-unito bot

Loading

jenshu commented Sep 16, 2024 •

edited

Loading

bewebi commented Sep 25, 2024 •

edited

Loading