Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(discovery): configure sharding every time MetricsHandler.Run runs #2478

Merged

Conversation

wallee94
Copy link
Contributor

@wallee94 wallee94 commented Aug 16, 2024

What this PR does / why we need it:

I see the following issue in a ksm deployment with custom CRD config enabled. Whenever a new CRD event is added to the cache, ksm stops watching and metrics stop reflecting new changes.

image

In the graph, ksm recovers after updating the Statefulset with any change. The metrics are rate(kube_state_metrics_watch_total) and the counter kube_state_metrics_custom_resource_state_add_events_total without rate.

Looking into the code, the problem seems to be the validation shardingUnchanged in AddFunc (here).

Without a CRD config, MetricsHandler.Run runs only once, and the vars m.curShard and m.curTotalShards are initially nil, which makes shardingUnchanged = false (here).

However, if a CRD config is present, discovery runs MetricsHandler.Run every time a CRD event is detected (here). If the Statefulset number of replicas/shards didn't change, the new CRD event will cancel the old metrics handler, but won't initiate a new one because shardingUnchanged = true in AddFunc.

This change removes the check shardingUnchanged in the AddFunc event handler. I don't think it's necessary because, in most cases, it's only called when the informer is synced at the end of MetricsHandler.Run.

This change updates CRDiscoverer.PollForCacheUpdates to rebuild the metrics writers in the already running metrics handler, instead of running a new one every time a CRD event occurs.

How does this change affect the cardinality of KSM:

No change in cardinality.

Which issue(s) this PR fixes:

Fixes #2372

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 16, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @wallee94!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 16, 2024
Signed-off-by: Walther Lee <walthere.lee@gmail.com>
@wallee94 wallee94 force-pushed the remove-shard-validation-on-sts-create branch from 102f391 to dec7a1b Compare August 16, 2024 17:16
@wallee94 wallee94 changed the title fix(discovery) configure sharding every time MetricsHandler.Run runs fix(discovery): configure sharding every time MetricsHandler.Run runs Aug 16, 2024
@logicalhan
Copy link
Member

/assign @CatherineF-dev
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 22, 2024
@CatherineF-dev
Copy link
Contributor

CatherineF-dev commented Aug 24, 2024

Hello, this code has been around for five years. Why is it only now experiencing this issue? Could there be a another underlying problem? If we can figure out how to only Run once inside https://github.com/kubernetes/kube-state-metrics/pull/1851/files, it will be the fix.

			m.mtx.RLock()
			shardingUnchanged := m.curShard == shard && m.curTotalShards == totalShards
			m.mtx.RUnlock()

			if shardingUnchanged {
				return
			}

@wallee94
Copy link
Contributor Author

Sorry, I mentioned it in a thread in the kube-state-metrics Slack channel and forgot to put it here.

The bug isn't exactly in metrics_handler if Run runs only once, which is how it works without CR. The issue is that discovery started running it multiple times in this PR if there's a CR config.

Removing the validation in AddFunc made sense to me to reconfigure every time the index informer is initially synced. I don't think it should depend on the previous m.curShard and m.curTotalShards values of earlier runs.

@CatherineF-dev
Copy link
Contributor

CatherineF-dev commented Aug 24, 2024

The issue is that discovery started running it multiple times in this PR if there's a CR config.

Thanks for spotting this! I am thinking whether we should run it only once.

@wallee94
Copy link
Contributor Author

wallee94 commented Aug 24, 2024

That's a good point, I can look into that. I think m.Run could probably be just m.ConfigureSharding, but discovery doesn't have access to the Statefulset or k8s client to calculate shard and totalShards.

On the other hand, if Run runs only once, the validation in this change is usually false because m.curShard and m.curTotalShards are initially 0.

@wallee94
Copy link
Contributor Author

I've made some changes to use m.ConfigureSharding in discovery and to run m.Run only once. I'm testing it on a test cluster and so far it's working fine. I'll let it run for a few more hours and push it tomorrow if everything looks good

Signed-off-by: Walther Lee <walthere.lee@gmail.com>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 27, 2024
@wallee94
Copy link
Contributor Author

wallee94 commented Aug 27, 2024

@CatherineF-dev I added new changes to run m.Run only once when the pod starts, and to reconfigure sharding in discovery instead of recreating the IndexInformer. I also removed some of the contexts in PollForCacheUpdates because m.ConfigureSharding already cancels the previous context.

I deployed the change to a few clusters and it seems to be working. This is the watch rate after adding a new CRD to the cluster:
image

I see the event, then a brief drop, which is when ksm is populating the cache after the reconfigure, and then it comes back to normal. All the metrics look good as well.

Signed-off-by: Walther Lee <walthere.lee@gmail.com>
@wallee94 wallee94 force-pushed the remove-shard-validation-on-sts-create branch from 0616572 to 4aced25 Compare September 5, 2024 21:02
Signed-off-by: Walther Lee <walthere.lee@gmail.com>
Signed-off-by: Walther Lee <walthere.lee@gmail.com>
Signed-off-by: Walther Lee <walthere.lee@gmail.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 13, 2024
@wallee94 wallee94 requested a review from dgrisonnet September 13, 2024 19:40
@wallee94
Copy link
Contributor Author

wallee94 commented Sep 27, 2024

A summary of changes per file to help with the review:

  • internal/discovery/discovery.go: Use m.BuildWriters(ctx) instead of m.Run(olderContext) to rebuild writers instead of recreating the whole handler. BuildWriters cancels the old context, so we no longer need the cancelations in this file.
  • pkg/metricshandler/metrics_handler.go: adds a function BuildWriters that rebuilds the metrics writers. We use it in discovery.go when the resources in the StoreBuilder have changed.
  • tests/e2e/discovery_test.go: Update to test custom metrics after updating a CRD. I broke this into subfunctions because it failed the cyclomatic complexity linter.

internal/store/builder.go Outdated Show resolved Hide resolved
@wallee94 wallee94 requested a review from mrueg September 30, 2024 16:23
Copy link
Member

@mrueg mrueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo in the comments, otherwise looks good to me. Thanks for the contribution!

/lgtm
and
/hold
for others to review.

pkg/metricshandler/metrics_handler.go Outdated Show resolved Hide resolved
pkg/metricshandler/metrics_handler.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2024
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 30, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 30, 2024
@mal-berbatov-ci
Copy link

Is there an ETA for when this will get merged?

@mrueg mrueg added this to the v2.14.0 milestone Oct 8, 2024
@mrueg
Copy link
Member

mrueg commented Oct 8, 2024

/lgtm

I'll still ping the other maintainers to review, currently everyone seems to be busy.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, wallee94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mrueg
Copy link
Member

mrueg commented Oct 15, 2024

/hold cancel

continuing here as no other reviews came in.

Thanks for the debugging and your contribution!

@mrueg
Copy link
Member

mrueg commented Oct 15, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2024
@k8s-ci-robot k8s-ci-robot merged commit f3b7593 into kubernetes:main Oct 15, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some kube-state-metrics shards are serving up stale metrics
7 participants