Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Support fetching LFC metrics (but don't use them yet) #895

Merged
merged 4 commits into from
Jun 10, 2024

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Apr 6, 2024

Probably best to review each commit separately.

Most of the commits here are just repositioning to make the final commit, which adds LFC metrics collection, as simple as possible.

Builds on #892, must not be merged before then.

@sharnoff sharnoff force-pushed the sharnoff/agent-refactor-metrics branch from 4e7edce to bb282a5 Compare April 11, 2024 15:16
Base automatically changed from sharnoff/agent-refactor-metrics to main April 11, 2024 15:26
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 8f2eb8b to d3fa7a6 Compare April 11, 2024 15:39
@sharnoff sharnoff marked this pull request as ready for review April 11, 2024 15:39
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from f6217dd to a3f9156 Compare April 11, 2024 16:17
@sharnoff sharnoff requested review from Omrigan and skyzh April 11, 2024 16:37
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from a3f9156 to 4c461f8 Compare April 14, 2024 22:56
@sharnoff sharnoff changed the base branch from main to sharnoff/rm-custom-Format()s April 14, 2024 23:45
Base automatically changed from sharnoff/rm-custom-Format()s to main April 15, 2024 14:42
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 4c461f8 to 288f86e Compare April 15, 2024 14:43
Copy link
Contributor

@Omrigan Omrigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes a little bit of time to review for me, as I am not used to this many generics usages. Usually, I would advocate against it in Go, relaying on dynamic dispatch/code duplication. But since we have Rust in the picture, maybe I should get used to it.

LMK if you come up the a way to make it less generic

ctx context.Context,
logger *zap.Logger,
newMetrics func(metrics core.Metrics, withLock func()),
kind string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have it as a part of an interface on C?

Also, C is metrics names, right, maybe should be Names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have it as a part of an interface on C?

Not quite sure what you're referring to here — is it kind?

Also, C is metrics names, right, maybe should be Names?

I was originally thinking we'd want some per-VM configuration for metrics for feature-gating new behavior, but upon reflection, I think implementing that properly is actually more effort than it's worth.

At the same time... I think the semantics are clearer if the interface defines it as "config" for metrics parsing, and then something external provides that "config" - even if it's always just the metric names in practice.

For now, I've added some comments to FromPrometheus (f47097a). LMK what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what you're referring to here — is it kind?

Yes. Can it be a method returning constant?

I've added some comments to FromPrometheus

But the implementation is actually called SystemMetricNames. IMO, either both type parameter and implementations needs to be called Names, or both Config, even if config only contains names.

UPD: or is it not? It seems I got myself confused in the hierarchy of types, and which ones are generic, and which are not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in MetricsSourceConfig[C] below, C is actually metric names, as a type parameters to a metrics config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 5dc98a9 satisfies this (in addition to the changes from 4b06d09..0bb2b59 ?

@skyzh
Copy link
Member

skyzh commented Apr 22, 2024

Discussion from the 1-1 meeting: scraping sql_exporter from the autoscaling-agent is the right way to go. I'll add a separate sql_exporter solely for autoscaling metrics. For now, we can just scrape the current one with a reasonable interval.

@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch 3 times, most recently from fe8737b to 24f5c94 Compare April 23, 2024 23:28
@Omrigan Omrigan self-requested a review April 24, 2024 21:09
ctx context.Context,
logger *zap.Logger,
newMetrics func(metrics core.Metrics, withLock func()),
kind string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what you're referring to here — is it kind?

Yes. Can it be a method returning constant?

I've added some comments to FromPrometheus

But the implementation is actually called SystemMetricNames. IMO, either both type parameter and implementations needs to be called Names, or both Config, even if config only contains names.

UPD: or is it not? It seems I got myself confused in the hierarchy of types, and which ones are generic, and which are not.

skyzh added a commit to neondatabase/neon that referenced this pull request May 2, 2024
As discussed in neondatabase/autoscaling#895, we
want to have a separate sql_exporter for simple metrics to avoid
overload the database because the autoscaling agent needs to scrape at a
higher interval. The new exporter is exposed at port 9499. I didn't do
any testing for this pull request but given it's just a configuration
change I assume this works.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Copy link
Contributor

@Omrigan Omrigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mostly nits are left, we should discuss them verbally

ctx context.Context,
logger *zap.Logger,
newMetrics func(metrics core.Metrics, withLock func()),
kind string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in MetricsSourceConfig[C] below, C is actually metric names, as a type parameters to a metrics config.

Comment on lines 190 to 197
ec.Add(c.Metrics.LFC.Names.Validate())
erc.Whenf(ec, c.Metrics.LFC.RequestTimeoutSeconds == 0, zeroTmpl, ".metrics.lfc.requestTimeoutSeconds")
erc.Whenf(ec, c.Metrics.LFC.SecondsBetweenRequests == 0, zeroTmpl, ".metrics.lfc.secondsBetweenRequests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to deduplicate it similar to how I did it in my billing PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - implemented in 86cb9f4 and c1ae8d3 (kept separate so that they can be properly squashed later).

conradludgate pushed a commit to neondatabase/neon that referenced this pull request May 8, 2024
As discussed in neondatabase/autoscaling#895, we
want to have a separate sql_exporter for simple metrics to avoid
overload the database because the autoscaling agent needs to scrape at a
higher interval. The new exporter is exposed at port 9499. I didn't do
any testing for this pull request but given it's just a configuration
change I assume this works.

Signed-off-by: Alex Chi Z <chi@neon.tech>
@sharnoff
Copy link
Member Author

Discussion with @Omrigan : It's fine to just hard-code the metric names, and that makes it a lot simpler. Some discussion about whether to encapsulate the callbacks for metrics collection into a single struct with func(...) fields or an interface.

sharnoff added a commit that referenced this pull request May 24, 2024
There's a lot of places where we have ad-hoc ways to construct pointers
to thigns - either via temporary variables, referencing the first
element in a slice, or by a local function to produce pointers.

We can replace those with lo.ToPtr (like what the control plane already
uses) to help with readability and consistency.

Originally discussed here:
github.com//pull/895#discussion_r1576977602
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 24f5c94 to 4b06d09 Compare May 27, 2024 03:05
sharnoff added a commit that referenced this pull request May 27, 2024
@sharnoff
Copy link
Member Author

sharnoff commented May 27, 2024

New changes:

  1. Move metric names into the code (rather than config) and remove the generics it required. Diff here: 4b06d09..0bb2b59.
  2. Fixed a bug where we'd always use the "system" metrics port, even for LFC metrics: da48a4c (NOTE: this was from reading the code; I haven't yet tested it)
  3. Switch from unnamed callbacks to a new metricsMgr[M] struct to contain them and the kind string: 5dc98a9
  4. Switch from the emptyMetrics() constructor callback to a generics-based approach: 031cc2d
    • cc @Omrigan particularly want your feedback on this one. I could go either way on this -- left it as a separate commit so it's easy to revert.
  5. Deduplicate metrics config parsing: 86cb9f4 + c1ae8d3

edit: And the first commit has been spun off into a separate PR: #948.

sharnoff added a commit that referenced this pull request May 28, 2024
There's a lot of places where we have ad-hoc ways to construct pointers
to things - either via temporary variables, referencing the first
element in a slice, or by a local function to produce pointers.

We can replace those with lo.ToPtr (like what the control plane already
uses) to help with readability and consistency.

Originally discussed here: #895 (comment)
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch 2 times, most recently from 666e11f to 34afd60 Compare May 28, 2024 23:52
sharnoff added a commit that referenced this pull request May 29, 2024
Specifically, where we spawn background workers in (*Runner).Run(),
this commit renames the inner function's context and logger:

* 'c' → 'ctx2'
* 'l' → 'logger2'

ref #895 (comment)
sharnoff added a commit that referenced this pull request May 29, 2024
ref #895 (comment)

Co-authored-by: Oleg Vasilev <oleg@neon.tech>
sharnoff added a commit that referenced this pull request May 29, 2024
@sharnoff
Copy link
Member Author

sharnoff commented May 29, 2024

Continuing with the changelog approach, if only because it helps me manage the rebasing:

  1. Reverted 031cc2d ("tmp: Switch from emptyMetrics() callback to generics"): 60f835b
  2. Squashed everything back down: 60f835b..666e11f
  3. Rebased onto main: 666e11f -> 34afd60
  4. Renamed newMetrics -> updateMetrics: cb14daf+b9e3ebd1ea51a7d3730b965d3ecd77c0318776f4
  5. Fix outdated comment on FromPrometheus: 662613d
  6. Renamed the c/l context/logger variables to ctx2/logger2: efdda7c
  7. Remove 5-minute load average from the system metrics names: 589f6b7
  8. Rework system metrics string constants: 66b9289

Plus a couple more changes to #948 based on review here.

Copy link
Contributor

@Omrigan Omrigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending comments and conflicts resolution.

sharnoff added a commit that referenced this pull request Jun 7, 2024
Specifically, where we spawn background workers in (*Runner).Run(),
this commit renames the inner function's context and logger:

* 'c' → 'ctx2'
* 'l' → 'logger2'

ref #895 (comment)
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 66b9289 to 20bd2ba Compare June 7, 2024 18:48
sharnoff added a commit that referenced this pull request Jun 7, 2024
Specifically, where we spawn background workers in (*Runner).Run(),
this commit renames the inner function's context and logger:

* 'c' → 'ctx2'
* 'l' → 'logger2'

ref #895 (comment)
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 20bd2ba to fb2f910 Compare June 7, 2024 18:50
@sharnoff
Copy link
Member Author

sharnoff commented Jun 9, 2024

Final changes:

  1. Squash previous commits. Diff here (empty): 66b9289..20bd2ba
  2. Rebase onto main (dropping api,agent/core: Make required ScalingConfig fields optional per-VM #948 from this branch). Diff here: 20bd2ba..fb2f910
  3. Fix compile error from rebase: 689e18d
  4. Change metricsMgr[M any] -> metricsMgr[M core.FromPrometheus]: d0991a9

Waiting on some issues with latest release. Will hold off on merging for now to avoid accidentally adding much larger changes alongside a small fix.

sharnoff added 4 commits June 10, 2024 10:09
Also, switch from using custom parsing to the proper parser provided by
the prometheus package(s).

Also also, change the global config from specifying LoadPrefix to
instead providing the names of all of the system metrics we need.
Another "make it generic" commit. Changes here are:

1. Add indirection in global config - instead of .Metrics.*, it's now
   .Metrics.System.* (referring to "system" metrics).
2. Rename pkg/agent/core.{Metrics => SystemMetrics} -- and all the
   associated functions that just referred to "metrics", like
   core.State's UpdateMetrics => UpdateSystemMetrics.
3. Make (*agent.Runner).getMetricsLoop() generic over the metrics and
   its configuration -- and because of that, it can't be a method.
It is conditionally enabled as part of api.ScalingConfig, defaulting to
disabled with the current config.

Currently the metrics don't go anywhere - this commit *just* sets up the
piping so we can use it in the scaling algorithm later.
Specifically, where we spawn background workers in (*Runner).Run(),
this commit renames the inner function's context and logger:

* 'c' → 'ctx2'
* 'l' → 'logger2'

ref #895 (comment)
@sharnoff sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from d0991a9 to ca2850a Compare June 10, 2024 17:10
@sharnoff sharnoff enabled auto-merge (rebase) June 10, 2024 17:11
@sharnoff sharnoff merged commit 5a54f6b into main Jun 10, 2024
15 checks passed
@sharnoff sharnoff deleted the sharnoff/agent-lfc-metrics branch June 10, 2024 17:22
sharnoff added a commit that referenced this pull request Jun 10, 2024
Also, switch from using custom parsing to the proper parser provided by
the prometheus package(s).

Also also, change the global config from specifying LoadPrefix to
instead providing the names of all of the system metrics we need.
sharnoff added a commit that referenced this pull request Jun 10, 2024
Another "make it generic" commit. Changes here are:

1. Add indirection in global config - instead of .Metrics.*, it's now
   .Metrics.System.* (referring to "system" metrics).
2. Rename pkg/agent/core.{Metrics => SystemMetrics} -- and all the
   associated functions that just referred to "metrics", like
   core.State's UpdateMetrics => UpdateSystemMetrics.
3. Make (*agent.Runner).getMetricsLoop() generic over the metrics and
   its configuration -- and because of that, it can't be a method.
sharnoff added a commit that referenced this pull request Jun 10, 2024
It is conditionally enabled as part of api.ScalingConfig, defaulting to
disabled with the current config.

Currently the metrics don't go anywhere - this commit *just* sets up the
piping so we can use it in the scaling algorithm later.
sharnoff added a commit that referenced this pull request Jul 7, 2024
Must've forgotten to update #895 after #946 was merged.
Omrigan pushed a commit that referenced this pull request Jul 8, 2024
Must've forgotten to update #895 after #946 was merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants