agent: Support fetching LFC metrics (but don't use them yet) #895

sharnoff · 2024-04-06T03:45:59Z

Probably best to review each commit separately.

Most of the commits here are just repositioning to make the final commit, which adds LFC metrics collection, as simple as possible.

~~Builds on #892, must not be merged before then.~~

Omrigan

This takes a little bit of time to review for me, as I am not used to this many generics usages. Usually, I would advocate against it in Go, relaying on dynamic dispatch/code duplication. But since we have Rust in the picture, maybe I should get used to it.

LMK if you come up the a way to make it less generic

pkg/agent/core/metrics.go

pkg/api/vminfo.go

pkg/agent/runner.go

Omrigan · 2024-04-15T19:42:40Z

pkg/agent/runner.go

 	ctx context.Context,
 	logger *zap.Logger,
-	newMetrics func(metrics core.Metrics, withLock func()),
+	kind string,


Can you have it as a part of an interface on C?

Also, C is metrics names, right, maybe should be Names?

Can you have it as a part of an interface on C?

Not quite sure what you're referring to here — is it kind?

Also, C is metrics names, right, maybe should be Names?

I was originally thinking we'd want some per-VM configuration for metrics for feature-gating new behavior, but upon reflection, I think implementing that properly is actually more effort than it's worth.

At the same time... I think the semantics are clearer if the interface defines it as "config" for metrics parsing, and then something external provides that "config" - even if it's always just the metric names in practice.

For now, I've added some comments to FromPrometheus (f47097a). LMK what you think.

Not quite sure what you're referring to here — is it kind?

Yes. Can it be a method returning constant?

I've added some comments to FromPrometheus

But the implementation is actually called SystemMetricNames. IMO, either both type parameter and implementations needs to be called Names, or both Config, even if config only contains names.

UPD: or is it not? It seems I got myself confused in the hierarchy of types, and which ones are generic, and which are not.

Yes, in MetricsSourceConfig[C] below, C is actually metric names, as a type parameters to a metrics config.

I think 5dc98a9 satisfies this (in addition to the changes from 4b06d09..0bb2b59 ?

pkg/agent/runner.go

skyzh · 2024-04-22T17:47:57Z

Discussion from the 1-1 meeting: scraping sql_exporter from the autoscaling-agent is the right way to go. I'll add a separate sql_exporter solely for autoscaling metrics. For now, we can just scrape the current one with a reasonable interval.

pkg/agent/runner.go

Omrigan · 2024-04-26T10:43:06Z

pkg/agent/runner.go

 	ctx context.Context,
 	logger *zap.Logger,
-	newMetrics func(metrics core.Metrics, withLock func()),
+	kind string,


Not quite sure what you're referring to here — is it kind?

Yes. Can it be a method returning constant?

I've added some comments to FromPrometheus

But the implementation is actually called SystemMetricNames. IMO, either both type parameter and implementations needs to be called Names, or both Config, even if config only contains names.

UPD: or is it not? It seems I got myself confused in the hierarchy of types, and which ones are generic, and which are not.

pkg/api/vminfo.go

pkg/agent/runner.go

pkg/agent/config.go

As discussed in neondatabase/autoscaling#895, we want to have a separate sql_exporter for simple metrics to avoid overload the database because the autoscaling agent needs to scrape at a higher interval. The new exporter is exposed at port 9499. I didn't do any testing for this pull request but given it's just a configuration change I assume this works. Signed-off-by: Alex Chi Z <chi@neon.tech>

Omrigan

I think mostly nits are left, we should discuss them verbally

Omrigan · 2024-05-05T14:54:43Z

pkg/agent/runner.go

 	ctx context.Context,
 	logger *zap.Logger,
-	newMetrics func(metrics core.Metrics, withLock func()),
+	kind string,


Yes, in MetricsSourceConfig[C] below, C is actually metric names, as a type parameters to a metrics config.

pkg/agent/config.go

Omrigan · 2024-05-05T14:59:05Z

pkg/agent/config.go

+	ec.Add(c.Metrics.LFC.Names.Validate())
+	erc.Whenf(ec, c.Metrics.LFC.RequestTimeoutSeconds == 0, zeroTmpl, ".metrics.lfc.requestTimeoutSeconds")
+	erc.Whenf(ec, c.Metrics.LFC.SecondsBetweenRequests == 0, zeroTmpl, ".metrics.lfc.secondsBetweenRequests")


Do you want to deduplicate it similar to how I did it in my billing PR?

Yep - implemented in 86cb9f4 and c1ae8d3 (kept separate so that they can be properly squashed later).

As discussed in neondatabase/autoscaling#895, we want to have a separate sql_exporter for simple metrics to avoid overload the database because the autoscaling agent needs to scrape at a higher interval. The new exporter is exposed at port 9499. I didn't do any testing for this pull request but given it's just a configuration change I assume this works. Signed-off-by: Alex Chi Z <chi@neon.tech>

sharnoff · 2024-05-22T14:43:58Z

Discussion with @Omrigan : It's fine to just hard-code the metric names, and that makes it a lot simpler. Some discussion about whether to encapsulate the callbacks for metrics collection into a single struct with func(...) fields or an interface.

There's a lot of places where we have ad-hoc ways to construct pointers to thigns - either via temporary variables, referencing the first element in a slice, or by a local function to produce pointers. We can replace those with lo.ToPtr (like what the control plane already uses) to help with readability and consistency. Originally discussed here: github.com//pull/895#discussion_r1576977602

ref #895 (comment)

sharnoff · 2024-05-27T04:27:59Z

New changes:

Move metric names into the code (rather than config) and remove the generics it required. Diff here: 4b06d09..0bb2b59.
Fixed a bug where we'd always use the "system" metrics port, even for LFC metrics: da48a4c (NOTE: this was from reading the code; I haven't yet tested it)
Switch from unnamed callbacks to a new metricsMgr[M] struct to contain them and the kind string: 5dc98a9
Switch from the emptyMetrics() constructor callback to a generics-based approach: 031cc2d
- cc @Omrigan particularly want your feedback on this one. I could go either way on this -- left it as a separate commit so it's easy to revert.
Deduplicate metrics config parsing: 86cb9f4 + c1ae8d3

edit: And the first commit has been spun off into a separate PR: #948.

pkg/agent/runner.go

pkg/api/vminfo.go

pkg/agent/runner.go

pkg/agent/core/metrics.go

There's a lot of places where we have ad-hoc ways to construct pointers to things - either via temporary variables, referencing the first element in a slice, or by a local function to produce pointers. We can replace those with lo.ToPtr (like what the control plane already uses) to help with readability and consistency. Originally discussed here: #895 (comment)

Specifically, where we spawn background workers in (*Runner).Run(), this commit renames the inner function's context and logger: * 'c' → 'ctx2' * 'l' → 'logger2' ref #895 (comment)

ref #895 (comment) Co-authored-by: Oleg Vasilev <oleg@neon.tech>

ref #895 (comment)

sharnoff · 2024-05-29T00:19:43Z

Continuing with the changelog approach, if only because it helps me manage the rebasing:

Reverted 031cc2d ("tmp: Switch from emptyMetrics() callback to generics"): 60f835b
Squashed everything back down: 60f835b..666e11f
Rebased onto main: 666e11f -> 34afd60
Renamed newMetrics -> updateMetrics: cb14daf+b9e3ebd1ea51a7d3730b965d3ecd77c0318776f4
Fix outdated comment on FromPrometheus: 662613d
Renamed the c/l context/logger variables to ctx2/logger2: efdda7c
Remove 5-minute load average from the system metrics names: 589f6b7
Rework system metrics string constants: 66b9289

Plus a couple more changes to #948 based on review here.

Omrigan

LGTM, pending comments and conflicts resolution.

pkg/agent/core/metrics.go

pkg/agent/runner.go

Specifically, where we spawn background workers in (*Runner).Run(), this commit renames the inner function's context and logger: * 'c' → 'ctx2' * 'l' → 'logger2' ref #895 (comment)

sharnoff · 2024-06-09T03:54:25Z

Final changes:

Squash previous commits. Diff here (empty): 66b9289..20bd2ba
Rebase onto main (dropping api,agent/core: Make required ScalingConfig fields optional per-VM #948 from this branch). Diff here: 20bd2ba..fb2f910
Fix compile error from rebase: 689e18d
Change metricsMgr[M any] -> metricsMgr[M core.FromPrometheus]: d0991a9

Waiting on some issues with latest release. Will hold off on merging for now to avoid accidentally adding much larger changes alongside a small fix.

Also, switch from using custom parsing to the proper parser provided by the prometheus package(s). Also also, change the global config from specifying LoadPrefix to instead providing the names of all of the system metrics we need.

Another "make it generic" commit. Changes here are: 1. Add indirection in global config - instead of .Metrics.*, it's now .Metrics.System.* (referring to "system" metrics). 2. Rename pkg/agent/core.{Metrics => SystemMetrics} -- and all the associated functions that just referred to "metrics", like core.State's UpdateMetrics => UpdateSystemMetrics. 3. Make (*agent.Runner).getMetricsLoop() generic over the metrics and its configuration -- and because of that, it can't be a method.

It is conditionally enabled as part of api.ScalingConfig, defaulting to disabled with the current config. Currently the metrics don't go anywhere - this commit *just* sets up the piping so we can use it in the scaling algorithm later.

Specifically, where we spawn background workers in (*Runner).Run(), this commit renames the inner function's context and logger: * 'c' → 'ctx2' * 'l' → 'logger2' ref #895 (comment)

Also, switch from using custom parsing to the proper parser provided by the prometheus package(s). Also also, change the global config from specifying LoadPrefix to instead providing the names of all of the system metrics we need.

Another "make it generic" commit. Changes here are: 1. Add indirection in global config - instead of .Metrics.*, it's now .Metrics.System.* (referring to "system" metrics). 2. Rename pkg/agent/core.{Metrics => SystemMetrics} -- and all the associated functions that just referred to "metrics", like core.State's UpdateMetrics => UpdateSystemMetrics. 3. Make (*agent.Runner).getMetricsLoop() generic over the metrics and its configuration -- and because of that, it can't be a method.

It is conditionally enabled as part of api.ScalingConfig, defaulting to disabled with the current config. Currently the metrics don't go anywhere - this commit *just* sets up the piping so we can use it in the scaling algorithm later.

Must've forgotten to update #895 after #946 was merged.

sharnoff mentioned this pull request Apr 6, 2024

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

Closed

sharnoff force-pushed the sharnoff/agent-refactor-metrics branch from 4e7edce to bb282a5 Compare April 11, 2024 15:16

Base automatically changed from sharnoff/agent-refactor-metrics to main April 11, 2024 15:26

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 8f2eb8b to d3fa7a6 Compare April 11, 2024 15:39

sharnoff marked this pull request as ready for review April 11, 2024 15:39

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from f6217dd to a3f9156 Compare April 11, 2024 16:17

sharnoff requested review from Omrigan and skyzh April 11, 2024 16:37

sharnoff mentioned this pull request Apr 14, 2024

agent,api: Remove custom Format()s for complex types #902

Merged

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from a3f9156 to 4c461f8 Compare April 14, 2024 22:56

sharnoff changed the base branch from main to sharnoff/rm-custom-Format()s April 14, 2024 23:45

Base automatically changed from sharnoff/rm-custom-Format()s to main April 15, 2024 14:42

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 4c461f8 to 288f86e Compare April 15, 2024 14:43

Omrigan reviewed Apr 15, 2024

View reviewed changes

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch 3 times, most recently from fe8737b to 24f5c94 Compare April 23, 2024 23:28

Omrigan self-requested a review April 24, 2024 21:09

skyzh mentioned this pull request Apr 25, 2024

vm-image: add sqlexporter for autoscaling metrics neondatabase/neon#7514

Merged

5 tasks

Omrigan reviewed Apr 26, 2024

View reviewed changes

hlinnaka mentioned this pull request May 5, 2024

Expose LSN and replication delay as metrics neondatabase/neon#7610

Merged

Omrigan reviewed May 5, 2024

View reviewed changes

sharnoff mentioned this pull request May 24, 2024

neonvm-controller: introduce unit tests #935

Merged

sharnoff mentioned this pull request May 24, 2024

Use github.com/samber/lo.ToPtr everywhere #946

Merged

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 24f5c94 to 4b06d09 Compare May 27, 2024 03:05

sharnoff added a commit that referenced this pull request May 27, 2024

fix capitalization

bba199c

ref #895 (comment)

Omrigan reviewed May 27, 2024

View reviewed changes

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch 2 times, most recently from 666e11f to 34afd60 Compare May 28, 2024 23:52

sharnoff added a commit that referenced this pull request May 29, 2024

s/requireAllRequiredFields/requireAll/

d77b97c

ref #895 (comment) Co-authored-by: Oleg Vasilev <oleg@neon.tech>

sharnoff added a commit that referenced this pull request May 29, 2024

use lo.ToPtr

f9ed5d2

ref #895 (comment)

Omrigan approved these changes Jun 1, 2024

View reviewed changes

pkg/agent/core/metrics.go Show resolved Hide resolved

pkg/agent/runner.go Show resolved Hide resolved

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 66b9289 to 20bd2ba Compare June 7, 2024 18:48

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from 20bd2ba to fb2f910 Compare June 7, 2024 18:50

sharnoff added 4 commits June 10, 2024 10:09

agent: Make metrics parsing generic (#895)

a214f2e

Also, switch from using custom parsing to the proper parser provided by the prometheus package(s). Also also, change the global config from specifying LoadPrefix to instead providing the names of all of the system metrics we need.

agent/runner: Rename single-letter context/logger variables (#895)

ca2850a

Specifically, where we spawn background workers in (*Runner).Run(), this commit renames the inner function's context and logger: * 'c' → 'ctx2' * 'l' → 'logger2' ref #895 (comment)

sharnoff force-pushed the sharnoff/agent-lfc-metrics branch from d0991a9 to ca2850a Compare June 10, 2024 17:10

sharnoff enabled auto-merge (rebase) June 10, 2024 17:11

sharnoff merged commit 5a54f6b into main Jun 10, 2024
15 checks passed

sharnoff deleted the sharnoff/agent-lfc-metrics branch June 10, 2024 17:22

sharnoff added a commit that referenced this pull request Jul 7, 2024

api: Use lo.ToPtr instead of &[]bool{...}[0]

461d8b2

Must've forgotten to update #895 after #946 was merged.

sharnoff mentioned this pull request Jul 7, 2024

api: Use lo.ToPtr instead of &[]bool{...}[0] #1002

Merged

Omrigan pushed a commit that referenced this pull request Jul 8, 2024

api: Use lo.ToPtr instead of &[]bool{...}[0] (#1002)

cc56e75

Must've forgotten to update #895 after #946 was merged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: Support fetching LFC metrics (but don't use them yet) #895

agent: Support fetching LFC metrics (but don't use them yet) #895

sharnoff commented Apr 6, 2024 •

edited

Loading

Omrigan left a comment

Omrigan Apr 15, 2024

sharnoff Apr 23, 2024

Omrigan Apr 26, 2024

Omrigan May 5, 2024

sharnoff May 27, 2024

skyzh commented Apr 22, 2024

Omrigan Apr 26, 2024

Omrigan left a comment

Omrigan May 5, 2024

Omrigan May 5, 2024

sharnoff May 27, 2024

sharnoff commented May 22, 2024

sharnoff commented May 27, 2024 •

edited

Loading

sharnoff commented May 29, 2024 •

edited

Loading

Omrigan left a comment

sharnoff commented Jun 9, 2024

agent: Support fetching LFC metrics (but don't use them yet) #895

agent: Support fetching LFC metrics (but don't use them yet) #895

Conversation

sharnoff commented Apr 6, 2024 • edited Loading

Omrigan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh commented Apr 22, 2024

Choose a reason for hiding this comment

Omrigan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharnoff commented May 22, 2024

sharnoff commented May 27, 2024 • edited Loading

sharnoff commented May 29, 2024 • edited Loading

Omrigan left a comment

Choose a reason for hiding this comment

sharnoff commented Jun 9, 2024

sharnoff commented Apr 6, 2024 •

edited

Loading

sharnoff commented May 27, 2024 •

edited

Loading

sharnoff commented May 29, 2024 •

edited

Loading