Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

stradig · 2024-03-22T09:14:25Z

We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.

Tasks

Give feedback

agent: Support fetching LFC metrics (but don't use them yet) #895
vm-image: add sqlexporter for autoscaling metrics neon#7514
neondatabase/cloud#14245
LFC: using rolling hyperloglog for correct working set estimation neon#7466

c/compute t/feature
vm-image: Expose new LFC working set size metrics neon#8298
agent/core: Implement LFC-aware scaling #1003
neondatabase/cloud#14929
Options

skyzh · 2024-03-25T18:41:29Z

Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev.

sharnoff · 2024-03-25T18:58:37Z

IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector.

skyzh · 2024-03-25T19:04:54Z

yep, I found https://vector.dev/docs/reference/configuration/sources/prometheus_scrape/ that directly scrapes exporter data.

ref neondatabase/autoscaling#878 ref neondatabase/autoscaling#872 Add `approximate_working_set_size` to sql exporter so that autoscaling can use it in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Peter Bendel <peterbendel@neon.tech>

Omrigan · 2024-04-11T13:36:37Z

So we have 4 possible ways to go forwad:

Fetch from vector (vm-builder: add SQL exporter to vector #878)
- Disadvantage: adds an additional delay between sql-exporter and vector
Fetch from sql-exporter (agent: Support fetching LFC metrics (but don't use them yet) #895)
- Disadvantage: sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
Fetch from vm-monitor (vm-monitor: collect lfc stats from vm-monitor neon#7302 (comment))
- Disadvantage: one more place to implement working with metrics
Fetch directly from postgres
- Disadvantage: breaks abstraction layers, needs somehow to put credentials into the autoscaler-agent

@skyzh @sharnoff sounds correct? Which ones do you prefer?

sharnoff · 2024-04-11T15:09:38Z

My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.

Fetch from vector — modifies vector here to support sql-exporter in neondatabase/neon, adding a new link. Also has the downside of repeating metric values because the autoscaler-agent fetch frequency would be greater than vector's refresh frequency.
Fetch from sql-exporter — mostly doesn't add a new link beyond what's required for this issue; the autoscaler-agent already fetches prometheus metrics from the VM. That's why I went with this approach.
Fetch from vm-monitor — adds a new responsibility to vm-monitor, and would also require additional support in the autoscaler-agent. All work done on the autoscaler-agent <-> vm-monitor protocol should be approached with hazmat suits for now. It does what we need it to, but it needs a lot of work, and I'm hesitant to add more responsibilities to it until after some refactoring has taken place.
Fetch directly from postgres — adds a new link between autoscaler-agent and postgres, like you said @Omrigan. And yeah, credentials would be quite tricky, requiring help from other components we don't currently rely on.

re:

sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s

The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart.

Omrigan · 2024-04-16T06:24:09Z

@skyzh Can you share your opinion on options 2 vs 3?

skyzh · 2024-04-16T06:52:53Z

If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor.

skyzh · 2024-04-16T07:30:00Z

...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter:

https://github.com/neondatabase/neon/blob/2d5a8462c8093fb7db7e15cea68c6d740818c39c/vm-image-spec.yaml#L161-L188

Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics.

sharnoff · 2024-04-22T17:50:22Z

Discussed briefly with @skyzh — tl;dr:

Medium-term, we want to avoid having the autoscaler-agent pull LFC metrics from the main sql-exporter
Short-term:
1. We can have the autoscaler-agent pull metrics from the existing sql-exporter, just with a low frequency so we don't overload postgres
2. We can set up a second sql-exporter to just report LFC metrics
Then, we can have control plane set some annotation on new VMs to tell the autoscaler-agent to fetch LFC metrics with a higher frequency from the new port — giving the desired end state while retaining support for older VMs.

sharnoff · 2024-06-10T15:17:43Z

Status:

agent: Support fetching LFC metrics (but don't use them yet) #895 is ready to merge, just was waiting to avoid interfering with patch release
We found out the new metrics weren't exposed. PR to fix is neondatabase/cloud#14245
Remaining work after that is actually using the metrics (design + implementation of new scaling algorithm, maybe?)

In general, rename: - lfc_approximate_working_set_size to - lfc_approximate_working_set_size_seconds For the "main" metrics that are actually scraped and used internally, the old one is just marked as deprecated. For the "autoscaling" metrics, we're not currently using the old one, so we can get away with just replacing it. Also, for the user-visible metrics we'll only store & expose a few different time windows, to avoid making the UI overly busy or bloating our internal metrics storage. But for the autoscaling-related scraper, we aren't storing the metrics, and it's useful to be able to programmatically operate on the trendline of how WSS increases (or doesn't!) window size. So there, we can just output datapoints for each minute. Part of neondatabase/autoscaling#872. See also #7466.

Part of #872. This builds on the metrics that will be exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various time windows. The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. See also: https://www.notion.so/neondatabase/874ef1cc942a4e6592434dbe9e609350

Part of #872. This builds on the metrics that will be exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various evenly-spaced windows (all 1 minute apart). The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. For more context, refer to the RFC here: https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6

In general, replace: * 'lfc_approximate_working_set_size' with * 'lfc_approximate_working_set_size_windows' For the "main" metrics that are actually scraped and used internally, the old one is just marked as deprecated. For the "autoscaling" metrics, we're not currently using the old one, so we can get away with just replacing it. Also, for the user-visible metrics we'll only store & expose a few different time windows, to avoid making the UI overly busy or bloating our internal metrics storage. But for the autoscaling-related scraper, we aren't storing the metrics, and it's useful to be able to programmatically operate on the trendline of how WSS increases (or doesn't!) with window size. So there, we can just output datapoints for each minute. Part of neondatabase/autoscaling#872 See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6

Part of #872. This builds on the metrics exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various time windows. The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. See also: https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6

In general, replace: * 'lfc_approximate_working_set_size' with * 'lfc_approximate_working_set_size_windows' For the "main" metrics that are actually scraped and used internally, the old one is just marked as deprecated. For the "autoscaling" metrics, we're not currently using the old one, so we can get away with just replacing it. Also, for the user-visible metrics we'll only store & expose a few different time windows, to avoid making the UI overly busy or bloating our internal metrics storage. But for the autoscaling-related scraper, we aren't storing the metrics, and it's useful to be able to programmatically operate on the trendline of how WSS increases (or doesn't!) with window size. So there, we can just output datapoints for each minute. Part of neondatabase/autoscaling#872 See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6

sharnoff · 2024-08-09T00:12:54Z

Earlier this week, LFC-aware scaling was completely rolled out to all regions. Closing this :)

This was referenced Mar 25, 2024

vm-builder: add SQL exporter to vector #878

Closed

neonvm: add LFC approximate working set size to metrics neondatabase/neon#7252

Merged

Omrigan assigned skyzh and sharnoff Apr 16, 2024

sharnoff mentioned this issue May 22, 2024

agent/core: Dependency-inject ScalingAlgorithm #737

Closed

This was referenced Jul 3, 2024

[DO NOT MERGE] Temporary PR for testing on top of #8068 neondatabase/neon#8243

Closed

vm-image: Expose new LFC working set size metrics neondatabase/neon#8298

Merged

sharnoff mentioned this issue Jul 7, 2024

agent/core: Implement LFC-aware scaling #1003

Merged

sharnoff mentioned this issue Jul 18, 2024

Epic: LFC-aware scaling follow-ups #1011

Open

sharnoff closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

stradig commented Mar 22, 2024 •

edited by sharnoff

Loading

Tasks

skyzh commented Mar 25, 2024

sharnoff commented Mar 25, 2024

skyzh commented Mar 25, 2024

Omrigan commented Apr 11, 2024 •

edited

Loading

sharnoff commented Apr 11, 2024

Omrigan commented Apr 16, 2024

skyzh commented Apr 16, 2024

skyzh commented Apr 16, 2024

sharnoff commented Apr 22, 2024 •

edited

Loading

sharnoff commented Jun 10, 2024

sharnoff commented Aug 9, 2024

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

Comments

stradig commented Mar 22, 2024 • edited by sharnoff Loading

Tasks

skyzh commented Mar 25, 2024

sharnoff commented Mar 25, 2024

skyzh commented Mar 25, 2024

Omrigan commented Apr 11, 2024 • edited Loading

sharnoff commented Apr 11, 2024

Omrigan commented Apr 16, 2024

skyzh commented Apr 16, 2024

skyzh commented Apr 16, 2024

sharnoff commented Apr 22, 2024 • edited Loading

sharnoff commented Jun 10, 2024

sharnoff commented Aug 9, 2024

stradig commented Mar 22, 2024 •

edited by sharnoff

Loading

Omrigan commented Apr 11, 2024 •

edited

Loading

sharnoff commented Apr 22, 2024 •

edited

Loading