-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872
Comments
Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev. |
IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector. |
yep, I found https://vector.dev/docs/reference/configuration/sources/prometheus_scrape/ that directly scrapes exporter data. |
ref neondatabase/autoscaling#878 ref neondatabase/autoscaling#872 Add `approximate_working_set_size` to sql exporter so that autoscaling can use it in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Peter Bendel <peterbendel@neon.tech>
So we have 4 possible ways to go forwad:
|
My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.
re:
The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart. |
@skyzh Can you share your opinion on options 2 vs 3? |
If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor. |
...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter: Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics. |
Discussed briefly with @skyzh — tl;dr:
|
Status:
|
In general, rename: - lfc_approximate_working_set_size to - lfc_approximate_working_set_size_seconds For the "main" metrics that are actually scraped and used internally, the old one is just marked as deprecated. For the "autoscaling" metrics, we're not currently using the old one, so we can get away with just replacing it. Also, for the user-visible metrics we'll only store & expose a few different time windows, to avoid making the UI overly busy or bloating our internal metrics storage. But for the autoscaling-related scraper, we aren't storing the metrics, and it's useful to be able to programmatically operate on the trendline of how WSS increases (or doesn't!) window size. So there, we can just output datapoints for each minute. Part of neondatabase/autoscaling#872. See also #7466.
Part of #872. This builds on the metrics that will be exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various time windows. The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. See also: https://www.notion.so/neondatabase/874ef1cc942a4e6592434dbe9e609350
Part of #872. This builds on the metrics that will be exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various time windows. The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. See also: https://www.notion.so/neondatabase/874ef1cc942a4e6592434dbe9e609350
Part of #872. This builds on the metrics that will be exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various time windows. The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. See also: https://www.notion.so/neondatabase/874ef1cc942a4e6592434dbe9e609350
Part of #872. This builds on the metrics that will be exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various evenly-spaced windows (all 1 minute apart). The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. For more context, refer to the RFC here: https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
In general, replace: * 'lfc_approximate_working_set_size' with * 'lfc_approximate_working_set_size_windows' For the "main" metrics that are actually scraped and used internally, the old one is just marked as deprecated. For the "autoscaling" metrics, we're not currently using the old one, so we can get away with just replacing it. Also, for the user-visible metrics we'll only store & expose a few different time windows, to avoid making the UI overly busy or bloating our internal metrics storage. But for the autoscaling-related scraper, we aren't storing the metrics, and it's useful to be able to programmatically operate on the trendline of how WSS increases (or doesn't!) with window size. So there, we can just output datapoints for each minute. Part of neondatabase/autoscaling#872 See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
Part of #872. This builds on the metrics exposed by neondatabase/neon#8298. For now, we only look at the working set size metrics over various time windows. The algorithm is somewhat straightforward to implement (see wss.go), but unfortunately seems to be difficult to understand *why* it's expected to work. See also: https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
In general, replace: * 'lfc_approximate_working_set_size' with * 'lfc_approximate_working_set_size_windows' For the "main" metrics that are actually scraped and used internally, the old one is just marked as deprecated. For the "autoscaling" metrics, we're not currently using the old one, so we can get away with just replacing it. Also, for the user-visible metrics we'll only store & expose a few different time windows, to avoid making the UI overly busy or bloating our internal metrics storage. But for the autoscaling-related scraper, we aren't storing the metrics, and it's useful to be able to programmatically operate on the trendline of how WSS increases (or doesn't!) with window size. So there, we can just output datapoints for each minute. Part of neondatabase/autoscaling#872 See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
Earlier this week, LFC-aware scaling was completely rolled out to all regions. Closing this :) |
We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.
Tasks
The text was updated successfully, but these errors were encountered: