-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vm-builder: add SQL exporter to vector #878
Conversation
Signed-off-by: Alex Chi Z <chi@neon.tech>
Signed-off-by: Alex Chi Z <chi@neon.tech>
|
mark ready for review |
@skyzh pls also add the working set size in pages to the sql_exporter as a metric select approximate_working_set_size(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select approximate_working_set_size(false);
is missing in sql_exporter metrics
Do we really require 1 sec interval for those metrics ?
For working set size and hit ratios a much smaller interval like 1 minute or so would be sufficient.
It is not just using resources in compute but also in our victoriametrics database if we scrape so frequently?
I'm using the default scrape interval for sql exporters, which is usually 15 secs (1 sec is for host information). |
will have a separate pull request on the compute side and make it into the next release. |
These metrics are exclusively consumed by the autoscaler-agent and not persisted anywhere. The autoscaler-agent fetches metrics (and potentially makes a scaling decision) every 5 seconds. IMO it's worthwhile to not need to worry about whether those metrics are stale; I don't immediately see a reason not to fetch sql-exporter metrics every 5 seconds (or even every 1 second) - presumably it's pretty cheap? |
There are some aggregations and joins on the system catalog, and therefore, if there are a lot of tables, it might be slow to execute. A better approach is to separate two sql exporters: the ones that are cheap to scrape, and the expensive ones. All LFC metrics are basically O(1) operations when being retrieved from the database. |
This pull request will be merged once the compute + console release is done next week. There are still some leftover works from upgrading the compute, and I will ensure that at the time this pull request gets merged, the metrics I put in this vector config will be available from all neon user projects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding the working set size
ref neondatabase/autoscaling#878 ref neondatabase/autoscaling#872 Add `approximate_working_set_size` to sql exporter so that autoscaling can use it in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Peter Bendel <peterbendel@neon.tech>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delayed review. Took a while to think this over.
High-level question:
I think this can also be implemented by telling the autoscaler-agent another port to fetch metrics from (per each VM). AFAICT the trade-offs would be:
- More implementation work required in autoscaler-agent to pass the data around
- The simplest implementation would probably pull every 5 seconds (although it wouldn't be too tricky to have a different scrape interval...)
- Rollout would depend on autoscaling -> cplane release (to add per-VM setting), instead of autoscaling -> compute image release
- We wouldn't have spooky action at a distance w.r.t. sql-exporter split across this repo and
neondatabase/neon
What do you think?
Separately, could you prefix "vm-builder: " to the start of this PR title for consistent naming? Helps with organization 🙏
# The data might be delayed as we use the default scrape interval. Some SQLs are not | ||
# trivial to execute, and therefore we don't want them to run every second. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much work would it be to run this at the same scrape interval? (or at least, <5s?)
I ask because it may make the scaling algorithm tricky (specifically: expecting counters to remain the same, not because nothing's happening, but because metrics are delayed). I'm sure it can be worked around, but it'd be good to weigh the options for tech debt like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need to split two sqlexporters (one for quick scrapes, and one for slow scrapes).
Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
Do you mean that the autoscaler-agent can directly query LFC data by logging in to the database using Postgres protocol + cloud_admin? In this case, do we need to store that data somewhere inside cplane? I thought the reason that we have vector.dev is to keep some history of the data and avoid storing them by autoscaling-agent itself, but I could be wrong... |
Okay I just re-read https://github.com/neondatabase/autoscaling/blob/main/ARCHITECTURE.md and I think I had some misunderstanding there. So I feel the best approach is:
|
can we make this into vm-monitor? |
Having the autoscaler-agent fetch the data via SQL is an interesting idea — hadn't thought of that. I was thinking more like having it fetch metrics from sql-exporter via prometheus (http). It's already exposed from every VM, so the change would be entirely within the autoscaler-agent. |
That feels like smashing abstraction layers. Does it yield any practical advantage over below?
This also, but less so than above. I see now a practical advantage (no bufferization) of bypassing vector. Also, @neondatabase/billing folks say vector is unreliable 🙂 so I guess this is my preference. |
I can open a pull request on the compute side to have vm-monitor directly exposing the metrics |
I have a demo pull request ready but not tested yet, feel free to leave some comments before we finalize the code + method of exposing the metrics. neondatabase/neon#7302 |
ref neondatabase/neon#5949
close #872
This pull request makes the LFC metrics available at the vector metrics endpoint. We use the default scrape interval to avoid overload the database.