-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Varnish 4 backend UUID causing metric churn #8
Comments
Looks like I'm splitting the UUID quite naively here (as noted by the comment :) https://github.com/jonnenauha/prometheus_varnish_exporter/blob/master/prometheus.go#L188-L204 Its assuming
This will fallback to putting a generic I can fix this so that you get the But I don't think this fixes the The old backend names are still reported by varnishstat. But I believe their values (at least some) are nuked to zero. This might cause prometheus to say the data is inconsistent (eg. some attribute that can only increment, suddenly goes to zero. Maybe I need to somehow detect the ghost backends and drop them on the floor. What do you mean by "churn"? The fact that is logs these errors in prometheus? Could you paste or send a private message here of the |
Indeed, also in this case I don't think the server/backend split would be possible anyways since only the backend is there, I'm not quite sure why that's the case though. The
Yeah I think so too. Now that I think of it it might be the
The reset to zero should be fine for Prometheus to handle though, e.g. for counters it is supposed to happen.
I'm referring to the fact that Prometheus will create new metrics each time the UUID changes even though they refer to the same backend underneath. IOW we're not interested in tracking the UUID as part of the label value as that would cause too much cardinality for the backend metrics.
For sure, you can find two dumps of varnish_exporter for two different varnish instances on the same machine, https://phabricator.wikimedia.org/P4478 and https://phabricator.wikimedia.org/P4477. |
Ok, I think I got it now. https://github.com/prometheus/client_golang/blob/3fb8ace93bc4ccddea55af62320c2fd109252880/prometheus/registry.go#L704-L707 This code hashes the metrics name and labels, hash
it will find a previously scraped hash Now, looking at your link https://phabricator.wikimedia.org/T150479 that has the errors. I noticed that both types are present with errors:
My code is able to get backend+server labels if there is no This looks to be something that varnish injects. For you I believe the Fixes I already made the backend+server label detection more robust. Handling the injected prefix before UUID. If neither of the known forms to me can be detected, it will inject both labels, to keep it consistent between scrapes. I will simply make the whole ident string The churn issue is a separate one. I will look into it once I'm done with this. As Varnish does report the old backend (forever if not restarted?) even with the correct detection you will be getting a lot of noise. I'm thinking that I could make a startup param If the second label is dropped, I believe the old and the new backend (after VLC reload) would generate identical metrics and this might also be a error on the varnish side. If this is the case I'll need to somehow detect "ghost" backends and drop them out completely. I'll try to get back to you on this. |
Yeah, it seems varnish keeps some of the old instances in the stats. It does seem to drop them after a while but I'm not sure if did that because my test backend was not up prior to doing ops. I don't have a health check enabled either. It looks we could use
Edit: I'm now confident that it will automatically drop the "cold" VCLs after a while. As I was writing this and ran the VCL list again, I think the label inconsistencies fix should resolve the errors. You'll see that churn still for a while after you reload VCLs, but they will go away (might be dependent on the health check frequency etc., but again I did not have this enabled it at all and it still went away). Do you need a official release to test this out? I will make one but if you want to test before I do that, that works out too. |
… parse certain UUID based identifiers (prefixed ones). This caused to fallback to 'id' label. If the scrape contained both 'backend' + 'server' labeled and 'id' labeled metrics, the varnish server reports errors. For a metric name, all need to have the same amount and named labels. Now if server label cannot be parsed it will be set to 'unkown' and the full ident set as 'backend'. Fixes #5 #8
Thanks @jonnenauha ! Yes the current master works for me and I'm getting results like https://phabricator.wikimedia.org/P4504. I'll keep it running for 1-2 days and let you know. |
Right, seems there are two warm VCLs. If my observations are correct the other one should go away. Will be interesting to see if hey keep piling on, or go to the cold state and not show up anymore. How often do the VCLs reload in this instance. Each commit or pushes to stable or similar?
I suppose these kind of metrics might confuse certain sums. Lets say you show total connections in. I think in this case the numbers would lie, because the inactive warm entry still in the stats is still there, so it has the same backend twice. Should not affect any rate/sum/avg graphs that have a time frame like [5m] because the inactive warm metric numbers should stop moving. I could still improve this by tracking the backend+server labels. Every time i get a new combo, I could check the varnishadm VCL list to verify the server label is an active one (the UUID). Or maybe just make it query the list on each scrape, the tool looks to be very simple and fast, should not be much overhead. This could eliminate any cold/inactive backends from the metrics. |
The list of backends is in etcd in this case and generated via confd, it changes approximately "every now and then" :)
It seems simpler to query |
I believe you'll need sudo for varnishstat to work if your varnish is running with sudo. Usually you'll be using a listening port that requires sudo anyway? varnishstat will try to find the varnish instance name/files with the current account. If you run it without sudo it will timeout as it cant find the instance. This is why I take the instance variables as startup params. And why I do a test scrape at the start, so I can fatal out if it fails. For most users the case would be that varnishadm should run fine. It could be an progressive enhancement. We could check in the startup if adm work and add those starts in (and remove non active backends). If it does not work, we can log a warning on startup to tell the user. Something like that :) I'll need to look further into how varnishadm works without sudo against a standard runnin ubuntu varnish service. |
It ultimately depends on the shared memory file permissions AFAICS. Anyways what I meant wrt sudo is that
Yeah that seems like a good idea! I'm happy to open another issue summing up this discussion, also because this issue is technically closed :) |
Yeah you can do that, so I don't forget what I have promised :) |
Thanks @jonnenauha ! I gave this some more thought and summed in #12, I think in this case the recording rules would be good enough, parsing VCL state would be a fair bit of complexity in the exporter which I'm not sure is worth the gains, HTH! |
Yeah you are right in that this feature would only marginally benefit users where VCLs are frequently realoded. I trust your suggestion in #12 will work for those folks, as you are a user like that :) |
Hi,
at Wikimedia Foundation we are trying varnish_exporter and came across some metric churn with Varnish 4, particularly on backend identity changing when reloading VCL.
The way we've worked around it is via Prometheus metric relabeling in the form of dropping the UUID altogether:
This isn't really an issue per-se in varnish_exporter, though it might be useful to point out to others if they come across the same. There's also more context here: https://phabricator.wikimedia.org/T150479
Thanks!
The text was updated successfully, but these errors were encountered: