Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Varnish 4 backend UUID causing metric churn #8

Closed
filippog opened this issue Nov 18, 2016 · 12 comments
Closed

Varnish 4 backend UUID causing metric churn #8

filippog opened this issue Nov 18, 2016 · 12 comments

Comments

@filippog
Copy link
Contributor

Hi,
at Wikimedia Foundation we are trying varnish_exporter and came across some metric churn with Varnish 4, particularly on backend identity changing when reloading VCL.
The way we've worked around it is via Prometheus metric relabeling in the form of dropping the UUID altogether:

    metric_relabel_configs: 
      - regex: "varnish_backend_.+;root:[-a-f0-9]+\x5c.(.*)"
        source_labels: 
          - "__name__"
          - id
        target_label: id

This isn't really an issue per-se in varnish_exporter, though it might be useful to point out to others if they come across the same. There's also more context here: https://phabricator.wikimedia.org/T150479

Thanks!

@jonnenauha
Copy link
Owner

jonnenauha commented Nov 18, 2016

Looks like I'm splitting the UUID quite naively here (as noted by the comment :) https://github.com/jonnenauha/prometheus_varnish_exporter/blob/master/prometheus.go#L188-L204

Its assuming VBE value to be <uuid>.<name> when yours is root:<uuid>.<name>, I had not encountered this before (my VLCs are quite simple, would need more complex ones to get these kind of things)

VBE.root:464fe99d-b7d7-4795-9809-48b32f0acd0b.be_analytics1027_eqiad_wmnet.happy will fail to attach server and backend labels to the metric as this fails vIdentifier[8] == '-'.

This will fallback to putting a generic id with the full value here https://github.com/jonnenauha/prometheus_varnish_exporter/blob/master/prometheus.go#L205-L211

I can fix this so that you get the server and backend labels, these are quite handy when building your graphana dashboards.

But I don't think this fixes the has label dimensions inconsistent with previously collected metrics in the same metric family issue. I don't understand what that error means. If the UUID changes on each VLC reload (https://phabricator.wikimedia.org/T150479) it should be unique.

The old backend names are still reported by varnishstat. But I believe their values (at least some) are nuked to zero. This might cause prometheus to say the data is inconsistent (eg. some attribute that can only increment, suddenly goes to zero. Maybe I need to somehow detect the ghost backends and drop them on the floor.


What do you mean by "churn"? The fact that is logs these errors in prometheus?

Could you paste or send a private message here of the prometheus_varnish_exporter scraped metrics to me? Your VLC file might also help but I understand if that is too sensitive to be sending around.

@filippog
Copy link
Contributor Author

filippog commented Nov 18, 2016

Looks like I'm splitting the UUID quite naively here (as noted by the comment :) https://github.com/jonnenauha/prometheus_varnish_exporter/blob/master/prometheus.go#L188-L204

Its assuming VBE value to be . when yours is root:., I had not encountered this before (my VLCs are quite simple, would need more complex ones to get these kind of things)

VBE.root:464fe99d-b7d7-4795-9809-48b32f0acd0b.be_analytics1027_eqiad_wmnet.happy will fail to attach server and backend labels to the metric as this fails vIdentifier[8] == '-'.

Indeed, also in this case I don't think the server/backend split would be possible anyways since only the backend is there, I'm not quite sure why that's the case though. The varnishstat -j output looks like this:

  "VBE.root:aca6de25-6790-43ed-bc4d-14b89df43c2d.be_analytics1027_eqiad_wmnet.happy": {
    "description": "Happy health probes",
    "type": "VBE", "ident": "root:aca6de25-6790-43ed-bc4d-14b89df43c2d.be_analytics1027_eqiad_wmnet", "flag": "b", "format": "b",
    "value": 0
  },

This will fallback to putting a generic id with the full value here https://github.com/jonnenauha/prometheus_varnish_exporter/blob/master/prometheus.go#L205-L211

I can fix this so that you get the server and backend labels, these are quite handy when building your graphana dashboards.

But I don't think this fixes the has label dimensions inconsistent with previously collected metrics in the same metric family issue. I don't understand what that error means. If the UUID changes on each VLC reload (https://phabricator.wikimedia.org/T150479) it should be unique.

Yeah I think so too. Now that I think of it it might be the id vs server/backend fallback that causes the issues in #5 where parsing temporarily fails and the fallback kicks in?

The old backend names are still reported by varnishstat. But I believe their values (at least some) are nuked to zero. This might cause prometheus to say the data is inconsistent (eg. some attribute that can only increment, suddenly goes to zero. Maybe I need to somehow detect the ghost backends and drop them on the floor.

The reset to zero should be fine for Prometheus to handle though, e.g. for counters it is supposed to happen.

What do you mean by "churn"? The fact that is logs these errors in prometheus?

I'm referring to the fact that Prometheus will create new metrics each time the UUID changes even though they refer to the same backend underneath. IOW we're not interested in tracking the UUID as part of the label value as that would cause too much cardinality for the backend metrics.

Could you paste or send a private message here of the prometheus_varnish_exporter scraped metrics to me? Your VLC file might also help but I understand if that is too sensitive to be sending around.

For sure, you can find two dumps of varnish_exporter for two different varnish instances on the same machine, https://phabricator.wikimedia.org/P4478 and https://phabricator.wikimedia.org/P4477.
re: VCL I think we can share that too, the public templatized version is at https://github.com/wikimedia/operations-puppet/tree/production/modules/varnish/templates but possibly hard to follow. We could probably share the "runtime" version too, cc @ema @blblack

@jonnenauha
Copy link
Owner

jonnenauha commented Nov 18, 2016

Ok, I think I got it now. https://github.com/prometheus/client_golang/blob/3fb8ace93bc4ccddea55af62320c2fd109252880/prometheus/registry.go#L704-L707

This code hashes the metrics name and labels, hash A. Then it does this

    if dimHash, ok := dimHashes[metricFamily.GetName()]; ok {
        if dimHash != dh {

it will find a previously scraped hash B. They should match. Id my backend and server labels were found before a VLC load but not after it, it will fallback to the id label. Then the metric name is the same but there are two types of label combos inside the scrape. This will report error that label counts, names or their values have changed for this particular metric.

Now, looking at your link https://phabricator.wikimedia.org/T150479 that has the errors. I noticed that both types are present with errors:

  • collected metric varnish_backend_pipe_hdrbytes label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:933922 > has label dimensions inconsistent with previously collected metrics in the same metric family`
  • collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4001_ulsfo_wmnet" > gauge:<value:205725 > has label dimensions inconsistent with previously collected metrics in the same metric family

My code is able to get backend+server labels if there is no root:xxx (I also saw boot.xxx on my own varnish (I don't have a boot prefix on the name, it has been injected there). I would guess you have not put root in front either.

This looks to be something that varnish injects. For you I believe the root: is not there at some point, making my backend+server labels present. At some point the root: gets injected there and it falls back to id.

Fixes

I already made the backend+server label detection more robust. Handling the injected prefix before UUID.

If neither of the known forms to me can be detected, it will inject both labels, to keep it consistent between scrapes. I will simply make the whole ident string backend and make server "unkown" (I believe a empty label value might also be considered bad in the varnish client).

The churn issue is a separate one. I will look into it once I'm done with this. As Varnish does report the old backend (forever if not restarted?) even with the correct detection you will be getting a lot of noise.

I'm thinking that I could make a startup param -no-backend-id to drop the unique identifiers (the backend label). I believe this is what you are doing with your id regex renamer thing. You don't want the UUID.

If the second label is dropped, I believe the old and the new backend (after VLC reload) would generate identical metrics and this might also be a error on the varnish side. If this is the case I'll need to somehow detect "ghost" backends and drop them out completely.

I'll try to get back to you on this.

@jonnenauha
Copy link
Owner

Yeah, it seems varnish keeps some of the old instances in the stats. It does seem to drop them after a while but I'm not sure if did that because my test backend was not up prior to doing ops. I don't have a health check enabled either.

It looks we could use varnishadm process to scan what backends are alive. I found two useful commands

> varnishadm backend.list
200        
Backend name                   Admin      Probe
cb5be46f-ce9c-40b0-aac0-23325fefbed7.default probe      Healthy (no probe)
> varnishadm vcl.list
200        
available  auto/cold          0 boot
available  auto/cold          0 ce19737f-72b5-4f4b-9d39-3d8c2d28240b
available  auto/cold          0 33df656c-442a-42f9-b0e7-643ae77df952
available  auto/warm          0 4d7ebc25-1f6d-4ac0-b454-73b0a6716698
active     auto/warm          0 cb5be46f-ce9c-40b0-aac0-23325fefbed7

Edit: I'm now confident that it will automatically drop the "cold" VCLs after a while. As I was writing this and ran the VCL list again, 4d7ebc25-1f6d-4ac0-b454-73b0a6716698 went to "cold". Now varnishstat -j no longer reports those VBEs.

I think the label inconsistencies fix should resolve the errors. You'll see that churn still for a while after you reload VCLs, but they will go away (might be dependent on the health check frequency etc., but again I did not have this enabled it at all and it still went away).

Do you need a official release to test this out? I will make one but if you want to test before I do that, that works out too.

jonnenauha added a commit that referenced this issue Nov 19, 2016
… parse certain UUID based identifiers (prefixed ones).

This caused to fallback to 'id' label. If the scrape contained both 'backend' + 'server' labeled and 'id' labeled metrics, the varnish server reports errors. For a metric name, all need to have the same amount and named labels.
Now if server label cannot be parsed it will be set to 'unkown' and the full ident set as 'backend'.

Fixes #5 #8
@filippog
Copy link
Contributor Author

Thanks @jonnenauha ! Yes the current master works for me and I'm getting results like https://phabricator.wikimedia.org/P4504. I'll keep it running for 1-2 days and let you know.

@jonnenauha
Copy link
Owner

Right, seems there are two warm VCLs. If my observations are correct the other one should go away. Will be interesting to see if hey keep piling on, or go to the cold state and not show up anymore.

How often do the VCLs reload in this instance. Each commit or pushes to stable or similar?

varnish_backend_conn{backend="be_cp2006_codfw_wmnet",server="02610319-7c2e-4154-8298-b991fc0b1042"} 18998
varnish_backend_conn{backend="be_cp2006_codfw_wmnet",server="ab7a9bed-d96b-439f-a776-a99d182ee326"} 33529

I suppose these kind of metrics might confuse certain sums. Lets say you show total connections in. I think in this case the numbers would lie, because the inactive warm entry still in the stats is still there, so it has the same backend twice.

Should not affect any rate/sum/avg graphs that have a time frame like [5m] because the inactive warm metric numbers should stop moving.

I could still improve this by tracking the backend+server labels. Every time i get a new combo, I could check the varnishadm VCL list to verify the server label is an active one (the UUID). Or maybe just make it query the list on each scrape, the tool looks to be very simple and fast, should not be much overhead. This could eliminate any cold/inactive backends from the metrics.

@filippog
Copy link
Contributor Author

Right, seems there are two warm VCLs. If my observations are correct the other one should go away. Will be interesting to see if hey keep piling on, or go to the cold state and not show up anymore.

How often do the VCLs reload in this instance. Each commit or pushes to stable or similar?

The list of backends is in etcd in this case and generated via confd, it changes approximately "every now and then" :)

varnish_backend_conn{backend="be_cp2006_codfw_wmnet",server="02610319-7c2e-4154-8298-b991fc0b1042"} 18998
varnish_backend_conn{backend="be_cp2006_codfw_wmnet",server="ab7a9bed-d96b-439f-a776-a99d182ee326"} 33529
I suppose these kind of metrics might confuse certain sums. Lets say you show total connections in. I think in this case the numbers would lie, because the inactive warm entry still in the stats is still there, so it has the same backend twice.

Should not affect any rate/sum/avg graphs that have a time frame like [5m] because the inactive warm metric numbers should stop moving.

I could still improve this by tracking the backend+server labels. Every time i get a new combo, I could check the varnishadm VCL list to verify the server label is an active one (the UUID). Or maybe just make it query the list on each scrape, the tool looks to be very simple and fast, should not be much overhead. This could eliminate any cold/inactive backends from the metrics.

It seems simpler to query varnishadm for the list of active VCLs to keep each scrape stateless rather than tracking state across scrapes. A related problem is that varnishadm is naturally more privileged than varnishstat so it'd need some sort of privilege elevation (e.g. sudo). Once the UUID <-> VCL state is known it could be reported alongside the metrics too to not lose information. Admittedly this could be also a feature for varnishstat too (report VCL status).

@jonnenauha
Copy link
Owner

jonnenauha commented Nov 30, 2016

I believe you'll need sudo for varnishstat to work if your varnish is running with sudo. Usually you'll be using a listening port that requires sudo anyway?

varnishstat will try to find the varnish instance name/files with the current account. If you run it without sudo it will timeout as it cant find the instance. This is why I take the instance variables as startup params. And why I do a test scrape at the start, so I can fatal out if it fails.

For most users the case would be that varnishadm should run fine. It could be an progressive enhancement. We could check in the startup if adm work and add those starts in (and remove non active backends). If it does not work, we can log a warning on startup to tell the user.

Something like that :) I'll need to look further into how varnishadm works without sudo against a standard runnin ubuntu varnish service.

@filippog
Copy link
Contributor Author

filippog commented Dec 6, 2016

I believe you'll need sudo for varnishstat to work if your varnish is running with sudo. Usually you'll be using a listening port that requires sudo anyway?

It ultimately depends on the shared memory file permissions AFAICS. Anyways what I meant wrt sudo is that varnishstat is naturally less privileged than varnishadm as it doesn't need the secret file and is read-only. That's why generally varnishadm will require more privileges than a world-readable vsm file for varnishstat to read.

varnishstat will try to find the varnish instance name/files with the current account. If you run it without sudo it will timeout as it cant find the instance. This is why I take the instance variables as startup params. And why I do a test scrape at the start, so I can fatal out if it fails.

For most users the case would be that varnishadm should run fine. It could be an progressive enhancement. We could check in the startup if adm work and add those starts in (and remove non active backends). If it does not work, we can log a warning on startup to tell the user.

Yeah that seems like a good idea! I'm happy to open another issue summing up this discussion, also because this issue is technically closed :)

@jonnenauha
Copy link
Owner

Yeah you can do that, so I don't forget what I have promised :)

@filippog
Copy link
Contributor Author

Thanks @jonnenauha ! I gave this some more thought and summed in #12, I think in this case the recording rules would be good enough, parsing VCL state would be a fair bit of complexity in the exporter which I'm not sure is worth the gains, HTH!

@jonnenauha
Copy link
Owner

Yeah you are right in that this feature would only marginally benefit users where VCLs are frequently realoded. I trust your suggestion in #12 will work for those folks, as you are a user like that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants