Expose metric for zipkin spans that could not be parsed #2264

Stono · 2020-05-28T17:30:23Z

Requirement - what kind of business use case are you trying to solve?

I would like to detect spans that are getting rejected with a 400BAD Request (see istio/istio#24177) with prometheus metrics.

There is currently nothing on :14269/metrics which captures such a failure (in this case, a span tag that was not a string).

Problem - what in Jaeger blocks you from solving the requirement?

We build a platform which our product teams use Jaeger to debug, we would like to detect problems before they get raised to us as missing spans in the UI.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Expose a prometheus metric which tracks failed zipkin span reports, so we're able to alert on it.

Any open questions to address

The text was updated successfully, but these errors were encountered:

Stono · 2020-05-31T09:08:58Z

Thinking about it more, it might be a bug, as really this counter should be being incremented:

jaeger_spans_rejected_total{debug="false",format="zipkin",svc="other-services",transport="http"} 0

objectiser · 2020-05-31T10:41:47Z

This metric can only be used once the span data has been correctly parsed, as it needs to identify the service. So either a new metric should be defined, or the service label would need a specific value to indicate the failure occurred on deserialization.

Stono · 2020-05-31T11:00:52Z

Ahhh good point and makes sense, thanks for clarifying

dimitarvdimitrov · 2020-11-22T15:19:44Z

I'd like to try doing this one. Is it still relevant, @yurishkuro, @Stono?

I managed to have a quick look but will need some guidance since I haven't worked on Jaeger before:

My understanding is that we want a way to detect when and if invalid spans are sent, but not necessarily who is sending them. Is this correct, @Stono?
From what I saw we can achieve this in three ways; option iv. looks the simplest to me
1. count failed deserializations in the collector's zipkin handler in an instantiated a counter; this will duplicate some of the code in collector/app/metrics.go
2. count failed deserializations in the collector's zipkin handler, but have the counter injected by the processor when creating the handler
3. the zipkin handler will callback the processor on failed deserializations and the processor will count them; no duplication but seems a little convoluted
4. report HTTP status code counter for all requests/endpoints in a gorilla mux middleware function; in theory, will affect the critical path, but shouldn't be noticeable; it will also give more observability over the collector
It'll be difficult to count incorrectly formatted spans because we can't parse them in the first place in order to count them; is it ok to just count failed deserialization?

yurishkuro · 2020-11-22T19:23:19Z

If we do not have any metrics on the HTTP endpoints, this is where I would start, i.e. emit a classic RED set where E(rrors) are labeled by the HTTP status code.

The OP does not ask for metrics by the originating service, and I think that's OK.

dimitarvdimitrov · 2020-11-23T00:09:54Z

Thanks. OP wrote just about the collector's endpoints. Should I add these onto the query & agent's endpoints as well?

…

On Sun, Nov 22, 2020 at 8:23 PM Yuri Shkuro ***@***.***> wrote: If we do not have any metrics on the HTTP endpoints, this is where I would start, i.e. emit a classic RED set where E(rrors) are labeled by the HTTP status code. The OP does not ask for metrics by the originating service, and I think that's OK. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFAL3A3J7I4SX4MV3ROJGRTSRFQLHANCNFSM4NNI7PBA> .

yurishkuro · 2020-11-23T01:02:05Z

query & agent do not have HTTP endpoints receiving spans. The UDP endpoint in the agent already emit metrics.

Stono · 2020-11-23T13:53:45Z

Hello!
Yes it's still relevant. As the requests are going to be "bad" - you're not going to be able to parse them to get any app information, so you wouldn't be able to get that as a dimension anyway.

At the moment - spans which contain invalid data are rejected with a 400 bad request, but there is no associated metric we can monitor and alert off. We're simply looking for that basic metric to tell us bad data is arriving, so we can then debug further.

ghost added the needs-triage label May 28, 2020

yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners and removed needs-triage labels May 28, 2020

dimitarvdimitrov mentioned this issue Nov 28, 2020

Add instrumentation handler to collector endpoints #2664

Merged

yurishkuro closed this as completed in #2664 Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose metric for zipkin spans that could not be parsed #2264

Expose metric for zipkin spans that could not be parsed #2264

Stono commented May 28, 2020

Stono commented May 31, 2020

objectiser commented May 31, 2020

Stono commented May 31, 2020

dimitarvdimitrov commented Nov 22, 2020

yurishkuro commented Nov 22, 2020

dimitarvdimitrov commented Nov 23, 2020 via email

yurishkuro commented Nov 23, 2020

Stono commented Nov 23, 2020

Expose metric for zipkin spans that could not be parsed #2264

Expose metric for zipkin spans that could not be parsed #2264

Comments

Stono commented May 28, 2020

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

Stono commented May 31, 2020

objectiser commented May 31, 2020

Stono commented May 31, 2020

dimitarvdimitrov commented Nov 22, 2020

yurishkuro commented Nov 22, 2020

dimitarvdimitrov commented Nov 23, 2020 via email

yurishkuro commented Nov 23, 2020

Stono commented Nov 23, 2020