Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metric for zipkin spans that could not be parsed #2264

Closed
Stono opened this issue May 28, 2020 · 8 comments · Fixed by #2664
Closed

Expose metric for zipkin spans that could not be parsed #2264

Stono opened this issue May 28, 2020 · 8 comments · Fixed by #2664
Labels
good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@Stono
Copy link

Stono commented May 28, 2020

Requirement - what kind of business use case are you trying to solve?

I would like to detect spans that are getting rejected with a 400BAD Request (see istio/istio#24177) with prometheus metrics.

There is currently nothing on :14269/metrics which captures such a failure (in this case, a span tag that was not a string).

Problem - what in Jaeger blocks you from solving the requirement?

We build a platform which our product teams use Jaeger to debug, we would like to detect problems before they get raised to us as missing spans in the UI.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Expose a prometheus metric which tracks failed zipkin span reports, so we're able to alert on it.

Any open questions to address

@ghost ghost added the needs-triage label May 28, 2020
@yurishkuro yurishkuro added help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners and removed needs-triage labels May 28, 2020
@Stono
Copy link
Author

Stono commented May 31, 2020

Thinking about it more, it might be a bug, as really this counter should be being incremented:

jaeger_spans_rejected_total{debug="false",format="zipkin",svc="other-services",transport="http"} 0

@objectiser
Copy link
Contributor

This metric can only be used once the span data has been correctly parsed, as it needs to identify the service. So either a new metric should be defined, or the service label would need a specific value to indicate the failure occurred on deserialization.

@Stono
Copy link
Author

Stono commented May 31, 2020

Ahhh good point and makes sense, thanks for clarifying

@dimitarvdimitrov
Copy link
Contributor

I'd like to try doing this one. Is it still relevant, @yurishkuro, @Stono?

I managed to have a quick look but will need some guidance since I haven't worked on Jaeger before:

  • My understanding is that we want a way to detect when and if invalid spans are sent, but not necessarily who is sending them. Is this correct, @Stono?
  • From what I saw we can achieve this in three ways; option iv. looks the simplest to me
    1. count failed deserializations in the collector's zipkin handler in an instantiated a counter; this will duplicate some of the code in collector/app/metrics.go
    2. count failed deserializations in the collector's zipkin handler, but have the counter injected by the processor when creating the handler
    3. the zipkin handler will callback the processor on failed deserializations and the processor will count them; no duplication but seems a little convoluted
    4. report HTTP status code counter for all requests/endpoints in a gorilla mux middleware function; in theory, will affect the critical path, but shouldn't be noticeable; it will also give more observability over the collector
  • It'll be difficult to count incorrectly formatted spans because we can't parse them in the first place in order to count them; is it ok to just count failed deserialization?

@yurishkuro
Copy link
Member

If we do not have any metrics on the HTTP endpoints, this is where I would start, i.e. emit a classic RED set where E(rrors) are labeled by the HTTP status code.

The OP does not ask for metrics by the originating service, and I think that's OK.

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented Nov 23, 2020 via email

@yurishkuro
Copy link
Member

query & agent do not have HTTP endpoints receiving spans. The UDP endpoint in the agent already emit metrics.

@Stono
Copy link
Author

Stono commented Nov 23, 2020

Hello!
Yes it's still relevant. As the requests are going to be "bad" - you're not going to be able to parse them to get any app information, so you wouldn't be able to get that as a dimension anyway.

At the moment - spans which contain invalid data are rejected with a 400 bad request, but there is no associated metric we can monitor and alert off. We're simply looking for that basic metric to tell us bad data is arriving, so we can then debug further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants