-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion for Aggregated Trace Metrics (ATM) #2574
Comments
Correction: I suggested it to be in the Jaeger collector, not necessarily in the OTEL project. The reason for collector is because it is the closest backend component to the source of data. In an architecture that supports tail-based sampling, for example, the collectors have access to highest fidelity / least sampled data, compared to all other stages of the backend processing. Building this on top of Kafka means we'd always have to pay disk serialization cost, whereas extracting metrics from spans is a lightweight operation that could be easily done prior to that.
I still don't follow why this is a problem. A TraceConsumer just takes trace data and does something with it. If that something happens to be extracting metrics and sending them somewhere, what does that have to do with processing pipelines of the OTEL collector? As far as those pipelines are concerned this consumer is a leaf. |
My thoughts exactly: the processor can just capture the metrics based on the traces from the pipeline and export them as metrics data points. |
I tried to draw this up (please let me know if it's inaccurate):
I like this because it's very simple and we should be able to get something working pretty quickly with the existing OTEL Collector design. It's certainly feasible, especially (as @yurishkuro mentioned) that extracting metrics from spans is lightweight and the network serde and I/O could be done out of band in a seperate goroutine, taking the cost away from the trace processing hot path. However, I feel it diverges from the design philosophy of OTEL collectors. An exporter sends data to other systems/backends and processors simply pre-process data or help with reliable completion of pipelines. Why this could be an issue is:
Although, a question that comes to mind, and is on me to research/learn more about OTEL Collectors, is how do we monitor the processors? Do processors need to instrument themselves and expose/ship their metrics? If so, we could perhaps piggyback off that precedent. |
I think that the definition in the readme has only the case of processors acting on data of the same type. Using traces to generate metrics is acceptable, IMO, but you might want to explicitly ask in the Collector SIG.
I had a similar concern with the routingprocessor and with the loadbalancingexporter. In this case, I would just allow OTLP endpoints to be on the receiving end of these metrics. Users can then configure their metrics pipeline to export this to their preferred backend. Processors are indeed able to instrument themselves. Take a look at the |
Given the volume of metrics that could be created I think that Prometheus remote_write would be the right choice here over exposing a page to scrape. remote_write is well supported amongst tsdb backends and so it would give users options for where to store their data. |
The downside is that we can't remote_write into prometheus itself, or maybe we can and I am just not aware of it. In order to process real-time metrics (ie alertmanager) you'd want to scrape as part of your pipeline to ingest. I could see both methods being valid for users depending on their specific requirements. If we want to use these metrics inside the Jaeger UI we also need to build a service and determining how we support multiple backends, but at least promql would be the query language. |
Why not export to OTLP metrics, and let users configure their exporters the way they need it? |
@yurishkuro I was under the impression that the Jaeger collector would be replaced by the Otel Collector hence we should be focusing efforts on Otel vs Jaeger. Based on this discussion we should look at Otel and utilize OTLP to create metrics which can then be sent to any backend. (thanks for the great advice @jpkrohling). We just need to determine how to use this data within Jaeger UI eventually. Thanks for the discussion. |
Yes and no. Jaeger collector will be based on OTEL Collector, but I think it will always remain a custom build, especially because it includes implementations of direct-to-db exporters. |
@jpkrohling, for my understanding, why specifically OTLP metrics? Could these metrics, theoretically, be shipped in any other format (opencensus, prometheus, etc.) so long as there is an OTEL receiver available to accept this data and normalise it to the OTEL metrics data model, then users can, as you say, configure their exporters the way they need? |
The |
Wanted to flag up my interest in this topic in general. My research group at MPI-SWS is currently doing some work on trace visualization and aggregate analysis, and our interests extend into the systems side of driving visualizations. Our starting point at this moment is mentioned in @jkowall's google doc writeup:
We've been looking at this in a few ways. We've been thinking about visualizations that (a) extend single-trace visualizations to provide aggregate context; and (b) visualizations for sets of traces in general. Metrics are a big part of this, plus we are also interested in how to represent structural aggregations too. We have also been thinking about the backends needed for driving the visualizatoins. In case anybody is interested, a PhD student in my group (Vaastav Anand) prototyped and wrote up some ideas for a class project earlier this year (link) We've been working on some new ideas and we're planning to reach out to the community within the next few months to find participants for a user study. We'll also be contributing the visualizations back into Jaeger. A question that has come up repeatedly over the years (I must sound like a broken record) is that of trace datasets. Currently we're working with pretty simple traces generated from the DeathStarBench which is a small microservices benchmark. This dataset is quite limited in terms of trace complexity and diversity, and means we end up making (often incorrect) assumptions as we design and develop visualizations. Would anybody reading this be interested in sharing some "real" trace datasets and examples of metrics you use in practice? We can sign NDAs where needed. If so please reach out to me! (jcmace@mpi-sws.org) |
I wanted to report that @albertteoh has a working POC for this using Otel Collector. We will be submitting a PR for the new pipeline in the coming weeks, and I'll update this issue. The reason for the delay is competing deadlines, but we'll get back on it, since it's an area which is critical for making tracing operational in nature. @JonathanMace that sounds interesting. I'll be waiting to hear from you! We have some UX resources here that want to also work on improvements in the visualizations, and I can envision a more usable main page when you login to Jaeger which leverages the work that @albertteoh is doing on ATM. I've been getting resources for some design work aligned internally and I'll keep the community updated as that progresses. |
Requirement - what kind of business use case are you trying to solve?
We are trying to build a way to aggregate traces into metrics for operational use cases you often see in monitoring requirements.
Problem - what in Jaeger blocks you from solving the requirement?
There is no current feature to do this.
Proposal - what do you suggest to solve the problem or improve the existing situation?
As per this document: https://docs.google.com/document/d/1EqIhkm7FLTEJQiDlPDlwR8f2KbJvPx6HMUHKCxQpqLg/edit which was discussed on the Jaeger community call on 10/16 I agreed to open this issue to address issues and have a discussion.
Any open questions to address
We have a few ways to design this feature. The goal would be to export metrics from trace aggregation to a Prometheus compatible backend (remote_write or scrapable endpoint). In the discussion on the 10/16 call @yurishkuro suggested this to be in Otel versus in a streaming system post ingest. This is a good design IMO, but the challenge is that after discussing with @albertteoh Otel doesn't allow for mixed telemetry types you can only export traces from traces or metrics from metrics. This would not come until Otel v2 which is too far away for us.
We would like to propose modifying the Jaeger exporter to include metric export along with trace export (we would do the same with logz.io exporter).
As a second phase we'd figure out how to use these metrics from within the Jaeger UI for topology views and possibly a more useful homepage, but that will require building some type of MetricQuery service.
Discussion open.
Thanks!
The text was updated successfully, but these errors were encountered: