Discussion for Aggregated Trace Metrics (ATM) #2574

jkowall · 2020-10-18T02:32:34Z

Requirement - what kind of business use case are you trying to solve?

We are trying to build a way to aggregate traces into metrics for operational use cases you often see in monitoring requirements.

Problem - what in Jaeger blocks you from solving the requirement?

There is no current feature to do this.

Proposal - what do you suggest to solve the problem or improve the existing situation?

As per this document: https://docs.google.com/document/d/1EqIhkm7FLTEJQiDlPDlwR8f2KbJvPx6HMUHKCxQpqLg/edit which was discussed on the Jaeger community call on 10/16 I agreed to open this issue to address issues and have a discussion.

Any open questions to address

We have a few ways to design this feature. The goal would be to export metrics from trace aggregation to a Prometheus compatible backend (remote_write or scrapable endpoint). In the discussion on the 10/16 call @yurishkuro suggested this to be in Otel versus in a streaming system post ingest. This is a good design IMO, but the challenge is that after discussing with @albertteoh Otel doesn't allow for mixed telemetry types you can only export traces from traces or metrics from metrics. This would not come until Otel v2 which is too far away for us.

We would like to propose modifying the Jaeger exporter to include metric export along with trace export (we would do the same with logz.io exporter).

As a second phase we'd figure out how to use these metrics from within the Jaeger UI for topology views and possibly a more useful homepage, but that will require building some type of MetricQuery service.

Discussion open.

Thanks!

yurishkuro · 2020-10-18T17:26:21Z

@yurishkuro suggested this to be in Otel versus in a streaming system post ingest

Correction: I suggested it to be in the Jaeger collector, not necessarily in the OTEL project.

The reason for collector is because it is the closest backend component to the source of data. In an architecture that supports tail-based sampling, for example, the collectors have access to highest fidelity / least sampled data, compared to all other stages of the backend processing. Building this on top of Kafka means we'd always have to pay disk serialization cost, whereas extracting metrics from spans is a lightweight operation that could be easily done prior to that.

Otel doesn't allow for mixed telemetry types you can only export traces from traces or metrics from metrics.

I still don't follow why this is a problem. A TraceConsumer just takes trace data and does something with it. If that something happens to be extracting metrics and sending them somewhere, what does that have to do with processing pipelines of the OTEL collector? As far as those pipelines are concerned this consumer is a leaf.

jpkrohling · 2020-10-19T08:20:49Z

A TraceConsumer just takes trace data and does something with it.

My thoughts exactly: the processor can just capture the metrics based on the traces from the pipeline and export them as metrics data points.

albertteoh · 2020-10-19T10:03:19Z

A TraceConsumer just takes trace data and does something with it.

My thoughts exactly: the processor can just capture the metrics based on the traces from the pipeline and export them as metrics data points.

I tried to draw this up (please let me know if it's inaccurate):

+-----------------------------------------------------------------------------------------------------------+ 
|                                         OTEL Collector Pipeline                                           | 
|                            +---------------------+     +----------------------+                           | 
|     +---------------+      |                     |     |     (optional)       |      +---------------+    | 
|     |               |      | agg trace metrics   |     |other trace processors|      |               |    | 
|     |trace receiver |----->|(passthrough traces) |---->|like tail-based       |----->|trace exporter |    | 
|     |               |      |                     |     |sampler               |      |               |    | 
|     +---------------+      +---------------------+     +----------------------+      +---------------+    | 
|                                       |                                                                   | 
+-----------------------------------------------------------------------------------------------------------+ 
                                        v                                                                     
                           +--------------------------+                                                       
                           |                          |                                                       
                           |    a metrics consumer    |                                                       
                           |  (m3, prometheus, etc.)  |                                                       
                           |                          |                                                       
                           +--------------------------+

I like this because it's very simple and we should be able to get something working pretty quickly with the existing OTEL Collector design. It's certainly feasible, especially (as @yurishkuro mentioned) that extracting metrics from spans is lightweight and the network serde and I/O could be done out of band in a seperate goroutine, taking the cost away from the trace processing hot path.

However, I feel it diverges from the design philosophy of OTEL collectors. An exporter sends data to other systems/backends and processors simply pre-process data or help with reliable completion of pipelines.

Why this could be an issue is:

(Usability) It would be unexpected to users to need host:port configuration on a processor when this should be on receiver or exporter config.
How will those metrics be sent to the backend? Would we need to implement support for multiple metrics shippers (i.e. prometheus scrape endpoint, prom remote write, metricbeat, etc.) and if so, would this be duplicated work that is already provided (or might be in future) by the open-source community through OTEL Collector exporters?

Although, a question that comes to mind, and is on me to research/learn more about OTEL Collectors, is how do we monitor the processors? Do processors need to instrument themselves and expose/ship their metrics? If so, we could perhaps piggyback off that precedent.

jpkrohling · 2020-10-19T10:25:51Z

processors simply pre-process data or help with reliable completion of pipelines.

I think that the definition in the readme has only the case of processors acting on data of the same type. Using traces to generate metrics is acceptable, IMO, but you might want to explicitly ask in the Collector SIG.

It would be unexpected to users to need host:port configuration on a processor when this should be on receiver or exporter config.

I had a similar concern with the routingprocessor and with the loadbalancingexporter. In this case, I would just allow OTLP endpoints to be on the receiving end of these metrics. Users can then configure their metrics pipeline to export this to their preferred backend.

Processors are indeed able to instrument themselves. Take a look at the loadbalancingexporter or groupbyprocessor for examples.

joe-elliott · 2020-10-19T12:40:52Z

Do processors need to instrument themselves and expose/ship their metrics? If so, we could perhaps piggyback off that precedent.

Given the volume of metrics that could be created I think that Prometheus remote_write would be the right choice here over exposing a page to scrape. remote_write is well supported amongst tsdb backends and so it would give users options for where to store their data.

jkowall · 2020-10-19T13:17:04Z

Given the volume of metrics that could be created I think that Prometheus remote_write would be the right choice here over exposing a page to scrape. remote_write is well supported amongst tsdb backends and so it would give users options for where to store their data.

The downside is that we can't remote_write into prometheus itself, or maybe we can and I am just not aware of it. In order to process real-time metrics (ie alertmanager) you'd want to scrape as part of your pipeline to ingest. I could see both methods being valid for users depending on their specific requirements.

If we want to use these metrics inside the Jaeger UI we also need to build a service and determining how we support multiple backends, but at least promql would be the query language.

jpkrohling · 2020-10-19T13:55:09Z

Given the volume of metrics that could be created I think that Prometheus remote_write would be the right choice here over exposing a page to scrape.

Why not export to OTLP metrics, and let users configure their exporters the way they need it?

jkowall · 2020-10-20T23:12:23Z

Correction: I suggested it to be in the Jaeger collector, not necessarily in the OTEL project.

@yurishkuro I was under the impression that the Jaeger collector would be replaced by the Otel Collector hence we should be focusing efforts on Otel vs Jaeger.

Based on this discussion we should look at Otel and utilize OTLP to create metrics which can then be sent to any backend. (thanks for the great advice @jpkrohling). We just need to determine how to use this data within Jaeger UI eventually.

Thanks for the discussion.

yurishkuro · 2020-10-20T23:39:38Z

the Jaeger collector would be replaced by the Otel Collector

Yes and no. Jaeger collector will be based on OTEL Collector, but I think it will always remain a custom build, especially because it includes implementations of direct-to-db exporters.

albertteoh · 2020-10-21T09:41:25Z

Why not export to OTLP metrics, and let users configure their exporters the way they need it?

@jpkrohling, for my understanding, why specifically OTLP metrics?

Could these metrics, theoretically, be shipped in any other format (opencensus, prometheus, etc.) so long as there is an OTEL receiver available to accept this data and normalise it to the OTEL metrics data model, then users can, as you say, configure their exporters the way they need?

jpkrohling · 2020-10-21T10:31:40Z

for my understanding, why specifically OTLP metrics?

The otlp receiver is more likely to exist in an OpenTelemetry Collector than other receivers. And if it's not there, I think it's more reasonable to require it to exist than, say, the OpenCensus receiver.

JonathanMace · 2020-10-23T12:17:59Z

Wanted to flag up my interest in this topic in general. My research group at MPI-SWS is currently doing some work on trace visualization and aggregate analysis, and our interests extend into the systems side of driving visualizations. Our starting point at this moment is mentioned in @jkowall's google doc writeup:

Aside from getting ATMs into Prometheus or other metric systems, we would ideally like to include or overlay metrics on the dependency view, or maybe a more usable Jaeger homepage which shows operational status of the services being monitored

We've been looking at this in a few ways. We've been thinking about visualizations that (a) extend single-trace visualizations to provide aggregate context; and (b) visualizations for sets of traces in general. Metrics are a big part of this, plus we are also interested in how to represent structural aggregations too. We have also been thinking about the backends needed for driving the visualizatoins. In case anybody is interested, a PhD student in my group (Vaastav Anand) prototyped and wrote up some ideas for a class project earlier this year (link)

We've been working on some new ideas and we're planning to reach out to the community within the next few months to find participants for a user study. We'll also be contributing the visualizations back into Jaeger.

A question that has come up repeatedly over the years (I must sound like a broken record) is that of trace datasets. Currently we're working with pretty simple traces generated from the DeathStarBench which is a small microservices benchmark. This dataset is quite limited in terms of trace complexity and diversity, and means we end up making (often incorrect) assumptions as we design and develop visualizations.

Would anybody reading this be interested in sharing some "real" trace datasets and examples of metrics you use in practice? We can sign NDAs where needed. If so please reach out to me! (jcmace@mpi-sws.org)

jkowall · 2020-10-27T23:12:05Z

I wanted to report that @albertteoh has a working POC for this using Otel Collector. We will be submitting a PR for the new pipeline in the coming weeks, and I'll update this issue. The reason for the delay is competing deadlines, but we'll get back on it, since it's an area which is critical for making tracing operational in nature.

@JonathanMace that sounds interesting. I'll be waiting to hear from you!

We have some UX resources here that want to also work on improvements in the visualizations, and I can envision a more usable main page when you login to Jaeger which leverages the work that @albertteoh is doing on ATM. I've been getting resources for some design work aligned internally and I'll keep the community updated as that progresses.

github-actions bot added the needs-triage label Oct 18, 2020

albertteoh mentioned this issue Oct 23, 2020

Proposal: Span Stats processor open-telemetry/opentelemetry-collector-contrib#403

Closed

yurishkuro added feature vote Proposed feature that needs 3+ users interested in it and removed needs-triage labels Jan 10, 2021

albertteoh mentioned this issue Feb 24, 2021

Create a metrics query endpoint #2736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion for Aggregated Trace Metrics (ATM) #2574

Discussion for Aggregated Trace Metrics (ATM) #2574

jkowall commented Oct 18, 2020 •

edited by yurishkuro

Loading

yurishkuro commented Oct 18, 2020 •

edited

Loading

jpkrohling commented Oct 19, 2020

albertteoh commented Oct 19, 2020 •

edited

Loading

jpkrohling commented Oct 19, 2020

joe-elliott commented Oct 19, 2020

jkowall commented Oct 19, 2020

jpkrohling commented Oct 19, 2020

jkowall commented Oct 20, 2020

yurishkuro commented Oct 20, 2020

albertteoh commented Oct 21, 2020

jpkrohling commented Oct 21, 2020

JonathanMace commented Oct 23, 2020 •

edited

Loading

jkowall commented Oct 27, 2020

Discussion for Aggregated Trace Metrics (ATM) #2574

Discussion for Aggregated Trace Metrics (ATM) #2574

Comments

jkowall commented Oct 18, 2020 • edited by yurishkuro Loading

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

yurishkuro commented Oct 18, 2020 • edited Loading

jpkrohling commented Oct 19, 2020

albertteoh commented Oct 19, 2020 • edited Loading

jpkrohling commented Oct 19, 2020

joe-elliott commented Oct 19, 2020

jkowall commented Oct 19, 2020

jpkrohling commented Oct 19, 2020

jkowall commented Oct 20, 2020

yurishkuro commented Oct 20, 2020

albertteoh commented Oct 21, 2020

jpkrohling commented Oct 21, 2020

JonathanMace commented Oct 23, 2020 • edited Loading

jkowall commented Oct 27, 2020

jkowall commented Oct 18, 2020 •

edited by yurishkuro

Loading

yurishkuro commented Oct 18, 2020 •

edited

Loading

albertteoh commented Oct 19, 2020 •

edited

Loading

JonathanMace commented Oct 23, 2020 •

edited

Loading