Are the current set of collector metrics adequate? #2165

objectiser · 2020-02-12T16:15:36Z

Using the following OpenTelemetry collector config (with image built from master):

    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: "localhost:14250"

    processors:
      queued_retry:

    exporters:
      logging:

    service:
      pipelines:
        traces:
          receivers: [jaeger]
          processors: [queued_retry]
          exporters: [logging]

and using the business-application.yaml to create some test requests, it resulted in the following metrics:

# HELP otelcol_batches_dropped The number of span batches dropped.
# TYPE otelcol_batches_dropped counter
otelcol_batches_dropped{processor="",service="",source_format=""} 0
# HELP otelcol_batches_received The number of span batches received.
# TYPE otelcol_batches_received counter
otelcol_batches_received{processor="queued_retry",service="inventory",source_format="jaeger"} 9
otelcol_batches_received{processor="queued_retry",service="order",source_format="jaeger"} 9
# HELP otelcol_oc_io_process_cpu_seconds CPU seconds for this process
# TYPE otelcol_oc_io_process_cpu_seconds gauge
otelcol_oc_io_process_cpu_seconds 0
# HELP otelcol_oc_io_process_memory_alloc Number of bytes currently allocated in use
# TYPE otelcol_oc_io_process_memory_alloc gauge
otelcol_oc_io_process_memory_alloc 4.582904e+06
# HELP otelcol_oc_io_process_sys_memory_alloc Number of bytes given to the process to use in total
# TYPE otelcol_oc_io_process_sys_memory_alloc gauge
otelcol_oc_io_process_sys_memory_alloc 7.25486e+07
# HELP otelcol_oc_io_process_total_memory_alloc Number of allocations in total
# TYPE otelcol_oc_io_process_total_memory_alloc gauge
otelcol_oc_io_process_total_memory_alloc 6.415736e+06
# HELP otelcol_otelcol_exporter_dropped_spans Counts the number of spans received by the exporter
# TYPE otelcol_otelcol_exporter_dropped_spans counter
otelcol_otelcol_exporter_dropped_spans{otelsvc_exporter="logging",otelsvc_receiver=""} 0
# HELP otelcol_otelcol_exporter_received_spans Counts the number of spans received by the exporter
# TYPE otelcol_otelcol_exporter_received_spans counter
otelcol_otelcol_exporter_received_spans{otelsvc_exporter="logging",otelsvc_receiver=""} 252
# HELP otelcol_otelcol_receiver_dropped_spans Counts the number of spans dropped by the receiver
# TYPE otelcol_otelcol_receiver_dropped_spans counter
otelcol_otelcol_receiver_dropped_spans{otelsvc_receiver="jaeger-collector"} 0
# HELP otelcol_otelcol_receiver_received_spans Counts the number of spans received by the receiver
# TYPE otelcol_otelcol_receiver_received_spans counter
otelcol_otelcol_receiver_received_spans{otelsvc_receiver="jaeger-collector"} 252
# HELP otelcol_queue_latency The "in queue" latency of the successful send operations.
# TYPE otelcol_queue_latency histogram
otelcol_queue_latency_bucket{processor="queued_retry",le="10"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="25"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="50"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="75"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="100"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="250"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="500"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="750"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="1000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="2000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="3000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="4000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="5000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="10000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="20000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="30000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="50000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="+Inf"} 18
otelcol_queue_latency_sum{processor="queued_retry"} 0
otelcol_queue_latency_count{processor="queued_retry"} 18
# HELP otelcol_queue_length Current number of batches in the queued exporter
# TYPE otelcol_queue_length gauge
otelcol_queue_length{processor="queued_retry"} 0
# HELP otelcol_send_latency The latency of the successful send operations.
# TYPE otelcol_send_latency histogram
otelcol_send_latency_bucket{processor="queued_retry",le="10"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="25"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="50"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="75"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="100"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="250"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="500"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="750"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="1000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="2000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="3000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="4000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="5000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="10000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="20000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="30000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="50000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="+Inf"} 18
otelcol_send_latency_sum{processor="queued_retry"} 0
otelcol_send_latency_count{processor="queued_retry"} 18
# HELP otelcol_spans_dropped The number of spans dropped.
# TYPE otelcol_spans_dropped counter
otelcol_spans_dropped{processor="",service="",source_format=""} 0
# HELP otelcol_spans_received The number of spans received.
# TYPE otelcol_spans_received counter
otelcol_spans_received{processor="queued_retry",service="inventory",source_format="jaeger"} 112
otelcol_spans_received{processor="queued_retry",service="order",source_format="jaeger"} 140
# HELP otelcol_success_send The number of successful send operations performed by queued exporter
# TYPE otelcol_success_send counter
otelcol_success_send{processor="queued_retry",service="inventory",source_format="jaeger"} 9
otelcol_success_send{processor="queued_retry",service="order",source_format="jaeger"} 9

The text was updated successfully, but these errors were encountered:

objectiser · 2020-02-12T16:27:28Z

A couple of initial comments:

Do we want source_format to be more specific, e.g. jaeger-grpc, jaeger-thrift-...?
Receiver and exporter metrics don't seem to support the service and source_format labels, only the batches/spans received (associated with the processor - is that an issue?

Some naming issues need to be sorted out - e.g. otelcol_otelcol_receiver_... metric name and otelsvc_receiver="jaeger-collector" tag (i.e. consistent use of receiver vs source_format?).

yurishkuro · 2020-02-12T18:23:57Z

I would prefer to create a google spreadsheet listing all metrics available in Jaeger components, and show how they map to otel metrics. GitHub ticket is not the best format for that analysis.

The general answer to the two questions above - yes, we want to keep the expressiveness of Jaeger metrics. All existing dimensions were added for a reason, especially being able to quantify different sources and format of inbound traffic is important for operating a prod cluster.

objectiser · 2020-02-14T16:46:44Z

Needs more work, but initial mapping is here.

Many of the mappings are not clear at the moment, so will need to dig into the code a bit to see what they actually represent.

objectiser · 2020-02-20T17:02:52Z

First draft of metrics comparison is now complete with comments that need discussion: https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing

objectiser · 2020-03-12T11:33:11Z

There are various issues with the metrics so want to tackle one specific set of metrics first - specifically the jaeger_agent_reporter_(batches|spans)_(submitted|failures).

The closest equivalent metrics produced by OTC currently otelcol_(success|fail)_send are associated with the queued_retry processor. Don't think this is an issue, as we would want the OTC exporter (when used in place of the agent reporter), to be backed by a retry/queuing mechanism.

Assuming that is not a problem, the issues are:

only reported for batches currently (should be straightforward to fix)
otel metrics additionally identify service as a dimension
otel metrics don't identify the protocol - was thinking that the queued_retry processor associated with the pipeline could be named to include the protocol, e.g. processor="queued_retry/jaeger_grpc" (so this would be set in the otel collector config if the end user wanted to differentiate the metrics by protocol) - however may be redundant if the only Jaeger exporter protocol is grpc :)

cc @jaegertracing/jaeger-maintainers

yurishkuro · 2020-03-12T14:29:50Z

Doesn’t the protocol label in jaeger refer to the inbound span format?

objectiser · 2020-03-12T15:20:33Z

@yurishkuro No, the reporter protocol was extracted from the metric name, to be a label, in 1.9.0.

yurishkuro · 2020-03-12T16:46:59Z

yes, I was thinking of the receiver transport, that should be a different metric anyway.

objectiser · 2020-03-12T16:50:45Z

@yurishkuro If those metrics seem ok for the agent reporter, I'll create some issues on the OTC repo to deal with the problems outlined?

yurishkuro · 2020-03-12T16:58:27Z

@objectiser so there are a bunch of red cells in your spreadsheet. Some of them are specific to jaeger client/agent integrations, what are your thoughts on those? I assume we can keep them out of scope, since OTel SDKs may not even have the same mechanisms.

For clear misses, yes let's file tickets in OTel.

objectiser · 2020-03-12T17:15:32Z

@yurishkuro The collector metrics I was going to deal with in a separate comment (probably next week) - wanted to start with the agent reporter ones. May also raise issue in OTC repo about an equivalent metric for jaeger_agent_reporter_batch_size, which would complete the set.

Regarding the jaeger_thrift_udp.... metrics - wasn't sure about them - if they/some are relevant then otel equivalents could be added to the jaeger receiver?

objectiser · 2020-03-19T16:35:31Z

Reported agent related metrics here: open-telemetry/opentelemetry-collector#662

pavolloffay · 2020-04-21T13:13:09Z

Adding example metrics recorded by Jaeger with hotrod:

agent: https://pastebin.com/Lc3bD41Q
collector: https://pastebin.com/K0cXsPes
all-in-one: https://pastebin.com/RLpwhknE

And OTEL metrics receiving data via Jaeger thrift receiver and sending to Jaeger collector (agent mode): https://pastebin.com/X4n9uSJ8

OTEL metrics --legacy-metrics=false --new-metrics=true https://pastebin.com/HRqGJDva
OTEL metrics --legacy-metrics=true --new-metrics=true https://pastebin.com/ebfZ6YV9

pavolloffay · 2020-04-22T12:11:59Z

Here is the set of new OTEL metrics:

Receiver metrics: accepter/refused

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3
# HELP otelcol_receiver_refused_spans Number of spans that could not be pushed into the pipeline.
otelcol_receiver_refused_spans{receiver="jaeger",transport="agent"} 0

Exporter metrics: failed/sent

# HELP otelcol_exporter_send_failed_spans Number of spans in failed attempts to send to destination.
otelcol_exporter_send_failed_spans{exporter="jaeger"} 10
# HELP otelcol_exporter_sent_spans Number of spans successfully sent to destination.
otelcol_exporter_sent_spans{exporter="jaeger"} 3

Processor metrics: accepted spans/batches, dropped spans/batches, refused spans, queue length and latency, send fail, send latency, retry send

# HELP otelcol_processor_accepted_spans Number of spans successfully pushed into the next component in the pipeline.
otelcol_processor_accepted_spans{processor="queued_retry"} 3
# HELP otelcol_processor_batches_received The number of span batches received.
otelcol_processor_batches_received{processor="queued_retry"} 3
# HELP otelcol_processor_dropped_spans Number of spans that were dropped.
otelcol_processor_dropped_spans{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_fail_send The number of failed send operations performed by queued_retry processor
otelcol_processor_queued_retry_fail_send{processor="queued_retry"} 10
# HELP otelcol_processor_queued_retry_queue_latency The "in queue" latency of the successful send operations.
otelcol_processor_queued_retry_queue_latency_bucket{processor="queued_retry",le="10"} 2
# HELP otelcol_processor_queued_retry_queue_length Current number of batches in the queue
otelcol_processor_queued_retry_queue_length{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_send_latency The latency of the successful send operations.
otelcol_processor_queued_retry_send_latency_bucket{processor="queued_retry",le="10"} 3
# HELP otelcol_processor_queued_retry_success_send The number of successful send operations performed by queued_retry processor
otelcol_processor_queued_retry_success_send{processor="queued_retry"} 3
# HELP otelcol_processor_refused_spans Number of spans that were rejected by the next component in the pipeline.
otelcol_processor_refused_spans{processor="queued_retry"} 0
# HELP otelcol_processor_spans_dropped The number of spans dropped.
otelcol_processor_spans_dropped{processor="queued_retry"} 0
# HELP otelcol_processor_spans_received The number of spans received.
otelcol_processor_spans_received{processor="queued_retry"} 3
# HELP otelcol_processor_trace_batches_dropped The number of span batches dropped.
otelcol_processor_trace_batches_dropped{processor="queued_retry"} 0

pavolloffay · 2020-04-22T12:39:44Z

I have added a second tab to @objectiser doc - https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing. It contains a similar comparison probably with more details.

Here are my findings:
The OTEL metrics look good, there is good coverage for all components. However Jaeger provides better visibility into which services are reporting spans, this is completely missing in OTEL.

We should address these things:

split receiver metrics by service. Jaeger exposes spans_received split by debug,format,service,transport. The transport we already have, the format is not needed as we use only a single format per transport, but we need service and maybe debug? cc) @yurishkuro Add service name dimension to trace metrics open-telemetry/opentelemetry-collector#857
split storage metrics by service. Jaeger exposes spans_saved_by_svc split by debug, service, result.
span average size - exposed in the receiver and also at the exporter (storage) because the size can change. Add average span size metric open-telemetry/opentelemetry-collector#856
Make transport in receiver metrics more precise. For instance our agent exposes two endpoints but it's labeled always with agent, this could be changed to aget_compact and agent_binary. Split Jaeger's agent transport metric label open-telemetry/opentelemetry-collector#859

pavolloffay · 2020-04-22T14:34:07Z

I was looking at 4. I could not find a way to distinguish between binary and compact in the agent's EmitBatch method

https://github.com/open-telemetry/opentelemetry-collector/blob/9f0f8e4b4ea368f68458e11a0cae2450a971e8d2/receiver/jaegerreceiver/trace_receiver.go#L317

https://github.com/open-telemetry/opentelemetry-collector/blob/9f0f8e4b4ea368f68458e11a0cae2450a971e8d2/receiver/jaegerreceiver/trace_receiver.go#L109

yurishkuro · 2020-04-22T15:23:47Z

Jaeger exposes spans_received split by debug,format,service,transport. The transport we already have, the format is not needed as we use only a single format per transport.

But OTel collector accepts ever more formats than Jaeger, why is format not needed?

pavolloffay · 2020-04-22T15:45:36Z

The receiver metrics are split by receiver type and transport. The idea here is that transport supports only a single format.

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3

yurishkuro · 2020-04-22T17:14:08Z

transport="agent" is weird, we have udp vs. grpc, the actual transports

pavolloffay · 2020-04-22T18:30:02Z

I think I might have a way how to split it into two values. What about udp_thrift_compact, udp_thrif_binary?

yurishkuro · 2020-04-22T19:21:04Z

that would be good & sufficient.

pavolloffay · 2020-04-22T19:32:17Z

Here is the PR open-telemetry/opentelemetry-collector#859

pavolloffay · 2020-04-23T07:23:11Z

Zipkin receiver has the same problem, the dimension is only http but it can be http_json_v1 http_json_v2, http_thrift_v1, http_proto.

PR to fix the Zipkin metrics open-telemetry/opentelemetry-collector#867

yurishkuro · 2020-04-23T19:10:34Z

+1

pavolloffay · 2020-06-29T15:03:51Z

The remaining items here are:

split metrics by service name
span average size. We could perhaps add this to batch processor with batch size. Batch size metrics Add batch size metric open-telemetry/opentelemetry-collector#1241. Batch bytes size metric Add batch size bytes metric to batch processor open-telemetry/opentelemetry-collector#1270 ---> Added batch size metric

pavolloffay changed the title ~~Are the current set of collector metrics are adequate?~~ Are the current set of collector metrics adequate? Feb 12, 2020

pavolloffay transferred this issue from jaegertracing/jaeger-opentelemetry-collector Apr 6, 2020

ghost added the needs-triage label Apr 6, 2020

pavolloffay added area/otel and removed needs-triage labels Apr 6, 2020

pavolloffay added this to the OpenTelemetry-collector drop-in replacement milestone Apr 6, 2020

This was referenced Jun 30, 2020

Add batch size metric open-telemetry/opentelemetry-collector#1241

Merged

Add batch size bytes metric to batch processor open-telemetry/opentelemetry-collector#1270

Merged

pavolloffay mentioned this issue Aug 28, 2020

Add storage metrics to OTEL, metrics by span service name #2431

Merged

pavolloffay closed this as completed in #2431 Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the current set of collector metrics adequate? #2165

Are the current set of collector metrics adequate? #2165

objectiser commented Feb 12, 2020

objectiser commented Feb 12, 2020

yurishkuro commented Feb 12, 2020

objectiser commented Feb 14, 2020

objectiser commented Feb 20, 2020

objectiser commented Mar 12, 2020 •

edited

Loading

yurishkuro commented Mar 12, 2020

objectiser commented Mar 12, 2020

yurishkuro commented Mar 12, 2020

objectiser commented Mar 12, 2020

yurishkuro commented Mar 12, 2020

objectiser commented Mar 12, 2020 •

edited

Loading

objectiser commented Mar 19, 2020

pavolloffay commented Apr 21, 2020 •

edited

Loading

pavolloffay commented Apr 22, 2020

pavolloffay commented Apr 22, 2020 •

edited

Loading

pavolloffay commented Apr 22, 2020

yurishkuro commented Apr 22, 2020

pavolloffay commented Apr 22, 2020

yurishkuro commented Apr 22, 2020

pavolloffay commented Apr 22, 2020

yurishkuro commented Apr 22, 2020

pavolloffay commented Apr 22, 2020

pavolloffay commented Apr 23, 2020 •

edited

Loading

yurishkuro commented Apr 23, 2020

pavolloffay commented Jun 29, 2020 •

edited

Loading

Are the current set of collector metrics adequate? #2165

Are the current set of collector metrics adequate? #2165

Comments

objectiser commented Feb 12, 2020

objectiser commented Feb 12, 2020

yurishkuro commented Feb 12, 2020

objectiser commented Feb 14, 2020

objectiser commented Feb 20, 2020

objectiser commented Mar 12, 2020 • edited Loading

yurishkuro commented Mar 12, 2020

objectiser commented Mar 12, 2020

yurishkuro commented Mar 12, 2020

objectiser commented Mar 12, 2020

yurishkuro commented Mar 12, 2020

objectiser commented Mar 12, 2020 • edited Loading

objectiser commented Mar 19, 2020

pavolloffay commented Apr 21, 2020 • edited Loading

pavolloffay commented Apr 22, 2020

pavolloffay commented Apr 22, 2020 • edited Loading

pavolloffay commented Apr 22, 2020

yurishkuro commented Apr 22, 2020

pavolloffay commented Apr 22, 2020

yurishkuro commented Apr 22, 2020

pavolloffay commented Apr 22, 2020

yurishkuro commented Apr 22, 2020

pavolloffay commented Apr 22, 2020

pavolloffay commented Apr 23, 2020 • edited Loading

yurishkuro commented Apr 23, 2020

pavolloffay commented Jun 29, 2020 • edited Loading

objectiser commented Mar 12, 2020 •

edited

Loading

objectiser commented Mar 12, 2020 •

edited

Loading

pavolloffay commented Apr 21, 2020 •

edited

Loading

pavolloffay commented Apr 22, 2020 •

edited

Loading

pavolloffay commented Apr 23, 2020 •

edited

Loading

pavolloffay commented Jun 29, 2020 •

edited

Loading