Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are the current set of collector metrics adequate? #2165

Closed
objectiser opened this issue Feb 12, 2020 · 25 comments · Fixed by #2431
Closed

Are the current set of collector metrics adequate? #2165

objectiser opened this issue Feb 12, 2020 · 25 comments · Fixed by #2431

Comments

@objectiser
Copy link
Contributor

Using the following OpenTelemetry collector config (with image built from master):

    receivers:
      jaeger:
        protocols:
          grpc:
            endpoint: "localhost:14250"

    processors:
      queued_retry:

    exporters:
      logging:

    service:
      pipelines:
        traces:
          receivers: [jaeger]
          processors: [queued_retry]
          exporters: [logging]

and using the business-application.yaml to create some test requests, it resulted in the following metrics:

# HELP otelcol_batches_dropped The number of span batches dropped.
# TYPE otelcol_batches_dropped counter
otelcol_batches_dropped{processor="",service="",source_format=""} 0
# HELP otelcol_batches_received The number of span batches received.
# TYPE otelcol_batches_received counter
otelcol_batches_received{processor="queued_retry",service="inventory",source_format="jaeger"} 9
otelcol_batches_received{processor="queued_retry",service="order",source_format="jaeger"} 9
# HELP otelcol_oc_io_process_cpu_seconds CPU seconds for this process
# TYPE otelcol_oc_io_process_cpu_seconds gauge
otelcol_oc_io_process_cpu_seconds 0
# HELP otelcol_oc_io_process_memory_alloc Number of bytes currently allocated in use
# TYPE otelcol_oc_io_process_memory_alloc gauge
otelcol_oc_io_process_memory_alloc 4.582904e+06
# HELP otelcol_oc_io_process_sys_memory_alloc Number of bytes given to the process to use in total
# TYPE otelcol_oc_io_process_sys_memory_alloc gauge
otelcol_oc_io_process_sys_memory_alloc 7.25486e+07
# HELP otelcol_oc_io_process_total_memory_alloc Number of allocations in total
# TYPE otelcol_oc_io_process_total_memory_alloc gauge
otelcol_oc_io_process_total_memory_alloc 6.415736e+06
# HELP otelcol_otelcol_exporter_dropped_spans Counts the number of spans received by the exporter
# TYPE otelcol_otelcol_exporter_dropped_spans counter
otelcol_otelcol_exporter_dropped_spans{otelsvc_exporter="logging",otelsvc_receiver=""} 0
# HELP otelcol_otelcol_exporter_received_spans Counts the number of spans received by the exporter
# TYPE otelcol_otelcol_exporter_received_spans counter
otelcol_otelcol_exporter_received_spans{otelsvc_exporter="logging",otelsvc_receiver=""} 252
# HELP otelcol_otelcol_receiver_dropped_spans Counts the number of spans dropped by the receiver
# TYPE otelcol_otelcol_receiver_dropped_spans counter
otelcol_otelcol_receiver_dropped_spans{otelsvc_receiver="jaeger-collector"} 0
# HELP otelcol_otelcol_receiver_received_spans Counts the number of spans received by the receiver
# TYPE otelcol_otelcol_receiver_received_spans counter
otelcol_otelcol_receiver_received_spans{otelsvc_receiver="jaeger-collector"} 252
# HELP otelcol_queue_latency The "in queue" latency of the successful send operations.
# TYPE otelcol_queue_latency histogram
otelcol_queue_latency_bucket{processor="queued_retry",le="10"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="25"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="50"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="75"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="100"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="250"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="500"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="750"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="1000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="2000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="3000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="4000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="5000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="10000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="20000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="30000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="50000"} 18
otelcol_queue_latency_bucket{processor="queued_retry",le="+Inf"} 18
otelcol_queue_latency_sum{processor="queued_retry"} 0
otelcol_queue_latency_count{processor="queued_retry"} 18
# HELP otelcol_queue_length Current number of batches in the queued exporter
# TYPE otelcol_queue_length gauge
otelcol_queue_length{processor="queued_retry"} 0
# HELP otelcol_send_latency The latency of the successful send operations.
# TYPE otelcol_send_latency histogram
otelcol_send_latency_bucket{processor="queued_retry",le="10"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="25"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="50"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="75"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="100"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="250"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="500"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="750"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="1000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="2000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="3000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="4000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="5000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="10000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="20000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="30000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="50000"} 18
otelcol_send_latency_bucket{processor="queued_retry",le="+Inf"} 18
otelcol_send_latency_sum{processor="queued_retry"} 0
otelcol_send_latency_count{processor="queued_retry"} 18
# HELP otelcol_spans_dropped The number of spans dropped.
# TYPE otelcol_spans_dropped counter
otelcol_spans_dropped{processor="",service="",source_format=""} 0
# HELP otelcol_spans_received The number of spans received.
# TYPE otelcol_spans_received counter
otelcol_spans_received{processor="queued_retry",service="inventory",source_format="jaeger"} 112
otelcol_spans_received{processor="queued_retry",service="order",source_format="jaeger"} 140
# HELP otelcol_success_send The number of successful send operations performed by queued exporter
# TYPE otelcol_success_send counter
otelcol_success_send{processor="queued_retry",service="inventory",source_format="jaeger"} 9
otelcol_success_send{processor="queued_retry",service="order",source_format="jaeger"} 9
@objectiser
Copy link
Contributor Author

A couple of initial comments:

  1. Do we want source_format to be more specific, e.g. jaeger-grpc, jaeger-thrift-...?
  2. Receiver and exporter metrics don't seem to support the service and source_format labels, only the batches/spans received (associated with the processor - is that an issue?

Some naming issues need to be sorted out - e.g. otelcol_otelcol_receiver_... metric name and otelsvc_receiver="jaeger-collector" tag (i.e. consistent use of receiver vs source_format?).

@pavolloffay pavolloffay changed the title Are the current set of collector metrics are adequate? Are the current set of collector metrics adequate? Feb 12, 2020
@yurishkuro
Copy link
Member

I would prefer to create a google spreadsheet listing all metrics available in Jaeger components, and show how they map to otel metrics. GitHub ticket is not the best format for that analysis.

The general answer to the two questions above - yes, we want to keep the expressiveness of Jaeger metrics. All existing dimensions were added for a reason, especially being able to quantify different sources and format of inbound traffic is important for operating a prod cluster.

@objectiser
Copy link
Contributor Author

Needs more work, but initial mapping is here.

Many of the mappings are not clear at the moment, so will need to dig into the code a bit to see what they actually represent.

@objectiser
Copy link
Contributor Author

First draft of metrics comparison is now complete with comments that need discussion: https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing

@objectiser
Copy link
Contributor Author

objectiser commented Mar 12, 2020

There are various issues with the metrics so want to tackle one specific set of metrics first - specifically the jaeger_agent_reporter_(batches|spans)_(submitted|failures).

The closest equivalent metrics produced by OTC currently otelcol_(success|fail)_send are associated with the queued_retry processor. Don't think this is an issue, as we would want the OTC exporter (when used in place of the agent reporter), to be backed by a retry/queuing mechanism.

Assuming that is not a problem, the issues are:

  • only reported for batches currently (should be straightforward to fix)
  • otel metrics additionally identify service as a dimension
  • otel metrics don't identify the protocol - was thinking that the queued_retry processor associated with the pipeline could be named to include the protocol, e.g. processor="queued_retry/jaeger_grpc" (so this would be set in the otel collector config if the end user wanted to differentiate the metrics by protocol) - however may be redundant if the only Jaeger exporter protocol is grpc :)

cc @jaegertracing/jaeger-maintainers

@yurishkuro
Copy link
Member

Doesn’t the protocol label in jaeger refer to the inbound span format?

@objectiser
Copy link
Contributor Author

@yurishkuro No, the reporter protocol was extracted from the metric name, to be a label, in 1.9.0.

@yurishkuro
Copy link
Member

yes, I was thinking of the receiver transport, that should be a different metric anyway.

@objectiser
Copy link
Contributor Author

@yurishkuro If those metrics seem ok for the agent reporter, I'll create some issues on the OTC repo to deal with the problems outlined?

@yurishkuro
Copy link
Member

@objectiser so there are a bunch of red cells in your spreadsheet. Some of them are specific to jaeger client/agent integrations, what are your thoughts on those? I assume we can keep them out of scope, since OTel SDKs may not even have the same mechanisms.

For clear misses, yes let's file tickets in OTel.

@objectiser
Copy link
Contributor Author

objectiser commented Mar 12, 2020

@yurishkuro The collector metrics I was going to deal with in a separate comment (probably next week) - wanted to start with the agent reporter ones. May also raise issue in OTC repo about an equivalent metric for jaeger_agent_reporter_batch_size, which would complete the set.

Regarding the jaeger_thrift_udp.... metrics - wasn't sure about them - if they/some are relevant then otel equivalents could be added to the jaeger receiver?

@objectiser
Copy link
Contributor Author

Reported agent related metrics here: open-telemetry/opentelemetry-collector#662

@pavolloffay pavolloffay transferred this issue from jaegertracing/jaeger-opentelemetry-collector Apr 6, 2020
@ghost ghost added the needs-triage label Apr 6, 2020
@pavolloffay
Copy link
Member

pavolloffay commented Apr 21, 2020

Adding example metrics recorded by Jaeger with hotrod:

And OTEL metrics receiving data via Jaeger thrift receiver and sending to Jaeger collector (agent mode): https://pastebin.com/X4n9uSJ8

OTEL metrics --legacy-metrics=false --new-metrics=true https://pastebin.com/HRqGJDva
OTEL metrics --legacy-metrics=true --new-metrics=true https://pastebin.com/ebfZ6YV9

@pavolloffay
Copy link
Member

Here is the set of new OTEL metrics:

Receiver metrics: accepter/refused

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3
# HELP otelcol_receiver_refused_spans Number of spans that could not be pushed into the pipeline.
otelcol_receiver_refused_spans{receiver="jaeger",transport="agent"} 0

Exporter metrics: failed/sent

# HELP otelcol_exporter_send_failed_spans Number of spans in failed attempts to send to destination.
otelcol_exporter_send_failed_spans{exporter="jaeger"} 10
# HELP otelcol_exporter_sent_spans Number of spans successfully sent to destination.
otelcol_exporter_sent_spans{exporter="jaeger"} 3

Processor metrics: accepted spans/batches, dropped spans/batches, refused spans, queue length and latency, send fail, send latency, retry send

# HELP otelcol_processor_accepted_spans Number of spans successfully pushed into the next component in the pipeline.
otelcol_processor_accepted_spans{processor="queued_retry"} 3
# HELP otelcol_processor_batches_received The number of span batches received.
otelcol_processor_batches_received{processor="queued_retry"} 3
# HELP otelcol_processor_dropped_spans Number of spans that were dropped.
otelcol_processor_dropped_spans{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_fail_send The number of failed send operations performed by queued_retry processor
otelcol_processor_queued_retry_fail_send{processor="queued_retry"} 10
# HELP otelcol_processor_queued_retry_queue_latency The "in queue" latency of the successful send operations.
otelcol_processor_queued_retry_queue_latency_bucket{processor="queued_retry",le="10"} 2
# HELP otelcol_processor_queued_retry_queue_length Current number of batches in the queue
otelcol_processor_queued_retry_queue_length{processor="queued_retry"} 0
# HELP otelcol_processor_queued_retry_send_latency The latency of the successful send operations.
otelcol_processor_queued_retry_send_latency_bucket{processor="queued_retry",le="10"} 3
# HELP otelcol_processor_queued_retry_success_send The number of successful send operations performed by queued_retry processor
otelcol_processor_queued_retry_success_send{processor="queued_retry"} 3
# HELP otelcol_processor_refused_spans Number of spans that were rejected by the next component in the pipeline.
otelcol_processor_refused_spans{processor="queued_retry"} 0
# HELP otelcol_processor_spans_dropped The number of spans dropped.
otelcol_processor_spans_dropped{processor="queued_retry"} 0
# HELP otelcol_processor_spans_received The number of spans received.
otelcol_processor_spans_received{processor="queued_retry"} 3
# HELP otelcol_processor_trace_batches_dropped The number of span batches dropped.
otelcol_processor_trace_batches_dropped{processor="queued_retry"} 0

@pavolloffay
Copy link
Member

pavolloffay commented Apr 22, 2020

I have added a second tab to @objectiser doc - https://docs.google.com/spreadsheets/d/1W6mGt3w47BlCdxVelnMbc_fL_HE_GC1CzO3Zu6-i83I/edit?usp=sharing. It contains a similar comparison probably with more details.

Here are my findings:
The OTEL metrics look good, there is good coverage for all components. However Jaeger provides better visibility into which services are reporting spans, this is completely missing in OTEL.

We should address these things:

  1. split receiver metrics by service. Jaeger exposes spans_received split by debug,format,service,transport. The transport we already have, the format is not needed as we use only a single format per transport, but we need service and maybe debug? cc) @yurishkuro Add service name dimension to trace metrics open-telemetry/opentelemetry-collector#857
  2. split storage metrics by service. Jaeger exposes spans_saved_by_svc split by debug, service, result.
  3. span average size - exposed in the receiver and also at the exporter (storage) because the size can change. Add average span size metric open-telemetry/opentelemetry-collector#856
  4. Make transport in receiver metrics more precise. For instance our agent exposes two endpoints but it's labeled always with agent, this could be changed to aget_compact and agent_binary. Split Jaeger's agent transport metric label open-telemetry/opentelemetry-collector#859

@yurishkuro
Copy link
Member

Jaeger exposes spans_received split by debug,format,service,transport. The transport we already have, the format is not needed as we use only a single format per transport.

But OTel collector accepts ever more formats than Jaeger, why is format not needed?

@pavolloffay
Copy link
Member

The receiver metrics are split by receiver type and transport. The idea here is that transport supports only a single format.

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
otelcol_receiver_accepted_spans{receiver="jaeger",transport="agent"} 3

@yurishkuro
Copy link
Member

transport="agent" is weird, we have udp vs. grpc, the actual transports

@pavolloffay
Copy link
Member

I think I might have a way how to split it into two values. What about udp_thrift_compact, udp_thrif_binary?

@yurishkuro
Copy link
Member

that would be good & sufficient.

@pavolloffay
Copy link
Member

Here is the PR open-telemetry/opentelemetry-collector#859

@pavolloffay
Copy link
Member

pavolloffay commented Apr 23, 2020

Zipkin receiver has the same problem, the dimension is only http but it can be http_json_v1 http_json_v2, http_thrift_v1, http_proto.

PR to fix the Zipkin metrics open-telemetry/opentelemetry-collector#867

@yurishkuro
Copy link
Member

+1

@pavolloffay
Copy link
Member

pavolloffay commented Jun 29, 2020

The remaining items here are:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants