Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics SDK produces ValueError sometimes #4093

Closed
pmcollins opened this issue Jul 30, 2024 · 7 comments · Fixed by #4154
Closed

Metrics SDK produces ValueError sometimes #4093

pmcollins opened this issue Jul 30, 2024 · 7 comments · Fixed by #4154
Assignees

Comments

@pmcollins
Copy link
Member

pmcollins commented Jul 30, 2024

Describe your environment

OS: Intel Mac
Python version: 3.8
SDK version: latest main
API version: latest main

What happened?

When I use opentelemetry-instrument on a script that sends a single span, the metrics SDK throws a ValueError on around half the runs. The script just gets a tracer and does a single start_as_current_span, but I believe opentelemetry-system-metrics is what runs in the background collecting and sending metrics.

Steps to Reproduce

Use opentelemetry-instrument to run a script containing something like the following

tracer = trace.get_tracer("my-tracer")
  with tracer.start_as_current_span("my-span"):
    print("hello")      

Expected Result

No ValueError exceptions.

Actual Result

(.venv) otel@C02CT31XMD6R opentelemetry-python-contrib % opentelemetry-instrument python trace_loop_grpc.py
Exception while exporting metrics Value out of range: 13228496505721087519
Traceback (most recent call last):
  File "/Users/otel/github/open-telemetry/opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/export/__init__.py", line 541, in _receive_metrics
    self._exporter.export(
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/metric_exporter/__init__.py", line 160, in export
    return self._export(data=metrics_data)
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py", line 263, in _export
    request=self._translate_data(data),
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/metric_exporter/__init__.py", line 150, in _translate_data
    return encode_metrics(data)
  File "/Users/otel/github/open-telemetry/opentelemetry-python/exporter/opentelemetry-exporter-otlp-proto-common/src/opentelemetry/exporter/otlp/proto/common/_internal/metrics_encoder/__init__.py", line 260, in encode_metrics
    pt.as_int = data_point.value
ValueError: Value out of range: 13228496505721087519
(.venv) otel@C02CT31XMD6R opentelemetry-python-contrib % 

Additional context

On one run, I was able to print the metric before the ValueError and got:

Metric(name='process.runtime.cpython.context_switches', description='Runtime context switches', unit='switches', data=Sum(data_points=[NumberDataPoint(attributes={'type': 'involuntary'}, start_time_unix_nano=1722370176730749000, time_unix_nano=1722370176730845000, value=0), NumberDataPoint(attributes={'type': 'voluntary'}, start_time_unix_nano=1722370176730749000, time_unix_nano=1722370176730845000, value=10484362086929426074)], aggregation_temporality=<AggregationTemporality.CUMULATIVE: 2>, is_monotonic=True))

The value that caused the error in this case was 10484362086929426074, which is about 10^19.

Would you like to implement a fix?

None

@pmcollins pmcollins added the bug Something isn't working label Jul 30, 2024
@aabmass
Copy link
Member

aabmass commented Jul 31, 2024

Since the protobuf field is sfixed64, possibly the exporter should handle this case eagerly or somehow convert it to be in the correct range.

However, the metric process.runtime.cpython.context_switches is coming from here and that seems like an impossibly large number of context switches. Maybe this is a bug in psutil for MacOS?

@aabmass
Copy link
Member

aabmass commented Jul 31, 2024

@pmcollins any way you can check the output of psutil.Process(os.getpid()).num_ctx_switches() when this happens and see if it an SDK issue or psutil issue?

@pmcollins
Copy link
Member Author

pmcollins commented Jul 31, 2024

Just running it at the python console from my Intel Mac I get

>>> psutil.Process(os.getpid()).num_ctx_switches()
pctxsw(voluntary=3168756596733903550, involuntary=0)

So maybe a psutil issue (?), but also maybe an issue in the SDK with how we handle too big of numbers.

@aabmass
Copy link
Member

aabmass commented Aug 8, 2024

Ya that seems unexpected on the psutil side. For reference, on my ARM Mac:

$ python3 -c 'import psutil; import os; print(psutil.Process(os.getpid()).num_ctx_switches())'
pctxsw(voluntary=33, involuntary=0)

@emdneto
Copy link
Member

emdneto commented Aug 8, 2024

Same here ARM Mac

python3 -c 'import psutil; import os; print(psutil.Process(os.getpid()).num_ctx_switches())'
pctxsw(voluntary=26, involuntary=0)

@pmcollins pmcollins removed the bug Something isn't working label Aug 16, 2024
@pmcollins
Copy link
Member Author

I'd like to work on improving the logging for this scenario. I've removed the bug label because I'm not sure it is one.

@lzchen
Copy link
Contributor

lzchen commented Aug 16, 2024

@pmcollins

Would you like to be assigned to the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment