Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM caused by span metrics connector #21290

Closed
alburthoffman opened this issue May 1, 2023 · 14 comments
Closed

OOM caused by span metrics connector #21290

alburthoffman opened this issue May 1, 2023 · 14 comments
Labels

Comments

@alburthoffman
Copy link
Contributor

Component(s)

connector/spanmetrics

Describe the issue you're reporting

We tried to switch from span metrics processor to span metrics connector, and found OOM issue.

Below is the pod heap memory after using span metrics connector. The pod traffic is around 20K spans per second:
image

before this, span metrics processor is quite stable.
image

the profile shows the pmap takes lots of memory for span metrics connector.
image

which is in createAtrributes.
image

@alburthoffman alburthoffman added the needs triage New item requiring triage label May 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented May 1, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@alburthoffman
Copy link
Contributor Author

@albertteoh @kovrus please help

@atoulme atoulme added bug Something isn't working and removed needs triage New item requiring triage labels May 2, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 3, 2023
@albertteoh
Copy link
Contributor

@alburthoffman apologies for the slow response.

I suggest trying the following:

  • Inspect the metrics sent to spanmetrics connector to see if there are any high cardinality metrics.
  • Reduce the dimensions_cache_size
  • It looks like the explicit histogram metrics are taking up most of the memory. Try reducing the number of dimensions buckets.

@github-actions github-actions bot removed the Stale label Jul 8, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 7, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 7, 2023
@vjsamuel
Copy link

@alburthoffman please try these and get back to @albertteoh

@alburthoffman
Copy link
Contributor Author

@albertteoh the old span processor can handle the same config without OOM issue. So it should not be a config issue.

@github-actions github-actions bot removed the Stale label Sep 11, 2023
@albertteoh
Copy link
Contributor

@albertteoh the old span processor can handle the same config without OOM issue. So it should not be a config issue.

Thanks for confirming, @alburthoffman. Is this something that's can be reproduced locally by any chance?

@aishyandapalli
Copy link
Contributor

@albertteoh I have done a quick round of testing for this one. When we enable histogram metrics in spanmetrics connector, the memory usage is very high. I tried disabling the histogram metrics and memory usage was very low. Reducing the number of buckets didn't help much. Memory usage reduced a bit but didn't help much.

Can you please help identify the root cause?

@alburthoffman
Copy link
Contributor Author

we did a new round of testing, and still see this issue

image

@portertech
Copy link
Contributor

@alburthoffman does this issue persist with the latest release?

Related, I wonder if further reducing/aggregating on fewer resource attributes would help, see: #29711

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Feb 19, 2024
@aishyandapalli
Copy link
Contributor

This issue is resolved. We have added a config to limit the number of exemplars to be added for sum metrics in spanmetrics connector. It's merged and deployed. @alburthoffman We can close this issue

@alburthoffman
Copy link
Contributor Author

Thx @aishyandapalli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants