Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor: Add limits for Otel write request byte size #8574

Merged
merged 12 commits into from
Jul 4, 2024

Conversation

ying-jeanne
Copy link
Contributor

@ying-jeanne ying-jeanne commented Jul 1, 2024

What this PR does

Which issue(s) this PR fixes or relates to

The current distributor message limit is 100MB See doc, Which demonstrates too much for compressed Otel write request. This could cause problem for otel, end up in big decompressed message and timeout. Knowing that otlp collector could have different compression algorithm and also can also be configured to be no compression. See doc, introduce new uncompressed request byte size in order to reject abusive write request as early as possible.

Why not using existing distributor.max-recv-msg-size?

  • The remote write and otlp request doesn't write request with same format, doesn't contains same quantity samples when requests are the same size.
  • In the same container, we can have several users who use otel and remote write during the same time, they need to be limited with different values

About default value

  • The current big request is rejected by 413 error code, which is not retryable, just to be on the safe side, put 100M as limit, we saw 100MiB request end up executing 10s. This needs to be carefully tested/observed in order to find the final sweet default value.

Adding runbook dedicated to the error

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@ying-jeanne ying-jeanne force-pushed the max-massage-request-bytes branch 7 times, most recently from 40c6a2e to 2f4d3d9 Compare July 3, 2024 09:32
@ying-jeanne ying-jeanne marked this pull request as ready for review July 3, 2024 10:00
@ying-jeanne ying-jeanne requested review from tacole02 and a team as code owners July 3, 2024 10:00
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this.

pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/otel.go Outdated Show resolved Hide resolved
pkg/distributor/otel.go Outdated Show resolved Hide resolved
pkg/distributor/otel_test.go Outdated Show resolved Hide resolved
pkg/distributor/push.go Outdated Show resolved Hide resolved
pkg/distributor/push.go Outdated Show resolved Hide resolved
pkg/util/globalerror/user.go Outdated Show resolved Hide resolved
Copy link
Contributor

@aknuds1 aknuds1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments. I did a quick review since I saw some fundamental problems. Please ping me for another review when my comments have been addressed.

CHANGELOG.md Outdated Show resolved Hide resolved
docs/sources/mimir/configure/about-versioning.md Outdated Show resolved Hide resolved
pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/otel.go Outdated Show resolved Hide resolved
pkg/distributor/otel_test.go Outdated Show resolved Hide resolved
pkg/distributor/otel_test.go Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
pkg/distributor/push.go Outdated Show resolved Hide resolved
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating these docs!

CHANGELOG.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
@ying-jeanne ying-jeanne force-pushed the max-massage-request-bytes branch 4 times, most recently from 8534200 to 51e14cd Compare July 3, 2024 16:52

How it **works**:

- The distributor implements an upper limit on the message size of incoming OTel write requests after decompression regardless of the compression type. Refer to [OTLP collector compression details](https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configgrpc/README.md#client-configuration) for more information.
Copy link
Member

@pstibrany pstibrany Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referenced documentation seems irrelevant here, because it points to gRPC client config, but Mimir doesn't accept OTLP/gRPC requests.

I don't think we need to mention "compression" at all in this runbook. It doesn't do any harm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, gracais

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated link to confighttp

pkg/distributor/push.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@@ -237,6 +240,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet, logger log.Logger) {
cfg.RetryConfig.RegisterFlags(f)

f.IntVar(&cfg.MaxRecvMsgSize, "distributor.max-recv-msg-size", 100<<20, "Max message size in bytes that the distributors will accept for incoming push requests to the remote write API. If exceeded, the request will be rejected.")
f.IntVar(&cfg.MaxOTLPRequestSize, maxOTLPRequestSizeFlag, 80<<20, "Maximum OTLP request size in bytes that the distributors accept. Requests exceeding this limit are rejected.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the 80 MiB selected? Do we have some data about size of the request and processing time?

Intuitively 80 MiB still feels very large. With previous 100 MiB we sometimes see translation to Mimir request taking up to 10 seconds, so perhaps better value would be somewhere close to 10 MiB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned in the description of the PR, it is tentative. Since 413 error is not retryable, we need to be cautious. if 100MiB/10s, how about 2s of response time ~ 20 MiB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need to be careful when rolling this out. It may be safer to use original value of 100 MiB for now, so that this doesn't break existing setups, and then we can experiment with correct value in our setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me revert my reverted commit

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my feedback. I left few more minor comments.

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

CHANGELOG.md Outdated Show resolved Hide resolved
docs/sources/mimir/configure/about-versioning.md Outdated Show resolved Hide resolved
@aknuds1
Copy link
Contributor

aknuds1 commented Jul 4, 2024

Reviewing again

Copy link
Contributor

@aknuds1 aknuds1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a couple of suggested clarifications wrt. compression.

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
@ying-jeanne
Copy link
Contributor Author

ying-jeanne commented Jul 4, 2024

Thank you! @pstibrany @aknuds1 @tacole02. Let's merge and now set it lower along the time and reject some requests🤞

ying-jeanne and others added 12 commits July 4, 2024 15:34
Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com>
@ying-jeanne ying-jeanne merged commit e7ab906 into main Jul 4, 2024
29 checks passed
@ying-jeanne ying-jeanne deleted the max-massage-request-bytes branch July 4, 2024 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants