-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shipper output fails on large events/batches #34695
Shipper output fails on large events/batches #34695
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Small correction: with current code the output will actually retry the batch forever on error, stalling the whole pipeline. Dropping on error is what was actually intended by the code; this is a separate bug that I've filed as #34700 |
As a short term fix we can increase the maximum message size on the shipper gRPC server: https://pkg.go.dev/google.golang.org/grpc#MaxMsgSize. Here is an example configuring this in the agent: server = grpc.NewServer(
grpc.Creds(creds),
grpc.MaxRecvMsgSize(m.grpcConfig.MaxMsgSize),
) We should default this to 100mb to match the default value of Elasticsearch's http.max_content_length setting.
In the long term we should automatically split up the batch, hopefully using the solution from #29778. We likely want to keep the gRPC max message size set high enough that we are unlikely to incur the overhead of splitting the batch regularly. |
When I wrote "this PR mitigates but does not fix #34695" in elastic/elastic-agent-shipper#281, github apparently took that to mean that the PR did fix the issue, and closed it automatically. It is not fixed yet. A real fix is impending, though 😜 |
When event batches targeting the shipper exceed the RPC limit (currently 4MB), the shipper output drops all the events with an error similar to the following:
There are multiple short-term mitigations (increase the RPC size limit; decrease the shipper output's batch size) but both of those approaches can still drop data unpredictably in the case of large events. If an event size is supported, then we should handle this error by splitting up the batch rather than permanently dropping its contents.
The text was updated successfully, but these errors were encountered: