Shipper output fails on large events/batches #34695

faec · 2023-02-28T16:49:04Z

When event batches targeting the shipper exceed the RPC limit (currently 4MB), the shipper output drops all the events with an error similar to the following:

{"log.level":"error","@timestamp":"2023-02-28T15:52:14.521Z","message":"failed to publish events: failed to publish the batch to the shipper, none of the 2048 events were accepted: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4692625 vs. 4194304)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.origin":{"file.line":176,"file.name":"pipeline/client_worker.go"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","ecs.version":"1.6.0"}

There are multiple short-term mitigations (increase the RPC size limit; decrease the shipper output's batch size) but both of those approaches can still drop data unpredictably in the case of large events. If an event size is supported, then we should handle this error by splitting up the batch rather than permanently dropping its contents.

elasticmachine · 2023-02-28T16:49:07Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

faec · 2023-02-28T19:01:44Z

Small correction: with current code the output will actually retry the batch forever on error, stalling the whole pipeline. Dropping on error is what was actually intended by the code; this is a separate bug that I've filed as #34700

cmacknz · 2023-03-01T18:11:39Z

As a short term fix we can increase the maximum message size on the shipper gRPC server: https://pkg.go.dev/google.golang.org/grpc#MaxMsgSize. Here is an example configuring this in the agent:

		server = grpc.NewServer(
			grpc.Creds(creds),
			grpc.MaxRecvMsgSize(m.grpcConfig.MaxMsgSize),
		)

We should default this to 100mb to match the default value of Elasticsearch's http.max_content_length setting.

http.max_content_length
(Static, byte value) Maximum size of an HTTP request body. Defaults to 100mb. Configuring this setting to greater than 100mb can cause cluster instability and is not recommended. If you hit this limit when sending a request to the Bulk API, configure your client to send fewer documents in each bulk request. If you wish to index individual documents that exceed 100mb, pre-process them into smaller documents before sending them to Elasticsearch. For instance, store the raw data in a system outside Elasticsearch and include a link to the raw data in the documents that Elasticsearch indexes.

In the long term we should automatically split up the batch, hopefully using the solution from #29778. We likely want to keep the gRPC max message size set high enough that we are unlikely to incur the overhead of splitting the batch regularly.

faec · 2023-03-23T19:00:50Z

When I wrote "this PR mitigates but does not fix #34695" in elastic/elastic-agent-shipper#281, github apparently took that to mean that the PR did fix the issue, and closed it automatically. It is not fixed yet.

A real fix is impending, though 😜

faec added bug Team:Elastic-Agent Label for the Agent team labels Feb 28, 2023

cmacknz mentioned this issue Feb 28, 2023

[Meta] Elastic Agent Shipper Project elastic/elastic-agent-shipper#16

Open

100 tasks

faec mentioned this issue Feb 28, 2023

Shipper output retries forever on all errors regardless of actual retry settings #34700

Closed

faec mentioned this issue Feb 28, 2023

Beats pipeline doesn't respect configured batch sizes on startup under agent #34703

Closed

faec mentioned this issue Mar 6, 2023

Increase max RPC size to 64MB elastic/elastic-agent-shipper#281

Merged

6 tasks

pierrehilbert assigned faec Mar 7, 2023

faec closed this as completed in elastic/elastic-agent-shipper#281 Mar 7, 2023

faec reopened this Mar 23, 2023

faec mentioned this issue Mar 23, 2023

Split large batches on error instead of dropping them #34911

Merged

6 tasks

faec closed this as completed in #34911 Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shipper output fails on large events/batches #34695

Shipper output fails on large events/batches #34695

faec commented Feb 28, 2023

elasticmachine commented Feb 28, 2023

faec commented Feb 28, 2023

cmacknz commented Mar 1, 2023

faec commented Mar 23, 2023

Shipper output fails on large events/batches #34695

Shipper output fails on large events/batches #34695

Comments

faec commented Feb 28, 2023

elasticmachine commented Feb 28, 2023

faec commented Feb 28, 2023

cmacknz commented Mar 1, 2023

faec commented Mar 23, 2023