Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Move batching to exporterhelper #4646

Open
bogdandrutu opened this issue Jan 5, 2022 · 18 comments
Open

[Proposal] Move batching to exporterhelper #4646

bogdandrutu opened this issue Jan 5, 2022 · 18 comments
Assignees
Labels
release:required-for-ga Must be resolved before GA release

Comments

@bogdandrutu
Copy link
Member

bogdandrutu commented Jan 5, 2022

Right now batching capability is offered as a processor, and this is extremely generic in our current pipeline model, which is a great thing, but also comes with limitations:

  1. Batch processor uses a queue (implemented as a channel), which is not the same as the one used in the queue retry, that can lose data when the binary crashes even after the client was informed that data were accepted. There is already work to offer a persistent queue mechanism which will not be available in the processor.
  2. If users want to use batching with the routing processor: will need to fix the Change batch processor to take client info into consideration #4544 issue, AND this will batch data that later will be split again. Very unnecessary.
  3. SignalFx/SplunkHec receiver/exporter uses a "hack" that adds one header from the incoming request to the resource, batches things, then has to split based on that header. Batching inside the exporter helper can be configured with possible custom logic much easier, by making the batching library accepting a custom "batcher" func.
  4. Allow customization of the "size" function. There were requests to support batching by the serialized size, we can offer this in the exporter helper much nicer since the custom sizer can actually work using the exporter wire format. So batching can happen after data serialization.
  5. When multiple exporters configured, they may have different batching requirements. This can be achieved today with multiple pipelines, but that causes code duplicate between the pipelines.

This will also fix some more problems:

The proposal is to look into offer the "batching" capability similar to timeout/queue/retry logic that we have in the exporterhelper.

@tigrannajaryan
Copy link
Member

A couple quick counterarguments:

  1. Dedicated batch processor keeps the batch in memory once even when there are multiple exporters in the pipeline. If we move the equivalently sized batch to queuedretry of every exporter we will slightly increase memory requirements (not by much since most of pdata is going to be shared assuming exporters' output throughput more or less keeps up with the input).
  2. Virtually the same batching work needs to happen for every exporter, which adds some CPU work per exporter. It would be useful to benchmark to see how much this difference is when there is more than one exporter.
  3. Batching probably helps to improve the performance of subsequent processors in the pipeline. We haven't benchmarked this, but it is possible that there is non-zero per-pdata cost that is amortized better if the batches are larger. This may be negligible, but I would like to see some benchmarks measuring it.

These are not necessarily strong enough to block the proposal, but I think it is worth thinking about these.

There is already work to offer a persistent queue mechanism which will not be available in the processor.

I think we discussed the possibility of adding persistency to the batch processor as well. Admittedly it may result in awkward user experience though since now you need to configure persistency in 2 different places.

@tigrannajaryan
Copy link
Member

Another benefit of eliminating batch processor is that it is one less thing to configure. This will simplify the user experience.

@bogdandrutu
Copy link
Member Author

While responding to your comments I realize that a possible thing is to offer both, do you think that is too confusing?

Dedicated batch processor keeps the batch in memory once even when there are multiple exporters in the pipeline. If we move the equivalently sized batch to queuedretry of every exporter we will slightly increase memory requirements (not by much since most of pdata is going to be shared assuming exporters' output throughput more or less keeps up with the input).

This is a good argument to possibly keep the processor as well, maybe using the same shared library that the exporterhelper uses. This comes with another downside that I haven't thought, if we want to have a fast batching we need to mutate the pdata (move them to the batch) which causes N (num exporter) copies needed even if the exporter may not need to mutate the data itself.

Virtually the same batching work needs to happen for every exporter, which adds some CPU work per exporter. It would be useful to benchmark to see how much this difference is when there is more than one exporter.

Indeed, I expect thought the work to be minimal, but definitely need to think about.

Batching probably helps to improve the performance of subsequent processors in the pipeline. We haven't benchmarked this, but it is possible that there is non-zero per-pdata cost that is amortized better if the batches are larger. This may be negligible, but I would like to see some benchmarks measuring it.

Need to think/propose of 1-2 components to benchmark for this, maybe the new "transform" processor since will be one of the most used? Any other idea?

@tigrannajaryan
Copy link
Member

While responding to your comments I realize that a possible thing is to offer both, do you think that is too confusing?

Yes, I think it will be confusing if we do batching in 2 places, with 2 different overlapping sets of configuration options available.

@tigrannajaryan
Copy link
Member

if the exporter may not need to mutate the data itself.

The exporters are currently not allowed to mutate data: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#data-ownership

@tigrannajaryan
Copy link
Member

Need to think/propose of 1-2 components to benchmark for this, maybe the new "transform" processor since will be one of the most used? Any other idea?

Since "transform" doesn't exist yet perhaps test with a couple basic ones for now, like attributesprocessor or filterprocessor.

@bogdandrutu
Copy link
Member Author

The exporters are currently not allowed to mutate data: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#data-ownership

you know that nobody reads documentation :)) One of them is the signalfx exporter which uses https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/batchperresourceattr/batchperresourceattr.go. Luckily after my change to move "capability" to Consumer, we now do the right thing for exporters because they also have capabilities and we clone if we need there as well.

So the documentation is out-dated unfortunately.

@mx-psi
Copy link
Member

mx-psi commented Jan 10, 2022

I think supporting both would be confusing. The 'customization of the "size" function' feature would be really helpful for us: we want batches to be below our intake's API max size, but currently we can't really give a precise number for the max batch size, since what we care about is the number of bytes in the payload we export, which can't be recovered from the number of points in a batch.

@bogdandrutu
Copy link
Member Author

@dashpole would be good if you drive this to resolution, see open-telemetry/opentelemetry-collector-contrib#7134 which you also need :)

@pmm-sumo
Copy link
Contributor

Batching probably helps to improve the performance of subsequent processors in the pipeline. We haven't benchmarked this, but it is possible that there is non-zero per-pdata cost that is amortized better if the batches are larger. This may be negligible, but I would like to see some benchmarks measuring it.

There's an edge case in which batch + groupbyattrs can be used for compacting traces and has a huge impact (as measured by @pkositsyn). Moving batching to exporterhelper would make it no longer possible but I think we could also consider making it a more generic helper, which could be used in processors as well?

@swiatekm
Copy link
Contributor

  1. Batch processor uses a queue (implemented as a channel), which is not the same as the one used in the queue retry, that can lose data when the binary crashes even after the client was informed that data were accepted. There is already work to offer a persistent queue mechanism which will not be available in the processor.

Correct me if I'm wrong, but we can lose data this way even without a crash, if we have an exporter with a queue and a batch processor, the data can be dropped due to the exporter queue being full, and the batch processor makes it impossible to propagate the error back to the receiver.

I don't know that I have an informed opinion on whether batching should happen in a separate processor or in an exporter, but it definitely shouldn't happen in both, for the above reason, unless we're willing to accept this kind of data loss and abandon any hope of applying backpressure to the receiver from any component after the batch processor.

@dashpole
Copy link
Contributor

For now, we've implemented batching within the gcp exporter. There is only one "correct" way to batch for GCM, so it seemed sensible to make it work correctly without requiring configuration. It is also needed to work around other quirks of the GCM api, such as #4442. open-telemetry/opentelemetry-collector-contrib#7134 isn't blocking us--we've implemented our conversion in our exporter. That proposal would only be valuable if other exporters that need summary splitting also wanted to use it.

I agree with others that supporting both processor and exporterhelper would be confusing. For now, I think it is reasonable to ask unusual use-cases like ours to re-implement batching themselves. Making it available as a library would be nice, but probably not worth the dependency unless we wanted to mark it stable.

@swiatekm
Copy link
Contributor

@bogdandrutu what do you think about making the batch processor conduct backpressure better in the meantime?

Currently, batch processor simply ignores consumer errors. If a transient error happens later in the pipeline, the data is effectively dropped and this information isn't propagated backwards. For example, this means that if we have an exporter with a full queue at the end of the pipeline, we'll keep batching and dropping data, whereas we really should return a transient error to the receiver so it can make its own flow control decisions.

For example, we could:

  1. Retry transient consumer errors
  2. Reject incoming data if there's backpressure - could be via a timeout or by switching to a buffered channel

Right now, it's impossible to have batching and reasonable error propagation. Fixing this should be possible right now, and wouldn't prevent us from moving batching to exporterhelper later.

@tigrannajaryan
Copy link
Member

what do you think about making the batch processor conduct backpressure better in the meantime?

@swiatekm-sumo if you want to help with this it would be great if you could do the benchmarking to prove that we don't need the separate batch processor to have reasonable performance. If we can demonstrate this then we should eliminate the batch processor and make batching an exporterhelper feature. This would be more preferable than spending time improving the batch processor which we end up retiring any way.

@swiatekm
Copy link
Contributor

@tigrannajaryan the reason I'd prefer to start with improving batch processor is that the change to it would be fairly straightforward and would improve user experience right now, as opposed to several months from now after we're done with the migration involving several breaking changes.

I'm also not entirely convinced this is the right thing to do. Even putting efficiency aside, removing the batch processor would have various knock-on effects - for example it would make https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/groupbyattrsprocessor worthless.

I can take up the benchmarking either way; I don't think these activities are mutually exclusive. But it seems like a much more open-ended project, as I'd need to come up with a bunch of representative configurations on my own.

@tigrannajaryan
Copy link
Member

@swiatekm-sumo Sure, if there is a maintainer/approver who wants to review the batch processor improvements please feel free to go ahead with what you suggest (I won't be able to help review the improvements).

If you can help with benchmarking it will be an added bonus :-)

@bogdandrutu
Copy link
Member Author

@swiatekm-sumo I am a bit reluctant because:

For example, we could:

Retry transient consumer errors
Reject incoming data if there's backpressure - could be via a timeout or by switching to a buffered channel

These are already implemented in the exporter helper as "`queue retry" :) that is yet another argument to unify them.

@dmitryax
Copy link
Member

dmitryax commented Feb 6, 2023

I like the idea. Another reason to do this is that the batch processor removes the client context and has to be put in a particular place in the pipeline to avoid causing problems to the processors that use the context which can be easily misconfigured

@atoulme atoulme changed the title [Proposal] Move batching to exportehelper [Proposal] Move batching to exporterhelper Jul 28, 2023
@atoulme atoulme added the release:required-for-ga Must be resolved before GA release label Dec 18, 2023
@mx-psi mx-psi moved this to In Progress in Collector: v1 Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:required-for-ga Must be resolved before GA release
Projects
Status: In Progress
Development

No branches or pull requests

9 participants