reenqueue may not be needed #8382

yutingcaicyt · 2023-09-11T12:22:02Z

Is your feature request related to a problem? Please describe.
When I used the persistent queue, I found that the strategy dealing with requests that retry failed is different. The persistent queue enqueues these requests but the memory queue does not. In fact, users can configure the retry time themselves, if timeout, the request can be discarded intuitively. If reenqueue, the time a request tries to send is hard to know. So why not keep it simple and Let the user decide how long a request can be retried at most?

Describe the solution you'd like
We can remove the reenqueue feature to make it simple and intuitive so users can decide the retry time with the configuration.

dmitryax · 2023-09-13T18:50:10Z

I very much agree with the suggestion. I don't know why it was introduced in the first place. I think the only scenario when we need to re-enqueue is during shutdown.

@swiatekm-sumo @bogdandrutu WDYT?

swiatekm · 2023-09-14T08:41:05Z

Requeueing as implemented is how a lot of users intuitively expect the queue to work, in my experience. That is, they expect no data to be dropped if the queue still has room. And this behaviour is difficult to reproduce by adjusting retry settings.

I'd prefer to add it as an option to the in-memory queue, with a default that is consistent between the two queue types.

yutingcaicyt · 2023-09-21T02:20:34Z

So how should we deal with it better?

swiatekm · 2023-09-21T09:38:13Z

Like I said, add the configuration option so both queues behave the same way. Then you can decide which behaviour you prefer.

yutingcaicyt · 2023-10-10T07:07:30Z

With the reenqueue option, users have two ways to control the retry of the data. Is there an alternative relationship between these two ways? If the user does not want to discard data, can it be enough to increase the retry time?

dmitryax · 2023-12-09T23:25:07Z

I'm trying to introduce this config option in the scope of #8987 and running into a couple of issues:

The requeuing doesn't work well with the memory queue on shutdown. The memory queue channel is closed on shutdown to drain the queue. So if we introduce this option, there is a chance of running into [exporterhelper] boundedMemoryQueue may send on closed channel #7388. I tried to avoid requeuing on shutdown but doesn't seem to be a bulletproof solution [chore] [exporterhelper] Do not requeue on shutdown #9054. So making sure we are not requeuing into a closed memory channel introduces unnecessary complexity.
As pointed out in [exporterhelper] Make the re-enqueue behavior configurable #8993 (comment), introducing another "retry" option might be more confusing to the users.

We discussed that with @bogdandrutu, and we are not sure the re-enqueue capability is useful enough. If a request cannot be delivered right now, why do we need to put it back to the queue and try another one instead of increasing the backoff timeout and re-trying the same request again?

@swiatekm-sumo can you please provide some real-world examples where reququing would be needed on top of the backoff retry?

swiatekm · 2023-12-11T10:47:55Z

@dmitryax thinking about it more, reenqueueing is logically equivalent to retrying forever. As long as we document that users who want that behaviour should instead set max_elapsed_time to some large value, I think it's ok to remove it. Especially if it's complex to implement for the in-memory queue.

dmitryax · 2023-12-12T07:30:58Z

max_elapsed_time actually supports 0 which means retry forever.

If we remove the re-queue capability straight away, there is a possibility of double publishing after the shutdown if the retry sender has partially failed requests. We would need to update the persistent storage state for partially delivered requests stuck in the retry sender. This should be easy to do after the recent refactoring, I can incorporate this change.

splunkericl · 2023-12-12T17:27:32Z

Functionally these two options feel like the same. But performance speaking, not having reenqueue looks like a huge hit for persistent queue users?

If there is an outage in the destination,

the exporterhelper will attempt to hold the request in memory instead of writing it to the storage. Depending on the batch size and the number of exporters, there might be risk of running out of memory
during shutdown/configurations restart, does this mean the exporterhelper try to write all these data to the disk? Depending on how much data, that might take some time. This means longer outage for the pipeline.

dmitryax · 2023-12-12T17:43:18Z

the exporterhelper will attempt to hold the request in memory instead of writing it to the storage. Depending on the batch size and the number of exporters, there might be risk of running out of memory

The amount of data kept in memory is always restricted by the number of queue workers. And they are always busy, it doesn't matter if we keep the same data for backoff retry or keep pushing data back and forth to the queue. In fact, the latter takes even more memory due to constant marshaling/unmarshaling with GC catching up.

during shutdown/configurations restart, does this mean the exporterhelper try to write all these data to the disk? Depending on how much data, that might take some time. This means longer outage for the pipeline.

This is always happening right now on shutdown. I'm suggesting that we do this only for partially failed requests which doesn't happen often.

The current re-enqueuing behavior is not obvious and cannot be configured. It takes place only for persistent queue and only if `retry_on_failure::enabled=true` even if `retry_on_failure` is a setting for a different backoff retry strategy. This change removes the re-enqueuing behavior in favor of the `retry_on_failure` option. Consider increasing `retry_on_failure::max_elapsed_time` to reduce chances of data loss. Resolves #8382

The current re-enqueuing behavior is not obvious and cannot be configured. It takes place only for persistent queue and only if `retry_on_failure::enabled=true` even if `retry_on_failure` is a setting for a different backoff retry strategy. This change removes the re-enqueuing behavior in favor of the `retry_on_failure` option. Consider increasing `retry_on_failure::max_elapsed_time` to reduce chances of data loss. Resolves open-telemetry#8382

dmitryax mentioned this issue Dec 12, 2023

[exporterhelper] Do not re-enqueue failed requests #9090

Merged

bogdandrutu closed this as completed in #9090 Dec 15, 2023

dmitryax mentioned this issue Dec 15, 2023

[exporterhelper] Make the re-enqueuing configurable on the queue sender #8987

Closed

dmitryax mentioned this issue Jan 4, 2024

Should in-memory QueuedRetry exporter helper use requeuingEnabled=true? #7480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reenqueue may not be needed #8382

reenqueue may not be needed #8382

yutingcaicyt commented Sep 11, 2023

dmitryax commented Sep 13, 2023

swiatekm commented Sep 14, 2023

yutingcaicyt commented Sep 21, 2023

swiatekm commented Sep 21, 2023

yutingcaicyt commented Oct 10, 2023

dmitryax commented Dec 9, 2023

swiatekm commented Dec 11, 2023

dmitryax commented Dec 12, 2023

splunkericl commented Dec 12, 2023

dmitryax commented Dec 12, 2023

reenqueue may not be needed #8382

reenqueue may not be needed #8382

Comments

yutingcaicyt commented Sep 11, 2023

dmitryax commented Sep 13, 2023

swiatekm commented Sep 14, 2023

yutingcaicyt commented Sep 21, 2023

swiatekm commented Sep 21, 2023

yutingcaicyt commented Oct 10, 2023

dmitryax commented Dec 9, 2023

swiatekm commented Dec 11, 2023

dmitryax commented Dec 12, 2023

splunkericl commented Dec 12, 2023

dmitryax commented Dec 12, 2023