Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reenqueue may not be needed #8382

Closed
yutingcaicyt opened this issue Sep 11, 2023 · 10 comments · Fixed by #9090
Closed

reenqueue may not be needed #8382

yutingcaicyt opened this issue Sep 11, 2023 · 10 comments · Fixed by #9090

Comments

@yutingcaicyt
Copy link

Is your feature request related to a problem? Please describe.
When I used the persistent queue, I found that the strategy dealing with requests that retry failed is different. The persistent queue enqueues these requests but the memory queue does not. In fact, users can configure the retry time themselves, if timeout, the request can be discarded intuitively. If reenqueue, the time a request tries to send is hard to know. So why not keep it simple and Let the user decide how long a request can be retried at most?

Describe the solution you'd like
We can remove the reenqueue feature to make it simple and intuitive so users can decide the retry time with the configuration.

@dmitryax
Copy link
Member

I very much agree with the suggestion. I don't know why it was introduced in the first place. I think the only scenario when we need to re-enqueue is during shutdown.

@swiatekm-sumo @bogdandrutu WDYT?

@swiatekm
Copy link
Contributor

Requeueing as implemented is how a lot of users intuitively expect the queue to work, in my experience. That is, they expect no data to be dropped if the queue still has room. And this behaviour is difficult to reproduce by adjusting retry settings.

I'd prefer to add it as an option to the in-memory queue, with a default that is consistent between the two queue types.

@yutingcaicyt
Copy link
Author

So how should we deal with it better?

@swiatekm
Copy link
Contributor

Like I said, add the configuration option so both queues behave the same way. Then you can decide which behaviour you prefer.

@yutingcaicyt
Copy link
Author

With the reenqueue option, users have two ways to control the retry of the data. Is there an alternative relationship between these two ways? If the user does not want to discard data, can it be enough to increase the retry time?

@dmitryax
Copy link
Member

dmitryax commented Dec 9, 2023

I'm trying to introduce this config option in the scope of #8987 and running into a couple of issues:

  1. The requeuing doesn't work well with the memory queue on shutdown. The memory queue channel is closed on shutdown to drain the queue. So if we introduce this option, there is a chance of running into [exporterhelper] boundedMemoryQueue may send on closed channel #7388. I tried to avoid requeuing on shutdown but doesn't seem to be a bulletproof solution [chore] [exporterhelper] Do not requeue on shutdown #9054. So making sure we are not requeuing into a closed memory channel introduces unnecessary complexity.
  2. As pointed out in [exporterhelper] Make the re-enqueue behavior configurable #8993 (comment), introducing another "retry" option might be more confusing to the users.

We discussed that with @bogdandrutu, and we are not sure the re-enqueue capability is useful enough. If a request cannot be delivered right now, why do we need to put it back to the queue and try another one instead of increasing the backoff timeout and re-trying the same request again?

@swiatekm-sumo can you please provide some real-world examples where reququing would be needed on top of the backoff retry?

@swiatekm
Copy link
Contributor

@dmitryax thinking about it more, reenqueueing is logically equivalent to retrying forever. As long as we document that users who want that behaviour should instead set max_elapsed_time to some large value, I think it's ok to remove it. Especially if it's complex to implement for the in-memory queue.

@dmitryax
Copy link
Member

max_elapsed_time actually supports 0 which means retry forever.

If we remove the re-queue capability straight away, there is a possibility of double publishing after the shutdown if the retry sender has partially failed requests. We would need to update the persistent storage state for partially delivered requests stuck in the retry sender. This should be easy to do after the recent refactoring, I can incorporate this change.

@splunkericl
Copy link
Contributor

Functionally these two options feel like the same. But performance speaking, not having reenqueue looks like a huge hit for persistent queue users?

If there is an outage in the destination,

  • the exporterhelper will attempt to hold the request in memory instead of writing it to the storage. Depending on the batch size and the number of exporters, there might be risk of running out of memory
  • during shutdown/configurations restart, does this mean the exporterhelper try to write all these data to the disk? Depending on how much data, that might take some time. This means longer outage for the pipeline.

@dmitryax
Copy link
Member

the exporterhelper will attempt to hold the request in memory instead of writing it to the storage. Depending on the batch size and the number of exporters, there might be risk of running out of memory

The amount of data kept in memory is always restricted by the number of queue workers. And they are always busy, it doesn't matter if we keep the same data for backoff retry or keep pushing data back and forth to the queue. In fact, the latter takes even more memory due to constant marshaling/unmarshaling with GC catching up.

during shutdown/configurations restart, does this mean the exporterhelper try to write all these data to the disk? Depending on how much data, that might take some time. This means longer outage for the pipeline.

This is always happening right now on shutdown. I'm suggesting that we do this only for partially failed requests which doesn't happen often.

bogdandrutu pushed a commit that referenced this issue Dec 15, 2023
The current re-enqueuing behavior is not obvious and cannot be
configured. It takes place only for persistent queue and only if
`retry_on_failure::enabled=true` even if `retry_on_failure` is a setting
for a different backoff retry strategy.

This change removes the re-enqueuing behavior in favor of the
`retry_on_failure` option. Consider increasing
`retry_on_failure::max_elapsed_time` to reduce chances of data loss.

Resolves
#8382
sokoide pushed a commit to sokoide/opentelemetry-collector that referenced this issue Dec 18, 2023
The current re-enqueuing behavior is not obvious and cannot be
configured. It takes place only for persistent queue and only if
`retry_on_failure::enabled=true` even if `retry_on_failure` is a setting
for a different backoff retry strategy.

This change removes the re-enqueuing behavior in favor of the
`retry_on_failure` option. Consider increasing
`retry_on_failure::max_elapsed_time` to reduce chances of data loss.

Resolves
open-telemetry#8382
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants