Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer.Retry.BackoffFunc is not applied when broker is down #1719

Closed
faec opened this issue Jun 10, 2020 · 1 comment
Closed

Producer.Retry.BackoffFunc is not applied when broker is down #1719

faec opened this issue Jun 10, 2020 · 1 comment
Labels
stale Issues and pull requests without any recent activity

Comments

@faec
Copy link

faec commented Jun 10, 2020

Versions
Sarama Kafka Go
1.24.1 2.1 1.13.10

(I believe the bug is still present at head but these are the versions I used to diagnose it.)

Configuration

The Kafka configuration is a default 3-node system, set up as described in https://kafka.apache.org/quickstart but omitting steps 3-5 and the topic replication setup.

The relevant Sarama configuration is that Producer.Retry.BackoffFunc is set to a callback that returns exponential backoff in the number of retries.

This bug arises from elastic/beats#19015, which may give relevant context.

Problem Description

If we shut down the lead broker, we observe Sarama's callback being called with retries set to zero every time, which makes it always use the initial retry period, potentially causing overloads.

There is some question whether the two types of error (missing broker vs. a failure response from a live broker) should have the same backoff behavior, but for our use case we definitely need to be able to specify backoff when the broker is down. It seems the intent in the code is to allow this, so I speculate it's an oversight since the current error handling architecture increments retry counts when a response is received, which results in this bug in the special case where there is no response.

I have a candidate fix which works in our tests, which simply adds a separate local variable in partitionProducer.dispatch to track broker-refresh retries separately from retries of a particular message.

@ghost
Copy link

ghost commented Mar 16, 2021

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur.
Please check if the master branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

@ghost ghost added the stale Issues and pull requests without any recent activity label Mar 16, 2021
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues and pull requests without any recent activity
Projects
None yet
Development

No branches or pull requests

2 participants