You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(I believe the bug is still present at head but these are the versions I used to diagnose it.)
Configuration
The Kafka configuration is a default 3-node system, set up as described in https://kafka.apache.org/quickstart but omitting steps 3-5 and the topic replication setup.
The relevant Sarama configuration is that Producer.Retry.BackoffFunc is set to a callback that returns exponential backoff in the number of retries.
If we shut down the lead broker, we observe Sarama's callback being called with retries set to zero every time, which makes it always use the initial retry period, potentially causing overloads.
There is some question whether the two types of error (missing broker vs. a failure response from a live broker) should have the same backoff behavior, but for our use case we definitely need to be able to specify backoff when the broker is down. It seems the intent in the code is to allow this, so I speculate it's an oversight since the current error handling architecture increments retry counts when a response is received, which results in this bug in the special case where there is no response.
I have a candidate fix which works in our tests, which simply adds a separate local variable in partitionProducer.dispatch to track broker-refresh retries separately from retries of a particular message.
The text was updated successfully, but these errors were encountered:
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur.
Please check if the master branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.
ghost
added
the
stale
Issues and pull requests without any recent activity
label
Mar 16, 2021
Versions
(I believe the bug is still present at head but these are the versions I used to diagnose it.)
Configuration
The Kafka configuration is a default 3-node system, set up as described in https://kafka.apache.org/quickstart but omitting steps 3-5 and the topic replication setup.
The relevant Sarama configuration is that
Producer.Retry.BackoffFunc
is set to a callback that returns exponential backoff in the number of retries.This bug arises from elastic/beats#19015, which may give relevant context.
Problem Description
If we shut down the lead broker, we observe Sarama's callback being called with
retries
set to zero every time, which makes it always use the initial retry period, potentially causing overloads.There is some question whether the two types of error (missing broker vs. a failure response from a live broker) should have the same backoff behavior, but for our use case we definitely need to be able to specify backoff when the broker is down. It seems the intent in the code is to allow this, so I speculate it's an oversight since the current error handling architecture increments retry counts when a response is received, which results in this bug in the special case where there is no response.
I have a candidate fix which works in our tests, which simply adds a separate local variable in
partitionProducer.dispatch
to track broker-refresh retries separately from retries of a particular message.The text was updated successfully, but these errors were encountered: