Kafka output backoff does not seem to grow as expected #19015

ycombinator · 2020-06-05T16:43:14Z

Version: master / 8.0.0.

In #17808 we implemented a stateless exponential-backoff-with-jitter for the Kafka output. We even added a unit test we believed would test the growth of the backoff duration over time and that test has been passing!

However, it appears that when we add some debugging statements to the backoff function, the actual backoff duration doesn't grow as expected — exponentially but with some jitter, up to a max value.

Steps to reproduce

To test this PR, one needs to cause the partition leader node to go away, which should trigger the backoff. One also needs to subsequently bring the partition leader node back online to ensure that the backoff duration has been reset.

Start a 3-node Kafka cluster. Follow the instructions through step 6 on https://kafka.apache.org/quickstart but skip steps 3, 4, and 5.

Configure filebeat.yml (could be any Beat, but Filebeat is easy for testing) like so:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /tmp/logs/*.log

output.kafka:
  enabled: true
  hosts: [ "localhost:9092", "localhost:9093", "localhost:9094" ]
  version: 2.1
  topic: foobar
  backoff:
    init: 3s
    max: 30s

Start writing log lines to the test log file.

mkdir /tmp/logs
while true; do echo "$RANDOM new log entry" >> /tmp/logs/foo.log; sleep 0.1; done

Modify the source of the backoff function to emit the computed backoff duration right before it is returned, like so:

diff --git a/libbeat/outputs/kafka/config.go b/libbeat/outputs/kafka/config.go
index fd3151b92..6f9fed429 100644
--- a/libbeat/outputs/kafka/config.go
+++ b/libbeat/outputs/kafka/config.go
@@ -336,6 +336,8 @@ func makeBackoffFunc(cfg backoffConfig) func(retries, maxRetries int) time.Durat
                // time falls into the interval [dur/2, dur]
                limit := int64(dur / 2)
                jitter := rand.Int63n(limit + 1)
-               return time.Duration(limit + jitter)
+               t := time.Duration(limit + jitter)
+               		fmt.Printf("backing off for %v (retries, %v, init: %v, max: %v)\n", t, retries, cfg.Init, cfg.Max)
+               return t
        }
 }

Build Filebeat with the modified source.
```
mage build
```
Start Filebeat.
```
./filebeat -c filebeat.test.yml -e
```
Check that the foobar topic has been created in Kafka.
```
bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic foobar
```
This topic is expected to have only 1 partition, partition 0. Note this partition's leader node ID.
Now kill the Kafka node corresponding to the partition leader's node ID.
Re-run the command in step 5 and verify that that partition 0 has no leader.
Check the Filebeat log. You should now see multiple entries like this:

backing off for 1.51030142s (retries, 0, init: 3s, max: 30s)

Note that the backoff duration reported in these entries seems to hover around the backoff.init value but never grows towards the backoff.max value.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-05T16:43:16Z

Pinging @elastic/integrations (Team:Integrations)

ycombinator · 2020-06-05T16:52:44Z

Here's multiple lines of the relevant output:

backing off for 1.792264421s (retries, 0, init: 3s, max: 30s)
backing off for 2.920651791s (retries, 0, init: 3s, max: 30s)
backing off for 1.623648064s (retries, 0, init: 3s, max: 30s)
backing off for 2.507094059s (retries, 0, init: 3s, max: 30s)
backing off for 2.110677972s (retries, 0, init: 3s, max: 30s)
backing off for 2.367617504s (retries, 0, init: 3s, max: 30s)
backing off for 2.910905813s (retries, 0, init: 3s, max: 30s)
backing off for 2.618339663s (retries, 0, init: 3s, max: 30s)
backing off for 2.516195213s (retries, 0, init: 3s, max: 30s)
backing off for 2.319671577s (retries, 0, init: 3s, max: 30s)
...
backing off for 2.881192726s (retries, 0, init: 3s, max: 30s)
backing off for 2.298502077s (retries, 0, init: 3s, max: 30s)
backing off for 2.490003244s (retries, 0, init: 3s, max: 30s)
backing off for 2.3601182s (retries, 0, init: 3s, max: 30s)
backing off for 1.515103433s (retries, 0, init: 3s, max: 30s)
backing off for 2.259946439s (retries, 0, init: 3s, max: 30s)

Notice how retries stays at 0. This explains why the backoff duration computed by the algorithm never grows towards the max. We expect sarama to pass an increasing value of retries with each retry but that doesn't seem to be the case. So that's what we need to investigate next.

kvch · 2020-09-02T06:57:10Z

The updated version of the sarama fork has been released in 7.9. Can we close this?

ycombinator added bug libbeat :Outputs Team:Integrations Label for the Integrations team labels Jun 5, 2020

ycombinator self-assigned this Jun 5, 2020

ycombinator removed their assignment Jun 5, 2020

andresrc added [zube]: Inbox [zube]: Backlog [zube]: Ready and removed [zube]: Inbox [zube]: Backlog labels Jun 8, 2020

faec self-assigned this Jun 9, 2020

This was referenced Jun 9, 2020

Properly report retry counts when refreshing the lead broker elastic/sarama#7

Merged

Producer.Retry.BackoffFunc is not applied when broker is down IBM/sarama#1719

Closed

faec mentioned this issue Jun 24, 2020

Properly report retry counts when refreshing the lead broker elastic/sarama#10

Merged

ycombinator closed this as completed Sep 2, 2020

zube bot added [zube]: Done and removed [zube]: Ready labels Sep 2, 2020

zube bot removed the [zube]: Done label Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka output backoff does not seem to grow as expected #19015

Kafka output backoff does not seem to grow as expected #19015

ycombinator commented Jun 5, 2020 •

edited by andresrc

Loading

elasticmachine commented Jun 5, 2020

ycombinator commented Jun 5, 2020 •

edited

Loading

kvch commented Sep 2, 2020

Kafka output backoff does not seem to grow as expected #19015

Kafka output backoff does not seem to grow as expected #19015

Comments

ycombinator commented Jun 5, 2020 • edited by andresrc Loading

Steps to reproduce

elasticmachine commented Jun 5, 2020

ycombinator commented Jun 5, 2020 • edited Loading

kvch commented Sep 2, 2020

ycombinator commented Jun 5, 2020 •

edited by andresrc

Loading

ycombinator commented Jun 5, 2020 •

edited

Loading