Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka output backoff does not seem to grow as expected #19015

Closed
ycombinator opened this issue Jun 5, 2020 · 3 comments
Closed

Kafka output backoff does not seem to grow as expected #19015

ycombinator opened this issue Jun 5, 2020 · 3 comments
Assignees
Labels
bug libbeat :Outputs Team:Integrations Label for the Integrations team

Comments

@ycombinator
Copy link
Contributor

ycombinator commented Jun 5, 2020

Version: master / 8.0.0.

In #17808 we implemented a stateless exponential-backoff-with-jitter for the Kafka output. We even added a unit test we believed would test the growth of the backoff duration over time and that test has been passing!

However, it appears that when we add some debugging statements to the backoff function, the actual backoff duration doesn't grow as expected — exponentially but with some jitter, up to a max value.

Steps to reproduce

To test this PR, one needs to cause the partition leader node to go away, which should trigger the backoff. One also needs to subsequently bring the partition leader node back online to ensure that the backoff duration has been reset.

  1. Start a 3-node Kafka cluster. Follow the instructions through step 6 on https://kafka.apache.org/quickstart but skip steps 3, 4, and 5.

  2. Configure filebeat.yml (could be any Beat, but Filebeat is easy for testing) like so:

    filebeat.inputs:
    - type: log
      enabled: true
      paths:
        - /tmp/logs/*.log
    
    output.kafka:
      enabled: true
      hosts: [ "localhost:9092", "localhost:9093", "localhost:9094" ]
      version: 2.1
      topic: foobar
      backoff:
        init: 3s
        max: 30s
    
  3. Start writing log lines to the test log file.

    mkdir /tmp/logs
    while true; do echo "$RANDOM new log entry" >> /tmp/logs/foo.log; sleep 0.1; done
    
  4. Modify the source of the backoff function to emit the computed backoff duration right before it is returned, like so:

    diff --git a/libbeat/outputs/kafka/config.go b/libbeat/outputs/kafka/config.go
    index fd3151b92..6f9fed429 100644
    --- a/libbeat/outputs/kafka/config.go
    +++ b/libbeat/outputs/kafka/config.go
    @@ -336,6 +336,8 @@ func makeBackoffFunc(cfg backoffConfig) func(retries, maxRetries int) time.Durat
                    // time falls into the interval [dur/2, dur]
                    limit := int64(dur / 2)
                    jitter := rand.Int63n(limit + 1)
    -               return time.Duration(limit + jitter)
    +               t := time.Duration(limit + jitter)
    +               		fmt.Printf("backing off for %v (retries, %v, init: %v, max: %v)\n", t, retries, cfg.Init, cfg.Max)
    +               return t
            }
     }
    
  5. Build Filebeat with the modified source.

    mage build
    
  6. Start Filebeat.

    ./filebeat -c filebeat.test.yml -e
    
  7. Check that the foobar topic has been created in Kafka.

    bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic foobar
    

    This topic is expected to have only 1 partition, partition 0. Note this partition's leader node ID.

  8. Now kill the Kafka node corresponding to the partition leader's node ID.

  9. Re-run the command in step 5 and verify that that partition 0 has no leader.

  10. Check the Filebeat log. You should now see multiple entries like this:

backing off for 1.51030142s (retries, 0, init: 3s, max: 30s)

Note that the backoff duration reported in these entries seems to hover around the backoff.init value but never grows towards the backoff.max value.

@ycombinator ycombinator added bug libbeat :Outputs Team:Integrations Label for the Integrations team labels Jun 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@ycombinator ycombinator self-assigned this Jun 5, 2020
@ycombinator
Copy link
Contributor Author

ycombinator commented Jun 5, 2020

Here's multiple lines of the relevant output:

backing off for 1.792264421s (retries, 0, init: 3s, max: 30s)
backing off for 2.920651791s (retries, 0, init: 3s, max: 30s)
backing off for 1.623648064s (retries, 0, init: 3s, max: 30s)
backing off for 2.507094059s (retries, 0, init: 3s, max: 30s)
backing off for 2.110677972s (retries, 0, init: 3s, max: 30s)
backing off for 2.367617504s (retries, 0, init: 3s, max: 30s)
backing off for 2.910905813s (retries, 0, init: 3s, max: 30s)
backing off for 2.618339663s (retries, 0, init: 3s, max: 30s)
backing off for 2.516195213s (retries, 0, init: 3s, max: 30s)
backing off for 2.319671577s (retries, 0, init: 3s, max: 30s)
...
backing off for 2.881192726s (retries, 0, init: 3s, max: 30s)
backing off for 2.298502077s (retries, 0, init: 3s, max: 30s)
backing off for 2.490003244s (retries, 0, init: 3s, max: 30s)
backing off for 2.3601182s (retries, 0, init: 3s, max: 30s)
backing off for 1.515103433s (retries, 0, init: 3s, max: 30s)
backing off for 2.259946439s (retries, 0, init: 3s, max: 30s)

Notice how retries stays at 0. This explains why the backoff duration computed by the algorithm never grows towards the max. We expect sarama to pass an increasing value of retries with each retry but that doesn't seem to be the case. So that's what we need to investigate next.

@kvch
Copy link
Contributor

kvch commented Sep 2, 2020

The updated version of the sarama fork has been released in 7.9. Can we close this?

@zube zube bot added [zube]: Done and removed [zube]: Ready labels Sep 2, 2020
@zube zube bot removed the [zube]: Done label Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug libbeat :Outputs Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

5 participants