OffsetForLeaderEpoch loop of failed requests with multiple leader changes #4425

emasab · 2023-09-07T13:46:32Z

Description

Loop of OffsetForLeaderEpoch calls when multiple leader changes happen and one of them is being retried because of an error.

How to reproduce

To reproduce it there should be an initial partition leader change that triggers an OffsetForLeaderEpoch request. The request should fail while a second leader change happens. The corresponding current leader epoch isn't updated and there's a loop of failing requests.

Checklist

librdkafka version (release number or git tag): 2.1.0+
Apache Kafka version: 3.5.1
librdkafka client configuration: any
Operating system: any
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

scanterog · 2023-09-08T14:49:20Z

We've also hit this issue. Would it be feasible to have a 2.2.1 release as soon as the fix is ready?

Fixes #4425

emasab · 2023-09-19T15:57:49Z

@scanterog sure we're planning to include it in next release

fix for confluentinc/librdkafka#4425

plachor · 2023-10-23T09:19:21Z

Hi @emasab we also struggle with this one, once one of our instance is affected it consumes all CPU limit, and carry lots of RX (1GB/min) and TX transfer (100MB/min). Also broker is flooded with 400K+/min OffsetForLeaderEpoch requests.

We started to observe this recently once we decided to bump from v1.9.0 to v.2.2.0. Previously it was working fine. We observe this on maintenance actions:

rolling brokers due to updates (OS and Kafka)
scaling cluster - introducing new brokers and relocating partitions

I agree with @scanterog. Would it not be wise to cherry pick this and release HF of v2.2.0 and v2.1.0 for ones that are affected so they are not forced to pull minor release introducing new features as a HF to already deployed versions ?

edit: I'll just add that we use .net wrapper confluent-kafka-dotnet so this HF would need to be released there as well.

rwkarg · 2023-10-23T15:43:33Z

This continues to be an issue for us as well (using the same C# wrapper as @plachor. The OffsetForLeaderEpoch spam appears to be from inside librdkafka, not the .NET code).

Is there an estimated schedule for the next release that includes #4433?

The only resolution is to restart processes that get in this state and we haven't yet found a way to automate that as it's not in the .NET code where the issue is, and the monitoring we have on the broker side just says there's a log of this call getting made but not what client(s) it is coming from.

ijuma · 2023-12-02T20:03:21Z

This fix is included in 2.3.0 according to the release notes.

milindl mentioned this issue Sep 8, 2023

Fix loop of OffsetForLeaderEpoch requests on quick leader changes #4426

Closed

emasab added the bug label Sep 8, 2023

milindl added a commit that referenced this issue Sep 14, 2023

Fix loop of OffsetForLeaderEpoch requests on quick leader changes

e465fd9

Fixes #4425

milindl added a commit that referenced this issue Sep 14, 2023

Fix loop of OffsetForLeaderEpoch requests on quick leader changes

668b5d9

Fixes #4425

milindl mentioned this issue Sep 14, 2023

Fix loop of OffsetForLeaderEpoch requests on quick leader changes #4433

Merged

emasab added the in progress label Sep 21, 2023

davidblewett added a commit to fede1024/rust-rdkafka that referenced this issue Sep 22, 2023

Bump librdkafka to dev_bug_offsetforleaderepoch branch to pick up

29d6c18

fix for confluentinc/librdkafka#4425

emasab closed this as completed in #4433 Sep 29, 2023

rwkarg mentioned this issue Oct 23, 2023

Clients impacted by librdkafka bug in 2.2.0 confluentinc/confluent-kafka-dotnet#2130

Closed

samsm mentioned this issue Nov 16, 2023

Update librdkafka to 2.3.0 karafka/rdkafka-ruby#322

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OffsetForLeaderEpoch loop of failed requests with multiple leader changes #4425

OffsetForLeaderEpoch loop of failed requests with multiple leader changes #4425

emasab commented Sep 7, 2023 •

edited

Loading

scanterog commented Sep 8, 2023

emasab commented Sep 19, 2023

plachor commented Oct 23, 2023 •

edited

Loading

rwkarg commented Oct 23, 2023

ijuma commented Dec 2, 2023

OffsetForLeaderEpoch loop of failed requests with multiple leader changes #4425

OffsetForLeaderEpoch loop of failed requests with multiple leader changes #4425

Comments

emasab commented Sep 7, 2023 • edited Loading

Description

How to reproduce

Checklist

scanterog commented Sep 8, 2023

emasab commented Sep 19, 2023

plachor commented Oct 23, 2023 • edited Loading

rwkarg commented Oct 23, 2023

ijuma commented Dec 2, 2023

emasab commented Sep 7, 2023 •

edited

Loading

plachor commented Oct 23, 2023 •

edited

Loading