Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OffsetForLeaderEpoch loop of failed requests with multiple leader changes #4425

Closed
4 of 7 tasks
emasab opened this issue Sep 7, 2023 · 5 comments · Fixed by #4433
Closed
4 of 7 tasks

OffsetForLeaderEpoch loop of failed requests with multiple leader changes #4425

emasab opened this issue Sep 7, 2023 · 5 comments · Fixed by #4433

Comments

@emasab
Copy link
Collaborator

emasab commented Sep 7, 2023

Description

Loop of OffsetForLeaderEpoch calls when multiple leader changes happen and one of them is being retried because of an error.

How to reproduce

To reproduce it there should be an initial partition leader change that triggers an OffsetForLeaderEpoch request. The request should fail while a second leader change happens. The corresponding current leader epoch isn't updated and there's a loop of failing requests.

Checklist

  • librdkafka version (release number or git tag): 2.1.0+
  • Apache Kafka version: 3.5.1
  • librdkafka client configuration: any
  • Operating system: any
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@scanterog
Copy link

We've also hit this issue. Would it be feasible to have a 2.2.1 release as soon as the fix is ready?

@emasab
Copy link
Collaborator Author

emasab commented Sep 19, 2023

@scanterog sure we're planning to include it in next release

@plachor
Copy link

plachor commented Oct 23, 2023

Hi @emasab we also struggle with this one, once one of our instance is affected it consumes all CPU limit, and carry lots of RX (1GB/min) and TX transfer (100MB/min). Also broker is flooded with 400K+/min OffsetForLeaderEpoch requests.

We started to observe this recently once we decided to bump from v1.9.0 to v.2.2.0. Previously it was working fine. We observe this on maintenance actions:

  • rolling brokers due to updates (OS and Kafka)
  • scaling cluster - introducing new brokers and relocating partitions

I agree with @scanterog. Would it not be wise to cherry pick this and release HF of v2.2.0 and v2.1.0 for ones that are affected so they are not forced to pull minor release introducing new features as a HF to already deployed versions ?

edit: I'll just add that we use .net wrapper confluent-kafka-dotnet so this HF would need to be released there as well.

@rwkarg
Copy link

rwkarg commented Oct 23, 2023

This continues to be an issue for us as well (using the same C# wrapper as @plachor. The OffsetForLeaderEpoch spam appears to be from inside librdkafka, not the .NET code).

Is there an estimated schedule for the next release that includes #4433?

The only resolution is to restart processes that get in this state and we haven't yet found a way to automate that as it's not in the .NET code where the issue is, and the monitoring we have on the broker side just says there's a log of this call getting made but not what client(s) it is coming from.

@ijuma
Copy link
Member

ijuma commented Dec 2, 2023

This fix is included in 2.3.0 according to the release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants