This repository has been archived by the owner on May 13, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 141
Race condition in partition rebalance. #62
Comments
nemosupremo
added a commit
to ChannelMeter/kafka
that referenced
this issue
Jul 5, 2015
wvanbergen
added a commit
that referenced
this issue
Aug 17, 2015
Retry claiming partitions if the partition is already claims. See #62
wvanbergen
pushed a commit
that referenced
this issue
Aug 25, 2015
xq262144
pushed a commit
to crask/kafka
that referenced
this issue
Nov 24, 2015
caihua-yin
pushed a commit
to caihua-yin/kafka
that referenced
this issue
Apr 19, 2016
This is complementary fix for wvanbergen#68 (issue: wvanbergen#62), before the re-implementation (wvanbergen#72) is ready. In my use case, the message consuming logic is sometimes time consuming, even with 3 times retry as the fix in pull#68, it's still easy to have the issue#62. Furhter checking current logic in consumer_group.go:partitionConsumer(), it may take as many as cg.config.Offsets.ProcessingTimeout to ReleasePartition so that the partition can be claimed by new consumer during rebalance. So just simply set the max retry time same as cg.config.Offsets.ProcessingTimeout, which is 60s by default. Verified this the system including this fix with frequent rebalance operations, the issue does not occur again.
wvanbergen
added a commit
that referenced
this issue
Jun 15, 2016
Complementary fix of partition rebalnce issue(#62)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
(Moved from #61)
Actually I was looking into this because I was having an issue where 2 of my nodes would stop accepting requests. I think this might be related - when my 9th node comes up one node gives up all its partitions, and another node tries to claim those partitions and fails:
It looks like this might be a data race?
Node A tries to grab 16, 17, 18, 19 and fails.
Node B lets go of 16,17,18,19 possible after Node A tries to acquire it.
It looks like the naive thing to do would be to possibly sleep for a second in topicListConsumer() - however using something other than
Sleep
to solve this race condition might be better - unfortunately I don't yet have a great understanding of how consumergroups work.Or, retry claiming a set number of times?
The text was updated successfully, but these errors were encountered: