Race condition in partition rebalance. #62

nemosupremo · 2015-07-05T02:43:14Z

(Moved from #61)

Actually I was looking into this because I was having an issue where 2 of my nodes would stop accepting requests. I think this might be related - when my 9th node comes up one node gives up all its partitions, and another node tries to claim those partitions and fails:

It looks like this might be a data race?
Node A tries to grab 16, 17, 18, 19 and fails.

[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] Triggering rebalance due to consumer list change
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/14 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/15 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/12 :: Stopping partition consumer at offset 44
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/13 :: Stopping partition consumer at offset 43
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/13
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/14
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/15
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/12
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user :: Stopped topic consumer
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] Currently registered consumers: 9
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user :: Started topic consumer
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user :: Claiming 4 of 32 partitions
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/16 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/17 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/18 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/19 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user :: Stopped topic consumer
[Sarama] 2015/07/05 02:18:46 client/metadata fetching metadata for all topics from broker 10.129.196.48:9092

Node B lets go of 16,17,18,19 possible after Node A tries to acquire it.

[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] Triggering rebalance due to consumer list change
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/16 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/17 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/18 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/19 :: Stopping partition consumer at offset 44
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/18
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/19
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/16
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/17
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user :: Stopped topic consumer
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] Currently registered consumers: 9
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user :: Started topic consumer
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user :: Claiming 4 of 32 partitions
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/20 :: Partition consumer starting at offset 37.
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/21 :: Partition consumer starting at offset 50.
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 added subscription to geard-user/20
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/22 :: Partition consumer starting at offset 57.
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/23 :: Partition consumer starting at offset 38.
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 added subscription to geard-user/21
[Sarama] 2015/07/05 02:10:37 consumer/broker/40770 added subscription to geard-user/23
[Sarama] 2015/07/05 02:10:37 consumer/broker/40770 added subscription to geard-user/22
[Sarama] 2015/07/05 02:18:42 client/metadata fetching metadata for all topics from broker 10.129.196.48:9092

It looks like the naive thing to do would be to possibly sleep for a second in topicListConsumer() - however using something other than Sleep to solve this race condition might be better - unfortunately I don't yet have a great understanding of how consumergroups work.

Or, retry claiming a set number of times?

The text was updated successfully, but these errors were encountered:

…nbergen#62

Retry claiming partitions if the partition is already claims. See #62

…nbergen#62

This is complementary fix for wvanbergen#68 (issue: wvanbergen#62), before the re-implementation (wvanbergen#72) is ready. In my use case, the message consuming logic is sometimes time consuming, even with 3 times retry as the fix in pull#68, it's still easy to have the issue#62. Furhter checking current logic in consumer_group.go:partitionConsumer(), it may take as many as cg.config.Offsets.ProcessingTimeout to ReleasePartition so that the partition can be claimed by new consumer during rebalance. So just simply set the max retry time same as cg.config.Offsets.ProcessingTimeout, which is 60s by default. Verified this the system including this fix with frequent rebalance operations, the issue does not occur again.

Complementary fix of partition rebalnce issue(#62)

nemosupremo added a commit to ChannelMeter/kafka that referenced this issue Jul 5, 2015

Retry claiming partitions if the partition is already claims. See wva…

22b7a7e

…nbergen#62

nemosupremo mentioned this issue Aug 13, 2015

Retry claiming partitions if the partition is already claims. See #62 #68

Merged

wvanbergen closed this as completed in #68 Aug 17, 2015

wvanbergen added a commit that referenced this issue Aug 17, 2015

Merge pull request #68 from ChannelMeter/retry-claim

7c09a42

Retry claiming partitions if the partition is already claims. See #62

wvanbergen pushed a commit that referenced this issue Aug 25, 2015

Retry claiming partitions if the partition is already claims. See #62

7716339

xq262144 pushed a commit to crask/kafka that referenced this issue Nov 24, 2015

Retry claiming partitions if the partition is already claims. See wva…

b3419ea

…nbergen#62

caihua-yin mentioned this issue Apr 19, 2016

Complementary fix of partition rebalnce issue(#62) #93

Merged

wvanbergen added a commit that referenced this issue Jun 15, 2016

Merge pull request #93 from caihua-yin/partition-rebalance-fix

64fe44b

Complementary fix of partition rebalnce issue(#62)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in partition rebalance. #62

Race condition in partition rebalance. #62

nemosupremo commented Jul 5, 2015

Race condition in partition rebalance. #62

Race condition in partition rebalance. #62

Comments

nemosupremo commented Jul 5, 2015