Skip to content
This repository has been archived by the owner on Mar 17, 2024. It is now read-only.

Continue to calculate lag for inactive groups for a configurable timespan #66

Closed
seglo opened this issue Sep 18, 2019 · 6 comments · Fixed by #128
Closed

Continue to calculate lag for inactive groups for a configurable timespan #66

seglo opened this issue Sep 18, 2019 · 6 comments · Fixed by #128
Labels
enhancement New feature or request

Comments

@seglo
Copy link
Owner

seglo commented Sep 18, 2019

Inspired by discussion in #63

Add a feature that continues to calculate consumer group lag for a group after it's no longer active. Today, we will immediately evict metrics for groups that no longer exist. We detect that a group has been removed by comparing the list of groups returned to the list returned in the last poll. Instead of removing metrics immediately, when we discover that groups no longer exist (they're no longer returned when we retrieve group metadata), we will continue to calculate lag for their last reported partition subscription. When a group is detected as removed it will be added with a timestamp to a removal list that will be cleaned up after each poll. If a group in the removal list exceed a configured time span then it will be removed. If the group becomes active again then the group is removed from the removal list. A default of 30 minutes would be a good value to start with.

@seglo seglo added the enhancement New feature or request label Sep 18, 2019
@graphex
Copy link
Contributor

graphex commented Sep 21, 2019

Another potential direction here might be to have a flag for the collection of (earliest and) latest metrics for all topics, regardless of consumer state. This would make alerting for unconsumed topics/partitions possible, which is a good thing to do to prevent data loss.

@seglo
Copy link
Owner Author

seglo commented Sep 21, 2019

That would be useful, but based on observations from #63 it seems that inactive groups aren't available when using AdminClient. It may be worth investigating this more, it's possible the group metadata may still be accessible, but it's just not returned when getting a list of consumer groups. We use the list of consumer groups to determine what groups to return metadata for.

@rkrage
Copy link

rkrage commented Jan 31, 2020

So I think something else is actually going on here. The kafka-consumer-groups.sh script uses AdminClient and absolutely displays inactive consumer groups:

QA rkrage@log-kafka01.qa:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
rkrage_test

QA rkrage@log-kafka01.qa:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe
Consumer group 'rkrage_test' has no active members.

TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
rkrage_test     0          4090            4090            0               -               -               -

QA rkrage@log-kafka01.qa:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe --state
Consumer group 'rkrage_test' has no active members.

COORDINATOR (ID)          ASSIGNMENT-STRATEGY       STATE                #MEMBERS
log-kafka02.qa:9092 (2)                             Empty                0

I believe this is the source code it's using to list all groups: https://github.com/apache/kafka/blob/6dc6f6a60ddf7a70c394c147fbed579749d2abcc/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L181-L185

@seglo
Copy link
Owner Author

seglo commented Sep 2, 2020

I think this is the same issue as #126. Where if a group has no active members its information was inadvertently filtered out. @lilyevsky resolved this with #128 which was released in 0.6.2.

The kafka-consumer-groups.sh script uses AdminClient and absolutely displays inactive consumer groups:

/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe
Consumer group 'rkrage_test' has no active members.

TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            
CLIENT-ID
rkrage_test     0          4090            4090            0   

This makes me think it might be the same issue.

I believe this is the source code it's using to list all groups: apache/kafka@6dc6f6a/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L181-L185

We make the same call in KafkaClient.

https://github.com/lightbend/kafka-lag-exporter/blob/v0.6.3/src/main/scala/com/lightbend/kafkalagexporter/KafkaClient.scala#L113

Can anyone confirm with the latest version of Kafka Lag Exporter? (@rkrage)

@rkrage
Copy link

rkrage commented Sep 14, 2020

@seglo, just upgraded to 0.6.3 today this appears to be solved for us!

@seglo
Copy link
Owner Author

seglo commented Sep 15, 2020

@rkrage Excellent! I'll close this ticket.

Fixed with #128

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants