[kafka_consumer] Add kafka support for `monitor_unlisted_consumer_groups` #2730

jeffwidman · 2018-12-12T12:33:53Z

This adds support for dynamically discovering/fetching metrics for
unlisted consumer groups that store their offsets in Kafka.

Previously, monitor_unlisted_consumer_groups only worked for
consumers offsets stored in Zookeeper.

There are a number of edge cases to adding this feature, see the full
decision tree in conf.yaml.example to understand how the check decides
which offsets to fetch from which source.

In order to accomplish this, I had to switch from
kakfa-python.KafkaClient to the new kafka-python.KafkaAdminClient.
In addition to allowing new functionality, this also resulted in
removing a lot of code that was now superfluous.

Note:
I suspect this will not work on clusters < 0.10.0 due to some bits of code in upstream kafka-python bombing out if the AdminClient is used against older Kafka clusters. So that may prevent merging unless upstream is patched to allow that.

jeffwidman · 2018-12-12T20:20:04Z

kafka_consumer/datadog_checks/kafka_consumer/data/conf.yaml.example

-      - another_zookeeper:2181
-    # zk_iteration_ival: 1  # how many seconds between ZK consumer offset
-                            # collections. If kafka consumer offsets disabled
-                            # this has no effect.


I removed this param as it will be natively supported at the instance level in v6 of the agent. Since it unnecessarily complicated the code + user experience when configuring the check, I proactively removed it. It is backwards compatible as it will simply be ignored if specified.

jeffwidman · 2018-12-13T11:48:29Z

Got this finished up and working on our clusters. Let's see what Travis has to say...

jeffwidman · 2018-12-17T18:43:12Z

Travis appears to be passing with Kafka clusters >= 0.10, and failing otherwise. This is due to kafka-python's KafkaAdmin client currently not supporting brokers < 0.10... so need to get that fixed in order to merge this.

gzussa · 2018-12-24T14:51:40Z

Thanks so much for all this great work 💯 🥇
This is quite serious code :D.
Would it be possible to add tests around it?

jeffwidman · 2018-12-25T06:35:55Z

Thanks for giving it a look-see.

While I value tests, the bad news is that I will not have time to add them anytime soon.

Some projects have come up at work that will be sucking up all available time the next few months, and since this is working great on our clusters, I can't spend any more work time on it right now.

And my personal coding time is probably best spent fixing kafka-python's API handling so that it doesn't bomb out on brokers < 0.10.0, as that's the main thing preventing this from being merged.

No objections from me if you have to the time/inclination to add tests yourself. Otherwise, when I eventually get the time I will add them, but it could be a while.

gzussa · 2018-12-26T11:54:56Z

Fare prioritization :) Let's leave this open for the time being.

This adds support for dynamically discovering/fetching metrics for unlisted consumer groups that store their offsets in Kafka. Previously, `monitor_unlisted_consumer_groups` only worked for consumers offsets stored in Zookeeper. There are a number of edge cases to adding this feature, see the full decision tree in `conf.yaml.example` to understand how the check decides which offsets to fetch from which source. In order to accomplish this, I had to switch from `kakfa-python.KafkaClient` to the new [`kafka-python.KafkaAdminClient`](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaAdminClient.html). In addition to allowing new functionality, this also resulted in removing a lot of code that was now superfluous.

stale · 2019-03-06T19:47:21Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

dabcoder · 2019-04-01T15:47:49Z

@jeffwidman just a quick ping re: this PR.
Do you think you'll have a chance to add some tests at some point? There are also some conflicting files (and the Travis build is failing) as you can see, so just wanted to make sure this does not fall through the cracks. Let us know either way, thanks again!

jeffwidman · 2019-04-02T23:53:16Z

I am unlikely to be able to spend more time on this anytime soon, as it works great in our environment.

You guys might want to take a second look at this though, as it resolves several complex/possibly buggy bits in the current version of the check related to how group coordinators are identified.

I'm surprised none of your other customers are requesting that monitor_unlisted_consumer_groups be extended to support kafka-based consumer groups?

gzussa · 2019-04-06T16:13:15Z

Hi @jeffwidman We updated tests a little on origin/master to increase confidence on these changes (sorry it took us so long). Basically we made sure we were asserting all metrics #3468 . Would you be able to solve conflicts? then we should be able to merge :)

jeffwidman · 2019-04-25T18:57:27Z

@gzussa I'm happy to fix conflicts, but the problem will be that the kafka-python's API that's used here will break on brokers < 0.10.

The actual kafka protocol calls still work on older brokers, it's just a check withinkafka-python's KafkaAdminClient will throw an error.

The current version of the check used in the Datadog Agent doesn't use the KafkaAdminClient, it instead hand-rolls the group coordinator lookup in a way that is difficult to walk through the code, then when I finally realized how it works it's both a bit buggy, minimal error-handling, and overly chatty at the network protocol level.

So it's a big win to switch to relying on the logic already built into kafka-python's KafkaAdminClient for handling group coordinator lookup/failover/error handling.

But in order to do so, either that upstream library needs patching to not throw the error for brokers < 0.10, or this check needs to drop support for clusters < 0.10...

It'd actually simplify the check greatly if you were willing to drop support for brokers <0.10 (or even simpler if you dropped support for < 0.10.2) because in addition to the broker communication being simpler, you could also simply drop support for zookeeper-based offsets as those were deprecated in the Kafka ecosystem years ago.

It'd be useful if you knew how many of your customers were using the old versions to know whether it's safe to do that or not... my suspicion is your product manager won't be on board with dropping support for old versions, in which case it's better to patch kafka-python... should be relatively straightforward, I just haven't had the time (or need, since we run newer clusters and this check as written works great for us).

gzussa · 2019-05-06T11:42:43Z

Thanks for all these details. That is very useful! Maybe another approach would be to use both libraries behind an abstract factory pattern. A good example of this can be found in the openstack_controller integration. I will leave it to @DataDog/agent-integrations to decide which approach to go to from here.

stale · 2019-06-05T12:02:04Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

jeffwidman · 2019-08-17T06:14:53Z

Closing in favor of #3957 which offers the same functionality, but has better error handling, cleaner separation between legacy vs new code path, and also uses callbacks to do all the I/O in parallel for much faster querying.

…_unlisted_consumer_groups` (#3957) * Add Kafka support for monitor_unlisted_consumer_groups This is a major refactor of `kafka_consumer` check with a new code path that will exist alongside the legacy code path. The new code path: 1. Eliminates legacy cruft: * drops support for fetching consumer offsets from zookeeper * drops support for brokers < 0.10.2 which do not support fetching all known offsets for a consumer group * migrates from the hand-rolled group-coordinator lookup code, retries, etc to relying on `kafka-python`'s `KafkaAdminClient` to handle all that under the covers 2. Adds support for `monitor_unlisted_consumer_groups` to fetch consumer offsets from Kafka 3. Refactors from sync to async using callbacks for improved performance, especially with larger clusters and more consumer groups. To clarify: Fetching consumer offsets from older brokers and from zookeeper is still supported, it just uses the legacy code path instead of this new code path. This has been under discussion in various forms for over two years, see #423 and #2730. The main blocker was coming up with a clean way to support legacy features like fetching from Zookeeper / older brokers while simultaneously migrating to `kafka-python`'s new `KafkaAdminClient` which only supports brokers >= `0.10.0`. After several rounds of discussion, the decision was to move the legacy code into a separate file and start from scratch with a new implementation of the check which would be the primary one for new features moving forward. * 1.4.7 was released * Update agent_requirements.in * for now..

jeffwidman requested a review from a team as a code owner December 12, 2018 12:33

This was referenced Dec 12, 2018

[kafka_consumer] Cleanup group coordinator lookup #1628

Closed

Collect consumer groups by default #1576

Closed

Getting lag from topics (KafkaAdmin and KafkaConsumer) dpkp/kafka-python#1673

Closed

jeffwidman commented Dec 12, 2018

View reviewed changes

jeffwidman mentioned this pull request Dec 13, 2018

[kafka_consumer] kafka_consumer does not support over kafka version > 0.8.0 #457

Closed

gzussa added community integration/kafka_consumer labels Dec 24, 2018

gzussa requested review from ofek, masci and gmmeyer December 24, 2018 14:52

nmuesch added the do-not-merge/WIP label Dec 24, 2018

ofek added the changelog/Added label Jan 2, 2019

sravi72 approved these changes Feb 4, 2019

View reviewed changes

stale bot added the stale label Mar 6, 2019

stale bot removed the stale label Apr 1, 2019

stale bot added the stale label Jun 5, 2019

jeffwidman mentioned this pull request Jun 20, 2019

Add support for fetching consumer offsets stored in Kafka to monitor_unlisted_consumer_groups #3957

Merged

5 tasks

jeffwidman closed this Aug 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kafka_consumer] Add kafka support for `monitor_unlisted_consumer_groups` #2730

[kafka_consumer] Add kafka support for `monitor_unlisted_consumer_groups` #2730

jeffwidman commented Dec 12, 2018 •

edited

Loading

jeffwidman Dec 12, 2018 •

edited

Loading

jeffwidman commented Dec 13, 2018

jeffwidman commented Dec 17, 2018

gzussa commented Dec 24, 2018 •

edited

Loading

jeffwidman commented Dec 25, 2018 •

edited

Loading

gzussa commented Dec 26, 2018

stale bot commented Mar 6, 2019

dabcoder commented Apr 1, 2019

jeffwidman commented Apr 2, 2019 •

edited

Loading

gzussa commented Apr 6, 2019

jeffwidman commented Apr 25, 2019 •

edited

Loading

gzussa commented May 6, 2019

stale bot commented Jun 5, 2019

jeffwidman commented Aug 17, 2019

[kafka_consumer] Add kafka support for monitor_unlisted_consumer_groups #2730

[kafka_consumer] Add kafka support for monitor_unlisted_consumer_groups #2730

Conversation

jeffwidman commented Dec 12, 2018 • edited Loading

jeffwidman Dec 12, 2018 • edited Loading

Choose a reason for hiding this comment

jeffwidman commented Dec 13, 2018

jeffwidman commented Dec 17, 2018

gzussa commented Dec 24, 2018 • edited Loading

jeffwidman commented Dec 25, 2018 • edited Loading

gzussa commented Dec 26, 2018

stale bot commented Mar 6, 2019

dabcoder commented Apr 1, 2019

jeffwidman commented Apr 2, 2019 • edited Loading

gzussa commented Apr 6, 2019

jeffwidman commented Apr 25, 2019 • edited Loading

gzussa commented May 6, 2019

stale bot commented Jun 5, 2019

jeffwidman commented Aug 17, 2019

[kafka_consumer] Add kafka support for `monitor_unlisted_consumer_groups` #2730

[kafka_consumer] Add kafka support for `monitor_unlisted_consumer_groups` #2730

jeffwidman commented Dec 12, 2018 •

edited

Loading

jeffwidman Dec 12, 2018 •

edited

Loading

gzussa commented Dec 24, 2018 •

edited

Loading

jeffwidman commented Dec 25, 2018 •

edited

Loading

jeffwidman commented Apr 2, 2019 •

edited

Loading

jeffwidman commented Apr 25, 2019 •

edited

Loading