Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kafka_consumer] Add kafka support for monitor_unlisted_consumer_groups #2730

Closed
wants to merge 1 commit into from
Closed

[kafka_consumer] Add kafka support for monitor_unlisted_consumer_groups #2730

wants to merge 1 commit into from

Conversation

jeffwidman
Copy link
Contributor

@jeffwidman jeffwidman commented Dec 12, 2018

This adds support for dynamically discovering/fetching metrics for
unlisted consumer groups that store their offsets in Kafka.

Previously, monitor_unlisted_consumer_groups only worked for
consumers offsets stored in Zookeeper.

There are a number of edge cases to adding this feature, see the full
decision tree in conf.yaml.example to understand how the check decides
which offsets to fetch from which source.

In order to accomplish this, I had to switch from
kakfa-python.KafkaClient to the new kafka-python.KafkaAdminClient.
In addition to allowing new functionality, this also resulted in
removing a lot of code that was now superfluous.


Note:
I suspect this will not work on clusters < 0.10.0 due to some bits of code in upstream kafka-python bombing out if the AdminClient is used against older Kafka clusters. So that may prevent merging unless upstream is patched to allow that.

- another_zookeeper:2181
# zk_iteration_ival: 1 # how many seconds between ZK consumer offset
# collections. If kafka consumer offsets disabled
# this has no effect.
Copy link
Contributor Author

@jeffwidman jeffwidman Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this param as it will be natively supported at the instance level in v6 of the agent. Since it unnecessarily complicated the code + user experience when configuring the check, I proactively removed it. It is backwards compatible as it will simply be ignored if specified.

@jeffwidman
Copy link
Contributor Author

Got this finished up and working on our clusters. Let's see what Travis has to say...

@jeffwidman
Copy link
Contributor Author

Travis appears to be passing with Kafka clusters >= 0.10, and failing otherwise. This is due to kafka-python's KafkaAdmin client currently not supporting brokers < 0.10... so need to get that fixed in order to merge this.

@gzussa
Copy link
Contributor

gzussa commented Dec 24, 2018

Thanks so much for all this great work 💯 🥇
This is quite serious code :D.
Would it be possible to add tests around it?

@jeffwidman
Copy link
Contributor Author

jeffwidman commented Dec 25, 2018

Thanks for giving it a look-see.

While I value tests, the bad news is that I will not have time to add them anytime soon.

Some projects have come up at work that will be sucking up all available time the next few months, and since this is working great on our clusters, I can't spend any more work time on it right now.

And my personal coding time is probably best spent fixing kafka-python's API handling so that it doesn't bomb out on brokers < 0.10.0, as that's the main thing preventing this from being merged.

No objections from me if you have to the time/inclination to add tests yourself. Otherwise, when I eventually get the time I will add them, but it could be a while.

@gzussa
Copy link
Contributor

gzussa commented Dec 26, 2018

Fare prioritization :) Let's leave this open for the time being.

This adds support for dynamically discovering/fetching metrics for
unlisted consumer groups that store their offsets in Kafka.

Previously, `monitor_unlisted_consumer_groups` only worked for
consumers offsets stored in Zookeeper.

There are a number of edge cases to adding this feature, see the full
decision tree in `conf.yaml.example` to understand how the check decides
which offsets to fetch from which source.

In order to accomplish this, I had to switch from
`kakfa-python.KafkaClient` to the new [`kafka-python.KafkaAdminClient`](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaAdminClient.html).
In addition to allowing new functionality, this also resulted in
removing a lot of code that was now superfluous.
@stale
Copy link

stale bot commented Mar 6, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

@stale stale bot added the stale label Mar 6, 2019
@dabcoder
Copy link
Contributor

dabcoder commented Apr 1, 2019

@jeffwidman just a quick ping re: this PR.
Do you think you'll have a chance to add some tests at some point? There are also some conflicting files (and the Travis build is failing) as you can see, so just wanted to make sure this does not fall through the cracks. Let us know either way, thanks again!

@stale stale bot removed the stale label Apr 1, 2019
@jeffwidman
Copy link
Contributor Author

jeffwidman commented Apr 2, 2019

I am unlikely to be able to spend more time on this anytime soon, as it works great in our environment.

You guys might want to take a second look at this though, as it resolves several complex/possibly buggy bits in the current version of the check related to how group coordinators are identified.

I'm surprised none of your other customers are requesting that monitor_unlisted_consumer_groups be extended to support kafka-based consumer groups?

@gzussa
Copy link
Contributor

gzussa commented Apr 6, 2019

Hi @jeffwidman We updated tests a little on origin/master to increase confidence on these changes (sorry it took us so long). Basically we made sure we were asserting all metrics #3468 . Would you be able to solve conflicts? then we should be able to merge :)

@jeffwidman
Copy link
Contributor Author

jeffwidman commented Apr 25, 2019

@gzussa I'm happy to fix conflicts, but the problem will be that the kafka-python's API that's used here will break on brokers < 0.10.

The actual kafka protocol calls still work on older brokers, it's just a check withinkafka-python's KafkaAdminClient will throw an error.

The current version of the check used in the Datadog Agent doesn't use the KafkaAdminClient, it instead hand-rolls the group coordinator lookup in a way that is difficult to walk through the code, then when I finally realized how it works it's both a bit buggy, minimal error-handling, and overly chatty at the network protocol level.

So it's a big win to switch to relying on the logic already built into kafka-python's KafkaAdminClient for handling group coordinator lookup/failover/error handling.

But in order to do so, either that upstream library needs patching to not throw the error for brokers < 0.10, or this check needs to drop support for clusters < 0.10...

It'd actually simplify the check greatly if you were willing to drop support for brokers <0.10 (or even simpler if you dropped support for < 0.10.2) because in addition to the broker communication being simpler, you could also simply drop support for zookeeper-based offsets as those were deprecated in the Kafka ecosystem years ago.

It'd be useful if you knew how many of your customers were using the old versions to know whether it's safe to do that or not... my suspicion is your product manager won't be on board with dropping support for old versions, in which case it's better to patch kafka-python... should be relatively straightforward, I just haven't had the time (or need, since we run newer clusters and this check as written works great for us).

@gzussa
Copy link
Contributor

gzussa commented May 6, 2019

Thanks for all these details. That is very useful! Maybe another approach would be to use both libraries behind an abstract factory pattern. A good example of this can be found in the openstack_controller integration. I will leave it to @DataDog/agent-integrations to decide which approach to go to from here.

@stale
Copy link

stale bot commented Jun 5, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

@stale stale bot added the stale label Jun 5, 2019
@jeffwidman
Copy link
Contributor Author

Closing in favor of #3957 which offers the same functionality, but has better error handling, cleaner separation between legacy vs new code path, and also uses callbacks to do all the I/O in parallel for much faster querying.

@jeffwidman jeffwidman closed this Aug 17, 2019
ofek pushed a commit that referenced this pull request Oct 1, 2019
…_unlisted_consumer_groups` (#3957)

* Add Kafka support for monitor_unlisted_consumer_groups

This is a major refactor of `kafka_consumer` check with a new code path
that will exist alongside the legacy code path. The new code path:
1. Eliminates legacy cruft:
   * drops support for fetching consumer offsets from zookeeper
   * drops support for brokers < 0.10.2 which do not support fetching
     all known offsets for a consumer group
   * migrates from the hand-rolled group-coordinator lookup code,
     retries, etc to relying on  `kafka-python`'s `KafkaAdminClient` to
     handle all that under the covers
2. Adds support for `monitor_unlisted_consumer_groups` to fetch consumer
   offsets from Kafka
3. Refactors from sync to async using callbacks for improved
  performance, especially with larger clusters and more consumer groups.

To clarify: Fetching consumer offsets from older brokers and from
zookeeper is still supported, it just uses the legacy code path instead
of this new code path.

This has been under discussion in various forms for over two years, see
#423 and
#2730. The main blocker
was coming up with a clean way to support legacy features like fetching
from Zookeeper / older brokers while simultaneously migrating to
`kafka-python`'s new `KafkaAdminClient` which only supports
brokers >= `0.10.0`. After several rounds of discussion, the decision
was to move the legacy code into a separate file and start from scratch
with a new implementation of the check which would be the primary one
for new features moving forward.

* 1.4.7 was released

* Update agent_requirements.in

* for now..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants