-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gRPC health check #8397
add gRPC health check #8397
Conversation
I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8397 +/- ##
=======================================
Coverage 90.47% 90.47%
=======================================
Files 303 303
Lines 15950 15955 +5
=======================================
+ Hits 14431 14436 +5
Misses 1229 1229
Partials 290 290 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready.
The change looks good. But I'm curious if the healthcheck extension could be used here instead. With the work to have status reporting per component, the healthcheck extension could be modified to report the health of components and the load balancer could point to that extension's endpoint. Would that address the same use-case?
Also confused about the need to be on the same server with the traffic. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
imho from the load balancer perspective is always better to health check the traffic port directly instead of another that may or may not reflect the state of the underlying listener. Specially if we are exposing multiple otel collector ports (e.g. otlp/grpc + otlp/http). |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
What is blocking this change? I don't see a way to make AWS ALB health check work without this change |
what is preventing the merge if all checks have passed and changes approved? |
How can I help prioritize merge this? |
Suren, i can't setup an AWS ALB for OTEL Collector GRPC endpoint without this feature. If you can merge this, I will greatly appreciate it.We are trying to standardize on OpenTelemetry. I have some other less disirable solutions by bypassing OpenTelemetry, but that will defeat the whole purpose. |
Suren, I may be confused about the merging process, seems we have two reviewers with write access approved, can the PR be merged now? Can you push the button? Thanks! |
@codeboten is your concern a blocker? |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Can we please get the issue approved and merged? |
The healthcheck extension as is would not work as in aws load balancers require it to be GRPC response not a http, the same as what is being served, the expectation is that should be on the same port as the traffic. You can override the port but even so, it needs to be a GRPC response. |
We also implemented receiver specific health checks - see for example splunk_hec, which exposes a health check for specific HEC clients. |
This issue now has 3 reviewers approved, one with a question, not necessarily a blocking one in my opinion. Is that sufficient to get this issue move forward, @SurenNihalani, are you able to merge now? Hope everyone has a good holiday! |
Thanks for your response @pellepelle3, I missed the notification of your response. To be clear the requirement here is to support a gRPC specific response, regardless of the port this is on correct? This was discussed at this week's SIG (Dec-20), my opinion was that the best way to implement this would be to support a gRPC response in the healthcheck extension. This would prevent users from having to check multiple ports for each component they have configured of a collector. Using the healthcheck extension would also provide a single source of truth, as I would find it confusing if the healthcheck extension was to return one status code and individual components could return different ones. If this is acceptable as a solution, then I would suggest adding this as a feature of the healthcheck extension that @mwear has been re-writing (to leverage component status reporting) One question that @kentquirk brought up in the SIG meeting was whether there was a need for the port to be the same, in the case of the ALB mentioned in this PR, it is definitely possible to configure alternative ports (this is often used to avoid making healthcheck visible externally for services) |
@codeboten I'm not familiar with the health check extension internals, but I wonder how it can reflect the healthiness of multiple receivers. |
That's a great question. I think there should be a mechanism in the healthcheck extension to differentiate between the health of different pipelines. There could also be an aggregate status for the overall health of the collector, but in more specific use-cases, like the one you mention, it would make sense to have a separate path (or port, but i feel like this would get confusing fast) for each pipeline |
@bpalermo, @codeboten, interesting discussion. I am planning to have a different AWS ALB endpoint to serve http and grpc endpoint for an otel collector. The http endpoint is the only one that is working for me currently . Here is a sample configuration annotations that I lifted from our Thanos grpc endpoint for AWS ALB, and I am hoping the otel collector can deliver the same working mechanism: default = { |
If I may, I would be in favor of having a heath check in the same port of each receiver, at least to start with. And maybe get fancier down the road.
One possibility could be to leverage to health check extension for this use case. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
@codeboten, follow up on your write-up about the SIG meeting and adding feature to the healthcheck extension, is there an issue tracking the feature? |
Hi @gjshao44. Here is the tracking issue: open-telemetry/opentelemetry-collector-contrib#26661. I am actively working on the new version of the healthcheck extension that is based on component status reporting. It has an implementation of the grpc.health.v1 service: https://github.com/grpc/grpc-proto/blob/master/grpc/health/v1/health.proto as an option. I expect to have a PR up sometime next week. |
Hello @mwear thanks for working on this ... waiting for your commit .. when can we expect the changes to be part of the release ? |
I have a PR open for the health check extension here: open-telemetry/opentelemetry-collector-contrib#30673. We will also need some version of #8684 or #8788 to provide the extension with status events for exporters. I'll revisit those PRs shortly. I'd appreciate any feedback on any of the PRs I've mentioned. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Any updates on this? I see the PR as closed but multiple people are suffering from this issue. |
I am attempting to setup a centralized otel collector which serves via an ALB and gRPC. I find that my target group health checks are failing and it seems this topic might be relevant to my goals. May I inquire about the status of this PR/issue? |
I've written an alternative health check extension that supports both HTTP and gRPC health checks. It's in the process of being reviewed. The full version is here: open-telemetry/opentelemetry-collector-contrib#30673. I am slicing it up into smaller PRs to facilitate review. The current chunk is here: open-telemetry/opentelemetry-collector-contrib#33528. Based on this users experience: open-telemetry/opentelemetry-collector-contrib#30673 (comment), I think it will work for your use case when the extension becomes available. |
@mwear why is healthcheck extension a replacement for this PR? Extension will run on a different port, which is very different from what was asked in the original ticket. |
@yurishkuro the healthcheck extension provides a grpc health server that is based on component status reporting. Component status reporting, while still a work in progress, allows us to derive collector health based on the health of the individual components. As you point out, it does not run on the same port. The need for the health check to be on the same port was debated in this PR and it wasn't clear if it was a hard requirement. I know that a user successfully used the extension on an AWS ALB, which is one of the use cases discussed, see: open-telemetry/opentelemetry-collector-contrib#30673 (comment). |
Description: added an option to gRPC server for serving the standard health check service grpc.health.v1.Health
Link to tracking Issue: #3040
Testing: added test to check that the gRPC service was registered
Documentation: added entry in readme for
healthcheck
key