Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gRPC health check #8397

Closed
wants to merge 3 commits into from

Conversation

seankhliao
Copy link
Contributor

Description: added an option to gRPC server for serving the standard health check service grpc.health.v1.Health

Link to tracking Issue: #3040

Testing: added test to check that the gRPC service was registered

Documentation: added entry in readme for healthcheck key

@seankhliao seankhliao requested review from a team and dmitryax September 12, 2023 19:44
@seankhliao
Copy link
Contributor Author

I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready.

@codecov
Copy link

codecov bot commented Sep 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.47%. Comparing base (f571691) to head (e0dd4d4).
Report is 507 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8397   +/-   ##
=======================================
  Coverage   90.47%   90.47%           
=======================================
  Files         303      303           
  Lines       15950    15955    +5     
=======================================
+ Hits        14431    14436    +5     
  Misses       1229     1229           
  Partials      290      290           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready.

The change looks good. But I'm curious if the healthcheck extension could be used here instead. With the work to have status reporting per component, the healthcheck extension could be modified to report the health of components and the load balancer could point to that extension's endpoint. Would that address the same use-case?

@bogdandrutu
Copy link
Member

Also confused about the need to be on the same server with the traffic.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Sep 29, 2023
@bpalermo
Copy link

bpalermo commented Oct 11, 2023

imho from the load balancer perspective is always better to health check the traffic port directly instead of another that may or may not reflect the state of the underlying listener. Specially if we are exposing multiple otel collector ports (e.g. otlp/grpc + otlp/http).

@github-actions github-actions bot removed the Stale label Oct 12, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 27, 2023
@gjshao44
Copy link

gjshao44 commented Nov 2, 2023

What is blocking this change? I don't see a way to make AWS ALB health check work without this change

@gjshao44
Copy link

gjshao44 commented Nov 9, 2023

what is preventing the merge if all checks have passed and changes approved?

@SurenNihalani
Copy link

How can I help prioritize merge this?

@gjshao44
Copy link

Suren, i can't setup an AWS ALB for OTEL Collector GRPC endpoint without this feature. If you can merge this, I will greatly appreciate it.We are trying to standardize on OpenTelemetry. I have some other less disirable solutions by bypassing OpenTelemetry, but that will defeat the whole purpose.

@gjshao44
Copy link

Suren, I may be confused about the merging process, seems we have two reviewers with write access approved, can the PR be merged now? Can you push the button? Thanks!

@yurishkuro
Copy link
Member

@codeboten is your concern a blocker?

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 28, 2023
@gjshao44
Copy link

Can we please get the issue approved and merged?

@github-actions github-actions bot removed the Stale label Nov 29, 2023
@pellepelle3
Copy link

I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready.

The change looks good. But I'm curious if the healthcheck extension could be used here instead. With the work to have status reporting per component, the healthcheck extension could be modified to report the health of components and the load balancer could point to that extension's endpoint. Would that address the same use-case?

The healthcheck extension as is would not work as in aws load balancers require it to be GRPC response not a http, the same as what is being served, the expectation is that should be on the same port as the traffic. You can override the port but even so, it needs to be a GRPC response.

@atoulme
Copy link
Contributor

atoulme commented Dec 18, 2023

We also implemented receiver specific health checks - see for example splunk_hec, which exposes a health check for specific HEC clients.

@gjshao44
Copy link

This issue now has 3 reviewers approved, one with a question, not necessarily a blocking one in my opinion. Is that sufficient to get this issue move forward, @SurenNihalani, are you able to merge now? Hope everyone has a good holiday!

@codeboten
Copy link
Contributor

The healthcheck extension as is would not work as in aws load balancers require it to be GRPC response not a http, the same as what is being served, the expectation is that should be on the same port as the traffic. You can override the port but even so, it needs to be a GRPC response.

Thanks for your response @pellepelle3, I missed the notification of your response. To be clear the requirement here is to support a gRPC specific response, regardless of the port this is on correct?

This was discussed at this week's SIG (Dec-20), my opinion was that the best way to implement this would be to support a gRPC response in the healthcheck extension. This would prevent users from having to check multiple ports for each component they have configured of a collector.

Using the healthcheck extension would also provide a single source of truth, as I would find it confusing if the healthcheck extension was to return one status code and individual components could return different ones. If this is acceptable as a solution, then I would suggest adding this as a feature of the healthcheck extension that @mwear has been re-writing (to leverage component status reporting)

One question that @kentquirk brought up in the SIG meeting was whether there was a need for the port to be the same, in the case of the ALB mentioned in this PR, it is definitely possible to configure alternative ports (this is often used to avoid making healthcheck visible externally for services)

@bpalermo
Copy link

@codeboten I'm not familiar with the health check extension internals, but I wonder how it can reflect the healthiness of multiple receivers.
Let's say I have otlp/grpc (4317) + otlp/http (4318), and one of them becomes unhealthy. I assume the extension will reporting the service as unhealthy and the load balancer will take this endpoint out of serving, although the other port would still be able to server.
Is my understanding correct?

@codeboten
Copy link
Contributor

@codeboten I'm not familiar with the health check extension internals, but I wonder how it can reflect the healthiness of multiple receivers. Let's say I have otlp/grpc (4317) + otlp/http (4318), and one of them becomes unhealthy. I assume the extension will reporting the service as unhealthy and the load balancer will take this endpoint out of serving, although the other port would still be able to server. Is my understanding correct?

That's a great question. I think there should be a mechanism in the healthcheck extension to differentiate between the health of different pipelines. There could also be an aggregate status for the overall health of the collector, but in more specific use-cases, like the one you mention, it would make sense to have a separate path (or port, but i feel like this would get confusing fast) for each pipeline

@gjshao44
Copy link

@bpalermo, @codeboten, interesting discussion. I am planning to have a different AWS ALB endpoint to serve http and grpc endpoint for an otel collector. The http endpoint is the only one that is working for me currently . Here is a sample configuration annotations that I lifted from our Thanos grpc endpoint for AWS ALB, and I am hoping the otel collector can deliver the same working mechanism:

default = {
"kubernetes.io/ingress.class" : "alb",
"alb.ingress.kubernetes.io/scheme" : "internal",
"alb.ingress.kubernetes.io/target-type" : "ip",
"alb.ingress.kubernetes.io/backend-protocol-version" : "GRPC",
"alb.ingress.kubernetes.io/healthcheck-path" : "/grpc.health.v1.Health.Check",
"alb.ingress.kubernetes.io/healthcheck-protocol" : "HTTP",
"alb.ingress.kubernetes.io/success-codes" : "1-99",
"alb.ingress.kubernetes.io/backend-protocol" : "HTTP",
"alb.ingress.kubernetes.io/listen-ports" : "[{"HTTPS": 443}]",
}

@bpalermo
Copy link

(or port, but i feel like this would get confusing fast) for each pipeline
Totally agree!

If I may, I would be in favor of having a heath check in the same port of each receiver, at least to start with. And maybe get fancier down the road.

There could also be an aggregate status for the overall health of the collector

One possibility could be to leverage to health check extension for this use case.

Copy link
Contributor

github-actions bot commented Jan 6, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels Jan 6, 2024
@gjshao44
Copy link

@codeboten, follow up on your write-up about the SIG meeting and adding feature to the healthcheck extension, is there an issue tracking the feature?

@mwear
Copy link
Member

mwear commented Jan 11, 2024

Hi @gjshao44. Here is the tracking issue: open-telemetry/opentelemetry-collector-contrib#26661. I am actively working on the new version of the healthcheck extension that is based on component status reporting. It has an implementation of the grpc.health.v1 service: https://github.com/grpc/grpc-proto/blob/master/grpc/health/v1/health.proto as an option. I expect to have a PR up sometime next week.

@vynu
Copy link

vynu commented Jan 18, 2024

Hello @mwear thanks for working on this ... waiting for your commit .. when can we expect the changes to be part of the release ?

@mwear
Copy link
Member

mwear commented Jan 19, 2024

I have a PR open for the health check extension here: open-telemetry/opentelemetry-collector-contrib#30673. We will also need some version of #8684 or #8788 to provide the extension with status events for exporters. I'll revisit those PRs shortly. I'd appreciate any feedback on any of the PRs I've mentioned.

Copy link
Contributor

github-actions bot commented Feb 4, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels Feb 4, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Copy link
Contributor

github-actions bot commented Mar 9, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Mar 9, 2024
@TRAD-Anthony-CKO
Copy link

Any updates on this? I see the PR as closed but multiple people are suffering from this issue.

@tmegow
Copy link

tmegow commented Jun 24, 2024

I am attempting to setup a centralized otel collector which serves via an ALB and gRPC. I find that my target group health checks are failing and it seems this topic might be relevant to my goals.

May I inquire about the status of this PR/issue?

@mwear
Copy link
Member

mwear commented Jun 24, 2024

I've written an alternative health check extension that supports both HTTP and gRPC health checks. It's in the process of being reviewed. The full version is here: open-telemetry/opentelemetry-collector-contrib#30673. I am slicing it up into smaller PRs to facilitate review. The current chunk is here: open-telemetry/opentelemetry-collector-contrib#33528. Based on this users experience: open-telemetry/opentelemetry-collector-contrib#30673 (comment), I think it will work for your use case when the extension becomes available.

@yurishkuro
Copy link
Member

@mwear why is healthcheck extension a replacement for this PR? Extension will run on a different port, which is very different from what was asked in the original ticket.

@mwear
Copy link
Member

mwear commented Aug 1, 2024

@yurishkuro the healthcheck extension provides a grpc health server that is based on component status reporting. Component status reporting, while still a work in progress, allows us to derive collector health based on the health of the individual components. As you point out, it does not run on the same port. The need for the health check to be on the same port was debated in this PR and it wasn't clear if it was a hard requirement. I know that a user successfully used the extension on an AWS ALB, which is one of the use cases discussed, see: open-telemetry/opentelemetry-collector-contrib#30673 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.