add gRPC health check #8397

seankhliao · 2023-09-12T19:44:09Z

Description: added an option to gRPC server for serving the standard health check service grpc.health.v1.Health

Link to tracking Issue: #3040

Testing: added test to check that the gRPC service was registered

Documentation: added entry in readme for healthcheck key

seankhliao · 2023-09-12T19:46:03Z

I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready.

codecov · 2023-09-12T19:50:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.47%. Comparing base (f571691) to head (e0dd4d4).
Report is 507 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8397   +/-   ##
=======================================
  Coverage   90.47%   90.47%           
=======================================
  Files         303      303           
  Lines       15950    15955    +5     
=======================================
+ Hits        14431    14436    +5     
  Misses       1229     1229           
  Partials      290      290

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codeboten

I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready.

The change looks good. But I'm curious if the healthcheck extension could be used here instead. With the work to have status reporting per component, the healthcheck extension could be modified to report the health of components and the load balancer could point to that extension's endpoint. Would that address the same use-case?

bogdandrutu · 2023-09-14T18:17:22Z

Also confused about the need to be on the same server with the traffic.

github-actions · 2023-09-29T03:15:20Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

bpalermo · 2023-10-11T22:55:03Z

imho from the load balancer perspective is always better to health check the traffic port directly instead of another that may or may not reflect the state of the underlying listener. Specially if we are exposing multiple otel collector ports (e.g. otlp/grpc + otlp/http).

github-actions · 2023-10-27T03:15:20Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

gjshao44 · 2023-11-02T18:52:08Z

What is blocking this change? I don't see a way to make AWS ALB health check work without this change

gjshao44 · 2023-11-09T02:38:58Z

what is preventing the merge if all checks have passed and changes approved?

SurenNihalani · 2023-11-12T04:04:39Z

How can I help prioritize merge this?

gjshao44 · 2023-11-12T18:08:37Z

Suren, i can't setup an AWS ALB for OTEL Collector GRPC endpoint without this feature. If you can merge this, I will greatly appreciate it.We are trying to standardize on OpenTelemetry. I have some other less disirable solutions by bypassing OpenTelemetry, but that will defeat the whole purpose.

gjshao44 · 2023-11-13T16:56:59Z

Suren, I may be confused about the merging process, seems we have two reviewers with write access approved, can the PR be merged now? Can you push the button? Thanks!

yurishkuro · 2023-11-13T17:54:01Z

@codeboten is your concern a blocker?

github-actions · 2023-11-28T03:15:34Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

gjshao44 · 2023-11-28T08:22:55Z

Can we please get the issue approved and merged?

pellepelle3 · 2023-12-14T23:53:44Z

I think this should be part of the gRPC server settings rather than an extension to allow load balancers to properly observe if the listener is ready.

The change looks good. But I'm curious if the healthcheck extension could be used here instead. With the work to have status reporting per component, the healthcheck extension could be modified to report the health of components and the load balancer could point to that extension's endpoint. Would that address the same use-case?

The healthcheck extension as is would not work as in aws load balancers require it to be GRPC response not a http, the same as what is being served, the expectation is that should be on the same port as the traffic. You can override the port but even so, it needs to be a GRPC response.

atoulme · 2023-12-18T23:28:49Z

We also implemented receiver specific health checks - see for example splunk_hec, which exposes a health check for specific HEC clients.

gjshao44 · 2023-12-20T14:41:23Z

This issue now has 3 reviewers approved, one with a question, not necessarily a blocking one in my opinion. Is that sufficient to get this issue move forward, @SurenNihalani, are you able to merge now? Hope everyone has a good holiday!

codeboten · 2023-12-20T23:31:30Z

The healthcheck extension as is would not work as in aws load balancers require it to be GRPC response not a http, the same as what is being served, the expectation is that should be on the same port as the traffic. You can override the port but even so, it needs to be a GRPC response.

Thanks for your response @pellepelle3, I missed the notification of your response. To be clear the requirement here is to support a gRPC specific response, regardless of the port this is on correct?

This was discussed at this week's SIG (Dec-20), my opinion was that the best way to implement this would be to support a gRPC response in the healthcheck extension. This would prevent users from having to check multiple ports for each component they have configured of a collector.

Using the healthcheck extension would also provide a single source of truth, as I would find it confusing if the healthcheck extension was to return one status code and individual components could return different ones. If this is acceptable as a solution, then I would suggest adding this as a feature of the healthcheck extension that @mwear has been re-writing (to leverage component status reporting)

One question that @kentquirk brought up in the SIG meeting was whether there was a need for the port to be the same, in the case of the ALB mentioned in this PR, it is definitely possible to configure alternative ports (this is often used to avoid making healthcheck visible externally for services)

bpalermo · 2023-12-21T11:56:23Z

@codeboten I'm not familiar with the health check extension internals, but I wonder how it can reflect the healthiness of multiple receivers.
Let's say I have otlp/grpc (4317) + otlp/http (4318), and one of them becomes unhealthy. I assume the extension will reporting the service as unhealthy and the load balancer will take this endpoint out of serving, although the other port would still be able to server.
Is my understanding correct?

codeboten · 2023-12-21T16:36:29Z

@codeboten I'm not familiar with the health check extension internals, but I wonder how it can reflect the healthiness of multiple receivers. Let's say I have otlp/grpc (4317) + otlp/http (4318), and one of them becomes unhealthy. I assume the extension will reporting the service as unhealthy and the load balancer will take this endpoint out of serving, although the other port would still be able to server. Is my understanding correct?

That's a great question. I think there should be a mechanism in the healthcheck extension to differentiate between the health of different pipelines. There could also be an aggregate status for the overall health of the collector, but in more specific use-cases, like the one you mention, it would make sense to have a separate path (or port, but i feel like this would get confusing fast) for each pipeline

gjshao44 · 2023-12-21T16:52:07Z

@bpalermo, @codeboten, interesting discussion. I am planning to have a different AWS ALB endpoint to serve http and grpc endpoint for an otel collector. The http endpoint is the only one that is working for me currently . Here is a sample configuration annotations that I lifted from our Thanos grpc endpoint for AWS ALB, and I am hoping the otel collector can deliver the same working mechanism:

default = {
"kubernetes.io/ingress.class" : "alb",
"alb.ingress.kubernetes.io/scheme" : "internal",
"alb.ingress.kubernetes.io/target-type" : "ip",
"alb.ingress.kubernetes.io/backend-protocol-version" : "GRPC",
"alb.ingress.kubernetes.io/healthcheck-path" : "/grpc.health.v1.Health.Check",
"alb.ingress.kubernetes.io/healthcheck-protocol" : "HTTP",
"alb.ingress.kubernetes.io/success-codes" : "1-99",
"alb.ingress.kubernetes.io/backend-protocol" : "HTTP",
"alb.ingress.kubernetes.io/listen-ports" : "[{"HTTPS": 443}]",
}

bpalermo · 2023-12-21T17:13:42Z

(or port, but i feel like this would get confusing fast) for each pipeline
Totally agree!

If I may, I would be in favor of having a heath check in the same port of each receiver, at least to start with. And maybe get fancier down the road.

There could also be an aggregate status for the overall health of the collector

One possibility could be to leverage to health check extension for this use case.

github-actions · 2024-01-06T03:15:27Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

gjshao44 · 2024-01-11T14:13:04Z

@codeboten, follow up on your write-up about the SIG meeting and adding feature to the healthcheck extension, is there an issue tracking the feature?

mwear · 2024-01-11T23:34:20Z

Hi @gjshao44. Here is the tracking issue: open-telemetry/opentelemetry-collector-contrib#26661. I am actively working on the new version of the healthcheck extension that is based on component status reporting. It has an implementation of the grpc.health.v1 service: https://github.com/grpc/grpc-proto/blob/master/grpc/health/v1/health.proto as an option. I expect to have a PR up sometime next week.

vynu · 2024-01-18T21:31:51Z

Hello @mwear thanks for working on this ... waiting for your commit .. when can we expect the changes to be part of the release ?

mwear · 2024-01-19T07:18:59Z

I have a PR open for the health check extension here: open-telemetry/opentelemetry-collector-contrib#30673. We will also need some version of #8684 or #8788 to provide the extension with status events for exporters. I'll revisit those PRs shortly. I'd appreciate any feedback on any of the PRs I've mentioned.

github-actions · 2024-02-04T03:15:31Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-02-23T03:15:29Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-03-09T03:16:14Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

TRAD-Anthony-CKO · 2024-06-12T15:25:36Z

Any updates on this? I see the PR as closed but multiple people are suffering from this issue.

tmegow · 2024-06-24T14:15:13Z

I am attempting to setup a centralized otel collector which serves via an ALB and gRPC. I find that my target group health checks are failing and it seems this topic might be relevant to my goals.

May I inquire about the status of this PR/issue?

mwear · 2024-06-24T15:59:19Z

I've written an alternative health check extension that supports both HTTP and gRPC health checks. It's in the process of being reviewed. The full version is here: open-telemetry/opentelemetry-collector-contrib#30673. I am slicing it up into smaller PRs to facilitate review. The current chunk is here: open-telemetry/opentelemetry-collector-contrib#33528. Based on this users experience: open-telemetry/opentelemetry-collector-contrib#30673 (comment), I think it will work for your use case when the extension becomes available.

yurishkuro · 2024-07-22T17:20:29Z

@mwear why is healthcheck extension a replacement for this PR? Extension will run on a different port, which is very different from what was asked in the original ticket.

mwear · 2024-08-01T22:09:51Z

@yurishkuro the healthcheck extension provides a grpc health server that is based on component status reporting. Component status reporting, while still a work in progress, allows us to derive collector health based on the health of the individual components. As you point out, it does not run on the same port. The need for the health check to be on the same port was debated in this PR and it wasn't clear if it was a hard requirement. I know that a user successfully used the extension on an AWS ALB, which is one of the use cases discussed, see: open-telemetry/opentelemetry-collector-contrib#30673 (comment).

seankhliao added 2 commits September 12, 2023 20:39

add gRPC health check

6e1d939

update readme

ac1c0ee

seankhliao requested review from a team and dmitryax September 12, 2023 19:44

seankhliao mentioned this pull request Sep 12, 2023

Add gRPC health check #3040

Open

cleanup typo

e0dd4d4

dmitryax approved these changes Sep 12, 2023

View reviewed changes

codeboten reviewed Sep 14, 2023

View reviewed changes

github-actions bot added the Stale label Sep 29, 2023

github-actions bot removed the Stale label Oct 12, 2023

github-actions bot added the Stale label Oct 27, 2023

github-actions bot removed the Stale label Nov 3, 2023

gjshao44 mentioned this pull request Nov 8, 2023

OTLP gRPC exporter not routing logs to AWS load balancer #5246

Closed

atoulme approved these changes Nov 12, 2023

View reviewed changes

github-actions bot added the Stale label Nov 28, 2023

github-actions bot removed the Stale label Nov 29, 2023

atoulme mentioned this pull request Dec 19, 2023

Enables gRPC Health Checking Protocol on Open Telemetry Collector #8955

Closed

github-actions bot added Stale and removed Stale labels Jan 6, 2024

github-actions bot added Stale and removed Stale labels Feb 4, 2024

github-actions bot added the Stale label Feb 23, 2024

agardnerIT mentioned this pull request Feb 27, 2024

Pytest fails consistently in .devcontainer/post-start.sh dynatrace-perfclinics/platform-engineering-demo#8

Closed

github-actions bot closed this Mar 9, 2024

yurishkuro mentioned this pull request Jul 18, 2024

Add Kafka exporter and receiver configuration jaegertracing/jaeger#5703

Merged

4 tasks

add gRPC health check #8397

add gRPC health check #8397

Conversation

seankhliao commented Sep 12, 2023

seankhliao commented Sep 12, 2023

codecov bot commented Sep 12, 2023 • edited Loading

Codecov Report

codeboten left a comment

Choose a reason for hiding this comment

bogdandrutu commented Sep 14, 2023

github-actions bot commented Sep 29, 2023

bpalermo commented Oct 11, 2023 • edited Loading

github-actions bot commented Oct 27, 2023

gjshao44 commented Nov 2, 2023

gjshao44 commented Nov 9, 2023

SurenNihalani commented Nov 12, 2023

gjshao44 commented Nov 12, 2023

gjshao44 commented Nov 13, 2023

yurishkuro commented Nov 13, 2023

github-actions bot commented Nov 28, 2023

gjshao44 commented Nov 28, 2023

pellepelle3 commented Dec 14, 2023

atoulme commented Dec 18, 2023

gjshao44 commented Dec 20, 2023

codeboten commented Dec 20, 2023

bpalermo commented Dec 21, 2023

codeboten commented Dec 21, 2023

gjshao44 commented Dec 21, 2023

bpalermo commented Dec 21, 2023

github-actions bot commented Jan 6, 2024

gjshao44 commented Jan 11, 2024

mwear commented Jan 11, 2024 • edited Loading

vynu commented Jan 18, 2024

mwear commented Jan 19, 2024

github-actions bot commented Feb 4, 2024

github-actions bot commented Feb 23, 2024

github-actions bot commented Mar 9, 2024

TRAD-Anthony-CKO commented Jun 12, 2024

tmegow commented Jun 24, 2024

mwear commented Jun 24, 2024

yurishkuro commented Jul 22, 2024

mwear commented Aug 1, 2024

codecov bot commented Sep 12, 2023 •

edited

Loading

bpalermo commented Oct 11, 2023 •

edited

Loading

mwear commented Jan 11, 2024 •

edited

Loading