Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add livez and readyz for etcd #16651

Merged
merged 1 commit into from
Oct 17, 2023
Merged

Conversation

siyuanfoundation
Copy link
Contributor

@siyuanfoundation siyuanfoundation commented Sep 25, 2023

#16007

This is a prototype for adding livez/readyz support to etcd (design doc).
This pr is setting up the general structure for livez/ready checks, with only 2 simple checks implemented.

@siyuanfoundation siyuanfoundation marked this pull request as draft September 25, 2023 23:02
@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Sep 25, 2023

Tested with local server
curl '127.0.0.1:2379/readyz?verbose'
output:

[+]ping ok
[+]serializable_read ok
[+]data_corruption ok
[+]defrag_active ok
readyz check passed

curl '127.0.0.1:2379/livez?verbose'
output:

[+]ping ok
[+]serializable_read ok
livez check passed

@tjungblu
Copy link
Contributor

Cool stuff and thanks for the PR, I was wondering why you're adding this to the GRPC API? We already have a health handler on the http server: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L746

Just FYI, we had huge issues in the past with running etcdctl commands as a health probes in openshift with regards to zombie processes. We're much happier with just letting kubelet hit an http endpoint instead.

server/etcdserver/healthz_test.go Outdated Show resolved Hide resolved
server/etcdserver/healthz.go Outdated Show resolved Hide resolved
@chaochn47
Copy link
Member

chaochn47 commented Sep 26, 2023

We already have a health handler on the http server: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L746

I think we can start from http endpoint. I don’t think backporting gRPC changes to release-3.5 is feasible and accepted. It will literally change the customer facing API even if it is just maintenance service.

@siyuanfoundation
Copy link
Contributor Author

We already have a health handler on the http server: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L746

I think we can start from http endpoint. I don’t think backporting gRPC changes to release-3.5 is feasible and accepted. It will literally change the customer facing API even if it is just maintenance service.

Removed gPRC change.

server/embed/etcd.go Outdated Show resolved Hide resolved
@siyuanfoundation siyuanfoundation marked this pull request as ready for review September 27, 2023 21:10
@wenjiaswe
Copy link
Contributor

@siyuanfoundation siyuanfoundation marked this pull request as draft September 28, 2023 15:34
server/etcdserver/api/v3health/healthz.go Outdated Show resolved Hide resolved
server/etcdserver/api/v3health/healthz.go Outdated Show resolved Hide resolved
server/etcdserver/api/v3health/healthz.go Outdated Show resolved Hide resolved
server/etcdserver/api/v3health/healthz.go Outdated Show resolved Hide resolved
server/etcdserver/api/v3health/healthz.go Outdated Show resolved Hide resolved
server/etcdserver/api/v3health/healthz.go Outdated Show resolved Hide resolved
server/etcdserver/api/v3health/healthz.go Outdated Show resolved Hide resolved
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nice work!

server/etcdserver/api/etcdhttp/health.go Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented Oct 13, 2023

Please rebase this PR and resolve the left comment, thx

Comment on lines +293 to +296
if _, found := r.URL.Query()["verbose"]; found {
fmt.Fprint(w, h.Reason)
}
fmt.Fprint(w, "ok\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to implement verbose for per check endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per check endpoint still print the detailed error message in verbose, and no details in non verbose. In theory, the user can get all the details from the root path verbose. But I think it still makes sense to follow the same paradigm.

@siyuanfoundation
Copy link
Contributor Author

@ahrtr Rebased the PR. Please review. Thanks!

Copy link

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
Add two separate probes, one for liveness and one for readiness. The liveness probe would check that the local individual node is up and running, or else restart the node, while the readiness probe would check that the cluster is ready to serve traffic. This would make etcd health-check fully Kubernetes API complient.

Signed-off-by: Siyuan Zhang <sizhang@google.com>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for your first contribution, and great work!

Please also add a followup item to update the doc in etcd-io/website.

serathius
serathius previously approved these changes Oct 17, 2023
@serathius serathius dismissed their stale review October 17, 2023 09:39

Question about serializable read

@serathius
Copy link
Member

Please note that this PR doesn't implement the full design as it was presented. Livez works correct, however we are missing checks for readyz making it not very useful. Please see 80ab2ad

I think this is ok to merge, however until we finish implementing readyz, we shouldn't backport nor document the new endpoints.

@serathius serathius merged commit cba514e into etcd-io:main Oct 17, 2023
27 checks passed
@siyuanfoundation siyuanfoundation deleted the livez-pr branch October 17, 2023 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

8 participants