Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Add fix for re-registration when Consul K8s rolling restarts agent nodes #227

Merged
merged 3 commits into from
Jun 16, 2022

Conversation

andrewstucki
Copy link
Contributor

@andrewstucki andrewstucki commented Jun 16, 2022

Changes proposed in this PR:

Currently K8s makes agent nodes ephemeral, when an agent node is upgraded it persists no local state because we don't mount any sort of persistent storage to the agent node containers. This means that we need to properly re-register our services when the agents restart. We currently have some re-registration logic within our registration code when it's detected that a service is non-existent, but it looks like our error casting code was broken due to having an extra pointer.

Unfortunately, this code path is quite tricky to hit without standing up an additional Consul cluster, and if we want to actually test the reproduction case, it'd require orchestrating an agent node flapping while dropping all of it's state. So, I figure manual testing might be better -- we have a mock Consul server that's able to unit test all the other code paths already.

How I've tested this PR:

In the process of ensuring that the deployments get re-registered when a rolling restart happens by performing one manually.

Validated via the learn tutorial with a custom image build and triggering a restart of the consul agent nodes. Here are the logs from the deployment (notice the successfully registered agent service message third from the bottom):

{"timestamp":"2022-06-16 13:54:59.229","thread":"19","level":"warning","name":"config","source":"./source/common/config/grpc_stream.h:160","message":"DeltaAggregatedResources gRPC config stream closed: 13, "}
{"@caller":"/go/pkg/mod/github.com/hashicorp/consul/api@v1.12.1-0.20220111183205-dcf1cd485363/watch/plan.go:95","@level":"error","@message":"Watch errored","@module":"consul-api-gateway-exec.cert-manager.watch","@timestamp":"2022-06-16T13:54:59.232953Z","error":"Unexpected response code: 500 (rpc error making call: EOF)","retry":5000000000,"type":"connect_roots"}
{"@caller":"/go/pkg/mod/github.com/hashicorp/consul/api@v1.12.1-0.20220111183205-dcf1cd485363/watch/plan.go:95","@level":"error","@message":"Watch errored","@module":"consul-api-gateway-exec.cert-manager.watch","@timestamp":"2022-06-16T13:55:00.240727Z","error":"Get \"https://172.18.0.2:8501/v1/agent/connect/ca/leaf/example-gateway\": dial tcp 172.18.0.2:8501: connect: connection refused","retry":5000000000,"type":"connect_leaf"}
{"@caller":"/build/internal/consul/registration.go:110","@level":"error","@message":"error fetching service","@module":"consul-api-gateway-exec.service-registry","@timestamp":"2022-06-16T13:55:02.724288Z","error":"Get \"https://172.18.0.2:8501/v1/agent/service/9eb6fa1e-43ee-474c-b55f-fc81e794ce81\": dial tcp 172.18.0.2:8501: connect: connection refused"}
{"@caller":"/go/pkg/mod/github.com/hashicorp/consul/api@v1.12.1-0.20220111183205-dcf1cd485363/watch/plan.go:95","@level":"error","@message":"Watch errored","@module":"consul-api-gateway-exec.cert-manager.watch","@timestamp":"2022-06-16T13:55:04.234505Z","error":"Get \"https://172.18.0.2:8501/v1/agent/connect/ca/roots\": dial tcp 172.18.0.2:8501: connect: connection refused","retry":20000000000,"type":"connect_roots"}
{"@caller":"/go/pkg/mod/github.com/hashicorp/consul/api@v1.12.1-0.20220111183205-dcf1cd485363/watch/plan.go:95","@level":"error","@message":"Watch errored","@module":"consul-api-gateway-exec.cert-manager.watch","@timestamp":"2022-06-16T13:55:05.242260Z","error":"Get \"https://172.18.0.2:8501/v1/agent/connect/ca/leaf/example-gateway\": dial tcp 172.18.0.2:8501: connect: connection refused","retry":20000000000,"type":"connect_leaf"}
{"@caller":"/build/internal/consul/registration.go:103","@level":"info","@message":"successfully registered agent service","@module":"consul-api-gateway-exec.service-registry","@timestamp":"2022-06-16T13:55:32.787988Z"}
{"timestamp":"2022-06-16 13:55:32.989","thread":"19","level":"info","name":"upstream","source":"source/common/upstream/cds_api_helper.cc:30","message":"cds: add 0 cluster(s), remove 0 cluster(s)"}
{"timestamp":"2022-06-16 13:55:32.990","thread":"19","level":"info","name":"upstream","source":"source/common/upstream/cds_api_helper.cc:67","message":"cds: added/updated 0 cluster(s), skipped 0 unmodified cluster(s)"}

Checklist:

  • CHANGELOG entry added

    Run make changelog-entry for guidance in authoring a changelog entry, and
    commit the resulting file, which should have a name matching your PR number.
    Entries should use imperative present tense (e.g. Add support for...)

@andrewstucki andrewstucki requested a review from a team June 16, 2022 13:59
Copy link
Member

@sarahalsmiller sarahalsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@andrewstucki andrewstucki merged commit 351cd58 into main Jun 16, 2022
@andrewstucki andrewstucki deleted the rolling-restart-fix branch June 16, 2022 14:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants