Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should HC activation be delayed until needed secrets are available? #12389

Closed
rgs1 opened this issue Jul 30, 2020 · 3 comments · Fixed by #13516
Closed

Should HC activation be delayed until needed secrets are available? #12389

rgs1 opened this issue Jul 30, 2020 · 3 comments · Fixed by #13516

Comments

@rgs1
Copy link
Member

rgs1 commented Jul 30, 2020

I did a cursory walk through the code in charge of activating HCs and I don't think it waits for secrets to be ready.

Furthermore, the initial_fetch_timeout broadly refers to the config gRPC stream becoming active (not necessarily to the actual payloads being received):

https://github.com/envoyproxy/envoy/blob/master/docs/root/intro/arch_overview/operations/init.rst

So for clusters where we don't set an initial_jitter we hit upstream_context_secrets_not_ready during hot restarts and we also see failed upstream health checks. I didn't test this for cold starts -- though it might be an issue there too.

I think we should either:

a) not activate HCs until the needed secrets are available

or

b) document the async nature of SDS and how setting initial_jitter might be necessary to workaround a race between the first HCs and secrets becoming available (we could add this to https://github.com/envoyproxy/envoy/blob/master/docs/root/intro/arch_overview/operations/init.rst
)

Thoughts?

cc: @mattklein123 @fishcakez

@mattklein123
Copy link
Member

Yeah I think we should fix this somehow. cc @snowp for thoughts.

@mattklein123
Copy link
Member

(Sorry I think we should do (a) somehow)

@snowp
Copy link
Contributor

snowp commented Jul 31, 2020

Yeah I agree, we should be sequencing the cluster warming so that we don't set/start the health checker for a cluster until its secrets have warmed.

mpuncel added a commit to mpuncel/envoy that referenced this issue Apr 29, 2021
This commit is a revival of
envoyproxy#13516 which aimed to solve
envoyproxy#12389. That PR was merged and
then reverted due to a deadlock bug, which is still present in this
commit.

The fix will be applied in a subsequent commit in order to make the
resolution clearer to PR reviewers.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment