Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adjust Max Backoff Delay to limit gRPC reconnect interval. Allows faster client reconnect to NSP and IPAM service backends.
With default settings gRPC might wait up to 2 minutes to attempt reconnect. New default value is 5 seconds.
Now, clients can reconnect within ~5 seconds even if the server was unavailable for minutes.
(refer to: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md)
Max Backoff Delay is not exposed through Operator by a common env variable at the moment, instead can be set one-by-one in the templates.
Possible way to verify improvement:
-Deploy Meridio (trench, conduit etc.)
-Deploy example-target with 1 replica
-Start tcpdump filtering for port 7778 (NSP) in Target POD
-Provoke NSP unavailability for >2minutes:
When all the PODs have reached readiness and Target has opened the stream, edit the NSP POD's config (e.g.
kubectl edit pod nsp-trench-a-0
) and change the labelapp
so that it won't match the related k8s service. Verify no endpoints are linked to the service (kubectl get endpoints nsp-service-trench-a
). Then, kill the nsp process in the NSP continer (kubectl exec nsp-trench-a-0 -- kill -2 1
). Wait 2 minutes, then fix the NSP POD labelapp
b re-editing the NSP POD. Verify the service endpoints.-TAPA: tcpdump should show reconnect attempts i.e. TCP SYNs (with max 5-6 seconds max interval). Once the NSP endpoint re-appears, TAPA should establish NSP connection within 5-6 seconds.
-Other Meridio components reflecting NSP connectivity as part of their readiness probes should recover within 5-6 seconds as well.
Issue link
NA
Checklist