Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSP/IPAM reconnect improvements #430

Merged
merged 3 commits into from
Jun 21, 2023
Merged

NSP/IPAM reconnect improvements #430

merged 3 commits into from
Jun 21, 2023

Conversation

zolug
Copy link
Collaborator

@zolug zolug commented Jun 14, 2023

Description

Adjust Max Backoff Delay to limit gRPC reconnect interval. Allows faster client reconnect to NSP and IPAM service backends.
With default settings gRPC might wait up to 2 minutes to attempt reconnect. New default value is 5 seconds.
Now, clients can reconnect within ~5 seconds even if the server was unavailable for minutes.
(refer to: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md)

Max Backoff Delay is not exposed through Operator by a common env variable at the moment, instead can be set one-by-one in the templates.

Possible way to verify improvement:
-Deploy Meridio (trench, conduit etc.)
-Deploy example-target with 1 replica
-Start tcpdump filtering for port 7778 (NSP) in Target POD
-Provoke NSP unavailability for >2minutes:
When all the PODs have reached readiness and Target has opened the stream, edit the NSP POD's config (e.g. kubectl edit pod nsp-trench-a-0) and change the label app so that it won't match the related k8s service. Verify no endpoints are linked to the service (kubectl get endpoints nsp-service-trench-a). Then, kill the nsp process in the NSP continer (kubectl exec nsp-trench-a-0 -- kill -2 1). Wait 2 minutes, then fix the NSP POD label app b re-editing the NSP POD. Verify the service endpoints.
-TAPA: tcpdump should show reconnect attempts i.e. TCP SYNs (with max 5-6 seconds max interval). Once the NSP endpoint re-appears, TAPA should establish NSP connection within 5-6 seconds.
-Other Meridio components reflecting NSP connectivity as part of their readiness probes should recover within 5-6 seconds as well.

Issue link

NA

Checklist

  • Purpose
    • Bug fix
    • New functionality
    • Documentation
    • Refactoring
    • CI
  • Test
    • Unit test
    • E2E Test
    • Tested manually
  • Introduce a breaking change
    • Yes (description required)
    • No

zolug added 3 commits June 14, 2023 10:22
A Timed out open context forced the retry block (supposed to
re-open the connection) to bail out due to the cancelled context.
Adjust max backoff delay to limit gRPC reconnect interval.
Allows faster client reconnect to NSP and IPAM service backends.
With default settings gRPC might wait up to 2 minutes to attempt
reconnect. New default value is 5 seconds.
(refer to: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md)
Make sure FE is connected with NSP ASAP.
@zolug zolug merged commit fb62940 into master Jun 21, 2023
@zolug zolug deleted the ezollug-tapa branch June 21, 2023 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants