NSP/IPAM reconnect improvements #430

zolug · 2023-06-14T08:27:48Z

Description

Adjust Max Backoff Delay to limit gRPC reconnect interval. Allows faster client reconnect to NSP and IPAM service backends.
With default settings gRPC might wait up to 2 minutes to attempt reconnect. New default value is 5 seconds.
Now, clients can reconnect within ~5 seconds even if the server was unavailable for minutes.
(refer to: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md)

Max Backoff Delay is not exposed through Operator by a common env variable at the moment, instead can be set one-by-one in the templates.

Possible way to verify improvement:
-Deploy Meridio (trench, conduit etc.)
-Deploy example-target with 1 replica
-Start tcpdump filtering for port 7778 (NSP) in Target POD
-Provoke NSP unavailability for >2minutes:
When all the PODs have reached readiness and Target has opened the stream, edit the NSP POD's config (e.g. kubectl edit pod nsp-trench-a-0) and change the label app so that it won't match the related k8s service. Verify no endpoints are linked to the service (kubectl get endpoints nsp-service-trench-a). Then, kill the nsp process in the NSP continer (kubectl exec nsp-trench-a-0 -- kill -2 1). Wait 2 minutes, then fix the NSP POD label app b re-editing the NSP POD. Verify the service endpoints.
-TAPA: tcpdump should show reconnect attempts i.e. TCP SYNs (with max 5-6 seconds max interval). Once the NSP endpoint re-appears, TAPA should establish NSP connection within 5-6 seconds.
-Other Meridio components reflecting NSP connectivity as part of their readiness probes should recover within 5-6 seconds as well.

Issue link

NA

Checklist

Purpose
- Bug fix
- New functionality
- Documentation
- Refactoring
- CI
Test
- Unit test
- E2E Test
- Tested manually
Introduce a breaking change
- Yes (description required)
- No

A Timed out open context forced the retry block (supposed to re-open the connection) to bail out due to the cancelled context.

Adjust max backoff delay to limit gRPC reconnect interval. Allows faster client reconnect to NSP and IPAM service backends. With default settings gRPC might wait up to 2 minutes to attempt reconnect. New default value is 5 seconds. (refer to: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md)

Make sure FE is connected with NSP ASAP.

zolug added 3 commits June 14, 2023 10:22

TAPA; fix NSP connection re-open loop

1e010fb

A Timed out open context forced the retry block (supposed to re-open the connection) to bail out due to the cancelled context.

FE; improve external connectivity announcement

377e93f

Make sure FE is connected with NSP ASAP.

zolug requested a review from LionelJouin June 14, 2023 08:27

zolug added priority/medium area/networking labels Jun 14, 2023

zolug self-assigned this Jun 14, 2023

LionelJouin approved these changes Jun 20, 2023

View reviewed changes

zolug merged commit fb62940 into master Jun 21, 2023

zolug deleted the ezollug-tapa branch June 21, 2023 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NSP/IPAM reconnect improvements #430

NSP/IPAM reconnect improvements #430

zolug commented Jun 14, 2023 •

edited

Loading

NSP/IPAM reconnect improvements #430

NSP/IPAM reconnect improvements #430

Conversation

zolug commented Jun 14, 2023 • edited Loading

Description

Issue link

Checklist

zolug commented Jun 14, 2023 •

edited

Loading