-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple k8s-registries lead to NSE flapping #6779
Comments
Made a quick test where I modified the Register function of dialNSEClient, so that it will only call Dial if there's no ClientConn or its state is not Ready. The register reshuffles seem to have gone with it. func (c *dialNSEClient) Register(ctx context.Context, in *registry.NetworkServiceEndpoint, opts ...grpc.CallOption) (*registry.NetworkServiceEndpoint, error) {
closeContextFunc := postpone.ContextWithValues(ctx)
// If no clientURL, we have no work to do
// call the next in the chain
clientURL := clienturlctx.ClientURL(ctx)
if clientURL == nil {
return next.NetworkServiceEndpointRegistryClient(ctx).Register(ctx, in, opts...)
}
cc, _ := clientconn.LoadOrStore(ctx, newDialer(c.chainCtx, c.dialTimeout, c.dialOptions...))
// If there's an existing grpc.ClientConnInterface and it's not ours, call the next in the chain
di, ok := cc.(*dialer)
if !ok {
return next.NetworkServiceEndpointRegistryClient(ctx).Register(ctx, in, opts...)
}
// If our existing dialer has a different URL close down the chain
if di.clientURL != nil && di.clientURL.String() != clientURL.String() {
closeCtx, closeCancel := closeContextFunc()
defer closeCancel()
err := di.Dial(closeCtx, di.clientURL)
if err != nil {
log.FromContext(ctx).Errorf("can not redial to %v, err %v. Deleting clientconn...", grpcutils.URLToTarget(di.clientURL), err)
clientconn.Delete(ctx)
return nil, err
}
_, _ = next.NetworkServiceEndpointRegistryClient(ctx).Unregister(clienturlctx.WithClientURL(closeCtx, di.clientURL), in, opts...)
}
if di.ClientConn == nil || di.ClientConn.GetState() != connectivity.Ready { // XXX: maybe also check grpc.WithBlock() option along with ready state
err := di.Dial(ctx, clientURL)
if err != nil {
log.FromContext(ctx).Errorf("can not dial to %v, err %v. Deleting clientconn...", grpcutils.URLToTarget(clientURL), err)
clientconn.Delete(ctx)
return nil, err
}
}
conn, err := next.NetworkServiceEndpointRegistryClient(ctx).Register(ctx, in, opts...)
if err != nil {
_ = di.Close()
return nil, err
}
return conn, nil
} |
/cc @edwarnicke , @glazychev-art , @NikitaSkrynnik |
@zolug Hello! Endpoint can go to different registries. We don't cache connections because their TLS certs can expire and gRPC doesn't update them automatically. If registry doesn't get re-registration for particular NSE and NSE expires, registry deletes this NSE from memory and also deletes all assosiated Custom Resourses. At the moment we don't consider scenarios with multi-registry so there might be a lot of instabilities. |
We've a bit discussed with Ed the problem. Notes from the disccustion:
@zolug At this moment I wouldn't recommend to use multi registry setup. |
Hi! I see, thanks for the information. We will change our setup accordingly. |
Signed-off-by: Denis Tingaikin <denis.tingajkin@xored.com>
@zolug The problem is solved. Multiple k8s registries can be used now with NSM setup. Note: memory registry can not be scaled due to storage type 'in memory'. |
@denis-tingaikin Hi! Thanks a lot, I will take a look at the changes and will try to play around with it as well. |
When deploying NSM using k8s-registry and running multiple registry instances, the NSE re-registration request for a particular NSE might end up on another registry over time. This makes the old registry to falsely assume the NSE had disappeared (no re-reg before expiration time), thus removes the associated Custom Resource. This leads to xconnect disturbances.
It appears that the underlying TCP connections backing gRPC keep changing as well.
Apparently dialNSEClient re-dials the same clientURL even if the cached dialer of the NSE has a connection (closing the old connection). Is that intentional?
https://github.com/networkservicemesh/sdk/blob/b8209fc0b7b6a0be3eb709bb7323ea6f1bb56477/pkg/registry/common/dial/nse_client.go#L75
Changing the registry service to provide session affinity based on the client IPs seemingly fixes the problem:
But that might not be a lasting solution, as session affinity seemingly has a timeout value. (Maybe also depends on the mode kube-proxy was started in.)
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#clientipconfig-v1-core
I wonder what are the recommendations to provide resiliency and availability through k8s registries?
Reproduction:
-Deploy NSM with multiple (e.g. 2) k8s registries
-Start up mutiple NSEs. For example simply deploying kernel NSE and scaling it up to 8-10 replicas should do it.
-Monitor availability of NSE Custom Resources and their age
Note: Remote VLAN NSE is also affected (which bypasses NSMgr and directly contacts a registry).
NSM version: 1.4.0
Kind (2 workers + controller): kindest/node:v1.22.9@sha256:ad5b8404c4052781365a4e70bb7d17c5331e4177bd4a7cd214339316cd6193b6
The text was updated successfully, but these errors were encountered: