Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adding max retry count while trying to subscribe to a nms pod #66

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion proto/client/gClient.go
Original file line number Diff line number Diff line change
Expand Up @@ -160,8 +160,9 @@ func (confClient *ConfigClient) subscribeToConfigPod(commChan chan *protos.Netwo
logger.GrpcLog.Infoln("subscribeToConfigPod ")
myid := os.Getenv("HOSTNAME")
var stream protos.ConfigService_NetworkSliceSubscribeClient
maxRetryCount := 10
for {
if stream == nil {
if stream == nil && maxRetryCount > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we fail.. I thought unlimited try is better right ?
What if webconsole does not come up in (5*10 = 50 sec) 50 seconds?

Copy link
Contributor Author

@gatici gatici Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,
Here the real problem is if SMF has the wrong IP resolution of NMS (if NMS pod is restarted), the loop never exits and without manually restarting the SMF pod, SMF can not connect to Webui again.
Putting a max retry count causes to restart the SMF pod and it gets the correct GRPC server address.
No matter GPRC server is up or not. Untill GRPC server is started, this process repeats.
That helps us to recover 5G configuration synchronization.
Otherwise, we need to detect all modules which have issues to connect NMS and restart them manually.

Copy link
Contributor Author

@gatici gatici Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @thakurajayL Here are the full logs of NRF: https://pastebin.ubuntu.com/p/4cPQVvVdWv/

NMS is restarted then NRF lost the connection to NMS. After 10 retry, NRF is restarted and GRPC connection is recovered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand what you are trying to solve. Could you point me to the code where SMF would restart after all MaxRetry?

Copy link
Contributor Author

@gatici gatici Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SMF https://github.com/omec-project/smf/blob/master/factory/factory.go#L52 calls PublishOnConfigChange function.

		roc := os.Getenv("MANAGED_BY_CONFIG_POD")
		if roc == "true" {
			gClient := ConnectToConfigServer(SmfConfig.Configuration.WebuiUri)
			commChannel := gClient.PublishOnConfigChange(false)
			go SmfConfig.updateConfig(commChannel)
		}

Then PublishOnConfigChange calls subscribeToConfigPod function which tries to connect GRPC server in an infinite loop. But this loop uses SmfConfig. Configuration.WebuiUri which can be resolved different IP addresses if NMS pod restarts. However loop never exits and a manual intervention is required after every NMS pod restart.

func (confClient *ConfigClient) PublishOnConfigChange(mdataFlag bool) chan *protos.NetworkSliceResponse {
	confClient.MetadataRequested = mdataFlag
	commChan := make(chan *protos.NetworkSliceResponse)
	confClient.Channel = commChan
	go confClient.subscribeToConfigPod(commChan)
	return commChan
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result, it will restart on commChannel := gClient.PublishOnConfigChange(false).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am completely lost. I understand high level reasoning of resolving address again but I am not able to connect to the code the way you are explaining. Give me some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need the change the way Client is initialized.
Clubbing below 2 lines in single call will help in getting ...But I do not think how your change works..or why it works..

        gClient := client.ConnectToConfigServer(SmfConfig.Configuration.WebuiUri)
        commChannel := gClient.PublishOnConfigChange(false)

status := confClient.Conn.GetState()
var err error
if status == connectivity.Ready {
Expand All @@ -171,18 +172,26 @@ func (confClient *ConfigClient) subscribeToConfigPod(commChan chan *protos.Netwo
logger.GrpcLog.Errorf("Failed to subscribe: %v", err)
time.Sleep(time.Second * 5)
// Retry on failure
maxRetryCount--
continue
}
} else if status == connectivity.Idle {
logger.GrpcLog.Errorf("Connectivity status idle, trying to connect again")
time.Sleep(time.Second * 5)
maxRetryCount--
continue
} else {
logger.GrpcLog.Errorf("Connectivity status not ready")
time.Sleep(time.Second * 5)
maxRetryCount--
continue
}
}

if stream == nil {
break
}

rsp, err := stream.Recv()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First - Could you please confirm that in your case SMF crashes at line 191 after 10 retry....
Second - I am not as such in favor of restarting all network functions.. Do you think we can just reinitialize the grpc and connect & get the configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatici, I am seeing the same thing as @thakurajayL indicated here.
For me the smf crashes after the webui pod gets restarted. Below is a snippet of the log/issue:

2024-09-28T02:18:44Z [ERRO][Config5g][GRPC] Failed to receive message: rpc error: code = Unavailable desc = error reading from server: EOF
2024-09-28T02:18:49Z [ERRO][Config5g][GRPC] Connectivity status idle, trying to connect again
github.com/omec-project/config5g/proto/client.(*ConfigClient).subscribeToConfigPod(0xc00057adc0, 0xc0000db3b0)
        /go/src/smf/vendor/github.com/omec-project/config5g/proto/client/gClient.go:191 +0x2d3 fp=0xc00088ffc0 sp=0xc00088fef0 pc=0xb5df33
github.com/omec-project/config5g/proto/client.(*ConfigClient).PublishOnConfigChange.gowrap1()
        /go/src/smf/vendor/github.com/omec-project/config5g/proto/client/gClient.go:79 +0x25 fp=0xc00088ffe0 sp=0xc00088ffc0 pc=0xb5d425
created by github.com/omec-project/config5g/proto/client.(*ConfigClient).PublishOnConfigChange in goroutine 1
        /go/src/smf/vendor/github.com/omec-project/config5g/proto/client/gClient.go:79 +0xa7

if err != nil {
logger.GrpcLog.Errorf("Failed to receive message: %v", err)
Expand Down