Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adding max retry count while trying to subscribe to a nms pod #66

Closed
wants to merge 2 commits into from
Closed

fix: adding max retry count while trying to subscribe to a nms pod #66

wants to merge 2 commits into from

Conversation

gatici
Copy link
Contributor

@gatici gatici commented Sep 26, 2024

If the Webconsole is restarted, other modules loses the GRPC connection to Webconsole
They tries to connect again for hours with no success.

2024-09-26T11:32:10.014Z [smf] 2024-09-26T11:32:10Z [ERRO][Config5g][GRPC] Connectivity status idle, trying to connect again.

We add a max retry count which breaks the loop if max retry count is exceeded.

chore: rebase from main

Signed-off-by: gatici <gulsum.atici@canonical.com>
for {
if stream == nil {
if stream == nil && maxRetryCount > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we fail.. I thought unlimited try is better right ?
What if webconsole does not come up in (5*10 = 50 sec) 50 seconds?

Copy link
Contributor Author

@gatici gatici Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,
Here the real problem is if SMF has the wrong IP resolution of NMS (if NMS pod is restarted), the loop never exits and without manually restarting the SMF pod, SMF can not connect to Webui again.
Putting a max retry count causes to restart the SMF pod and it gets the correct GRPC server address.
No matter GPRC server is up or not. Untill GRPC server is started, this process repeats.
That helps us to recover 5G configuration synchronization.
Otherwise, we need to detect all modules which have issues to connect NMS and restart them manually.

Copy link
Contributor Author

@gatici gatici Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @thakurajayL Here are the full logs of NRF: https://pastebin.ubuntu.com/p/4cPQVvVdWv/

NMS is restarted then NRF lost the connection to NMS. After 10 retry, NRF is restarted and GRPC connection is recovered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand what you are trying to solve. Could you point me to the code where SMF would restart after all MaxRetry?

Copy link
Contributor Author

@gatici gatici Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SMF https://github.com/omec-project/smf/blob/master/factory/factory.go#L52 calls PublishOnConfigChange function.

		roc := os.Getenv("MANAGED_BY_CONFIG_POD")
		if roc == "true" {
			gClient := ConnectToConfigServer(SmfConfig.Configuration.WebuiUri)
			commChannel := gClient.PublishOnConfigChange(false)
			go SmfConfig.updateConfig(commChannel)
		}

Then PublishOnConfigChange calls subscribeToConfigPod function which tries to connect GRPC server in an infinite loop. But this loop uses SmfConfig. Configuration.WebuiUri which can be resolved different IP addresses if NMS pod restarts. However loop never exits and a manual intervention is required after every NMS pod restart.

func (confClient *ConfigClient) PublishOnConfigChange(mdataFlag bool) chan *protos.NetworkSliceResponse {
	confClient.MetadataRequested = mdataFlag
	commChan := make(chan *protos.NetworkSliceResponse)
	confClient.Channel = commChan
	go confClient.subscribeToConfigPod(commChan)
	return commChan
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result, it will restart on commChannel := gClient.PublishOnConfigChange(false).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am completely lost. I understand high level reasoning of resolving address again but I am not able to connect to the code the way you are explaining. Give me some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need the change the way Client is initialized.
Clubbing below 2 lines in single call will help in getting ...But I do not think how your change works..or why it works..

        gClient := client.ConnectToConfigServer(SmfConfig.Configuration.WebuiUri)
        commChannel := gClient.PublishOnConfigChange(false)

Signed-off-by: gatici <gulsum.atici@canonical.com>
@gatici gatici requested a review from thakurajayL September 27, 2024 15:08
@thakurajayL
Copy link
Contributor

thakurajayL commented Sep 28, 2024

continue
}
}

rsp, err := stream.Recv()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First - Could you please confirm that in your case SMF crashes at line 191 after 10 retry....
Second - I am not as such in favor of restarting all network functions.. Do you think we can just reinitialize the grpc and connect & get the configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatici, I am seeing the same thing as @thakurajayL indicated here.
For me the smf crashes after the webui pod gets restarted. Below is a snippet of the log/issue:

2024-09-28T02:18:44Z [ERRO][Config5g][GRPC] Failed to receive message: rpc error: code = Unavailable desc = error reading from server: EOF
2024-09-28T02:18:49Z [ERRO][Config5g][GRPC] Connectivity status idle, trying to connect again
github.com/omec-project/config5g/proto/client.(*ConfigClient).subscribeToConfigPod(0xc00057adc0, 0xc0000db3b0)
        /go/src/smf/vendor/github.com/omec-project/config5g/proto/client/gClient.go:191 +0x2d3 fp=0xc00088ffc0 sp=0xc00088fef0 pc=0xb5df33
github.com/omec-project/config5g/proto/client.(*ConfigClient).PublishOnConfigChange.gowrap1()
        /go/src/smf/vendor/github.com/omec-project/config5g/proto/client/gClient.go:79 +0x25 fp=0xc00088ffe0 sp=0xc00088ffc0 pc=0xb5d425
created by github.com/omec-project/config5g/proto/client.(*ConfigClient).PublishOnConfigChange in goroutine 1
        /go/src/smf/vendor/github.com/omec-project/config5g/proto/client/gClient.go:79 +0xa7

@gatici
Copy link
Contributor Author

gatici commented Oct 2, 2024

The solution is provided in a different PR: #69

@gatici gatici closed this Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants