Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkManager-wait-online can fail on slower machines #32

Open
phillxnet opened this issue Oct 6, 2020 · 6 comments
Open

NetworkManager-wait-online can fail on slower machines #32

phillxnet opened this issue Oct 6, 2020 · 6 comments

Comments

@phillxnet
Copy link
Member

On some low power devices, i.e Pi4 / Ten64, and slower/older x86_64 machines, the default Network Manager wait online service leaves insufficient time before 'declaring' to it's dependants that no online state is available. This false negative on online status can lead to dependants, i.e. KVM installs or Hashicorp Vault instances, failing to start as their dependency of online state was not indicated.

The proposed fix is to increase the default wait setting for the NetworkManager-wait-online service.

The service derives it's timeout setting from the following parameter:

## Type:        int
## Default:     30
#
# When using NetworkManager you may define a timeout to wait for NetworkManager
# to connect in NetworkManager-wait-online.service.  Other network services
# may require the system to have a valid network setup in order to succeed.
#
# This variable has no effect if NetworkManager is disabled.
#
NM_ONLINE_TIMEOUT="30"

in /etc/sysconfig/network/config

some experimentation has indicated that a setting of 45 seconds looks to resolve the observed failures.

@phillxnet
Copy link
Member Author

Current setting can be retrieved via:

# grep "NM_ONLINE_TIMEOUT" /etc/sysconfig/network/config 
NM_ONLINE_TIMEOUT="30"

And assuming the default is as expected the following will change that specific setting to the proposed 45 seconds:

sed -i 's/NM_ONLINE_TIMEOUT="30"/NM_ONLINE_TIMEOUT="45"/g' /etc/sysconfig/network/config

yast can also configure this setting via:

sudo yast sysconfig set NM_ONLINE_TIMEOUT="45"

But as we are akin to a JeOS install a regular Rocsktor system will not have yast configured and is not, as yet, yast compatible.

@phillxnet
Copy link
Member Author

An indication of the failed state of NetworkManager-wait-online can be assessed via:

systemctl status NetworkManager-wait-online

@phillxnet
Copy link
Member Author

phillxnet commented Oct 6, 2020

I am undecided on the route to take here. Adding many tens of seconds to boot times for what looks to be a non critical service may not be the way to go. Especially give that it seems no Rockstor native service is affected. Also note that on for example the Ten64, if one starts this service post boot the time taken for it to start successfully is around 46 seconds. Whereas during boot, the delay required to achieve successful 'no time out' with the typical samba service enable is 185 seconds.

The above increase to 185 seconds (from the default of 30) affects the boot times thus:
From Grub screen to command line:

- NM_ONLINE_TIMEOUT="30" NM_ONLINE_TIMEOUT="185"
Rockstor Web-UI login available 120 seconds 120 seconds
command login 60 seconds 210 seconds

Holding off on this change for the time being as this may all be a red-herring of sorts.
Also need timings for when this service is disabled, the consequence of this.

@FroggyFlox
Copy link
Member

@phillxnet , the same thing happened to me a little while back on some of my Rockstor KVM, but I never could point the source and it was clearly due to my situation at the time... I remember looking around a bit and see some people reporting such timeout at boot when having multiple NICs; this was my best guess at the time as I erroneously was binding a few interfaces to my KVM at the time. I haven't tried those VMs in a while (not sure I still have them), but could the number of interfaces be relevant here? It seems fitting given the high number of interfaces on the Ten64, for instance.

@phillxnet
Copy link
Member Author

phillxnet commented Oct 6, 2020

@FroggyFlox

reporting such timeout at boot when having multiple NICs;

That's interesting the Ten64 does have 10 NIC's so possible, but I've also seen it on a Haswell NUC, single NIC, and an i5 Ivy Bridge desktop with a single NIC, in the latter 2 cases both machines were fairly heavily loaded starting multiple KVM's thought. This was with generic Leap 15.0/15.1.

It's really perplexing, also doesn't look like anything is hanging, just waiting around. I'm inclined to disable actually but not sure of consequences. In the Vault instance I think I removed the dependency on this service at one point as Vault then worked fine anyway in my context here. I think testing on the Pi4 may help shed light as it seems to affect slow / loaded / cpu bound machines. But may just be quirky re hardware as on KVM's here it seems to work immediately.

Early timeout settings were 0 I think, wait for ever. This was changed to 1 at some point to stop infinite hangs on the service in some settings. I've moved to 40 - 60 on some settings to make stuff work and finally got to do some testing here in the Rockstor realm.

@FroggyFlox
Copy link
Member

I still have my VM that shows that... and it only has one NIC, so the number of NICs seems irrelevant, actually... I'm currently leaning towards IPv6 issue as we still have a lot of log messages with IPv6-related operations failing (understandably so). Maybe we should make sure we're not missing something IPv6-related somewhere.
In this VM, I also see everything running fine as NetworkManager still boots fine, just a little later than NetworkManager-wait-online would like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants