Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in Equinix Metal install process #1371

Closed
vielmetti opened this issue Feb 22, 2024 · 12 comments
Closed

Race condition in Equinix Metal install process #1371

vielmetti opened this issue Feb 22, 2024 · 12 comments
Labels
kind/bug Something isn't working platform/equinixmetal

Comments

@vielmetti
Copy link

Description

When provisioning instances on the Equinix Metal platform, instances are sometimes marked as "active" before
they are fully ready. There appears to be a race condition between the install script and the "phone-home"
operation.

Impact

When provisioning devices, there are times when the server says it is provisioned and it displays it is up and running on the console but the user doesn't have ssh access to it.

Environment and steps to reproduce

  1. Set-up: Flatcar Linux on Equinix Metal
  2. Task: Initial provisioning of the server
  3. Action(s):
    a. Provision a new Flatcar Linux server from the Equinix console
    b. After some time, observe that the console says the device is "ready"
    c. Attempt to log on with ssh; login sometimes fails
    d. Attempt to log on using out of band management ("SOS"), login succeeds
  4. Error: console reports "system ready" prior to it actually being ready.

Expected behavior

Expected behavior is that when the "phone home" script runs to signal to the
Equinix console that the device is ready, that the device will actually be ready for ssh logins
and will have completed all of its provisioning tasks.

Additional information

CoreOS has a service coreos-metadata.service that phones home using the url retrieved from the metadata.

Flatcar Linux does test on the Equinix Metal platform.

Is it easy to run the phone-home conditioned on flatcar.first_boot=detected and not on flatcar.first_boot=1?

cc @turegano-equinix

@vielmetti vielmetti added the kind/bug Something isn't working label Feb 22, 2024
@vielmetti
Copy link
Author

please add the label platform/equinixmetal thanks

@vielmetti
Copy link
Author

It would be worth reviewing #1143 which was a change in the phone-home setup.

@vielmetti
Copy link
Author

Also for review flatcar/scripts#1197 and flatcar/init#107

@vielmetti
Copy link
Author

also cc @pothos @tormath1 who had some of the recent edits to this mentioned above; it appears that the test environment for the changes was QEMU based (?) and our specific production environment is PXE based and different from that, which may explain why this was not caught in testing.

@tormath1
Copy link
Contributor

Hello @vielmetti, thanks for the report. Please note that Flatcar is tested on Equinix Metal at each release (at least for AMD64) in a PXE based environment.
When you have access to the out-of-band console, can you list the pending jobs (systemctl list-jobs) to see what's going on? Can you also confirm that this behavior started to appear from stable-3760.2.0 and not before?

@vielmetti
Copy link
Author

Thanks @tormath1 . The systemctl list-jobs suggestion was a good one and we're digging into logs based on that.

@jepio
Copy link
Member

jepio commented Feb 23, 2024

@vielmetti just capturing what I wrote in slack here:
I would think that if you use ipxe flatcar to install flatcar to disk, then you would want to mask the phone-home.service in the ipxe flatcar environment, so that the phone-home.service only runs when the disk installed flatcar boots.

Our test suite tests iPXE Flatcar on EM, and in that case one needs it to phone-home from the iPXE env.

@pothos
Copy link
Member

pothos commented Feb 26, 2024

Ideally the internal provisioning would be done in a tinkerbell container that directly runs flatcar-install instead of using a Flatcar PXE boot to run it, see #125

@turegano-equinix
Copy link

@jepio Thanks for the suggestion. I'll test systemd.mask=phone-home.service and will update the thread with the results

@turegano-equinix
Copy link

@jepio it worked!!! systemd.mask=packet-phone-home.service it is in production now! Thanks a lot!
@vielmetti Thank you for managing it! 🥇 we can can close the issue!

@vielmetti
Copy link
Author

Thanks @turegano-equinix - is there something that we should add to the documentation or otherwise commit upstream? I'll hold this issue open to capture any of that.

@vielmetti
Copy link
Author

Done to our satisfaction; if a doc issue comes up, it's an internal one here. Closing as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working platform/equinixmetal
Projects
None yet
Development

No branches or pull requests

5 participants