Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding NVMe emulation #1051

Closed
bensmrs opened this issue Jul 26, 2024 · 16 comments
Closed

Questions regarding NVMe emulation #1051

bensmrs opened this issue Jul 26, 2024 · 16 comments
Labels
Incomplete Waiting on more information from reporter

Comments

@bensmrs
Copy link
Contributor

bensmrs commented Jul 26, 2024

Required information

  • Distribution: Arch Linux
  • Distribution version:
  • The output of "incus info" or if that fails:
    • Kernel version: 6.6.25-1-lts
    • LXC version: 6.0.1
    • Incus version: 6.2
    • Storage backend in use: ZFS

Issue description

I’m looking into NVMe emulation and have found out that using io.bus: nvme creates a PCI device on which a PCI device representing the virtual NVMe is linked. Is there a way to directly connect the NVMe devices to the root PCI?
On the following image, nvme{0,1}n1 are devices manually created with the QEMU command line (something like -drive id=foo,if=none,file=/tmp/foo.img,format=qcow2 -device nvme,serial=deadbeef,drive=foo), and nvme{2,3}n1 are created with Incus options.
nvme-pci
I’m trying to build macOS VMs and NVMe disks declared with Incus are not recognized, while QEMU ones are, so I’m guessing I need the NVMe devices to be linked to the root PCI to be recognized (I may be wrong on that though…).
Any clue?

@stgraber
Copy link
Member

It looks like your QEMU NVME drives are attached directly to the root PCI bridge rather than to a hot-pluggable port on the PCIe bridge.

Our layout lines up with what you'd get on a physical system and allow for hotplug and hotremove:

root@abydos:~# ls -lh /sys/class/block/nvme*
lrwxrwxrwx 1 root root 0 Jul 21 04:42 /sys/class/block/nvme0n1 -> ../../devices/pci0000:80/0000:80:02.0/0000:83:00.0/nvme/nvme0/nvme0n1
lrwxrwxrwx 1 root root 0 Jul 21 04:42 /sys/class/block/nvme1n1 -> ../../devices/pci0000:80/0000:80:03.0/0000:84:00.0/nvme/nvme1/nvme1n1

^ physical system with two NVMe drives.

@stgraber stgraber added the Incomplete Waiting on more information from reporter label Jul 26, 2024
@bensmrs
Copy link
Contributor Author

bensmrs commented Jul 26, 2024

Is there any way to attach the NVMe drives directly to the root PCI bridge?
I really think that’s what’s keeping me from creating proper macOS VMs. Defining devices directly attached to the root PCI works (they are recognized by macOS), but not the hot-pluggable ports you talk about (Linux and OpenCore recognize them, but not macOS…).
I understand the opinionated design behind Incus, but i really feel the delta to being able to emulate macOS is very small and that would be pretty bad not to allow NVMe drives to be attached to the root PCI…

@stgraber
Copy link
Member

Nope, you'd need to hack your way around it with raw.qemu

@stgraber
Copy link
Member

What doesn't really make sense is that presumably whatever is failing to find those virtualized NVME drives is quite capable of finding physical ones which are connected with an identical layout.

@bensmrs
Copy link
Contributor Author

bensmrs commented Jul 27, 2024

I think the way devices are identified is closely related to the physical layout of Macs, and I wouldn’t be surprised if everything was attached to the root PCI on them.
I really wanted not to resort to any QEMU hackery to get block devices to work, and NVMe emulation is the only option, as macOS doesn’t ship SCSI drivers.
Would allowing people to attach NVMe disks directly on the root PCI be against the philosophy of Incus?

@stgraber
Copy link
Member

stgraber commented Aug 1, 2024

It's something I'd prefer not to do as it would cause a different PCIe structure just for this one case, which then has "fun" side-effects such as changing the otherwise mostly predictable naming pattern for network interfaces. It would also prevent hot-remove of those disks when we otherwise have universal support for storage add/remove, so that'd be more exceptions we'd need to keep in mind throughout the code.

Overall, while it's not impossible to do, making this an option is likely to make things quite brittle or tricky to test.

@bensmrs
Copy link
Contributor Author

bensmrs commented Aug 1, 2024

Got it. I tried to do some nasty EFI/ACPI trickeries in the last few days, but to no avail. Still, I really want to see on Incus the disks that my macOS VMs use. I tried to use the generated Incus drive names (and the virtual /dev/fdset/*) on the QEMU command line, but they seem to only exist after, as they are created through the monitor.

I guess I’ll attach a second monitor and send a few device_add commands to attach each generated Incus drive to devices that macOS actually can recognize. Does that feel reasonable enough? I see two problems:

  • I need to add the devices right after Incus has done the QMP configuration and before booting; how would you schedule that without dirty tricks such as inotifying the socket file and querying repeatedly QEMU until I see the Incus drives?
  • I need a secondary monitor for each of my VMs, but I’d rather not hard-code the socket path for each of them. Is there a way to do basic name substitution in profiles such as /run/incus/${name}/macos.monitor?

@stgraber
Copy link
Member

stgraber commented Aug 1, 2024

One option may be to add some more raw.qemu config options to allow sending QMP commands.

Basically:

  • raw.qemu.qmp.early
  • raw.qemu.qmp.pre-start
  • raw.qemu.qmp.post-start

The value of which would be a JSON list of QMP commands (command+arguments).

In your case, you'd likely want pre-start as that would run after all our own commands but prior to startup, so let you attach things to your pre-created devices.

@bensmrs
Copy link
Contributor Author

bensmrs commented Aug 1, 2024

I’ll look into it. I have not yet looked at that part of the codebase, so that might take me quite some time; I’ll try to come up with something in a couple of weeks. A few questions (tell me if you’d prefer me to open another issue to discuss this):

  • How should errors in QMP commands be handled? Fail completely? Ignore every failing command? Ignore every command as soon as one fails? Attempt to rollback the whole thing (not always possible)?
  • In my case, these commands would need arguments from other commands / from the YAML config. For example, I’d need to get devices/<name>/type=disk, then do something with <name>, or process the query-block command output. Would a scriptlet approach be advisable? Do you have something else in mind?

@stgraber
Copy link
Member

stgraber commented Aug 1, 2024

How should errors in QMP commands be handled? Fail completely? Ignore every failing command? Ignore every command as soon as one fails? Attempt to rollback the whole thing (not always possible)?

I'd probably just fail startup if any of the command fails, returning the error back to the user.

In my case, these commands would need arguments from other commands / from the YAML config. For example, I’d need to get devices/<name>/type=disk, then do something with <name>, or process the query-block command output. Would a scriptlet approach be advisable? Do you have something else in mind?

I was thinking of the easy way to do this, basically as our devices/configuration is consistent across startup, you should be able to predict what the name or index is going to be and just use that in your raw.qemu.qmp.XYZ entries.

Though having a raw.qmp.scriptlet which basically takes in a standard scriptlet that's fed what stage we're at as an argument (early, pre-start, post-start) and has support for running QMP commands including reading their responses could be an interesting addition.

This is usually not something we'd be very keen on for security reasons as such a script could be used to do some very nasty things through QMP and could also cause Incus' memory and CPU usage to get out of control, but as it would be a raw config key, this is already part of the dangerous config options which we don't expose to untrusted users, so that'd be fine.

@bensmrs
Copy link
Contributor Author

bensmrs commented Aug 2, 2024

I was thinking of the easy way to do this, basically as our devices/configuration is consistent across startup, you should be able to predict what the name or index is going to be and just use that in your raw.qemu.qmp.XYZ entries.

Sure, but if I add a device later, I’d also need to edit the configuration key (I’d like to put the key in a profile and forget about it). I’ll see what I can do by first implementing non-scriptlet solutions, then look at scriptlets.

@bensmrs
Copy link
Contributor Author

bensmrs commented Aug 9, 2024

I think I’m almost there. Is there any reason to prefer, as you suggested, (i) a JSON list of QMP commands vs. (ii) a list of JSON QMP commands?
Handling a JSON list adds an unmarshalling/remarshalling step which doesn’t look very good in the code. Also, if I were to implement (ii), would allowing the user to write:

raw.qemu.qmp.early: command

vs.

raw.qemu.qmp.early:
  - command

be a bad idea?

@stgraber
Copy link
Member

Well, the catch is that all our config keys are stored as strings (configs are always map[string]string in Incus),so your second option would lead to a type error when trying to set it.

@bensmrs
Copy link
Contributor Author

bensmrs commented Aug 12, 2024

Okay, I’ll do a PR this week. Meanwhile, a little update on my painful and ugly journey.

I tried everything I could to add my drives with QMP, but every attempt has been met with XXX does not support hotplugging. I discovered that nvme devices cannot be hotplugged with nvme-ns devices, and that even “stopping” the VM with stop (and resuming it with cont) doesn’t allow me to do any hotplugging on pcie.0. Then I tried pre-defining devices, with dummy drives, and changing the drives afterwards, which unfortunately isn’t supported by blockdev-* and device_* commands. Out of despair, I even tried to mess with qom-set and… well, it works!

  raw.qemu.qmp.pre-start: |-
    [
      {
        "execute": "blockdev-add",
        "arguments": {
          "driver": "raw",
          "node-name": "fdset0",
          "file": {
            "driver": "file",
            "filename": "/dev/fdset/0"
          }
        }
      },
      {
        "execute": "qom-set",
        "arguments": {
          "path": "/machine/peripheral/blk0",
          "property": "drive",
          "value": "fdset0"
        }
      }
    ]

Not that I don’t like black magic, but isn’t there a cleaner way of dynamically patching a drive?

@stgraber
Copy link
Member

Haha, welcome to the QMP hell :)

It's one of those APIs that has grown very organically with no clear design or consistency...

@bensmrs
Copy link
Contributor Author

bensmrs commented Aug 13, 2024

I think now’s the time to close the issue, as #1115 fixes part of it.
I’ll soon open another one regarding scriptlets, because they are already used for instance placement, there is an open issue to use them for authorization, and I also want them for dynamic QMP voodoo; I believe we need to work on defining a clean way of integrating new scriptlets to make life easier for contributors.

@bensmrs bensmrs closed this as completed Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Incomplete Waiting on more information from reporter
Development

No branches or pull requests

2 participants