Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a unique_id identifier for NVIDIA and AMD devices #30

Closed
wants to merge 4 commits into from

Conversation

neithanmo
Copy link

This adds a way to identify a device without the risk of collisions which might happen if the bus_id or device name is used for this purpose. The native cl_device_id does not work either because the returned ID is different between processes. The implementation uses HW attributes that are specific to each device.
We tested this on a machine with 2 NVIDIA 1080 Ti, both sharing the same bus_id, name, but with different unique_id that is built from the CL_DEVICE_PCI_SLOT_ID_NV attribute.
In the AMD case, this ID is constructed from the amd_device_topology (u8, u8, u8), device, bus, and function.
Important to note that this ID is also persistent, so does not change even if the machine is restarted.

I did not touch the GPUSelector, and other functions/structs that are part of the public API in order to minimize breaking code for users. although, I think we can add this to it.

Copy link
Collaborator

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the usage of the bus ID should be removed completely and the GPUSelector should change to using the unique ID. As I understand it, the purpose of the bus ID was to have a unique identifier, which obviously doesn't work.

@neithanmo before you make the change, I'd like to get confirmation from one more of my team member to confirm that we should remove the bus ID completely.

@neithanmo
Copy link
Author

I would say the usage of the bus ID should be removed completely and the GPUSelector should change to using the unique ID. As I understand it, the purpose of the bus ID was to have a unique identifier, which obviously doesn't work.

I do not think the bus_id is completely useless. It works on specific setups. if the GPUSelector needs to be modified, it would be adding the uniqueID as a new variant so that, We can use either the busID or uniqueID as a fallback case.

Another thing is that for the AMD case we tested it on a machine with only one AMD card. Does someone have a machine with more than one??

@vmx
Copy link
Collaborator

vmx commented Apr 14, 2021

I do not think the bus_id is completely useless.

It's not useless, but it seems to me that it's not needed. I'd prefer removing features that are not needed.

@dignifiedquire
Copy link
Contributor

we had some issues with gpus on macs before, please make sure it works there as well

@jennijuju
Copy link
Member

@neithanmo thoughts on the feedbacks?

@porcuquine
Copy link
Collaborator

The original author can no longer consulted, but I can confirm that the reason BusID was originally introduced was to act as a (hopefully) unique identifier per card. I don't have a strong opinion on whether it needs to be removed or preserved, but if the original goal can be accomplished without it, I don't object to removal.

@porcuquine
Copy link
Collaborator

To be clear, I'm talking about the original use of BusID in rust-gpu-tools when I mention the original author and intent. I'm not trying to speak for @neithanmo — who is currently working on the GPU scheduler code which needs to enhance the code.

@neithanmo
Copy link
Author

I am currently working on the scheduler integration with Neptune and besides the unique id feature this PR adds, we need to "update" the GpuSelector by removing the BusId and adding the UniqueId variant. This GputSelector is used in Neptune. so I am going to update this PR with that.

@vmx
Copy link
Collaborator

vmx commented Apr 27, 2021

@neithanmo I've spent some time looking into the unique ID situation and also into some of the history what the bus ID is used for.

Ideally we end up with those features:

  • It is a unique identifier: that's a must and is currently not given by the GPUSelector code on master
  • It works with Nvidia, AMD and cards on Mac: that's a must
  • The identifier is something you can get with other/stock utilities: this is a nice to have. the reason to have it is that we have a command line flag in Neptune to select a certain GPU. It's nice for users if they can get the identifier from some standard command line tool. Though of course we could have our own tool that lists all available GPUs together with the unique identifier we use.

Such an unique identifier would then replace our current use of bus ID and we'd call it "unique identifier" instead.

In regards to what to use as an unique identifier I currently lean towards doing a best effort approach, where we try to get the "best possible" identifier and then fallback to less "good" ones. I propose using the OpenCL 3.0 CL_DEVICE_UUID_KHR extension to get the "best possible" identifier. I know OpenCL 3.0 is pretty new and might not be widely supported. Though I suspect that most people using our libraries are using the latest drivers and AFAICT at least the Nvidia and AMD ones have support for that extension.

You can try it with the clinfo command line tool. If your card/driver supports it, you'd have a section saying Device UUID and also cl_khr_device_uuid in the list of Device Extensions. I'd like to use that identifier as at least for the Nvidia cards I've tested with, the identifier is the same for CUDA as well. That makes it a nice to use from a user interface perspective (as rust-gpu-tools will also have CUDA support soon).

As we already have the code in this PR to do unique identifiers by other methods, I propose using those as fallback. I think we'd always need a fallback for Mac anyway.

What do others think about this approach? What I wrote is meant as a discussion starting point and not a "this is how we must do things". Folks involved in this PR have way more experience with this topic than I have.

@neithanmo
Copy link
Author

* The identifier is something you can get with other/stock utilities: this is a nice to have. the reason to have it is that we have a command line flag in Neptune to select a certain GPU. It's nice for users if they can get the identifier from some standard command line tool. Though of course we could have our own tool that lists all available GPUs together with the unique identifier we use.

I am currently looking at Neptune in order to integrate the scheduler with it and that env variable is something that we need to modify, at least when neptune is used with the scheduler feature-flag enable. about the command-line tool I think it is also a must because this unique identifier is a dev-extension value that is not printed out by normal ocl-info command. and user needs to know this unique identifier for editing the scheduler config file.

  • In regards to what to use as a unique identifier I currently lean towards doing a best-effort approach, where we try to get the "best possible" identifier and then fallback to less "good" ones. I propose using the OpenCL 3.0 CL_DEVICE_UUID_KHR extension to get the "best possible" identifier. I know OpenCL 3.0 is pretty new and might not be widely supported. Though I suspect that most people using our libraries are using the latest drivers and AFAICT at least the Nvidia and AMD ones have support for that extension.

ok. I have not checked this yet but it is important to keep in mind that this unique identifier must be persistent between processes and in-between machine reboots and shutdowns because of the fact that we have a configuration file where devices can be reserved for specific tasks if this changes the configs have to be updated and can be sort of annoying.

  • As we already have the code in this PR to do unique identifiers by other methods, I propose using those as fallback. I think we'd always need a fallback for Mac anyway.

It sounds good to me. The current fallback case for mac devices is that the uid is set to None the same for the bus id. not sure if opencl 3 has better support for macOS cards.

as a summary about this PR:

  • Do i need to include opencl3 to this and use the uuid extension you mentioned? while keeping the current code as a fallback case?.

@porcuquine
Copy link
Collaborator

I can confirm that the original intended use of BusID included allowing users to explicitly choose devices and to discover appropriate identifiers using clinfo. I haven't followed this closely enough to understand the cases in which BusID fails to meet these requirements, but it sounds like both of you understand the situation — so I think whatever you can agree on will work for me. The discussion here seems reasonable and balanced.

@vmx
Copy link
Collaborator

vmx commented Apr 27, 2021

keep in mind that this unique identifier must be persistent between processes and in-between machine reboots and shutdowns

The description of CL_DEVICE_UUID_KHR from the OpenCL 3.0 spec:

Device UUIDs must be immutable for a given device across processes, driver APIs, driver versions, and system reboots.

Do i need to include opencl3 to this and use the uuid extension you mentioned? while keeping the current code as a fallback case?.

For now, just keep using the current master of rust-gpu-tools. There is no need to switch to opencl3 (it's still work in progress). In case you cannot get the UUID with ocl we should talk again. My idea would be that this PR introduces a unique identifier which is based on the UUID if it's available and else use the fallback this PR is currently using.

@vmx
Copy link
Collaborator

vmx commented Apr 29, 2021

The removal of bus_id looks good. There are some match cases which are no longer needed, but I guess those are placeholders when the UUID code gets implemented.

@neithanmo
Copy link
Author

The removal of bus_id looks good. There are some match cases which are no longer needed, but I guess those are placeholders when the UUID code gets implemented.

Yes, I left them as a place holder for the later enum variant with the UUID

@jennijuju
Copy link
Member

jennijuju commented May 10, 2021

More discussion is happening in Slack - implementation awaiting

@vmx
Copy link
Collaborator

vmx commented May 12, 2021

@neithanmo I had another look at the code as I'll need a similar thing for my work of integrating CUDA into rust-gpu-tools. Here are my thoughts. Currently there is a distinction between the UniqueId and the Index. I can see that you wanted to break the API as little as possible (which is great), but I think we are in the position to break the API and reduce it to the things we really need.

Currently rust-gpu-tools is only used by bellperson and by neptune. So functionality that is not used there can potentially be removed. E.g. get_device_bus_id_by_index() doesn't seem to be used at all.

Given all that I think it would make sense to also unify the Index into UniqueId as a fallback. So UniqueId will be an enum consisting of the Uuid (which will be preferred), the way you implemented here and the Index as yet another fallback. You can then get a device by UniqueId and need to pass in the correct value (I hope that works out well, I haven't fully though this through and also haven't checked all uses of the GPUSelector yet). How does that sound?

@neithanmo
Copy link
Author

Given all that I think it would make sense to also unify the Index into UniqueId as a fallback. So UniqueId will be an enum consisting of the Uuid (which will be preferred), the way you implemented here and the Index as yet another fallback. You can then get a device by UniqueId and need to pass in the correct value (I hope that works out well, I haven't fully though this through and also haven't checked all uses of the GPUSelector yet). How does that sound?

I think that you are describing the current GPUSelector enum ? or I did not understand what you suggested here., The GPUSelector contains the Uui and as a fallback the Index. Actually, it is used in neptune in case the passed Uuid is not valid..
if what you propose is:

enum UniqueId
     Uuid(..),
     Index(..),
}

I am not sure then. It would look similar to the GPUSelector. and removing the selector would require many changes on upstream libraries like neptune and bellperson and even rust-fil-proofs.

@vmx
Copy link
Collaborator

vmx commented May 14, 2021

@neithanmo I tried to combine to many things at once. You are right, let's keep things simple (and not breaking to much at once). So yes, I think what I'm after is that the GPUSelector looks something like that:

pub enum GPUSelector {
    Uuid(),
    BusId(u32),
    Index(usize),
}

Better names should be used. What I mean with those is that Uuid is using CL_DEVICE_UUID_KHR, BudId is using your current implementation, and Index has the current meaning.

Does that make sense? If not and you have already a clear picture of how you want to integrate the CL_DEVICE_UUID_KHR call, just go for it. You currently have a deeper understanding that I have.

@neithanmo
Copy link
Author

it is ok for me. of course, the BusId has to be renamed, not clear what would be a correct name for it.

@vmx
Copy link
Collaborator

vmx commented May 17, 2021

not clear what would be a correct name for it.

Would PciId work? As for both, AMD and Nvidia the value is based on the PCI bus/slot.

@neithanmo
Copy link
Author

neithanmo commented May 17, 2021

Would PciId work? As for both, AMD and Nvidia the value is based on the PCI bus/slot.

PciId makes sense. It works for me.

src/opencl/mod.rs Outdated Show resolved Hide resolved
src/opencl/mod.rs Outdated Show resolved Hide resolved
src/opencl/mod.rs Outdated Show resolved Hide resolved
src/opencl/mod.rs Outdated Show resolved Hide resolved
src/opencl/mod.rs Outdated Show resolved Hide resolved
src/opencl/mod.rs Show resolved Hide resolved
src/opencl/utils.rs Show resolved Hide resolved
src/opencl/mod.rs Outdated Show resolved Hide resolved
@vmx vmx mentioned this pull request May 20, 2021
Adds a unique_id identifier for NVIDIA and AMD devices
Remove bus_id

remove weird loop in test

add uuid error variant

apply suggestion during review

remove two functions that are not longer required

use GPUResult
Copy link
Collaborator

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Could someone else from the team (perhaps @dignifiedquire or @porcuquine) have another look. I still feel like I'm missing something obvious that would make things even cleaner/simpler.

@neithanmo
Copy link
Author

I updated this. TryFrom is more idiomatic and also handy to use.

@porcuquine
Copy link
Collaborator

Have the AMD-specific code paths been tested on AMD devices?

@neithanmo
Copy link
Author

Have the AMD-specific code paths been tested on AMD devices?

No, I have not tested it on AMD. I am going to get this tested as soon as we get an AMD device.

@neithanmo
Copy link
Author

I talked to @vmx about adding serde to serialize/deserialize GPUSelector. the reason is that users may want to share GPUSelector over the network or store it somewhere. in our case it would be nice if we can use GPUSelector directly in the scheduler for device selection.
The implementation adds a feature flag called : serde_support
let me know if it is an appropriate name or approach.

Copy link
Collaborator

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@vmx
Copy link
Collaborator

vmx commented Jul 6, 2021

I close this PR in favour of #39. Thanks @neithanmo for the work on this PR and also the tests, which I've largely re-used.

@vmx vmx closed this Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants