Adds a unique_id identifier for NVIDIA and AMD devices #30

neithanmo · 2021-04-14T04:25:23Z

This adds a way to identify a device without the risk of collisions which might happen if the bus_id or device name is used for this purpose. The native cl_device_id does not work either because the returned ID is different between processes. The implementation uses HW attributes that are specific to each device.
We tested this on a machine with 2 NVIDIA 1080 Ti, both sharing the same bus_id, name, but with different unique_id that is built from the CL_DEVICE_PCI_SLOT_ID_NV attribute.
In the AMD case, this ID is constructed from the amd_device_topology (u8, u8, u8), device, bus, and function.
Important to note that this ID is also persistent, so does not change even if the machine is restarted.

I did not touch the GPUSelector, and other functions/structs that are part of the public API in order to minimize breaking code for users. although, I think we can add this to it.

vmx

I would say the usage of the bus ID should be removed completely and the GPUSelector should change to using the unique ID. As I understand it, the purpose of the bus ID was to have a unique identifier, which obviously doesn't work.

@neithanmo before you make the change, I'd like to get confirmation from one more of my team member to confirm that we should remove the bus ID completely.

neithanmo · 2021-04-14T13:05:49Z

I would say the usage of the bus ID should be removed completely and the GPUSelector should change to using the unique ID. As I understand it, the purpose of the bus ID was to have a unique identifier, which obviously doesn't work.

I do not think the bus_id is completely useless. It works on specific setups. if the GPUSelector needs to be modified, it would be adding the uniqueID as a new variant so that, We can use either the busID or uniqueID as a fallback case.

Another thing is that for the AMD case we tested it on a machine with only one AMD card. Does someone have a machine with more than one??

vmx · 2021-04-14T13:13:08Z

I do not think the bus_id is completely useless.

It's not useless, but it seems to me that it's not needed. I'd prefer removing features that are not needed.

dignifiedquire · 2021-04-19T10:27:13Z

we had some issues with gpus on macs before, please make sure it works there as well

jennijuju · 2021-04-26T16:49:26Z

@neithanmo thoughts on the feedbacks?

porcuquine · 2021-04-26T16:51:02Z

The original author can no longer consulted, but I can confirm that the reason BusID was originally introduced was to act as a (hopefully) unique identifier per card. I don't have a strong opinion on whether it needs to be removed or preserved, but if the original goal can be accomplished without it, I don't object to removal.

porcuquine · 2021-04-26T16:54:48Z

To be clear, I'm talking about the original use of BusID in rust-gpu-tools when I mention the original author and intent. I'm not trying to speak for @neithanmo — who is currently working on the GPU scheduler code which needs to enhance the code.

neithanmo · 2021-04-26T18:09:56Z

I am currently working on the scheduler integration with Neptune and besides the unique id feature this PR adds, we need to "update" the GpuSelector by removing the BusId and adding the UniqueId variant. This GputSelector is used in Neptune. so I am going to update this PR with that.

vmx · 2021-04-27T08:54:48Z

@neithanmo I've spent some time looking into the unique ID situation and also into some of the history what the bus ID is used for.

Ideally we end up with those features:

It is a unique identifier: that's a must and is currently not given by the GPUSelector code on master
It works with Nvidia, AMD and cards on Mac: that's a must
The identifier is something you can get with other/stock utilities: this is a nice to have. the reason to have it is that we have a command line flag in Neptune to select a certain GPU. It's nice for users if they can get the identifier from some standard command line tool. Though of course we could have our own tool that lists all available GPUs together with the unique identifier we use.

Such an unique identifier would then replace our current use of bus ID and we'd call it "unique identifier" instead.

In regards to what to use as an unique identifier I currently lean towards doing a best effort approach, where we try to get the "best possible" identifier and then fallback to less "good" ones. I propose using the OpenCL 3.0 CL_DEVICE_UUID_KHR extension to get the "best possible" identifier. I know OpenCL 3.0 is pretty new and might not be widely supported. Though I suspect that most people using our libraries are using the latest drivers and AFAICT at least the Nvidia and AMD ones have support for that extension.

You can try it with the clinfo command line tool. If your card/driver supports it, you'd have a section saying Device UUID and also cl_khr_device_uuid in the list of Device Extensions. I'd like to use that identifier as at least for the Nvidia cards I've tested with, the identifier is the same for CUDA as well. That makes it a nice to use from a user interface perspective (as rust-gpu-tools will also have CUDA support soon).

As we already have the code in this PR to do unique identifiers by other methods, I propose using those as fallback. I think we'd always need a fallback for Mac anyway.

What do others think about this approach? What I wrote is meant as a discussion starting point and not a "this is how we must do things". Folks involved in this PR have way more experience with this topic than I have.

neithanmo · 2021-04-27T14:40:17Z

* The identifier is something you can get with other/stock utilities: this is a nice to have. the reason to have it is that we have a command line flag in Neptune to select a certain GPU. It's nice for users if they can get the identifier from some standard command line tool. Though of course we could have our own tool that lists all available GPUs together with the unique identifier we use.

I am currently looking at Neptune in order to integrate the scheduler with it and that env variable is something that we need to modify, at least when neptune is used with the scheduler feature-flag enable. about the command-line tool I think it is also a must because this unique identifier is a dev-extension value that is not printed out by normal ocl-info command. and user needs to know this unique identifier for editing the scheduler config file.

In regards to what to use as a unique identifier I currently lean towards doing a best-effort approach, where we try to get the "best possible" identifier and then fallback to less "good" ones. I propose using the OpenCL 3.0 CL_DEVICE_UUID_KHR extension to get the "best possible" identifier. I know OpenCL 3.0 is pretty new and might not be widely supported. Though I suspect that most people using our libraries are using the latest drivers and AFAICT at least the Nvidia and AMD ones have support for that extension.

ok. I have not checked this yet but it is important to keep in mind that this unique identifier must be persistent between processes and in-between machine reboots and shutdowns because of the fact that we have a configuration file where devices can be reserved for specific tasks if this changes the configs have to be updated and can be sort of annoying.

As we already have the code in this PR to do unique identifiers by other methods, I propose using those as fallback. I think we'd always need a fallback for Mac anyway.

It sounds good to me. The current fallback case for mac devices is that the uid is set to None the same for the bus id. not sure if opencl 3 has better support for macOS cards.

as a summary about this PR:

Do i need to include opencl3 to this and use the uuid extension you mentioned? while keeping the current code as a fallback case?.

porcuquine · 2021-04-27T14:58:03Z

I can confirm that the original intended use of BusID included allowing users to explicitly choose devices and to discover appropriate identifiers using clinfo. I haven't followed this closely enough to understand the cases in which BusID fails to meet these requirements, but it sounds like both of you understand the situation — so I think whatever you can agree on will work for me. The discussion here seems reasonable and balanced.

vmx · 2021-04-27T15:06:55Z

keep in mind that this unique identifier must be persistent between processes and in-between machine reboots and shutdowns

The description of CL_DEVICE_UUID_KHR from the OpenCL 3.0 spec:

Device UUIDs must be immutable for a given device across processes, driver APIs, driver versions, and system reboots.

Do i need to include opencl3 to this and use the uuid extension you mentioned? while keeping the current code as a fallback case?.

For now, just keep using the current master of rust-gpu-tools. There is no need to switch to opencl3 (it's still work in progress). In case you cannot get the UUID with ocl we should talk again. My idea would be that this PR introduces a unique identifier which is based on the UUID if it's available and else use the fallback this PR is currently using.

vmx · 2021-04-29T15:37:40Z

The removal of bus_id looks good. There are some match cases which are no longer needed, but I guess those are placeholders when the UUID code gets implemented.

neithanmo · 2021-04-29T16:14:48Z

The removal of bus_id looks good. There are some match cases which are no longer needed, but I guess those are placeholders when the UUID code gets implemented.

Yes, I left them as a place holder for the later enum variant with the UUID

jennijuju · 2021-05-10T16:27:49Z

More discussion is happening in Slack - implementation awaiting

vmx · 2021-05-12T16:11:07Z

@neithanmo I had another look at the code as I'll need a similar thing for my work of integrating CUDA into rust-gpu-tools. Here are my thoughts. Currently there is a distinction between the UniqueId and the Index. I can see that you wanted to break the API as little as possible (which is great), but I think we are in the position to break the API and reduce it to the things we really need.

Currently rust-gpu-tools is only used by bellperson and by neptune. So functionality that is not used there can potentially be removed. E.g. get_device_bus_id_by_index() doesn't seem to be used at all.

Given all that I think it would make sense to also unify the Index into UniqueId as a fallback. So UniqueId will be an enum consisting of the Uuid (which will be preferred), the way you implemented here and the Index as yet another fallback. You can then get a device by UniqueId and need to pass in the correct value (I hope that works out well, I haven't fully though this through and also haven't checked all uses of the GPUSelector yet). How does that sound?

neithanmo · 2021-05-13T03:38:19Z

Given all that I think it would make sense to also unify the Index into UniqueId as a fallback. So UniqueId will be an enum consisting of the Uuid (which will be preferred), the way you implemented here and the Index as yet another fallback. You can then get a device by UniqueId and need to pass in the correct value (I hope that works out well, I haven't fully though this through and also haven't checked all uses of the GPUSelector yet). How does that sound?

I think that you are describing the current GPUSelector enum ? or I did not understand what you suggested here., The GPUSelector contains the Uui and as a fallback the Index. Actually, it is used in neptune in case the passed Uuid is not valid..
if what you propose is:

enum UniqueId
     Uuid(..),
     Index(..),
}

I am not sure then. It would look similar to the GPUSelector. and removing the selector would require many changes on upstream libraries like neptune and bellperson and even rust-fil-proofs.

vmx · 2021-05-14T10:26:39Z

@neithanmo I tried to combine to many things at once. You are right, let's keep things simple (and not breaking to much at once). So yes, I think what I'm after is that the GPUSelector looks something like that:

pub enum GPUSelector {
    Uuid(…),
    BusId(u32),
    Index(usize),
}

Better names should be used. What I mean with those is that Uuid is using CL_DEVICE_UUID_KHR, BudId is using your current implementation, and Index has the current meaning.

Does that make sense? If not and you have already a clear picture of how you want to integrate the CL_DEVICE_UUID_KHR call, just go for it. You currently have a deeper understanding that I have.

neithanmo · 2021-05-14T21:31:38Z

it is ok for me. of course, the BusId has to be renamed, not clear what would be a correct name for it.

vmx · 2021-05-17T08:50:01Z

not clear what would be a correct name for it.

Would PciId work? As for both, AMD and Nvidia the value is based on the PCI bus/slot.

neithanmo · 2021-05-17T14:40:35Z

Would PciId work? As for both, AMD and Nvidia the value is based on the PCI bus/slot.

PciId makes sense. It works for me.

src/opencl/mod.rs

src/opencl/utils.rs

src/opencl/mod.rs

Adds a unique_id identifier for NVIDIA and AMD devices Remove bus_id remove weird loop in test add uuid error variant apply suggestion during review remove two functions that are not longer required use GPUResult

vmx

Looks good to me. Could someone else from the team (perhaps @dignifiedquire or @porcuquine) have another look. I still feel like I'm missing something obvious that would make things even cleaner/simpler.

neithanmo · 2021-05-23T16:58:44Z

I updated this. TryFrom is more idiomatic and also handy to use.

porcuquine · 2021-05-25T20:36:37Z

Have the AMD-specific code paths been tested on AMD devices?

neithanmo · 2021-05-25T21:39:09Z

Have the AMD-specific code paths been tested on AMD devices?

No, I have not tested it on AMD. I am going to get this tested as soon as we get an AMD device.

neithanmo · 2021-05-26T22:02:20Z

I talked to @vmx about adding serde to serialize/deserialize GPUSelector. the reason is that users may want to share GPUSelector over the network or store it somewhere. in our case it would be nice if we can use GPUSelector directly in the scheduler for device selection.
The implementation adds a feature flag called : serde_support
let me know if it is an appropriate name or approach.

vmx

Looks good to me.

vmx · 2021-07-06T15:29:03Z

I close this PR in favour of #39. Thanks @neithanmo for the work on this PR and also the tests, which I've largely re-used.

vmx reviewed Apr 14, 2021

View reviewed changes

neithanmo force-pushed the unique_id branch from de2975e to a883f88 Compare May 19, 2021 21:51

vmx reviewed May 20, 2021

View reviewed changes

vmx mentioned this pull request May 20, 2021

Switch from ocl to opencl3 #34

Merged

fix update to opencl3 and use uuid instead of unique_id

3d4a2b7

Adds a unique_id identifier for NVIDIA and AMD devices Remove bus_id remove weird loop in test add uuid error variant apply suggestion during review remove two functions that are not longer required use GPUResult

neithanmo force-pushed the unique_id branch from e4b7ea6 to 3d4a2b7 Compare May 20, 2021 17:00

vmx approved these changes May 20, 2021

View reviewed changes

Use TryFrom which is more idiomatic

3e6e0f4

vmx approved these changes May 25, 2021

View reviewed changes

Add serialization/deserialization feature to GPUSelector

4d4b98e

vmx reviewed May 27, 2021

View reviewed changes

derive hash for GPUSelector

30a572d

vmx closed this Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a unique_id identifier for NVIDIA and AMD devices #30

Adds a unique_id identifier for NVIDIA and AMD devices #30

neithanmo commented Apr 14, 2021

vmx left a comment

neithanmo commented Apr 14, 2021

vmx commented Apr 14, 2021

dignifiedquire commented Apr 19, 2021

jennijuju commented Apr 26, 2021

porcuquine commented Apr 26, 2021

porcuquine commented Apr 26, 2021

neithanmo commented Apr 26, 2021

vmx commented Apr 27, 2021

neithanmo commented Apr 27, 2021

porcuquine commented Apr 27, 2021

vmx commented Apr 27, 2021

vmx commented Apr 29, 2021

neithanmo commented Apr 29, 2021

jennijuju commented May 10, 2021 •

edited

Loading

vmx commented May 12, 2021

neithanmo commented May 13, 2021

vmx commented May 14, 2021

neithanmo commented May 14, 2021

vmx commented May 17, 2021

neithanmo commented May 17, 2021 •

edited

Loading

vmx left a comment

neithanmo commented May 23, 2021

porcuquine commented May 25, 2021

neithanmo commented May 25, 2021

neithanmo commented May 26, 2021

vmx left a comment

vmx commented Jul 6, 2021

Adds a unique_id identifier for NVIDIA and AMD devices #30

Adds a unique_id identifier for NVIDIA and AMD devices #30

Conversation

neithanmo commented Apr 14, 2021

vmx left a comment

Choose a reason for hiding this comment

neithanmo commented Apr 14, 2021

vmx commented Apr 14, 2021

dignifiedquire commented Apr 19, 2021

jennijuju commented Apr 26, 2021

porcuquine commented Apr 26, 2021

porcuquine commented Apr 26, 2021

neithanmo commented Apr 26, 2021

vmx commented Apr 27, 2021

neithanmo commented Apr 27, 2021

porcuquine commented Apr 27, 2021

vmx commented Apr 27, 2021

vmx commented Apr 29, 2021

neithanmo commented Apr 29, 2021

jennijuju commented May 10, 2021 • edited Loading

vmx commented May 12, 2021

neithanmo commented May 13, 2021

vmx commented May 14, 2021

neithanmo commented May 14, 2021

vmx commented May 17, 2021

neithanmo commented May 17, 2021 • edited Loading

vmx left a comment

Choose a reason for hiding this comment

neithanmo commented May 23, 2021

porcuquine commented May 25, 2021

neithanmo commented May 25, 2021

neithanmo commented May 26, 2021

vmx left a comment

Choose a reason for hiding this comment

vmx commented Jul 6, 2021

jennijuju commented May 10, 2021 •

edited

Loading

neithanmo commented May 17, 2021 •

edited

Loading