-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a unique_id identifier for NVIDIA and AMD devices #30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say the usage of the bus ID should be removed completely and the GPUSelector
should change to using the unique ID. As I understand it, the purpose of the bus ID was to have a unique identifier, which obviously doesn't work.
@neithanmo before you make the change, I'd like to get confirmation from one more of my team member to confirm that we should remove the bus ID completely.
I do not think the bus_id is completely useless. It works on specific setups. if the GPUSelector needs to be modified, it would be adding the uniqueID as a new variant so that, We can use either the busID or uniqueID as a fallback case. Another thing is that for the AMD case we tested it on a machine with only one AMD card. Does someone have a machine with more than one?? |
It's not useless, but it seems to me that it's not needed. I'd prefer removing features that are not needed. |
we had some issues with gpus on macs before, please make sure it works there as well |
@neithanmo thoughts on the feedbacks? |
The original author can no longer consulted, but I can confirm that the reason BusID was originally introduced was to act as a (hopefully) unique identifier per card. I don't have a strong opinion on whether it needs to be removed or preserved, but if the original goal can be accomplished without it, I don't object to removal. |
To be clear, I'm talking about the original use of BusID in rust-gpu-tools when I mention the original author and intent. I'm not trying to speak for @neithanmo — who is currently working on the GPU scheduler code which needs to enhance the code. |
I am currently working on the scheduler integration with Neptune and besides the unique id feature this PR adds, we need to "update" the GpuSelector by removing the BusId and adding the UniqueId variant. This GputSelector is used in Neptune. so I am going to update this PR with that. |
@neithanmo I've spent some time looking into the unique ID situation and also into some of the history what the bus ID is used for. Ideally we end up with those features:
Such an unique identifier would then replace our current use of bus ID and we'd call it "unique identifier" instead. In regards to what to use as an unique identifier I currently lean towards doing a best effort approach, where we try to get the "best possible" identifier and then fallback to less "good" ones. I propose using the OpenCL 3.0 CL_DEVICE_UUID_KHR extension to get the "best possible" identifier. I know OpenCL 3.0 is pretty new and might not be widely supported. Though I suspect that most people using our libraries are using the latest drivers and AFAICT at least the Nvidia and AMD ones have support for that extension. You can try it with the clinfo command line tool. If your card/driver supports it, you'd have a section saying As we already have the code in this PR to do unique identifiers by other methods, I propose using those as fallback. I think we'd always need a fallback for Mac anyway. What do others think about this approach? What I wrote is meant as a discussion starting point and not a "this is how we must do things". Folks involved in this PR have way more experience with this topic than I have. |
I am currently looking at Neptune in order to integrate the scheduler with it and that env variable is something that we need to modify, at least when neptune is used with the scheduler feature-flag enable. about the command-line tool I think it is also a must because this unique identifier is a dev-extension value that is not printed out by normal ocl-info command. and user needs to know this unique identifier for editing the scheduler config file.
ok. I have not checked this yet but it is important to keep in mind that this unique identifier must be persistent between processes and in-between machine reboots and shutdowns because of the fact that we have a configuration file where devices can be reserved for specific tasks if this changes the configs have to be updated and can be sort of annoying.
It sounds good to me. The current fallback case for mac devices is that the uid is set to None the same for the bus id. not sure if opencl 3 has better support for macOS cards. as a summary about this PR:
|
I can confirm that the original intended use of BusID included allowing users to explicitly choose devices and to discover appropriate identifiers using |
The description of
For now, just keep using the current master of rust-gpu-tools. There is no need to switch to |
The removal of |
Yes, I left them as a place holder for the later enum variant with the UUID |
More discussion is happening in Slack - implementation awaiting |
@neithanmo I had another look at the code as I'll need a similar thing for my work of integrating CUDA into Currently Given all that I think it would make sense to also unify the |
I think that you are describing the current GPUSelector enum ? or I did not understand what you suggested here., The GPUSelector contains the Uui and as a fallback the Index. Actually, it is used in neptune in case the passed Uuid is not valid.. enum UniqueId
Uuid(..),
Index(..),
} I am not sure then. It would look similar to the GPUSelector. and removing the selector would require many changes on upstream libraries like neptune and bellperson and even rust-fil-proofs. |
@neithanmo I tried to combine to many things at once. You are right, let's keep things simple (and not breaking to much at once). So yes, I think what I'm after is that the pub enum GPUSelector {
Uuid(…),
BusId(u32),
Index(usize),
} Better names should be used. What I mean with those is that Does that make sense? If not and you have already a clear picture of how you want to integrate the |
it is ok for me. of course, the BusId has to be renamed, not clear what would be a correct name for it. |
Would |
PciId makes sense. It works for me. |
Adds a unique_id identifier for NVIDIA and AMD devices Remove bus_id remove weird loop in test add uuid error variant apply suggestion during review remove two functions that are not longer required use GPUResult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Could someone else from the team (perhaps @dignifiedquire or @porcuquine) have another look. I still feel like I'm missing something obvious that would make things even cleaner/simpler.
I updated this. |
Have the AMD-specific code paths been tested on AMD devices? |
No, I have not tested it on AMD. I am going to get this tested as soon as we get an AMD device. |
I talked to @vmx about adding serde to serialize/deserialize GPUSelector. the reason is that users may want to share GPUSelector over the network or store it somewhere. in our case it would be nice if we can use GPUSelector directly in the scheduler for device selection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I close this PR in favour of #39. Thanks @neithanmo for the work on this PR and also the tests, which I've largely re-used. |
This adds a way to identify a device without the risk of collisions which might happen if the bus_id or device name is used for this purpose. The native cl_device_id does not work either because the returned ID is different between processes. The implementation uses HW attributes that are specific to each device.
We tested this on a machine with 2 NVIDIA 1080 Ti, both sharing the same bus_id, name, but with different unique_id that is built from the
CL_DEVICE_PCI_SLOT_ID_NV
attribute.In the AMD case, this ID is constructed from the amd_device_topology
(u8, u8, u8), device, bus, and function
.Important to note that this ID is also persistent, so does not change even if the machine is restarted.
I did not touch the GPUSelector, and other functions/structs that are part of the public API in order to minimize breaking code for users. although, I think we can add this to it.