Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nice approach on DL dev scenario #12

Open
pokerfaceSad opened this issue Nov 20, 2023 · 4 comments
Open

Nice approach on DL dev scenario #12

pokerfaceSad opened this issue Nov 20, 2023 · 4 comments

Comments

@pokerfaceSad
Copy link
Contributor

I think nvshare a nice approach on DL develop scenario!

Has there been any testing on the overhead brought by UVM swap in training scenarios?

BTW, I have posted a solution to address the issue of long GPU idle times in dev scenarios by dynamically mounting the GPU.
https://github.com/pokerfaceSad/GPUMounter

@grgalex
Copy link
Owner

grgalex commented Nov 20, 2023

@pokerfaceSad Hi, thanks for the feedback!

For the overhead of UVM in and of itself (i.e., when an app runs alone on the system), you can take a look at chapter 11.3 of my diploma thesis [1].

The overhead of the UVM swapping when the GPU lock changes hands, which happens every TQ seconds assuming > 1 apps want to run GPU work, it depends on the PCIe bandwidth and the working set size of the application.

Simple Example

Let's assume a GPU has 32 GB/s PCIe bandwidth and the application that just got the GPU lock uses 32 GB of data, then the UVM swapping overhead is around (2 * 32) / 32 = 2 sec. We multiply the 32 GB of data by a factor of two to account for the swap-out traffic (data of the previous app) in addition to the swap-in traffic (data of the current app).

You can measure the actual PCIe bandwidth of a GPU by using the bandwidthTest CUDA sample [2].

[1] https://dspace.lib.ntua.gr/xmlui/handle/123456789/54290
[2] https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest

@pokerfaceSad
Copy link
Contributor Author

Thanks for your detailed reply!

Any ideas about GPU migration? I see it in your Future Improvements.

It seems that it is possible to achieve it by UVM , according to https://dl.acm.org/doi/10.1145/3357223.3362714.
Do you have any ideas?

@grgalex
Copy link
Owner

grgalex commented Nov 21, 2023

I haven't looked at migration thoroughly yet.

(Though a prerequisite for that is nvshare support for multiple GPUs per node, which is relatively simple and not implemented yet.)

Are you perhaps interested in taking a look?

If you want to talk about something in private, you can send me an e-mail :)

@pokerfaceSad
Copy link
Contributor Author

Sorry for the late reply.

I have sent you an email:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants