You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have one doubt after reading your paper - how do you achieve concurrent copy and computation only using streams wrapped by torch?
As the description of https://developer.download.nvidia.cn/CUDA/training/StreamsAndConcurrencyWebinar.pdf from NVIDIA, it's impossible to run kernels on default stream and other streams simultaneously. If developers want to use two or more streams to overlap communication and computation, they have to explicitly create non-blocking streams to do this, but not including default-stream.
Besides, due to the Python GIL limitation, it's also hard/impossible to launch kernels simultaneously into several non-blocking streams using Python.
Or do you introduce other technologies to address this issue?
Doubts as follows :
The figure5 in your paper said that kernels run on default stream and non-blocking streams, sure?
The figure7 has shown that the improvement you achieved by profiling kernel timeline with Nvidia Nsight tool, but if we enlarge the timeline to see it, the kernels perform overlapped?, or still execute sequentially in overall although using more streams.
Is there any possible that the improvement you got is from the reduced idle gap between each kernels, not overlapped communication and computation.
Thanks for your time and answer.
The text was updated successfully, but these errors were encountered:
Indeed, copy kernel can overlap with execution kernels (for reasonably new gpus). We used non-default stream for only copy kernels (since execution kernels does not need to be run simultaneously).
We remark that those kernel launches in cpu need not be simultaneously done (which may subject to GIL), as long as they are issued ahead of actual execution on the device (=gpu).
We have seen the overlaps beetween copy kernels and execution kernels. For example, the first gpu in GPipe copies the results of the first microbatch and simultaneously start computation of the second microbatch.
However, as you said execution kernels cannot overlap each other, and copy kernels cannot overlap each other if they share the destination device.
Thanks for your sharing this project and paper.
I have one doubt after reading your paper - how do you achieve concurrent copy and computation only using streams wrapped by torch?
Or do you introduce other technologies to address this issue?
Doubts as follows :
Thanks for your time and answer.
The text was updated successfully, but these errors were encountered: