Issues on torchgpipe project and paper #28

xshaun · 2021-06-09T03:47:03Z

Thanks for your sharing this project and paper.

I have one doubt after reading your paper - how do you achieve concurrent copy and computation only using streams wrapped by torch?

As the description of https://developer.download.nvidia.cn/CUDA/training/StreamsAndConcurrencyWebinar.pdf from NVIDIA, it's impossible to run kernels on default stream and other streams simultaneously. If developers want to use two or more streams to overlap communication and computation, they have to explicitly create non-blocking streams to do this, but not including default-stream.
Besides, due to the Python GIL limitation, it's also hard/impossible to launch kernels simultaneously into several non-blocking streams using Python.

Or do you introduce other technologies to address this issue?

Doubts as follows :

The figure5 in your paper said that kernels run on default stream and non-blocking streams, sure?
The figure7 has shown that the improvement you achieved by profiling kernel timeline with Nvidia Nsight tool, but if we enlarge the timeline to see it, the kernels perform overlapped?, or still execute sequentially in overall although using more streams.
Is there any possible that the improvement you got is from the reduced idle gap between each kernels, not overlapped communication and computation.

Thanks for your time and answer.

chiheonk · 2021-06-09T11:39:58Z

Hi xshaun,

Indeed, copy kernel can overlap with execution kernels (for reasonably new gpus). We used non-default stream for only copy kernels (since execution kernels does not need to be run simultaneously).

See page 19 for example: https://www.nvidia.com/docs/IO/116711/sc11-multi-gpu.pdf
We remark that those kernel launches in cpu need not be simultaneously done (which may subject to GIL), as long as they are issued ahead of actual execution on the device (=gpu).

We have seen the overlaps beetween copy kernels and execution kernels. For example, the first gpu in GPipe copies the results of the first microbatch and simultaneously start computation of the second microbatch.
However, as you said execution kernels cannot overlap each other, and copy kernels cannot overlap each other if they share the destination device.

chiheonk closed this as completed Jun 9, 2021

chiheonk reopened this Jun 9, 2021

chiheonk added the question Further information is requested label Jun 9, 2021

Provide feedback