You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I read source code, I found data collector is supposed to
/**************************************
Each node will have one DataCollector.
Each iteration, one of the data collector will
send it's CSR buffers to remote node.
************************************/
as commented. However, I cannot find specific codes to do this thing. Can sb give some explanation? Thanks~
The text was updated successfully, but these errors were encountered:
Thank you for your question. It's a bug in document. In v2.1 data reader will not send buffer to remote node anymore. Each of the nodes will read data independently but only copy corresponding data to GPU in the same node.
Thank you for immediate response! This question is from our practice on multi nodes. Actually, we have done some experiments on multi nodes training,however we fail to get the same speed as single node, and get very slow speed with NCCL. could you give some configuration guides on multi nodes training? Thanks a lot!
@xd-kevin Generally, inter-node communication will be a major bottleneck on multi-node training. Could you provide some information on configuration of your machine? i.e. Do you have NVLink in single-node and how is the network between nodes?
@xd-kevin Generally, inter-node communication will be a major bottleneck on multi-node training. Could you provide some information on configuration of your machine? i.e. Do you have NVLink in single-node and how is the network between nodes?
hi,thank you for replying. we have NVLink in a single node and infiniband connects with all nodes.
When I read source code, I found data collector is supposed to
/**************************************
************************************/
as commented. However, I cannot find specific codes to do this thing. Can sb give some explanation? Thanks~
The text was updated successfully, but these errors were encountered: