Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how data collector send it's CSR buffers to remote node? #15

Closed
xingt-tang opened this issue Jun 24, 2020 · 4 comments
Closed

how data collector send it's CSR buffers to remote node? #15

xingt-tang opened this issue Jun 24, 2020 · 4 comments
Assignees
Labels
fea::doc Improvements or additions to documentation

Comments

@xingt-tang
Copy link

When I read source code, I found data collector is supposed to
/**************************************

  • Each node will have one DataCollector.
  • Each iteration, one of the data collector will
  • send it's CSR buffers to remote node.
    ************************************/
    as commented. However, I cannot find specific codes to do this thing. Can sb give some explanation? Thanks~
@zehuanw
Copy link
Collaborator

zehuanw commented Jun 24, 2020

Thank you for your question. It's a bug in document. In v2.1 data reader will not send buffer to remote node anymore. Each of the nodes will read data independently but only copy corresponding data to GPU in the same node.

@zehuanw zehuanw added the fea::doc Improvements or additions to documentation label Jun 24, 2020
@zehuanw zehuanw self-assigned this Jun 24, 2020
@xingt-tang
Copy link
Author

Thank you for immediate response! This question is from our practice on multi nodes. Actually, we have done some experiments on multi nodes training,however we fail to get the same speed as single node, and get very slow speed with NCCL. could you give some configuration guides on multi nodes training? Thanks a lot!

@HWZealot
Copy link

@xd-kevin Generally, inter-node communication will be a major bottleneck on multi-node training. Could you provide some information on configuration of your machine? i.e. Do you have NVLink in single-node and how is the network between nodes?

@xingt-tang
Copy link
Author

@xd-kevin Generally, inter-node communication will be a major bottleneck on multi-node training. Could you provide some information on configuration of your machine? i.e. Do you have NVLink in single-node and how is the network between nodes?

hi,thank you for replying. we have NVLink in a single node and infiniband connects with all nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fea::doc Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants