-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running hugectr with multi nodes #13
Comments
My dockerfile looks like:
|
Hi, Thanks for trying HugeCTR for multi-node. |
Update: In v2.2 release, the default communication library used by HugeCTR is NCCL, so UCX and HWLOC will not be used by default. NCCL will also automatically detect topology and select the optimal approach to communicate between GPUs in both intra-node and inter-node environment. So, again, you don't need to specify anything about locality and network for "mpirun". |
Hi, I want to know how does NCCL compare to UCX's or Gossip's performance in both intra-node and inter-node environment? Is there relevant test data? :) |
Is there any whole tutorial about running hugectr with multi nodes ?
I have try this:
Follow the examples(https://github.com/NVIDIA/HugeCTR/tree/master/samples/dcn2nodes) , what have done is:
Build an mutlinode support images:
Run hugectr with two NVlink supported 8*V100(32G) phyical machines.
export SSH_PORT="xxx"
export NP="2"
export WORK_DIR="/data/dcn_data/"
export HOSTS="ip1:1,ip2:1"
export ARGS=" ./bin/huge_ctr --train ./data/dcn-dist.json "
cd $WORK_DIR
bash start_dist.sh
start_dist.sh:
set -x
mpirun --bind-to none --allow-run-as-root -np$NP -H $ {HOSTS} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} -x LIBRARY_PATH=${LIBRARY_PATH} -x PATH=${PATH} -wdir ${PWD} --mca plm_rsh_agent "$PWD/ssh_resolver.sh" --mca btl_tcp_if_include ib0 $ARGS > logs.txt 2>&1 &
ssh_resolver.sh:
#!/bin/bash
HOSTNAME=$1
shift
ARGS=$*
ssh -p "$SSH_PORT" "$HOSTNAME" "$ARGS"
My question is:
Is my mpirun command is correct ? Should I specfic ucx in mpirun?How hugectr use the ucx 、hwloc ? And how can I user Inifiniband \ RDMA to accelerate hugectr?
For example ,the ucx command looks like:
mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app
https://github.com/openucx/ucx
The text was updated successfully, but these errors were encountered: