running hugectr with multi nodes #13

BookChan · 2020-06-15T07:49:29Z

Is there any whole tutorial about running hugectr with multi nodes ?

I have try this:

Follow the examples(https://github.com/NVIDIA/HugeCTR/tree/master/samples/dcn2nodes) , what have done is:
Build an mutlinode support images:

base on the dockerfile in hugectr
install hwloc2.2.0
install ucx-1.8.0
install openmpi4.0.3 withe ucx support
install mpi4py 3.0.3
build ctr : cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 -DENABLE_MULTINODES=ON ..

Run hugectr with two NVlink supported 8*V100(32G) phyical machines.

Start command is:
export SSH_PORT="xxx"
export NP="2"
export WORK_DIR="/data/dcn_data/"
export HOSTS="ip1:1,ip2:1"
export ARGS=" ./bin/huge_ctr --train ./data/dcn-dist.json "
cd $WORK_DIR
bash start_dist.sh

start_dist.sh:
set -x

mpirun --bind-to none --allow-run-as-root -np $NP -H ${HOSTS} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} -x LIBRARY_PATH=${LIBRARY_PATH} -x PATH=${PATH} -wdir ${PWD} --mca plm_rsh_agent "$PWD/ssh_resolver.sh" --mca btl_tcp_if_include ib0 $ARGS > logs.txt 2>&1 &

ssh_resolver.sh:

#!/bin/bash
HOSTNAME=$1
shift
ARGS=$*

ssh -p "$SSH_PORT" "$HOSTNAME" "$ARGS"

My question is:

Is my mpirun command is correct ? Should I specfic ucx in mpirun?How hugectr use the ucx 、hwloc ? And how can I user Inifiniband \ RDMA to accelerate hugectr?

For example ,the ucx command looks like:
mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

https://github.com/openucx/ucx

BookChan · 2020-06-15T07:56:45Z

My dockerfile looks like:

FROM   hugectrbase:0.0.1
#  build by: https://github.com/NVIDIA/HugeCTR/blob/master/Dockerfile


 
# install ssh  hwloc
RUN   apt-get install -y --no-install-recommends   net-tools  sysstat  wget libnuma-dev   build-essential         curl  ca-certificates    openssh-client openssh-server 

# install hwloc2.2.0
RUN cd /tmp &&  wget  https://download.open-mpi.org/release/hwloc/v2.2/hwloc-2.2.0.tar.gz && \
tar -zxvf  hwloc-2.2.0.tar.gz && cd hwloc* && ./configure && make && make install && rm  -rf /tmp/*



ENV  UCX_DIR=/usr/local/ucx
ENV  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$UCX_DIR/lib
ENV PATH=$PATH:$UCX_DIR/bin
# install ucx
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.8.0/ucx-1.8.0.tar.gz && tar xzf ucx-1.8.0.tar.gz &&\
 cd ucx-1.8.0 && ./contrib/configure-release --prefix=${UCX_DIR} &&\
  make -j8 install  && rm -rf /tmp/*



# install openmpi  
# https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX
RUN   mkdir -p  /tmp/openmpi && \
    cd /tmp/openmpi && \
    wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.3.tar.gz --no-check-certificate && \
    tar zxf openmpi-4.0.3.tar.gz && \
    cd openmpi-4.0.3 && \
    ./configure --enable-orterun-prefix-by-default  --enable-mca-no-build=btl-uct   --with-cuda=/usr/local/cuda  --with-ucx=${UCX_DIR} && \
    make -j $(nproc) all &&\
    cd /tmp/openmpi/openmpi-4.0.3 && \
    make install && \
    ldconfig && \
    rm -rf /tmp/*

# install mpi4py
RUN apt-get install -y libopenmpi-dev  &&  python3 -m pip install  -i  https://pypi.tuna.tsinghua.edu.cn/simple  --no-cache-dir    mpi4py

# set ssh
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config 
    
# get hugectr code and build hugectr
RUN cd / && git clone https://github.com/NVIDIA/HugeCTR.git && cd HugeCTR && git submodule update --init --recursive \
 && mkdir build && cd build &&  cmake -DCMAKE_BUILD_TYPE=Release   -DSM=70   -DNCCL_A2A=ON  -DENABLE_MULTINODES=ON  \
 &&   make -j  \
 && rm -rf ./git

WORKDIR /HugeCTR

HWZealot · 2020-06-19T12:48:25Z

Hi, Thanks for trying HugeCTR for multi-node.
The HugeCTR mainly use UCX to communicate between nodes, which directly calls UCX(UCP) API to exchange data between GPUs from different nodes. The OpenMPI is only used to communicate the information that needed to set up the UCX runtime environment(i.e. UCP address etc.), so it is not performance critical(as long as nodes can use OpenMPI to communicate, it's Okay). The HugeCTR use dedicate UCX context to communicate the data, so it will not interfere with the built-in UCX context within the OpenMPI runtime environment. In short, OpenMPI is used to create the processes on different nodes(Currently, HugeCTR requires 1 process per node) and communicate metadata to set up connection for UCX. UCX will take care of the data transfer.
The HugeCTR use HWLOC to detect the number of CPU package(i.e. CPU socket) in your node. Then it will create a UCX context per CPU socket(HWLOC will be used to bind each UCX context to its target CPU socket). When performing communication, each GPU will directly send and receive data via its affinity UCX context(i.e. the UCX context within the same CPU socket as this GPU). If your UCX is built with CUDA support, it should automatically select the fastest network to perform the communication from GPU(i.e. it will prefer InfiniBand/RDMA network over the TCP network automatically if there are RDMA-enabled network adapters on the same CPU socket with your GPUs). In short, you don't have to specify anything about locality and network for "mpirun", HugeCTR already bind affinity for you. As long as your RDMA network is well-installed and UCX is CUDA-supported, UCX will utilize GPUDirect RDMA automatically(RDMA network should be on the same CPU socket with GPU to enable GPUDirect RDMA). Of coarse, you can still specify "-x UCX_NET_DEVICES=mlx5_0:1" to manually restrict the network used by UCX if you think the UCX is not using the correct/optimal network.
Because the multi-node environment might be very different for different user(i.e. some cluster may use slurm), HugeCTR doesn't provide a standard docker for multi-node. Also, the mpirun command line maybe different for different cluster environment. Generally, you should provide the information to mpirun to access and create processes on specified host(i.e. -H and -np etc.). Locality and network-related command lines are not needed. I think your command line is Okay.

HWZealot · 2020-08-12T09:02:27Z

Update: In v2.2 release, the default communication library used by HugeCTR is NCCL, so UCX and HWLOC will not be used by default. NCCL will also automatically detect topology and select the optimal approach to communicate between GPUs in both intra-node and inter-node environment. So, again, you don't need to specify anything about locality and network for "mpirun".

a550461053 · 2020-08-13T11:15:07Z

Update: In v2.2 release, the default communication library used by HugeCTR is NCCL, so UCX and HWLOC will not be used by default. NCCL will also automatically detect topology and select the optimal approach to communicate between GPUs in both intra-node and inter-node environment. So, again, you don't need to specify anything about locality and network for "mpirun".

Hi, I want to know how does NCCL compare to UCX's or Gossip's performance in both intra-node and inter-node environment? Is there relevant test data? :)

minseokl closed this as completed May 17, 2022

dusir mentioned this issue Sep 7, 2023

[Question] An illegal memory access was encountered on H800 & Hugectr dcn test #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running hugectr with multi nodes #13

running hugectr with multi nodes #13

BookChan commented Jun 15, 2020

BookChan commented Jun 15, 2020 •

edited

Loading

HWZealot commented Jun 19, 2020

HWZealot commented Aug 12, 2020

a550461053 commented Aug 13, 2020

running hugectr with multi nodes #13

running hugectr with multi nodes #13

Comments

BookChan commented Jun 15, 2020

My question is:

BookChan commented Jun 15, 2020 • edited Loading

HWZealot commented Jun 19, 2020

HWZealot commented Aug 12, 2020

a550461053 commented Aug 13, 2020

BookChan commented Jun 15, 2020 •

edited

Loading