Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running hugectr with multi nodes #13

Closed
BookChan opened this issue Jun 15, 2020 · 4 comments
Closed

running hugectr with multi nodes #13

BookChan opened this issue Jun 15, 2020 · 4 comments

Comments

@BookChan
Copy link

Is there any whole tutorial about running hugectr with multi nodes ?

I have try this:

Follow the examples(https://github.com/NVIDIA/HugeCTR/tree/master/samples/dcn2nodes) , what have done is:
Build an mutlinode support images:

  • base on the dockerfile in hugectr
  • install hwloc2.2.0
  • install ucx-1.8.0
  • install openmpi4.0.3 withe ucx support
  • install mpi4py 3.0.3
  • build ctr : cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 -DENABLE_MULTINODES=ON ..

Run hugectr with two NVlink supported 8*V100(32G) phyical machines.

  • Start command is:
    export SSH_PORT="xxx"
    export NP="2"
    export WORK_DIR="/data/dcn_data/"
    export HOSTS="ip1:1,ip2:1"
    export ARGS=" ./bin/huge_ctr --train ./data/dcn-dist.json "
    cd $WORK_DIR
    bash start_dist.sh

start_dist.sh:
set -x

mpirun --bind-to none --allow-run-as-root -np $NP -H ${HOSTS} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} -x LIBRARY_PATH=${LIBRARY_PATH} -x PATH=${PATH} -wdir ${PWD} --mca plm_rsh_agent "$PWD/ssh_resolver.sh" --mca btl_tcp_if_include ib0 $ARGS > logs.txt 2>&1 &

ssh_resolver.sh:

#!/bin/bash
HOSTNAME=$1
shift
ARGS=$*

ssh -p "$SSH_PORT" "$HOSTNAME" "$ARGS"

My question is:

Is my mpirun command is correct ? Should I specfic ucx in mpirun?How hugectr use the ucx 、hwloc ? And how can I user Inifiniband \ RDMA to accelerate hugectr?

For example ,the ucx command looks like:
mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

https://github.com/openucx/ucx

@BookChan
Copy link
Author

BookChan commented Jun 15, 2020

My dockerfile looks like:

FROM   hugectrbase:0.0.1
#  build by: https://github.com/NVIDIA/HugeCTR/blob/master/Dockerfile


 
# install ssh  hwloc
RUN   apt-get install -y --no-install-recommends   net-tools  sysstat  wget libnuma-dev   build-essential         curl  ca-certificates    openssh-client openssh-server 

# install hwloc2.2.0
RUN cd /tmp &&  wget  https://download.open-mpi.org/release/hwloc/v2.2/hwloc-2.2.0.tar.gz && \
tar -zxvf  hwloc-2.2.0.tar.gz && cd hwloc* && ./configure && make && make install && rm  -rf /tmp/*



ENV  UCX_DIR=/usr/local/ucx
ENV  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$UCX_DIR/lib
ENV PATH=$PATH:$UCX_DIR/bin
# install ucx
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.8.0/ucx-1.8.0.tar.gz && tar xzf ucx-1.8.0.tar.gz &&\
 cd ucx-1.8.0 && ./contrib/configure-release --prefix=${UCX_DIR} &&\
  make -j8 install  && rm -rf /tmp/*



# install openmpi  
# https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX
RUN   mkdir -p  /tmp/openmpi && \
    cd /tmp/openmpi && \
    wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.3.tar.gz --no-check-certificate && \
    tar zxf openmpi-4.0.3.tar.gz && \
    cd openmpi-4.0.3 && \
    ./configure --enable-orterun-prefix-by-default  --enable-mca-no-build=btl-uct   --with-cuda=/usr/local/cuda  --with-ucx=${UCX_DIR} && \
    make -j $(nproc) all &&\
    cd /tmp/openmpi/openmpi-4.0.3 && \
    make install && \
    ldconfig && \
    rm -rf /tmp/*

# install mpi4py
RUN apt-get install -y libopenmpi-dev  &&  python3 -m pip install  -i  https://pypi.tuna.tsinghua.edu.cn/simple  --no-cache-dir    mpi4py

# set ssh
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config 
    
# get hugectr code and build hugectr
RUN cd / && git clone https://github.com/NVIDIA/HugeCTR.git && cd HugeCTR && git submodule update --init --recursive \
 && mkdir build && cd build &&  cmake -DCMAKE_BUILD_TYPE=Release   -DSM=70   -DNCCL_A2A=ON  -DENABLE_MULTINODES=ON  \
 &&   make -j  \
 && rm -rf ./git

WORKDIR /HugeCTR

@HWZealot
Copy link

Hi, Thanks for trying HugeCTR for multi-node.
The HugeCTR mainly use UCX to communicate between nodes, which directly calls UCX(UCP) API to exchange data between GPUs from different nodes. The OpenMPI is only used to communicate the information that needed to set up the UCX runtime environment(i.e. UCP address etc.), so it is not performance critical(as long as nodes can use OpenMPI to communicate, it's Okay). The HugeCTR use dedicate UCX context to communicate the data, so it will not interfere with the built-in UCX context within the OpenMPI runtime environment. In short, OpenMPI is used to create the processes on different nodes(Currently, HugeCTR requires 1 process per node) and communicate metadata to set up connection for UCX. UCX will take care of the data transfer.
The HugeCTR use HWLOC to detect the number of CPU package(i.e. CPU socket) in your node. Then it will create a UCX context per CPU socket(HWLOC will be used to bind each UCX context to its target CPU socket). When performing communication, each GPU will directly send and receive data via its affinity UCX context(i.e. the UCX context within the same CPU socket as this GPU). If your UCX is built with CUDA support, it should automatically select the fastest network to perform the communication from GPU(i.e. it will prefer InfiniBand/RDMA network over the TCP network automatically if there are RDMA-enabled network adapters on the same CPU socket with your GPUs). In short, you don't have to specify anything about locality and network for "mpirun", HugeCTR already bind affinity for you. As long as your RDMA network is well-installed and UCX is CUDA-supported, UCX will utilize GPUDirect RDMA automatically(RDMA network should be on the same CPU socket with GPU to enable GPUDirect RDMA). Of coarse, you can still specify "-x UCX_NET_DEVICES=mlx5_0:1" to manually restrict the network used by UCX if you think the UCX is not using the correct/optimal network.
Because the multi-node environment might be very different for different user(i.e. some cluster may use slurm), HugeCTR doesn't provide a standard docker for multi-node. Also, the mpirun command line maybe different for different cluster environment. Generally, you should provide the information to mpirun to access and create processes on specified host(i.e. -H and -np etc.). Locality and network-related command lines are not needed. I think your command line is Okay.

@HWZealot
Copy link

Update: In v2.2 release, the default communication library used by HugeCTR is NCCL, so UCX and HWLOC will not be used by default. NCCL will also automatically detect topology and select the optimal approach to communicate between GPUs in both intra-node and inter-node environment. So, again, you don't need to specify anything about locality and network for "mpirun".

@a550461053
Copy link

Update: In v2.2 release, the default communication library used by HugeCTR is NCCL, so UCX and HWLOC will not be used by default. NCCL will also automatically detect topology and select the optimal approach to communicate between GPUs in both intra-node and inter-node environment. So, again, you don't need to specify anything about locality and network for "mpirun".

Hi, I want to know how does NCCL compare to UCX's or Gossip's performance in both intra-node and inter-node environment? Is there relevant test data? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants