Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Compatibility issue with rocm 6.0.2 #10

Open
pelahi opened this issue Mar 1, 2024 · 0 comments
Open

[Issue]: Compatibility issue with rocm 6.0.2 #10

pelahi opened this issue Mar 1, 2024 · 0 comments

Comments

@pelahi
Copy link

pelahi commented Mar 1, 2024

Problem Description

The are two issues encountered when using rocm 6.0.2.

  1. The first one might be related to building a rocm container on a machine lacking an AMD gpu. The build of rocm used amdgpu-install -y --usecase=hiplibsdk,rocm,hip,opencl to install, which in earlier versions defined __HIP_PLATFORM_AMD__ but this not defined. The result is configure will fail
checking for hip/hip_runtime.h... no
configure: error: unable to find required headers

This is uninformative and a deeper look at the config.log shows

configure:4638: checking for hip/hip_runtime.h
configure:4638: gcc-12 -c -I/opt/rocm/include -I/opt/rocm/include -I/usr/include  -I/usr/include  conftest.c >&5
In file included from conftest.c:60:
/opt/rocm/include/hip/hip_runtime.h:66:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
   66 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70:
/opt/rocm/include/hip/hip_runtime_api.h:8575:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 8575 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:71:
/opt/rocm/include/hip/library_types.h:75:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
   75 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:73:
/opt/rocm/include/hip/hip_vector_types.h:38:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
   38 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~

It is just a matter of defining the compilation argument but it wasn't necessary in previous versions to do so explicitly.

The other issue is a compilation issue. With changes made to hipPointerAttribute_t the code will not compile, giving a message

make[2]: Entering directory '/tmp/aws-ofi-rccl/src'
  CC       nccl_ofi_net.lo
nccl_ofi_net.c: In function 'get_cuda_device':
nccl_ofi_net.c:497:17: error: 'struct hipPointerAttribute_t' has no member named 'memoryType'
  497 |         if (attr.memoryType == hipMemoryTypeDevice) {
      |                 ^
make[2]: *** [Makefile:435: nccl_ofi_net.lo] Error 1

The fix is to update this line to use attr.type.

Operating System

Ubuntu 22.04 LTS

CPU

AMD EPYC-Rome with no GPU

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

Here is the section from the Docker recipe and shows the instructions that I am running.

ARG ROCM_VERSION=6.0.2
RUN echo "Building rocm ${ROCM_VERSION}" \
    && rocm_major=$(echo ${ROCM_VERSION} | sed "s/\./ /g" | awk '{print $1}') \
    && rocm_minor=$(echo ${ROCM_VERSION} | sed "s/\./ /g" | awk '{print $2}') \
    && ROCM_INSTALLER_VERSION=$(echo ${ROCM_VERSION} | sed "s/\./0/g") \
    # if rocm version does not list minor patch version number add 00 to end of installer version
    && if [ $(echo ${ROCM_VERSION} | sed "s/\./\n/g" | wc -l) -eq "2" ]; then ROCM_INSTALLER_VERSION=${ROCM_INSTALLER_VERSION}"00"; fi \
    && ROCM_INSTALLER_VERSION=${ROCM_INSTALLER_VERSION}"-1" \
    && ROCM_INSTALLER_VERSION=${rocm_major}.${rocm_minor}.${ROCM_INSTALLER_VERSION} \
	&& cd /tmp/build \
    # && wget https://bootstrap.pypa.io/get-pip.py \
    # && python3 get-pip.py \
    && roc_url="https://repo.radeon.com/amdgpu-install/"${ROCM_VERSION}"/ubuntu/jammy/amdgpu-install_"${ROCM_INSTALLER_VERSION}"_all.deb" \
    && echo ${roc_url} \
	&& wget ${roc_url} \
	&& apt -y install ./amdgpu-install_${ROCM_INSTALLER_VERSION}_all.deb \
	&& amdgpu-install -y --usecase=hiplibsdk,rocm,hip,opencl \
    && cd /tmp/build && rm -rf amdgpu-install_${ROCM_INSTALLER_VERSION}_all.deb \
	echo "Done"

# Install aws-ofi-rccl
ARG RCCL_CONFIGURE_OPTIONS="--prefix=/usr --with-mpi=/usr --with-libfabric=/usr --with-hip=/opt/rocm --with-rccl=/opt/rocm CC=gcc-12 CXX=g++-12"
RUN echo "Build rccl" \
    && git clone https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl.git \
	&& cd aws-ofi-rccl \
	&& ./autogen.sh \
	&& ./configure ${RCCL_CONFIGURE_OPTIONS}} \
	&& make -j 16 \
	&& make install \
        && cd /tmp \
	&& rm -rf /tmp/build \
	&& echo "Done"

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant