Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HIP/ROCm build rules and test packages #40627

Closed
fwyzard opened this issue Jan 26, 2023 · 31 comments
Closed

Implement HIP/ROCm build rules and test packages #40627

fwyzard opened this issue Jan 26, 2023 · 31 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Jan 26, 2023

Here is a first attempt at defining the build rules to build HIP files for the ROCm backend.
I am confident that these are not final, but having something to play with will likely help me make progress towards more robust rules.

affected files

These rules apply to

  • *.hip.cc files under src, plugins, test, and bin directories
  • alpaka/*.dev.cc under the same directories, when building an Alpaka-based package for the ROCm backend

compilation

file.hip.cc should be compiled by hipcc using the variables defined in the rocm scram tool:

${HIPCC} ${ROCM_FLAGS} -c ${CPPFLAGS} ${CXXFLAGS} -MMD -MF tmp/${SCRAM_ARCH}/src/TARGET/file.hip.cc.d file.hip.cc -o tmp/${SCRAM_ARCH}/src/TARGET/file.hip.cc.o

where

  • ${ROCM_FLAGS} are the ROCm specific flags for the target; mostly they are defined by the rocm tool, but each target could add or remove flags as usual;
  • ${CPPFLAGS} are the preprocessor flags (-D..., -U..., -I..., etc.) for the target;
  • ${CXXFLAGS} are the c++ compiler flags for the target, filtered out by the ${ROCM_HOST_REM_CXXFLAGS} variable.

Note: as hipcc supports the same options as the corresponding clang compiler, some of the gcc-specific options should be removed; ${ROCM_HOST_REM_CXXFLAGS} should match the ${REM_CXXFLAGS} used by llvm-cxxcompiler; it's possible that the two have diverged since we defined the tool and more (or less) options may need to be removed, for example those related to the LTO builds.

Note 2: currently rocm has -fno-gpu-rdc, but we should change it to -fgpu-rdc.

linking a device library

When building a library (not a plugin or binary), all the *.hip.cc.o files should be bundled into a static library, for use by other libraries and plugins that need to link to ROCm device code.
For example, one library could define a __device__ function, and a plugin could define a __global__ kernel that calls that function.

ar crs tmp/${SCRAM_ARCH}/src/TARGET/libTARGET_rocm.a tmp/${SCRAM_ARCH}/src/TARGET/*.hip.cc.o

This is similar to the libTARGET_nv.a static library we build for CUDA packages, but (for the moment) it will contain also the compiled host code. If it works, I'll look into keeping only the device code at a later time.

If a library, plugin or binary with any *.hip.cc files "uses" a library with *.hip.cc files, that library's libTARGET_rocm.a should be linked in (see next part).
I think this is what are doing for the CUDA libTARGET_nv.a library as well ?

linking a host library or binary

When building a library, plugin or binary that contains *.hip.cc files, we should do the linking with hipcc instead of g++:

${HIPCC} ${CXXFLAGS} ${ROCM_FLAGS} --hip-link ...

it should link any dependency's _rocm.a library:

... -L${CMSSW_BASE}/static/${SCRAM_ARCH} -lDEPENDENCY_rocm ...

and it should use the target's flag for building a shared library or binary.

Note: with respect to the CUDA rules, using hipcc --hip-link avoids the need for the extra linking step.
If we settle on this approach, we may be able to simplify the CUDA rules in a similar way.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 26, 2023

assign core

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 26, 2023

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,core

@fwyzard,@Dr15Jones,@smuzaffar,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 26, 2023

@smuzaffar could you try to set up these rules ?
(I'm away tomorrow and fully busy on Monday and Tuesday morning, so it's not terribly urgent...)

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 29, 2023

@smuzaffar thank you for cms-sw/cmsdist#8268 and cms-sw/cmsdist#8271, I've experimented a bit with them and I have some follow up:

  • when a library has any *.hip.cc files we should make a static library libPackage_rocm.a from all the *.hip.cc.o object files; later we may find a more compact way, but for the moment it should be good enough;
  • when a package has any *.hip.cc files we link it using g++ -shared; instead, we should use hipcc $(ROCM_FLAGS) $(ROCM_LDFLAGS) -shared ($(ROCM_LDFLAGS) so far is empty, but we may need to add some extra options there, I'm still experimenting)
  • same for a test binary, if it has any *.hip.cc files we should link it with hipcc;
  • when a package or binary "uses" a library Package that has a libPackage_rocm.a static library, we try to link it with -L... -lPackage_rocm.
    For some reason this does not work :-/
    Instead it works if we list library as a component to link explicitly, i.e. as .../libPackage_rocm.a.

If you could make these changes, I can experiment with them in a couple of days.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 30, 2023

* when a package or binary "uses" a library Package that has a libPackage_rocm.a static library, we try to link it with -L... -lPackage_rocm.
For some reason this does not work :-/
Instead it works if we list library as a component to link explicitly, i.e. as .../libPackage_rocm.a.

This is fixed in ROCm 5.2.x and later.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 30, 2023

To implement the build rules and test them, the following PRs should be merged in order:

Unrelated to these changes, there is a sister PR for CUDA:

Once these packages are working, we need at least a couple of follow ups:

A bug fix in the ROCm build rules:

We should also update CMSSW code for the latest version of Alpaka, and improve support in the common utilities:

@fwyzard fwyzard changed the title first attempt at HIP/ROCm build rules Implement HIP/ROCm build rules and test packages Jan 31, 2023
@smuzaffar
Copy link
Contributor

@fwyzard , I have few questions

  • what will happen to .dev.cc files for alpaka/rocm backend? If I understand correctly then, for alpaka/rocm, these files should be compiled with hipcc ( just like for alpaka/cuda we use nvcc) ... right?
  • dev.cc.o object files should also be considered for extracting the rocm device code .. right ? I mean the objcopy -j '__CLANG_OFFLOAD_BUNDLE*' should also be run for dev.cc.o ?
  • What about linking for alpaka/rocm ? I guess we still use hipcc for lnking ... right?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 1, 2023

  • what will happen to .dev.cc files for alpaka/rocm backend? If I understand correctly then, for alpaka/rocm, these files should be compiled with hipcc ( just like for alpaka/cuda we use nvcc) ... right?

yes: for the alpaka/rocm backend, *.dev.cc files should be considered the same as *.hip.cc files

  • dev.cc.o object files should also be considered for extracting the rocm device code .. right ? I mean the objcopy -j '__CLANG_OFFLOAD_BUNDLE*' should also be run for dev.cc.o ?

yes (see above)

  • What about linking for alpaka/rocm ? I guess we still use hipcc for linking ... right?

I think we should never have both *.cu and *.hip.cc files in the same package; while it could be made to work, we discussed this with @makortel and agreed that we prefer to keep the different backends separate in different libraries and plugins.

@smuzaffar
Copy link
Contributor

@fwyzard , as we do not distribute rocm along with cmssw itself and we only have symlinks pointing to /cvmfs/patatrack.cern.ch so I do not know what will happen on grid sites. As we only have symlinks so rpm does not know about libs provided by rocm package ( that is why we have these explicit Provides statements ). Last two IBs failed due to missing lib dependency , for now we avoid it by adding the extra Provides but this is not the correct solution. Rocm distribution is very big i.e. 21GB (/cvmfs/patatrack.cern.ch/externals/x86_64/rhel8/amd/rocm-5.4.2 ) so shipping it with CMSSW has its own side effects. So what would you suggest how should be deal with it? should we strip the distribution and package only what we need/use? order of 1-2GB should be fine but not more than that.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 2, 2023

/cvmfs/patatrack.cern.ch is available outside CERN, so any site with access to CVMFS should be fine.

I know the ROCm distribution is big - that's why we are using symlinks in the first place - but I don't really know how we can trim it.

We definitely do not use all the components and libraries that it provides, but I'm not sure what we should include and what we can drop.

@smuzaffar
Copy link
Contributor

/cvmfs/patatrack.cern.ch is available outside CERN, so any site with access to CVMFS should be fine.

correct but I am not sure if grid sites do a auto mount or explicit mount of cvmfs repos. Sites with explicit mount have to explicitly include patatrack.cern.ch in mount list.

May be add a rocm-runtime package which can just copy all the runtime libs and ship it along with cmssw?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 2, 2023

Mhm, I don't think this aspect is terribly urgent, because so far I'm not aware of any site with AMD GPUs.
I would prefer to start having some code that uses ROCm in CMSSW (directly or via Alpaka) in order to get more experience on what components are needed, before thinking about the packaging.

@smuzaffar
Copy link
Contributor

Also what about system rocm installation? I assume system with AMD GPU will have some local libs too. Do we also need to do some dynamic env setting to use system lib or libs from cms externals?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 2, 2023

I have absolutely no idea about how to deal with the system driver and runtime... this too goes into the "let's get more experience part".

Of course if anybody with experience with ROCm and AMD GPUs wants to help, that would be great :-)

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

The CMSSW_13_0_X_2023-02-05-2300 IB incudes support for the latest version of ROCm, unit tests for ROCm-based modules, and support for ROCm as an Alpaka backend.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

+heterogeneous

@makortel
Copy link
Contributor

makortel commented Feb 7, 2023

@smuzaffar Do we have anything to add to this issue from core side?

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 7, 2023

humm, I thought build rules were not complete for alpaka/rocm backend. @fwyzard did you try enabling the rocm backend and did it work? If yes then we need to update the build rules to enable rocm by default

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 7, 2023

Mhm, my bad, I though we had already enabled them by default... instead, I have the change in my local area

diff --git a/Projects/CMSSW/Self.xml b/Projects/CMSSW/Self.xml
index 24bc3a9..154617f 100644
--- a/Projects/CMSSW/Self.xml
+++ b/Projects/CMSSW/Self.xml
@@ -25,7 +25,7 @@
   <flags CHECK_PRIVATE_HEADERS="1"/>
   <flags SCRAM_TARGETS="haswell sandybridge nehalem"/>
   <flags OVERRIDABLE_FLAGS="CPPDEFINES CXXFLAGS FFLAGS CFLAGS CPPFLAGS LDFLAGS CUDA_FLAGS CUDA_LDFLAGS LTO_FLAGS ROCM_FLAGS ROCM_LDFLAGS"/>
-  <flags ALPAKA_BACKENDS="cuda serial"/>
+  <flags ALPAKA_BACKENDS="cuda rocm serial"/>
   <flags CODE_CHECK_ALPAKA_BACKEND="serial"/>
   <flags ENABLE_LTO="@ENABLE_LTO@"/>
  </client>

but I never made a PR with it.

@smuzaffar
Copy link
Contributor

@fwyzard , I just tested locally and surprised that alpaka/rocm backend works (at least builds). It generate a lot warnings though

>> Compiling alpaka/rocm dev/src/HeterogeneousCore/AlpakaInterface/test/alpaka/testKernel.dev.cc
In file included from external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/acc/AccGpuUniformCudaHipRt.hpp:26,
                 from external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/acc/AccGpuHipRt.hpp:14,
                 from external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/alpaka.hpp:29,
                 from dev/src/HeterogeneousCore/AlpakaInterface/test/alpaka/testVec.cc:1:
external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/rand/RandUniformCudaHipRand.hpp:28: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
   28 | #        pragma clang diagnostic push
      |
             from dev/src/HeterogeneousCore/AlpakaInterface/test/alpaka/testVec.cc:1:

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 7, 2023

Yes, that's something I'm following up inside Alpaka.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 9, 2023

@smuzaffar some of the warnings can be fixed inside Alpaka (done in alpaka-group/alpaka#1914), while some are caused by the HIP headers.
Is there a way to change them to use -isystem instead of -I ?

That is, changing

... -I/cvmfs/cms-ci.cern.ch/week1/PR_19c02028/el8_amd64_gcc11/external/rocm/5.4.2-4bcebc22a189c738df04e9c24dd2bf21/include ...

to

... -isystem/cvmfs/cms-ci.cern.ch/week1/PR_19c02028/el8_amd64_gcc11/external/rocm/5.4.2-4bcebc22a189c738df04e9c24dd2bf21/include ...

?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 9, 2023

I think I found it:

<flags SYSTEM_INCLUDE="1"/>

@smuzaffar
Copy link
Contributor

yes SYSTEM_INCLUDE should change it to -system

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 6, 2023

I think that with the set of PRs that entered CMSSW_13_0_0 and 13_1_0_pre1, everything is in pace and working.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 6, 2023

+heterogeneous

@makortel
Copy link
Contributor

makortel commented Mar 6, 2023

+core

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 6, 2023

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Mar 6, 2023

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants