Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging CUDA and non CUDA toolchains into one #12484

Closed
Micket opened this issue Mar 29, 2021 · 21 comments
Closed

Merging CUDA and non CUDA toolchains into one #12484

Micket opened this issue Mar 29, 2021 · 21 comments
Labels
Milestone

Comments

@Micket
Copy link
Contributor

Micket commented Mar 29, 2021

As has been discussed multiple times in zoom and in chat, merging fosscuda+foss, and just using versionsuffixes for CUDA-variants of the (relatively few) easyconfigs that does have CUDA bindings.

One thing holding us back has been the CUDA support in MPI, which prevents us from moving into foss since it's part of the toolchain definitions itself, but now with UCX, it might be possible to get the best of both worlds (I'm going to just dismiss the legacy CUDA stuff in openmpi and just focus on UCX).

We want something that

  1. works regardless of RPATH is used or not
  2. can be opt int after foss(without CUDA) is already in place
  3. supports all the RDMA goodies we have today in fosscuda.

Can it be done? Maybe; UCX has an environment variable for all the plugins it uses (ucx_info -f lists all variables):
We could introduce a a UCX-package (UCX-CUDA maybe?) which shadows non-CUDA UCX, and we start setting the environment variable;

UCX_MODULE_DIR='%(installdir)s/lib/ucx'

and, well, that should be it?

Example UCX-CUDA how i envision it:

easyblock = 'ConfigureMake'

name = 'UCX-CUDA'
version = '1.9.0'
local_cudaversion = '11.1.1'
versionsuffix = '-CUDA-%s' % local_cudaversion

homepage = 'http://www.openucx.org/'
description = """Unified Communication X
An open-source production grade communication framework for data centric
and high-performance applications
"""

toolchain = {'name': 'GCCcore', 'version': '10.2.0'}
toolchainopts = {'pic': True}

source_urls = ['https://github.com/openucx/ucx/releases/download/v%(version)s']
sources = ['%(namelower)s-%(version)s.tar.gz']
checksums = ['a7a2c8841dc0d5444088a4373dc9b9cc68dbffcd917c1eba92ca8ed8e5e635fb']

builddependencies = [
    ('binutils', '2.35'),
    ('Autotools', '20200321'),
    ('pkg-config', '0.29.2'),
]

osdependencies = [OS_PKG_IBVERBS_DEV]

dependencies = [
    ('UCX', version),
    ('numactl', '2.0.13'),
    ('CUDAcore', local_cudaversion, '', True),
    ('GDRCopy', '2.1', versionsuffix),
]

configure_cmd = "contrib/configure-release"
configopts = '--enable-optimizations --enable-cma --enable-mt --with-verbs '
configopts += '--without-java --disable-doxygen-doc '
configopts += '--with-cuda=$EBROOTCUDACORE --with-gdrcopy=$EBROOTGDRCOPY '

prebuildopts = 'unset CUDA_CFLAGS && unset LIBS && '
buildopts = 'V=1'

# Not a PATH since we want to replace it, not append to it
modextravars = {
    'UCX_MODULE_DIR': '%(installdir)s/lib/ucx',
}

sanity_check_paths = {
    'files': ['bin/ucx_info', 'bin/ucx_perftest', 'bin/ucx_read_profile'],
    'dirs': ['include', 'lib', 'share']
}

sanity_check_commands = ["ucx_info -d"]

moduleclass = 'lib'

We'd then just let a TensorFlow-2.4.1-foss-2020b-CUDA-11.1.1.eb have a dependency on UCX-CUDA (at least indirectly) which is probably the ugliesst thing with this approach.

This would remove the need for gcccuda, gompic, fosscuda, and they would all just use suffixes instead (and optionally depend on UCX-CUDA if they have some MPI parts).

@bartoldeman Is this at all close to the approach you envisioned?

@Micket Micket added the 2021a label Mar 29, 2021
@boegel boegel added this to the 4.x milestone Mar 31, 2021
@bartoldeman
Copy link
Contributor

@Micket more-or-less. But I tested it out and ran into a couple of issues:

  1. I found that UCX_MODULE_DIRdoes not override the default search path for modules, it merely adds a second alternative to $EBROOTUCX/lib/ucx (the first is derived at runtime from from the full path of $EBROOTUCX/lib/libucs.so.0. That could of course be patched in the UCX source code.
  2. With RPATH, the single Open MPI is still linked to the main non-CUDA UCX libraries, so libucs.so.0 and co in UCX-CUDA would be there for nothing.
  3. There is however an alternative which instead of UCX centric is Open MPI centric: via OMPI_MCA_mca_component_path you could add a directory with a mca_pml_ucx.so that links to a full UCX-CUDA. So you'd have this:
  • gompi is a subtoolchain of gompic
  • gompic depends on OpenMPI-CUDA and gompi components
  • OpenMPI-CUDA only has the modified plugins and extends OMPI_MCA_mca_component_path, no libmpi etc.
  • OpenMPI-CUDA depends on OpenMPI (which depends on plain UCX) and UCX-CUDA
    a bit complex though...

@bartoldeman
Copy link
Contributor

Perhaps switching the two lines here:
https://github.com/openucx/ucx/blob/0477cce66118f6c9a65b8954878c0ee3a33b5035/src/ucs/sys/module.c#L121
makes a difference, will check...

@Micket
Copy link
Contributor Author

Micket commented Mar 31, 2021

Is 1. and 2. basically the same issue right? Were we to patch the order, then I suspect it would solve the RPATH issues? (or are
I was a bit lazy here and just reused the entire UCX build, but I'm really only after the new directory with UCX-modules; so one could do the same as you suggest for OpenMPI-CUDA.

Regarding 3. redefining gompi as a subtoolchain sounds a bit scary; can we even do that without wreaking havoc on all previous toolchains versions?

@bartoldeman
Copy link
Contributor

I'm actually not even sure if non-CUDA UCX can be convinced to see the CUDA plugins.. will need to test some more.
The Open MPI approach via OMPI_MCA_mca_component_path works for sure (tested).

@akesandgren
Copy link
Contributor

Unfortunately that won't be enough in the long run, when we get other things that use UCX directly and would then need the UCX-CUDA version...

@Micket
Copy link
Contributor Author

Micket commented Mar 31, 2021

We would definitely have a "UCX-CUDA" even with the OMPI_MCA_component_path approach, and I don't think there would be an issue depending on that directly if necessary with either approach

@bartoldeman
Copy link
Contributor

bartoldeman commented Mar 31, 2021

Upon further investigation the UCX plugin architecture is not flexible enough for our purpose. The reason is that the list of plugins to load is set at configure time: for us this is:

#define uct_MODULES ":ib:rdmacm:cma:knem"

in config.h for non-CUDA and

#define uct_MODULES ":cuda:ib:rdmacm:cma:knem"

for CUDA. UCX parses this list to figure out which plugins to load.

Open MPI's plugin architecture is more flexible, though messing about with OMPI_MCA_mca_component_path would be a novelty in modules as far as I know, so tread carefully...

@bartoldeman
Copy link
Contributor

Note about Intel MPI: this one goes via libfabric which is flexible enough (via FI_PROVIDER_PATH).
Note that we (with rpath) patch it a little in a hook already that modifies postinstallcmds adding this:

patchelf --set-rpath $EBROOTUCX/lib --force-rpath %(installdir)s/intel64/libfabric/lib/prov/libmlx-fi.so

we'd need to have two libmlx-fi.so copies, one linking to the non-CUDA UCX and one to the CUDA UCX (if the latter works properly at all)

@Micket
Copy link
Contributor Author

Micket commented Mar 31, 2021

So

  1. expand OpenMPI to set OMPI_MCA_mca_component_path
  2. UCX-CUDA (under GCCcore) like above (minus the pointless MODULE_DIR)
  3. OpenMPI-CUDA (under gompi) that just contains one MCA library + OMPI_MCA_mca_component_path.
  4. impi-CUDA (under iimpi?) that just contains libmlx-fi.so + FI_PROVIDER_PATH

@boegel
Copy link
Member

boegel commented Apr 1, 2021

Why does OpenMPI need to set OMPI_MCA_mca_component_path? (step 0)
Maybe I'm missing something...

@bartoldeman
Copy link
Contributor

@boegel no, step 0 isn't necessary; only if you want OpenMPI-CUDA to prepend that would be slightly cleaner.
step 2 could set

OMPI_MCA_mca_base_component_path="$EBROOTOPENMPIMINUSCUDA/lib/openmpi:$EBROOTOPENMPI/lib/openmpi:$HOME/.openmpi/components"

or reading the source code at https://github.com/open-mpi/ompi/blob/92389c364df669822bb6d72de616c8ccf95b891c/opal/mca/base/mca_base_component_repository.c#L215 this is also possible:

OMPI_MCA_mca_base_component_path="$EBROOTOPENMPIMINUSCUDA/lib/openmpi:SYSTEM_DEFAULT:USER_DEFAULT"

@Micket
Copy link
Contributor Author

Micket commented Jun 28, 2021

So we have a working UCX + CUDA split working now. We should just decide on how workflow of how to put the easyconfigs on top of these now. Some loose suggestions have been floating around but nothing concrete. I can only recall hearing objections to all approaches so I doubt we can serve.
The suggestions I have myself, or have seen others make are:

  1. The simplest approach we could take is to just depend on CUDAcore or UCX-CUDA if the software has MPI support. The cuda variants of software gets a versionsuffix = "-CUDA-%(cudaver)s". E.g. GROMACS-2021.3-foss-2021a.eb and GROMACS-2021.3-foss-2021a-CUDA-11.3.1.eb. For intel mpi I don't think it supports UCX-CUDA regardless so there they could just depend on CUDAcore. The test suite would ensure we don't mix multiple different versions. Upside: Simplicity. Also, it mirrors what we do at GCCcore with UCX-CUDA and NCCL. Downside: UCX-CUDA might be a bit obscure and people might not know to depend on it? (perhaps a test suite check would suffice to remedy that)

  2. Add a trivial bundle "CUDA-11.3.1-GCCcore-10.3.0.eb" that depends on UCX-CUDA, and, maybe binutils. Then just depend on that and use a versionsuffix just like before. Upside: Perhaps simpler to remember that dependency name for those we don't realize what the purpose of UCX-CUDA is? Downside: Forces UCX-CUDA on everything even if it doesn't need UCX/MPI, especially for the intel side which can't even use UCX-CUDA (yet?).

  3. Add a "CUDA" package that expands modulepath somewhere so it looks like more HMNS. Of course, the whole plan was to be able to reuse non CUDA foss dependencies here, so, it would somehow need to be below foss, so, I don't know what sort of mess we'd have to create to expand modulepath depending on what other modules you have loaded. We'd have to mess about with toolchain definitions and such. Upside: Adds a hierarchy for those who like that. Downside: I think it would create a fair bit of complexity

For option 2 and 3 we should probably also watch out for how the name CUDA would interact with the naming schemes and install locations.

@akesandgren
Copy link
Contributor

As of Intel MPI 2019.6 (5 actually but there it is just a tech preview) it requires UCX (https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html), so from that point of view we should use UCX-CUDA just to make them identical.

I think option 1 is the cleanest one, and it also makes it clearer for users which module they want when they look for CUDA enabled stuff.

I don't really like the -CUDA-x.y versionsuffix myself but I'd still go for it.

For option 2, shouldn't that be one CUDA-x.y-GCCcore-x.y which depend on CUDAcore and one CUDA-x.y-gompi-z that depends on UCX-CUDA? I.e. depending on at what level the toolchain is it pulls in the required CUDA(core|UCX)?

@akesandgren
Copy link
Contributor

Hmmm, for option 1 we probably need to make accomodations in tools/module_naming_scheme/hierarchical_mns.py that CUDA should not change modulepath in this case...

@Micket
Copy link
Contributor Author

Micket commented Jun 29, 2021

As of Intel MPI 2019.6 (5 actually but there it is just a tech preview) it requires UCX (https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html), so from that point of view we should use UCX-CUDA just to make them identical.

Yes, we will/are depend on UCX via impi, but it's just whether or not do add ucx-cuda plugins and GPUDirect if you make a TensorFlow-3.4.5-intel-2021a-CUDA-11.3.1.eb, but if impi can't use those features, why add it?
Not that there is much harm to add it, just 2 extra modules to load.

For option 2, shouldn't that be one CUDA-x.y-GCCcore-x.y which depend on CUDAcore and one CUDA-x.y-gompi-z that depends on UCX-CUDA? I.e. depending on at what level the toolchain is it pulls in the required CUDA(core|UCX)?

I don't think that does anything useful unless you want to expand modulepaths like in option 3 and define a bunch of toolchain stuff on top of gcc/gompi/foss (so that we can still use dependencies from these levels). Otherwise, they would all just depend on the same UCX-CUDA and nothing else.

Hmmm, for option 1 we probably need to make accomodations in tools/module_naming_scheme/hierarchical_mns.py that CUDA should not change modulepath in this case...

When CUDA/CUDAcore is used as an ordinary dependency (not part of a toolchain), this doesn't happen. I managed to build #13282 without modifications and it ends up in modules/all/MPI/GCC/10.3.0/OpenMPI/4.1.1/OSU-Micro-Benchmarks/5.7.1-CUDA-11.3.1.lua like expected (matching option 1 presented here)

@casparvl
Copy link
Contributor

casparvl commented Jul 7, 2021

Seeing #13282 solution 1 actually leads to pretty clean easyconfigs: adding one dep for GPU support, and another dep for GPU communication support. It's also flexible: if, for some reason, a different CUDA version is required for a specific EasyConfig, that can easily be done, because CUDA is not a dependency in some very low level toolchain. One would only have to install a new CUDAcore, and (if relevant) UCX-CUDA.

But, let me see if I get this right: there will then be essentially both a non-cuda UCX (from gompi) in the path, as well as a UCX with CUDA support (from UCX-CUDA). So we essentially rely on the path order to have it pick up the relevant one? Or does UCX-CUDA now only install the plugins, but still uses the UCX module for the base UCX?

I guess in the first case, the RPATH issue that @bartoldeman mentioned will still be an issue, but I guess in the 2nd case it will essentially be resolved, right?

All in all, I'm in favour of solution 1. Solution 2 is 'convenient' but indeed a bit dirty that UCX get's pulled in even for non-MPI software. I don't think it's hard to check PRs for sanity in this scenario 1, so if all maintainers know about this approach, it should be ok. Essentially, the only check that needs to be done is: if it's an MPI capable toolchain, and includes CUDA as dep, it has to also include UCX-CUDA. Maybe a check like that could even be automated...

@Micket
Copy link
Contributor Author

Micket commented Jul 14, 2021

@casparvl

Or does UCX-CUDA now only install the plugins, but still uses the UCX module for the base UCX?
It does this. UCX-CUDA depends on UCX, and this is all merged already actually. No RPATH problems as it relies on UCX_MODULE_DIR.

And just because a software depends on MPI and CUDA doesn't mean it necessarily needs UCX-CUDA, I think it still needs special directives to use GPUDirect/RDMA stuff. But there is no harm in always enabling the support in UCX.

@casparvl
Copy link
Contributor

casparvl commented Jul 14, 2021

I guess you're right.

As far as I know the use of GPUDirect requires the MPI_* routines to be called with a GPU pointer as send/receive buffer, instead of calling cudaMempy to copy from GPU to CPU buffer, and then calling your MPI_* routine on the CPU buffer. But it's probably harder to find out if software has such GPU-direct support, than just making sure the support is enabled in UCX :)

It does this. UCX-CUDA depends on UCX, and this is all merged already actually. No RPATH problems as it relies on UCX_MODULE_DIR.

This is cool btw. I'd seen the PR, but wasn't sure I understood correctly what was going on. I suspected this was the case though, nice solution.

@branfosj
Copy link
Member

branfosj commented Aug 4, 2021

I am happy with approach 1. I've done some testing of this but with CUDA 11.2.x (due to the NVIDIA driver version I have available).

@boegel
Copy link
Member

boegel commented Aug 18, 2021

@Micket I think we can close this, since from foss/2021a onwards we now have a better approach to support software that requires a GPU-aware OpenMPI via UCX-CUDA (cfr. #13260), so we effectively don't have a fosscuda anymore?

@Micket
Copy link
Contributor Author

Micket commented Aug 18, 2021

Sure. There will likely be some surprises going forward when we actually start adding stuff, but i think we can sort them out.

@Micket Micket closed this as completed Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants