Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch UCX to add CUDA support at runtime #12866

Closed

Conversation

Micket
Copy link
Contributor

@Micket Micket commented May 15, 2021

(created using eb --new-pr)

@Micket
Copy link
Contributor Author

Micket commented May 15, 2021

So, I think this works. ucx_info manages to find and report on CUDA stuff.

But, there are some details to sort out when it comes to building the additional UCX-plugins; I would like them link to the existing libuct.so etc. libraries. Right now, I hackishly install the plugins. Alternatively, we can just install copies of those libraries, they should hopefully be identical anyway.

@Micket Micket marked this pull request as draft May 15, 2021 16:30
@boegel
Copy link
Member

boegel commented May 15, 2021

@Micket Do you think it's worth implementing a custom easyblock for this, where we can check the output produced by ucx_infoa bit more thoroughly?

This easyconfig looks complex enough to warrant that, I think...

@Micket
Copy link
Contributor Author

Micket commented May 15, 2021

Sure, though it's mostly just the same as UCX. We could probably automatically populate EB_xxx_MODULES here.

The biggest issue is

  1. Gotta test to see if it actually works for transferring data. I need 2 nodes to test and my cluster is a bit booked up at the moment.
  2. The hacky way I install just the modules right now is not good. I would really need to rework this. Maybe even requiring a patch to just build the plugins and not involve any other linking shenanigans. I'm not to familiar with how to hack this into libtool/autohell stuff going on here. I really hate autotools...

UCS_INIT_ONCE(init_once) {
ucs_module_debug("loading modules for %s", framework);
- modules_str = ucs_strdup(modules, "modules_list");
+ sprintf(env, "EB_%s_MODULES", framework);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use strncpy, strncat, strncat here instead, then you can drop stdio and stdlib and do string.h instead.

@akesandgren
Copy link
Contributor

This one no longer need the original UCX nor the patch for dynamic modules.

@boegelbot
Copy link
Collaborator

@Micket: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/885348992
Output from first failing test suite run:

Found 12227 easyconfigs...
Failed to determine merge base (ec: 1, output: ''), falling back to specifying target branch develop
FAIL: test_changed_files_pull_request (test.easyconfigs.easyconfigs.EasyConfigTest)
Failed to determine merge base (ec: 1, output: ''), falling back to specifying target branch develop
Specific checks only done for the (easyconfig) files that were changed in a pull request.

----------------------------------------------------------------------
List of changed easyconfig files in this PR:
Traceback (most recent call last):
	UCX-1.10.0-GCCcore-10.3.0.eb
  File "test/easyconfigs/easyconfigs.py", line 990, in test_changed_files_pull_request

    self.check_sha256_checksums(changed_ecs)
List of added easyconfig files in this PR:
  File "test/easyconfigs/easyconfigs.py", line 692, in check_sha256_checksums
	UCX-CUDA-1.10.0-GCCcore-10.3.0-CUDA-11.3.0.eb
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches in UCX-CUDA-1.10.0-GCCcore-10.3.0-CUDA-11.3.0.eb: found 1 sources + 1 patches vs 1 checksums

----------------------------------------------------------------------
Ran 12237 tests in 372.346s

FAILED (failures=1)
ERROR: Not all tests were successful.

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@boegel boegel modified the milestones: 4.4.0, release after 4.4.0 May 28, 2021
@Micket
Copy link
Contributor Author

Micket commented Jun 26, 2021

Replaced with #13260

@Micket Micket closed this Jun 26, 2021
@Micket Micket deleted the 20210515134431_new_pr_UCX-CUDA1100 branch April 21, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants