Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to UCX v1.14.0 and configure/build for release #111

Merged
merged 15 commits into from
Mar 18, 2023

Conversation

pentschev
Copy link
Contributor

@pentschev pentschev commented Oct 17, 2022

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Changes:

  • Upgrade to UCX 1.14.0
  • Add libnuma dependency from conda-forge
  • Switch to release builds
  • Move to single build compatible for CPU-only and CUDA
  • Remove verbs support

Closes #26
Closes #66
Closes #102
Closes #110
Closes #112
Closes #115

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@pentschev
Copy link
Contributor Author

@conda-forge-admin, please rerender

@pentschev
Copy link
Contributor Author

cc @leofang @jakirkham

@github-actions
Copy link
Contributor

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/ucx-split-feedstock/actions/runs/3267176097.

@leofang
Copy link
Member

leofang commented Oct 17, 2022

Thanks, Peter! LGTM but I'd like to have @jakirkham or someone else to take another look 🙂

@jakirkham
Copy link
Member

Does this result in ABI changes? In particular were there debug symbols that would have been linked against before that are now removed?

@pentschev pentschev mentioned this pull request Jan 27, 2023
1 task
@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@pentschev pentschev changed the title Configure/build UCX for release Update to UCX v1.14.0 and configure/build for release Mar 14, 2023
@pentschev
Copy link
Contributor Author

As per our previous discussion, we wanted to wait for the change to build for release until the new UCX release was out, now v1.14.0 has been released and I've updated this PR to use v1.14.0.

@jakirkham @leofang would you mind taking a look?

@pentschev
Copy link
Contributor Author

It seems that the CentOS 6 builds fail with UCX 1.14 because IBV_ACCESS_ON_DEMAND is missing as it was previously defined in https://github.com/openucx/ucx/blob/c0aa76d0c121445b8665dbb16e1577ef5ebc6200/src/uct/ib/base/ib_verbs.h#L160-L169 and now is expected to be defined as part of rdma-core. Similar to

# --with-rdmacm requires rdma-core v23+, while CentOS 7 only offers v22.
, I believe we should be building against rdma-core v28 for MOFED 5.x compatibility, which we can't because CentOS 7 goes only up to v22. My suggestion would be to remove --with-verbs entirely until we can guarantee rdma-core v28 or higher is available as part of the conda-forge packaging process.

@pentschev
Copy link
Contributor Author

To be clear, MOFED 5.x is now the only tested UCX version, I know MOFED 4.x was still "supported" by UCX until last year or so but not actively tested, thus I was constantly finding bugs. As a result of that we decided to stop supporting MOFED 4.x entirely for RAPIDS, both internally at NVIDIA and for customers. I would recommend the same be done here, as MOFED 4.x support (rdma-core < v28) is going to be hit or miss.

@jakirkham
Copy link
Member

Thanks Peter! 🙏

Do we also want to add libnuma ( #112 )?

Wonder if we can also combine these into a single build for both CPU and GPU (just moving cudatoolkit to run_constrained). This would resolve issue ( #66 ) and also issues ( #26 ), ( #66 ), ( #102 ) & ( #115 )

cc @kkraus14 (for awareness)

recipe/meta.yaml Outdated

build:
skip: true # [not linux]
number: 0
number: 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're bumping the version we don't need to bump the build number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an oversight, this PR was only changing to release with no bump on the UCX version, so this was left by accident. We also have the number variable above, is there a reason we can't use it here @jakirkham ? It is currently only used by outputs.name=ucx.build.number, wondering whether we can use for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 50a5dc2 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should use the top-level number and template it everywhere else

@jakirkham
Copy link
Member

A separate question would be other we want to backport the CUDA 12 fixes ourselves ( #116 (comment) ) to ensure they are in 1.14.0

@kkraus14
Copy link

To be clear, MOFED 5.x is now the only tested UCX version, I know MOFED 4.x was still "supported" by UCX until last year or so but not actively tested, thus I was constantly finding bugs. As a result of that we decided to stop supporting MOFED 4.x entirely for RAPIDS, both internally at NVIDIA and for customers. I would recommend the same be done here, as MOFED 4.x support (rdma-core < v28) is going to be hit or miss.

Now that there's a virtual package plugin system for conda we in theory could have a virtual package for MOFED version and constrain UCX packages based on that MOFED version if desired.

Is there a path forward for an updated rdma-core CDT or are we limited to CentOS 7 for CDTs?

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter! 🙏

Looks good. One minor suggestion below.

Co-authored-by: jakirkham <jakirkham@gmail.com>
@jakirkham
Copy link
Member

Thanks Peter! 🙏

LGTM

Will let this sit for a bit to see if anyone else has feedback. Planning on merging EOD tomorrow if no comments

@jakirkham jakirkham merged commit 09499c8 into conda-forge:main Mar 18, 2023
@jakirkham
Copy link
Member

Thanks Peter for the PR and Keith for help reviewing! 🙏

rapids-bot bot pushed a commit to rapidsai/ucx-py that referenced this pull request Mar 20, 2023
conda-forge/ucx-split-feedstock#111 was merged and UCX v1.14.0 is now available on conda-forge which we should support.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: #935
rapids-bot bot pushed a commit to rapidsai/raft that referenced this pull request Mar 23, 2023
UCX 1.14.0 was recently released and conda-forge package was updated in conda-forge/ucx-split-feedstock#111 with several packaging improvements. Relax the pin to allow installing UCX v1.14.x as well.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #1366
@pentschev pentschev deleted the use-configure-release branch April 12, 2023 07:37
requirements:
build:
- {{ compiler("c") }}
- {{ compiler("cxx") }}
- {{ compiler("cuda") }} # [cuda_compiler_version != "None"]
- {{ cdt("libnl") }} # [cdt_name == "cos6"]
- {{ cdt("libnl3") }} # [cdt_name == "cos7"]
- {{ cdt("libibcm-devel") }} # [cdt_name == "cos6"]
- {{ cdt("libibverbs-devel") }} # [cdt_name == "cos6"]
- {{ cdt("librdmacm-devel") }} # [cdt_name == "cos6"]
- {{ cdt("numactl-devel") }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed we still have this CDT. Though we added libnuma below. Should we drop the CDT then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing in PR ( #118 )

@jakirkham jakirkham mentioned this pull request Apr 12, 2023
5 tasks
@jakirkham jakirkham mentioned this pull request May 22, 2023
rapids-bot bot pushed a commit to rapidsai/integration that referenced this pull request Jan 28, 2025
Contributes to rapidsai/build-planning#142

Proposes removing the exclusion for `ucx-proc` in the CI job that checks that all projects have produced recent-enough nightlies. See that linked issue for much more detail, but in short:

* there haven't been new `ucx-proc` packages for 2+ years, since conda-forge/ucx-split-feedstock#111
* RAPIDS can safely drop its `ucx-proc` dependency, and other PRs linked to rapidsai/build-planning#142 have done that for the 25.02 release
* `ucx-proc` should not show up in RAPIDS 25.02 environment solves, and so we don't need to carry an exclusion for it in the testing config

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham
  - Bradley Dice (https://github.com/bdice)

URL: #742
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants