Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate out 3rd-party headers #634

Merged
18 commits merged into from
Oct 9, 2024
Merged

separate out 3rd-party headers #634

18 commits merged into from
Oct 9, 2024

Conversation

ghost
Copy link

@ghost ghost commented Sep 30, 2024

88569151 feat(build): handle 3p headers better

Don't put 3p dependencies under the same directory as our headers, put
them under a separate top-level 3rd-party directory. Substitute in the
accelerator (cuda or neuron), so that our code doesn't need to actually
care if it's neuron or nvidia at every inclusion, including <nccl/err.h>
will do the right thing either way. Put the tuner under its own
Makefile. Move uthash, too. Prefer <> to "" so that it's clear that
these are system headers, not code we control. Also helps silence
warnings stemming from these files.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

43 files changed, 144 insertions(+), 131 deletions(-)
.gitignore                                         |  1 +
3rd-party/Makefile.am                              | 28 ++++++++++++++++++
.../nccl/cuda/include/nccl}/err.h                  |  0 
.../nccl/cuda/include/nccl}/net.h                  |  0 
.../nccl/cuda/include/nccl}/net_device.h           |  0 
.../nccl/cuda/include/nccl}/net_v2.h               |  0 
.../nccl/cuda/include/nccl}/net_v3.h               |  0 
.../nccl/cuda/include/nccl}/net_v4.h               |  0 
.../nccl/cuda/include/nccl}/net_v5.h               |  0 
.../nccl/cuda/include/nccl}/net_v6.h               |  0 
.../nccl/cuda/include/nccl}/net_v7.h               |  0 
.../nccl/cuda/include/nccl}/net_v8.h               |  0 
.../nccl/cuda/include/nccl}/tuner.h                |  4 +--
.../nccl/cuda/include/nccl}/tuner_v1.h             |  0 
.../nccl/cuda/include/nccl}/tuner_v2.h             |  0 
.../nccl/cuda/include/nccl}/types.h                |  0 
3rd-party/nccl/neuron/include/nccl/err.h           |  1 +
.../nccl/neuron/include/nccl}/error.h              |  0 
.../nccl/neuron/include/nccl}/net.h                |  0 
.../nccl/neuron/include/nccl}/net_v4.h             |  0 
.../nccl/neuron/include/nccl}/net_v5.h             |  0 
.../uthash/include/uthash}/uthash.h                |  0 
.../uthash/include/uthash}/utlist.h                |  0 
Makefile.am                                        |  2 +-
configure.ac                                       |  4 +++
include/Makefile.am                                | 22 --------------
include/nccl-headers/error.h                       | 16 ----------
include/nccl-headers/net.h                         | 16 ----------
include/nccl_ofi.h                                 | 14 +++++----
include/nccl_ofi_api.h                             |  4 +--
include/nccl_ofi_log.h                             |  2 +-
include/nccl_ofi_rdma.h                            | 12 ++++----
include/nccl_ofi_sendrecv.h                        |  4 +--
include/nccl_ofi_tuner.h                           |  2 +-
src/Makefile.am                                    | 34 +++++++---------------
src/nccl_ofi_ep_addr_list.c                        |  9 +++---
src/nccl_ofi_sendrecv.c                            |  1 +
src/tuner/Makefile.am                              | 32 ++++++++++++++++++++
src/tuner/nccl_ofi_tuner.c                         |  8 ++---
tests/functional/Makefile.am                       |  5 +++-
tests/functional/test-common.h                     | 19 +++++++-----
tests/unit/Makefile.am                             | 26 +++++++++++------
tests/unit/scheduler.c                             |  9 +++---
6df3fb59 feat(build): use -isystem instead of -I

When constructing include directories, use -isystem instead of -I. This
accomplishes largely the same thing, but informs the compiler that we
don't control these headers, which is useful for filtering warnings.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ghost ghost requested review from bwbarrett, rajachan, rauteric and a team as code owners September 30, 2024 11:34
@ghost ghost force-pushed the better-headers branch from 8856915 to 73b4979 Compare September 30, 2024 11:40
@ghost ghost marked this pull request as draft September 30, 2024 11:41
@ghost
Copy link
Author

ghost commented Sep 30, 2024

Moved to draft, was hoping this could avoid being based atop other work but it fails to build.

Worked here: https://github.com/aws-nslick/nccl-net-ofi/actions/runs/11104508976

@ghost ghost force-pushed the better-headers branch 6 times, most recently from 1dbb13a to d01683e Compare October 4, 2024 06:48
@ghost ghost marked this pull request as ready for review October 4, 2024 06:57
@ghost ghost force-pushed the better-headers branch from d01683e to ddea7d3 Compare October 4, 2024 07:07
bwbarrett
bwbarrett previously approved these changes Oct 4, 2024
@ghost ghost force-pushed the better-headers branch from ddea7d3 to 7cdb369 Compare October 5, 2024 22:59
Nicholas Sielicki added 10 commits October 8, 2024 12:51

Verified

This commit was signed with the committer’s verified signature.
CI has not been running unit tests for neuron, even though it's
possible. This doesn't make functional tests work on neuron yet, and it
remains automatically disabled. Separate out the enablement of the
functional tests from the unit tests. Make --disable-tests not build the
unit tests or the functional tests.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
858f448 stopped building libfabric from git, but didn't change the name,
so all of these builds have been named "libfabric@git" when they're
using the prebuilt efa installer. Remove the token.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
Remove the existing unit-tests action -- pay the install cost time only
once, and just run the tests after we run the build.

Prefer to not pass -j to make -- this can create jumbled output that is
harder to debug. These are running in parallel, it's okay if each
individual TU takes longer.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
The only reason you are ever looking at these logs is because the job
failed, in which case, it's a lot easier to debug if you can just see
precisely what the compiler called. Enable that here.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
This was a mess previously, and a lot of jobs were mislabeled as a
result.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
Fixing the compiler being misapplied in CI has exposed another gcc13
-fanalyzer report. This is likely a false positive; it seems to gets
confused about how the error status is being stored within the allocated
req before being freed and is unable to reason about the relationship
between ret and s_comm. Just add another check.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
Similar to this commits parent, fixing the compiler being misapplied in
CI has exposed another gcc13 -fanalyzer warning. The compiler wants to
be able to trust that if the rank is not 0, it must be 1 and vice-versa.

Unnecessarily checking for:

  if (rank == 0) {} else if (rank == 2) {}

instead of just

  if (rank == 0) {} else {}

means that the compiler considers cases where neither branch is taken.
Assert at init that both world size is two and further that the rank
assigned to each job is within {0,1}. Then, also make the else-ifs into
elses to silence the warning.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
Add a macro that uses AX_CHECK_COMPILE_FLAG to test for the existence of
a potential flag, and appends into the picky flags variable if and only
if the compiler has the flag.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
GCC manual says: -Wjump-misses-init is included in -Wc++-compat. It can
be disabled with the -Wno-jump-misses-init option.

This is seemingly only true in newer GCC, or is just bugged outright,
because we had -Wc++-compat enabled and did not see these warnings.

Add a macro for appending "picky" compiler flags and use it for existing
options, then ensure that -Wjump-misses-init is passed.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
a global `provider_filter` was made redundant with 159bfed and this is
now passed to every user directly, but the provider_filter string that
now exists in nccl_net_ofi_create_plugin was shadowing the global one
that was exported everywhere in nccl_ofi.h. Delete the global one.

rdma had two cases of declaring a new variable, shadowing the old
variable, pointing to the same thing. Delete the second useless
declaration. sendrecv had a case where it was shadowing a "req" variable
in an inner scope, but pointing to a different req. Just rename the
variable in the inner scope to be more clear.

mr.c needed to remove a useless include, which removes the shadow on
system_page_size. This unfortunately exposed an issue in the header for
mr.h; config.h was included late, it was transiently relying on an
inclusion for the definition of offsetof and, because we do not build
with C11, stdbool. The conditional inclusion of fi_domain.h was not
correct in all cases. So add <stddef.h> there, always include
fi_domain.h, and fix the config.h misordering.

Rename some things in tests to avoid shadowing.

With these fixed, enable -Wshadow whenever -Werror is enabled.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Nicholas Sielicki added 7 commits October 8, 2024 12:51

Verified

This commit was signed with the committer’s verified signature.
Enable -Wformat=2, which is stricter than Wall's -Wformat:
> -Wformat=2
>> Enable -Wformat plus additional format checks. Currently equivalent to
>> -Wformat -Wformat-nonliteral -Wformat-security -Wformat-y2k.

Fix a warning that this exposed: this literal in topo.c can be made
completely const and should be.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
Add missing case statements and enable -Wimplicit-fallthrough.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
Without this, gcc complains:
> cast from 'char *' to 'struct nccl_ofi_freelist_elem_t *' increases
> required alignment from 1 to 8

Fix problematic areas in container_of and freelist, then enable
-Wcast-align whenever -Werror is enabled.

Related-commit: 4e64913
Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
Many TU-local functions were missing a `static` keyword. Add it.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
We get this for free, likely fixed in #627 or with some of the
-fanalyzer fixes. Enable -Wnull-dereference whenever -Werror is enabled.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
When constructing include directories, use -isystem instead of -I. This
accomplishes largely the same thing, but informs the compiler that we
don't control these headers, which is useful for filtering warnings.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>

Verified

This commit was signed with the committer’s verified signature.
C99 does not define bool as a keyword (unlike C11 and beyond), which
means that files are free to use `bool` as something other than a type
name. When <stdbool.h> is not included through some other mechanism,
-Wc++-compat treats it as such and complains that it's a reserved
keyword in C++.

freelist.h uses bool, but does not bring in stdbool.h; add <stdbool.h>

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
@ghost ghost force-pushed the better-headers branch from 7cdb369 to 6cec9dd Compare October 8, 2024 19:52

Verified

This commit was signed with the committer’s verified signature.
Don't put 3p header-only dependencies (meaning, nccl's ext-net, nccl's
ext-tuner, uthash) under the same directory as our headers, put them
under a separate top-level 3rd-party/ directory. Ensure that a license
file is present in each directory.

On the build side, use AC_SUBST to make $device_interface accessible
outside of the top-level configure.ac, then use this to generate any
accelerator-specific include paths. It is now not possible to include
the wrong copy of the nccl interface files, because when compiling under
that accelerator it is simply not in the include path.

Move the tuner to its own Makefile.am for the src/tuner subdirectory;
don't try to accomplish this in the single src/Makefile.am.

When moving over the includes to the new expected format, prefer
<header.h> to "header.h" so that it's clear that these are system
headers, which also has the benefit of telling the compiler that
warnings in those files are different than any warnings in our actual
source files.

Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
@ghost ghost force-pushed the better-headers branch from 6cec9dd to 4b390d6 Compare October 8, 2024 19:53
@ghost ghost merged commit 6e3f942 into aws:master Oct 9, 2024
27 checks passed
@ghost ghost deleted the better-headers branch October 9, 2024 18:05
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant