-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
separate out 3rd-party headers #634
Merged
Merged
+575
−492
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8856915
to
73b4979
Compare
Moved to draft, was hoping this could avoid being based atop other work but it fails to build. Worked here: https://github.com/aws-nslick/nccl-net-ofi/actions/runs/11104508976 |
1dbb13a
to
d01683e
Compare
ghost
commented
Oct 4, 2024
bwbarrett
reviewed
Oct 4, 2024
bwbarrett
previously approved these changes
Oct 4, 2024
CI has not been running unit tests for neuron, even though it's possible. This doesn't make functional tests work on neuron yet, and it remains automatically disabled. Separate out the enablement of the functional tests from the unit tests. Make --disable-tests not build the unit tests or the functional tests. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
858f448 stopped building libfabric from git, but didn't change the name, so all of these builds have been named "libfabric@git" when they're using the prebuilt efa installer. Remove the token. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Remove the existing unit-tests action -- pay the install cost time only once, and just run the tests after we run the build. Prefer to not pass -j to make -- this can create jumbled output that is harder to debug. These are running in parallel, it's okay if each individual TU takes longer. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
The only reason you are ever looking at these logs is because the job failed, in which case, it's a lot easier to debug if you can just see precisely what the compiler called. Enable that here. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
This was a mess previously, and a lot of jobs were mislabeled as a result. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Fixing the compiler being misapplied in CI has exposed another gcc13 -fanalyzer report. This is likely a false positive; it seems to gets confused about how the error status is being stored within the allocated req before being freed and is unable to reason about the relationship between ret and s_comm. Just add another check. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Similar to this commits parent, fixing the compiler being misapplied in CI has exposed another gcc13 -fanalyzer warning. The compiler wants to be able to trust that if the rank is not 0, it must be 1 and vice-versa. Unnecessarily checking for: if (rank == 0) {} else if (rank == 2) {} instead of just if (rank == 0) {} else {} means that the compiler considers cases where neither branch is taken. Assert at init that both world size is two and further that the rank assigned to each job is within {0,1}. Then, also make the else-ifs into elses to silence the warning. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Add a macro that uses AX_CHECK_COMPILE_FLAG to test for the existence of a potential flag, and appends into the picky flags variable if and only if the compiler has the flag. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
GCC manual says: -Wjump-misses-init is included in -Wc++-compat. It can be disabled with the -Wno-jump-misses-init option. This is seemingly only true in newer GCC, or is just bugged outright, because we had -Wc++-compat enabled and did not see these warnings. Add a macro for appending "picky" compiler flags and use it for existing options, then ensure that -Wjump-misses-init is passed. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
a global `provider_filter` was made redundant with 159bfed and this is now passed to every user directly, but the provider_filter string that now exists in nccl_net_ofi_create_plugin was shadowing the global one that was exported everywhere in nccl_ofi.h. Delete the global one. rdma had two cases of declaring a new variable, shadowing the old variable, pointing to the same thing. Delete the second useless declaration. sendrecv had a case where it was shadowing a "req" variable in an inner scope, but pointing to a different req. Just rename the variable in the inner scope to be more clear. mr.c needed to remove a useless include, which removes the shadow on system_page_size. This unfortunately exposed an issue in the header for mr.h; config.h was included late, it was transiently relying on an inclusion for the definition of offsetof and, because we do not build with C11, stdbool. The conditional inclusion of fi_domain.h was not correct in all cases. So add <stddef.h> there, always include fi_domain.h, and fix the config.h misordering. Rename some things in tests to avoid shadowing. With these fixed, enable -Wshadow whenever -Werror is enabled. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Enable -Wformat=2, which is stricter than Wall's -Wformat: > -Wformat=2 >> Enable -Wformat plus additional format checks. Currently equivalent to >> -Wformat -Wformat-nonliteral -Wformat-security -Wformat-y2k. Fix a warning that this exposed: this literal in topo.c can be made completely const and should be. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Add missing case statements and enable -Wimplicit-fallthrough. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Without this, gcc complains: > cast from 'char *' to 'struct nccl_ofi_freelist_elem_t *' increases > required alignment from 1 to 8 Fix problematic areas in container_of and freelist, then enable -Wcast-align whenever -Werror is enabled. Related-commit: 4e64913 Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Many TU-local functions were missing a `static` keyword. Add it. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
We get this for free, likely fixed in #627 or with some of the -fanalyzer fixes. Enable -Wnull-dereference whenever -Werror is enabled. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
When constructing include directories, use -isystem instead of -I. This accomplishes largely the same thing, but informs the compiler that we don't control these headers, which is useful for filtering warnings. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
C99 does not define bool as a keyword (unlike C11 and beyond), which means that files are free to use `bool` as something other than a type name. When <stdbool.h> is not included through some other mechanism, -Wc++-compat treats it as such and complains that it's a reserved keyword in C++. freelist.h uses bool, but does not bring in stdbool.h; add <stdbool.h> Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
Don't put 3p header-only dependencies (meaning, nccl's ext-net, nccl's ext-tuner, uthash) under the same directory as our headers, put them under a separate top-level 3rd-party/ directory. Ensure that a license file is present in each directory. On the build side, use AC_SUBST to make $device_interface accessible outside of the top-level configure.ac, then use this to generate any accelerator-specific include paths. It is now not possible to include the wrong copy of the nccl interface files, because when compiling under that accelerator it is simply not in the include path. Move the tuner to its own Makefile.am for the src/tuner subdirectory; don't try to accomplish this in the single src/Makefile.am. When moving over the includes to the new expected format, prefer <header.h> to "header.h" so that it's clear that these are system headers, which also has the benefit of telling the compiler that warnings in those files are different than any warnings in our actual source files. Signed-off-by: Nicholas Sielicki <nslick@amazon.com>
bwbarrett
approved these changes
Oct 9, 2024
This pull request was closed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.