-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v5.0.x] opal/common/ofi: refactor NIC selection logic #12135
Conversation
Just FWIW: I had time at SC to meet with some fabric folks as well as Brice to talk about this "distance" thing. Bottom line is that it is actually meaningless - and has been for years. The only thing you really care about is that the NIC be attached to your local PCIe bus (as opposed to the bus on another package). Beyond that, you really aren't measuring anything meaningful. So the "nearest" NIC will always be the first NIC on the PCIe bus in your local package. Doesn't matter where the proc is bound - answer is always the same. For other NICs hanging off of that bus, you can split hairs and claim there is a "distance" measured by position down the bus - but the vendors all say that is pointless as it is so small to be meaningless. If you only have one bus being shared by multiple packages (a very old architecture), then the answers don't change. The "nearest" NIC is still the first one on that PCIe bus, and distances to the other NICs on that bus can be considered identical. All of which led me to question why we are wasting so much time (both people and compute cycles) beating on this question when the fabric vendors say "it makes no difference - just take a NIC on the local PCIe bus of your package"? I'm wondering if we aren't overthinking things? |
@rhc54 |
I understand that - my point is that the entire code may be moot. According to the vendors I met with, all we need to do is (a) verify that the PCIe is local to our package (which it certainly will be minus a very old chip), and (b) round-robin the assignments. Everything else that we are doing is really just unnecessary complexity that gains us nothing. So I suggest that the correct PR is to simply remove all this code and round-robin the assignments. |
I see your point but I respectfully disagree on disregarding the distance entirely - I look at the problem from a generic software engineering pov. There are 2 separate aspects to the NIC selection problem: 1) calculating a meaningful distance metric and 2) using that distance metric to determine the nearest NIC. I believe this change is in the right direction by separating out the |
🤷♂️ It's your guys code, not mine, so I have no strong opinion on it. Again, FWIW: the results when run on various topologies yields the exact same distance for every proc to each NIC. So the round-robin code is what is going to be operating in all cases. Likelihood of someone partitioning the package to put multiple PCIe's in it, each with NICs on it so that the distance from proc to NIC depends on binding, is....awfully low, at least in my remaining lifetime 😄 |
Right I attempted to calculate the distance metric by counting hwloc objects between the processor and NIC but that wasn't easy either. An alternative would be for the application to ask for a specific NIC, assuming that the application has a perfect knowledge of the PCI topology. Naturally that means this NIC selection logic would be no-op. |
I promised @jsquyres to summarize the nic selection logic before merging this PR |
Need to pick #12144 once it's merged. There is a bug. |
This patch refactors the OFI NIC selection logic. It foremost improves the NIC search algorithm. Instead of searching for the closest NICs on the system, this patch directly compares the distances of the given providers and selects the nearest NIC. This change also makes it explicit that if the process is unbound, or the distance cannot be reliably calculated, a provider will be selected in round-robin fashion. Signed-off-by: Wenduo Wang <wenduwan@amazon.com> (cherry picked from commit f5f3b93)
This change fixes current round-robin selection logic: - Only providers of the same type should be considered, i.e. providers that match the head of the list. This deviates from the documented behavior. - For unbound process the selection should be based on its local rank, i.e. rank among processes on the same node. Currently only the first NIC will be selected. Signed-off-by: Wenduo Wang <wenduwan@amazon.com> (cherry picked from commit b061f96)
The documentation needs an update to reflect latest implementation. The original cpuset matching logic has been replaced with a new distance calculation algorithm. This change also clarifies the round-robin selection process when we need to break a tie. Signed-off-by: Wenduo Wang <wenduwan@amazon.com> (cherry picked from commit 3aba0bb)
1365116
to
1b1dd85
Compare
Changes are included in latest push - all of them are backport. I suggest merging after 5.0.2 release. |
@wenduwan so we can remove the draft tag from this PR? |
@janjust Thanks it's ready for re-review. |
This patch refactors the OFI NIC selection logic. It foremost improves the NIC search algorithm. Instead of searching for the closest NICs on the system, this patch directly compares the distances of the given providers and selects the nearest NIC.
This change also makes it explicit that if the process is unbound, or the distance cannot be reliably calculated, a provider will be selected in round-robin fashion.
Signed-off-by: Wenduo Wang wenduwan@amazon.com
(cherry picked from commit f5f3b93)