-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP BTL fails to collect all interface addresses (when interfaces are on different subnets) #12232
Comments
I'm a little confused about what you are trying to do. Your description of the problem lists 6 private IP addresses and 1 public IP address:
A few questions:
|
@jsquyres thanks for getting back to me
I am not familiar with the logic that OpenMPI's TCP BTL uses to bind to networks interfaces. Since each node has one NIC/GPU and NUMA node, the ideal case would be for OpenMPI to use the "nearest" NIC with respect to the rank's NUMA domain. That implies an optimal choice of interface for each rank. It doesn't matter what address it uses.
I don't see why we would want 4 subnets. We have 4 NICS per node (and therefore 4 More broadly speaking: You can think of Perlmutter's high-speed network to be a single private network, and we've given each node a "line to the outside world" by piggy-backing off the Based on the error message:
I think what's happening is that OpenMPI is unaware that Perhaps the best solution is to record all IPs (even if they belong to different subnets)?
There is no overlay network here -- this is not Kubernetes. We're running Podman-HPC (https://github.com/NERSC/podman-hpc) using Slurm. The container sees the host network. |
Oh one more thing: of the interfaces listed in my example: OpenMPI definitely should not use |
I don't know how to parse this sentence; it seems to contradict itself. The first part of the sentence says that there's one NIC/GPU and NUMA node, but then the second part of the sentence implies that the Open MPI process should use the nearest NIC according to its NUMA domain. Later in the text, you specifically mention that there are 4 NICs and 4 GPUs; I infer that this means that there are 4 NUMA domains, too.
I think you should check into how the Linux kernel handles IP traffic from multiple interfaces that are all on the same subnet. It's been a little while since I've poked around in that area, but it used to be true that all outgoing traffic to that subnet would go out the "first" interface on that subnet. I.e., all your outgoing traffic -- for that subnet -- would go out a single NIC. Perhaps the Linux kernel IP stack has gotten smarter about this over time, but I think this kind of use case is not(or at least: has not historically been) well represented in the Linux kernel IP space. Multiple interfaces on the same subnet were more intended to be used for bonding and the like, not NUMA-friendly transfers. Also, this begs the larger question: why are you using TCP? Assuming you're using NVIDIA NICs and GPUs, shouldn't you be using UCX? UCX will handle all the interface selection and pinning, etc. It will also handle all the RDMA and GPUDirect stuff (which TCP won't). |
Except for the OOB subsystem, of course. |
Oops, my response got mangled in edits. Apologies. Each node on Perlmutter has 1 CPU, 4 GPUs, and 4 NICs. The NICs and GPUs each are on a single NUMA node. (https://docs.nersc.gov/systems/perlmutter/architecture/) So yes, you have 4 NUMA domains, each with their own GPU and NIC. |
Clearly, you can bind to multiple interfaces, and send traffic via those ... I've done that in plenty of applications. So I suspect that the kernel's IP stack has gotten smarter. Anyway, you might be right. But HPE won't change the network architecture of their flagship high-speed network over this. So I think speculating over how to arrange interfaces and subnets is moot -- seeing that we won't be able to change that. |
It's Nvidia GPUs and HPE NICs -- again here's the link: https://docs.nersc.gov/systems/perlmutter/architecture/. Also since HPE bought Cray, I don't know how well supported UCX is right now (with respect to Casini NICs. FTR: I like UCX, and would prefer to use it too). HPE provides libfabric, built against their CXI (https://docs.open-mpi.org/en/main/tuning-apps/networking/ofi.html#what-are-the-libfabric-ofi-components-in-open-mpi) and their own MPI implementation. That's all the official support from the vendor that I am aware of... |
Ah! Fair point. We are straying very far from the original point of this issue though. But for the sake of completeness, I will round out this discussion: My goal is to help users be productive on our systems. Almost always that means minimizing the time it takes to have meaningful scientific data. Depending on the user's problem that can involve everything from performance tuning at scale, or just getting a pre-built executable to work at all. So, if raw performance was my goal, right now I would be using the vendor's recommended transport library (which in this case would be Cray MPICH) -- even if I'm partial to OpenMPI 😉 Often though, some users do not have the wherewithal to compile a large application using a system-specific toolchain. In an ideal world ABIs and standards would be mature enough that things like CXI could be resolved dynamically, and MPI "just works" -- in that case, the vendor could just provide something like and OFI provider, we would provide sensible configurations, and an application built against any MPI implementation would pick those up. For now, this reality isn't here (yet). In its absence, TCP has established itself as the least common denominator that usually "just works". For many users, that is enough. (or at least the performance gains are not worth the effort of rebuilding their applications). We are working with the CP2K developers, as well as Nvidia, to build a CP2K container that uses HPE CXI. I am happy to discuss this -- and other HPE-related topics -- further. Eventually though, I would like to return to the original point of this issue: that OpenMPI's TCP-based BTL is not detecting all interface addresses. |
Yeah, Slingshot NICs means you cannot use UCX - must use libfabric. However, I believe OMPI does support Slingshot operations - in fact, last I checked OMPI v5 is running on Perlmutter. I'm curious, though - is |
Correct! But that requires OpenMPI to be built with libfabric support (and AFAIK, it needs to be the right version of libfabric -- I might be wrong though). These have to be configured at compile time, and are missing from the Nvidia container. We are working with Nvidia (who built https://catalog.ngc.nvidia.com/orgs/hpc/containers/cp2k ) to see if they can update their image (and upgrade to OMPI v5 while their at it).
We're running the application with Slurm and PMI2: |
Correct. However, you have to use PMIx with OMPI v5 (no pmi2 support any more).
Those folks at NVIDIA are rather fixated on UCX, so that could be a problem convincing them. Truly wish you luck on it! 😄 |
Good point, another reason to bump the container's OMPI version up to v5
Thanks!!! Vendors who each prefer their favorite solutions?! Who would that thought? 😉 Seriously thought, I've always found some folks that will listen, so this is not a lost cause 🤞 |
Occurs to me: @hppritcha Would it make sense to post a Docker container with OMPI v5 built against libfabric to the Docker registry? Not sure what else is in the NVIDIA offering, but might help people get around the UCX-only issue. |
That would be much appreciated. |
@JBlaschke the IP you posted do not match the error message, so this is not ideal for troubleshooting. Can you confirm you start an Open MPI 4 application with |
@rhc54 thanks for the Docker container suggestion. isn't a substitute for understanding and hopefully addressing this TCP issue. @JBlaschke do you know if this is reproducible outside of the container? I will try to reproduce myself on PM, but outside of a container. |
I guess this will wait as PM is undergoing maintenance today. |
@hppritcha Our test system is up -- it has the same hardware configuration as Perlmutter. I'll run @ggouaillardet 's test now and post it here
@hppritcha I don't know -- I haven't built CP2K. Last time I looked at the build script, it was rather involved. If you have a working build (or even if you have a way to get one quickly), then maybe it might be best for you to test that. Where you thinking of using the |
@JBlaschke i pinged you on NERSC slack for some info. |
@ggouaillardet Confirming that this is an OMPI v4 application, with
@ggouaillardet Good point, here the output of
It looks like this interface is the problem:
@ggouaillardet I do not set any #!/usr/bin/env bash
#!/bin/bash
#SBATCH --image docker:nvcr.io/hpc/cp2k:v2023.1
#SBATCH --nodes 2
#SBATCH --cpus-per-task 128
#SBATCH --gpus-per-task 4
#SBATCH --ntasks-per-node 1
#SBATCH --constraint gpu
#SBATCH --qos debug
#SBATCH -t 00:20:00
#SBATCH -A nstaff
#SBATCH -J cp2k
export OMP_NUM_THREADS=1
srun -n 2 ip -4 -f inet addr show
srun -n 2 --cpu-bind cores --mpi pmi2 shifter --module gpu --entrypoint cp2k -i initial_2.inp -o initial_13.out |
I suspect the problem is that one proc uses the 10.250.0.39 interface, and the other proc uses the 128.55 address - there is no reason for either of them to prefer one address over the other. What happens if you put EDIT: fixed the environment variable name |
If this does not work, you might also want to try to put this instead in your environment (and double check this is passed to the MPI tasks by |
Huh! This is funny:
But the job runs (producing an output file that has reasonable-looking contents). I cannot tell if it runs to completion, or if it deadlocks before then.
and the program does not produce any output. In both cases the program runs out the wallclock, so something is dead-locked. It's possible that the |
I saw a few suspicious things. As pointed by Ralph, Open MPI is supposed to use only one IP per physical interface, so that kind of error can occur if one node picks first, just sets Then I suggest you apply the inline patch below and rebuild with then you can set diff --git a/opal/mca/btl/tcp/btl_tcp.h b/opal/mca/btl/tcp/btl_tcp.h
index 846ee3b..acb4af6 100644
--- a/opal/mca/btl/tcp/btl_tcp.h
+++ b/opal/mca/btl/tcp/btl_tcp.h
@@ -172,6 +172,7 @@ struct mca_btl_tcp_module_t {
struct sockaddr_storage tcp_ifaddr_6; /**< First IPv6 address discovered for this interface, bound as sending address for this BTL */
#endif
uint32_t tcp_ifmask; /**< BTL interface netmask */
+ char tcp_ifname[32];
opal_mutex_t tcp_endpoints_mutex;
opal_list_t tcp_endpoints;
diff --git a/opal/mca/btl/tcp/btl_tcp_component.c b/opal/mca/btl/tcp/btl_tcp_component.c
index 78dee89..012c025 100644
--- a/opal/mca/btl/tcp/btl_tcp_component.c
+++ b/opal/mca/btl/tcp/btl_tcp_component.c
@@ -505,6 +505,7 @@ static int mca_btl_tcp_create(int if_kindex, const char* if_name)
/* initialize the btl */
btl->tcp_ifkindex = (uint16_t) if_kindex;
+ strcpy(btl->tcp_ifname, if_name);
#if MCA_BTL_TCP_STATISTICS
btl->tcp_bytes_recv = 0;
btl->tcp_bytes_sent = 0;
@@ -512,7 +513,7 @@ static int mca_btl_tcp_create(int if_kindex, const char* if_name)
#endif
struct sockaddr_storage addr;
- opal_ifkindextoaddr(if_kindex, (struct sockaddr*) &addr,
+ opal_ifnametoaddr(if_name, (struct sockaddr*) &addr,
sizeof (struct sockaddr_storage));
#if OPAL_ENABLE_IPV6
if (addr.ss_family == AF_INET6) {
@@ -816,6 +817,10 @@ static int mca_btl_tcp_component_create_instances(void)
}
/* if this interface was not found in the excluded list, create a BTL */
if(argv == 0 || *argv == 0) {
+
+ opal_output_verbose(30, opal_btl_base_framework.framework_output,
+ "btl:tcp: Creating instance with interface %d %s",
+ if_index, if_name);
mca_btl_tcp_create(if_index, if_name);
}
}
@@ -1175,6 +1180,9 @@ static int mca_btl_tcp_component_exchange(void)
}
opal_ifindextoname(index, ifn, sizeof(ifn));
+ if (0 != strcmp(ifn, mca_btl_tcp_component.tcp_btls[i]->tcp_ifname)) {
+ continue;
+ }
opal_output_verbose(30, opal_btl_base_framework.framework_output,
"btl:tcp: examining interface %s", ifn);
if (OPAL_SUCCESS !=
@@ -1218,7 +1226,7 @@ static int mca_btl_tcp_component_exchange(void)
opal_ifindextokindex (index);
current_addr++;
opal_output_verbose(30, opal_btl_base_framework.framework_output,
- "btl:tcp: using ipv6 interface %s", ifn);
+ "btl:tcp: using ipv6 interface %s with address %s and ifkindex %d", ifn, opal_net_get_hostname((struct sockaddr*)&my_ss), addrs[current_addr].addr_ifkindex);
}
} /* end of for opal_ifbegin() */
} /* end of for tcp_num_btls */
diff --git a/opal/mca/btl/tcp/btl_tcp_endpoint.c b/opal/mca/btl/tcp/btl_tcp_endpoint.c
index e69cd86..b1d52dd 100644
--- a/opal/mca/btl/tcp/btl_tcp_endpoint.c
+++ b/opal/mca/btl/tcp/btl_tcp_endpoint.c
@@ -752,8 +752,9 @@ static int mca_btl_tcp_endpoint_start_connect(mca_btl_base_endpoint_t* btl_endpo
}
#endif
opal_output_verbose(10, opal_btl_base_framework.framework_output,
- "btl: tcp: attempting to connect() to %s address %s on port %d",
+ "btl: tcp: attempting to connect() to %s from %s address %s on port %d",
OPAL_NAME_PRINT(btl_endpoint->endpoint_proc->proc_opal->proc_name),
+ opal_net_get_hostname((struct sockaddr*) &btl_endpoint->endpoint_btl->tcp_ifaddr),
opal_net_get_hostname((struct sockaddr*) &endpoint_addr),
ntohs(btl_endpoint->endpoint_addr->addr_port));
diff --git a/opal/mca/btl/tcp/btl_tcp_proc.c b/opal/mca/btl/tcp/btl_tcp_proc.c
index c7ee66b..952a327 100644
--- a/opal/mca/btl/tcp/btl_tcp_proc.c
+++ b/opal/mca/btl/tcp/btl_tcp_proc.c
@@ -335,6 +335,9 @@ static mca_btl_tcp_interface_t** mca_btl_tcp_retrieve_local_interfaces(mca_btl_t
}
if (true == skip) {
/* This interface is not part of the requested set, so skip it */
+ opal_output_verbose(20, opal_btl_base_framework.framework_output,
+ "btl:tcp: skipping local interface %s",
+ local_if_name);
continue;
}
@@ -344,6 +347,9 @@ static mca_btl_tcp_interface_t** mca_btl_tcp_retrieve_local_interfaces(mca_btl_t
/* create entry for this kernel index previously not seen */
if (OPAL_SUCCESS != rc) {
index = proc_data->num_local_interfaces++;
+ opal_output_verbose(20, opal_btl_base_framework.framework_output,
+ "btl:tcp: adding local interface %d/%d %s with kindex %d",
+ index, proc_data->num_local_interfaces, local_if_name, kindex);
opal_hash_table_set_value_uint32(&proc_data->local_kindex_to_index, kindex, (void*)(uintptr_t) index);
if( proc_data->num_local_interfaces == proc_data->max_local_interfaces ) {
@@ -356,6 +362,10 @@ static mca_btl_tcp_interface_t** mca_btl_tcp_retrieve_local_interfaces(mca_btl_t
proc_data->local_interfaces[index] = (mca_btl_tcp_interface_t *) malloc(sizeof(mca_btl_tcp_interface_t));
assert(NULL != proc_data->local_interfaces[index]);
mca_btl_tcp_initialise_interface(proc_data->local_interfaces[index], kindex, index);
+ } else {
+ opal_output_verbose(20, opal_btl_base_framework.framework_output,
+ "btl:tcp: already added local interface %s with kindex %d",
+ local_if_name, kindex);
}
local_interface = proc_data->local_interfaces[index];
@@ -551,6 +561,8 @@ int mca_btl_tcp_proc_insert( mca_btl_tcp_proc_t* btl_proc,
for( i = 0; i < proc_data->num_local_interfaces; ++i ) {
mca_btl_tcp_interface_t* local_interface = proc_data->local_interfaces[i];
for( j = 0; j < proc_data->num_peer_interfaces; ++j ) {
+ opal_output_verbose(20, opal_btl_base_framework.framework_output,
+ "btl:tcp: evaluating path from %d/%d to %d/%d", i, proc_data->num_local_interfaces, j, proc_data->num_peer_interfaces);
/* initially, assume no connection is possible */
proc_data->weights[i][j] = CQ_NO_CONNECTION; |
I think the issue is totally different, and potentially not in OMPI. We need to split the discussion in two: modex exchange and first handshake.
So, either we screwed up the connection code and bind the socket to the wrong IP, or the kernel does some tricks and use the first IP on the interface when sending connection requests. Let's check that out. diff --git a/opal/mca/btl/tcp/btl_tcp_endpoint.c b/opal/mca/btl/tcp/btl_tcp_endpoint.c
index 28138a6b43..298517307f 100644
--- a/opal/mca/btl/tcp/btl_tcp_endpoint.c
+++ b/opal/mca/btl/tcp/btl_tcp_endpoint.c
@@ -791,6 +791,14 @@ static int mca_btl_tcp_endpoint_start_connect(mca_btl_base_endpoint_t *btl_endpo
CLOSE_THE_SOCKET(btl_endpoint->endpoint_sd);
return OPAL_ERROR;
}
+ char tmp[2][16];
+ inet_ntop(AF_INET, &((struct sockaddr_in *)&btl_endpoint->endpoint_btl->tcp_ifaddr)->sin_addr,
tmp[0], 16);
+ inet_ntop(AF_INET, &((struct sockaddr_in *)&endpoint_addr)->sin_addr, tmp[1], 16);
+ opal_output(0, "proc %s bind socket to %s:%d before connecting to peer %s at %s:%d\n",
+ OPAL_NAME_PRINT(OPAL_PROC_MY_NAME),
+ tmp[0], htons(((struct sockaddr_in *) &btl_endpoint->endpoint_btl->tcp_ifaddr)->sin_port),
+ OPAL_NAME_PRINT(btl_endpoint->endpoint_proc->proc_opal->proc_name),
+ tmp[1], ntohs(((struct sockaddr_in *) &endpoint_addr)->sin_port));
}
#if OPAL_ENABLE_IPV6
if (endpoint_addr.ss_family == AF_INET6) { |
Sorry for replying late, it's been a busy week. @JBlaschke ACK on all your points. Thanks for all the detail! On the original report, I'm still a little confused -- and I think @bosilca is asking the right questions here:
Is 10.249.13.210 a known IP address on the peer ( @ggouaillardet has a good suggestion: build with |
Because we only keep/report one IP per iface. Multiple IPs on the same iface will only confuse the communication balancer, and lead to non optimal communication scheduling. |
@bosilca Ah, ok. So should @JBlaschke run with |
He should run as on the last reported test with
|
Hi @bosilca you mean to run those with the patches you sent? If so, I'll have to build a smaller reproducer. That's no problem, but I won't be able to get around to that until later next week. |
First a task will locally select one IP per physical interface based on If I correctly understand the modex exchange, we will send information for all the interfaces with a previously selected kernel index. Now this is just speculation: I suspect the receiver (of the modex) will keep a single physical interface per peer, but we do not control which one (if we excluded Bottom line, I was able to reproduce a similar error ( |
@ggouaillardet according to this a process publishes one IP address per BTL module (and that address was acquired following the include/exclude requirements). |
@bosilca You are right about the |
You're right, in the 4.1 all addresses of all selected interfaces are stored in the modex. The peer receives them and stores them in the proc's I wonder if we don't have the same issue in the 5.0, but unfortunately I do not have access to the platform to test. |
This issue is related to: #5818
I am encountering
when running the CP2K container (https://catalog.ngc.nvidia.com/orgs/hpc/containers/cp2k) on NERSC's Perlmutter system(https://docs.nersc.gov/systems/perlmutter/architecture/). We've tried OpenMPI v4.1.2rc2 and v4.1.5
Background
Perlmutter's GPU nodes have 4 NICS, each has a private IP address, and one NIC (the one corresponding to the
hsn0
interface has an additional public IP addres -- therefore each node has one NIC with two addresses, and these addresses are in different subnets. Eg.:(the example above also shows the node management network
nmn
interace -- but MPIshouldn't be talking to that anyway)I think the error must be caused by
hsn0
's two ip addresses on two subnets.The text was updated successfully, but these errors were encountered: