Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OSC RDMA to work in cases where one BTL cannot reach all communicator group members #7830

Closed
jsquyres opened this issue Jun 16, 2020 · 36 comments

Comments

@jsquyres
Copy link
Member

Cisco MTT shows that MPI_WIN_CREATE is failing since the OSC pt2pt component was removed (i.e., MPI_WIN_CREATE fails because there is no OSC component available). Here's an example of a simple test failing because MPI_WIN_CREATE failed.

After discussion on the 16 June 2020 weekly Open MPI webex, @hjelmn said that he could add a few functions to make OSC RDMA work in all cases (not just in cases where a single BTL can reach all members in the communicator group).

Marking this as a blocker for v5.0 because it's preventing MPI_WIN_CREATE from working in at least some environments, and because -- per the discussion between @hjelmn and @bwbarrett -- the fix to OSC RDMA doesn't seem too difficult.

@hjelmn
Copy link
Member

hjelmn commented Jun 23, 2020

I have the implementation mostly complete. It will require reworking how BTL receive completions are handled but that is a good thing. Been looking at reworking those for years. I have a branch with the changes but I can not test usnic or portals4. I verified vader, tcp, and self. Will validate uGNI and UCX today.

@jsquyres
Copy link
Member Author

Send me the branch and what tests to run, and I'll check usnic.

@hjelmn
Copy link
Member

hjelmn commented Jun 23, 2020

https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome

This only has the changes to the BTL receive logic. Please run a standard set of two-sided tests. IMB would work. Feel free to push back and changes btl/usnic needs.

The old receive callback got btl, tag, descriptor, cbdata. The new callback gets the btl and a new receive descriptor which contains the tag, endpoint, received message, and cbdata.

@hjelmn
Copy link
Member

hjelmn commented Jun 23, 2020

Really curious if there are any performance regressions. There shouldn't be but that is going to be where any pushback will be.

@hjelmn
Copy link
Member

hjelmn commented Jul 6, 2020

@jsquyres Ping. Please take a look at the branch mentioned above. I need to know if this change works for you as I can not verify the usnic BTL changes. I want to get this change approved and in this week so I can get the btl base atomics support finished.

@jsquyres
Copy link
Member Author

jsquyres commented Jul 7, 2020

@hjelmn Sorry for the delay. I ran the https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome branch last night and things generally look good. I'm getting some random failures, but I'm apparently getting some random failures with master, too. Specifically: with the btl_base_atomics_are_awesome branch, I'm seeing most (all?) of the one-sided tests that were failing are now passing.

We'll have to address the other random failures in another issue.

@jsquyres
Copy link
Member Author

This is still causing terabytes of corefiles with Cisco's MTT that I regularly have to manually clear out to avoid filling my filesystem.

I'm going to have to disable Cisco MTT's one-sided testing until this is fixed.

@jsquyres
Copy link
Member Author

I don't know if this is related to #7892 or not, but I figured I'd put in the cross-reference.

@gpaulsen
Copy link
Member

Austen said he has some time next week to take a look. Thanks!

@awlauria
Copy link
Contributor

osc/rdma will internally call MPI_Comm_split(MPI_COMM_TYPE_NODE) to create a local group, and is getting some interesting/weird results. For example, I'm seeing the comm size on return be 3 when running 4 processes across 2 nodes, with 2 procs per node.

I have to think that #9010 is related.

@gpaulsen
Copy link
Member

@jsquyres @awlauria #9010 has been closed (some PMIx/PRRTE change, not root caused), so perhaps this issue also just needs to be retested?

@awlauria
Copy link
Contributor

It helps but I still run into issues.

Unfortunately I haven't had time to press on it much since the last update, and am out next week. Earliest I can give cycles on it would be the 28th.

@gpaulsen gpaulsen reopened this Aug 31, 2021
@gpaulsen
Copy link
Member

oops, I accidentally clicked close, then re-opened it immediately.

@hjelmn
Copy link
Member

hjelmn commented Sep 18, 2021

Same issue as described here? Where the comm split is wrong?

@bosilca
Copy link
Member

bosilca commented Sep 18, 2021

The communicator seems to be correct. I attached to one of the processes before it segfault, and the issue is in osc_rdma_lock.h, function ompi_osc_rdma_btl_fop, where the state_index is -1. The stack and peer variable do not look corrupted or uninitialized but both btl_index (state and data) of the peer object are set to 255 (because unsigned) and the selected_btl is then some memory location outside of the allocated array of BTLs.

@wzamazon
Copy link
Contributor

@bosilca Can you elaborate on the error you are seeing? what type of btl? what test? This may be related to #8983.

@bosilca
Copy link
Member

bosilca commented Sep 24, 2021

Any test from the onesided directory in ompi-tests. BTL: tcp and sm. As I said above I see either segfaults or deadlocks.

@rhc54
Copy link
Contributor

rhc54 commented Sep 24, 2021

FWIW: the Cisco MTT now only shows segfaults from onesided tests and from the datatype "pack" function. It is otherwise running pretty well. Strictly tcp and sm there.

@bosilca
Copy link
Member

bosilca commented Sep 24, 2021

I applied the patch from #9400 and rerun the tests. Similar story as before, segfaults and deadlocks. To give a precise example, here is how I run it:

$ salloc -N 2 -wa,b
$ mpirun -np 4 --map-by node --mca osc rdma --mca btl tcp,self -ca btl_tcp_if_include ib0 ./test_dan1

The test segfault in lock_unlock stage. However, the same test segfault when using UCX as well.

$ mpirun -np 4 --map-by node --mca pml ucx ./test_dan1

@wzamazon
Copy link
Contributor

@bosilca I can reproduce the segfault when running test_dan1, I will look into it.

@wzamazon
Copy link
Contributor

@bosilca The segfault from the lock_unlock stage in test_dan1 is because of locking type being two level. When locking type is two level, a process will first lock a global leader then lock the peer. The segfault happens when global leader is the process itself, in that case, because btl/tcp cannot do self communication, btl_index is 255, and selected_btl is NULL. adding --mca osc_rdma_locking_mode on_demand fixed this segfault. If I understand correctly, two level locking is an optimization, not a mandate, so we should just disable two level locking when btl/tcp is used.

However, even with on_demand locking, I encountered numerous issues like hang, segfault, and data corruption when running other tests under ompi/onesided.

I will make a list and keep looking into it.

@hjelmn
Copy link
Member

hjelmn commented Sep 24, 2021

Honestly let's just change the default to on demand locking in this case. The performance is not bad with on demand.

@wzamazon
Copy link
Contributor

I just updated #9400 to address several new issues I found

The biggest issue continue to be that btl/tcp does not support self communication. To address this issue, I added b0197d9 which uses btl/self for self communication (if selected btl does not support self communication).

To make it work, I had to do some refactor work , and make some change to btl/self, and change to osc/rdma.

Then there is another bug in osc/rdma in that the shm file name is not unique. I added 3c0ae03 to address it.

With the current PR, most of the test under ompi-tests/onesided directory pass. except:

One negative test in t_winerror failed. The test was trying to free a non-existent window, and expect MPI_Win_free() to return MPI_WIN_ERROR, however, the error is considered fatal, so MPI aborted. I am not sure how (or should) we can fix it. Similar issue with test_start1.

lock4, get2 randomly hang on two nodes, I will continue investigate

put1, put4 data validation error on two nodes. I will continue investigate.

@hjelmn
Copy link
Member

hjelmn commented Sep 28, 2021

Catching up. Can btl/sm not be used? It allows self communication. When using tcp it should always also use btl/sm.

@hjelmn
Copy link
Member

hjelmn commented Sep 28, 2021

But I could be remembering incorrectly :).

@wzamazon
Copy link
Contributor

Can btl/sm not be used?

Problem is people are running tests with the sm btl excluded (by specifying --mca btl self,tcp).

I do not have the full background. I wonder if there is an agreement that btl/tcp must be used with btl/sm (for osc/rdma)?

@gpaulsen
Copy link
Member

gpaulsen commented Dec 2, 2021

@bosilca You're assigned to this, but are you working this?

@wzamazon
Copy link
Contributor

wzamazon commented Dec 2, 2021

#9696 will fix this issue too.

@awlauria
Copy link
Contributor

@jsquyres can we close this?

@awlauria
Copy link
Contributor

@jsquyres ping. can this be closed?

@awlauria
Copy link
Contributor

I think this is done.. @jsquyres if this is still an issue please re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants