-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl: add an rdma only btl for using uct #4919
Conversation
@yosefe Now we can argue about the relative merits of osc/rdma + btl/uct vs osc/ucx. Here are the results so far: btl/ugni (fastest):
btl/uct: (still very good-- though the aries tl is not optimal)
osc/ucx: (really bad for small messages)
|
This is on a Cray XC system with Haswell processors. You do not want to see the osc/ucx results @ 32 threads or on knl. It gets really really bad. |
I do have to say, uct is a very reasonable fit for a btl. The only semantic that doesn't completely match is completion semantics (local vs remote). This is fine for osc/rdma though. |
@hjelmn can you post the data for thread single? |
@hjelmn is this an OSU benchmark? |
@hjelmn Nathan, is this the benchmark you are using? http://www.cs.sandia.gov/smb/rma-mt.html |
@xinzhao3 Its a modified version that also pins the worker threads. I can't release it until I get an ok from LANL. I will get you access once I have the ok though it could be awhile (lawyers). Single threaded with osu. Though something is wrong since the high-end bandwidth is way low. btl/ucx:
osc/ucx:
Latencies are almost identical. There is a difference in small message bandwidth, this is probably due to different protocol selections and tuning. With rmamt and 1 thread osc/ucx is slower. |
rmamt 1 thread: btl/ucx:
osc/ucx:
Hmm, osc/ucx slower for 1 byte and faster for some other sizes. I strongly suspect it is because there is an extra progress call in the btl put function. This helps with the multi-threaded case as it spreads the load of reaping completions across all the threads. Maybe not. Will investigate as it should be a wash at a single thread. |
Looks like most of the small message difference is having to unpack the rkey on every call. This is a small mismatch between the uct and btl interfaces that will hopefully be corrected in a future version of ucx. |
@hjelmn thanks Nathan. Could you release the MT code once it is allowed? I want to try it on Mellanox machine. |
Sure. Would be interesting to see how btl/ucx holds up there. ugni is not the optimal network for ucx. I pushed some osc/rdma updates that should be used for any performance comparisons. |
@hjelmn This is interesting. I would like to give it a try on Cori but I'm not sure how to properly use this. So I built with UCX but when I do ompi_info it uct doesnt show up as a BTL. Am I doing anything wrong? What is the proper command to use this? Here is what I tried (and failed?):
|
@thananon It needs openucx installed which you can build yourself. I tested with master and it works on Trinity. I also recommend totally disabling pml/ucx because there seems to be some issues with it on Crays right now. Also, do not disable btl/ugni. That is needed for two-sided communication as btl/uct is only for OSC. Open MPI configure line: ./configure --with-ucx=/path/to/ucx --enable-mca-no-build pml-ucx Run: mpirun --mca btl_ucx_transports ugni-ugni_rdma --mca osc_rdma_btls uct btl/uct will not be able to match btl/ugni but the performance is not bad. btw, if you keep it to yourself I can get you a copy of the updated btl/ugni that I am working on. It shows almost perfect thread scaling with the RMA-MT benchmarks. I plan to push it for Open MPI v4.0.0. |
opal/mca/btl/uct/btl_uct_rdma.c
Outdated
continue; | ||
} | ||
|
||
mca_btl_uct_context_lock (context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn I think you'll get the better performance if you will use trylock and will allow multiple threads to independently call progress on multiple workers.
One thread is unlikely to fully utilize the NIC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that now it looks like one thread will be held by another one holding the lock of the context 0 for example.
If use trylock - you can skip context 0 and go to context 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I’m not mistaken, one concerning thing is that if number of contexts is less than number of threads, this flush in one thread can block other threads from posting new requests while waiting for the current work to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the requirements of MPI_Flush. All operations targeting a remote process (or all remote processes) need to be flushed regardless of which thread started it so we can't skip a device context. This is something that the endpoints proposal would be able to help with. In theory flushing everything will have an impact but without many apps it is hard to tell what the real impact is. I can modify the RMA-MT benchmarks to include this scenario and see how we might be able to work around this limitation without endpoints. It might require tweaking the BTL interface (which is fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you can still let multiple threads to do the progress
You can skip context, but do multiple iterations on top of for(i=0; i < num_contexts;i++)
loop to make sure all is done.
I'd guess this would have better perf as multiple threads will progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will make that change.
@hjelmn I think that main advantage of this implementation is multiple contexts. I like the idea of static thread-local variables to hold the context ID assigned to the thread, very simple and yet efficient. Having multiple ucp contexts/workers will improve osc/ucx as well. We are looking into that now. We are using the benchmark that I wrote some time ago:
We will put it into the public access this week I think. I will update. |
Also, can you provide results for methods other than "flush"? |
Yes, multiple contexts is one of the advantages. The other (for me) is maintainability of the code. As far as OMPI RMA is concerned there is very little benefit for treating UCX as anything other than a byte transport layer. The bandwidth difference between osc/ucx and osc/rdma is going to come down to how well it has been optimized. And I would be very careful about how you add multiple contexts to the OSC. An app can use any number of windows and with the btl strategy they all use the same shared resources. Resource sharing can be done in the osc component but the multi-threaded performance of osc/ucx is the least of my concerns. Its scalability is a far bigger concern. This is part of the reason I recommended last year not targeting ucx with an osc component. |
Thank you for clarification. |
There is little performance difference with changing the synchronization methods. At this small a scale lock, lock-all, pscw, and fence are all about the same. That changes with larger scale (especially (edit) fence) but the benchmark doesn't yet look at anything other than single pair communication. |
@artpol84 A lot of work has been put into reducing the memory footprint, handling locking efficiently, and time scaling of osc/rdma. With a new component you are starting from scratch or just copying osc/rdma. At that point why bother. A btl is < 4k lines of code with comments and a well-optimized osc component is much larger. Right now the lock-all scaling of osc/ucx is bad. I mean really bad. O(n) bad. |
Also, I am going to on a PR sometime later this month or next that would (at compile time) inline a btl with osc/rdma. It will be some work but will reduce the overhead. We have a similar (but less complicated) concept for the pml. |
But this is a question of maintenance. I was interested in scalability concern. Is it something in particular or about the strategic decision of adding osc/ucx? |
Its nothing that can't be fixed without work. Just saying the work really isn't worth it. Since it has already been done once. |
I see, thank you very much! |
@artpol84 No problem :) |
I do plan to commit this btl in the near future (in its current disabled-by-default form). I need to work through some bugs when using the target hardware though. Should have that done in the next several weeks. |
@hjelmn Is this in a shape where I could test on an Infiniband EDR system? Can I apply this PR to master/3.1.0rc4, or do you plan to make more changes? |
@angainor |
Yeah. I would wait on using this with infiniband. I plan to have it working for RMA on infiniband in the next couple of weeks. |
@xinzhao3 That took forever. You can find the latest RMA-MT benchmarks @ http://github.com/hpc/rma-mt |
By request I am currently adding AM features to this BTL. So far it is working with RC and UD. The performance is good (at least on ARM) and pml/ob1 + btl/uct is fairly close to pml/ucx for two-sided. I plan to finish the code changes this week and update this PR. The component will still be disabled by default. |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/446018bf818e671422ba235105d12cdc |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/ec89369be352ab82143690f09977c0cb |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/1395ad33819ee94538c4b4d297a65950 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/16670bcec69b62c813fae1662fadf079 |
Should be good to go now. Still disabled by default. Tested on multiple ib/mlx5 systems and a ugni/aries system. This BTL should be a suitable replacement for btl/openib when using Mellanox IB. |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/d1348cfc899fd700bfd3a6d26d758a8b |
This looks great! |
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/a58c9e5d5e2008bcca70a9d36255387c |
^&$^#. Still can't figure out why the -luct is missing. Back to the configury. |
isn't that ucp? |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/d94b66fd2c4260dfdbcc5c3f17d5acfd |
@thananon Yeah, I think something I did screwed up the configury for pml/ucx. All ucx components in my tree are just getting -lucp but sometimes it looks like they just get -luct (maybe?). I am now just adding -luct to the btl to see if that resolves it. I can figure out a better fix later. |
@thananon Note the option to enable btl/uct is now |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/4bd054b89a3a8ace61cce301e3b378a8 |
This commit adds a new btl for one-sided and two-sided. This btl uses the uct layer in OpenUCX. This btl makes use of multiple uct contexts and per-thread device pinning to provide good performance when using threads and osc/rdma. This btl has been tested extensively with osc/rdma and passes all MTT tests on aries and IB hardware. For now this new component disables itself but can be enabled by setting the btl_ucx_transports MCA variable with a comma-delimited list of supported memory domains/transport layers. For example: --mca btl_uct_memory_domains ib/mlx5_0. The specific transports used can be selected using --mca btl_uct_transports. The default is to use any available transport. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Was a missing line in the Makefile.am :D |
This commit adds a new btl for one-sided communication only. This btl
uses the uct layer in openucx. This btl makes use of multiple uct
contexts and device pinning to provide good performance when using
threads and osc/rdma. This btl can not be used with pml/ob1 at this
time. This btl has been tested extensively with osc/rdma and passes
all MTT tests on aries hardware.
For now this new component disables itself but can be enabled by
setting the btl_ucx_transports MCA variable with a comma-delimited
list of supported memory domains/transport layers. For example:
--mca btl_ucx_transports ugni-ugni_rdma.
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov