Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable atomics in OBJ_RETAIN/RELEASE during THREAD_SINGLE #1902

Closed
jsquyres opened this issue Jul 26, 2016 · 13 comments
Closed

Disable atomics in OBJ_RETAIN/RELEASE during THREAD_SINGLE #1902

jsquyres opened this issue Jul 26, 2016 · 13 comments

Comments

@jsquyres
Copy link
Member

@nysal brought up on the 2016-07-26 webex that the code still uses atomics for OBJ_RETAIN/RELEASE even in the THREAD_SINGLE case.

@hjelmn noted on the call that this is mainly for historical / code structure reasons, but it should probably be fixed (i.e., there's no need to use the atomics in the THREAD_SINGLE case). On the call, he proposed separating some header files that would obviate the need for a circular dependency (which is why the code is currently the way it is).

@nysal said he would take a whack at this. I'm tentatively marking this as v2.0.1 -- this could easily be considered a performance bug. Depending on how the fix shakes out, it may slip to a later milestone.

My question, however is: is there a difference between thread safety for RETAIN/RELEASE for single and multiple application threads, and single and multiple Open MPI implementation threads? I.e., is there a case where we still need atomic-enabled RETAIN/RELEASE macros even in a THREAD_SINGLE scenario?

@jsquyres jsquyres added this to the v2.0.1 milestone Jul 26, 2016
@hjelmn
Copy link
Member

hjelmn commented Jul 26, 2016

There is a case for atomics in THREAD_SINGLE if async progress is enabled. The change should be to use atomics if opal_using_threads() is true.

@hjelmn
Copy link
Member

hjelmn commented Jul 26, 2016

@nysal There is actually a bit more cleanup that could be done. I have a little time today to take a crack at it. It will probably need a little work but I can at least give an idea of how I think things should be structured.

@nysal
Copy link
Member

nysal commented Jul 27, 2016

@hjelmn Sure, let me know how you want it structured. I was initially planning to move the opal_using_threads() and closely tied APIs (opal_set_using_threads()) to say opal/threads/threads_base.h. Then include this new header in opal/class/opal_object.h. This is to break the circular dependency. The second optimization was to skip reference counting for MPI_COMM_WORLD and predefined dataypes. The atomics hurt us here more than a branch could. The latter optimization should also hopefully help multithreaded message rate benchmarks.

@nysal
Copy link
Member

nysal commented Jul 27, 2016

I ran the phloem SQMR benchmark using vader BTL (master branch) to measure the message rate, using 2 tasks (--bind-to core --map-by socket) on a Power8 system. I see roughly 30% improvement in message rate for small messages with the patch to disable atomic refcounts, when the application is single threaded.

Current master branch:
# SQMR v1.0.0 - MPI maximal message rate benchmark
# Run at 07/27/16 00:43:41, with rank 0 on c712f5n10.pok.stglabs.ibm.com
#
# MPI tasks per node                 : 1
# Neighbor tasks                     : 1
# Iterations per message size        : 100000
# Send/Recv operations per iteration : 64
#
#                            average               max                  min
# msgsize iters time     msgs/sec MiB/sec     msgs/sec MiB/sec     msgs/sec MiB/sec
       0 100000  1.55   8233609.02    0.00   8233609.02    0.00   8233609.02    0.00
       1 80000  1.41   7255598.16    6.92   7255598.16    6.92   7255598.16    6.92
       2 64000  1.12   7296221.49   13.92   7296221.49   13.92   7296221.49   13.92
       4 51200  0.96   6812262.04   25.99   6812262.04   25.99   6812262.04   25.99
       8 40960  0.73   7193249.92   54.88   7193249.92   54.88   7193249.92   54.88
      16 32768  0.64   6533519.04   99.69   6533519.04   99.69   6533519.04   99.69
      32 26215  0.51   6525622.45  199.15   6525622.45  199.15   6525622.45  199.15
      64 20972  0.71   3783185.88  230.91   3783185.88  230.91   3783185.88  230.91


With the patch for disabling atomic refcounts:

# SQMR v1.0.0 - MPI maximal message rate benchmark
# Run at 07/27/16 01:21:44, with rank 0 on c712f5n10.pok.stglabs.ibm.com
#
# MPI tasks per node                 : 1
# Neighbor tasks                     : 1
# Iterations per message size        : 100000
# Send/Recv operations per iteration : 64
#
#                            average               max                  min
# msgsize iters time     msgs/sec MiB/sec     msgs/sec MiB/sec     msgs/sec MiB/sec
       0 100000  1.18  10886352.25    0.00  10886352.25    0.00  10886352.25    0.00
       1 80000  1.12   9109542.92    8.69   9109542.92    8.69   9109542.92    8.69
       2 64000  0.89   9167126.68   17.48   9167126.68   17.48   9167126.68   17.48
       4 51200  0.77   8500162.25   32.43   8500162.25   32.43   8500162.25   32.43
       8 40960  0.58   9079650.51   69.27   9079650.51   69.27   9079650.51   69.27
      16 32768  0.51   8287911.44  126.46   8287911.44  126.46   8287911.44  126.46
      32 26215  0.40   8417103.82  256.87   8417103.82  256.87   8417103.82  256.87
      64 20972  0.61   4365054.60  266.42   4365054.60  266.42   4365054.60  266.42

@nysal
Copy link
Member

nysal commented Jul 27, 2016

This is the patch I used for the previously reported benchmark: nysal@7a1a9d9

The comm world and predefined datatype refcount optimization is not included in this commit. I will add it as a separate commit as it seems logically separate. @hjelmn This is not yet ready to commit, I just put it out there to get comments.

@hjelmn
Copy link
Member

hjelmn commented Jul 27, 2016

@nysal That is also what I have in mind. I also want to clean up the OPAL_THREAD macros a little bit as part of this. Should have that done soon.

@jladd-mlnx
Copy link
Member

👍 Nice, @nysal!

@hjelmn
Copy link
Member

hjelmn commented Jul 28, 2016

Looks like we actually did fix this with a one-off for 1.8 but I think we decided at the time to hold off on fixing master. It has been some time so I do not remember why exactly. Was kinda busy around the time we fixed this in v1.8. Anyway, we completely lost track of this issue until @nysal managed to re-discover it. Here is the v1.8 fix: open-mpi/ompi-release@ce91307. I will open a PR for a cherry-pick of that onto master. Will leave the cleanup for a later commit.

@nysal
Copy link
Member

nysal commented Jul 28, 2016

@hjelmn Yes, that indeed looks exactly like my patch. Please do bring it over to master and v2.x. I'll open a PR for the comm world and predefined data type refcounts today.

@hjelmn
Copy link
Member

hjelmn commented Jul 28, 2016

@nysal Will do that now. Github went down right after I posted that comment.

hjelmn added a commit to hjelmn/ompi-release that referenced this issue Jul 28, 2016
This commit adds opal_using_threads() protection around the atomic
operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance
issues seen when running psm with MPI_THREAD_SINGLE.

To avoid issues with header dependencies opal_using_threads() has been
moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and
OPAL_THREAD_CMPSET* macros have also been relocated to this header.

This was cherry-picked off a fix applied to v1.8 and not master. See
open-mpi/ompi#1902.

(cherry picked from commit ce91307)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi-release that referenced this issue Jul 28, 2016
This commit adds opal_using_threads() protection around the atomic
operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance
issues seen when running psm with MPI_THREAD_SINGLE.

To avoid issues with header dependencies opal_using_threads() has been
moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and
OPAL_THREAD_CMPSET* macros have also been relocated to this header.

This was cherry-picked off a fix applied to v1.8 and not master. See
open-mpi/ompi#1902.

(cherry picked from commit ce91307)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jul 28, 2016
This commit adds opal_using_threads() protection around the atomic
operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance
issues seen when running psm with MPI_THREAD_SINGLE.

To avoid issues with header dependencies opal_using_threads() has been
moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and
OPAL_THREAD_CMPSET* macros have also been relocated to this header.

This commit is cherry-picked off a fix that was submitted for the v1.8
release series but never applied to master. This fixes part of the
problem reported by @nysal in open-mpi#1902.

(cherry picked from commit open-mpi/ompi-release@ce91307)

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
@hjelmn
Copy link
Member

hjelmn commented Jul 28, 2016

Ok, merged. I have a PR open for 2.0.1 as well.

@jsquyres
Copy link
Member Author

jsquyres commented Aug 2, 2016

PR for v2.0.1: open-mpi/ompi-release#1287

@rhc54
Copy link
Contributor

rhc54 commented Aug 9, 2016

Fixed by open-mpi/ompi-release#1287

@rhc54 rhc54 closed this as completed Aug 9, 2016
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
This commit adds opal_using_threads() protection around the atomic
operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance
issues seen when running psm with MPI_THREAD_SINGLE.

To avoid issues with header dependencies opal_using_threads() has been
moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and
OPAL_THREAD_CMPSET* macros have also been relocated to this header.

This commit is cherry-picked off a fix that was submitted for the v1.8
release series but never applied to master. This fixes part of the
problem reported by @nysal in open-mpi#1902.

(cherry picked from commit open-mpi/ompi-release@ce91307)

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants