Disable atomics in OBJ_RETAIN/RELEASE during THREAD_SINGLE #1902

jsquyres · 2016-07-26T16:24:04Z

@nysal brought up on the 2016-07-26 webex that the code still uses atomics for OBJ_RETAIN/RELEASE even in the THREAD_SINGLE case.

@hjelmn noted on the call that this is mainly for historical / code structure reasons, but it should probably be fixed (i.e., there's no need to use the atomics in the THREAD_SINGLE case). On the call, he proposed separating some header files that would obviate the need for a circular dependency (which is why the code is currently the way it is).

@nysal said he would take a whack at this. I'm tentatively marking this as v2.0.1 -- this could easily be considered a performance bug. Depending on how the fix shakes out, it may slip to a later milestone.

My question, however is: is there a difference between thread safety for RETAIN/RELEASE for single and multiple application threads, and single and multiple Open MPI implementation threads? I.e., is there a case where we still need atomic-enabled RETAIN/RELEASE macros even in a THREAD_SINGLE scenario?

hjelmn · 2016-07-26T16:40:32Z

There is a case for atomics in THREAD_SINGLE if async progress is enabled. The change should be to use atomics if opal_using_threads() is true.

hjelmn · 2016-07-26T20:06:51Z

@nysal There is actually a bit more cleanup that could be done. I have a little time today to take a crack at it. It will probably need a little work but I can at least give an idea of how I think things should be structured.

nysal · 2016-07-27T04:18:32Z

@hjelmn Sure, let me know how you want it structured. I was initially planning to move the opal_using_threads() and closely tied APIs (opal_set_using_threads()) to say opal/threads/threads_base.h. Then include this new header in opal/class/opal_object.h. This is to break the circular dependency. The second optimization was to skip reference counting for MPI_COMM_WORLD and predefined dataypes. The atomics hurt us here more than a branch could. The latter optimization should also hopefully help multithreaded message rate benchmarks.

nysal · 2016-07-27T05:36:02Z

I ran the phloem SQMR benchmark using vader BTL (master branch) to measure the message rate, using 2 tasks (--bind-to core --map-by socket) on a Power8 system. I see roughly 30% improvement in message rate for small messages with the patch to disable atomic refcounts, when the application is single threaded.

Current master branch:
# SQMR v1.0.0 - MPI maximal message rate benchmark
# Run at 07/27/16 00:43:41, with rank 0 on c712f5n10.pok.stglabs.ibm.com
#
# MPI tasks per node                 : 1
# Neighbor tasks                     : 1
# Iterations per message size        : 100000
# Send/Recv operations per iteration : 64
#
#                            average               max                  min
# msgsize iters time     msgs/sec MiB/sec     msgs/sec MiB/sec     msgs/sec MiB/sec
       0 100000  1.55   8233609.02    0.00   8233609.02    0.00   8233609.02    0.00
       1 80000  1.41   7255598.16    6.92   7255598.16    6.92   7255598.16    6.92
       2 64000  1.12   7296221.49   13.92   7296221.49   13.92   7296221.49   13.92
       4 51200  0.96   6812262.04   25.99   6812262.04   25.99   6812262.04   25.99
       8 40960  0.73   7193249.92   54.88   7193249.92   54.88   7193249.92   54.88
      16 32768  0.64   6533519.04   99.69   6533519.04   99.69   6533519.04   99.69
      32 26215  0.51   6525622.45  199.15   6525622.45  199.15   6525622.45  199.15
      64 20972  0.71   3783185.88  230.91   3783185.88  230.91   3783185.88  230.91


With the patch for disabling atomic refcounts:

# SQMR v1.0.0 - MPI maximal message rate benchmark
# Run at 07/27/16 01:21:44, with rank 0 on c712f5n10.pok.stglabs.ibm.com
#
# MPI tasks per node                 : 1
# Neighbor tasks                     : 1
# Iterations per message size        : 100000
# Send/Recv operations per iteration : 64
#
#                            average               max                  min
# msgsize iters time     msgs/sec MiB/sec     msgs/sec MiB/sec     msgs/sec MiB/sec
       0 100000  1.18  10886352.25    0.00  10886352.25    0.00  10886352.25    0.00
       1 80000  1.12   9109542.92    8.69   9109542.92    8.69   9109542.92    8.69
       2 64000  0.89   9167126.68   17.48   9167126.68   17.48   9167126.68   17.48
       4 51200  0.77   8500162.25   32.43   8500162.25   32.43   8500162.25   32.43
       8 40960  0.58   9079650.51   69.27   9079650.51   69.27   9079650.51   69.27
      16 32768  0.51   8287911.44  126.46   8287911.44  126.46   8287911.44  126.46
      32 26215  0.40   8417103.82  256.87   8417103.82  256.87   8417103.82  256.87
      64 20972  0.61   4365054.60  266.42   4365054.60  266.42   4365054.60  266.42

nysal · 2016-07-27T05:51:20Z

This is the patch I used for the previously reported benchmark: nysal@7a1a9d9

The comm world and predefined datatype refcount optimization is not included in this commit. I will add it as a separate commit as it seems logically separate. @hjelmn This is not yet ready to commit, I just put it out there to get comments.

hjelmn · 2016-07-27T22:44:28Z

@nysal That is also what I have in mind. I also want to clean up the OPAL_THREAD macros a little bit as part of this. Should have that done soon.

jladd-mlnx · 2016-07-28T00:31:00Z

👍 Nice, @nysal!

hjelmn · 2016-07-28T04:50:39Z

Looks like we actually did fix this with a one-off for 1.8 but I think we decided at the time to hold off on fixing master. It has been some time so I do not remember why exactly. Was kinda busy around the time we fixed this in v1.8. Anyway, we completely lost track of this issue until @nysal managed to re-discover it. Here is the v1.8 fix: open-mpi/ompi-release@ce91307. I will open a PR for a cherry-pick of that onto master. Will leave the cleanup for a later commit.

nysal · 2016-07-28T06:01:32Z

@hjelmn Yes, that indeed looks exactly like my patch. Please do bring it over to master and v2.x. I'll open a PR for the comm world and predefined data type refcounts today.

hjelmn · 2016-07-28T12:57:10Z

@nysal Will do that now. Github went down right after I posted that comment.

This commit adds opal_using_threads() protection around the atomic operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance issues seen when running psm with MPI_THREAD_SINGLE. To avoid issues with header dependencies opal_using_threads() has been moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and OPAL_THREAD_CMPSET* macros have also been relocated to this header. This was cherry-picked off a fix applied to v1.8 and not master. See open-mpi/ompi#1902. (cherry picked from commit ce91307) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

@nysal

This commit adds opal_using_threads() protection around the atomic operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance issues seen when running psm with MPI_THREAD_SINGLE. To avoid issues with header dependencies opal_using_threads() has been moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and OPAL_THREAD_CMPSET* macros have also been relocated to this header. This commit is cherry-picked off a fix that was submitted for the v1.8 release series but never applied to master. This fixes part of the problem reported by @nysal in open-mpi#1902. (cherry picked from commit open-mpi/ompi-release@ce91307) Signed-off-by: Nathan Hjelm <hjelmn@me.com>

hjelmn · 2016-07-28T14:41:53Z

Ok, merged. I have a PR open for 2.0.1 as well.

jsquyres · 2016-08-02T15:21:42Z

PR for v2.0.1: open-mpi/ompi-release#1287

rhc54 · 2016-08-09T15:15:37Z

Fixed by open-mpi/ompi-release#1287

@nysal

This commit adds opal_using_threads() protection around the atomic operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance issues seen when running psm with MPI_THREAD_SINGLE. To avoid issues with header dependencies opal_using_threads() has been moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and OPAL_THREAD_CMPSET* macros have also been relocated to this header. This commit is cherry-picked off a fix that was submitted for the v1.8 release series but never applied to master. This fixes part of the problem reported by @nysal in open-mpi#1902. (cherry picked from commit open-mpi/ompi-release@ce91307) Signed-off-by: Nathan Hjelm <hjelmn@me.com>

jsquyres added the enhancement label Jul 26, 2016

jsquyres added this to the v2.0.1 milestone Jul 26, 2016

jsquyres assigned nysal Jul 26, 2016

hjelmn mentioned this issue Jul 28, 2016

Fix performance regression caused by enabling opal thread support open-mpi/ompi-release#1287

Merged

hjelmn mentioned this issue Jul 28, 2016

Fix performance regression caused by enabling opal thread support #1909

Merged

jsquyres added Severity: blocker bug labels Aug 2, 2016

rhc54 closed this as completed Aug 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable atomics in OBJ_RETAIN/RELEASE during THREAD_SINGLE #1902

Disable atomics in OBJ_RETAIN/RELEASE during THREAD_SINGLE #1902

jsquyres commented Jul 26, 2016

hjelmn commented Jul 26, 2016

hjelmn commented Jul 26, 2016

nysal commented Jul 27, 2016

nysal commented Jul 27, 2016

nysal commented Jul 27, 2016

hjelmn commented Jul 27, 2016

jladd-mlnx commented Jul 28, 2016

hjelmn commented Jul 28, 2016 •

edited

Loading

nysal commented Jul 28, 2016

hjelmn commented Jul 28, 2016

hjelmn commented Jul 28, 2016

jsquyres commented Aug 2, 2016

rhc54 commented Aug 9, 2016

Disable atomics in OBJ_RETAIN/RELEASE during THREAD_SINGLE #1902

Disable atomics in OBJ_RETAIN/RELEASE during THREAD_SINGLE #1902

Comments

jsquyres commented Jul 26, 2016

hjelmn commented Jul 26, 2016

hjelmn commented Jul 26, 2016

nysal commented Jul 27, 2016

nysal commented Jul 27, 2016

nysal commented Jul 27, 2016

hjelmn commented Jul 27, 2016

jladd-mlnx commented Jul 28, 2016

hjelmn commented Jul 28, 2016 • edited Loading

nysal commented Jul 28, 2016

hjelmn commented Jul 28, 2016

hjelmn commented Jul 28, 2016

jsquyres commented Aug 2, 2016

rhc54 commented Aug 9, 2016

hjelmn commented Jul 28, 2016 •

edited

Loading