Runtime overhead on H2D and D2H tranfers #160

ye-luo · 2020-10-08T23:12:16Z

This figure comes from a single offload region. https://github.com/ye-luo/miniqmc/blob/46073436a432fc0472bf793784f9f87b5f8fdfcb/src/QMCWaveFunctions/einspline_spo_omp.cpp#L407

The map to/from arrays are already mapped with "target enter data"
Why there are still memory_pool_allocate and agents_allow access and memory_pool_free for these arrays?

JonChesterfield · 2020-10-09T00:00:06Z

That's curious. I thought the per-launch allocations had all been removed, though there is probably still allocation on the first target launch. I'll look into this next week.

Edit: this is probably the allocation in atmi_memcpy. Data is (sometimes unnecessarily) staged to a fine grain region before coping, and iirc that fine grain region is presently allocated per memcpy. If memory_pool_allocate is expensive, which it easily could be, we should do some buffer reuse instead.

JonChesterfield · 2020-10-15T14:18:42Z

A partial patch landed at ROCm/amd-llvm-project#174. The HSA api exposes 'lock', which sometimes succeeds, and a synchronous memcpy, which sometimes fails. I'm trying to work out if that's a constraint on how the functions should be called or a bug in HSA.

Things should now be faster on the happy path. Continuing to look into it.

gregrodgers · 2020-10-26T12:59:18Z

This will be resolved in Nov 2 release of AOMP 11.11-0

JonChesterfield · 2021-04-12T21:54:20Z

I'm afraid I forgot about this after the partial fix above. The cuda plugin overlaps tasks that we run sequentially, which I think makes this the right test case for asynchronous offloading

gregrodgers · 2021-04-20T13:02:06Z

@dhruvachak Can you check this issue on aomp 13.0-2?

ye-luo · 2021-11-15T05:50:43Z

That's curious. I thought the per-launch allocations had all been removed, though there is probably still allocation on the first target launch. I'll look into this next week.

Edit: this is probably the allocation in atmi_memcpy. Data is (sometimes unnecessarily) staged to a fine grain region before coping, and iirc that fine grain region is presently allocated per memcpy. If memory_pool_allocate is expensive, which it easily could be, we should do some buffer reuse instead.

I still saw the heavy stagging https://github.com/llvm/llvm-project/blob/0d1d058544446b5fc74e70827a021a017facc56a/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp#L65.
I'm wondering if we can use https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1820 to query if the host pointer has been pinned https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1729 already and skip staging.

JonChesterfield · 2021-12-15T15:10:04Z

What we should have now is unconditionally try to lock the memory and if that succeeds make the memcpy call on the passed argument. I think that is more reliable than the previous strategy. If the lock fails we still do the allocate + memcpy fallback, this seems to be necessary for pointers into rodata and is probably required for pointers into the host stack.

That is, on trunk (and I think amd-stg-open, but am not sure about aomp) you should no longer see the allocate on the profiler, assuming you're passing heap allocated data into the target region.

ye-luo · 2021-12-15T15:14:12Z

On NVIDIA GPUs, registering pinned memory is a quite expensive operation while checking a pointer to see if it has been pinned already is cheap. I still have a feeling that always trying to lock is sub-optimal.

JonChesterfield · 2021-12-15T18:16:08Z

That's probably worth checking. @carlobertolli?

Question is whether calling lock on already locked memory is more expensive than some other query we could make of it. My expectation is that roct has to do ~ the same check we'd do so there would be no saving.

Regardless, the really expensive thing is to allocate N bytes of host memory and copy over it, and that should now be gone for all the cases that we know it can be elided for.

ye-luo · 2021-12-15T18:23:11Z

The hsa is not well documented and I have difficulty to make it working as expected.
ROCm/ROCR-Runtime#128

I doubt hsa does reference counting and locking and unlocking on already locked memory may cause undefined behaviors.

JonChesterfield · 2021-12-15T18:32:49Z

Looks refcounted to me. fmm.c in roct, e.g. fmm_deregister_memory

        if (object->registration_count > 1) {
                --object->registration_count;
                pthread_mutex_unlock(&aperture->fmm_mutex);
                return HSAKMT_STATUS_SUCCESS;
        }

where fmm_register_memory increments the same counter iff vm_find_object found an existing one.

I don't know offhand what the relationship between hsa_amd_pointer_info and hsa_amd_memory_lock is intended to be. OpenMP isn't using hsa_amd_pointer_info at present.

ye-luo · 2021-12-16T00:14:56Z

It is good the hsa does reference counting. In my measurements, tracer shows lock takes 35us and unlock takes 7us. They are not cheap. If I pinned via HIP ahead of computing, lock takes 0.9us and unlock takes roughly 0.7us.

From rocprof. It is a bit more messy and the time doesn't really match the tracer.
Without user pinned.

$ cat results.hsa_stats.csv |grep lock
hsa_amd_memory_lock,803,68153169,84873,9.288249639958051
hsa_amd_memory_unlock,803,19387176,24143,2.6421798596306414

With user pinned memories.

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hsa_amd_memory_lock_to_pool,36,41889377,1163593,12.713881187629408
hsa_amd_memory_lock,803,41304882,51438,12.536480602637097
hsa_amd_memory_unlock,838,23449281,27982,7.1171116383357935

The time drop on hsa_amd_memory_lock is very visible. If you are not happy with the printout layout blame rocprof ROCm/ROCm#1640

JonChesterfield · 2021-12-17T17:12:57Z

More expensive than I expected, thanks for measuring.

I think we're stuck locking it though - we need the memory to remain locked during the call, and if it happened to be locked by some other code we don't know that said other code will keep it locked for the duration.

E.g. another thread might have locked (before our call) and then unlocked (during our call) the same memory (this is probably page granularity, so not even necessarily the same variable).

edit: a potentially interesting move is to hand out locked memory in omp allocate and keep track of the pointers thus handed out, but it's hard to tell whether that would be a win overall. Locked memory is a scarcer resource than address space too. Maybe the way to go is to check what rocr/roct are doing and see if that can be made cheaper.

ronlieb · 2022-01-21T19:48:31Z

+Carlo

carlobertolli · 2022-01-28T16:17:23Z

@ye-luo I've run a small test using ROCr directly from c++ for the tthree cases below.
I measure the time elapsed in between the curly brackets {}.

lock is included in elapsed time

{
  lock(a, locked_a); // not locked
  async_memcpy(locked_a);
}

memcopy only, no lock in elapsed time

lock(a, locked_a);
{
  async_memcpy(locked_a);
}

re-lock included in elapsed time

lock(a, locked_a1);
{
  lock(locked_a1, locked_a2); // already locked
  async_memcpy(locked_a2);
}

The pointer being locked refers to a 10^8 double array.
As I believe you point out:

Is the slowest by several orders of magnitude than 2 and 3
2 and 3 perform similarly (I could not detect a difference).

Conclusion: if you have already locked a pointer by the time the amdgpu omptarget plugin tries to lock it again, then you would not pay for the cost of relocking.

Does this sound reasonable to you?

ye-luo · 2022-01-28T16:31:43Z

If my measurement is mapped to your code example, case 1 the lock part takes 7us. case 3 the inner lock takes 0.9us.

Conclusion: if you have already locked a pointer by the time the amdgpu omptarget plugin tries to lock it again, then you would not pay for the cost of relocking.

I would say the cost of re-locking is relatively low but it is not negligible. I also read the timing of unlock from the trace. It took 0.7us. So in total, it is 1.6us which is still not small. If the lock status can be detected ahead, that 1.6us can be saved as well. My experiment runs with 1 thread but QMCPACK does concurrent offload. In that scenario, the cost of talking to the runtime definitely goes beyond 1.6us.

ye-luo · 2023-04-28T15:50:52Z

The upstream plugin-nextgen keeps track of locked memory https://reviews.llvm.org/D142514 and aomp inherits it.
Right now, the hsa async copy remains very slow and its developers are aware of it and working on a fix now.

ye-luo mentioned this issue Oct 8, 2020

Reduce synchronization needed #161

Closed

ye-luo mentioned this issue Oct 15, 2020

HIP and OpenMP target composability issue when exchanging memory #163

Closed

gregrodgers assigned JonChesterfield and unassigned JonChesterfield Oct 26, 2020

gregrodgers added enhancement New feature or request performance This is an enhancement for performace labels Oct 26, 2020

gregrodgers assigned dhruvachak Apr 20, 2021

gregrodgers added the question Further information is requested label Apr 20, 2021

ronlieb assigned carlobertolli Jan 21, 2022

ye-luo closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime overhead on H2D and D2H tranfers #160

Runtime overhead on H2D and D2H tranfers #160

ye-luo commented Oct 8, 2020

JonChesterfield commented Oct 9, 2020 •

edited

Loading

JonChesterfield commented Oct 15, 2020

gregrodgers commented Oct 26, 2020

JonChesterfield commented Apr 12, 2021

gregrodgers commented Apr 20, 2021

ye-luo commented Nov 15, 2021

JonChesterfield commented Dec 15, 2021

ye-luo commented Dec 15, 2021

JonChesterfield commented Dec 15, 2021

ye-luo commented Dec 15, 2021

JonChesterfield commented Dec 15, 2021

ye-luo commented Dec 16, 2021 •

edited

Loading

JonChesterfield commented Dec 17, 2021 •

edited

Loading

ronlieb commented Jan 21, 2022

carlobertolli commented Jan 28, 2022

ye-luo commented Jan 28, 2022

ye-luo commented Apr 28, 2023

Runtime overhead on H2D and D2H tranfers #160

Runtime overhead on H2D and D2H tranfers #160

Comments

ye-luo commented Oct 8, 2020

JonChesterfield commented Oct 9, 2020 • edited Loading

JonChesterfield commented Oct 15, 2020

gregrodgers commented Oct 26, 2020

JonChesterfield commented Apr 12, 2021

gregrodgers commented Apr 20, 2021

ye-luo commented Nov 15, 2021

JonChesterfield commented Dec 15, 2021

ye-luo commented Dec 15, 2021

JonChesterfield commented Dec 15, 2021

ye-luo commented Dec 15, 2021

JonChesterfield commented Dec 15, 2021

ye-luo commented Dec 16, 2021 • edited Loading

JonChesterfield commented Dec 17, 2021 • edited Loading

ronlieb commented Jan 21, 2022

carlobertolli commented Jan 28, 2022

ye-luo commented Jan 28, 2022

ye-luo commented Apr 28, 2023

JonChesterfield commented Oct 9, 2020 •

edited

Loading

ye-luo commented Dec 16, 2021 •

edited

Loading

JonChesterfield commented Dec 17, 2021 •

edited

Loading