Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime overhead on H2D and D2H tranfers #160

Closed
ye-luo opened this issue Oct 8, 2020 · 17 comments
Closed

Runtime overhead on H2D and D2H tranfers #160

ye-luo opened this issue Oct 8, 2020 · 17 comments
Assignees
Labels
enhancement New feature or request performance This is an enhancement for performace question Further information is requested

Comments

@ye-luo
Copy link

ye-luo commented Oct 8, 2020

This figure comes from a single offload region. https://github.com/ye-luo/miniqmc/blob/46073436a432fc0472bf793784f9f87b5f8fdfcb/src/QMCWaveFunctions/einspline_spo_omp.cpp#L407

The map to/from arrays are already mapped with "target enter data"
Why there are still memory_pool_allocate and agents_allow access and memory_pool_free for these arrays?

Screenshot from 2020-10-08 17-40-09

@JonChesterfield
Copy link
Contributor

JonChesterfield commented Oct 9, 2020

That's curious. I thought the per-launch allocations had all been removed, though there is probably still allocation on the first target launch. I'll look into this next week.

Edit: this is probably the allocation in atmi_memcpy. Data is (sometimes unnecessarily) staged to a fine grain region before coping, and iirc that fine grain region is presently allocated per memcpy. If memory_pool_allocate is expensive, which it easily could be, we should do some buffer reuse instead.

@JonChesterfield
Copy link
Contributor

A partial patch landed at ROCm/amd-llvm-project#174. The HSA api exposes 'lock', which sometimes succeeds, and a synchronous memcpy, which sometimes fails. I'm trying to work out if that's a constraint on how the functions should be called or a bug in HSA.

Things should now be faster on the happy path. Continuing to look into it.

@gregrodgers gregrodgers added enhancement New feature or request performance This is an enhancement for performace labels Oct 26, 2020
@gregrodgers
Copy link
Contributor

This will be resolved in Nov 2 release of AOMP 11.11-0

@JonChesterfield
Copy link
Contributor

I'm afraid I forgot about this after the partial fix above. The cuda plugin overlaps tasks that we run sequentially, which I think makes this the right test case for asynchronous offloading

@gregrodgers
Copy link
Contributor

@dhruvachak Can you check this issue on aomp 13.0-2?

@gregrodgers gregrodgers added the question Further information is requested label Apr 20, 2021
@ye-luo
Copy link
Author

ye-luo commented Nov 15, 2021

That's curious. I thought the per-launch allocations had all been removed, though there is probably still allocation on the first target launch. I'll look into this next week.

Edit: this is probably the allocation in atmi_memcpy. Data is (sometimes unnecessarily) staged to a fine grain region before coping, and iirc that fine grain region is presently allocated per memcpy. If memory_pool_allocate is expensive, which it easily could be, we should do some buffer reuse instead.

I still saw the heavy stagging https://github.com/llvm/llvm-project/blob/0d1d058544446b5fc74e70827a021a017facc56a/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp#L65.
I'm wondering if we can use https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1820 to query if the host pointer has been pinned https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1729 already and skip staging.

@JonChesterfield
Copy link
Contributor

What we should have now is unconditionally try to lock the memory and if that succeeds make the memcpy call on the passed argument. I think that is more reliable than the previous strategy. If the lock fails we still do the allocate + memcpy fallback, this seems to be necessary for pointers into rodata and is probably required for pointers into the host stack.

That is, on trunk (and I think amd-stg-open, but am not sure about aomp) you should no longer see the allocate on the profiler, assuming you're passing heap allocated data into the target region.

@ye-luo
Copy link
Author

ye-luo commented Dec 15, 2021

On NVIDIA GPUs, registering pinned memory is a quite expensive operation while checking a pointer to see if it has been pinned already is cheap. I still have a feeling that always trying to lock is sub-optimal.

@JonChesterfield
Copy link
Contributor

That's probably worth checking. @carlobertolli?

Question is whether calling lock on already locked memory is more expensive than some other query we could make of it. My expectation is that roct has to do ~ the same check we'd do so there would be no saving.

Regardless, the really expensive thing is to allocate N bytes of host memory and copy over it, and that should now be gone for all the cases that we know it can be elided for.

@ye-luo
Copy link
Author

ye-luo commented Dec 15, 2021

The hsa is not well documented and I have difficulty to make it working as expected.
ROCm/ROCR-Runtime#128

I doubt hsa does reference counting and locking and unlocking on already locked memory may cause undefined behaviors.

@JonChesterfield
Copy link
Contributor

Looks refcounted to me. fmm.c in roct, e.g. fmm_deregister_memory

        if (object->registration_count > 1) {
                --object->registration_count;
                pthread_mutex_unlock(&aperture->fmm_mutex);
                return HSAKMT_STATUS_SUCCESS;
        }

where fmm_register_memory increments the same counter iff vm_find_object found an existing one.

I don't know offhand what the relationship between hsa_amd_pointer_info and hsa_amd_memory_lock is intended to be. OpenMP isn't using hsa_amd_pointer_info at present.

@ye-luo
Copy link
Author

ye-luo commented Dec 16, 2021

It is good the hsa does reference counting. In my measurements, tracer shows lock takes 35us and unlock takes 7us. They are not cheap. If I pinned via HIP ahead of computing, lock takes 0.9us and unlock takes roughly 0.7us.

From rocprof. It is a bit more messy and the time doesn't really match the tracer.
Without user pinned.

$ cat results.hsa_stats.csv |grep lock
hsa_amd_memory_lock,803,68153169,84873,9.288249639958051
hsa_amd_memory_unlock,803,19387176,24143,2.6421798596306414

With user pinned memories.

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hsa_amd_memory_lock_to_pool,36,41889377,1163593,12.713881187629408
hsa_amd_memory_lock,803,41304882,51438,12.536480602637097
hsa_amd_memory_unlock,838,23449281,27982,7.1171116383357935

The time drop on hsa_amd_memory_lock is very visible. If you are not happy with the printout layout blame rocprof ROCm/ROCm#1640

@JonChesterfield
Copy link
Contributor

JonChesterfield commented Dec 17, 2021

More expensive than I expected, thanks for measuring.

I think we're stuck locking it though - we need the memory to remain locked during the call, and if it happened to be locked by some other code we don't know that said other code will keep it locked for the duration.

E.g. another thread might have locked (before our call) and then unlocked (during our call) the same memory (this is probably page granularity, so not even necessarily the same variable).

edit: a potentially interesting move is to hand out locked memory in omp allocate and keep track of the pointers thus handed out, but it's hard to tell whether that would be a win overall. Locked memory is a scarcer resource than address space too. Maybe the way to go is to check what rocr/roct are doing and see if that can be made cheaper.

@ronlieb
Copy link
Contributor

ronlieb commented Jan 21, 2022

+Carlo

@carlobertolli
Copy link
Contributor

@ye-luo I've run a small test using ROCr directly from c++ for the tthree cases below.
I measure the time elapsed in between the curly brackets {}.

  1. lock is included in elapsed time
{
  lock(a, locked_a); // not locked
  async_memcpy(locked_a);
}
  1. memcopy only, no lock in elapsed time
lock(a, locked_a);
{
  async_memcpy(locked_a);
}
  1. re-lock included in elapsed time
lock(a, locked_a1);
{
  lock(locked_a1, locked_a2); // already locked
  async_memcpy(locked_a2);
}

The pointer being locked refers to a 10^8 double array.
As I believe you point out:

  1. Is the slowest by several orders of magnitude than 2 and 3
  2. 2 and 3 perform similarly (I could not detect a difference).

Conclusion: if you have already locked a pointer by the time the amdgpu omptarget plugin tries to lock it again, then you would not pay for the cost of relocking.

Does this sound reasonable to you?

@ye-luo
Copy link
Author

ye-luo commented Jan 28, 2022

If my measurement is mapped to your code example, case 1 the lock part takes 7us. case 3 the inner lock takes 0.9us.

Conclusion: if you have already locked a pointer by the time the amdgpu omptarget plugin tries to lock it again, then you would not pay for the cost of relocking.

I would say the cost of re-locking is relatively low but it is not negligible. I also read the timing of unlock from the trace. It took 0.7us. So in total, it is 1.6us which is still not small. If the lock status can be detected ahead, that 1.6us can be saved as well. My experiment runs with 1 thread but QMCPACK does concurrent offload. In that scenario, the cost of talking to the runtime definitely goes beyond 1.6us.

@ye-luo
Copy link
Author

ye-luo commented Apr 28, 2023

The upstream plugin-nextgen keeps track of locked memory https://reviews.llvm.org/D142514 and aomp inherits it.
Right now, the hsa async copy remains very slow and its developers are aware of it and working on a fix now.

@ye-luo ye-luo closed this as completed Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance This is an enhancement for performace question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants