-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime overhead on H2D and D2H tranfers #160
Comments
That's curious. I thought the per-launch allocations had all been removed, though there is probably still allocation on the first target launch. I'll look into this next week. Edit: this is probably the allocation in atmi_memcpy. Data is (sometimes unnecessarily) staged to a fine grain region before coping, and iirc that fine grain region is presently allocated per memcpy. If memory_pool_allocate is expensive, which it easily could be, we should do some buffer reuse instead. |
A partial patch landed at ROCm/amd-llvm-project#174. The HSA api exposes 'lock', which sometimes succeeds, and a synchronous memcpy, which sometimes fails. I'm trying to work out if that's a constraint on how the functions should be called or a bug in HSA. Things should now be faster on the happy path. Continuing to look into it. |
This will be resolved in Nov 2 release of AOMP 11.11-0 |
I'm afraid I forgot about this after the partial fix above. The cuda plugin overlaps tasks that we run sequentially, which I think makes this the right test case for asynchronous offloading |
@dhruvachak Can you check this issue on aomp 13.0-2? |
I still saw the heavy stagging https://github.com/llvm/llvm-project/blob/0d1d058544446b5fc74e70827a021a017facc56a/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp#L65. |
What we should have now is unconditionally try to lock the memory and if that succeeds make the memcpy call on the passed argument. I think that is more reliable than the previous strategy. If the lock fails we still do the allocate + memcpy fallback, this seems to be necessary for pointers into rodata and is probably required for pointers into the host stack. That is, on trunk (and I think amd-stg-open, but am not sure about aomp) you should no longer see the allocate on the profiler, assuming you're passing heap allocated data into the target region. |
On NVIDIA GPUs, registering pinned memory is a quite expensive operation while checking a pointer to see if it has been pinned already is cheap. I still have a feeling that always trying to lock is sub-optimal. |
That's probably worth checking. @carlobertolli? Question is whether calling lock on already locked memory is more expensive than some other query we could make of it. My expectation is that roct has to do ~ the same check we'd do so there would be no saving. Regardless, the really expensive thing is to allocate N bytes of host memory and copy over it, and that should now be gone for all the cases that we know it can be elided for. |
The hsa is not well documented and I have difficulty to make it working as expected. I doubt hsa does reference counting and locking and unlocking on already locked memory may cause undefined behaviors. |
Looks refcounted to me. fmm.c in roct, e.g. fmm_deregister_memory
where fmm_register_memory increments the same counter iff vm_find_object found an existing one. I don't know offhand what the relationship between hsa_amd_pointer_info and hsa_amd_memory_lock is intended to be. OpenMP isn't using hsa_amd_pointer_info at present. |
It is good the hsa does reference counting. In my measurements, tracer shows lock takes 35us and unlock takes 7us. They are not cheap. If I pinned via HIP ahead of computing, lock takes 0.9us and unlock takes roughly 0.7us. From rocprof. It is a bit more messy and the time doesn't really match the tracer.
With user pinned memories.
The time drop on hsa_amd_memory_lock is very visible. If you are not happy with the printout layout blame rocprof ROCm/ROCm#1640 |
More expensive than I expected, thanks for measuring. I think we're stuck locking it though - we need the memory to remain locked during the call, and if it happened to be locked by some other code we don't know that said other code will keep it locked for the duration. E.g. another thread might have locked (before our call) and then unlocked (during our call) the same memory (this is probably page granularity, so not even necessarily the same variable). edit: a potentially interesting move is to hand out locked memory in omp allocate and keep track of the pointers thus handed out, but it's hard to tell whether that would be a win overall. Locked memory is a scarcer resource than address space too. Maybe the way to go is to check what rocr/roct are doing and see if that can be made cheaper. |
+Carlo |
@ye-luo I've run a small test using ROCr directly from c++ for the tthree cases below.
The pointer being locked refers to a 10^8 double array.
Conclusion: if you have already locked a pointer by the time the amdgpu omptarget plugin tries to lock it again, then you would not pay for the cost of relocking. Does this sound reasonable to you? |
If my measurement is mapped to your code example, case 1 the lock part takes 7us. case 3 the inner lock takes 0.9us.
I would say the cost of re-locking is relatively low but it is not negligible. I also read the timing of unlock from the trace. It took 0.7us. So in total, it is 1.6us which is still not small. If the lock status can be detected ahead, that 1.6us can be saved as well. My experiment runs with 1 thread but QMCPACK does concurrent offload. In that scenario, the cost of talking to the runtime definitely goes beyond 1.6us. |
The upstream plugin-nextgen keeps track of locked memory https://reviews.llvm.org/D142514 and aomp inherits it. |
This figure comes from a single offload region. https://github.com/ye-luo/miniqmc/blob/46073436a432fc0472bf793784f9f87b5f8fdfcb/src/QMCWaveFunctions/einspline_spo_omp.cpp#L407
The map to/from arrays are already mapped with "target enter data"
Why there are still memory_pool_allocate and agents_allow access and memory_pool_free for these arrays?
The text was updated successfully, but these errors were encountered: