Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hipamd: SIGSEGV when compiled with -march=znver4 #18

Open
AngryLoki opened this issue Oct 18, 2023 · 10 comments
Open

hipamd: SIGSEGV when compiled with -march=znver4 #18

AngryLoki opened this issue Oct 18, 2023 · 10 comments

Comments

@AngryLoki
Copy link

Due to unaligned allocations, library crashes in nontemporalMemcpy in _mm512_stream_si512 (which requires 64-aligned allocations, but used to copy default-aligned objects) in https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/rocclr/device/rocm/rocvirtual.cpp#L2793

Originally reported to https://bugs.gentoo.org/915969 as a part of rocBLAS and miopen update (failure in hipamd module loader causes crash in dependent libraries).

AngryLoki pushed a commit to AngryLoki/clr that referenced this issue Oct 18, 2023
in _mm512_stream_si512 (which requires 64-aligned allocations,
but used to copy default-aligned objects).

As it is seemingly difficult to change allocations for copied
objects (common objects with ref-counts), the fix just replaces
nontemporalMemcpy with normal memcpy, which is already optimized
in most versions of C runtime.

Closes ROCm#18
@iassiour
Copy link
Contributor

I will create an internal PR to fix this. The PARAMETERS_MIN_ALIGNMENT https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/utils/flags.hpp#L55 should be set to the native alignment.

@AngryLoki
Copy link
Author

AngryLoki commented Oct 21, 2023

Hi, unfortunately, setting PARAMETERS_MIN_ALIGNMENT seems to be not enough.
Code still crashes in nontemporalMemcpy in in rocThrust tests (and supposedly in other libraries that pass unaligned pointers to hipMemcpyAsync).

The traceback is:

libamdhip64.so.5!roc::nontemporalMemcpy(uint16_t size, const void * __restrict__ src, void * __restrict__ dst) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:2799)
libamdhip64.so.5!roc::VirtualGPU::submitKernelInternal(roc::VirtualGPU * const this, const amd::NDRangeContainer & sizes,  kernel, const_address parameters, void * eventHandle, uint32_t sharedMemBytes, amd::NDRangeKernelCommand * vcmd, hsa_kernel_dispatch_packet_t * aql_packet) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:3099)
libamdhip64.so.5!roc::KernelBlitManager::copyBuffer(const roc::KernelBlitManager * const this,  srcMemory,  dstMemory,  srcOrigin,  dstOrigin,  sizeIn, bool entire, amd::CopyMetadata copyMetadata) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocblit.hpp:217)
libamdhip64.so.5!roc::VirtualGPU::copyMemory(roc::VirtualGPU * const this, cl_command_type type, amd::Memory & srcMem, amd::Memory & dstMem, bool entire, const amd::Coord3D & srcOrigin, const amd::Coord3D & dstOrigin, const amd::Coord3D & size, const amd::BufferRect & srcRect, const amd::BufferRect & dstRect, amd::CopyMetadata copyMetadata) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:1832)
libamdhip64.so.5!roc::VirtualGPU::submitCopyMemory(roc::VirtualGPU * const this, amd::CopyMemoryCommand & cmd) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\device\rocm\rocvirtual.cpp:1881)
libamdhip64.so.5!amd::Command::enqueue(amd::Command * const this) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\rocclr\platform\command.cpp:393)
libamdhip64.so.5!ihipMemcpy(void * dst, const void * src, size_t sizeBytes, hipMemcpyKind kind, hip::Stream & stream, bool isHostAsync, bool isGPUAsync) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\hipamd\src\hip_memory.cpp:502)
libamdhip64.so.5!hipMemcpyAsync_common(void * dst, const void * src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\hipamd\src\hip_memory.cpp:1422)
libamdhip64.so.5!hipMemcpyAsync(void * dst, const void * src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream) (\usr\src\debug\dev-util\hip-5.7.1\clr-rocm-5.7.1\hipamd\src\hip_memory.cpp:1428)
thrust::system::hip::detail::async_copy_n<thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag>, thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag>, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, long>(thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & from_exec, thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & to_exec, thrust::detail::normal_iterator<thrust::device_ptr<short> > first, long n, thrust::detail::normal_iterator<thrust::device_ptr<short> > output) (\src\rocThrust\thrust\system\hip\detail\async\copy.h:132)
thrust::hip_rocprim::async_copy<thrust::hip_rocprim::tag, thrust::hip_rocprim::tag, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >(thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & from_exec, thrust::hip_rocprim::execution_policy<thrust::hip_rocprim::tag> & to_exec, thrust::detail::normal_iterator<thrust::device_ptr<short> > first, thrust::detail::normal_iterator<thrust::device_ptr<short> > last, thrust::detail::normal_iterator<thrust::device_ptr<short> > output) (\src\rocThrust\thrust\system\hip\detail\async\copy.h:533)
thrust::async::copy_detail::copy_fn::call<thrust::hip_rocprim::tag, thrust::hip_rocprim::tag, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >(const thrust::detail::execution_policy_base<thrust::hip_rocprim::tag> & from_exec, const thrust::detail::execution_policy_base<thrust::hip_rocprim::tag> & to_exec) (\src\rocThrust\thrust\async\copy.h:86)
thrust::async::copy_detail::copy_fn::call<thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >() (\src\rocThrust\thrust\async\copy.h:121)
thrust::async::copy_detail::copy_fn::operator()<thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >() (\src\rocThrust\thrust\async\copy.h:137)
invoke_async_copy_fn::operator()<thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> >, thrust::detail::normal_iterator<thrust::device_ptr<short> > >() (\src\rocThrust\test\test_async_copy.cpp:30)
AsyncCopyDeviceToHost<short, invoke_async_copy_fn>() (\src\rocThrust\test\test_async_copy.cpp:116)
AsyncCopyTests_TestAsyncCopyTriviallyRelocatableDeviceToHost_Test<Params<short, thrust::hip_rocprim::par_t> >::TestBody(AsyncCopyTests_TestAsyncCopyTriviallyRelocatableDeviceToHost_Test<Params<short, thrust::hip_rocprim::par_t> > * this) (\src\rocThrust\test\test_async_copy.cpp:132)
libgtest.so.1.13.0!void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (Unknown Source:0)
libgtest.so.1.13.0!testing::Test::Run() (Unknown Source:0)
libgtest.so.1.13.0!testing::TestInfo::Run() (Unknown Source:0)

It looks like thrust::host_vector and thrust::device_vector are using std::allocator, then it eventually appears in nontemporalMemcpy.

Should I report this to rocThrust repo?

Upd: according to https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#gab388b755a0ee2c86ca5e0c29391a584c , there are no alignment requirements for hipMemcpyAsync pointers, so I think it should be fixed in clr.

@iassiour
Copy link
Contributor

Thank you @AngryLoki let me raise the issue internally and I will get back shortly.

@iassiour
Copy link
Contributor

Hi @AngryLoki the nontemporal copy that is shown here is irrelevant to the passed pointers in hipMemcpyAsync.
This is a copy of the kernel argument stack instead and this is managed internally.

I see some offsets are still calculated based on the 16 alignment.

https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/platform/kernel.hpp#L160
https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/platform/kernel.hpp#L162
https://github.com/ROCm-Developer-Tools/clr/blob/develop/rocclr/platform/kernel.hpp#L223

Could you please try to change these lines to the native alignment and let me know if it fixes the issue. I will followup with a permanent fix in that case.

@AngryLoki
Copy link
Author

AngryLoki commented Oct 30, 2023

Hi @iassiour ,

sorry for delay, I think I don't fully understand what exactly I need to change in the first link (wrong line number?). I applied this:

--- a/rocclr/platform/kernel.hpp
+++ b/rocclr/platform/kernel.hpp
@@ -159,7 +159,7 @@ class KernelParameters : protected HeapObject {
         deviceKernelArgs_(false) {
     totalSize_ = signature.paramsSize() + (signature.numMemories() +
         signature.numSamplers() + signature.numQueues()) * sizeof(void*);
-    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), 16);
+    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), NATIVE_ALIGNEMENT_SIZE);
     memoryObjOffset_ = signature_.paramsSize();
     memoryObjects_ = reinterpret_cast<amd::Memory**>(values_ + memoryObjOffset_);
     samplerObjOffset_ = memoryObjOffset_ + signature_.numMemories() * sizeof(amd::Memory*);
@@ -183,7 +183,7 @@ class KernelParameters : protected HeapObject {
         execNewVcop_(rhs.execNewVcop_),
         execPfpaVcop_(rhs.execPfpaVcop_),
         deviceKernelArgs_(false) {
-    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), 16);
+    values_ = reinterpret_cast<address>(this) + alignUp(sizeof(KernelParameters), NATIVE_ALIGNEMENT_SIZE);
     memoryObjOffset_ = signature_.paramsSize();
     memoryObjects_ = reinterpret_cast<amd::Memory**>(values_ + memoryObjOffset_);
     samplerObjOffset_ = memoryObjOffset_ + signature_.numMemories() * sizeof(amd::Memory*);
@@ -220,7 +220,7 @@ class KernelParameters : protected HeapObject {
   //! Allocate memory for this instance as well as the required storage for
   //  the values_, defined_, and rawPointer_ arrays.
   void* operator new(size_t size, const KernelSignature& signature) {
-    size_t requiredSize = alignUp(size, 16) + signature.paramsSize() +
+    size_t requiredSize = alignUp(size, NATIVE_ALIGNEMENT_SIZE) + signature.paramsSize() +
       (signature.numMemories() + signature.numSamplers() + signature.numQueues()) *
        sizeof(void*);
     return AlignedMemory::allocate(requiredSize, PARAMETERS_MIN_ALIGNMENT);

it does not crash anymore with this change in nontemporalMemcpy, but it makes one subtest (TestBinarySearchDevice) in one rocThrust test fail.

Summary:

  1. after recompiling clr without -march test succeeds
  2. after recompiling clr with -march=znver4 and memcpy (initial idea) test succeeds
  3. after recompiling clr with NATIVE_ALIGNEMENT_SIZE one subtest fails in https://github.com/ROCmSoftwarePlatform/rocThrust/blob/rocm-5.7.1/test/test_binary_search.cpp#L731 with
[ RUN      ] BinarySearchTests.TestBinarySearchDevice
/src/rocThrust/test/test_binary_search.cpp:731: Failure
Expected equality of these values:
  h_result
    Which is: { 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... }
  d_result
    Which is: { 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... }
Google Test trace:
/src/rocThrust/test/test_binary_search.cpp:708: with seed= 1
/src/rocThrust/test/test_binary_search.cpp:704: with size= 1048453
/src/rocThrust/test/test_binary_search.cpp:701: with device_id= 0
[  FAILED  ] BinarySearchTests.TestBinarySearchDevice (15571 ms)

Note the large size; test succeeds with smaller vectors and with size=1048453 ((1 << 20) - 123) it always fails, so I guess my NATIVE_ALIGNEMENT_SIZE patch is not complete.

@AngryLoki
Copy link
Author

By the way, here is another idea to fix this issue once and forever, which I learned recently.

I described it in ROCm/rocRAND#403: if you require C++17 in CMakeLists.txt and add __attribute__((ext_vector_type(4)))-family hints, you can get rid of manual alignUp's and new/delete operators. Maybe you'll find it useful.

@iassiour
Copy link
Contributor

iassiour commented Nov 12, 2023

Thank you for the pointers @AngryLoki I have not managed to reproduce the issue in rocThrust TestBinarySearchDevice yet but the changes you made above in operator new and the constructors in kernel.hpp are correct and in any case are required in order to make the current implementation that uses nonTemmporalCopy copy work with avx/avx512.
I will create a PR internally to add these, it should appear in the next release.

@LtdJorge
Copy link

Same happens to me on Gentoo (awell) running the tests from the Orochi project (from gpuopen), using -march=znver2. It segfaults in the memset. I tried building everything with Clang at first, thinking it was an issue with GCC LTO, but when I discovered this issue, I tried disabling the AVX/2 options with:
COMMON_FLAGS="-O3 -march=znver2 -mno-avx -mno-avx2 -g"
Now it runs correctly, as it was doing before I updated my system and newer HIP libraries got installed.

I'll subscribe here and to the Gentoo PR to get notified when it's fixed and I can enable AVX. Thank you @AngryLoki for the deep dive and finding the root cause.

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Nov 19, 2023
- add fix-unaligned-memcpy.patch
- add exec-stack.patch
- add disable-stack-protector patch
- drop asan doc from the build system.

Closes: https://bugs.gentoo.org/915969
Bug: ROCm/clr#18
Bug: #33400
Bug: ROCm/clr#22
Bug: ROCm/clr#21
Bug: ROCm/ROCm-CompilerSupport#61
Signed-off-by: Sv. Lockal <lockalsash@gmail.com>
Signed-off-by: Benda Xu <heroxbd@gentoo.org>
@gotzl
Copy link

gotzl commented Nov 19, 2023

I just want to add that this issue also applies to amdocl, as described in #31. The bug is inrocclr, which is common to hip and opencl build targets.

@ppanchad-amd
Copy link

Hi @AngryLoki, can you please check if your issue is fixed in the latest ROCm? If so, please close the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants