Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use hipMemcpyDefault in 2D memcpys due to bugs in HIP #1106

Merged
merged 6 commits into from
Mar 25, 2024

Conversation

msimberg
Copy link
Collaborator

2D memcpys involving host buffers with non-zero offsets are broken on HIP 5.6 onwards. Some cases have been fixed in ROCm/clr@56daa6c (included in 5.7.0) and further fixes in are in ROCm/clr@d3bfb55 (not released as of this PR).

Where it's straightforward to do so I've simply replaced whip::memcpy_default with the explicit kind. In lacpy we don't know the direction anymore so I've added an additional parameter for the kind that has to be passed by the caller. Finally, I have not changed all instances of whip::memcpy_default (e.g. in 1D memcpys) though I wouldn't oppose that at all as I find being explicit is rarely a bad thing.

@msimberg msimberg self-assigned this Feb 28, 2024
@msimberg
Copy link
Collaborator Author

cscs-ci run

2D memcpys involving host buffers with non-zero offsets are broken on
HIP 5.6 onwards. Some cases have been fixed in
ROCm/clr@56daa6c
(included in 5.7.0) and further fixes in are in
ROCm/clr@d3bfb55
(not released as of this commit).
@msimberg msimberg force-pushed the hip-memcpy-no-default branch from ccf2b9a to 5b8f856 Compare February 28, 2024 11:44
@msimberg
Copy link
Collaborator Author

cscs-ci run

@msimberg msimberg marked this pull request as ready for review February 29, 2024 07:37
@rasolca
Copy link
Collaborator

rasolca commented Feb 29, 2024

I would prefer not to add the extra parameter to lacpy.

Did you considered to use hipPointerGetAttributes to identify the MemcpyKind directly in lacpy only for buggy versions of rocm and continue using MemcpyDefault when it works?

@msimberg
Copy link
Collaborator Author

I would prefer not to add the extra parameter to lacpy.

Did you considered to use hipPointerGetAttributes to identify the MemcpyKind directly in lacpy only for buggy versions of rocm and continue using MemcpyDefault when it works?

I did not, but that sounds like a good idea! I'll check if we can use that and update to use that instead if that's the case.

@msimberg
Copy link
Collaborator Author

msimberg commented Mar 1, 2024

cscs-ci run

@msimberg msimberg force-pushed the hip-memcpy-no-default branch from c27f1aa to 6a908e1 Compare March 1, 2024 09:26
@msimberg
Copy link
Collaborator Author

msimberg commented Mar 1, 2024

cscs-ci run

@msimberg msimberg requested review from albestro and rasolca March 1, 2024 12:45
@msimberg
Copy link
Collaborator Author

msimberg commented Mar 1, 2024

@rasolca I think using hipPointerGetAttributes is working just fine. I was able to run a small set of benchmarks with the latest change that otherwise would fail with master. I added a couple of assertions:

@rasolca
Copy link
Collaborator

rasolca commented Mar 1, 2024

The enum contains different values depending on the rocm version.
https://docs.amd.com/projects/HIP/en/docs-5.6.0/.doxygen/docBin/html/hip__runtime__api_8h.html#aea86e91d3cd65992d787b39b218435a3

For sure hipMemoryTypeUnregistered has to be considered as it represent system allocated memory. (For 5.6 I suppose that it works as for CUDA which returns an error).

@msimberg
Copy link
Collaborator Author

msimberg commented Mar 1, 2024

https://docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html#ga7c3e8663feebb7be9fd3a1e5139bcefc seems to indicate that maybe even the newer versions return an error for unregistered memory, but it's not 100% clear from the description. I'll try it out and see what it returns.

@msimberg
Copy link
Collaborator Author

msimberg commented Mar 4, 2024

@rasolca indeed 5.6.1 returns hipErrorInvalidValue with a mallocd pointer, 6.0.2 gives hipMemoryTypeUnregistered.

We could:

  • Catch whip::exception and fall back to hipMemcpyDefault on hipErrorInvalidValue. hipMemcpy2D will only fail with non-zero offsets on host pointers, and if you get one of those pointers with a buggy version of HIP, hipMemcpy2D will anyway fail on the copy.
  • Treat hipMemoryTypeUnregistered as hipMemoryTypeHost, if we get it, in the versions of HIP that support it.
  • Fall back to hipMemcpyDefault on the other memory types.
  • Other options?

Perhaps worth discussing in today's meeting.

@msimberg msimberg marked this pull request as draft March 7, 2024 10:20
@msimberg
Copy link
Collaborator Author

msimberg commented Mar 7, 2024

cscs-ci run

Comment on lines +121 to +176
if (status == hipErrorInvalidValue) {
// If HIP returns hipErrorInvalidValue we assume it's due
// to HIP not recognizing a non-HIP allocated pointer as
// host memory, and we assume the type is host.
return hipMemoryTypeHost;
}
else if (status != hipSuccess) {
throw whip::exception(status);
}

switch (attributes.type) {
#if HIP_VERSION >= 60000000
case hipMemoryTypeUnregistered:
[[fallthrough]];
#endif
case hipMemoryTypeHost:
return hipMemoryTypeHost;
case hipMemoryTypeArray:
[[fallthrough]];
case hipMemoryTypeDevice:
return hipMemoryTypeDevice;
case hipMemoryTypeUnified:
return hipMemoryTypeUnified;
case hipMemoryTypeManaged:
return hipMemoryTypeManaged;
default:
DLAF_UNREACHABLE_PLAIN;
}
};

hipMemoryType src_type = get_memory_type(src);
hipMemoryType dst_type = get_memory_type(dst);

if (src_type == hipMemoryTypeDevice && dst_type == hipMemoryTypeHost) {
return whip::memcpy_device_to_host;
}
else if (src_type == hipMemoryTypeHost && dst_type == hipMemoryTypeDevice) {
return whip::memcpy_host_to_device;
}
else if (src_type == hipMemoryTypeDevice && dst_type == hipMemoryTypeDevice) {
return whip::memcpy_device_to_device;
}
else if (src_type == hipMemoryTypeManaged || src_type == hipMemoryTypeUnified ||
dst_type == hipMemoryTypeManaged || dst_type == hipMemoryTypeUnified) {
return whip::memcpy_default;
}
else if (src_type == hipMemoryTypeHost && dst_type == hipMemoryTypeHost) {
DLAF_ASSERT(
false,
"Attempting to do a HIP lacpy with host source and destination, use the CPU lacpy instead");
}
else {
DLAF_ASSERT(false,
"Attempting to do a HIP lacpy with unsupported source and destination memory type",
src_type, dst_type);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've slightly refactored this. It's still a bit of a mess but perhaps passable. I now:

  • Map a hipErrorInvalidValue from hipPointerGetAttributes to hipMemoryTypeHost. From a small test it seems like a malloc'd pointer does not go through the same path as a hipHostMalloc'd pointer, but we still need to cover the case of hipMemoryTypeHost, so we can treat both the same.
  • Map hipMemoryTypeUnregistered (if available) to hipMemoryTypeHost.
  • Map hipMemoryTypeArray to hipMemoryTypeDevice because the HIP documentation states that it's always on a device.
  • If any pointer is hipMemoryTypeManaged or hipMemoryTypeUnified the memcpy kind is hipMemcpyDefault because I don't know what else to do with them. This may be too conservative, but at least I'm not making any assumptions that I'm not sure about. This case was previously also using hipMemcpyDefault so it should not be worse than before.

All of the above is only done if HIP is 5.6 or newer. I'm hoping that we can put an upper bound on the workaround maybe with 6.1 or whichever version properly fixes the behaviour.

Comment on lines +167 to +171
else if (src_type == hipMemoryTypeHost && dst_type == hipMemoryTypeHost) {
DLAF_ASSERT(
false,
"Attempting to do a HIP lacpy with host source and destination, use the CPU lacpy instead");
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this assertion is only triggered with HIP 5.6 and newer, but theoretically it's useful also with other versions, including with CUDA. I don't know if it's worth it to always do this assertion though. I don't think it's a big deal to not check it (as we do right now on master).

@msimberg
Copy link
Collaborator Author

msimberg commented Mar 7, 2024

cscs-ci run

@msimberg msimberg force-pushed the hip-memcpy-no-default branch from ba8bff3 to 47d6f36 Compare March 7, 2024 15:48
@msimberg
Copy link
Collaborator Author

msimberg commented Mar 7, 2024

cscs-ci run

@msimberg msimberg marked this pull request as ready for review March 8, 2024 10:33
@msimberg msimberg requested review from RMeli and aurianer March 18, 2024 13:46
Copy link
Collaborator

@albestro albestro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

Just a minor comment: wondering if this workaround shouldn't be moved to whip. Feel free to ignore this, we can always move it if we decide it's a good idea.

@msimberg
Copy link
Collaborator Author

Just a minor comment: wondering if this workaround shouldn't be moved to whip. Feel free to ignore this, we can always move it if we decide it's a good idea.

It's a very good point, but at least at this point I'm not confident enough in the workaround to put it in whip. I know it works for our use cases and I think it works for others, but I really haven't tested it extensively enough to put it into whip right now.

@rasolca rasolca merged commit de20c0c into master Mar 25, 2024
4 checks passed
@rasolca rasolca deleted the hip-memcpy-no-default branch March 25, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants