-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch train FC hang on ROCm-4.5.0 with gfx803 #1218
Comments
This appears to be caused by out-of-bounds writes in the gemm batched routines for gfx803. By deleting the Tensile yaml files for that architecture, you fall back to source kernels rather than using the assembly kernels. In my tests, the source kernels appear to have fewer failures, but I'm not sure that this entirely solves the problem. I still see failures when running That's about all the investigation I have time for, but I'll organize and post my logs here later. |
You can see what kernels are run using the Using rocBLAS built with the r9nano yaml files:
Using rocBLAS built without the r9nano yaml files:
|
These are my logs running rocblas-test on the RX 570 with and without the r9nano yaml files: Unfortunately, in both cases you see failures like these:
That failed check in |
Please note that gfx803 is not officially supported on ROCm. |
Thank @cgmb help, I will try to test to this direction. Thank @bragadeesh noticed, I am very appreciate for your help. |
@cgmb Sorry for the delay response. So I think it may resolved rocblas gfx803 issues if I deleted r9nano related ymls. I will find time to dig asm problems. |
I'm surprised. That's different from the results I had got. I'd honestly given up after trying ROCm 5.0 and discovering my gfx803 card didn't seem to work at all anymore. It was just today that I learned you need to set In any case, my conclusion from this issue was that the complexity of Tensile limits the ability of the community to understand and debug rocBLAS. I've been asking about potentially having fallback implementations for rocBLAS functions when building without Tensile. That's something that's been desired for a number of different people for different reasons (e.g., spack/spack#28846), and I think it would be helpful for cases like this. |
@cgmb My environment:
Although, rocblas-test said the related rocBLAS version is I think this is why our test results are different. Maybe you didn't re-install rocblas to /opt/rocm, so you used the asm codes to run rocBLAS on gfx803 all the time. Next point is the error isnot stable, I ran And if I ran one test at one time, it run properly. It looks like the context is not clear and one thread may affect others with asm codes. Could you have any clue on this? thank you very much. The good news is I can reproduce this issue with c codes, dont have to install tensorflow.
|
Cool! But is it possible to have a working tensorflow or pytorch installation from the current ROCm 5.1.3 toolchain? I actually have a gfx803 in my gaming laptop, which further complicates the process of getting ROCm to run. If I remember correctly, it was the different memory management model between the iGPU and the dGPU, causing segmentation fault if ROCm tries to access the other GPU with wrong memory management model (e.g: accessing the dGPU with iGPU memory management mode). I would be grateful if @xuhuiseng can provide me some clues to get this to work. Currently I have an Manjaro installation so I have to build all the ROCm components from source if I want to install them. |
Can you provide me a guide to build tensorflow from source for gfx803? |
@LamEnder |
@cgmb The way to reproduce this issue , It is the simplest way what I found.
It will match 4 tests, the third test always failed.
It looks like stable, gtest always failed at the third test - pre_checkin_gemm_medium_f32_r_NN_3_3_3_1_3_3_3_3. I try to calculate by handle, and dont know where is 100 comes from.
And I try to copy the gtest codes to a single cpp source code, since the gtest need parse test-data first, and it costs times, And the problem disappeared. I just dont know what different between gtest and my cpp. Since the log display one by one, I guess we are not need using threads. Here is my cpp test code: |
Thanks for the information. It will be a while before I can get to it, but I will take another look. |
@cgmb with device_vector: without device_vector: Keep digging. |
@xuhuisheng @cgmb Since rocBLAS uses Tensile as its backend, I suggest we can generate new kernels by running Tensile benchmarks on gfx803. Because the newly generated kernels will be built against the new Tensile codebase, I think that would fix some of the rocBLAS test failure I guess. What do you think? This wiki link would be useful: https://github.com/ROCmSoftwarePlatform/Tensile/wiki/ |
That might help or it might not. I'm not sure. To address this issue (and for a wide variety of other reasons), I've asked rocBLAS to provide fallback implementations that can work without Tensile. If nothing else, that would mean that rocBLAS will at least be functional on all platforms that have HIP support in the driver, compiler and runtime. As a rocSOLVER developer, I will also appreciate the ability to build and test rocSOLVER without having to build the (incredibly slow to compile) Tensile kernels. And to build and test rocSOLVER on platforms that Tensile has not yet added support for. And the ability to debug by comparing results with and without Tensile. So, it will be a useful feature for a number of reasons. I'm not sure when it fits into the schedule, but I put in the feature request at the start of June. It's in the backlog. |
Yes, and by having a generic, architecture-agnostic implementation, we could also eliminate the pain of dealing with large binary size of rocBLAS. The Tensile kernels can be distributed separately and loaded dynamically at runtime when it's available, and use the fallback implementation if it's not. I'm looking forward to it btw. Anyway I don't see your feature request anywhere in the backlog, is it just me or it's hidden elsewhere? |
That would be nice. I had been considering a similar strategy for rocSOLVER's size-specialized kernels, but it hadn't occurred to me that it would be useful for rocBLAS too.
The development backlog is not publicly accessible. Unfortunately, it contains a mix of information that could reasonably be made public and information that must remain private. |
@cgmb
Please help to have a look this bug. Thank you very much.
If you haven't enough time, please give me some clue. I am really not familiar with GCN assembles. (T_T)
What is the expected behavior
What actually happens
How to reproduce
Environment
workaround
If I delete
library/src/blas3/Tensile/Logic/asm_full/r9nano_*.yaml
and rebuild rocBLAS. This problem resolved.Here is my patched rocblas https://github.com/xuhuisheng/rocm-gfx803/releases/download/rocm450/rocblas_2.41.0-337552f0.dirty_amd64.deb
etc
Here is my personal issue for tracing this issue on ROCm-4.0.1
xuhuisheng/rocm-build#4
Here is my building documentation for gfx803. Glade to see there is only one issue on ROCm-4.5.0 with gfx803.
https://github.com/xuhuisheng/rocm-build/tree/master/gfx803
I try some other scripts :
The mnist on tensorflow with gfx803 shows NaN loss.
The text-classification of tensorflow shows
Invalid argument: indices[5,284] = 997212422 is not in [0, 5001) (text classification)
.#1172
I guess this problem is caused by the new assemble codes of Tensile, we can see after I delete n9nano related tensile yaml, the legal c gemm implements run properly.
But I dont know how to reproduce it with plain c codes. But it affects the pytorch/tensorflow.
The text was updated successfully, but these errors were encountered: