Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault with too big numSearchThreads (OpenCL) #182

Closed
shammyx opened this issue Mar 31, 2020 · 13 comments
Closed

segmentation fault with too big numSearchThreads (OpenCL) #182

shammyx opened this issue Mar 31, 2020 · 13 comments

Comments

@shammyx
Copy link
Contributor

shammyx commented Mar 31, 2020

I have been playing around a bit with KataGo and repeatedly run into trouble when trying to increase the "numSearchThreads" (or trying the benchmark). I am running an up-to-date arch linux and compiled katago from source (v.1.3.5) as described in the readme. I am using a AMD Radeon RX 580 GPU and a AMD Ryzen 7 1700 CPU. The issue seems to be independent of the net (tried with current g170 extended nets b10, b15, and b20) and board size.

I attached the full gdb output with a backtrace.
gdboutput.txt

If you need more information or I can help resolving this issue, please let me know. Thanks for your effort!

@lightvector
Copy link
Owner

lightvector commented Mar 31, 2020

I'm unable to reproduce on my machine. I'm guessing it has something going wrong when trying to use batch size 5 or greater, since your attached output works fine up to 4 search threads, and fails at 6, and if you look at the gdb backtrace you have batchSize=5 in the actual call that failed. I also don't have your exact gtp config gtp_9x9t.cfg, so I used the default cpp/configs/gtp_example.cfg. Do you similarly experience this problem with the default config?

If you could trace this to some issue in KataGo by exploring in GDB, that would be cool. The only things that come to mind right now for how the specific call you gave could have failed would be if:

err = clEnqueueCopyBuffer(handle->commandQueue, input, inputScratch, 0, 0, sizeof(float)*batchSize*cSize*nnYLen*nnXLen, 0, NULL, NULL);

...on that line, either input or inputScratch ended up being invalid buffers for OpenCL, or were not sized to be large enough. For batch size 5 and 9x9, and given 22 channels of input, they need to be sized to be more than sizeof(float) * 5 * 22 * 9 * 9. Those buffers should be created on lines 2187 and 2188 in cpp/neuralnet/openclbackend.cpp so it would be nice to verify that the size was indeed large enough in your failure case and that the addresses for those created buffers are the same as on the failing line, which in your particular run was input=0x7fff48056f10, inputScratch=0x7fff48057250 according to gdb.

Another question: what is your stack size? When you run ulimit -s or when you directly ask pthreads for the stack size as in https://unix.stackexchange.com/questions/127602/default-stack-size-for-pthreads what do you get?

Lastly, given that I can't reproduce, no recent update has materially touched the OpenCL code as far as I know, and other users have not yet reported, it could be an issue specific to your GPU and/or OpenCL drivers. Do either of these help? (found purely by googling your GPU along with "segfault", I don't claim any specific understanding of these threads).
https://forums.freebsd.org/threads/opencl-with-amd-radeon-rx580-segfaults.66789/
https://bugs.freedesktop.org/show_bug.cgi?id=108272

@shammyx
Copy link
Contributor Author

shammyx commented Mar 31, 2020

The issue persists for the default cpp/configs/gtp_example.cfg. Using a board size of 19x19, a segfault sometimes randomly occurs already during the tests with numSearchThreads = 2 at n / 10 positions with n < 10. I can't reproduce this, however. The segfault with 2 threads occured twice in maybe 20 tries until now. One related gdb output in gdbOutput.txt:
gdbOutput.txt
Still trying to figure out what it causes. It has never successfully run a single instance with 4 threads, although it seems pretty stable with a single thread.

The stack size is the same as in the linked question:
$ ulimit -s reports 8192 and pthreads reports the corresponding number 8388608.

I am not very familiar with gdb, so I don't know how to further trace the issue. Printing input or inputScratch returns their type (cl_mem). So I added some verbose printing around the creation of those buffers around line 2187 in cpp/neuralnet/openclbackend.cpp. (see gdbOutput.txt, line 120-132) The addresses of input and inputScratch after creation differ from what gdb reports (again, see gdbOutput.txt).

Would a core file be helpful? I would try to create one, in case some one who knows what they are doing wants to dig down that thing.

edit: Regarding those hardware-related links, it seems pretty cold. The problems described there do not match mine. I can run OpenCL apps fine, as well as KataGo. KataGo just segfaults with multiple search threads.

@lightvector
Copy link
Owner

I think your verbose printing code is incorrect. You want to cast the cl_mem (which itself is basically a pointer I think) directly to a void* and print out the numeric value of that pointer, rather than take the address of the cl_mem. I'm pretty sure you're printing out the memory address of where the cl_mem pointer is stored, rather than the address within the cl_mem pointer itself.

No, a core file would probably not be helpful.

Are you sure that you can run other OpenCL apps fine, such as various compute-and-memory-intensive benchmarks and such? Given that you report the issue with a smaller number of threads on 19x19, I'm suspecting now that what's breaking for you is something to do with the size of the neural net input data being past some threshold, or at least is stochastically triggered by such.

Once you have multiple threads, what happens is that the evaluations from those threads will get batched if they send queries to the backend code at about the same time, and therefore be using more space within the cl_mem buffers. For 2 threads though, batching can sometimes be rare, with the two threads settling into a rhythm of alternately ping-ponging on the GPU with a batch size of 1, instead of landing at the same time to make a batch size of 2.

For obvious reasons, a 19x19 board uses more space than a 9x9 board, so if it were the case, one would expect a 19x19 board to trigger problems at a smaller batch size than 9x9.

@lightvector
Copy link
Owner

And in particular, I'm guessing that you're seeing this issue as soon as you hit batch size 2 (but not 1) on 19x19, or batch size 5 (but not 1-4) on 9x9.

Still not sure how to fix it. Are there other applications that you think are using the GPU? Is there a command line tool that you can use on your system to see current GPU usage and the amount of GPU memory allocated by each other process that's using it, if any?

@shammyx
Copy link
Contributor Author

shammyx commented Mar 31, 2020

I'll double-check that OpenCL apps work fine, I think I got confused with OpenGL which works flawlessly for sure. My verbose code printed indeed just the memory address of where the cl_mem pointer is stored, I will change that and see whether it changes anything.

@shammyx
Copy link
Contributor Author

shammyx commented Mar 31, 2020

I tested the windows binary on a separate windows partition, everything works fine there, so no defective hardware. I used radeontop to check GPU usage and didn't see anything suspicious. This doesn't show, however, the allocated memory for processes. I also tried running the test without any instance of X11, so really nothing should use my GPU.

Currently I am compiling luxmark to check whether this works fine as well. That should rule out (or point towards) OpenCL/mesa (hardware driver) issues.

After that (or well, maybe during the compilation, it already takes a while) I'll try to adjust the verbose output and post an update.

@shammyx
Copy link
Contributor Author

shammyx commented Mar 31, 2020

The cast to void* did it. Now the addresses match.

I had some problems with luxmark, which is why I found geekbench. The first few tests of this benchmark work, but then it also fails due to an "internal error". So probably an OpenCL problem..?

@shammyx
Copy link
Contributor Author

shammyx commented Mar 31, 2020

I scanned the websites with OpenCL errors you googled (especially https://forums.freebsd.org/threads/opencl-with-amd-radeon-rx580-segfaults.66789/) and it looks like these simple tools that segfault work well for me.

But an idea arose from there: whatch dmesg. Et voila:

[ 5143.722198] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 16800 thread katago:cs0 pid 16801
[ 5143.722203] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
[ 5143.722205] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04800C
[ 5143.722210] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
[ 5144.595948] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 16800 thread katago:cs0 pid 16801
[ 5144.595952] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
[ 5144.595954] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04800C
[ 5144.595959] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 7, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
[ 5152.414302] katago[16855]: segfault at 8 ip 00007f6af5865c1d sp 00007f6a537fd050 error 6 in pipe_radeonsi.so[7f6af576a000+2ad000]
[ 5152.414316] Code: 00 00 8b 83 40 06 00 00 c6 83 2d 6e 00 00 01 48 89 df 48 8d 74 24 60 83 e0 fe 83 c8 02 89 83 40 06 00 00 48 8b 83 c8 04 00 00 <c6> 40 08 01 ff 93 68 03 00 00 48 8b 93 c8 04 00 00 8b 83 40 06 00
[ 5152.414365] audit: type=1701 audit(1585684467.023:119): auid=1000 uid=1000 gid=1000 ses=4 pid=16800 comm="katago" exe="/home/xy/repos/KataGo/cpp/katago" sig=11 res=1
[ 5152.433480] audit: type=1130 audit(1585684467.043:120): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-16870-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 5155.633706] audit: type=1131 audit(1585684470.243:121): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-16870-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

Not sure how to continue investigation now, though.

@shammyx
Copy link
Contributor Author

shammyx commented Mar 31, 2020

Letting katago play let my pc freeze. Looking into it with journalctl -xa gives

...
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 17185 thread katago:cs0 pid 17189
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
Mar 31 23:24:09 arx kernel: amdgpu 0000:09:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
...
Mar 31 23:27:07 arx rtkit-daemon[1409]: The canary thread is apparently starving. Taking action.
Mar 31 23:27:07 arx rtkit-daemon[1409]: Demoting known real-time threads.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1838 of process 1722.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1433 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1427 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Successfully demoted thread 1404 of process 1404.
Mar 31 23:27:08 arx rtkit-daemon[1409]: Demoted 4 threads.
...
Mar 31 23:29:50 arx kernel: amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000480c for process katago pid 17185 thread katago:cs0 pid 17189
Mar 31 23:29:50 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000600
Mar 31 23:29:52 arx kernel: amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
Mar 31 23:29:55 arx kernel: amdgpu 0000:09:00.0: VM fault (0x0c, vmid 6, pasid 32772) at page 1536, read from 'TC4' (0x54433400) (72)
Mar 31 23:35:47 arx systemd-udevd[732]: card0: Worker [17487] processing SEQNUM=4011 is taking a long time
-- Reboot --

where ... denotes many many of the instances of the first four fully printed lines.

@lightvector
Copy link
Owner

Thanks for checking on the debug code and gdb. At this point it sounds like there's nothing overtly wrong in the the way KataGo's side of the code is executing on your system - it's passing the right buffers that have been allocated with the right sizes, calling a normal OpenCL routine clEnqueueCopyBuffer to copy data between the two buffers. Although there's some chance that it's actually caused by a different call, since somewhere along the way things become asynchronous w.r.t the GPU. It would be possible to try forcing everything to be synchronous by peppering the code with https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clWaitForEvents.html if you were dedicated, but I'm skeptical it's going to be worth it to dig in that far.

Do you get similar GPU fault messages when you run geekbench or other intensive OpenCL applications? If so, then the implication would be that something is wrong the drivers for your GPU, or your GPU hardware, or the way the GPU driver interacts with your OS (if you found that on windows it worked), or something.

Googling for "VM_CONTEXT1_PROTECTION_FAULT_ADDR" and other parts of your log gives some suggestive links, but you're probably in a position to investigate better than me.

@shammyx
Copy link
Contributor Author

shammyx commented Mar 31, 2020

Thanks for your support and explanations, I appreciate it! I don't get these messages from geekbench or clpeak, but geekbench may catch the error as it fails not with a segfault but an "internal error". clpeak runs fine, the system reacts very slowly sometimes, but no errors, no faults, no warnings. Also clinfo runs without issues, although this I wouldn't call an intensive application.

As I said before, I tested katago with the windows binary on the same system and everything works as intended with a recommended fastest numSearchThreads of 16. So I would conclude it's not a hardware defect.

I think it could be OpenCL related, because everything else, including OpenGL, runs without problems. But of course it might be driver related or OS-driver-interaction related.

Indeed, I am in a better position and will try my best to investigate further. I agree it's an issue probably not on katago's side. Should I close this issue then?

@lightvector
Copy link
Owner

You can leave this issue open if you think there's more to find still and/or it would be helpful for other users to see in the meantime while still unsolved, I don't mind. Let me know what you find, and if it does turn out that there's something I can fix in the code to improve the robustness, let me know.

@shammyx
Copy link
Contributor Author

shammyx commented Apr 2, 2020

I couldn't find anything else so I opened a thread at the arch forums and got this answer: https://bbs.archlinux.org/viewtopic.php?pid=1895516#p1895516
The source of this trouble is the mesa-opencl package. Replacing it by its proprietary AMDGPU-PRO counterpart, (AUR package opencl-amd) everything works perfectly (and even faster).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants