Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clblast_tuner_xgemm broken in latest development branch #97

Closed
wernsaar opened this issue Sep 5, 2016 · 3 comments
Closed

clblast_tuner_xgemm broken in latest development branch #97

wernsaar opened this issue Sep 5, 2016 · 3 comments

Comments

@wernsaar
Copy link

wernsaar commented Sep 5, 2016

I try to tune Nvidia devices:

commandline:
./clblast_tuner_xgemm -m 1024 -n 1024 -k 1024

After some iterations, the program is aborted.

[   FAILED ] Kernel Xgemm failed
[   FAILED ]   catched exception: Internal OpenCL error: -36
terminate called after throwing an instance of 'std::runtime_error'
  what():  Internal OpenCL error: -5
Aborted (core dumped)

This does not happen, if I try to tune the CPU

@wernsaar
Copy link
Author

wernsaar commented Sep 5, 2016

This is a backtrace from gdb:

[   FAILED ] Kernel Xgemm failed
[   FAILED ]   catched exception: Internal OpenCL error: -36
terminate called after throwing an instance of 'std::runtime_error'
  what():  Internal OpenCL error: -5

Program received signal SIGABRT, Aborted.
0x00007ffff69af5f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56    return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libX11-1.6.3-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 libxcb-1.11-4.el7.x86_64
(gdb) bt
#0  0x00007ffff69af5f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff69b0ce8 in __GI_abort () at abort.c:90
#2  0x00007ffff72b49d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3  0x00007ffff72b2946 in ?? () from /lib64/libstdc++.so.6
#4  0x00007ffff72b2973 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007ffff72b2b93 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x00007ffff777b2bb in cltune::CheckError(int) () from /home/saar/openblas/lib/libcltune.so
#7  0x00007ffff778adad in bool cltune::TunerImpl::DownloadAndCompare<float>(cltune::TunerImpl::MemArgument&, unsigned long) () from /home/saar/openblas/lib/libcltune.so
#8  0x00007ffff7785cb5 in cltune::TunerImpl::VerifyOutput() () from /home/saar/openblas/lib/libcltune.so
#9  0x00007ffff778617f in cltune::TunerImpl::Tune() () from /home/saar/openblas/lib/libcltune.so
#10 0x0000000000416699 in clblast::Tuner<clblast::TuneXgemm<float>, float> (argc=argc@entry=11, argv=argv@entry=0x7fffffffde68) at /home/saar/CLBlast-development/src/tuning/tuning.hpp:129
#11 0x0000000000408751 in main (argc=11, argv=0x7fffffffde68) at /home/saar/CLBlast-development/src/tuning/kernels/xgemm.cpp:152

@CNugteren
Copy link
Owner

The error -36 indicates a CL_INVALID_COMMAND_QUEUE. Probably something got corrupt in the run just before it, which also failed as indicated.

I just re-ran the tuner from the development branch on an NVIDIA device (Tesla K40m with CUDA 7.5) and I successfully completed ~400 tuning experiments, no errors. Note that the configurations which it tries are random so it might be difficult to track down. Is your issue consistent over runs? What GPU / CUDA version are you using? Does it work on the master branch (consistently)?

@wernsaar
Copy link
Author

wernsaar commented Sep 7, 2016

Sorry for the trouble,

The reason was a buggy driver from NVIDIA (version 367.35).
Using a previous driver or the newest driver (version 367.44),
the problem is solved and the tuner runs fine.

Best regards
Werner

@wernsaar wernsaar closed this as completed Sep 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants