-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Integrate with unified SYCL backend for Intel GPUs #2690
Conversation
This looks interesting, but I need some more context and numbers. What hardware is this useful for? |
Discrete Intel GPUs (Intel ARC and the professional variants.) Very interested to see performance numbers with this vs CLBlast. |
@ggerganov yes this is for Intel dGPUs (max and flex) including arc GPUs which rely on sycl backend. Currently OpenCL is already supported but for perf and better optimization (from Intel llvm) this PR is raised, currently I am testing its perf and making it stable. I will flag it off for review once things provide proper stability . |
I'm also interested in this feature. @abhilash1910 are you actively working on it, or is it available for grabs? |
@unbrice yes it is under dev , but if you are able to compile then great. There are some configs and tasks which are pending to be added. |
I have put together a repo that shows and example of building llama.cpp with OpenCL and running in on an Intel A770 via Docker. The docker file and all associated scripts showing how the container is built, ran, and tested are included. There is an example log file that shows more of the console logs from the docker container including responding to the curl command in the test.sh file. https://github.com/itlackey/llama.cpp-opencl I have the A770 and a 4060ti 16GB running in the same machine. Below are the examples of output when running the same model on either card. The 4060 is 10x faster than the Arc. This is not the case when running things like Intel Extensions for Pytorch. These cards should perform vary similarly when running optimally. This leads me to believe that the OpenCL support in llama.cpp is not using the card to it's fullest potential. Hopefully adding SYCL (or Vulkan) support would bring the Arc up to speed. Hopefully this is helpful. A770 Logs: 4060ti Logs: llama_print_timings: load time = 808.00 ms |
I'm starting my SYCL research and development here and it's looking like a decent sized effort. Macs might not play well with this. |
Btw, the existing OpenCL implementation offloads only the matrix multiplications to the GPU - the rest of the ops are still running on the CPU and there is overhead from constantly moving the activations back and forth between host and device memory. Ideally, the entire graph computation should be offloaded, similar to the CUDA and Metal backends |
@abhilash1910 You need any support with adding any remaining configurations or is it complete? |
I have no idea but it's been working perfectly for me with Llama and Mistral models. While I don't think there are shaders for all the ops yet Vulkan uses 100% of my GPU (unlike OpenCL) and it runs 2x faster. |
@ggerganov could you help trigger CI ? Thanks |
🤞 |
I've been trying to get this branch to build to play around with my a770, but so far have had no luck. What environment/dependencies does one need to build this? I've tried various oneAPI containers but none seem to be able to find SYCL during cmake configuration. Edit - I realize now you were talking about the Vulcan fork not this one, sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments regarding the README and exmpales
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments from trying to compile this application with open source DPCPP release
This still doesn't compile.
|
Yes. I'm fixing this issue. |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ggerganov |
@ggerganov |
Likely will merge later today or tomorrow |
Thanks for all your hard work guys! I've been able to easily compile and run it using intel/hpckit docker image without any problem. When running inside container, you can pass the GPU through the container with this argument: I'm using a iGPU (Intel(R) Iris(R) Xe Graphics) and be able to utilize 100% of its power, though unfortunately the performance is not better than just using CPU. I definitely should get myself an external GPU. I'll update the guide for compiling & running with docker in the future. |
Thank your update for docker! Intel iGPU has many EUs. In general, the iGPU includes 32 EUs. It's slow. If you try it on iGPU of Meteor Lake (new Intel Core iGPU), or Intel Arc/Flex/Max dGPU, the performance is good. |
@abhilash1910
Do you have any suggestion? |
Yes @sorasoras WIN build support is next in our development plan. We are working to provide the build option . |
The Windows building is in end of stage. We will create PR soon. |
Cool,Cannot wait to test this against vulkan build |
Windows build PR is created: #5208 Please join to review. |
Is this supposed to work with laptop/low-end iGPUs? I was getting some acceleration with openBlas but wanted this give a shot locally, but fails with:
does that mean that the op is not supported by the onboard GPU? If so I'd be happy to add it in the known issues in the docs |
@mudler seems there are some issues about your oneapi installation that no gpu device are detected. Can you run for example, if the iGPU are detected, there will be separate iGPU(Intel(R) Graphics [0x7d55]) and CPU. And sycl backend can work on iGPU.
|
mmm alright I see, here I just have:
so maybe something is wrong my setup (even if I see all the drivers loaded 🙄 ), anyway thanks for double checking! maybe we can add a mention in the docs that there should be listed a gpu device (with the |
* first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
For better Arc support, is there a way we can have layers offload to the GPU in chunks? You see, intel have made it so that it doesn't allow for moving chunks greater than 4GB in size at any one time. |
* first update for migration * update init_cublas * add debug functio, commit all help code * step 1 * step 2 * step3 add fp16, slower 31->28 * add GGML_LIST_DEVICE function * step 5 format device and print * step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue * support main device is non-zero * step7 add debug for code path, rm log * step 8, rename all macro & func from cuda by sycl * fix error of select non-zero device, format device list * ren ggml-sycl.hpp -> ggml-sycl.h * clear CMAKE to rm unused lib and options * correct queue: rm dtct:get_queue * add print tensor function to debug * fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481 * summary dpct definition in one header file to replace folder:dpct * refactor device log * mv dpct definition from folder dpct to ggml-sycl.h * update readme, refactor build script * fix build with sycl * set nthread=1 when sycl, increase performance * add run script, comment debug code * add ls-sycl-device tool * add ls-sycl-device, rm unused files * rm rear space * dos2unix * Update README_sycl.md * fix return type * remove sycl version from include path * restore rm code to fix hang issue * add syc and link for sycl readme * rm original sycl code before refactor * fix code err * add know issue for pvc hang issue * enable SYCL_F16 support * align pr4766 * check for sycl blas, better performance * cleanup 1 * remove extra endif * add build&run script, clean CMakefile, update guide by review comments * rename macro to intel hardware * editor config format * format fixes * format fixes * editor format fix * Remove unused headers * skip build sycl tool for other code path * replace tab by space * fix blas matmul function * fix mac build * restore hip dependency * fix conflict * ren as review comments * mv internal function to .cpp file * export funciton print_sycl_devices(), mv class dpct definition to source file * update CI/action for sycl code, fix CI error of repeat/dup * fix action ID format issue * rm unused strategy * enable llama_f16 in ci * fix conflict * fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml * fix ci cases for unsupported data type * revert unrelated changed in cuda cmake remove useless nommq fix typo of GGML_USE_CLBLAS_SYCL * revert hip cmake changes * fix indent * add prefix in func name * revert no mmq * rm cpu blas duplicate * fix no_new_line * fix src1->type==F16 bug. * pass batch offset for F16 src1 * fix batch error * fix wrong code * revert sycl checking in test-sampling * pass void as arguments of ggml_backend_sycl_print_sycl_devices * remove extra blank line in test-sampling * revert setting n_threads in sycl * implement std::isinf for icpx with fast math. * Update ci/run.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update CMakeLists.txt Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add copyright and MIT license declare * update the cmd example --------- Co-authored-by: jianyuzh <jianyu.zhang@intel.com> Co-authored-by: luoyu-intel <yu.luo@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Motivation:
Thanks for creating llama.cpp. There has been quite an effort to integrate OpenCL runtime for AVX instruction sets.
However for running on Intel graphics cards , there needs to be additional sycl runtime porting over the OpenCL runtime.
This is a feature enabling PR which is now in final stages with expectations for community feedback in terms of performance and improvements.
Co authored by : @NeoZhangJianyu , @airMeng , @luoyu-intel and thanks to @AidanBeltonS (Codeplay) for suggestions and recommendations. Thanks to all associated to help in improving and shaping the PR in terms of feedback and future performance .
Thanks to @jacob1218 for running initial benchmarks :
Since the development is based on SYCLomatic runtime which is evolving with latest upgrades, feedbacks/suggestions and comments are welcome .
Tagging @ggerganov .