Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFloat16 assembler errors with Windows Sapphire Rapids kernels #4450

Closed
imciner2 opened this issue Jan 23, 2024 · 5 comments
Closed

BFloat16 assembler errors with Windows Sapphire Rapids kernels #4450

imciner2 opened this issue Jan 23, 2024 · 5 comments
Labels
Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS

Comments

@imciner2
Copy link
Contributor

When building the BFloat16 API for Julia, we are seeing an assembler failure in the 64-bit Windows builds for the Sapphire Rapids kernels: https://buildkite.com/julialang/yggdrasil/builds/7890#018d31fe-fdfa-4892-870d-1f9e89ccf362/6-2212.

The errors are:

[18:55:08] /tmp/ccNhJljp.s: Assembler messages:
[18:55:08] /tmp/ccNhJljp.s:108: Error: `(%rax,%r10d,1)' is not a valid base/index expression
[18:55:08] /tmp/ccNhJljp.s:111: Error: `(%rdx,%r11d,1)' is not a valid base/index expression
[18:55:08] /tmp/ccNhJljp.s:125: Error: `(%r8,%r10d,1)' is not a valid base/index expression
[18:55:08] /tmp/ccNhJljp.s:131: Error: `(%rcx,%eax,1)' is not a valid base/index expression
[18:55:08] /tmp/ccNhJljp.s:175: Error: `(%r8,%ebx,1)' is not a valid base/index expression
[18:55:08] /tmp/ccNhJljp.s:178: Error: `(%rax,%r11d,1)' is not a valid base/index expression
[18:55:08] /tmp/ccNhJljp.s:195: Error: `(%rdx,%eax,1)' is not a valid base/index expression
[18:55:08] /tmp/ccNhJljp.s:203: Error: `(%rcx,%eax,1)' is not a valid base/index expression
[18:55:08] make[1]: *** [Makefile.L3:637: sbgemm_oncopy_SAPPHIRERAPIDS.obj] Error 1

and

...
[18:55:12] /tmp/ccJmjjeg.s:2368: Error: `(%rdx,%ebx,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2381: Error: `(%rsi,%ebx,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2387: Error: `(%r8,%r10d,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2444: Error: `(%rax,%ebx,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2454: Error: `(%rax,%r10d,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2461: Error: `(%rsi,%ebx,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2467: Error: `(%rax,%r10d,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2561: Error: `(%r14,%r11d,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2567: Error: `(%rdi,%eax,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2576: Error: `(%rax,%edx,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2582: Error: `(%r14,%r11d,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2634: Error: `(%rdi,%ecx,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2644: Error: `(%rdx,%r11d,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2647: Error: `(%r8,%r13d,1)' is not a valid base/index expression
[18:55:12] /tmp/ccJmjjeg.s:2653: Error: `(%rdx,%r11d,1)' is not a valid base/index expression
[18:55:12] make[1]: *** [Makefile.L3:820: sbgemm_kernel_SAPPHIRERAPIDS.obj] Error 1

The command line used to compile the kernel is

[18:54:57] cc -O2 -DSMALL_MATRIX_OPT -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DDYNAMIC_ARCH -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=32 -DMAX_PARALLEL_NUMBER=1 -DBUILD_BFLOAT16 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.26\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -march=sapphirerapids -fno-asynchronous-unwind-tables -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME=sbgemm_kernel_SAPPHIRERAPIDS -DASMFNAME=sbgemm_kernel_SAPPHIRERAPIDS_ -DNAME=sbgemm_kernel_SAPPHIRERAPIDS_ -DCNAME=sbgemm_kernel_SAPPHIRERAPIDS -DCHAR_NAME=\"sbgemm_kernel_SAPPHIRERAPIDS_\" -DCHAR_CNAME=\"sbgemm_kernel_SAPPHIRERAPIDS\" -DNO_AFFINITY -DTS=_SAPPHIRERAPIDS -I.. -DBUILD_KERNEL -DTABLE_NAME=gotoblas_SAPPHIRERAPIDS -march=sapphirerapids -fno-asynchronous-unwind-tables -DBFLOAT16 -UDOUBLE  -UCOMPLEX -c -DBFLOAT16 -UDOUBLE -UCOMPLEX ../kernel/x86_64/sbgemm_kernel_16x16_spr.c -o sbgemm_kernel_SAPPHIRERAPIDS.obj

We are using GCC 11.1, so the Sapphire Rapids architecture should be available.

Note that we are building with the patch from #4423 applied to the build tree.

@imciner2
Copy link
Contributor Author

Note, this is the compiler used:

sandbox:${WORKSPACE}/srcdir/OpenBLAS-0.3.26 # $CC --version
x86_64-w64-mingw32-gcc (GCC) 11.1.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@martin-frbg
Copy link
Collaborator

Could be a case of the assembler (binutils package) being too old for the compiler output ?

@imciner2
Copy link
Contributor Author

imciner2 commented Jan 23, 2024

Hmm, we're using the assembler that is included

 /opt/x86_64-w64-mingw32/bin/../lib/gcc/x86_64-w64-mingw32/11.1.0/../../../../x86_64-w64-mingw32/bin/as -v -I .. --64 -o sbgemm_oncopy_SAPPHIRERAPIDS.obj /tmp/ccpDOOBO.s
GNU assembler version 2.36 (x86_64-w64-mingw32) using BFD version (GNU Binutils) 2.36

Looking at the assembly produced by the GCC version, it is choking on the AMX instructions:

106 /APP
107  # 86 "../kernel/x86_64/sbgemm_oncopy_16_spr.c" 1
108     tileloadd   (%rax,%r10d,1), %tmm0
109  # 0 "" 2
110  # 87 "../kernel/x86_64/sbgemm_oncopy_16_spr.c" 1
111     tilestored  %tmm0, (%rdx,%r11d,1)
112  # 0 "" 2
113 /NO_APP

which correspond to the lines here

_tile_loadd(T_16x32, aoffset0, lda * 2);

According to the release notes, AMX instructions were added to gas/binutils in 2.36 (https://github.com/bminor/binutils-gdb/blob/48a121f83cae0a625f63d3ad5f8a9149f7fa964a/gas/NEWS#L256).

However, after quite a bit of digging, I think it is actually a GCC bug, and it is specifically because they decided to implement the _tile_loadd parts as macros instead of intrinics. There have been two changes since that was introduced, https://inbox.sourceware.org/gcc-patches/CAMZc-bw6HX04NbZY7UGiOvEXmXPY=eKo4UYnnECKrayRm21uag@mail.gmail.com/T/ and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106714. Since this is failing on Windows, I am guessing the latter is the problem (also, the fact the assembly is trying to only get 32-bits from the register instead of the full register would seem to suggest this). The downside is that is fixed in GCC 12.3 and GCC 13, so that implies the code in this kernel won't work with earlier versions.

@martin-frbg
Copy link
Collaborator

Looks like this is Windows-specific then, which might explain why it has not come up before. I guess I need to add some form of the tileloadd instruction to the code snippet used in the NO_AVX512BF16 check.

@imciner2
Copy link
Contributor Author

Thinking about this more, I don't know if it is really worth fixing in the OpenBLAS build system. It seems like it would be a bit of a complicated fix, and from what I can see it shouldn't be a problem in GCC 12.3 or 13+, so recent compilers should work, and I guess not a lot of people actually compile with older GCC and BFloat16 on Windows.

I think we'll work on patching this locally downstream as a fix for us (basically just patching the GCC headers we use to fix it, not an OpenBLAS patch).

@martin-frbg martin-frbg added the Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS label Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS
Projects
None yet
Development

No branches or pull requests

2 participants