bobni@Training MINGW64 ~/OneDrive/Desktop/Projects/ao (main) $ python setup.py install running install C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools. warnings.warn( C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools. warnings.warn( running bdist_egg running egg_info writing torchao.egg-info\PKG-INFO writing dependency_links to torchao.egg-info\dependency_links.txt writing requirements to torchao.egg-info\requires.txt writing top-level names to torchao.egg-info\top_level.txt C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\cpp_extension.py:499: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.')) reading manifest file 'torchao.egg-info\SOURCES.txt' adding license file 'LICENSE' writing manifest file 'torchao.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_py C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\setuptools\command\build_py.py:202: SetuptoolsDeprecationWarning: Installing 'torchao.csrc' as data is deprecated, please list it in `packages`. !! ############################ # Package would be ignored # ############################ Python recognizes 'torchao.csrc' as an importable package, but it is not listed in the `packages` configuration of setuptools. 'torchao.csrc' has been automatically added to the distribution only because it may contain data files, but this behavior is likely to change in future versions of setuptools (and therefore is considered deprecated). Please make sure that 'torchao.csrc' is included as a package by using the `packages` configuration field or the proper discovery methods (for example by using `find_namespace_packages(...)`/`find_namespace:` instead of `find_packages(...)`/`find:`). You can read more about "package discovery" and "data files" on setuptools documentation page. !! check.warn(importable) C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\setuptools\command\build_py.py:202: SetuptoolsDeprecationWarning: Installing 'torchao.csrc.cuda.fp6_llm' as data is deprecated, please list it in `packages`. !! ############################ # Package would be ignored # ############################ Python recognizes 'torchao.csrc.cuda.fp6_llm' as an importable package, but it is not listed in the `packages` configuration of setuptools. 'torchao.csrc.cuda.fp6_llm' has been automatically added to the distribution only because it may contain data files, but this behavior is likely to change in future versions of setuptools (and therefore is considered deprecated). Please make sure that 'torchao.csrc.cuda.fp6_llm' is included as a package by using the `packages` configuration field or the proper discovery methods (for example by using `find_namespace_packages(...)`/`find_namespace:` instead of `find_packages(...)`/`find:`). You can read more about "package discovery" and "data files" on setuptools documentation page. !! check.warn(importable) C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\setuptools\command\build_py.py:202: SetuptoolsDeprecationWarning: Installing 'torchao.csrc.fp6_llm' as data is deprecated, please list it in `packages`. !! ############################ # Package would be ignored # ############################ Python recognizes 'torchao.csrc.fp6_llm' as an importable package, but it is not listed in the `packages` configuration of setuptools. 'torchao.csrc.fp6_llm' has been automatically added to the distribution only because it may contain data files, but this behavior is likely to change in future versions of setuptools (and therefore is considered deprecated). Please make sure that 'torchao.csrc.fp6_llm' is included as a package by using the `packages` configuration field or the proper discovery methods (for example by using `find_namespace_packages(...)`/`find_namespace:` instead of `find_packages(...)`/`find:`). You can read more about "package discovery" and "data files" on setuptools documentation page. !! check.warn(importable) running build_ext C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\cpp_extension.py:384: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified warnings.warn(f'Error checking compiler version for {compiler}: {error}') building 'torchao._C' extension C:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc" -c torchao\csrc\cuda\fp6_llm\fp6_linear.cu -o build\temp.win-amd64-cpython-311\Release\torchao\csrc\cuda\fp6_llm\fp6_linear.obj -IC:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\include -IC:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\include\TH -IC:\Users\bobni\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IC:\Users\bobni\AppData\Local\Programs\Python\Python311\include -IC:\Users\bobni\AppData\Local\Programs\Python\Python311\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17 --use-local-env fp6_linear.cu C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_gmem.cuh(66): error: "restrict" is not allowed __declspec(__device__) __forceinline void CopyFromGlobalToShared(half __restrict (*SharedPTR)[(4 * 16)+8], ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\ptx_mma.cuh(41): error: "restrict" is not allowed __declspec(__device__) __forceinline void B_FromSharedToReg(uint32_t __restrict Reg[][4], ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\ptx_mma.cuh(42): error: "restrict" is not allowed half __restrict (*read_SPTR)[(4 * 16)+8], ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\ptx_mma.cuh(116): error: "restrict" is not allowed MMA_FP16_M16N8K16(uint32_t __restrict c[], uint32_t __restrict *a, uint32_t __restrict *b) ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\ptx_mma.cuh(116): error: "restrict" is not allowed MMA_FP16_M16N8K16(uint32_t __restrict c[], uint32_t __restrict *a, uint32_t __restrict *b) ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\ptx_mma.cuh(116): error: "restrict" is not allowed MMA_FP16_M16N8K16(uint32_t __restrict c[], uint32_t __restrict *a, uint32_t __restrict *b) ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(29): error: the modifier "__forceinline" is not allowed on this declaration __declspec(__device__) __forceinline void FP6_FP16_Cast_4Way(u_int32_t *R1, u_int32_t *R2) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(29): error: An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false) __declspec(__device__) __forceinline void FP6_FP16_Cast_4Way(u_int32_t *R1, u_int32_t *R2) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(29): error: incomplete type is not allowed __declspec(__device__) __forceinline void FP6_FP16_Cast_4Way(u_int32_t *R1, u_int32_t *R2) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(29): error: identifier "u_int32_t" is undefined __declspec(__device__) __forceinline void FP6_FP16_Cast_4Way(u_int32_t *R1, u_int32_t *R2) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(29): error: identifier "R1" is undefined __declspec(__device__) __forceinline void FP6_FP16_Cast_4Way(u_int32_t *R1, u_int32_t *R2) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(29): error: identifier "R2" is undefined __declspec(__device__) __forceinline void FP6_FP16_Cast_4Way(u_int32_t *R1, u_int32_t *R2) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(29): error: expected a ";" __declspec(__device__) __forceinline void FP6_FP16_Cast_4Way(u_int32_t *R1, u_int32_t *R2) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(87): warning #607-D: this pragma must immediately precede a statement #pragma unroll(8) ^ Remark: The warnings can be suppressed with "-diag-suppress " C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(101): warning #12-D: parsing restarts here after previous syntax error FP6_FP16_Cast_4Way(&Packed_FP6, &tmp); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(103): error: this declaration has no storage class or type specifier *OutputRegs = MultScale(Packed_FP6, Scale_RPTR[0] ); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(103): error: identifier "Packed_FP6" is undefined *OutputRegs = MultScale(Packed_FP6, Scale_RPTR[0] ); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(103): error: identifier "Scale_RPTR" is undefined *OutputRegs = MultScale(Packed_FP6, Scale_RPTR[0] ); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(103): error: identifier "MultScale" is undefined *OutputRegs = MultScale(Packed_FP6, Scale_RPTR[0] ); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(104): error: this declaration has no storage class or type specifier OutputRegs += 1; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(104): error: variable "OutputRegs" has already been defined OutputRegs += 1; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(104): error: expected a ";" OutputRegs += 1; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(105): error: this declaration has no storage class or type specifier *OutputRegs = MultScale(tmp, Scale_RPTR[1]); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(105): error: variable "OutputRegs" has already been defined *OutputRegs = MultScale(tmp, Scale_RPTR[1]); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(105): error: identifier "tmp" is undefined *OutputRegs = MultScale(tmp, Scale_RPTR[1]); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(106): error: this declaration has no storage class or type specifier OutputRegs += 1; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(106): error: variable "OutputRegs" has already been defined OutputRegs += 1; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(106): error: expected a ";" OutputRegs += 1; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(108): error: expected a declaration if(i%2==1) Scale_RPTR += 2; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_parallel_dequant.cuh(109): error: expected a declaration } ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(98): warning #12-D: parsing restarts here after previous syntax error uint32_t a_1[2]; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(100): error: CopyFromSharedToRegister_AFrag is not a template CopyFromSharedToRegister_AFrag<2> (a_1, A1_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(100): error: expected a ")" CopyFromSharedToRegister_AFrag<2> (a_1, A1_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(100): error: expected a ";" CopyFromSharedToRegister_AFrag<2> (a_1, A1_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(101): error: CopyFromSharedToRegister_AFrag is not a template CopyFromSharedToRegister_AFrag<4> (a_2, A2_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(101): error: expected a ")" CopyFromSharedToRegister_AFrag<4> (a_2, A2_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(101): error: variable "a_2" has already been defined CopyFromSharedToRegister_AFrag<4> (a_2, A2_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(101): error: variable "slice_id" has already been defined CopyFromSharedToRegister_AFrag<4> (a_2, A2_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(101): error: expected a ";" CopyFromSharedToRegister_AFrag<4> (a_2, A2_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(102): error: this declaration has no storage class or type specifier Dequant_32FP6_4Way(a_write, a_1, a_2, RPTR_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(102): error: identifier "a_write" is undefined Dequant_32FP6_4Way(a_write, a_1, a_2, RPTR_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(102): error: identifier "RPTR_Scales" is undefined Dequant_32FP6_4Way(a_write, a_1, a_2, RPTR_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(102): error: too many initializer values Dequant_32FP6_4Way(a_write, a_1, a_2, RPTR_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(103): error: expected a declaration B_FromSharedToReg (b_write, B_SPTR_read, slice_id); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_core.cuh(104): error: expected a declaration } ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(54): warning #12-D: parsing restarts here after previous syntax error const size_t AverageNumBlock_K = NumBlock_K/Split_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(55): error: identifier "NumBlock_K" is undefined const size_t ExtraNumBlock_K = NumBlock_K - AverageNumBlock_K * Split_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(55): error: identifier "AverageNumBlock_K" is undefined const size_t ExtraNumBlock_K = NumBlock_K - AverageNumBlock_K * Split_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(55): error: identifier "Split_K" is undefined const size_t ExtraNumBlock_K = NumBlock_K - AverageNumBlock_K * Split_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(57): error: expected a declaration if(BatchID(smem); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(79): error: argument list for class template "TilingConfig" is missing uint32_t* AFrag_4BIT_SPTR = AFrag_2BIT_SPTR+(((4 * 16)*(4 * 16))*2/8)/4*TilingConfig::BLOCK_WARPS*2; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(81): error: this declaration has no storage class or type specifier AFrag_2BIT_SPTR += warpId * (((4 * 16)*(4 * 16))*2/8)/4; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(81): error: variable "AFrag_2BIT_SPTR" has already been defined AFrag_2BIT_SPTR += warpId * (((4 * 16)*(4 * 16))*2/8)/4; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(81): error: expected a ";" AFrag_2BIT_SPTR += warpId * (((4 * 16)*(4 * 16))*2/8)/4; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(82): error: this declaration has no storage class or type specifier AFrag_4BIT_SPTR += warpId * (((4 * 16)*(4 * 16))*4/8)/4; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(82): error: variable "AFrag_4BIT_SPTR" has already been defined AFrag_4BIT_SPTR += warpId * (((4 * 16)*(4 * 16))*4/8)/4; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(82): error: expected a ";" AFrag_4BIT_SPTR += warpId * (((4 * 16)*(4 * 16))*4/8)/4; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(84): error: expected a declaration for(int i=0; i<2-1; i++) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(91): warning #12-D: parsing restarts here after previous syntax error const half* TB_StartGPTR_A_Scale = Scales + (y*TilingConfig::BLOCK_ROW_WARPS) * 64; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(92): error: identifier "TB_StartGPTR_A_Scale" is undefined const half* WARP_StartGPTR_A_Scales = TB_StartGPTR_A_Scale + WARP_i * 64; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(93): error: this declaration has no storage class or type specifier CopyFromGlobalToShared_Scales(QuantScales+WARP_i*64, WARP_StartGPTR_A_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(93): error: declaration is incompatible with "void CopyFromGlobalToShared_Scales(half *, const half *)" (declared at line 52 of C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\utils_gmem.cuh) CopyFromGlobalToShared_Scales(QuantScales+WARP_i*64, WARP_StartGPTR_A_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(93): error: identifier "QuantScales" is undefined CopyFromGlobalToShared_Scales(QuantScales+WARP_i*64, WARP_StartGPTR_A_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(93): error: too many initializer values CopyFromGlobalToShared_Scales(QuantScales+WARP_i*64, WARP_StartGPTR_A_Scales); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(95): error: identifier "B" is undefined const half *BTile_GPTR = B + Tile_Start_N * K_Global + StartBlockID_K * TilingConfig::TILE_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(95): error: identifier "Tile_Start_N" is undefined const half *BTile_GPTR = B + Tile_Start_N * K_Global + StartBlockID_K * TilingConfig::TILE_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(95): error: identifier "K_Global" is undefined const half *BTile_GPTR = B + Tile_Start_N * K_Global + StartBlockID_K * TilingConfig::TILE_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(95): error: argument list for class template "TilingConfig" is missing const half *BTile_GPTR = B + Tile_Start_N * K_Global + StartBlockID_K * TilingConfig::TILE_K; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(96): error: expected a declaration for(int i=0; i<2-1; i++) { ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(101): warning #12-D: parsing restarts here after previous syntax error constexpr int NumRegSets_a = 4; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(102): error: argument list for class template "TilingConfig" is missing constexpr int NumRegSets_b = (TilingConfig::WARP_COL_MMA_TENSORS==1) ? 1 : TilingConfig::WARP_COL_MMA_TENSORS/2; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(102): error: argument list for class template "TilingConfig" is missing constexpr int NumRegSets_b = (TilingConfig::WARP_COL_MMA_TENSORS==1) ? 1 : TilingConfig::WARP_COL_MMA_TENSORS/2; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(104): error: identifier "NumRegSets_a" is undefined uint32_t a [NumRegSets_a * 2][4]; ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(108): error: expected a declaration for(int i=0; i(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(120): error: expected a ")" initialize_mma_slice(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(120): error: variable "a" has already been defined initialize_mma_slice(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(120): error: variable "b" has already been defined initialize_mma_slice(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(120): error: variable "AFrag_2BIT_SPTR" has already been defined initialize_mma_slice(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(120): error: variable "AFrag_4BIT_SPTR" has already been defined initialize_mma_slice(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(120): error: variable "Scales_RPTR" has already been defined initialize_mma_slice(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(120): error: expected a ";" initialize_mma_slice(a, b, AFrag_2BIT_SPTR, AFrag_4BIT_SPTR, smem_array, Scales_RPTR); ^ C:\Users\bobni\OneDrive\Desktop\Projects\ao\torchao\csrc\cuda\fp6_llm\kernel_matmul.cuh(124): error: expected a declaration for (size_t tile_id_k = 0; tile_id_k < NumIter; tile_id_k++) ^ Error limit reached. 100 errors detected in the compilation of "torchao/csrc/cuda/fp6_llm/fp6_linear.cu". Compilation terminated. error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.1\\bin\\nvcc.exe' failed with exit code 4 bobni@Training MINGW64 ~/OneDrive/Desktop/Projects/ao (main) $