Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ptx assembly aborted due to errors #58491

Open
carlosgalvezp opened this issue Oct 20, 2022 · 26 comments
Open

Ptx assembly aborted due to errors #58491

carlosgalvezp opened this issue Oct 20, 2022 · 26 comments

Comments

@carlosgalvezp
Copy link
Contributor

Hi!

We are bumping Clang to commit 1ae33bf, and we find that it crashes building CUDA code with this error trace:

ptxas /tmp/patch-4eaef1/patch-sm_61.s, line 3885; fatal   : Parsing error near '.': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
clang: �[0;1;31merror: ptxas command failed with exit code 255 (use -v to see invocation)
clang version 16.0.0 (https://github.com/llvm/llvm-project.git 1ae33bf42680b156fe0f5cd6163bf24ef45d8cd3)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: external/llvm/usr/bin

Is this a known problem?

@carlosgalvezp
Copy link
Contributor Author

carlosgalvezp commented Oct 20, 2022

It has to do with the __func__ function, and only when compiling in debug mode -g. Minimal repro with CUDA 11.7 Clang trunk:

#include <cstdio>

__global__ void foo_kernel()
{
    printf("%s", __func__);
}

void foo()
{
    foo_kernel<<<10, 1>>>();
}
clang --cuda-path=/path/to/cuda-11.7 -c  -g --cuda-gpu-arch=sm_75 -o foo.cu.o foo.cu 

@carlosgalvezp
Copy link
Contributor Author

carlosgalvezp commented Oct 20, 2022

Bisecting brings me here: 7aa1fa0
FYI @hctim @dwblaikie @rnk @adrian-prantl

@Artem-B
Copy link
Member

Artem-B commented Oct 20, 2022

https://godbolt.org/z/8bMYcf1z7

The debug info directive that ptxas does not like is on line 655:

.b64 __func__._Z10foo_kernelv

It should've been __func___$__Z10foo_kernelv. Apparently NVPTX's name normalizer didn't get applied to the symbol name in debug info.

A work-around would be to disable GPU-side debug info with -Xarch_device -g0

@Artem-B Artem-B self-assigned this Oct 21, 2022
@carlosgalvezp
Copy link
Contributor Author

Thanks for the quick help! Will try the workaround :)

@vangohao
Copy link

vangohao commented Jan 3, 2024

https://godbolt.org/z/8bMYcf1z7

The debug info directive that ptxas does not like is on line 655:

.b64 __func__._Z10foo_kernelv

It should've been __func___$__Z10foo_kernelv. Apparently NVPTX's name normalizer didn't get applied to the symbol name in debug info.

A work-around would be to disable GPU-side debug info with -Xarch_device -g0

Is this issue solved? I am encountering this issue with clang and llvm 17.0.6

@tuero
Copy link

tuero commented Mar 6, 2024

I'm encountering a similar issue with clang version 18.0.0git (https://github.com/llvm/llvm-project.git b7376c319630a6b8395f3df5a46ba73e8fe29ea9), where debug builds fail when using __PRETTY_FUNCTION__

@tambry
Copy link
Contributor

tambry commented Mar 15, 2024

Minimal repro:

echo '__attribute__((device)) void foo(){__PRETTY_FUNCTION__;}' | clang -cc1 -triple nvptx64-nvidia-cuda -S -fcuda-is-device -debug-info-kind=constructor -fno-dwarf-directory-asm -Wno-everything -x cuda | ptxas -

Removing -debug-info-kind=constructor works around this.

@Artem-B
Copy link
Member

Artem-B commented Mar 15, 2024

It looks like another case of LLVM generating symbol names with a dot in it and sneaking through our attempts to normalize such names:

 .global .align 1 .b8 __PRETTY_FUNCTION___$__Z3foov[11] = {118, 111, 105, 100, 32, 102, 111, 111, 40, 41};
...

.b64 __PRETTY_FUNCTION__._Z3foov

The variable itself does have . mangled, but the reference from debug info does not.

Switching to line-only debug info would work around the issue, too.

@tambry
Copy link
Contributor

tambry commented Mar 19, 2024

Looked into this quite a bit. It seems the name gets embedded in a debug DIE during the annotation-remarks pass in getOrCreateGlobalVariableDIE()addLocationAttribute()addOpAddress(). Somehow there end up being 2 MCSymbols related to __PRETTY_FUNCTION__._Z3foov and nvptx-assign-valid-global-names renames the general one, but not the one that was embedded into the DIE...

After spending already too much time looking into this and not understanding enough about the guts of the LLVM debug information infrastructure I took the easy way out:

Generate pre-defined lvalue names without dots

`.` should be converted to `_$_` by the nvptx-assign-valid-global-names pass as `ptxas` doesn't support dots.
But during the ASMPrinter initialization the global variable name gets embedded in a debug DIE.
There somehow end up being two different `MCSymbol`s for the global variable with only the main one being renamed.

Bug: https://github.com/llvm/llvm-project/issues/58491
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -3277,7 +3277,12 @@ LValue CodeGenFunction::EmitPredefinedLV
     FnName = FnName.substr(1);
   StringRef NameItems[] = {
       PredefinedExpr::getIdentKindName(E->getIdentKind()), FnName};
-  std::string GVName = llvm::join(NameItems, NameItems + 2, ".");
+  std::string GVName;
+  if (CGM.getLangOpts().CUDA && CGM.getLangOpts().CUDAIsDevice) {
+    GVName = llvm::join(NameItems, NameItems + 2, "_$_");
+  } else {
+    GVName = llvm::join(NameItems, NameItems + 2, ".");
+  }
   if (auto *BD = dyn_cast_or_null<BlockDecl>(CurCodeDecl)) {
     std::string Name = std::string(SL->getString());
     if (!Name.empty()) {

@Artem-B
Copy link
Member

Artem-B commented Mar 19, 2024

I think we've dealt with a similar issue in the dwarf debug info before. Let me see if I can find it.

@Artem-B
Copy link
Member

Artem-B commented Mar 20, 2024

I think I had 2e7e097 in mind, but it may not be helpful here as it was dealing with the concept of private prefixes. Here the symbol which causes the problem is a . used as a separator.

I believe we did discuss invalid symbol issues in the past, but I do not think it ever went anywhere.
E.g. the discussion on https://reviews.llvm.org/D40573 still seems to be somewhat relevant.
Especially this bit:

This is silly. This bug has been open for so long that nvidia could've just fixed their toolchain by now to accept dots in symbol names.

Back to figuring out how to fix this instance.

But during the ASMPrinter initialization the global variable name gets embedded in a debug DIE.
There somehow end up being two different MCSymbols for the global variable with only the main one being renamed.

Oh, well. Looks like we may need to do it the hard way and teach nvptx-assign-valid-global-names how to deal with the symbols in debug info. It would still be dealing with the consequences, but at least the mess would be contained in one place.

@alexey-bataev Would you happen to have any idea on what would be the best way to get DWARF's symbol references mangled the same way we mangle other symbols in NVPTX?

@alexey-bataev
Copy link
Member

I always thought that we need to handle it in the frontend. But it is only my thought, feel free to discard it.

@Artem-B
Copy link
Member

Artem-B commented Mar 20, 2024

Avoiding such symbols in the front-end is would avoid some of the issues (granted, including this one), but a symbol with a dot may materialize within LLVM itself. Granted, it may not happen often in practice. It's also possible that such symbol cloning would not be affected by this issue (e.g. if, unlike this case, debug info would point to the same MCSymbol for the cleaned up name).

Here are the options I see:

  • Get NVIDIA to change ptxas and allow a more sensible set of characters in identifiers. The problem is that it's not going to help us for a long time, as we need to deal with ptxas versions that are out there already.
  • Change LLVM to use something other than . when it needs to create identifiers. This has consequences for ABI. E.g. host/device symbols will get mangled differently. That would be a problem.
  • Because of the above, this name cleanup may need to be applied selectively on multiple targets (NVPTX + supported host architectures, currently x86 and ARM). E.g. we'll want to apply it to all symbols on the GPU, and to all symbols that need to have the same name across host/GPU boundary. E.g. kernels and other GPU-side symbols we may need to refer to from the host.
  • cleanup the names in the front-end. This is a very narrow workaround for a subset of these 'illegal character' issues. Should be enough to deal with this case, but I do not like it because it's not front-end's job to know about the quirks of something many abstraction levels below it. Front-end should be contrained by the contract between it and LLVM. If the symbol is valid for LLVM, how it gets lowered into target assembly is LLVM's responsibility.
  • Teach nvptx-assign-valid-global-names how to fix symbol names in associated debug info. I think that may be the least bad trade-off we may have at the moment. The caveat is that I have no idea how much effort it would take.

@dwblaikie If we rename a global symbol how hard is that to find and update references to the symbol from debug info. I suspect we already do that somewhere in LLVM. Can you point me in the right direction?

@dwblaikie
Copy link
Collaborator

Not sure if existing instances of this (as you say, abi would mostly make it impossible to change symbol names effectively)

But if you want to try it - the disubprogram attached to the function, if it has the mangled name (maybe it doesn't, maybe it just depends on the actual symbol name of the llvm::function in which case you wouldn't have to do anything for debuginfo) - that should be updated.

@tambry
Copy link
Contributor

tambry commented Mar 21, 2024

The DISubprogram name referred to the correct MCSymbol* and was correct AFAIK. The problematic name instead seemed to be attached to the !17 debug annotation on the ret.

source_filename = "-"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@"__PRETTY_FUNCTION___$__Z3foov" = private unnamed_addr constant [11 x i8] c"void foo()\00", align 1, !dbg !0

; Function Attrs: convergent mustprogress noinline nounwind optnone
define dso_local void @_Z3foov() #0 !dbg !14 {
entry:
  ret void, !dbg !17
}

attributes #0 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-features"="+ptx32" }

!llvm.dbg.cu = !{!8}
!llvm.linker.options = !{}
!llvm.module.flags = !{!10, !11, !12}
!llvm.ident = !{!13}

!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(scope: null, file: !2, line: 1, type: !3, isLocal: true, isDefinition: true)
!2 = !DIFile(filename: "<stdin>", directory: "/home/raul.tambre")
!3 = !DICompositeType(tag: DW_TAG_array_type, baseType: !4, size: 88, elements: !6)
!4 = !DIDerivedType(tag: DW_TAG_const_type, baseType: !5)
!5 = !DIBasicType(name: "char", size: 8, encoding: DW_ATE_signed_char)
!6 = !{!7}
!7 = !DISubrange(count: 11)
!8 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !2, producer: "Clebian clang version 19.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, globals: !9, splitDebugInlining: false, nameTableKind: None)
!9 = !{!0}
!10 = !{i32 2, !"Debug Info Version", i32 3}
!11 = !{i32 1, !"wchar_size", i32 4}
!12 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
!13 = !{!"Clebian clang version 19.0.0"}
!14 = distinct !DISubprogram(name: "foo", linkageName: "_Z3foov", scope: !2, file: !2, line: 1, type: !15, scopeLine: 1, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !8)
!15 = !DISubroutineType(types: !16)
!16 = !{null}
!17 = !DILocation(line: 1, column: 56, scope: !14)

I managed to write something to reach that instruction, but not how to reach the DIE that had the wrong name embedded already. Seems to be an abstraction layer away and inaccessible in such a pass. It seemed having nvptx-assign-valid-global-names run as one the first passes before the DIE is created might work.

@dwblaikie
Copy link
Collaborator

Sorry, I'm not following that last comment - the DISubprogram is the same one from the Function and from the DILocation.

I take it this renaming isn't done at the IR level, OK - so it's not about updating the DISubprogram itself to match a change to the Function, but later than that.

Sure enough then - DwarfUnit::applySubprogramDefinitionAttributes calls addLinkageName - I guess it'd need some awkward mapping in DwarfDebug of DISubprogram back to llvm::Function... I don't feel good about that, maybe there's some other way to handle it, but you could at least prototype that.

@tambry
Copy link
Contributor

tambry commented Mar 21, 2024

Sorry, I'm not following that last comment - the DISubprogram is the same one from the Function and from the DILocation.

I guess was aiming at that you can't get the DILocation from the DISubprogram, but rather have to iterate the instructions to find the return instruction with the appropriate debug annotation. At least it seemed to me so, but chewing through the API and abstractions was difficult when I did try. 🙂

@dwblaikie
Copy link
Collaborator

Sorry, I'm not following that last comment - the DISubprogram is the same one from the Function and from the DILocation.

I guess was aiming at that you can't get the DILocation from the DISubprogram, but rather have to iterate the instructions to find the return instruction with the appropriate debug annotation. At least it seemed to me so, but chewing through the API and abstractions was difficult when I did try. 🙂

Ah, yes, DILocations aren't accesible top-down from the DISubprogram, only bottom-up from the DISubprogram's Function's instructions.

@aminiussi
Copy link

aminiussi commented Jul 10, 2024

Hello, I have a similar issue with llvm 18.1.8 and CUDA 12.5. Is that expected ?

The failing line is

.b64 __PRETTY_FUNCTION__._ZN7fargOCA6nsnextILm1EEEmmm

and the message:

alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:14:54_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$ptxas -v  ./disk-cuda-nvptx64-nvidia-cuda-sm_89.s
ptxas ./disk-cuda-nvptx64-nvidia-cuda-sm_89.s, line 148598; fatal   : Parsing error near '.': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$

It is also failing with main branch:

alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$/opt/llvm-main/bin/clang++ --version
clang version 19.0.0git (https://github.com/llvm/llvm-project.git 08ce14732d528ab70309f334446d39782f2f07c0)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/llvm-main/bin
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$$ptxas disk-cuda-nvptx64-nvidia-cuda-sm_89.s
ptxas disk-cuda-nvptx64-nvidia-cuda-sm_89.s, line 145450; fatal   : Parsing error near '.': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$

@Artem-B
Copy link
Member

Artem-B commented Jul 10, 2024

Unfortunately, the issue is still there, and we still do not have a good fix. Disabling GPU-side dwarf debug info with -Xarch_device -g0 is the best workaround we have at the moment.

@tambry
Copy link
Contributor

tambry commented Jul 11, 2024

@Artem-B Would you object to applying this workaround to master until someone actually dives into the guts of the DI subsystem to find the bug?

@Artem-B
Copy link
Member

Artem-B commented Jul 11, 2024

Doing it on clang side would depend on the name mangling implementation details in NVPTX back-end. I think a better approach would be to try intercepting printouts of .bN constants and mangle the ones that look like a symbol. While it's not the best time/place to do that mangling (we do need to figure out how to fix it wherever the unmangled name lives, and fix it there), but it would be doing the right thing in principle, using the same mangling we're applying to the symbols produced by NVPTX.

@aminiussi
Copy link

@Artem-B Would you object to applying this workaround to master until someone actually dives into the guts of the DI subsystem to find the bug?

Fine by me. Thanks

@sidarth-narayanan-csi
Copy link

Hello,

We are facing the same issue, was the workaround fix already merged into the master ? If yes, what is the version number for reference ?

Thank you

@Artem-B
Copy link
Member

Artem-B commented Aug 6, 2024

This problem is not fixed yet. You may work around by disabling GPU-side debug info with -Xarch_device -g0

@sidarth-narayanan-csi
Copy link

@Artem-B Thank you for the response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants