Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dbg.value between phi nodes when converting from DbgRecords #83620

Closed
wants to merge 1 commit into from

Conversation

dsandersllvm
Copy link
Collaborator

We have a downstream pass that at some point clones a phi instruction that is carrying some DPValues. When the verifier is called on IR containing phi nodes like:

%0 = phi ... ; has DPValues
%1 = phi ... ; cloned from previous phi

it currently converts to the debug intrinsics like so:

%0 = phi ...
call @llvm.dbg.value(...)
call @llvm.dbg.value(...)
...
%1 = phi ...

and then fails to verify because the phi's aren't first.

This patch moves the insertion point to a position after the phi nodes and any dbg.value calls it has already added.

It's surprising that the verifier modifies the IR but this is presumably temporary until RemoveDI's finishes. More surprising is that F.dump() performs this conversion but BB.dump() does not.

Unfortunately, I don't have a test case I can share as this problem occurred in a downstream pass and I'm not aware of any upstream that can be made to similarly clone a phi node.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 1, 2024

@llvm/pr-subscribers-llvm-ir

Author: Daniel Sanders (dsandersllvm)

Changes

We have a downstream pass that at some point clones a phi instruction that is carrying some DPValues. When the verifier is called on IR containing phi nodes like:

%0 = phi ... ; has DPValues
%1 = phi ... ; cloned from previous phi

it currently converts to the debug intrinsics like so:

%0 = phi ...
call @<!-- -->llvm.dbg.value(...)
call @<!-- -->llvm.dbg.value(...)
...
%1 = phi ...

and then fails to verify because the phi's aren't first.

This patch moves the insertion point to a position after the phi nodes and any dbg.value calls it has already added.

It's surprising that the verifier modifies the IR but this is presumably temporary until RemoveDI's finishes. More surprising is that F.dump() performs this conversion but BB.dump() does not.

Unfortunately, I don't have a test case I can share as this problem occurred in a downstream pass and I'm not aware of any upstream that can be made to similarly clone a phi node.


Full diff: https://github.com/llvm/llvm-project/pull/83620.diff

1 Files Affected:

  • (modified) llvm/lib/IR/BasicBlock.cpp (+7-2)
diff --git a/llvm/lib/IR/BasicBlock.cpp b/llvm/lib/IR/BasicBlock.cpp
index 25aa3261164513..2001804e709fc8 100644
--- a/llvm/lib/IR/BasicBlock.cpp
+++ b/llvm/lib/IR/BasicBlock.cpp
@@ -109,9 +109,14 @@ void BasicBlock::convertFromNewDbgValues() {
       continue;
 
     DPMarker &Marker = *Inst.DbgMarker;
-    for (DbgRecord &DR : Marker.getDbgValueRange())
-      InstList.insert(Inst.getIterator(),
+    for (DbgRecord &DR : Marker.getDbgValueRange()) {
+      auto I = Inst.getIterator();
+      // Avoid inserting debug info between phi nodes.
+      if (isa<PHINode>(Inst))
+        I = getFirstNonPHIOrDbg()->getIterator();
+      InstList.insert(I,
                       DR.createDebugIntrinsic(getModule(), nullptr));
+    }
 
     Marker.eraseFromParent();
   }

Copy link

github-actions bot commented Mar 1, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

We have a downstream pass that at some point clones a phi instruction that is
carrying some DPValues. When the verifier is called on IR containing phi nodes
like:
```
%0 = phi ... ; has DPValues
%1 = phi ... ; cloned from previous phi
```
it currently converts to the debug intrinsics like so:
```
%0 = phi ...
call @llvm.dbg.value(...)
call @llvm.dbg.value(...)
...
%1 = phi ...
```
and then fails to verify because the phi's aren't first.

This patch moves the insertion point to a position after the phi nodes and
any dbg.value calls it has already added.

It's surprising that the verifier modifies the IR but this is presumably
temporary until RemoveDI's finishes. More surprising is that F.dump()
performs this conversion but BB.dump() does not.

Unfortunately, I don't have a test case I can share as this problem occurred
in a downstream pass and I'm not aware of any upstream that can be made to
similarly clone a phi node.
@jmorse
Copy link
Member

jmorse commented Mar 4, 2024

Thanks for the patch -- I have the feeling that this is treating the symptom rather than the cause, alas. This situation seems consistent with something mentioned on discourse [0], PHIs getting inserted into the wrong place leads to mixing of debug-records and PHIs. We were aware that if you inserted instructions into block using instruction pointers rather than iterators, you might end up with debug records attached one instruction adjacent to where it should be. Unfortunately, with blocks of PHIs, having a debug-record in the wrong place ends up being a hard verifier error, and we've accidentally glossed over that flaw (mea culpa).

I think we should avoid tolerating debug-records attached to PHIs (as this patch leans towards) as there's code out there which scans through all post-PHI instructions in blocks for debug-info and we'll be messing with that. Plus we would end up with multiple ways of representing the same inputs. However if this flaw is currently causing serious problems then we can land this as a workaround.

If you're OK with waiting a couple of days, there's a more general solution, which is our bulk-updates of all of LLVM to use iterators instead of Instruction * for insertion -- see 6b62a91 for example. We've got a patch that does 100% of this which we're cutting up and pushing into LLVM (it's NFC and mechanical). There's a bunch of PHI-creation sites that get updated, and hopefully the use of iterators there will solve whatever's causing your mixing of debug-records and PHIs, assuming that it's coming from an upstream pass. For downstream creation of PHI nodes: use getFirstNonPHIIt to identify the location and insert using the iterator-taking insert/Create/constructor methods.

It's surprising that the verifier modifies the IR but this is presumably temporary until RemoveDI's finishes. More surprising is that F.dump() performs this conversion but BB.dump() does not.

It's an awkward choice; that's another sticking plaster because the textual IR printing situation wasn't finalised -- although I believe that's now landed today. (CC @OCHyams and @SLTozer for visibility)

[0] https://discourse.llvm.org/t/rfc-instruction-api-changes-needed-to-eliminate-debug-intrinsics-from-ir/68939/19

@dsandersllvm
Copy link
Collaborator Author

Thanks for the review. This is occurring in a downstream pass that I've just confirmed is already using iterators as the insertion point for the cloned instructions (specifically, end() or getFirstInsertionPt()) so I suspect that upcoming patch won't fix our crash unless it somehow ensures that phis never have debug info so that adding a new phi doesn't leave the second-to-last phi with debug info.

I think we should avoid tolerating debug-records attached to PHIs (as this patch leans towards) as there's code out there which scans through all post-PHI instructions in blocks for debug-info and we'll be messing with that.

Am I right in understanding this as "phi's should never have debug info"? If so, two questions arise from that. The first is: What happens when code like this is converted?:

%0 = phi ...
call @llvm.dbg.value(%0, ...)

If the new debug info attaches to previous instructions as I currently think it does then any debug info attaching to a phi like this won't be found by code that scans the post-PHI instructions because those dbg.values end up attaching to the last phi and therefore aren't in the post-PHI instructions anymore. It may be that the debug info transfers to the phi's inputs but if that's the case then I'm not sure how a variable would still be tracked after the phi, nor how it's transferred back to it's original position and vreg in time for codegen which is still based on dbg.value's from what I've read so far.

The second is how we had a phi with debug info attached in the input to our pass to begin with. Our crash arises from adding a phi node after one that's already carrying debug info. The debug info in question originates from debugify so if there's new restrictions on the placement of debug info then that pass might not be aware of them

However if this flaw is currently causing serious problems then we can land this as a workaround.

We already landed this patch downstream to clear up our CI problems and we'll revert+replace or adapt it based on how it's resolved upstream.

@jmorse
Copy link
Member

jmorse commented Mar 5, 2024

unless it somehow ensures that phis never have debug info so that adding a new phi doesn't leave the second-to-last phi with debug info.

That's the intention / hope. With debug-intrinsics, not mixing PHIs and intrinsics is a guaranteed property because inserting at getFirstNonPHI or getFirstInsertionPt always points after the leading PHIs and before any debug intrinsics. The RemoveDIs work has created an ambiguity there -- an instruction position might be before or after any attached debug-info records which will be resolved by the extra bit of information that's being placed in BasicBlock::iterator. Essentially it tells any instruction inserter that because you called getFirstNonPHI / getFirstInsertionPt, anything inserted there (such as a new PHI) should be placed in front of any debug-info records.

I hate to use a video, but I've got an illustrated example of the problem here: https://www.youtube.com/watch?v=fZgFTOuhEzc&t=475s from 4 minutes to 9:30ish.

Am I right in understanding this as "phi's should never have debug info"?

Correct -- it was a property of LLVM using debug-intrinsics (because you can't put calls in between PHIs), and we're trying to preserve that property to have confidence that we're only changing how debug-info is represented, not what it means / how it behaves. Possibly we'll have to weaken that if this problem is too difficult to solve,

If the new debug info attaches to previous instructions as I currently think it does

It's actually the other way around -- debug-info records get attached to the following instruction. It's a convenient property because all well-formed blocks should have a terminator, thus almost all the time the debug-info records can be attached to an instruction. There's some temporary storage and hacks for scenarios where terminators get erased then recreated / re-inserted to a block, which is happily rare.

The second is how we had a phi with debug info attached in the input to our pass to begin with.

See the above paragraph on insertion and the video link -- something somewhere in LLVM has a scenario like this:

%foo = PHI ...
#dbg_value %foo, !123, !DIExpression....
%bar = add i32 ...

And code that calls getFirstNonPHI or getFirstInsertionPt on the block, gets an instruction pointer to the add instruction, and inserts a PHI there making this code:

%foo = PHI ...
#dbg_value %foo, !123, !DIExpression....
%baz = PHI ...
%bar = add i32 ...

Wheras if it inserted with a BasicBlock::iterator, the intention of that code to insert at the start of the block would be communicated to the inserter, and it would produce:

%foo = PHI ...
%baz = PHI ...
#dbg_value %foo, !123, !DIExpression....
%bar = add i32 ...

Thinking about it, we can probably put an assertion in the behind-the-scenes debug-info maintenance code to detect scenarios where this occurs early, rather than during the verifier. I'll also try to fast-forward everything to do with PHIs in our update-patch to happen immediately. Will report back!

@dsandersllvm
Copy link
Collaborator Author

That's the intention / hope. With debug-intrinsics, not mixing PHIs and intrinsics is a guaranteed property because inserting at getFirstNonPHI or getFirstInsertionPt always points after the leading PHIs and before any debug intrinsics....
...

Am I right in understanding this as "phi's should never have debug info"?
Correct -- it was a property of LLVM using debug-intrinsics (because you can't put calls in between PHIs), and we're trying to preserve that property to have confidence that we're only changing how debug-info is represented, not what it means / how it behaves. Possibly we'll have to weaken that if this problem is too difficult to solve,

That all makes sense to me

If the new debug info attaches to previous instructions as I currently think it does
It's actually the other way around -- debug-info records get attached to the following instruction. It's a convenient property because all well-formed blocks should have a terminator, thus almost all the time the debug-info records can be attached to an instruction. There's some temporary storage and hacks for scenarios where terminators get erased then recreated / re-inserted to a block, which is happily rare.

Ah yep, I can see it in the new code to print the debug records.

And code that calls getFirstNonPHI or getFirstInsertionPt on the block, gets an instruction pointer to the add instruction, and inserts a PHI there making this code:

I've tracked down the origin of the debug info now and it was indeed something like this but it was also quite hard to spot. Rather than being the cloned phi, it was an instruction built just after it whose insertion point was selected with:

Builder.SetInsertPoint(FI->getFirstNonPHI());

At first glance that doesn't look wrong but getFirstNonPHI() and getFirstNonPHIIt() return different positions despite the names implying they only differ by return type. Additionally, SetInsertPoint has overloads for both Instruction* and BasicBlock::iterator so it's not an error to use the wrong one. I'm inclined to say that getFirstNonPHI() should be marked deprecated (and things like getFirstNonPHIIt() which use it internally updated to use a name that isn't marked deprecated) with the message suggesting getFirstNonPHIIt() or getFirstNonPHIOrDbg() (this one skips PseudoProbeInst's too so it's not 100% the same)

Thinking about it, we can probably put an assertion in the behind-the-scenes debug-info maintenance code to detect scenarios where this occurs early, rather than during the verifier.

That would be great. Also, it's a lot easier to see what's going now that we have the record printing but one hack I still had to do to help me track it down was to stop the pass manager from converting back to the intrinsics so I could see the new records in -print-before-all/-print-after-all.

I'll also try to fast-forward everything to do with PHIs in our update-patch to happen immediately. Will report back!

Thanks

@dsandersllvm
Copy link
Collaborator Author

Closing this one since we have the assert instead (#84054). I've also created
https://github.com/llvm/llvm-project/pull/84595/files for another behaviour difference between getFirstNonPHI() and getFirstNonPHIIt() where the latter cannot currently be safely used on blocks that only contain phi nodes and have no terminator

@dsandersllvm dsandersllvm deleted the verifier-breaks-phis branch March 9, 2024 01:55
@jmorse
Copy link
Member

jmorse commented Mar 10, 2024

Thanks for chasing all that down,

At first glance that doesn't look wrong but getFirstNonPHI() and getFirstNonPHIIt() return different positions despite the names implying they only differ by return type. Additionally, SetInsertPoint has overloads for both Instruction* and BasicBlock::iterator so it's not an error to use the wrong one.

Curses, yes, that's another set of API calls we'll need to work on. We got caught by all this late in the day and were hoping to make the transition completely transparent. Apologies for the disruption,

Also, it's a lot easier to see what's going now that we have the record printing but one hack I still had to do to help me track it down was to stop the pass manager from converting back to the intrinsics so I could see the new records in -print-before-all/-print-after-all.

CC @SLTozer , IIRC print-before/after is due to print debug-records around about now (or last Friday or something)?

@SLTozer
Copy link
Contributor

SLTozer commented Mar 11, 2024

It should have been working for a while now, as-of the printing patch (2e39b57) having landed, and testing locally on up-to-date main opt --print-after/before-all shows debug records in the before/after dumps, regardless of whether --write-experimental-debuginfo is passed. If the print-before/after isn't working, I'd be interested to know what revision and what flags are being passed.

@dsandersllvm
Copy link
Collaborator Author

It's working now without cherry-picking 2e39b57 or the additional hack I needed last week. Upstream commits sometimes take a while to reach our fork as there's multiple forks between llvm.org and our project that they have to flow through first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants