Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heisenbug: debug builds Integer::repr_discr has transient ICE on compile-fail/enum-discrim-too-small2.rs #47381

Closed
pnkfelix opened this issue Jan 12, 2018 · 27 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-ICE Issue: The compiler panicked, giving an Internal Compilation Error (ICE) ❄️ T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@pnkfelix
Copy link
Member

pnkfelix commented Jan 12, 2018

My config.toml:

[rust]
codegen-units = 8
debug = true
optimize = true
debug-assertions = true
debuginfo = true
debuginfo-lines = true

And my (odd?) build invocation:

% time python /Users/fklock/Dev/Mozilla/rust-mirborrowck/.git/../x.py test src/tools/tidy && \
  time python /Users/fklock/Dev/Mozilla/rust-mirborrowck/.git/../x.py build --stage 1 --incremental --keep-stage 0 src/libstd && \
  time python /Users/fklock/Dev/Mozilla/rust-mirborrowck/.git/../x.py test --stage 1 src/test/{mir-opt,compile-fail,run-pass}

Results in stage 1 binary where we get weirdness on this compile-fail test:

% ./objdir-dbgopt/build/x86_64-unknown-linux-gnu/stage1/bin/rustc src/test/compile-fail/enum-discrim-too-small2.rs
error: literal out of range for i8
  --> src/test/compile-fail/enum-discrim-too-small2.rs:18:11
   |
18 |     Ci8 = 223, //~ ERROR literal out of range for i8
   |           ^^^
   |
note: lint level defined here
  --> src/test/compile-fail/enum-discrim-too-small2.rs:11:9
   |
11 | #![deny(overflowing_literals)]
   |         ^^^^^^^^^^^^^^^^^^^^

error: literal out of range for i16
  --> src/test/compile-fail/enum-discrim-too-small2.rs:25:12
   |
25 |     Ci16 = 55555, //~ ERROR literal out of range for i16
   |            ^^^^^

error: literal out of range for i32
  --> src/test/compile-fail/enum-discrim-too-small2.rs:32:12
   |
32 |     Ci32 = 3_000_000_000, //~ ERROR literal out of range for i32
   |            ^^^^^^^^^^^^^

error: internal compiler error: librustc/ty/layout.rs:553: Integer::repr_discr: `#[repr]` hint too small for discriminant range of enum `Ei64

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports

note: rustc 1.25.0-dev running on x86_64-unknown-linux-gnu

thread 'rustc' panicked at 'Box<Any>', librustc_errors/lib.rs:508:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.

I am pretty sure this is specific to debug builds. At least, that's my current best guess as to why this hasn't been caught by bors.

The code layout.rs looks like this:

       if let Some(ity) = repr.int {
            let discr = Integer::from_attr(tcx, ity);
            let fit = if ity.is_signed() { signed_fit } else { unsigned_fit };
            if discr < fit {
                bug!("Integer::repr_discr: `#[repr]` hint too small for \
                  discriminant range of enum `{}", ty)
            }
            return (discr, ity.is_signed());
        }

it should probably be signalling a proper error, or trusting that the lint will eventually fire and catch this, rather than just triggering a call to bug! (since this does not represent a bug in the compiler itself, right?)

@pnkfelix
Copy link
Member Author

(It seems easy enough to add another LayoutError variant and returning that from Integer::repr_discr. Trying a build with that change now.)

@pnkfelix
Copy link
Member Author

Hmm. Apparently we have run-pass test that is relying on Integer::repr_discr call not erroring! Investigating...

@pnkfelix
Copy link
Member Author

(I am now suspecting that there is a Layout bug lurking here...)

@pnkfelix
Copy link
Member Author

FYI the test in question is https://github.com/rust-lang/rust/blob/master/src/test/run-pass/discrim-explicit-23030.rs

I'm double checking what's going on in the layout code now, to see if its something obviously wrong with the these explicitly assigned values...

@pnkfelix
Copy link
Member Author

Update: Something might have been terribly wrong with my builds. Things went south in other very strange ways while I was working on this.

@pnkfelix
Copy link
Member Author

Yes, as far as I can tell, at least some of the problems I was observing seem to have gone away after x.py clean ... :(

Closing this ticket.

@pnkfelix
Copy link
Member Author

pnkfelix commented Jan 15, 2018

(while I am not yet willing to reopen this ticket, I am seeing this same scenario arise on two different computers, which makes me think that while it may be some transient issue, its something we still may need to look more carefully into. But whatever's going on, the fact that it does not consistently reproduce is sign of a deeper problem than what this bug description indicates...)

Update: to be clear: It consistently reproduces for some compiler builds. That is, I can re-run the test over and over and see it. The problem is that small unrelated changes to the rustc source (e.g. sometimes just adding debug! calls, or expanding an existing one) can cause the problem to go away, and (AFAICT) it is not trivial to bring the problem back.

@pnkfelix
Copy link
Member Author

pnkfelix commented Jan 15, 2018

... just saw it arise again (on a build that I am pretty sure was working a little while ago...). For the record, on my Linux box, this is the command line I am invoking for each build:

../x.py test src/tools/tidy && ../x.py build --incremental --keep-stage 0 --stage 1 src/libstd && ../x.py test --stage 1 src/test/{mir-opt,compile-fail,run-pass}

I am becoming tempted to start digging into why this keeps cropping up for me...

@pnkfelix
Copy link
Member Author

pnkfelix commented Jan 16, 2018

For the record, I am seeing cases where (presumably inlined) calls to Integer::fit_signed are returning the wrong value for edge cases.

That is, after instrumenting the panic message with more data, I get a message that min: -9223372036854775808 ends up being mapped to I128 by the aforementioned function.

And that should not be happening. (That is, it is presumably some sort of code generation bug.)

@pnkfelix pnkfelix changed the title debug builds Integer::repr_discr ICE on compile-fail/enum-discrim-too-small2.rs heisenbug: debug builds Integer::repr_discr has transient ICE on compile-fail/enum-discrim-too-small2.rs Jan 23, 2018
@pnkfelix pnkfelix reopened this Jan 23, 2018
@pnkfelix
Copy link
Member Author

Reopening ticket. I'm seeing this on a build on my mac (it can crop up on either linux or mac) so I'm going to spend some time digging into the generated machine code and try to get to the bottom of what's going on.

@estebank estebank added the I-ICE Issue: The compiler panicked, giving an Internal Compilation Error (ICE) ❄️ label Jan 24, 2018
@pnkfelix
Copy link
Member Author

Okay so here's what I've found after much investigation (and learning how to get Visual Studio Code to do various breakpoint machinations):

Stepping through the instructions, it seems like the essential problem is arising from a miscompiled instruction sequence corresponding to an inlined call to rustc::tylayout::Integer::fit_signed. here's the definition for that method, for reference:

    /// Find the smallest Integer type which can represent the signed value.
    pub fn fit_signed(x: i128) -> Integer {
        match x {
            -0x0000_0000_0000_0080...0x0000_0000_0000_007f => I8,
            -0x0000_0000_0000_8000...0x0000_0000_0000_7fff => I16,
            -0x0000_0000_8000_0000...0x0000_0000_7fff_ffff => I32,
            -0x8000_0000_0000_0000...0x7fff_ffff_ffff_ffff => I64,
            _ => I128
        }
    }

The (lldb disassemble'd) instruction sequence I am seeing executed in the erroneous case looks like this (where the input to the function is being passed via %r14 and %r15 in the context of where the call is inlined):

(executed instructions annotated with values in relevant registers as each is run)
instruction address instruction %rax %rcx %rdx %rdi %r12 %r14 %r15 rflags cs
0xffffffff80000000 0xffffffffffffffff 0xffffffffffffffff 0x0000000000000000 0x0000000000000017 0xffffffffffffffff 0x8000000000000000 0x000000000000002b
0x1066ccb36 <+12534> movq %r15, %rax 0x8000000000000000
0x1066ccb39 <+12537> addq $0x80, %rax 0x8000000000000080 0x0000000000000282
0x1066ccb3f <+12543> movq %r14, %rcx 0xffffffffffffffff
0x1066ccb42 <+12546> adcq $0x0, %rcx 0xffffffffffffffff 0x0000000000000286
0x1066ccb46 <+12550> cmpq $0x100, %rax 0x0000000000000a02
0x1066ccb4c <+12556> sbbq $0x0, %rcx 0xffffffffffffffff 0x0000000000000286
0x1066ccb50 <+12560> jae 0x1066ccbbf 0x0000000000000286
... ... ... ... ... ... ... ... ... ... ...
0x1066ccbbf <+12671> movq %r15, %rcx 0x8000000000000000
0x1066ccbc2 <+12674> addq $0x8000, %rcx 0x8000000000008000
0x1066ccbc9 <+12681> movq %r14, %rdx 0xffffffffffffffff
0x1066ccbcc <+12684> adcq $0x0, %rdx 0xffffffffffffffff
0x1066ccbd0 <+12688> movb $0x1, %al 0x8000000000000001
0x1066ccbd2 <+12690> cmpq $0x10000, %rcx 0x0000000000000a06
0x1066ccbd9 <+12697> sbbq $0x0, %rdx 0xffffffffffffffff 0x0000000000000286
0x1066ccbdd <+12701> jb 0x1066ccbfd
0x1066ccbdf <+12703> movl $0x80000000, %eax 0x0000000080000000
0x1066ccbe4 <+12708> addq %rax, %r15 0x8000000080000000
0x1066ccbe7 <+12711> adcq $0x0, %r14 0xffffffffffffffff
0x1066ccbeb <+12715> shrdq $0x20, %r14, %r15 0xffffffff80000000 0x0000000000000287
0x1066ccbf0 <+12720> shrq $0x20, %r14 0x00000000ffffffff 0x0000000000000a07
0x1066ccbf4 <+12724> orq %r15, %r14 0xffffffffffffffff 0x0000000000000286
0x1066ccbf7 <+12727> movb $0x2, %al 0x0000000080000002
0x1066ccbf9 <+12729> je 0x1066ccbfd 0x0000000080000002
0x1066ccbfb <+12731> movb $0x4, %al 0x0000000080000004

The problem is that this is ending up with the value $0x4 (aka Integer::I128) in the result register %al, while this input should be yielding $0x3 (Integer::I64).

The assembly code that I see via passing --emit=asm,... to the relevant stage0/rustc invocation (I'm debugging a stage1 build) is:

Portion of `--emit=asm` output
LBB99_376:
	.loc	13 467 0
	movq	%r13, %rax
	addq	$128, %rax
	movq	%r14, %rcx
	adcq	$0, %rcx
	cmpq	$256, %rax
	sbbq	$0, %rcx
	jae	LBB99_383
Ltmp3275:
	.loc	13 0 0 is_stmt 0
	xorl	%eax, %eax
	jmp	LBB99_386
LBB99_378:
Ltmp3276:
	.loc	13 468 0 is_stmt 1
	movq	%r13, %rcx
	addq	$32768, %rcx
	movq	%r14, %rdx
	adcq	$0, %rdx
	movb	$1, %al
	cmpq	$65536, %rcx
	sbbq	$0, %rdx
	jb	LBB99_381
	.loc	13 469 0
	movl	$2147483648, %ecx
	addq	%r13, %rcx
	movq	%r14, %rdx
	adcq	$0, %rdx
	shrdq	$32, %rdx, %rcx
	shrq	$32, %rdx
	movb	$2, %al
	orq	%rcx, %rdx
	je	LBB99_381
	.loc	13 470 0
	movabsq	$-9223372036854775808, %rax
	addq	%rax, %r13
	movq	%r14, %rax
	adcq	$0, %rax
	setne	%al
	addb	$3, %al
Ltmp3277:
LBB99_381:
	.loc	13 467 0
	movq	%r15, %rcx
	addq	$128, %rcx
	movq	%rdi, %rdx
	adcq	$0, %rdx
	cmpq	$256, %rcx
	movl	%esi, %r13d
	sbbq	$0, %rdx
	jae	LBB99_388
Ltmp3278:
	.loc	13 0 0 is_stmt 0
	xorl	%ecx, %ecx
	jmp	LBB99_391
LBB99_383:
Ltmp3279:
	.loc	13 468 0 is_stmt 1
	movq	%r13, %rcx
	addq	$32768, %rcx
	movq	%r14, %rdx
	adcq	$0, %rdx
	movb	$1, %al
	cmpq	$65536, %rcx
	sbbq	$0, %rdx
	jb	LBB99_386
	.loc	13 469 0
	movl	$2147483648, %ecx
	addq	%r13, %rcx
	movq	%r14, %rdx
	adcq	$0, %rdx
	shrdq	$32, %rdx, %rcx
	shrq	$32, %rdx
	movb	$2, %al
	orq	%rcx, %rdx
	je	LBB99_386
	.loc	13 470 0
	movabsq	$-9223372036854775808, %rax
	addq	%rax, %r13
	adcq	$0, %r14
	setne	%al
	addb	$3, %al

Note in particular that the dumped assembly includes (I think) comparisons against the large constant $-9223372036854775808 (-8000000000000000 in hex). The disassembled instruction stream has no occurrences of that constant that I can see, nor have I been able to identify any instruction sequences to construct the constant or otherwise implement the relevant comparison here.

In short, my theory is that, for some reason, LLVM is optimizing away the code generated for this match arm on the i128 input:

            -0x8000_0000_0000_0000...0x7fff_ffff_ffff_ffff => I64,

but I believe this mal-optimization is occurring during LTO.

Its not clear to me whether its happening during ThinLTO or FatLTO. The stage0 binary that is used for this build of the stage1 rustc is:

% ./build/x86_64-apple-darwin/stage0/bin/rustc  --version
rustc 1.24.0-beta.1 (5b496b726 2018-01-02)

which I believe would be a build that predates #47548 but comes after #46382 ; so it is at least conceivable that this bug (which is unfortunately quite hard to reproduce/minimize) is a problem in ThinLTO alone...

@pnkfelix
Copy link
Member Author

This bug is especially scary to me given that we are currently making policy decisions about whether to turn ThinLTO on or off based solely on the compile-time/codesize impact, without any discussion of risk related to its correctness?

Of course I understand that compiler bugs are a fact of life (including in LLVM), but it might be better to have a uniform policy everywhere, so that when a bug does arise, it actually has pressure to get fixed? That is, if we're going to turn ThinLTO off in one context based on codesize/compile-time issues, then I might argue we should have it off everywhere just to keep bug investigation under control...

@kennytm kennytm added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-bug Category: This is a bug. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Jan 31, 2018
@nikomatsakis
Copy link
Contributor

Nominating for discussion in the @rust-lang/compiler meeting. @pnkfelix can share what the heck is going on here.

@pnkfelix
Copy link
Member Author

Unfortunately I'll be on PTO during the meeting tomorrow.

I'm not sure what I have to share beyond the information I attempted to pack into my earlier comment.

(I'm also a little shocked that I seem to be the only one who is running into this on any sort of consistent basis, on both Linux and Mac. But that could well be because I was working for a couple of weeks against a baseline point on master that was still using ThinLTO by default, which may or may not be a key catalyst for reproducing this bug...)

@pnkfelix
Copy link
Member Author

pnkfelix commented Feb 8, 2018

(Indeed, it seems like updating to a newer master has made my problem go away again...)

@nikomatsakis
Copy link
Contributor

I am seeing this problem too =) and I believe @spastorino was too. I'm going to try rebasing, but it's definitely worrisome.

@eddyb -- were you seeing this, or something else?

@eddyb
Copy link
Member

eddyb commented Feb 10, 2018

I don't think I've seen this one, no.

@spastorino
Copy link
Member

spastorino commented Feb 10, 2018

@nikomatsakis unsure if it's exactly the same thing or just related in some way. In my case 4 tests are consistently failing on my local machine.
I see failing run-pass/discrim-explicit-23030.rs (seems like the same thing), run-pass/enum-discrim-autosizing.rs (this one seems a bit different), run-pass/enum-discrim-range-overflow.rs (this seems also a bit different) & run-pass/issue-15523-big.rs (seems like the same thing).

You can check my log here https://gist.github.com/anonymous/f452914ff6e474c465ca02d28e628a07

@pnkfelix
Copy link
Member Author

@spastorino I wouldn't be surprised, given the nature of the machine code I was seeing, if all of those have the same root cause. (Which, the current theory goes, is due to ThinLTO mis-optimizing handling of 128-bit values...)

@nikomatsakis
Copy link
Contributor

I encounter this from time to time. Rebasing has seemed to help though.

@RalfJung
Copy link
Member

I'm seeing something similar with enum-discrim-too-small2.rs failing in stage 1 but succeeding in stage 2, see #48326 (comment)

@nikomatsakis
Copy link
Contributor

should we re-open this?

@spastorino
Copy link
Member

spastorino commented Feb 23, 2018

@nikomatsakis 👍 or open new issues 🙂

@nikomatsakis nikomatsakis reopened this Feb 26, 2018
@pnkfelix
Copy link
Member Author

pnkfelix commented Mar 7, 2018

I just saw this once again with the enum-discrim-too-small2.rs test (*) atop a branch that (once again) should not be affecting this, with stage0 like so:

% ./build/x86_64-unknown-linux-gnu/stage0/bin/rustc --version
rustc 1.25.0-beta.2 (1e8fbb143 2018-02-19)

(*) An earlier version of this comment erroneously claimed the issue was arising during bootstrap of the compiler itself. That was wrong.

And this time its on a system with a working copy of rr, so I'm going to investigate...

@Zoxc
Copy link
Contributor

Zoxc commented Mar 24, 2018

I've been seeing this on https://github.com/Zoxc/rust/tree/run-fast-ice with x.py test src/test/run-pass (on combined run-pass tests, not enum-discrim-too-small2.rs)

@Enselic
Copy link
Member

Enselic commented Jul 26, 2023

The bug seems to have manifested itself fairly frequently when it was present. Now it is 5 years ago someone reported they encountered it. A generous (and IMHO reasonable) interpretation of that is that the heisenbug is gone and that this I-ICE issue can be closed.

@Enselic
Copy link
Member

Enselic commented Oct 10, 2023

Triage: No objections to the close proposal for two months. Closing!

@Enselic Enselic closed this as completed Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-ICE Issue: The compiler panicked, giving an Internal Compilation Error (ICE) ❄️ T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants