Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious x86_64 Windows CI failures due to OOM #66342

Closed
ecstatic-morse opened this issue Nov 12, 2019 · 7 comments · Fixed by #66394
Closed

Spurious x86_64 Windows CI failures due to OOM #66342

ecstatic-morse opened this issue Nov 12, 2019 · 7 comments · Fixed by #66394
Assignees
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@ecstatic-morse
Copy link
Contributor

ecstatic-morse commented Nov 12, 2019

A MSVC x86_64 bors job has failed on two unrelated PRs (#60026, #66170) with the same error:

2019-11-12T18:01:21.3370489Z memory allocation of 4294967304 bytes failed[RUSTC-TIMING] hex test:false 3.438
2019-11-12T18:01:21.3472541Z error: could not compile `hex`.
2019-11-12T18:01:22.8915944Z [RUSTC-TIMING] glob test:false 1.984
2019-11-12T18:01:22.9032050Z error: build failed
2019-11-12T18:01:22.9081592Z command did not execute successfully: "D:\\a\\1\\s\\build\\x86_64-pc-windows-msvc\\stage0\\bin\\cargo.exe" "build" "-Zconfig-profile" "--target" "x86_64-pc-windows-msvc" "-Zbinary-dep-depinfo" "-j" "2" "--release" "--locked" "--color" "always" "--manifest-path" "D:\\a\\1\\s\\src/tools/cargo\\Cargo.toml" "--features" "rustc-workspace-hack/all-static" "--message-format" "json-render-diagnostics"
2019-11-12T18:01:22.9082199Z expected success, got: exit code: 101
2019-11-12T18:01:22.9082199Z expected success, got: exit code: 101
2019-11-12T18:01:23.0062715Z failed to run: D:\a\1\s\build\bootstrap\debug\bootstrap dist

In both cases, cargo build attempted to allocate several GB of memory while compiling the hex crate. One of these failures was in the dist-x86_64-msvc job, and the other was in x86_64-msvc-cargo.

This failure appears to be spurious: a previous version of #60026 passed the same job.

@ecstatic-morse ecstatic-morse added A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. O-windows-msvc Toolchain: MSVC, Operating system: Windows labels Nov 12, 2019
@Mark-Simulacrum
Copy link
Member

I wonder if we could get a backtrace or something from this allocation -- it seems... suspicious that we're only sometimes OOMing.

It's also interesting to note that this is almost exactly 4 GB in a single allocation, which is surprising generally (hex is not a big crate -- 300 lines or so, with comments etc).

@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Nov 12, 2019

On my PR it was a different (almost) power of two:

memory allocation of 2147483656 bytes failed

edit
In both cases, the size of the allocation is 8 bytes less than a power of two.

@ollie27
Copy link
Member

ollie27 commented Nov 13, 2019

So I minimized this to:

pub fn from_hex() -> [u8; 42949672960] {
    loop {}
}

on a recent nightly gives:

memory allocation of 42949672960 bytes failed

It looks like the compiler is trying to allocate enough space to store a value of the return type. As long as I use a large enough value it reproduces reliably and it isn't specific to Windows: playground.

I bisected this to 57a5f92 and I'm guessing #66216 (cc @wesleywiser) is the root cause.

@ollie27 ollie27 added I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-nominated regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed O-windows-msvc Toolchain: MSVC, Operating system: Windows labels Nov 13, 2019
@wesleywiser
Copy link
Member

Yes, that seems likely. I will take a look at this tonight.

@wesleywiser wesleywiser self-assigned this Nov 13, 2019
@ecstatic-morse
Copy link
Contributor Author

ecstatic-morse commented Nov 13, 2019

@wesleywiser should we put #66074 on hold while this is sorted out?

@wesleywiser
Copy link
Member

@ecstatic-morse Yeah, let's do that just to be safe.

@ecstatic-morse ecstatic-morse changed the title Spurious x86_64 MSVC CI failures due to OOM Spurious x86_64 Windows CI failures due to OOM Nov 13, 2019
Aaron1011 added a commit to Aaron1011/rust that referenced this issue Nov 13, 2019
In issue rust-lang#66342, we're seeing extremely large allocations by rustc
(4GB for the `hex` crate, which is only a few hundred lines). This is
exhausing the memory on CI, causing jobs to fail intermittently.

This PR installs a custom allocation error hook for nightly compilers,
which attempts to trigger a panic after printing the initial error
message. Hopefully, this will allow us to retrieve a backtrace when one
of these large spurious allocations occurs.

The hook is installed in `librustc_driver`, so that other compiler
frontends (e.g. clippy) will get this logic as well.

I'm unsure if this needs to be behind any kind of additional feature
gate, beyond being ngithly only. This only affects compiler frontends,
not generic users of `libstd`. While this will nake OOM errors on
nightly much more verbose, I don't think this is necessarily a bad
thing. I would expect that out of memory errors when running the
compiler are usually infrequent, so most users will probably never
notice this change. If any users are experiencing rust-lang#66342 (or something
like it) on their own crates, the extra output might even be useful to
them.

I don't know of any reasonable way of writing a test for this. I
manually verified the implementation by inserting:
`let _a: Vec<usize> = Vec::with_capacity(9999999999)` into
`librustc_driver` after the hook installation, and verified that
a backtrace was printed.

If we're very unlucky, it may turn out the large allocation on CI
happens to occur after several large successful allocations, leaving
extremely little memory left when we try to panic. If this is the case,
then we may fail to panic or print the backtrace, since panicking
currently allocates memory. However, we will still print an error
message, so the output will be no less useful (though a little more
spammy) then before.
wesleywiser added a commit to wesleywiser/rust that referenced this issue Nov 14, 2019
@pnkfelix
Copy link
Member

triage: P-high. Removing nomination label.

@pnkfelix pnkfelix added P-high High priority and removed I-nominated labels Nov 14, 2019
tmandry added a commit to tmandry/rust that referenced this issue Nov 14, 2019
JohnTitor added a commit to JohnTitor/rust that referenced this issue Nov 15, 2019
bors added a commit that referenced this issue Nov 16, 2019
Fix two OOM issues related to `ConstProp`

Fixes #66342
Fixes #66397

r? @oli-obk
@bors bors closed this as completed in d389f64 Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants