-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent Win32 segfaults on AppVeyor #10045
Comments
Possibly related, on 64 bit the tests have been intermittently taking much longer than usual. Linalg, fft, dsp, and parallel are occasionally several times slower than usual. |
Just hit one in: At line 614 we have
cc: @stevengj |
That function seems to always show up in the backtraces, not sure why. I suspect the problem is something unrelated to utf8proc. Maybe gensym related (based on merge timing of when this started getting frequent), maybe related to the 32-bit complex issues that have been happening on Linux (#10027). |
I can reproduce this locally, FWIW. I'm trying with various combinations of julia-debug, removing sys.dll, etc to see if I can get any more information here. |
@vtjnash any idea what's up with this? https://gist.github.com/tkelman/7e6aa9bf14d382ba5b65 |
Looks like julia-debug.exe has been failing the spawn test ever since the changes to julia_init, and that's true on both win32 and win64 so I'll open a separate issue. Makes this issue harder to debug at the moment though. |
Unfortunately #10145 didn't appear to help here. Still getting segfaults. |
Is there a reliable way to reproduce this locally? |
Reliable? Not really. Run the tests a bunch of times and it will happen locally though. |
Or |
@tkelman maybe we could run UserDump on appveyor: http://support.microsoft.com/kb/241215 |
I'm going to tentatively call this fixed by #10275. Will reopen if it comes back. |
👏 |
Sad news: https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2766/job/oxcc6ghee6br08i9 (Believe me, I'm as disappointed as you.) |
What's odd is I ran the tests 10 times in a row on your branch from #10275 and had no problems locally. Either some corruption can randomly happen at compile/bootstrap time, or something else changed on master that was not included in the history of that branch, or we're just unlucky. |
Well, that's good news! So maybe there are multiple issues, and #10275 fixed one of them. Feel free to close again. |
Maybe. We are still getting a lot of failures, so it would appear closing this was premature. |
Based on when this initially started getting frequent, my best guesses at the original cause are either a0266d4 or 5a4e1f8. The first 2 linked jobs up top were PR builds that may have been merging into a much newer state of master than their build number would indicate, depending on how long they were waiting in the queue for. The 3rd link was a doc commit to master (4b03533) and is probably the most useful point in the history to start looking from. |
@vtjnash I'm increasingly convinced this is gensym-related. I ran the tests repeatedly, several dozen times, on 51d5412, without any problems. 77d1394 appeared to be okay, but df0e099 and following commits segfault (sometimes during |
@timholy you didn't happen to clone Jameson too while you were at it, did ya? On the gensym merge commit 2d23b6f I get an odd broken module when running win32 julia-debug.exe on the arrayops test: https://gist.github.com/tkelman/8d0ada9bceacc36f3e9d |
that was fixed in e41a507 |
the gensym commit is almost entirely a front-end compiler optimization, so it should have little effect on the final code generated. it should also have no win32-specific dependency, so we should see the same failures on linux32. you are still seeing a broken module after e41a507? |
Sorry, no, I should've been more specific. At e41a507 I do not see the broken module, but I do see intermittent segfaults that look very similar to what's been happening on AppVeyor. It may be the case that this would also happen on 32 bit Linux, but we run the test suite less often on that architecture. Here's a possibly-related segfault on the 32 bit Linux buildbot: http://buildbot.e.ip.saba.us:8010/builders/build_ubuntu14.04-x86/builds/789/steps/shell_2/logs/stdio My PR to turn on a 32 bit Linux build on Travis (#9153) works now and could be merge-able. |
Certainly looks like it. The backtrace isn't always so ridiculously long, but I've seen a few different cases lately where it was. This does happen locally, and with enough persistence I have even managed to catch it in gdb once or twice. But wasn't able to get any backtrace at all. |
Here's a tiny piece of backtrace from a local segfault at the parallel test:
|
at 7e8b10c I was getting this repeatedly during bootstrap, same basic first-line of backtrace
|
and, rebuilt with
Any clues in there? Anything more specific I should try to look at here? |
Interesting, the innermost
which is really similar to something that was happening in #10235 |
The
|
@tkelman I'm really grateful for the work you're doing here. Time to update your Amazon Wishlist or such! |
Ah. That does explain it! The win32 malloc returns 4-byte aligned values, whereas other malloc will return 16-byte aligned. |
These |
that'll be fixed in llvm3.5 when we switch to getting this info from llvm. currently, we use the windows dbghelp library for this. it doesn't seem like microsoft has made any real improvements to that library in at least 10 years. it doesn't support the dwarf debugging format (although, why would it?), but that's the only format gcc knows how to create. edit: note that it identifies most functions as utf8proc_NFKC, so you can't just make a simple 1-to-1 shift |
Also on the topic of this issue, I haven't been watching every appveyor failure lately but I think there's been an occasional 32-bit failure somewhat reminiscent of this one that has creeped back in. edit: at least it's way less frequent https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.3465/job/poidsluygaha8di6 https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.3560/job/slad1jgx5m7yo8i0 https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.3605/job/ac7kadcn4vlsbdy2 |
This has been happening a bunch lately. Sometimes during bootstrap, sometimes in the tests. Not sure what's up.
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2049/job/5d6plrsb90yeqn92
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2050/job/wgciqc00rb0bv3h7
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2071/job/puaikgsjvfqmyc63
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2083/job/g7bet5eb1qong90a
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2089/job/y3eu478673aat464
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2113/job/2ihsooq03nct3x2s
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2119/job/dc15k9j0en62hqfy
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2151/job/31h7l8at66atva6a
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2163/job/n68op35scqu329wt
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2164/job/gwsfpj6tn6a9fw6o
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2168/job/upbdide2tpcl1apn
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2170/job/r9wl36cora62b9ja
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2171/job/j5iuctx7vj0baspr
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2182/job/mwvve52vgqoes017
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2189/job/jknp2i4rr58p4fjm
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2193/job/e4vi0f4hx0f9qip5
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2200/job/pljbofe6ba70l48u
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2205/job/o3wblaxjq6ss9e2y
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2210/job/e9sx2n9k0hgiscow
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2213/job/rvnwuy5ubp3ar29r
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2219/job/xjxebynask99d4r0
https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.2247/job/rufdccmy70le2oti
The text was updated successfully, but these errors were encountered: