-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect NUMCPU command processing leading to CRASH #481
Comments
Clang 13 didn't work for me last time I tried it. But right now I can't remember the specifics. |
This is quite interesting! Test case Test case BUT YET... Looking at the actual Hercules log output, I'm seeing the following sequence: (Note: some lines removed for clarity; only the important lines are shown) 22:19:40 HHC01603I * Start of test file ../tests/CDSG.tst date in /Users/enrico/hercules.SDL/work/hyperion/tests 22:19:40 HHC01603I numcpu 2 22:19:40 HHC00100I Thread id 000000016e587000, prio 2, name 'Processor CP01' started 22:19:40 HHC00811I Processor CP01: architecture mode z/Arch 22:19:40 HHC02204I NUMCPU set to 2 22:19:40 HHC00811I Processor CP00: architecture mode ESA/390 22:19:40 HHC00811I Processor CP01: architecture mode ESA/390 22:19:40 HHC00811I Processor CP00: architecture mode z/Arch 22:19:40 HHC00811I Processor CP01: architecture mode z/Arch 22:19:40 HHC00809I Processor CP01: disabled wait state 0002000000000000 0000000000000000 22:19:40 HHC00809I Processor CP00: disabled wait state 0002000000000000 0000000000000000 22:19:40 HHC02334I Script 1: test: test ended 22:19:40 HHC02338I Script 1: test: actual duration: 0.028094 seconds 22:19:40 HHC01603I numcpu 1 # (reset back to default) 22:19:40 HHC02204I NUMCPU set to 2 22:19:40 HHC00101I Thread id 000000016e47b000, prio 2, name 'Processor CP00' ended 22:19:40 HHC01603I * Start of test file ../tests/CLCL-et-al.tst date in /Users/enrico/hercules.SDL/work/hyperion/tests 22:19:40 HHC00100I Thread id 000000016e693000, prio 2, name 'Processor CP00' started 22:19:40 HHC00811I Processor CP00: architecture mode ESA/390 22:19:40 HHC00811I Processor CP01: architecture mode ESA/390 22:19:40 HHC00101I Thread id 000000016e587000, prio 2, name 'Processor CP01' ended How the fuck did |
Enrico, Please try the following patch (also attached): --- hyperion-1/config.c 2022-03-27 07:43:20.649669000 -0700
+++ hyperion-0/config.c 2022-03-28 06:06:49.151999600 -0700
@@ -1200,7 +1200,8 @@
}
/* Make sure we did that right */
- ASSERT( sysblk.cpus == numcpu && numcpu <= sysblk.maxcpu );
+ while (sysblk.cpus != numcpu)
+ usleep(10000);
return 0;
} It's a workaround, not a fix (I was too lazy to pinpoint the exact location where the obvious race was), but it should hopefully prevent the problem from occurring. (I was unable to reproduce it, but I will keep trying.) Thanks. |
I ran it through the same macOS Big Sur 11.6.5, with xcode tools 13.2, but on x86. Similar-ish failure. See the end of the log. (Fish, note, this is without your patch) |
Fish, with the patch, it hangs in that new while loop.
|
I would prefer to see a gdb crash dump, not a mac crash file. Analysis of a gdb crash dumo (.core file) should tell us exactly where the crash occurred. Note that you need to build with "--enable-optimization=yes" (or, if you are specifying your own custom optimization flags, you will need to be sure to also include the Once we have a good gdb .core file, then we should be able to analyze it to determine exactly where the segfault occurred. See GitHub Issue #199 "How to analyze a Linux "core" file (i.e. crash dump)" for details. Thanks. |
Same thing to you too, Bill: since you're able to reproduce the crash too, having a gdb .core file that we can analyze will help tremendously with determining exactly where (and thus hopefully why as a result!) the crash occurred. But you need to first enable crash dumps on your system. Looking at your log, I'm seeing: 18:47:22 HHC00017W Crash dumps NOT enabled and: +++ OOPS! +++ Hercules has crashed! (Segmentation fault: 11) Crash dump NOT created due to ulimit setting... Refer to GitHub Issue #199 "How to analyze a Linux "core" file (i.e. crash dump)" for details regarding how to enable crash dumps (i.e. ".core" files). Thanks. |
FWIW those instructions need to be updated for macOS, and lldb, which would be the default for most Mac users.
And then lldb's commands are quite a bit different than gdb's. In any case, here's some output from my crash. When I turned off optimization to try to get better results in the traceback, then of course the problem went away.
You'll note:
Also, Fish, that Thread Sanitizer report I emailed you earlier today mentions data races on the Bill |
Fish, Not suggesting this is a fix, but it is a tidbit of additional info: the following patch makes it ( diff --git a/ipl.c b/ipl.c
index 0d837cfe..ec60ae7a 100644
--- a/ipl.c
+++ b/ipl.c
@@ -219,7 +219,9 @@ int ARCH_DEP( system_reset )( const int target_mode, const bool clear,
regs = sysblk.regs[ i ];
- if (regs->cpustate != CPUSTATE_STOPPED)
+ // WRL
+ // if (regs->cpustate != CPUSTATE_STOPPED)
+ if (regs == NULL || regs->cpustate != CPUSTATE_STOPPED)
{
wait = true; |
Okaaay... Could you do me a favor and edit a few of the existing comments and/or add some new comments to that GitHub Issue to document how one does it on a Mac please? Thanks! The bottom line is, in such situations (where Hercules is segfault-ing), we really, REALLY need to see a gdb (or lldb or whatever!) dump that shows us a backtrace that includes source files and line numbers of the thread that crashed, so we can see exactly where Hercules is crashing. And knowing where it's crashing, is usually enough to tell us why it's crashing too. |
Yeah, that should do it. As well as, I believe (hope!), the following tweaked version of my previous patch will too: --- hyperion-1/config.c 2022-03-27 07:43:20.649669000 -0700
+++ hyperion-0/config.c 2022-03-28 18:21:24.439913500 -0700
@@ -1199,9 +1199,6 @@
configure_cpu( cpu );
}
- /* Make sure we did that right */
- ASSERT( sysblk.cpus == numcpu && numcpu <= sysblk.maxcpu );
-
return 0;
}
@@ -1222,6 +1219,10 @@
}
RELEASE_INTLOCK( NULL );
+ /* Make sure we did that right */
+ while (sysblk.cpus != numcpu)
+ usleep(10000);
+
return rc;
}
@@ -1258,6 +1259,11 @@
resetAllCpuIds();
}
RELEASE_INTLOCK( NULL );
+
+ /* Make sure we did that right */
+ while (sysblk.maxcpu != maxcpu)
+ usleep(10000);
+
return rc;
}
My bug was checking (looping on) the value with intlock held. The check (loop) should be outside of holding intlock. Please let me know whether the above patch also fixes the problem or not. Thanks. |
Yes, I'll do that. |
That hangs after logging this:
|
hi Fish, Bill what I did as far the /cores permissions things are a bit murkier sudo chmod 777 /cores and on the mac-book x86_65 with Big Sur 11.6.5 and now the big why oh why on mac-mini arm64 with Big Sur 11.6.5 and ... no core dump was produced the two big sur installation are ONLY APPARENTLY similar mi guess is that some preferences were not overwritten btw ... an additional remark ( holds for apple ) enrico |
I have thus far been unwilling to update the Xcode-tools to 13 on my Mac M1 hardware. I had a hell of a time just getting a Python package installed. But yes, I've run Clang 14 on the x86 and it acts the same as 13 re: this issue. Bill |
The same results as ................... ? What? The same results as Bill's? Or the same results as the "hercules_2022-03-27-221940_enrico-macmini.crash.txt" file in your original crash_info.zip file attached to your original report? I need your clarification on this, because Bill's dump properly shows the filenames and line numbers in its lldb If the lldb of your /core files are not showing Hercules filenames and line numbers in its bt backtrace command, then it's probably because your Hercules was not built with the proper debugging information. For non-Apple builds, this is the -g -g3 and -ggdb3 optimization options that are normally automatically specified by the ./configure For Apple builds, I don't know what the proper options should be, since it uses lldb and not gdb. Does lldb need the same -g -g3 and -ggdb3 options? Or does it need slightly different build options to include source file names and line numbers in its debugging information? I don't know! But someone needs to find out and then document it (and add support for it in
And THAT is the system that is experiencing the problem? (the crash?) Correct? Damn! Well, we need to figure out why and get it fixed, because we NEED that dump so we can see where Hercules is crashing! But for now I am going to presume it is crashing in the same place as it is on Bill's system: on line 222 of ipl.c, in the
We're obviously going to need to dig deeper to determine precisely where/why our obvious race condition is occurring so we can develop a more proper fix instead of a workaround. (I hate workarounds! I'd rather get to the root of the problem and fix it than use a workaround. To me, workarounds are like bandaids: they only treat the symptom but not the cause.) |
FWIW, Hercules-Helper (which Enrico mentioned he was using) puts in the But I know absolutely nothing about the Apple crash logs, and/or if those ever have source line information. |
Fish, we could also re-title this issue to mention something along the lines of: "Crash with null pointer reference during NUMCPU command, on macOS with Clang 13 and 14". |
short summary/clarification: I did everything over from scratch on the mac-mini silicon my point was to show the different behavior on systems on the intel I got the dump and ( lldb ) bt showed the same line numbers as Bill' s On the silicon I got the crash with no dump, enrico PS for Bill |
Done! |
Thank you for the clarification, Enrico. Much appreciated. Please do let us know what Apple has to say about this. It is, after all, rather difficult to debug a crash when you don't know where the crash is occurring! |
Hi Bill, Fish unfortunately the situation is worse ... on apple I applied both patches and the make check hangs then I reverted Fish's patch 2 and at the same time I got the idea to test also on Fedora35 using clang on Fedora35 clang 13 ( the original ipl.c ) the test suite runs happily without crashing
the line number is differenty for Fedora 76260 ( might be to different environments ) just to let You know enrico PS |
Hi Fish a quick and dirty FreeBSD 13 with Clang 13, attempt gave back
is the issue related to the current one enrico |
Mistake! My config.c patch is BAD. It should NOT be used.
Because my config.c patch is bad. Please don't use it.
Good!
Interesting! Methinks we might now be getting a bit closer to the actual root of the problem! Enrico, if you still have it (or can you please recreate it if you don't), would you please attach the ".out" output file that the runtest (make check) should have created? I would very much like to see it! I'm beginning to suspect maybe(?) our problem might(?) stem from the following, which I'm seeing in Enrico's original problem report (file "allTests.out" in his "crash_info.zip" attachment): 22:19:19 HHC00112W Thread CPU Time is NOT available. Precisely how it might impact/trigger our existing problem I'm not quite sure yet, but it's something timing related, and the above |
I get this, too, on macOS x86 with Clang 14, and Fedora 35 with Clang 13.
|
Fish,
There are plenty of warning messages because Just keep in mind, though, the tests works fine with Clang 12 (and older Clangs on older macOS versions). |
here is the allTests.out |
pleas remember I got the same |
THANK YOU! This confirms our problem is the same one we've been chasing all along: There's a flaw (obvious race condition) in our Take a look at this: (Note: some lines not shown; only "interesting" lines are shown.) 21:19:00 HHC01603I * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 21:19:00 HHC01603I * 21:19:00 HHC01603I * Start of test file ../tests/runtest0.tst date in /Users/enrico/hercules.SDL/work/hyperion/tests 21:19:00 HHC01603I * [...] 21:19:00 HHC01603I *Testcase runtest0: all cpus 0 seconds 21:19:00 HHC01603I * ---------------------------------------------------------------------------- 21:19:00 HHC01603I * 21:19:00 HHC01603I numcpu 4 # Total CPUs needed for this test... 21:19:00 HHC00100I Thread id 000000016e753000, prio 2, name 'Processor CP01' started 21:19:00 HHC00811I Processor CP01: architecture mode z/Arch 21:19:00 HHC00100I Thread id 000000016e85f000, prio 2, name 'Processor CP02' started 21:19:00 HHC00811I Processor CP02: architecture mode z/Arch 21:19:00 HHC00100I Thread id 000000016e96b000, prio 2, name 'Processor CP03' started 21:19:00 HHC00811I Processor CP03: architecture mode z/Arch 21:19:00 HHC02204I NUMCPU set to 4 [...] 21:19:00 HHC01603I * Start the test and wait for completion... 21:19:00 HHC01603I * 21:19:00 HHC02336I Script 1: test: test starting 21:19:00 HHC02339I Script 1: test: duration limit: 0.800000 seconds 21:19:00 HHC02228I restart key pressed 21:19:00 HHC02333I Script 1: test: running... 21:19:00 HHC00814I Processor CP00: SIGP Set architecture mode (12) CP00, PARM 0000000000000001: CC 1 status 00000100 21:19:00 HHC00809I Processor CP00: disabled wait state 0002000180000000 0000000000000000 21:19:00 HHC00809I Processor CP01: disabled wait state 0002000180000000 0000000000000000 21:19:00 HHC00809I Processor CP02: disabled wait state 0002000180000000 0000000000000000 21:19:00 HHC00809I Processor CP03: disabled wait state 0002000180000000 0000000000000000 21:19:00 HHC02334I Script 1: test: test ended 21:19:00 HHC02338I Script 1: test: actual duration: 0.000369 seconds [...] 21:19:00 HHC01603I numcpu 1 # Clean up own mess 21:19:00 HHC02204I NUMCPU set to 4 21:19:00 HHC01603I * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 21:19:00 HHC01603I * 21:19:00 HHC01603I * Start of test file ../tests/runtest4.tst date in /Users/enrico/hercules.SDL/work/hyperion/tests 21:19:00 HHC01603I * [...] 21:19:00 HHC01603I *Testcase runtest4: all cpus 1-4 seconds 21:19:00 HHC00101I Thread id 000000016e647000, prio 2, name 'Processor CP00' ended 21:19:00 HHC01603I * ---------------------------------------------------------------------------- 21:19:00 HHC01603I * 21:19:00 HHC01603I numcpu 4 # Total CPUs needed for this test... 21:19:00 HHC02389E CPUs must be offline or stopped 21:19:00 HHC00007I Previous message from function 'numcpu_cmd' at hsccmd.c(3952) 21:19:00 HHC01603I * 21:19:00 HHC01603I * ---------------------------------------------------------------------------- 21:19:00 HHC01603I * 21:19:00 HHC01603I defsym secs0 1 # Loop duration in seconds for each CPU... 21:19:00 HHC01603I defsym secs1 2 21:19:00 HHC00101I Thread id 000000016e753000, prio 2, name 'Processor CP01' ended 21:19:00 HHC01603I defsym secs2 3 21:19:00 HHC00101I Thread id 000000016e96b000, prio 2, name 'Processor CP03' ended 21:19:00 HHC01603I defsym secs3 4 21:19:00 HHC00101I Thread id 000000016e85f000, prio 2, name 'Processor CP02' ended 21:19:00 HHC01603I * [...] 21:19:00 HHC01603I sysclear # Clear the world 21:19:00 HHC00100I Thread id 000000016ea77000, prio 2, name 'Processor CP00' started 21:19:00 HHC00811I Processor CP00: architecture mode z/Arch 21:19:00 HHC00811I Processor CP00: architecture mode ESA/390 21:19:00 HHC02311I sysclear completed 21:19:00 HHC01603I archmode z/Arch # Set z/Arch mode 21:19:00 HHC00811I Processor CP00: architecture mode z/Arch 21:19:00 HHC02204I ARCHLVL set to z/Arch [...] 21:19:00 HHC01603I * Start the test and wait for completion... 21:19:00 HHC01603I * 21:19:00 HHC02336I Script 1: test: test starting 21:19:00 HHC02339I Script 1: test: duration limit: 21.000000 seconds 21:19:00 HHC02228I restart key pressed 21:19:00 HHC02333I Script 1: test: running... 21:19:00 HHC00814I Processor CP00: SIGP Set architecture mode (12) CP00, PARM 0000000000000001: CC 1 status 00000100 21:19:00 HHC00814I Processor CP00: SIGP Restart (06) CP01, PARM 0000000000000000: CC 3 21:19:01 HHC00809I Processor CP00: disabled wait state 0002000180000000 0000000000000000 21:19:01 HHC02334I Script 1: test: test ended 21:19:01 HHC02338I Script 1: test: actual duration: 0.562810 seconds [...] 21:19:01 HHC01603I *Compare 21:19:01 HHC01603I r a00-a03 21:19:01 HHC02290I A:0000000000000000 K:06 21:19:01 HHC02290I R:0000000000000A00 FFF1F2F3 .123 21:19:01 HHC01603I *Want FFFFFFFF The setting of Then As a result, Our bug is in our configuring/deconfiguring of our CPU threads. They're not properly waiting for the actual requested cpu thread to begin or end (start or exit) before returning, resulting in a race condition when the next Let me take a closer look at our configure/deconfigure cpu logic to see if I can find where our bug is and how best to fix it. |
No, we're actually not. It's been the same bug all along. Even your patch to ipl.c's |
for VM management I use Vagrant and VirtualBox as you know for FreeBSD clang is the standard compiler suite usually on vagrant boxes the first thing I do is to update the system without your patch i got consistently the crash the first build failed with gazillions of errors about RTLD_NOW duplicated defines how did You get the '--without-included-ltdl' flag to configure ? e |
Enrico,
I installed this one:
Yes
I updated and upgraded I installed
Note, that
Please zip and send me the Bill |
(missing attachment from previous comment):
sorry Bill... I strongly disagree find attached some odd examples of runs |
* * * BREAKING NEWS! * * * "Heisenbug strike actually helped a programmer's bug hunt!" I added some debugging code ( BUT... when I went to test it (just to make sure I didn't have any programming errors before offering it to Enrico/Bill (my first test crashed due to a stupid oversight)), guess what? IT TRIPPED! (i.e. triggered, i.e. the bug struck!) Without my debugging code (i.e. with no changes to the current 'develop' git branch), I'm unable to reproduce the problem. WITH my debugging code however, I am able to reproduce the problem 100% of the time! But oddly enough, only for Release builds. With a Debug build I can never reproduce it. It always works fine. When I try the same thing with an optimized Release build however (which are much faster than unoptimized Debug builds), the problem occurs every single time. (Hint: you only need to run the "runtest0.tst" test. That should be good enough to trigger the bug. If the NOTE: I have not had time to analyze my That's all for now. |
BUG FOUND! Here's the fix:
Thanks! --- hyperion-1/config.c 2022-03-30 20:18:02.673086100 -0700
+++ hyperion-0/config.c 2022-03-30 20:21:18.203829500 -0700
@@ -1120,9 +1120,12 @@
int ourcpu;
bool arecpu = are_cpu_thread( &ourcpu );
- /* If we're NOT trying to deconfigure ourselves */
- if (target_cpu != ourcpu)
+ if (!arecpu || target_cpu != ourcpu) // NORMAL CASE
{
+ /* We're either not a CPU thread, or if we are,
+ we're not attempting to deconfigure ourself.
+ */
+
/* Deconfigure CPU */
sysblk.regs[ target_cpu ]->configured = 0;
sysblk.regs[ target_cpu ]->cpustate = CPUSTATE_STOPPING;
@@ -1143,6 +1146,7 @@
if (arecpu)
sysblk.regs[ ourcpu ]->intwait = false;
+ /* Wait for cpu_thread to completely exit */
join_thread( sysblk.cputid[ target_cpu ], NULL );
detach_thread( sysblk.cputid[ target_cpu ]);
@@ -1152,9 +1156,14 @@
/* post-processing that is done by various callers. */
/*-----------------------------------------------------------*/
}
- else
+ else // (arecpu && target_cpu == ourcpu) HIGHLY UNUSUAL!
{
- /* Else we ARE trying to deconfigure ourselves */
+ /* We ARE a cpu thread *AND* we're trying to deconfigure
+ ourself! This can only happen if B220 SERVC instruction
+ is executed to deconfigure its own CPU, or else the CPU
+ issues a Hercules command via the diagnose-8 interface
+ to deconfigure its own CPU (i.e. itself).
+ */
sysblk.regs[ target_cpu ]->configured = 0;
sysblk.regs[ target_cpu ]->cpustate = CPUSTATE_STOPPING;
ON_IC_INTERRUPT( sysblk.regs[ target_cpu ]);
@@ -1203,9 +1212,6 @@
configure_cpu( cpu );
}
- /* Make sure we did that right */
- ASSERT( sysblk.cpus == numcpu && numcpu <= sysblk.maxcpu );
-
return 0;
}
|
Hi again Fish, We are not completely out of the tunnel yet, but what I see is the light, not the train. I run the tests on Freebsd13 running in virtualbox vm. The name of the I ran the
much more often. My workaround was to hard code in the runtest script a (As a related issue, the runtest script should be fixed to properly process the passed args; John probably was aware of that because he hard coded the timeout value instead of doing a proper check for the argument presence.) With the workaround, I now get no Here is the file anyway: enrico |
Fish, For me, on macOS with Clang 13, and Fedora 35 with Clang 13, the commit before your fix, Bill P.S. My mood continues to deteriorate! |
Actually we ARE completely out of the tunnel! The original reported problem has been resolved. The patch worked. The What we are now experiencing are completely different unrelated problems, which we should be creating brand new GitHub Issues for:
I would suggest creating separate (new) GitHub Issues for each of them. I leave it up to you whether you want to create separate GitHub Issues for items #1 and #2 above, or whether you want to create only one GitHub Issue to resolve both problems since they're very likely very closely related. Item #3 above deserves a completely separate GitHub Issue to be created for it, which I will be doing as soon as I post this message and close THIS issue.
|
Fixed by commit 85bd55c. Closing. |
Actually I'm not surprised by that at all since that particular commit accidentally included part of the The following change: Lines 1018 to 1019 in da32975
should actually not have been included in commit bcd9131. Commit bcd9131 should have changed ONLY It's actually the combination of both: Lines 1018 to 1019 in da32975
and: Line 1123 in da32975
that together is what actually properly fixed our In your particular case (and in many other user's cases too more than likely), yes, it alone will likely fix the problem. But that's not a certainty. The test for
I'm very sorry to hear that. Is there anything I can do to help? |
Enrico, I followed a recipe I found on the FreeBSD website, and rebuilt from sources to upgrade/switch from Release to Stable. Took many hours. Everything appeared to work as they claimed. Now running Clang 13 on FreeBSD 13 STABLE. Fish, Tested the latest change/fix, and it does work for me on that system as well. Bill |
Thanks. That's good to hear. |
Also tested now on macOS Monterey 12.3.1 x86, with Apple's xcode-tools / Clang 13.1.6. Tests pass. |
Hello Fish, Bill,
This time I was able to collect a bit more information about the crash.
Environment:
Hercules developer branch at commit level 849e1eb.
Find attached a zip file with:
The script can be improved by parsing the arguments to run only the requested tests, but I was in a hurry to get some result and I just needed to run all the tests.
The two files produced by make check:
The two reports at the time of the crash
note that before the crash there was a peak of wakeups
Hope that the data is enough to provide some hints.
enrico
The text was updated successfully, but these errors were encountered: