Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in 370 mode IPL #538

Closed
jmaynard opened this issue Jan 19, 2023 · 27 comments
Closed

Segfault in 370 mode IPL #538

jmaynard opened this issue Jan 19, 2023 · 27 comments
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. L Linux only issue, such as with tuntap networking that doesn't occur on Windows.

Comments

@jmaynard
Copy link
Contributor

This was discovered while trying to build and run the develop branch after #537 (comment) added the ADDFRR instruction, though subsequent testing determined it actually was introduced earlier.

Symptom: Attempting to IPL any system with ARCHMODE S/370 immediately segfaults. This only happens on Linux; Windows appears unaffected.

The failure appears in the run CPU loop. More information can be found in the pull request comments.

@Fish-Git
Copy link
Member

Fish-Git commented Jan 19, 2023

I'm getting nowhere.   :(

I can't seem to cause a core file to get created even though my ulimit is set to unlimited. And when I run my test and Hercules segfaults, the message is just "Segmentation fault", not the expected "Segmentation fault (core dumped)":

ipl 550
/home/fish/bin/herc: line 94:  2089 Segmentation fault      ${MYSUDO[@]} herclin -f fish-dosvs-550-hyperion-4.0.txt -o ${HOME}/hercules.log

So the core file is never getting created and I have no idea why.   :(

I tried running Herc from gdb instead, but I can't even get that to work either! It can't seem to even start it. It keeps saying:

(gdb) run -f ${HOME}/hercules/HERC/CFG/fish-dosvs-550-hyperion-4.0.txt -o ${HOME}/hercules.log
Starting program: /home/fish/hercules/hercules-0/.libs/hercules -f ${HOME}/hercules/HERC/CFG/fish-dosvs-550-hyperion-4.0.txt -o ${HOME}/hercules.log
/home/fish/hercules/hercules-0/.libs/hercules: error while loading shared libraries: libherc.so: cannot open shared object file: No such file or directory
[Inferior 1 (process 2342) exited with code 0177]
(gdb) 

This happens no matter what directory I start gdb from.   :(

I give up. My Linux skills are simply too lacking.   :(

@Fish-Git Fish-Git added BUG The issue describes likely incorrect product functionality that likely needs corrected. HELP! Help is needed from someone more experienced or I'm simply overloaded with too much work right now! L Linux only issue, such as with tuntap networking that doesn't occur on Windows. labels Jan 19, 2023
@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

I do more tomorrow.

@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

Jay, to enable core dumps on Pop!_OS (which is based on Ubuntu):

ulimit -S -c unlimited
sudo sysctl -w kernel.core_pattern="%s.core"

Fish, those commands also work to enable core files on Kubuntu.

Here's the output, on Kubuntu, with git checkout 52b5df6f, and -g -g3 -gdb3 -O0.

$ gdb .libs/hercules 11.core.96459

...

Core was generated by `/home/bill/herctest/hyperion/build/.libs/hercules -p .libs -f ../tests/tests.co'.
Program terminated with signal SIGSEGV, Segmentation fault.
--Type <RET> for more, q to quit, c to continue without paging--
#0  0x00007ff019f246ac in fetch_hw_noswap (ptr=0xf3ca52f0) at ../machdep.h:712
712         inline U16 fetch_hw_noswap(const void *ptr) {
[Current thread is 1 (Thread 0x7ff019078640 (LWP 96475))]
(gdb) backtrace
#0  0x00007ff019f246ac in fetch_hw_noswap (ptr=0xf3ca52f0) at ../machdep.h:712
#1  0x00007ff019cdc0ba in s370_run_cpu (cpu=0, oldregs=0x7ff0100c0000) at ../cpu.c:2032
#2  0x00007ff019cebd9e in cpu_thread (ptr=0x7ffe567d6efc) at ../cpu.c:2315
#3  0x00007ff019b0cb4f in hthread_func (arg2=0x555ff3696200) at ../hthreads.c:1059
#4  0x00007ff01992fb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5  0x00007ff0199c1a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) 

cpu.c:2032 is the UNROLLED_EXECUTE( current_opcode_table, regs ); as seen from Jay's original backtrace.

    for (i=0; i < MAX_CPU_LOOPS/2; i++)
    {
        UNROLLED_EXECUTE( current_opcode_table, regs );
        UNROLLED_EXECUTE( current_opcode_table, regs );
    }

@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

OK, I've found the change that makes it fail. Why, I don't know yet.

52b5df6#diff-8c254e4d1ce0ab4fa655b87eadddfb19246ac185d0336b659c22702d9dd95a8f

The changes in feature.h to the various updated STORAGE_KEY_BYTEMASK definitions:

- #define STORAGE_KEY_BYTEMASK      0x00000FFF
+ #define STORAGE_KEY_BYTEMASK      STORAGE_KEY_4K_BYTEMASK

STORAGE_KEY_4K_BYTEMASK has a more complex definition than just a constant:

#define STORAGE_KEY_4K_BYTEMASK (~(((unsigned)(~0)) << SHIFT_4K))

@jmaynard
Copy link
Contributor Author

That evaluates to the same value, though:
((unsigned)(~0)) evaluates to 0xFFFFFFFF
... << SHIFT-4K shifts in 12 bits of zero, so 0xFFFFF000
~(...) turns it back into 0x00000FFF

@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

Yeah, you'd think so. And it looks right in Compiler Explorer

I suspect the problem is when it's applied later with an & somewhere, and the width/type goofs something up.

Specifically it is feature.h line 703 that makes it crash.

- #define PAGEFRAME_BYTEMASK        0x000007FF
+ #define PAGEFRAME_BYTEMASK        STORAGE_KEY_2K_BYTEMASK

@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

Works correctly when compiled with Clang 13!

(yesterday I was thinking, I bet Fish will get to exercise his dislike of gcc before we're done with this one)

@Fish-Git
Copy link
Member

Fish-Git commented Jan 19, 2023

Works correctly when compiled with Clang 13!

(yesterday I was thinking, I bet Fish will get to exercise his dislike of gcc before we're done with this one)

Do you really blame me? Fucks up with one version but works fine with a later one? What would you conclude? Obvious compiler bug.

Now I'm not saying other compilers don't also occasionally introduce bugs of their own, Microsoft's included. No. They obviously do. Humans are fallible creatures after all, and programming can be quite challenging at times. Especially for compiler writers.

But it just seems to me that we seem to stumble across these types of bugs WAY more frequently with gcc than we do with Microsoft's product, thereby forcing me to ask the obvious question "Why is that?".

And the only logical conclusion I can come to in my mind is that the gcc programmers are either: a) incompetent, or b) working with a poorly designed/written product, thereby causing otherwise competent and skilled programmers to fuck up much more frequently than otherwise (to our unfortunate detriment).

Linux is generally a fine product. Superior to Microsoft's in many ways.  (but not so much in others and vice-versa)

But their compiler? It truly sucks.

Oh sure! It's fine when it works! No question!

But it seems to break way more frequently than it should IMO. WAY more. And that should be cause for concern.

(And unlike Microsoft's, it's not exactly user friendly either IMHO)

But enough of that...

Do you want to do the honors and commit the fix, Bill, since you were the one that found/identified it after all (which I presume would be to simply change that one #define back to its original 0x000007FF value?)  Or would you like me to commit it since it was after all caused by my commit? Or do you, Jay, want to commit the fix, since it was you that discovered the bug?

@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

Fish,

Do you really blame me? Fucks up with one version but works fine with a later one? What would you conclude? Obvious compiler bug.

No, can't blame you.

Do you want to do the honors and commit the fix, Bill, since you were the one that found/identified it after all (which I presume would be to simply change that one #define back to its original 0x000007FF value?) Or would you like me to commit it since it was after all caused by my commit? Or do you, Jay, want to commit the fix, since it was you that discovered the bug?

I'll do it. But I want to experiment a bit more first.

Bill

@jmaynard
Copy link
Contributor Author

jmaynard commented Jan 19, 2023 via email

@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

Yeah, I wondered about that too.

@jmaynard
Copy link
Contributor Author

jmaynard commented Jan 19, 2023 via email

@Fish-Git
Copy link
Member

Interesting!  So basically:

-  #define STORAGE_KEY_2K_BYTEMASK       (~(((unsigned)(~0)) << SHIFT_2K))
+  #define STORAGE_KEY_2K_BYTEMASK       (~((~(unsigned) 0 ) << SHIFT_2K))

-  #define STORAGE_KEY_4K_BYTEMASK       (~(((unsigned)(~0)) << SHIFT_4K))
+  #define STORAGE_KEY_4K_BYTEMASK       (~((~(unsigned) 0 ) << SHIFT_4K))

Yes?

@wrljet
Copy link
Member

wrljet commented Jan 19, 2023

Doesn't fix it.

@jmaynard
Copy link
Contributor Author

jmaynard commented Jan 19, 2023 via email

@Fish-Git
Copy link
Member

Doesn't fix it.

You beat me to it! I was just going to try it. I guess I won't now.

So (gcc(?) clang(?) which one?) screws up either way? Yes? So we're forced to use 0x000007ff? Yes?

@jmaynard
Copy link
Contributor Author

jmaynard commented Jan 19, 2023 via email

@Fish-Git
Copy link
Member

I'd suggest ~0U, and if that doesn't fix it, then shrug and use 0x7ff.

That doesn't work either.  (SIGH!!)

So I guess we're stuck with 7ff.

@wrljet
Copy link
Member

wrljet commented Jan 20, 2023

Fish,

So (gcc(?) clang(?) which one?) screws up either way? Yes? So we're forced to use 0x000007ff? Yes?

Tested gcc 11.3.0 and 12.1.0. Both fick up, with everything other than the simply constant.

Clang 13.0.1 works with the original (~(((unsigned)(~0)) << SHIFT_4K)). (haven't tested other versions yet)

Bill

@wrljet
Copy link
Member

wrljet commented Jan 20, 2023

There is a way with git to update the commit that introduces the problem. But it's tricky and I don't want to risk wrecking things. So I will just check in a new commit at the head, and we'll accept that a small range of commits, in between, don't work.

OK?

@jmaynard
Copy link
Contributor Author

jmaynard commented Jan 20, 2023 via email

@Fish-Git
Copy link
Member

Fish-Git commented Jan 20, 2023

Bill said:

There is a way with git to update the commit that introduces the problem. But it's tricky and I don't want to risk wrecking things.

Please don't!

Bill said:

So I will just check in a new commit at the head, and we'll accept that a small range of commits, in between, don't work.

OK?

Yes please! Just do a normal commit to HEAD, just like always, just like you'd do any other commit.

Undoing an accidental commit to HEAD like what occurred when Jay's ADDFRR got merged into the wrong branch is one thing (it was actually fairly simple), but trying to undo/fix a commit that was made further down in the tree is something completely different and entirely too risky!

As Jay said:

That's the way development on stuff like this usually goes. Don't sweat it.

Quite right. This is the development branch after all. It's expected that things like this will occasionally happen. When it does, we just commit the fix for it and move on. It's wholly unimportant that a small range of commits "breaks" Hercules. The only thing that's important is the end result. Only the tip of the branch is important (i.e. the current state of the branch), not any previous state. The current (most recent) commit should (ideally) build cleanly and function correctly. If it doesn't and we somehow miss that (like we did with this current GitHub Issue), we simply commit the fix for it and move on. It has happened countless times before in the history of Hercules development and will likely happen again eventually. That's just the way software development works.

So PLEASE DON'T try to do whatever it was you were thinking of maybe doing, and just do a normal commit. It's safer and will fix the problem, which is all we're really interested in.

Thanks.

@wrljet
Copy link
Member

wrljet commented Jan 20, 2023

So PLEASE DON'T try to do whatever it was you were thinking of maybe doing, and just do a normal commit. It's safer and will fix the problem, which is all we're really interested in.

OK

I'm testing now on various other systems.

Just an update... The failing code also fails on Clang 13.1.6 on Mac M1 CPU. The correction works.
So I'm going to claim it's not an outright compiler bug. Or, if it is, it's a popular one.

I suspect we've just fallen afoul of C undefined behavior. And god knows where.

@Fish-Git
Copy link
Member

If you're curious however and want to experiment on whether what you were thinking of doing would work or not, feel free to give it a try, BUT NOT ON OUR LIVE RESPOSITORY!

We actually have two test (dummy) repositories for performing git experiments with:

(see https://github.com/SDL-Hercules-390).

These test repositories were expressly created for such experimenting. If you screw up and the repository gets fucked up, it's no big deal. We can always recreate them.

But NEVER do any experimenting on (anything risky to) our current production repository!

Cool?

@wrljet
Copy link
Member

wrljet commented Jan 20, 2023

If you're curious however and want to experiment on whether what you were thinking of doing would work or not, feel free to give it a try, BUT NOT ON OUR LIVE RESPOSITORY!

We actually have two test (dummy) repositories for performing git experiments with:

* https://github.com/SDL-Hercules-390/dummytest

* https://github.com/SDL-Hercules-390/dummytest-old

(see https://github.com/SDL-Hercules-390).

These test repositories were expressly created for such experimenting. If you screw up and the repository gets fucked up, it's no big deal. We can always recreate them.

But NEVER do any experimenting on (anything risky to) our current production repository!

Cool?

Of course.

wrljet added a commit that referenced this issue Jan 20, 2023
Changes in commit 52b5df6 exacerbated a compiler bug and/or C undefined behavior.
@wrljet
Copy link
Member

wrljet commented Jan 20, 2023

Hopefully addressed with commit 1159891

Tested on:

  • Debian 11 (forget which gcc), and Kubuntu 22.04 with gcc 11.3.0 and 12.1.0. As well as Clang 13.0.1.
  • macOS Ventura, x86_64, with Clang 14
  • macOS Monterey, Apple M1 silicon, with Clang 13 and 14.
  • Raspberry Pi 4B (PiOS Debian 11) with gcc 10.2.1.

@Fish-Git
Copy link
Member

Hopefully addressed with commit 1159891

Confirmed on Kubuntu 21.10 with:

  • gcc version: 11.2.0
  • clang version: 13.0.0-2

Thank you Bill!

Closing GitHub Issue.

@Fish-Git Fish-Git removed the HELP! Help is needed from someone more experienced or I'm simply overloaded with too much work right now! label Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. L Linux only issue, such as with tuntap networking that doesn't occur on Windows.
Projects
None yet
Development

No branches or pull requests

3 participants