-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hercules Segmentation fault during z/VM start up when QETH layer3 is opened #239
Comments
Hi Florian! I'm very sorry your Hercules is crashing. It probably has something to do with my recent commit that attempted to fix a hercifc crash that Ian Shorter reported which he said did indeed fix his problem. My fix for Ian might have only moved the problem to a different place? If you can (PLEASE!), please review github issue #199 "How to analyze a Linux "core" file (i.e. crash dump)" and then do a gdb crash dump analysis of your crash dump and then post the results here. Hopefully it's something simple and obvious that I can get fixed for you right away (me: fingers crossed). Thanks. (and sorry about this!) |
fbi-ranger, Fish I am using Hercules version 4.3.0.0-SDL-g1f8fa07-modified (4.3.0.0) on Ubuntu 16.04 with z/VM 6.4 and Layer 3 and TCPIP has been working just fine for me (if a bit slow, but that's been like this for a long time). In fact, with this new SDL version it's the first time my z/VM and layer 3 stays up more than a few minutes. So, I am confused as to what is different here? thanks M |
The interesting thing is that the core file was named core.dev EC02 thrd.1486, which is the last device of the QETH set (EC00-EC02). Here is the output from gdb:
Hope this helps you. |
Thank you. I'll take a look at it as soon as I can and will get back to you. |
This is a strong clue I think. It looks like the startup messages that Let me perform some experiments to see if maybe I can recreate the crash on my own system. If I can, then I should be able to devise a fix fairly quickly. Otherwise I'm just going to have to take an educated guess. Hang loose, Florian! I'm working on it! P.S. In the mean time, what you can do that would be a BIG help to me would be (if you can easily/reliably reproduce (cause) the crash on demand, i.e. each and every time whenever you want) is the same thing that Ian did in Issue #236:
and then, when the crash occurs:
That resulting gdb display might hopefully provide better information regarding exactly where the crash is occurring (and thus hopefully why). Thanks. (*) Unfortunately the crash dump analysis you posted wasn't as much help as I was hoping it would be. Looking at it, I'm unable to see where the crash is occurring! Maybe someone more experienced with gdb's output might be able to make more sense of it? |
Here the gdb where:
(gdb) where
(gdb) where full:
|
Okay, this crash doesn't make any sense to me! According to #0 0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6 #1 0x00007ffff58b028e in strdup () from /lib64/libc.so.6 #2 0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150 0 0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ffff58b028e in strdup () from /lib64/libc.so.6 No symbol table info available. #2 0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150 ttmtu = uMTU = i = rc = buf = "}\000\000\000\377\177\000\000Y\364\332\366\377\177\000\000xPK\001\000\000\000\000pPK\001\000\000\000\000xPK\001\000\000\000\000\000 \220\001\000\000\000\000\020PK\001\000\000\000\000\367\005z\337\377\177\000" i = rc = buf = ttmtu = uMTU = buf = Lines 1147 to 1159 in 4998c65
Florian? I'm going to need some detailed information regarding how you have your system setup/configured: your Hercules configuration file, your z/VM System Configuration file, and your z/OS TCPIP profile (and VTAM TRLE for your OSAs). I have z/VM 6.3 and am able to run z/OS 2.3 under z/VM in Layer 3 mode just fine. My Hercules has two OSAs (QETH devices) defined in Hercules, my z/VM System Config has two OSAs defined, and one VSWITCH defined for the second OSA. My z/VM runs in Layer 2 mode (using the first OSA) and the VSWITCH uses the second OSA. My z/OS has two OSAs defined too, which both of which are using the the second real Hercules OSA due to z/VM (but z/OS thinks it has 2 OSAs thanks to z/VM). Everything works fine for me! So I'm going to need to see how you have your system configured (Hercules, z/VM and z/OS) to see if maybe you can accomplish the same thing differently. Maybe the manner in which you are doing things is not supported by Hercules? I don't know! I'm lost on this one and need some help from someone who knows Linux better than me. HELP! |
Fish, The configuration of Hercules, z/VM and z/OS didn't change at all. It is still as described it in #230. Here the z/VM cannot handle the QETH devices for the vswitch. So there is no z/OS etc. involved at all. z/VM starts two VMs (DTCVSW1 and DTCVSW2) for controlling the vswitch(es). In my configuration UA EC00-EC02 is for vswitch VMINTRA1. When DTCVSW2 tries to engage EC00-EC02 the segfault occurs. This is very early in the z/VM startup process and has nothing to do with a guest OS. There is no difference in the configuration. The Hercules version is 4.3 where the crash happens while in 4.2 it works correctly. NO change of config files on any of the involved components. When I start a z/VM 6.3 it does exactly the same crash: (gdb) where #0 0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6 #1 0x00007ffff58b028e in strdup () from /lib64/libc.so.6 #2 0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150 #3 osa_adapter_cmd (dev=dev@entry=0x1917000, req_th=req_th@entry=0x7fffdf684000, datalen=datalen@entry=103) at qeth.c:1358 #4 0x00007fffdf7a8722 in qeth_execute_ccw (dev=0x1917000, code=, flags=, chained=, count=103, prevcode=, ccwseq=0, iobuf=0x7fffdf684000 "", more=0x7fffdf681563 "", unitstat=0x7fffdf681561 "", residual=0x7fffdf681564) at qeth.c:4617 #5 0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102 #6 0x00007ffff7575c78 in call_execute_ccw_chain (arch_mode=, pDevBlk=pDevBlk@entry=0x1917000) at channel.c:6188 #7 0x00007ffff757ff15 in device_thread (arg=) at channel.c:2440 #8 0x00007ffff6daec32 in hthread_func (arg2=0x7fffe8011650) at hthreads.c:798 #9 0x00007ffff5be9569 in start_thread () from /lib64/libpthread.so.0 #10 0x00007ffff5920a2f in clone () from /lib64/libc.so.6 (gdb) where full #0 0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ffff58b028e in strdup () from /lib64/libc.so.6 No symbol table info available. #2 0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150 ttmtu = uMTU = i = rc = buf = "}\000\000\000\377\177\000\000Y\364\332\366\377\177\000\000xPK\001\000\000\000\000pPK\001\000\000\000\000xPK\001\000\000\000\000\000 \220\001\000\000\000\000\020PK\001\000\000\000\000\367\005z\337\377\177\000" i = rc = buf = ttmtu = uMTU = buf = #3 osa_adapter_cmd (dev=dev@entry=0x1917000, req_th=req_th@entry=0x7fffdf684000, datalen=datalen@entry=103) at qeth.c:1358 req_puk = 0x7fffdf684038 grp = 0x14b5010 rsp_bhr = req_rrh = offrrh = offph = __FUNCTION__ = "osa_adapter_cmd" #4 0x00007fffdf7a8722 in qeth_execute_ccw (dev=0x1917000, code=, flags=, chained=, count=103, prevcode=, ccwseq=0, iobuf=0x7fffdf684000 "", more=0x7fffdf681563 "", unitstat=0x7fffdf681561 "", residual=0x7fffdf681564) at qeth.c:4617 datalen = 103 length = 103 first4 = contentstring = "RRH_TYPE_ULP: PUK_TYPE_ENABLE (ULP_ENABLE)", '\000' grp = 0x14b5010 num = __FUNCTION__ = "qeth_execute_ccw" #5 0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102 dev = 0x1917000 iobuf = 0x7fffdf683000 ccwaddr = 2147136224 ticaddr = 0 ticback = 0 idapmask = idawfmt = ccwfmt = ccwkey = opcode = flags = addr = mbaddr = mbk = mbcount = count = 103 ccw = unitstat = 0 '\000' chanstat = 0 '\000' residual = 0 more = 0 '\000' chain = 0 '\000' tracethis = firstccw = 0 '\000' area = '\000' bufpos = 0 skip_ccws = cmdretry = prevccwaddr = prefetch_remaining = ps = 1 more=0x7fffdf681563 "", unitstat=0x7fffdf681561 "", residual=0x7fffdf681564) at qeth.c:4617 datalen = 103 length = 103 first4 = contentstring = "RRH_TYPE_ULP: PUK_TYPE_ENABLE (ULP_ENABLE)", '\000' grp = 0x14b5010 num = __FUNCTION__ = "qeth_execute_ccw" #5 0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102 dev = 0x1917000 iobuf = 0x7fffdf683000 ccwaddr = 2147136224 ticaddr = 0 ticback = 0 idapmask = idawfmt = ccwfmt = ccwkey = opcode = flags = addr = mbaddr = mbk = mbcount = count = 103 ccw = unitstat = 0 '\000' chanstat = 0 '\000' residual = 0 more = 0 '\000' chain = 0 '\000' tracethis = firstccw = 0 '\000' area = '\000' bufpos = 0 skip_ccws = cmdretry = prevccwaddr = prefetch_remaining = ps = 1 ts = prefetch = {seq = 2, pos = 103, reqcount = 103, prevcode = 0 '\000', opcode = 1 '\001', reserved = 0, chanstat = '\000' , ccwaddr = {2147136224, 2147136224, 0 }, datalen = {0, 103, 0 }, dataaddr = {2146975832, 2112405584, 0 }, ccwflags = "$", '\000' , ccwcount = { 103, 0 }, idawaddr = {0, 2146975832, 0 }, idawtype = "\000\002", '\000' , idawflag = '\000' } iobuf_initial = __FUNCTION__ = "z900_execute_ccw_chain" #6 0x00007ffff7575c78 in call_execute_ccw_chain (arch_mode=, pDevBlk=pDevBlk@entry=0x1917000) at channel.c:6188 No locals. #7 0x00007ffff757ff15 in device_thread (arg=) at channel.c:2440 dev = 0x1917000 current_priority = 1 rc = waitcount = 0 #8 0x00007ffff6daec32 in hthread_func (arg2=0x7fffe8011650) at hthreads.c:798 pfn = 0x7ffff757fc90 arg = 0x0 name = 0x7ffff7842890 "idle device thread" tid = 140736941614848 rc = #9 0x00007ffff5be9569 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #10 0x00007ffff5920a2f in clone () from /lib64/libc.so.6 No symbol table info available. |
This isn't a QETH specific problem, it also affects CTCI and LCS. The On my host system the QETH and CTCI overwrites cause a segmentation fault. The LCS overwrite doesn't. All of the variables are setup by function I also see a message: That's as far as I've got for the moment, and I have to stop now. Hopefully will be able to continue later. |
Wow. That's great info, Ian! And looking at the I 'm seeing the following code in Lines 270 to 275 in fad967f
Lines 287 to 291 in fad967f
The The
and:
Florian? Do me a favor and try rebuilding Hercules after applying the following patch: --- hyperion-1/tuntap.c 2019-08-03 15:41:15.683714400 -0700
+++ hyperion-0/tuntap.c 2019-08-20 07:38:34.711798400 -0700
@@ -270,7 +270,9 @@
return -1;
}
- strcpy( pszNetDevName, hifr.hifr_name );
+ // N.B. strlcpy used to prevent buffer overflow
+ // and to ensure null termination.
+ strlcpy( pszNetDevName, hifr.hifr_name, IFNAMSIZ );
}
#if !defined( OPTION_W32_CTCI )
else
@@ -284,10 +286,14 @@
// This should work on OS X with Christoph Pfisterer's TUN driver,
// since it does set the device name to the basename of the
// file. -- JRM
- char *p = strrchr( pszTUNDevice, '/' );
+ char* p = strrchr( pszTUNDevice, '/' );
- if( p )
- strncpy( pszNetDevName, ++p, IFNAMSIZ );
+ if (p)
+ {
+ // N.B. strlcpy used to prevent buffer overflow
+ // and to ensure null termination.
+ strlcpy( pszNetDevName, p+1, IFNAMSIZ );
+ }
else
{
// "Invalid TUN/TAP device name %s" If the above patch fixes your problem then I'll go ahead and commit the fix to the repository. Otherwise if it doesn't resolve your problem, then we'll need to continue scratching our heads. Thanks. |
Fish, Unfortunately this fix causes more troubles:
System dumps and re-IPLs indefinitely.
|
Notice the strings highlighted string in the above messages: "cules Network I". This string is supposed to be the name of the QETH device's tuntap device! (i.e. the value that you specified for the HOW THE F**K is it getting set to "cules Network I"?! What's weird is the "cules Network I" just so happens to be characters 3-17 of the string "Hercules Network Interface Configuration Program", which, if you include a null terminator into the count, just so happens to be 15 bytes long, which is the exact size of the constant So like Ian said, some type of buffer overflow is occurring somewhere, but where it is occurring we don't know yet! |
Florian, (and Ian), Here's another patch for you to try: Don't forget to remove the previous patch I gave you before applying this new one. Thanks. |
Fish, The definition of the QETH device in hercules.cnf is EC00.3 QETH CHPID EC IPADDR 10.10.0.190 NETMASK 255.255.255.0 So no iface parameter there. Seems to be an default value and the variable may be badly aligned. I will apply try2. |
Note : The "CHPID EC" is probably redundant (but it shouldn't break things) |
Here are some compiler warnings from patched msgenu.h: shared.c: In function 'serverConnect': qeth.c:6490:19: note: directive argument in the range [0, 2147483647] |
Fish, Sorry but this 2try does not help. However here is the log: HHC00901I 0:EC02 QETH: Interface cules Network I, type TUN opened + It looks indeed for an interface cules Network I which cannot be found of course. I have no idea where this comes from. |
In 4.2 version no iface needed to be specified in the Hercules config file and it worked correctly with the default. Seems that the defaults are not set correctly any more. In 4.3. there is a wrong storage address taken for the interface name, cules Network I, which is an documentation string in hercifc. hercifc seems to called with a broken parameter list??? |
Florian, (and others), I still don't know what's going on, BUT... Since your original report was a Segmentation fault and your issue has something to with networking (i.e. ... Let's try this: --- hyperion-1/hercifc.c 2019-08-15 16:42:45.794199200 -0700
+++ hyperion-0/hercifc.c 2019-08-21 09:21:02.871505700 -0700
@@ -26,9 +26,6 @@
#include "hercules.h"
#include "hercifc.h"
-#define UTILITY_NAME "hercifc"
-#define UTILITY_DESC "Hercules Network Interface Configuration Program"
-
// --------------------------------------------------------------------
// HERCIFC program entry point
// --------------------------------------------------------------------
@@ -51,7 +48,7 @@
UNREFERENCED( argc );
- INITIALIZE_UTILITY( UTILITY_NAME, UTILITY_DESC, &pszProgName );
+ hthreads_internal_init();
DROP_PRIVILEGES(CAP_NET_ADMIN);
Maybe we'll get lucky and that will fix your problem while also fixing Ian's problem too but in a much more simplistic way? <shrug> |
Here's what's happening.
The fix fo #236 was to make The
When Hercules needs to create a network interface it calls function The I'm slightly(!) confused by
Plus, since The solution is to modify the function |
Gentlemen, It works now. I could start z/VM correctly as it was before. The vswitch is correctly defined and the corresponding tun1 is opened.
Great! Thanks to your analysis. BTW: I tried also the modification proposed by Fish. This works as well. |
Ah-HA! So that's what's happening! THANK YOU, Ian! I completely missed the
You're not alone!
hercifc confuses the hell out of me too! It's whole design is screwy. Sending binary requests and responses on text streams?! (stdin and stdout?!) That's crazy! There are far better IPC mechanisms available than using stdin/stdout for crying out loud! (Sheesh!) And hercifc is a security hole too as I recall. I believe John Hartmann pointed that out a long time ago, which is why he came up with pre-configured interfaces, which makes much more sense to me. (And I completely missed the
OR...
I personally recommend #2 (get rid of hercifc altogether), but since this is a purely Linux issue my recommendations don't count for much. I'll let you guys decide how best to straighten out this whole hercifc mess. Thank you, Ian, for figuring this mess out. MUCH appreciated buddy. AND ... as I said, since it looks like you know what's going on and since this is a Linux issue (for which I'm not really qualified to deal with), I'm going to let you (or Ivan, etc) figure out how best to fix things and commit the changes. I'm going to go ahead and re-assign this GitHub issue over to you if that's okay. Thanks again, Ian. This puppy is all yours now! |
Wow. THAT was easy! But I'm still going to let you guys (Ian and Ivan, etc) design and implement (and_commit!) the actual fix based on whatever you decide is the best way to do it. I'm out of here! It looks like you guys have things well in hand. |
Ian? Ivan? Is commit 113c3ec the final/official fix for this issue? Or is there more work you still want/need to do? If there are no more changes either of you are planning to make then we should go ahead and remove "In Progress" status and officially Close this GitHub Issue. Thanks. |
For the moment that is the final fix. However, as it stands there can be a lot of version, copyright, etc messages coming from hercifc (one set for each call to TUNTAP_CreateInterface). My intention, once things have settled, is to revisit function initialize_utility and remove the call to function display_version for hercifc. |
During startup of z/VM, after IPL completed, while AUTOLOG2 starts the VM machines, a Segmentation fault happens and Hercules crashes when opening QETH: /dev/tun
The text was updated successfully, but these errors were encountered: