Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hercules Segmentation fault during z/VM start up when QETH layer3 is opened #239

Closed
fbi-ranger opened this issue Aug 17, 2019 · 25 comments
Closed
Assignees
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. L Linux only issue, such as with tuntap networking that doesn't occur on Windows.

Comments

@fbi-ranger
Copy link

fbi-ranger commented Aug 17, 2019

Hercules version 4.3.0.0-SDL-g1f8fa07f-modified (4.3.0.0)
Host OS openSUSE 15.0
Guest OS z/VM 6.4

During startup of z/VM, after IPL completed, while AUTOLOG2 starts the VM machines, a Segmentation fault happens and Hercules crashes when opening QETH: /dev/tun

09:05:23 HHC03800I 0:EC02 QETH: Adapter mode set to Layer 3
09:05:23 HHC00901I 0:EC02 QETH: Interface cules Network Interface Configuration Program - version 4.3.0.0-SDL-g1f8^PS^AÌf^?, type TUN opened
09:05:23

    +++ OOPS! +++

Hercules has crashed! (Segmentation fault)
Creating crash dump... This may take a while...
/usr/local/bin/iml-hercules: line 3: 25201 Segmentation fault      (core dumped) /local/sys1/$VERSION/hercules/bin/hercules -f /local/sys1/$VERSION/etc/hercules.cnf
@Fish-Git Fish-Git self-assigned this Aug 17, 2019
@Fish-Git Fish-Git added the BUG The issue describes likely incorrect product functionality that likely needs corrected. label Aug 17, 2019
@Fish-Git
Copy link
Member

Hi Florian!

I'm very sorry your Hercules is crashing. It probably has something to do with my recent commit that attempted to fix a hercifc crash that Ian Shorter reported which he said did indeed fix his problem. My fix for Ian might have only moved the problem to a different place?

If you can (PLEASE!), please review github issue #199 "How to analyze a Linux "core" file (i.e. crash dump)" and then do a gdb crash dump analysis of your crash dump and then post the results here. Hopefully it's something simple and obvious that I can get fixed for you right away (me: fingers crossed).

Thanks.

(and sorry about this!)

@Fish-Git Fish-Git added IN PROGRESS... I'm working on it! (Or someone else is!) QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. labels Aug 17, 2019
@moshix
Copy link

moshix commented Aug 17, 2019

fbi-ranger, Fish

I am using Hercules version 4.3.0.0-SDL-g1f8fa07-modified (4.3.0.0) on Ubuntu 16.04 with z/VM 6.4 and Layer 3 and TCPIP has been working just fine for me (if a bit slow, but that's been like this for a long time).

In fact, with this new SDL version it's the first time my z/VM and layer 3 stays up more than a few minutes.

So, I am confused as to what is different here?

thanks

M

@Fish-Git Fish-Git changed the title Hercules segmentation fault during z/VM start up when QETH layer3 is opened Hercules Segmentation fault during z/VM start up when QETH layer3 is opened Aug 17, 2019
@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 18, 2019

The interesting thing is that the core file was named core.dev EC02 thrd.1486, which is the last device of the QETH set (EC00-EC02).

Here is the output from gdb:

Thread 14 (Thread 0x7f6be559c700 (LWP 1493)):
#0  0x00007f6be62d1053 in select () from /lib64/libc.so.6
#1  0x00007f6be813bcdc in http_server (arg=<optimized out>) at httpserv.c:944
#2  0x00007f6be7767c32 in hthread_func (arg2=0x13ea980) at hthreads.c:798
#3  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f6bcfefe700 (LWP 1506)):
#0  0x00007f6be65a8cc6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77692be in hthread_timed_wait_condition (plc=plc@entry=0x7f6be7749a10 <sysblk+2192>, 
    plk=plk@entry=0x7f6be7749a08 <sysblk+2184>, tm=tm@entry=0x7f6bcfefde20, 
    location=0x7f6be81fc328 "channel.c:2474") at hthreads.c:703
#2  0x00007f6be7765889 in timed_wait_condition_relative_usecs_impl (
    pCOND=pCOND@entry=0x7f6be7749a10 <sysblk+2192>, pLOCK=pLOCK@entry=0x7f6be7749a08 <sysblk+2184>, 
    usecs=usecs@entry=100000, pTV=<optimized out>, pTV@entry=0x0, 
    loc=loc@entry=0x7f6be81fc328 "channel.c:2474") at hscutl.c:699
#3  0x00007f6be7f38db4 in device_thread (arg=<optimized out>) at channel.c:2471
#4  0x00007f6be7767c32 in hthread_func (arg2=0x7f6bd8011a60) at hthreads.c:798
#5  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f6be579e700 (LWP 1489)):
#0  0x00007f6be62a70f0 in nanosleep () from /lib64/libc.so.6
#1  0x00007f6be62d19c4 in usleep () from /lib64/libc.so.6
#2  0x00007f6be81c764b in timer_thread (argp=<optimized out>) at timer.c:277
#3  0x00007f6be7767c32 in hthread_func (arg2=0x13e7ec0) at hthreads.c:798
#4  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f6be569d700 (LWP 1492)):
#0  0x00007f6be65a890d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77691e4 in hthread_wait_condition (plc=plc@entry=0x7f69c4001998, 
    plk=plk@entry=0x7f6be7749770 <sysblk+1520>, location=location@entry=0x7f6be820ca13 "cpu.c:1513")
    at hthreads.c:684
#2  0x00007f6be7f91c5e in CPU_Wait (regs=regs@entry=0x7f69c4001000) at cpu.c:1513
#3  0x00007f6be7f9a066 in z900_process_interrupt (regs=regs@entry=0x7f69c4001000) at cpu.c:1680
#4  0x00007f6be7f9c25a in z900_run_cpu (cpu=<optimized out>, oldregs=<optimized out>) at cpu.c:1830
#5  0x00007f6be7f91468 in cpu_thread (ptr=<optimized out>) at cpu.c:1307
#6  0x00007f6be7767c32 in hthread_func (arg2=0x13e8360) at hthreads.c:798
#7  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f6be474b700 (LWP 1496)):
#0  0x00007f6be65a8cc6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77692be in hthread_timed_wait_condition (plc=plc@entry=0x142c980, 
    plk=plk@entry=0x142c978, tm=tm@entry=0x7f6be474a560, location=0x7f6be47630e7 "ctc_lcs.c:2015")
    at hthreads.c:703
#2  0x00007f6be7765889 in timed_wait_condition_relative_usecs_impl (pCOND=pCOND@entry=0x142c980, 
    pLOCK=pLOCK@entry=0x142c978, usecs=usecs@entry=250000, pTV=<optimized out>, pTV@entry=0x0, 
   try=0x7f6be47630e7 "ctc_lcs.c:2015") at hscutl.c:699
#3  0x00007f6be474fc5e in LCS_PortThread (arg=0x142c850) at ctc_lcs.c:2009
#4  0x00007f6be7767c32 in hthread_func (arg2=0x14fee10) at hthreads.c:798
#5  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f6be420e700 (LWP 1503)):
#0  0x00007f6be65a8cc6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77692be in hthread_timed_wait_condition (plc=plc@entry=0x7f6be7749a10 <sysblk+2192>, 
    plk=plk@entry=0x7f6be7749a08 <sysblk+2184>, tm=tm@entry=0x7f6be420de20, 
    location=0x7f6be81fc328 "channel.c:2474") at hthreads.c:703
#2  0x00007f6be7765889 in timed_wait_condition_relative_usecs_impl (
    pCOND=pCOND@entry=0x7f6be7749a10 <sysblk+2192>, pLOCK=pLOCK@entry=0x7f6be7749a08 <sysblk+2184>, 
    usecs=usecs@entry=100000, pTV=<optimized out>, pTV@entry=0x0, 
    loc=loc@entry=0x7f6be81fc328 "channel.c:2474") at hscutl.c:699
#3  0x00007f6be7f38db4 in device_thread (arg=<optimized out>) at channel.c:2471
#4  0x00007f6be7767c32 in hthread_func (arg2=0x7f6bd8001f20) at hthreads.c:798
#5  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f6be5288700 (LWP 1494)):
#0  0x00007f6be62d1173 in pselect () from /lib64/libc.so.6
#1  0x00007f6be528f556 in console_connection_handler (arg=<optimized out>) at console.c:3484
#2  0x00007f6be7767c32 in hthread_func (arg2=0x13f3100) at hthreads.c:798
#3  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f6be5aa2700 (LWP 1490)):
#0  0x00007f6be65a890d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77691e4 in hthread_wait_condition (plc=plc@entry=0x7f6bd0001998, 
    plk=plk@entry=0x7f6be7749770 <sysblk+1520>, location=location@entry=0x7f6be820ca13 "cpu.c:1513")
    at hthreads.c:684
#2  0x00007f6be7f91c5e in CPU_Wait (regs=regs@entry=0x7f6bd0001000) at cpu.c:1513
#3  0x00007f6be7f9a066 in z900_process_interrupt (regs=regs@entry=0x7f6bd0001000) at cpu.c:1680
#4  0x00007f6be7f9c25a in z900_run_cpu (cpu=<optimized out>, oldregs=<optimized out>) at cpu.c:1830
#5  0x00007f6be7f91468 in cpu_thread (ptr=<optimized out>) at cpu.c:1307
#6  0x00007f6be7767c32 in hthread_func (arg2=0x13e81e0) at hthreads.c:798
#7  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f6be59a1700 (LWP 1491)):
#0  0x00007f6be65a890d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77691e4 in hthread_wait_condition (plc=plc@entry=0x7f6bd4001998, 
    plk=plk@entry=0x7f6be7749770 <sysblk+1520>, location=location@entry=0x7f6be820ca13 "cpu.c:1513")
    at hthreads.c:684
#2  0x00007f6be7f91c5e in CPU_Wait (regs=regs@entry=0x7f6bd4001000) at cpu.c:1513
#3  0x00007f6be7f9a066 in z900_process_interrupt (regs=regs@entry=0x7f6bd4001000) at cpu.c:1680
#4  0x00007f6be7f9c25a in z900_run_cpu (cpu=<optimized out>, oldregs=<optimized out>) at cpu.c:1830
#5  0x00007f6be7f91468 in cpu_thread (ptr=<optimized out>) at cpu.c:1307
#6  0x00007f6be7767c32 in hthread_func (arg2=0x13d6be0) at hthreads.c:798
#8  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f6be410d700 (LWP 1504)):
#0  0x00007f6be65a8cc6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77692be in hthread_timed_wait_condition (plc=plc@entry=0x7f6be7749a10 <sysblk+2192>, 
    plk=plk@entry=0x7f6be7749a08 <sysblk+2184>, tm=tm@entry=0x7f6be410ce20, 
    location=0x7f6be81fc328 "channel.c:2474") at hthreads.c:703
#2  0x00007f6be7765889 in timed_wait_condition_relative_usecs_impl (
    pCOND=pCOND@entry=0x7f6be7749a10 <sysblk+2192>, pLOCK=pLOCK@entry=0x7f6be7749a08 <sysblk+2184>, 
    usecs=usecs@entry=100000, pTV=<optimized out>, pTV@entry=0x0, 
    loc=loc@entry=0x7f6be81fc328 "channel.c:2474") at hscutl.c:699
#3  0x00007f6be7f38db4 in device_thread (arg=<optimized out>) at channel.c:2471
#4  0x00007f6be7767c32 in hthread_func (arg2=0x7f6bd80117c0) at hthreads.c:798
#5  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f6be589f700 (LWP 1488)):
#0  0x00007f6be65a890d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6be77691e4 in hthread_wait_condition (plc=plc@entry=0x7f6bd8002998, 
    plk=plk@entry=0x7f6be7749770 <sysblk+1520>, location=location@entry=0x7f6be820ca13 "cpu.c:1513")
    at hthreads.c:684
#2  0x00007f6be7f91c5e in CPU_Wait (regs=regs@entry=0x7f6bd8002000) at cpu.c:1513
#3  0x00007f6be7f9a066 in z900_process_interrupt (regs=regs@entry=0x7f6bd8002000) at cpu.c:1680
#4  0x00007f6be7f9c25a in z900_run_cpu (cpu=<optimized out>, oldregs=<optimized out>) at cpu.c:1830
#5  0x00007f6be7f91468 in cpu_thread (ptr=<optimized out>) at cpu.c:1307
#6  0x00007f6be7767c32 in hthread_func (arg2=0x13e7ec0) at hthreads.c:798
#7  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f6be61e0700 (LWP 1487)):
#0  0x00007f6be62ca4d5 in __xstat64 () from /lib64/libc.so.6
#1  0x00007f6be629843d in __tzfile_read () from /lib64/libc.so.6
#2  0x00007f6be6297d48 in tzset_internal.part () from /lib64/libc.so.6
#3  0x00007f6be62980b5 in __tz_convert () from /lib64/libc.so.6
#4  0x00007f6be62965b9 in ctime () from /lib64/libc.so.6
#5  0x00007f6be776a4f8 in logger_logfile_timestamp () at logger.c:323
#6  logger_thread (arg=<optimized out>) at logger.c:453
#7  0x00007f6be7767c32 in hthread_func (arg2=0x13d4b50) at hthreads.c:798
#8  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f6be8987740 (LWP 1486)):
#0  0x00007f6be62cad7b in write () from /lib64/libc.so.6
#1  0x00007f6be625c21d in _IO_file_write@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x00007f6be625b4ff in new_do_write () from /lib64/libc.so.6
#3  0x00007f6be625c9de in __GI__IO_file_xsputn () from /lib64/libc.so.6
#4  0x00007f6be623295b in buffered_vfprintf () from /lib64/libc.so.6
#5  0x00007f6be622fbce in vfprintf () from /lib64/libc.so.6
#6  0x00007f6be6238524 in fprintf () from /lib64/libc.so.6

Thread 1 (Thread 0x7f6bcffff700 (LWP 1505)):
#0  0x00007f6be6279526 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007f6be626928e in strdup () from /lib64/libc.so.6
#2  0x00007f6be421a39a in qeth_create_interface (grp=0x2694050, dev=0x26e5000) at qeth.c:1150
#3  osa_adapter_cmd (dev=dev@entry=0x26e5000, req_th=req_th@entry=0x7f6bcffed000, 
    datalen=datalen@entry=103) at qeth.c:1358
#4  0x00007f6be421f722 in qeth_execute_ccw (dev=0x26e5000, code=<optimized out>, 
    flags=<optimized out>, chained=<optimized out>, count=103, prevcode=<optimized out>, ccwseq=0, 
    iobuf=0x7f6bcffed000 "", more=0x7f6bcffea563 "", unitstat=0x7f6bcffea561 "", 
    residual=0x7f6bcffea564) at qeth.c:4617
#5  0x00007f6be7f2c0e1 in z900_execute_ccw_chain (arg=arg@entry=0x26e5000) at channel.c:5102
#6  0x00007f6be7f2ec78 in call_execute_ccw_chain (arch_mode=<optimized out>, 
    pDevBlk=pDevBlk@entry=0x26e5000) at channel.c:6188
#7  0x00007f6be7f38f15 in device_thread (arg=<optimized out>) at channel.c:2440
#8  0x00007f6be7767c32 in hthread_func (arg2=0x7f6bd8011910) at hthreads.c:798
#9  0x00007f6be65a2569 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f6be62d9a2f in clone () from /lib64/libc.so.6

Hope this helps you.

@Fish-Git
Copy link
Member

Here is the output from gdb:

Thank you. I'll take a look at it as soon as I can and will get back to you.

@Fish-Git Fish-Git removed the QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. label Aug 19, 2019
@Fish-Git
Copy link
Member

09:05:23 HHC00901I 0:EC02 QETH: Interface cules Network Interface Configuration Program - version 4.3.0.0-SDL-g1f8^PS^AÌf^?, type TUN opened

This is a strong clue I think.

It looks like the startup messages that hercifc now issues when it first starts (due to my recent commit to fix a completely different problem than Ian reported) are somehow getting intermixed (interleaved) with Hercules's own messages (in this case, messages that the qeth driver issues), which I'm guessing (*) is somehow causing the crash.

Let me perform some experiments to see if maybe I can recreate the crash on my own system. If I can, then I should be able to devise a fix fairly quickly. Otherwise I'm just going to have to take an educated guess.

Hang loose, Florian! I'm working on it!


P.S. In the mean time, what you can do that would be a BIG help to me would be (if you can easily/reliably reproduce (cause) the crash on demand, i.e. each and every time whenever you want) is the same thing that Ian did in Issue #236:

  1. Start up Hercules under the control of gdb (instructions for doing so are documented in Issue How to analyze a Linux "core" file (i.e. crash dump) #199 in the section "Obtaining a backtrace by running Hercules directly under gdb itself")

and then, when the crash occurs:

  1. Issue the same two gdb commands that Ian did: where and then where full.

That resulting gdb display might hopefully provide better information regarding exactly where the crash is occurring (and thus hopefully why).

Thanks.


(*) Unfortunately the crash dump analysis you posted wasn't as much help as I was hoping it would be. Looking at it, I'm unable to see where the crash is occurring! Maybe someone more experienced with gdb's output might be able to make more sense of it?

@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 19, 2019

Here the gdb where:

Thread 25 "dev EC02 thrd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdea8a700 (LWP 3728)]
0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6

(gdb) where

#0  0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007ffff58b028e in strdup () from /lib64/libc.so.6
#2  0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150
#3  osa_adapter_cmd (dev=dev@entry=0x1917000, req_th=req_th@entry=0x7fffdea78000, 
    datalen=datalen@entry=103) at qeth.c:1358
#4  0x00007fffdf7a8722 in qeth_execute_ccw (dev=0x1917000, code=<optimized out>, flags=<optimized out>, 
    chained=<optimized out>, count=103, prevcode=<optimized out>, ccwseq=0, iobuf=0x7fffdea78000 "", 
    more=0x7fffdea75563 "", unitstat=0x7fffdea75561 "", residual=0x7fffdea75564) at qeth.c:4617
#5  0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102
#6  0x00007ffff7575c78 in call_execute_ccw_chain (arch_mode=<optimized out>, 
    pDevBlk=pDevBlk@entry=0x1917000) at channel.c:6188
#7  0x00007ffff757ff15 in device_thread (arg=<optimized out>) at channel.c:2440
#8  0x00007ffff6daec32 in hthread_func (arg2=0x7fffe8012780) at hthreads.c:798
#9  0x00007ffff5be9569 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff5920a2f in clone () from /lib64/libc.so.6

(gdb) where full:

0  0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ffff58b028e in strdup () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150
        ttmtu = <optimized out>
        uMTU = <optimized out>
        i = <optimized out>
        rc = <optimized out>
        buf = "}\000\000\000\377\177\000\000Y\364\332\366\377\177\000\000xPK\001\000\000\000\000pPK\001\000\000\000\000xPK\001\000\000\000\000\000 \220\001\000\000\000\000\020PK\001\000\000\000\000\367\005z\337\377\177\000"
        i = <optimized out>
        rc = <optimized out>
        buf = <optimized out>
        ttmtu = <optimized out>
        uMTU = <optimized out>
        buf = <optimized out>
#3  osa_adapter_cmd (dev=dev@entry=0x1917000, req_th=req_th@entry=0x7fffdea78000, 
    datalen=datalen@entry=103) at qeth.c:1358
        req_puk = 0x7fffdea78038
        grp = 0x14b5010
        rsp_bhr = <optimized out>
        req_rrh = <optimized out>
        offrrh = <optimized out>
        offph = <optimized out>
        __FUNCTION__ = "osa_adapter_cmd"
#4  0x00007fffdf7a8722 in qeth_execute_ccw (dev=0x1917000, code=<optimized out>, flags=<optimized out>, 
    chained=<optimized out>, count=103, prevcode=<optimized out>, ccwseq=0, iobuf=0x7fffdea78000 "", 
    more=0x7fffdea75563 "", unitstat=0x7fffdea75561 "", residual=0x7fffdea75564) at qeth.c:4617
        datalen = 103
        length = 103
        first4 = <optimized out>
        contentstring = "RRH_TYPE_ULP: PUK_TYPE_ENABLE (ULP_ENABLE)", '\000' <repeats 213 times>
        grp = 0x14b5010
        num = <optimized out>
        __FUNCTION__ = "qeth_execute_ccw"
#5  0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102
        dev = 0x1917000
        iobuf = 0x7fffdea77000
        ccwaddr = 1610593696
        ticaddr = 0
        ticback = 0
        idapmask = <optimized out>
        idawfmt = <optimized out>
        ccwfmt = <optimized out>
        ccwkey = <optimized out>
        opcode = <optimized out>
        flags = <optimized out>
        addr = <optimized out>
        mbaddr = <optimized out>
        mbk = <optimized out>
        mbcount = <optimized out>
        count = 103
        ccw = <optimized out>
        unitstat = 0 '\000'
        chanstat = 0 '\000'
        residual = 0
        more = 0 '\000'
        chain = 0 '\000'
        tracethis = <optimized out>
        firstccw = 0 '\000'
        area = '\000' <repeats 63 times>
        bufpos = 0
        skip_ccws = <optimized out>
        cmdretry = <optimized out>
        prevccwaddr = <optimized out>
        prefetch_remaining = <optimized out>
        ps = 1
       ts = <optimized out>
        prefetch = {seq = 2, pos = 103, reqcount = 103, prevcode = 0 '\000', opcode = 1 '\001', 
          reserved = 0, chanstat = '\000' <repeats 255 times>, ccwaddr = {1610593696, 1610593696, 
            0 <repeats 254 times>}, datalen = {0, 103, 0 <repeats 254 times>}, dataaddr = {2147356680, 
            528392272, 0 <repeats 254 times>}, ccwflags = "$", '\000' <repeats 254 times>, ccwcount = {
            103, 0 <repeats 255 times>}, idawaddr = {0, 2147356680, 0 <repeats 254 times>}, 
          idawtype = "\000\002", '\000' <repeats 253 times>, idawflag = '\000' <repeats 255 times>}
        iobuf_initial = <error reading variable iobuf_initial (value of type `IOBUF' requires 69632 bytes, which is more than max-value-size)>
        __FUNCTION__ = "z900_execute_ccw_chain"
#6  0x00007ffff7575c78 in call_execute_ccw_chain (arch_mode=<optimized out>, 
    pDevBlk=pDevBlk@entry=0x1917000) at channel.c:6188
No locals.
#7  0x00007ffff757ff15 in device_thread (arg=<optimized out>) at channel.c:2440
        dev = 0x1917000
        current_priority = 1
        rc = <optimized out>
        waitcount = 0
#8  0x00007ffff6daec32 in hthread_func (arg2=0x7fffe8012780) at hthreads.c:798
        pfn = 0x7ffff757fc90 <device_thread>
        arg = 0x0
        name = 0x7ffff7842890 "idle device thread"
        tid = 140736928982784
        rc = <optimized out>
#9  0x00007ffff5be9569 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007ffff5920a2f in clone () from /lib64/libc.so.6
No symbol table info available.

@Fish-Git
Copy link
Member

Fish-Git commented Aug 19, 2019

Okay, this crash doesn't make any sense to me!

According to gdb, it's crashing in 'strdup' called from qeth.c's 'qeth_create_interface' function at line 1150:

 
(gdb where):

#0  0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007ffff58b028e in strdup () from /lib64/libc.so.6
#2  0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150

 
(gdb where full):

0  0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ffff58b028e in strdup () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150
        ttmtu = 
        uMTU = 
        i = 
        rc = 
        buf = "}\000\000\000\377\177\000\000Y\364\332\366\377\177\000\000xPK\001\000\000\000\000pPK\001\000\000\000\000xPK\001\000\000\000\000\000 \220\001\000\000\000\000\020PK\001\000\000\000\000\367\005z\337\377\177\000"
        i = 
        rc = 
        buf = 
        ttmtu = 
        uMTU = 
        buf = 

 
(corresponding source code):

hyperion/qeth.c

Lines 1147 to 1159 in 4998c65

/* Set the interface's MTU size, if possible */
{
/* Save original requested value, if any */
char* ttmtu = grp->ttmtu ? strdup( grp->ttmtu ) : NULL;
U16 uMTU = ttmtu ? (U16) atoi( ttmtu ) : 0;
/* Retrieve the interface's actual MTU */
free( grp->ttmtu );
grp->ttmtu = NULL;
grp->uMTU = 0;
InitMTU( dev, grp ); /* get interface's mtu value */
ASSERT( grp->ttmtu );
ASSERT( grp->uMTU );

 
As you can see on line 1150, grp->ttmtu is being checked for NULL before calling strdup! So why is it crashing?! I don't know! It doesn't make any sense to me! :(

@fbi-ranger

Florian?

I'm going to need some detailed information regarding how you have your system setup/configured: your Hercules configuration file, your z/VM System Configuration file, and your z/OS TCPIP profile (and VTAM TRLE for your OSAs).

I have z/VM 6.3 and am able to run z/OS 2.3 under z/VM in Layer 3 mode just fine. My Hercules has two OSAs (QETH devices) defined in Hercules, my z/VM System Config has two OSAs defined, and one VSWITCH defined for the second OSA. My z/VM runs in Layer 2 mode (using the first OSA) and the VSWITCH uses the second OSA. My z/OS has two OSAs defined too, which both of which are using the the second real Hercules OSA due to z/VM (but z/OS thinks it has 2 OSAs thanks to z/VM).

Everything works fine for me!

So I'm going to need to see how you have your system configured (Hercules, z/VM and z/OS) to see if maybe you can accomplish the same thing differently. Maybe the manner in which you are doing things is not supported by Hercules? I don't know!

I'm lost on this one and need some help from someone who knows Linux better than me. :(

HELP! :(

@Fish-Git Fish-Git added HELP! Help is needed from someone more experienced or I'm simply overloaded with too much work right now! L Linux only issue, such as with tuntap networking that doesn't occur on Windows. QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. Researching... The issue is being looked into or additional information is being gathered/located. and removed IN PROGRESS... I'm working on it! (Or someone else is!) labels Aug 19, 2019
@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 20, 2019

Fish,

The configuration of Hercules, z/VM and z/OS didn't change at all. It is still as described it in #230.

Here the z/VM cannot handle the QETH devices for the vswitch. So there is no z/OS etc. involved at all. z/VM starts two VMs (DTCVSW1 and DTCVSW2) for controlling the vswitch(es).

In my configuration UA EC00-EC02 is for vswitch VMINTRA1. When DTCVSW2 tries to engage EC00-EC02 the segfault occurs. This is very early in the z/VM startup process and has nothing to do with a guest OS.

There is no difference in the configuration. The Hercules version is 4.3 where the crash happens while in 4.2 it works correctly. NO change of config files on any of the involved components.

When I start a z/VM 6.3 it does exactly the same crash:

(gdb) where 

#0  0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007ffff58b028e in strdup () from /lib64/libc.so.6
#2  0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150
#3  osa_adapter_cmd (dev=dev@entry=0x1917000, req_th=req_th@entry=0x7fffdf684000, 
    datalen=datalen@entry=103) at qeth.c:1358
#4  0x00007fffdf7a8722 in qeth_execute_ccw (dev=0x1917000, code=, flags=, 
    chained=, count=103, prevcode=, ccwseq=0, iobuf=0x7fffdf684000 "", 
    more=0x7fffdf681563 "", unitstat=0x7fffdf681561 "", residual=0x7fffdf681564) at qeth.c:4617
#5  0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102
#6  0x00007ffff7575c78 in call_execute_ccw_chain (arch_mode=, 
    pDevBlk=pDevBlk@entry=0x1917000) at channel.c:6188
#7  0x00007ffff757ff15 in device_thread (arg=) at channel.c:2440
#8  0x00007ffff6daec32 in hthread_func (arg2=0x7fffe8011650) at hthreads.c:798
#9  0x00007ffff5be9569 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff5920a2f in clone () from /lib64/libc.so.6


(gdb) where full

#0  0x00007ffff58c0526 in __strlen_sse2 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ffff58b028e in strdup () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007fffdf7a339a in qeth_create_interface (grp=0x14b5010, dev=0x1917000) at qeth.c:1150
        ttmtu = 
        uMTU = 
        i = 
        rc = 
        buf = "}\000\000\000\377\177\000\000Y\364\332\366\377\177\000\000xPK\001\000\000\000\000pPK\001\000\000\000\000xPK\001\000\000\000\000\000 \220\001\000\000\000\000\020PK\001\000\000\000\000\367\005z\337\377\177\000"
        i = 
        rc = 
        buf = 
        ttmtu = 
        uMTU = 
        buf = 
#3  osa_adapter_cmd (dev=dev@entry=0x1917000, req_th=req_th@entry=0x7fffdf684000, 
    datalen=datalen@entry=103) at qeth.c:1358
        req_puk = 0x7fffdf684038
        grp = 0x14b5010
        rsp_bhr = 
        req_rrh = 
        offrrh = 
        offph = 
        __FUNCTION__ = "osa_adapter_cmd"
#4  0x00007fffdf7a8722 in qeth_execute_ccw (dev=0x1917000, code=, flags=, 
    chained=, count=103, prevcode=, ccwseq=0, iobuf=0x7fffdf684000 "", 
    more=0x7fffdf681563 "", unitstat=0x7fffdf681561 "", residual=0x7fffdf681564) at qeth.c:4617
        datalen = 103
        length = 103
        first4 = 
        contentstring = "RRH_TYPE_ULP: PUK_TYPE_ENABLE (ULP_ENABLE)", '\000' 
        grp = 0x14b5010
        num = 
        __FUNCTION__ = "qeth_execute_ccw"
#5  0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102
        dev = 0x1917000
        iobuf = 0x7fffdf683000
        ccwaddr = 2147136224
        ticaddr = 0
        ticback = 0
        idapmask = 
        idawfmt = 
        ccwfmt = 
        ccwkey = 
        opcode = 
        flags = 
        addr = 
        mbaddr = 
        mbk = 
        mbcount = 
        count = 103
        ccw = 
        unitstat = 0 '\000'
        chanstat = 0 '\000'
        residual = 0
        more = 0 '\000'
        chain = 0 '\000'
        tracethis = 
        firstccw = 0 '\000'
        area = '\000' 
        bufpos = 0
        skip_ccws = 
        cmdretry = 
        prevccwaddr = 
        prefetch_remaining = 
        ps = 1
    more=0x7fffdf681563 "", unitstat=0x7fffdf681561 "", residual=0x7fffdf681564) at qeth.c:4617
        datalen = 103
        length = 103
        first4 = 
        contentstring = "RRH_TYPE_ULP: PUK_TYPE_ENABLE (ULP_ENABLE)", '\000' 
        grp = 0x14b5010
        num = 
        __FUNCTION__ = "qeth_execute_ccw"
#5  0x00007ffff75730e1 in z900_execute_ccw_chain (arg=arg@entry=0x1917000) at channel.c:5102
        dev = 0x1917000
        iobuf = 0x7fffdf683000
        ccwaddr = 2147136224
        ticaddr = 0
        ticback = 0
        idapmask = 
        idawfmt = 
        ccwfmt = 
        ccwkey = 
        opcode = 
        flags = 
        addr = 
        mbaddr = 
        mbk = 
        mbcount = 
        count = 103
        ccw = 
        unitstat = 0 '\000'
        chanstat = 0 '\000'
        residual = 0
        more = 0 '\000'
        chain = 0 '\000'
        tracethis = 
        firstccw = 0 '\000'
        area = '\000' 
        bufpos = 0
        skip_ccws = 
        cmdretry = 
        prevccwaddr = 
        prefetch_remaining = 
        ps = 1
        ts = 
        prefetch = {seq = 2, pos = 103, reqcount = 103, prevcode = 0 '\000', opcode = 1 '\001', 
          reserved = 0, chanstat = '\000' , ccwaddr = {2147136224, 2147136224, 
            0 }, datalen = {0, 103, 0 }, dataaddr = {2146975832, 
            2112405584, 0 }, ccwflags = "$", '\000' , ccwcount = {
            103, 0 }, idawaddr = {0, 2146975832, 0 }, 
          idawtype = "\000\002", '\000' , idawflag = '\000' }
        iobuf_initial = 
        __FUNCTION__ = "z900_execute_ccw_chain"
#6  0x00007ffff7575c78 in call_execute_ccw_chain (arch_mode=, 
    pDevBlk=pDevBlk@entry=0x1917000) at channel.c:6188
No locals.
#7  0x00007ffff757ff15 in device_thread (arg=) at channel.c:2440
        dev = 0x1917000
        current_priority = 1
        rc = 
        waitcount = 0
#8  0x00007ffff6daec32 in hthread_func (arg2=0x7fffe8011650) at hthreads.c:798
        pfn = 0x7ffff757fc90 
        arg = 0x0
        name = 0x7ffff7842890 "idle device thread"
        tid = 140736941614848
        rc = 
#9  0x00007ffff5be9569 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007ffff5920a2f in clone () from /lib64/libc.so.6
No symbol table info available.

@mcisho
Copy link
Contributor

mcisho commented Aug 20, 2019

This isn't a QETH specific problem, it also affects CTCI and LCS.

The _OSA_GRP variable ttifname, the _CTCBLK variable szTUNIfName and the _LCSPORT variable szNetIfName, which are all IFNAMSIZ (i.e. 16) characters long, are overwritten with 72 characters that appear to originate from hercifc.

On my host system the QETH and CTCI overwrites cause a segmentation fault. The LCS overwrite doesn't. All of the variables are setup by function TUNTAP_CreateInterface.

I also see a message: malloc(): corrupt top size.

That's as far as I've got for the moment, and I have to stop now. Hopefully will be able to continue later.

@Fish-Git Fish-Git removed the QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. label Aug 20, 2019
@Fish-Git
Copy link
Member

That's as far as I've got for the moment, and I have to stop now. Hopefully will be able to continue later.

Wow. That's great info, Ian! And looking at the TUNTAP_CreateInterface function, I'm now thinking our bug may just be a simple case of failing to null terminate the pszNetDevName string!

I 'm seeing the following code in TUNTAP_CreateInterface which appears quite dangerous to me:

hyperion/tuntap.c

Lines 270 to 275 in fad967f

return -1;
}
strcpy( pszNetDevName, hifr.hifr_name );
}
#if !defined( OPTION_W32_CTCI )

hyperion/tuntap.c

Lines 287 to 291 in fad967f

char *p = strrchr( pszTUNDevice, '/' );
if( p )
strncpy( pszNetDevName, ++p, IFNAMSIZ );
else

The strcpy on line line 273 and the strncpy on line 290 are both just asking for trouble!

The strcpy for obvious reasons, and the strncpy due to far too many programmers forgetting that it doesn't null terminate the destination buffer when the length of the source string is equal to or greater than the maximum characters to be copied:

strncpy:

APPLICATION USAGE
...
If there is no NUL character byte in the first n bytes of the array pointed to by s2, the result is not null-terminated.

and:

strncpy:

Remarks
... If count is less than or equal to the length of strSource, a null character is not appended automatically to the copied string.

 
@fbi-ranger

Florian?

Do me a favor and try rebuilding Hercules after applying the following patch:

--- hyperion-1/tuntap.c	2019-08-03 15:41:15.683714400 -0700
+++ hyperion-0/tuntap.c	2019-08-20 07:38:34.711798400 -0700
@@ -270,7 +270,9 @@
             return -1;
         }
 
-        strcpy( pszNetDevName, hifr.hifr_name );
+        // N.B. strlcpy used to prevent buffer overflow
+        //      and to ensure null termination.
+        strlcpy( pszNetDevName, hifr.hifr_name, IFNAMSIZ );
     }
 #if !defined( OPTION_W32_CTCI )
     else
@@ -284,10 +286,14 @@
         // This should work on OS X with Christoph Pfisterer's TUN driver,
         //        since it does set the device name to the basename of the
         //        file. -- JRM
-        char *p = strrchr( pszTUNDevice, '/' );
+        char* p = strrchr( pszTUNDevice, '/' );
 
-        if( p )
-            strncpy( pszNetDevName, ++p, IFNAMSIZ );
+        if (p)
+        {
+            // N.B. strlcpy used to prevent buffer overflow
+            //      and to ensure null termination.
+            strlcpy( pszNetDevName, p+1, IFNAMSIZ );
+        }
         else
         {
             // "Invalid TUN/TAP device name %s"

If the above patch fixes your problem then I'll go ahead and commit the fix to the repository. Otherwise if it doesn't resolve your problem, then we'll need to continue scratching our heads.

Thanks.

@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 20, 2019

Fish,

Unfortunately this fix causes more troubles:

20:31:05 HHC01415I Build date: Aug 20 2019 at 20:25:04
20:31:05 HHC00166E hercifc: ioctl error doing SIOCSIFADDR on cules Network I: 19 No such device
20:31:05 HHC00007I Previous message from function 'main' at hercifc.c(257)
20:31:05 HHC00166E hercifc: ioctl error doing SIOCSIFNETMASK on cules Network I: 19 No such device
20:31:05 HHC00007I Previous message from function 'main' at hercifc.c(257)
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using MAC address 02:00:5e:a4:40:d1
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using IP address 10.10.0.190
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using subnet mask 255.255.255.0
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using MTU 1500
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using drive MAC address ee:17:6c:47:7a:64
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using drive IP address fe80::ee17:6cff:fe47:7a64
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
20:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
20:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state

System dumps and re-IPLs indefinitely.

0:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
20:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
20:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:05 HHC01415I Build date: Aug 20 2019 at 20:25:04
20:31:05 HHC00166E hercifc: ioctl error doing SIOCSIFADDR on cules Network I: 19 No such device 
20:31:05 HHC00007I Previous message from function 'main' at hercifc.c(257)
20:31:05 HHC00166E hercifc: ioctl error doing SIOCSIFNETMASK on cules Network I: 19 No such device
20:31:05 HHC00007I Previous message from function 'main' at hercifc.c(257)
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using MAC address 02:00:5e:a4:40:d1
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using IP address 10.10.0.190
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using subnet mask 255.255.255.0 
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using MTU 1500
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using drive MAC address ee:17:6c:47:7a:64
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using drive IP address fe80::ee17:6cff:fe47:7a64
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
20:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state 
20:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
20:31:05 HHC00912E 0:EC00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
20:31:05 HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
20:31:09 HHC00814I Processor CP00: SIGP CPU reset                        (0C) CP01, PARM 00000000004E9BA8: CC 0
20:31:09 HHC00814I Processor CP00: SIGP Store status                     (0E) CP01, PARM 0000000000B31200: CC 0
20:31:09 20:31:09 HCPWRP963I SHUTDOWN STEP SGPST - STOP OTHER PROCESSORS
20:31:10 20:31:10 HCPWRP959I PHOENIX  SYSTEM TERMINATION IN PROGRESS ON 2019-08-20
20:31:10 20:31:10 HCPWRP963I SHUTDOWN STEP TXTDS - TERMINATE DATA TRACES
20:31:10 20:31:10 HCPWRP963I SHUTDOWN STEP SVACV - ACTIVATE TERMINATION SAVE AREAS
20:31:11 20:31:11 HCPWRP963I SHUTDOWN STEP CHMOF - DISABLE CHANNEL MEASUREMENT
20:31:11 20:31:11 HCPWRP963I SHUTDOWN STEP ISHDA - DISABLE ALL DEVICES
20:31:12 20:31:12 HCPWRP963I SHUTDOWN STEP DMPDK - TAKE DUMP
20:31:12 20:31:05 HCPDMP908I SYSTEM FAILURE ON CPU 0000, CODE - VAI008
20:31:12 20:31:05 HCPDMP908I 2019-08-20 SYSTEM PHOENIX
20:31:12 20:31:12 HCPDMP9252I DUMPING TO DASD
20:31:17 20:31:17 HCPDMP9260I 25% OF THE DUMP HAS BEEN COMPLETED
20:31:19 20:31:19 HCPDMP9260I 50% OF THE DUMP HAS BEEN COMPLETED
20:31:21 20:31:21 HCPDMP9260I 75% OF THE DUMP HAS BEEN COMPLETED
20:31:23 20:31:23 HCPDMP9261I DUMP COMPLETE ON 2019-08-20, CODE VAI008
20:31:23 20:31:23 HCPWRP963I SHUTDOWN STEP CKPSH - TAKE A CHECKPOINT
20:31:24 20:31:24 HCPWRP963I SHUTDOWN STEP OPRCK - SAVE OPERATOR CONSOLE LIST
20:31:24 20:31:24 HCPWRP963I SHUTDOWN STEP MCWMD - DETERMINE MACHINE CHECK STATUS
20:31:25 20:31:25 HCPWRP963I SHUTDOWN STEP SDVRS - RESET IBM DASD CU CHARACTERISTICS
20:31:25 20:31:25 HCPWRP963I SHUTDOWN STEP SGQXX - RESET OTHER PROCESSORS
20:31:26 HHC00814I Processor CP00: SIGP CPU reset                        (0C) CP02, PARM 00000000004E9BA8: CC 0
20:31:26 HHC00814I Processor CP00: SIGP CPU reset                        (0C) CP03, PARM 00000000004E9BA8: CC 0
20:31:26 20:31:26 HCPWRP9277I SYSTEM TERMINATION COMPLETE, ATTEMPTING RESTART
20:31:26 20:31:26 HCPWRP9277I 2019-08-20 SYSTEM PHOENIX
20:31:26 20:31:26 HCPWRP963I SHUTDOWN STEP SVADV - DEACTIVATE TERMINATION SAVE AREAS
20:31:26 20:31:26 HCPWRP963I SHUTDOWN STEP MTORS - RESET THE CONFIGURATION
20:31:27 HHC00814I Processor CP00: SIGP Set architecture mode            (12) CP00, PARM 0000000000000000: CC 0
20:31:27 HHC00811I Processor CP00: architecture mode ESA/390
20:31:27 HHC00814I Processor CP00: SIGP Set architecture mode            (12) CP00, PARM 00000001: CC 0
20:31:27 HHC00811I Processor CP00: architecture mode z/Arch
20:31:32 20:31:32 z/VM SYSTEM RESTART AFTER SYSTEM FAILURE; CODE VAI008

VAI008

Explanation:
Issued by the HCPVAI module. This module is distributed as object code only; therefore, nosource program materials are available.

User response:
See z/VM: Diagnosis Guide for information on gathering the documentation you need to assist IBM in diagnosing the problem; then contact your IBM Support Center personnel

@Fish-Git
Copy link
Member

20:31:05 HHC00166E hercifc: ioctl error doing SIOCSIFADDR on cules Network I: 19 No such device
20:31:05 HHC00007I Previous message from function 'main' at hercifc.c(257)
20:31:05 HHC00166E hercifc: ioctl error doing SIOCSIFNETMASK on cules Network I: 19 No such device
20:31:05 HHC00007I Previous message from function 'main' at hercifc.c(257)
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using MAC address 02:00:5e:a4:40:d1
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using IP address 10.10.0.190
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using subnet mask 255.255.255.0
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using MTU 1500
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using drive MAC address ee:17:6c:47:7a:64
20:31:05 HHC03997I 0:EC02 QETH: cules Network I: using drive IP address fe80::ee17:6cff:fe47:7a64

Notice the strings highlighted string in the above messages: "cules Network I".

This string is supposed to be the name of the QETH device's tuntap device! (i.e. the value that you specified for the iface parameter of your Hercules QETH device statement, or what you specified on your NETDEV Hercules configuration file statement, or the default if neither is specified, which for Linux is /dev/net/tun).

HOW THE F**K is it getting set to "cules Network I"?!

What's weird is the "cules Network I" just so happens to be characters 3-17 of the string "Hercules Network Interface Configuration Program", which, if you include a null terminator into the count, just so happens to be 15 bytes long, which is the exact size of the constant IFNAMSIZ, being the size of the _CTLREQ struct's szIFName name field.

So like Ian said, some type of buffer overflow is occurring somewhere, but where it is occurring we don't know yet! :(

@Fish-Git
Copy link
Member

@fbi-ranger
@mcisho

Florian, (and Ian),

Here's another patch for you to try:

Don't forget to remove the previous patch I gave you before applying this new one.

Thanks.

@fbi-ranger
Copy link
Author

Fish,
I do not have any NETDEV statement specified in the configuration.

The definition of the QETH device in hercules.cnf is

EC00.3 QETH CHPID EC IPADDR 10.10.0.190 NETMASK 255.255.255.0

So no iface parameter there. Seems to be an default value and the variable may be badly aligned.

I will apply try2.

@ivan-w
Copy link
Member

ivan-w commented Aug 21, 2019

Note : The "CHPID EC" is probably redundant (but it shouldn't break things)

@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 21, 2019

Here are some compiler warnings from patched msgenu.h:

shared.c: In function 'serverConnect':
shared.c:2414:25: warning: '%04X' directive output may be truncated writing 4 bytes into a region of size between 1 and 5 [-Wformat-truncation=]
MSGBUF( threadname, "shrd dev %1d:%04X", LCSS_DEVNUM );
^
msgenu.h:207:70: note: in definition of macro 'MSGBUF'
#define MSGBUF( _buf, ... ) snprintf(_buf, sizeof(_buf), ## VA_ARGS )
^~~~~~~~~~~
shared.c:2414:25: note: directive argument in the range [0, 65535]
MSGBUF( threadname, "shrd dev %1d:%04X", LCSS_DEVNUM );
^
msgenu.h:207:70: note: in definition of macro 'MSGBUF'
#define MSGBUF( _buf, ... ) snprintf(_buf, sizeof(_buf), ## VA_ARGS )
^~~~~~~~~~~
msgenu.h:207:33: note: 'snprintf' output between 16 and 20 bytes into a destination of size 16
#define MSGBUF( _buf, ... ) snprintf(_buf, sizeof(_buf), ## VA_ARGS )

qeth.c:6490:19: note: directive argument in the range [0, 2147483647]
MSGBUF( cbuf, "%u", pfxlen );
^
msgenu.h:207:70: note: in definition of macro 'MSGBUF'
#define MSGBUF( _buf, ... ) snprintf(_buf, sizeof(_buf), ## VA_ARGS )
^~~~~~~~~~~
msgenu.h:207:33: note: 'snprintf' output between 2 and 11 bytes into a destination of size 8
#define MSGBUF( _buf, ... ) snprintf(_buf, sizeof(_buf), ## VA_ARGS )
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
qeth.c:6490:5: note: in expansion of macro 'MSGBUF'
MSGBUF( cbuf, "%u", pfxlen );
^~~~~~
CCLD hdtqeth.la

@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 21, 2019

Fish,

Sorry but this 2try does not help.

However here is the log:

HHC00901I 0:EC02 QETH: Interface cules Network I, type TUN opened +
HHC00136E Error in function TUNTAP_GetMTU: No such device
HHC00007I Previous message from function 'TUNTAP_GetMTU' at tuntap.c(585)
HHC02499I Hercules utility hercifc - Hercules Network Interface Configuration Program - version 4.3.0.0-S
HHC01414I (C) Copyright 1999-2019 by Roger Bowler, Jan Jaeger, and others
HHC01417I YBI-15017-9802
HHC01415I Build date: Aug 21 2019 at 12:05:28
HHC00166E hercifc: ioctl error doing SIOCSIFADDR on cules Network I: 19 No such device
HHC00007I Previous message from function 'main' at hercifc.c(253)
HHC00166E hercifc: ioctl error doing SIOCSIFNETMASK on cules Network I: 19 No such device
HHC00007I Previous message from function 'main' at hercifc.c(253)
HHC03997I 0:EC02 QETH: cules Network I: using MAC address 02:00:5e:ae:2b:a2
HHC03997I 0:EC02 QETH: cules Network I: using IP address 10.10.0.190
HHC03997I 0:EC02 QETH: cules Network I: using subnet mask 255.255.255.0
HHC03997I 0:EC02 QETH: cules Network I: using MTU 1500
HHC03997I 0:EC02 QETH: cules Network I: using drive MAC address 06:ee:9c:43:ef:0f
HHC03997I 0:EC02 QETH: cules Network I: using drive IP address fe80::6ee:9cff:fe43:ef0f
HHC00166E hercifc: ioctl error doing SIOCSIFFLAGS on cules Network I: 19 No such device
HHC00007I Previous message from function 'main' at hercifc.c(253)
14:53:22 HCPSWU2830I VSWITCH SYSTEM VMINTRA1 status is ready.
14:53:22 HCPSWU2830I DTCVSW1 is VSWITCH controller for device EC00.P00.
14:53:32 AUTO LOGON *** TCPIP USERS = 14 BY AUTOLOG2
HHC03800I 0:0C02 QETH: Adapter mode set to Layer 3
HHC00901I 0:0C02 QETH: Interface cules Network I, type TUN opened
HHC00136E Error in function TUNTAP_GetMTU: No such device
HHC00007I Previous message from function 'TUNTAP_GetMTU' at tuntap.c(585)
HHC00166E hercifc: ioctl error doing SIOCSIFADDR on cules Network I: 19 No such device
HHC00007I Previous message from function 'main' at hercifc.c(253)
HHC03997I 0:0C02 QETH: cules Network I: using MAC address 02:00:5e:f9:ad:5e
HHC03997I 0:0C02 QETH: cules Network I: using IP address 10.1.0.1
HHC03997I 0:0C02 QETH: cules Network I: using MTU 1500
HHC03997I 0:0C02 QETH: cules Network I: using drive MAC address 9e:53:a9:a6:eb:af
HHC03997I 0:0C02 QETH: cules Network I: using drive IP address fe80::9e53:a9ff:fea6:ebaf
HHC00166E hercifc: ioctl error doing SIOCSIFFLAGS on cules Network I: 19 No such device
HHC00007I Previous message from function 'main' at hercifc.c(253)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC00912E 0:0C00 QETH: Error reading from device cules Network I: 77 File descriptor in bad state
HHC00007I Previous message from function 'read_packet' at qeth.c(2738)
HHC01603I stopall

It looks indeed for an interface cules Network I which cannot be found of course. I have no idea where this comes from.
Also when I specify an iface /dev/net/tun in the config it looks for "cules Network I". Same result when added an additional NETDEV.

@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 21, 2019

In 4.2 version no iface needed to be specified in the Hercules config file and it worked correctly with the default. Seems that the defaults are not set correctly any more.

In 4.3. there is a wrong storage address taken for the interface name, cules Network I, which is an documentation string in hercifc. hercifc seems to called with a broken parameter list???

@Fish-Git
Copy link
Member

@fbi-ranger

Florian, (and others),

I still don't know what's going on, BUT...

Since your original report was a Segmentation fault and your issue has something to with networking (i.e. hercifc) and since Ian reported a Segmentation fault in hercifc too (but earlier in issue #236) which was due to hthreads not being initialized which was resolved (for him!) by adding a INITIALIZE_UTILITY statement (which calls hthreads_internal_init(), which is required to fix Ian's issue), one of the parameters which happens to be the UTILITY_DESC constant ("Hercules Network Interface Configuration Program"), and that string seems to somehow play a major role in the problem you're having ...

... Let's try this:

--- hyperion-1/hercifc.c	2019-08-15 16:42:45.794199200 -0700
+++ hyperion-0/hercifc.c	2019-08-21 09:21:02.871505700 -0700
@@ -26,9 +26,6 @@
 #include "hercules.h"
 #include "hercifc.h"
 
-#define UTILITY_NAME    "hercifc"
-#define UTILITY_DESC    "Hercules Network Interface Configuration Program"
-
 // --------------------------------------------------------------------
 // HERCIFC program entry point
 // --------------------------------------------------------------------
@@ -51,7 +48,7 @@
 
     UNREFERENCED( argc );
 
-    INITIALIZE_UTILITY( UTILITY_NAME, UTILITY_DESC, &pszProgName );
+    hthreads_internal_init();
 
     DROP_PRIVILEGES(CAP_NET_ADMIN);
 

Maybe we'll get lucky and that will fix your problem while also fixing Ian's problem too but in a much more simplistic way?

<shrug>

@mcisho
Copy link
Contributor

mcisho commented Aug 21, 2019

Here's what's happening.

hercifc runs as a child process, receiving requests from Hercules on stdin, sending responses to Hercules on stdout, and sending console messages for Hercules to display to stderr.

The fix fo #236 was to make hercifc issue INITIALIZE_UTILITY( UTILITY_NAME, UTILITY_DESC, &pszProgName );, a macro that simply calls the function initialize_utility.

The initialize_utility function does various things but finally calls function display_version to display the version, copyright, etc messages on stdout. Four messages are displayed, for example:

HHC02499I Hercules utility hercifc - Hercules Network Interface Configuration Program - version 4.3.0.0-SDL-gfad967f9-modified
HHC01414I (C) Copyright 1999-2019 by Roger Bowler, Jan Jaeger, and others
HHC01417I ** The SoftDevLabs version of Hercules **
HHC01415I Build date: Aug 21 2019 at 10:50:23

When Hercules needs to create a network interface it calls function TUNTAP_CreateInterface which in turn calls function TUNTAP_SetMode. The TUNTAP_SetMode function creates a hercifc child process, sends a TUNSETIFF request to the process, then waits for a response from the process, i.e. whatever the process sends on stdout (which, ever since the fix for #236 was implemented, is now hercifc's version, copyright, etc. messages).

The TUNTAP_SetMode function then simply extracts what it expects to be the interface name from the response it receives ("cules Network I") and returns, job done.

I'm slightly(!) confused by hercifc.

hercifc appears to have been written with the intention of there being a single process that handles innumerable requests from Hercules, until Hercules sends a request telling it to terminate (CTLREQ_OP_DONE). However, I can't find where Hercules ever sends such a request.

Plus, since TUNTAP_SetMode creates a hercifc child process for each interface, ever since the #236 fix we now see a lot of hercifc version, copyright, etc. messages.

The solution is to modify the function initialize_utility so that, for hercifc, the function display_version is called to display the version, copyright, etc. messages on stderr instead of stdout (or alternatively, simply do not call the display_version function at all).

@mcisho mcisho closed this as completed in 113c3ec Aug 21, 2019
@mcisho mcisho reopened this Aug 21, 2019
@fbi-ranger
Copy link
Author

fbi-ranger commented Aug 21, 2019

Gentlemen,

It works now.

I could start z/VM correctly as it was before. The vswitch is correctly defined and the corresponding tun1 is opened.

20:59:45 Q VSWITCH                                                                                       
20:59:45 VSWITCH SYSTEM VMINTRA1   Type: QDIO    Connected: 0    Maxconn: INFINITE                       
20:59:45   PERSISTENT  RESTRICTED    NONROUTER                 Accounting: OFF                           
20:59:45   USERBASED LOCAL                                                                               
20:59:45   VLAN Unaware                                                                                  
20:59:45   MAC address: 02-00-00-00-00-01    MAC Protection: OFF                                         
20:59:45   IPTimeout: 5         QueueStorage: 8                                                          
20:59:45   Isolation Status: OFF        VEPA Status: OFF                                                 
20:59:45  Uplink Port:                                                                                   
20:59:45   State: Ready                                                                                  
20:59:45   PMTUD setting: EXTERNAL   PMTUD value: 8992     Trace Pages: 8                                
20:59:45   RDEV: EC00.P00 VDEV: 0600 Controller: DTCVSW2  ACTIVE 

Great! Thanks to your analysis.

BTW: I tried also the modification proposed by Fish. This works as well.

@Fish-Git
Copy link
Member

Fish-Git commented Aug 21, 2019

Here's what's happening.

TUNTAP_SetMode simply extracts what it expects to be the interface name from the response it receives, i.e. ules Network I and returns, job done.

Ah-HA! So that's what's happening! THANK YOU, Ian! I completely missed the TUNTAP_SetMode aspect. Good catch!

I'm slightly(!) confused by hercifc.

You're not alone!

hercifc appears to have been written with the intention of there being a single hercifc process that handles innumerable requests from Hercules, until Hercules sends a request telling hercifc to terminate. However, I can't find where Hercules sends such a request.

hercifc confuses the hell out of me too! It's whole design is screwy. Sending binary requests and responses on text streams?! (stdin and stdout?!) That's crazy! There are far better IPC mechanisms available than using stdin/stdout for crying out loud! (Sheesh!)

And hercifc is a security hole too as I recall. I believe John Hartmann pointed that out a long time ago, which is why he came up with pre-configured interfaces, which makes much more sense to me.

(And I completely missed the CTLREQ_OP_DONE request that asks hercifc to exit too! Another good catch, Ian! Thanks! This whole time I was operating under the mistaken belief that hercifc, once started, never exited (unless there was an error), due to its while (1) loop and UNREACHABLE_CODE statement.)

The solution is to modify the function initialize_utility so that, for hercifc, the function display_version is called to display the version, copyright, etc messages on stderr (or alternatively perhaps function display_version is not called at all).

OR...

  1. Completely redesign our whole hercifc <--> hercules IPC mechanism to use some other method than stdin and stdout!

  2. Get rid of hercifc completely and only support use of pre-configured interfaces instead! Interfaces that the user must preconfigure to be the way they want so Hercules doesn't need to do any network configuring at all, and instead simply uses whatever tuntap interface the user says to use as is.

I personally recommend #2 (get rid of hercifc altogether), but since this is a purely Linux issue my recommendations don't count for much. I'll let you guys decide how best to straighten out this whole hercifc mess.

Thank you, Ian, for figuring this mess out. MUCH appreciated buddy.

AND ... as I said, since it looks like you know what's going on and since this is a Linux issue (for which I'm not really qualified to deal with), I'm going to let you (or Ivan, etc) figure out how best to fix things and commit the changes. I'm going to go ahead and re-assign this GitHub issue over to you if that's okay.

Thanks again, Ian. This puppy is all yours now! :)

@Fish-Git Fish-Git assigned mcisho and ivan-w and unassigned Fish-Git Aug 21, 2019
@Fish-Git Fish-Git added IN PROGRESS... I'm working on it! (Or someone else is!) and removed HELP! Help is needed from someone more experienced or I'm simply overloaded with too much work right now! Researching... The issue is being looked into or additional information is being gathered/located. labels Aug 21, 2019
@Fish-Git
Copy link
Member

BTW: I tried also the modification proposed by Fish. This works as well.

Wow. THAT was easy! :)

But I'm still going to let you guys (Ian and Ivan, etc) design and implement (and_commit!) the actual fix based on whatever you decide is the best way to do it.

I'm out of here!

It looks like you guys have things well in hand. :)

@Fish-Git
Copy link
Member

@mcisho
@ivan-w

Ian? Ivan?

Is commit 113c3ec the final/official fix for this issue? Or is there more work you still want/need to do?

If there are no more changes either of you are planning to make then we should go ahead and remove "In Progress" status and officially Close this GitHub Issue.

Thanks.

@mcisho mcisho removed the IN PROGRESS... I'm working on it! (Or someone else is!) label Aug 22, 2019
@mcisho
Copy link
Contributor

mcisho commented Aug 22, 2019

For the moment that is the final fix. However, as it stands there can be a lot of version, copyright, etc messages coming from hercifc (one set for each call to TUNTAP_CreateInterface). My intention, once things have settled, is to revisit function initialize_utility and remove the call to function display_version for hercifc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. L Linux only issue, such as with tuntap networking that doesn't occur on Windows.
Projects
None yet
Development

No branches or pull requests

5 participants