-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hercules 4.4.1 crashes after OSA failure #489
Comments
A couple of things right off the bat:
Most of these things are mentioned in our "SUBMITTING PROBLEM REPORTS" document. Please review it and then submit all needed additional information. Thanks. In the mean time I'll take a peek at your dump to see whether anything jumps out at me. |
Dump was worthless. I need to see the complete Hercules log file, configuration file, and TTTest64 output. Explaining in more detail what you were doing (or trying to do) when the problem occurred might help too. p.s. Did you disable Unconstrained Transactions (i.e. run TXOFF) before doing whatever it was you were attempting to do, like you did previously? |
Hi, thank you for your answer. The netmask was fixed and everything redone. It failed again. I've attached the entire log, second dump, TTTest64 report and entire config file: I also ran TXOFF before doing whatever I do. The only thing not clear to me is when TXON should be executed. I'm testing with OSA because it failed with CTCI as well. In general Hercules works pretty fine. The issue only occurs when I want to access PC applications to exploit mainframe facilities. Activities:
I'm using all products included in Hercules, but for example I've seen that WinPCap is not supported and Npcap is recommended. Don't know if that may be related. Let me know anything else you need. Regards |
I'm going to need some help reproducing this. I have almost zero z/OS skills. I do have z/OS 2.4B, and it seems to IPL and run just fine, but for all of my z/OS Hercules IPL tests, I always use loadparm When I just now tried IPLing my system (which I believe is an ADCD system) using loadparm
which I don't know how to respond to. I also don't know what My system is configured to use OSA and is configured almost identically as yours. Can you explain to me (in simple terms please! I'm not a z/OS person!) what I need to do to reproduce your problem? Thanks. p.s. I suspect something might be wrong with your local network configuration. Your TTTest64 report does not look right. Your first ping of www.linux.org using a tun, reported a ping response time of "time=16ms", which is quite reasonable. But then you closed your tun interface and tried the same thing again using a tap interface instead, and this time all of your ping responses were all "time=<1ms"!! I am seriously doubting you can ping www.linux.org from your Windows system in less than 1ms!! Can you provide some more details regarding your local networking? Thanks. |
And it might be a good idea to try your TTTest64 Ping Test again, but this time without using a tun interface beforehand. Do the test right away using only a tap defined interface. It might not hurt to clear your ARP cache beforehand too, or even IPL your Windows system. You said you had to reinstall Windows 10, and as I recall, Windows 10 did have some type of bug in its networking handling at some point in the distant past that was fixed with an update. Did you (re-)install all of your Windows/Security updates after you reinstalled Windows? (and before doing your Hercules test?) |
Hi, Regarding the message, it is just because the CICS 54 DFHLPA maybe in the lpalist, but the lib doesn't exist. In such cases you can reply zosexplorer is an IBM eclipse application working as a framework for other applications included as plugins, like cics explorer, git interfaces, zos connect enterprise edition, zosexplorer, dbb for zDevOps and others. I start my ADCD z/OS 2.4 with CI as well. To reproduce my problem you could use zosexplorer. You need two started tasks: jmon and rsed, which I believe are started by default and define a connection to the default eclipse zos explorer port, port 4035. With my previous Windows I used to have a fixed IP address 192.168.1.110 with a CTCI definition. Unfortunately I didn't keep record of that configuration. So I tried to do the same and now I have this problem. This worked in my other Windows. So I switched to an OSA definition, hoping it would help, but it didn't. My local networking has DHCP, no other configuration. I erased the ARP cache and re-ran TTTest64 Ping Test: |
Thanks. That seems to have worked. I had to reply to it twice though. After the first reply, the same(?) message appeared again a minute or so later. After I replied to the second message, the system finally finished IPLing and is now running normally(?).
Which I know nothing about.
Which I don't know how to do.
Good!!
Sorry for the stupid questions, but as I explained, I know almost nothing about z/OS! Thanks. |
FYI: When I press PF10 on my master console (to issue the |
one question, do you have a TSO session or just the console? |
TSO session too. I should tell you I DO know how to do a little teensy tiny bit. I know how to logon. I know how to use ISPF 6 to issue 'ping' and other commands. I know how to browse dataset/edit dataset members, submit jobs (although I do NOT know JCL!), look at printouts, etc. And I know how to cleanly shut the system down. But nothing beyond that, |
ok, I can guide you to download zos explorer from IBM site, how to configure a connection to Mainframe and how it's invoked if you are willing to do it. I'll have to take a series of screenshots and send them to you. Please confirm if you want. additionally, do you have TXOFF/ON in your zos? |
Yes, certainly!
That's fine. My email address is fish at either softdevlabs.com or infidels.org.
Not yet, no. But I do have a copy of Jürgen's TXONOFF job stream, so as far as I know all I have to do is run it, and then it'll be on my system. Yes? And then all I need to do is enter the command 'txoff' or 'txon' from ISPF 6 to disable or enable unconstrained transactions, yes? |
yes to both... i'm writing the instructions. I'll send them as soon as i can. Thank you very much. |
CORRECTION: I just found my notes regarding TXONOFF:
Is that correct? |
Thanks! Standing by... |
FYI: TXONOFF is now on my system, and entering
So I think we're good to go. Standing by for further instructions via email... |
mail sent, let me know if you got it. |
Nope. Not yet. What email address did you send it to? |
fish@softdevlabs.com |
Hi, I'll attach the file here. Regards. |
Hi I did two things:
Entire log attached. I hope this helps. |
Weird. I never receive it.
Thanks. I downloaded it and tried it it yesterday after making the following changes:
As you can see below, I was able to connect and things worked just fine (although I didn't know what the heck I was doing! I'm not familiar with IBM Explorer for z/OS!): I was however able to view files and printouts! Pretty cool! One thing I did notice was that sometimes the connection would fail on the first attempt. But if I tried again it would work the second time around. I'm not sure what that means, if anything. I also do not use DHCP. I have all of my systems are hard coded with their own uniquely assigned IP addresses, so it wasn't exactly a fair test. My initial goal wasn't to try and exactly reproduce your problem, but rather just to see if I could get it to work, and I succeeded in that endeavor. I also have Checksum Offloading overridden in CTCI-WIN too:
(Refer to "Disable CTCI-WIN's default Checksum Offload behavior" in the "Common Problems" section of the CTCI-WIN Help file) And finally, I do not have IPv6 enabled on my adapter either (whereas you do). Despite all the hype, I've personally never found much use for IPv6. If you have a lot of internet devices maybe you have a need, I don't know. But for me, living without IPv6 is not a problem. I would also note that at no time during my initial attempts (when my connection attempts would fail and RSED would crash due to a TXF restricted instruction failure), at no time did my system crash. Hercules remained up and running just fine. If I get time I will MAYBE try to configure my system to use DHCP and also try running IBM Explorer for z/OS on the same system that Hercules is running on, just to see whether that makes any difference or not. I'm doubting it will, but it might be worth a shot. Personally I think your local Windows network is borked. Your second TTTest64 report is still showing "time=<1ms" for your pings to www.linux.org, which is virtually impossible. Some things to try/check:
That's all for now. I'll continue trying to reproduce your crash but so far I haven't had any luck. |
Carlos (@cfdonatucci), Would it be possible for you to do a favor for me please? I am having trouble determining where in Hercules it is hanging and crashing, because I am unable to reproduce your problem on my own system. I would therefore very much appreciate if you would run a test for me please. You will need to temporarily break things on your system by temporarily re-enabling "Large Send Offload" and "Jumbo Frame" by setting them to "Enable". Once you have set "Large Send Offload" and "Jumbo Frame" to "Enabled", then please start Hercules and perform the same test you did before, but with the following small change:
Then connect from zosexplorer like before. Hercules will hopefully crash just like it did before, but this time some additional information should be displayed in the Hercules log. It is very important to save the Hercules log file! That is the only thing I need! I do not need the crash dump. When Hercules crashes and it asks you if you want to create a crash dump, you can answer "No". I do not need the crash dump. I only need the Hercules log file. Afterwards, you can fix things by re-disabling "Large Send Offload" and "Jumbo Frame" by setting them back to "Disable" again. Thanks. |
Hi sorry i didn't aswer. I was at the office today. Just a question, to enable jumbo frame I have 4088 or 9014 bytes. I don't remember the default value... should I set 9014? |
I don't know. It probably doesn't matter. But based solely on the following information, I would probably set it to 9014: |
I set jumbo to 9014 and disabled Large Send Offload, ran txoff and started RSED. To my surprise it didn't fail. I think it should be greater than 1500 |
You were supposed to ENABLE Large Send Offload for my test. Disabling it fixes the problem (prevents the crash). Enabling Large Send Offload (which is the normal default setting) is what causes the crash. I need you to ENABLE it, so the problem/crash occurs. Once you run my test, then you can disable it again so it doesn't crash. |
That's what I did, I enabled it... when I ran the test, I had Large Send Offload enabled and Jumbo set to 9014. The connection succeeded and everything worked. After a while the connection crashed. I see errors in the Hercules console related to QETH adapter, but Hercules didn't crash like before. I'll send the log to you. |
Log attached: |
What does that mean? "The connection crashed"?
Then the test was no good. We will have to try again. I am trying to determine the cause of the original Hercules crashing, which you claimed earlier "could be consistently reproduced":
If Hercules does not crash, then the test is no good. I want to recreate the original conditions that was causing your Hercules to crash, but with my debugging commands active (ptt io qeth 50000, t+400, etc...). Thanks. |
Not helpful. I need you to recreate the original Hercules crash problem you were experiencing. That is what I am trying to resolve. The "ptt io qeth 50000" and other commands will provide additional information that will help me determine where (and thus hopefully why) Hercules was hanging/crashing. Thanks. |
Okay, now, I set all parameters for this to work, meaning the parameters we believed resolved the problem. The connection is established. So I did nothing for a while. At some point the adapter started giving "22 Invalid argument" error, and the connection failed. I think we should go back to the network trace. 22:03:22.827 00000AF0 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MAC address 02:00:5e:a3:be:84 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.1.112 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.255.0 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MTU 1500 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive MAC address 96:7a:59:e5:d2:bf 22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::967a:59ff:fee5:d2bf 22:03:22.924 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.1.112 22:03:22.925 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 10.1.10.1 22:03:34.693 00000AF0 HHC00801I Processor CP00: Operation exception code 0001 ilc 4 22:03:34.694 00000AF0 HHC02324I CP00: PSW=0704100080000000 000000001FB179D6 INST=B2AF0000 ????? , ..... 23:38:53.653 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:53.653 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:53.663 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:53.663 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:54.127 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:54.127 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:54.923 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument 23:38:54.923 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796) 23:38:55.922 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument |
I would prefer that we recreate the Hercules crash that you originally reported. That is what I am most interested in: preventing Hercules from crashing. Since we are having trouble accomplishing that however, we might be able to determine a likely cause using your current testing procedure. So let's do this: Do the same thing you did above (that caused the
That should display all of the trace entries in the ptt trace table. THAT is what I need to see. Save the Hercules logfile and post it here. Thanks. |
Hi. Test done. Hercules didn't crashed and didn't fail, This is the summary:
At this point everything worked perfectly. I couldn't reproduce the failure. I waited for a while, and then went to lunch. When I returned the PC was sleeping. When it was awoken, the connection with zosexplorer failed and the invalid argument error message was issued. I then executed So now it seems I cannot reproduce what was previously a consistently reproducible error. Regards. |
<snip>
Thanks. Unfortunately however, the test is no good. You didn't do it correctly. You entered the I need you to wait until the I've also been thinking: if you previously added a Windows Firewall rule, maybe you need to disable it to reproduce the original problem? Please try again, being careful to issue the Thanks. |
I'm going to do the test now. When I enabled the firewall rules the problem was not resolved. The problem was resolved (hold the press, remember?) when I disabled jumbo and Large Send Offload. I disabled the firewall rules. Let's see what happens. |
What I remember is, originally, you did not have any special Firewall rules defined. Your Windows Firewall was enabled (which is the default for Windows). I then recommended that you define a new Firewall rule that would allow any/all packets through (effectively disabling your firewall). I don't recall whether you actually added (defined) such a rule to your Windows Firewall or not, but I do remember that you disabled the Windows Firewall completely, since that was the easiest way to eliminate it as a culprit. It was sometime after that test (which still failed) that Ian Shorter (@mcisho) noticed your Wireshark trace showed packets trying to be sent that were too large for CTCI-WIN (WinPCap) / Hercules to handle. It was then I recommended disabling Large Send Offload, which, yes, did indeed resolve your problem. So what I am trying to do is to go back to the way you originally had your system configured that was causing Hercules to crash. If we can recreate your original crash again, then we can see where things might be going wrong. That is what the If we cannot recreate the original crash, then we should try to recreate the "Error writing to device tun0: 22 Invalid argument" error instead (with Thanks. |
I really don't think the firewall is the issue. I ran tests with the firewall completely disabled and they failed. So far I couldn't make it fail. I'm going to run a last test. I even changed RSED mode from 64 bits to 31 and go back, and nothing. |
I was unable to recreate the original error. To recreate "22 invalid argument", I IPLed Hercules as you said. Nothing happened. So I sent the PC to sleep. After a few minutes I recalled it and the error occurred immediately. After few "22" messages, I executed |
I do. When you reported your original problem, the Firewall was enabled, and because it was, it was not allowing any incoming connections. That was part of the original problem. Your zosexplorer client was trying to connect to RSED, but Windows wasn't allowing it because it wasn't allowing any outside packets through the firweall. It was only when we disabled the firewall that we then made some progress. Think. How was your Windows system configured when you first reported your original problem? That is the same configuration I would like to try to recreate. But forget about that right now. Let's move on.
Yes, that helped. Thank you. I would now like you to perform the same test, but this time with the following Hercules commands active:
This should provide additional (hopefully more detailed) debugging information when the "22" errors occur. What I suspect is happening is, your network adapter (NIC) is being turned off when your PC goes to sleep, but is not being fully powered on after it wakes up. This is a known problem in Windows:
(emphasis mine) And:
(emphasis mine) I also found the following web page as well:
(and, at the very bottom of the following page):
(and):
(emphasis mine) Bottom line: everywhere I look, disabling (unchecking) the "Allow the computer to turn off this device to save power" option fixes the problem. But I would like to see another test like you just did, but with And if that is true, there is nothing Hercules nor CTCI-WIN can do to fix or workaround the problem. It is just another "User Error" (i.e. not having your system configured properly). Thanks. MORE INFO: |
Hi I ran the test and got your trace: Afterwards, I unchecked that box and restarted Hercules and z/OS. After a while, I put the PC to sleep. When it restarted, I didn't get any 22 errors, but the adapter stopped working. Apparently the suspend introduces some problem. Anyway, I set my PC to never sleeps, so I think we are okay unless you want another test. |
Unfortunately, your .zip file is completely empty. It contains .... nothing. No files at all. |
Try this one: |
Wow. I wasn't expecting that. Not very helpful at all, but not your fault of course. It looks like Windows's re-initialization of the network adapter upon resume after suspension ends up killing Hercules's guest's virtual network adapter. I'm guessing that CTCI-WIN's WinPCap sniffing/injection hook on Windows's real physical adapter ends up being invalidated/removed/rendered inaccessible resulting in a permanent loss of connectivity between Hercules (CTCI-WIN/WinPCap) and the adapter. Most unfortunate. So it looks like setting one's PC to never sleep (which you have already done) is the only solution available to us. There doesn't appear to be any way for us (Hercules/CTCI-WIN/WinPCap) to recover from this. Nevertheless, since my ultimate goal for this issue was not to try to overcome that problem, but rather to try to prevent Hercules from crashing as a result of the situation, I have added some additional code to Hercules (commit 96b606a) to try and better detect when a resume from suspend (awake from sleep or re-power-on after hibernation) has occurred in order to prevent our watchdog thread from erroneously detecting what it believes is a malfunctioning CPU. Hopefully, this new code, along with our previous attempt to address the same issue (commit 4b439ae), will, together, prevent Hercules from crashing upon resume from suspend. <me: fingers crossed> As I believe we have done all that we can with this issue, I am going to close it at this time. I would like to thank you, Carlos, for your incredible patience. Take care. p.s. to Mark G. (@dasdman): I apologize for not having implemented your earlier suggestion sooner. At the time, I didn't think it was necessary! |
I'm glad this was useful. It was great for me to contribute with you guys, I really admire what you do. Let me know any way I can do anything in the future, tests or whatever. I'm not at your level but I'll try to do my best. |
NOTE: GitHub Issue #458 (Hercules crash after resume from suspend) is also closely related to this issue.
Hi guys,
I had to reinstall my Windows 10 after a problem. After reinstalling Hercules, I'm now having an odd problem not happening before, regarding my external connection to use other Windows applications on the same PC. I reinstalled all Hercules software.
I also ran TTTest64.exe successfully. I have DHCP default conf in my PC. I always used CTCI connection but fails with no error message. So I tried with OSA, and now I get a dump.
z/OS 2.4 starts ok and the OSA is installed and activated ok.
I can start TSO sessions from any PC in my net, I can use CICSPLEX very well using
cicsexpl
.When I attempt to use a
zosexpl
session with RSED, the connection crashed and Hercules as well.Dump available:
I'd appreciate any help!
Regards,
Carlos
OSA Adapter
Hercules log
The text was updated successfully, but these errors were encountered: