-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Same instruction sometimes executed TWICE when stepping through code (Herc s+ command) #321
Comments
Since, whenever a z/OS processor stops executing instructions (such as when stepping through code instruction by instruction such as what occurs during a Hercules debugging session), z/OS might mistakenly believe there is a hardware problem (malfunction) because its other CPUs (that are still running!) may end up "spinning" trying to acquire a lock that the code being stepped through may be holding (thereby causing z/OS to mistakenly believe the "spin" is taking much too long) , the following PARMLIB member (and associated command to activate it) can be used to reduce the likelihood of z/OS complaining in such situations: (contents):
The attached PARMLIB member effectively lets a CPU spin for 999 seconds. Use it before starting a Hercules debugging session to step through z/OS code. To use it, first, place the member in the appropriate PARMLIB PDS as member name |
Thank you for this report, Joe! One question (just so we're on the same page): what version of Hercules is this? The BUGTRCE.TXT log you attached is not a complete log so it doesn't contain the Hercules Also, while I don't doubt you (I believe you!), if possible, I would very much like to see evidence of the double load issue you described (i.e. What I am trying to determine here is whether what you are seeing is cosmetic in nature (where, as you speculated in your opening remarks, Hercules is simply not processing the start request correctly and is simply re-displaying the same instruction again), or whether this is indeed a true bug where the same instruction is indeed being executed again twice in a row. Cosmetic bugs are confusing annoyances and should certainly be fixed whether they cause any problems or not, but an improper instruction execution bug such as the one you describe is a very serious, nay, catastrophic bug that needs to be found and fixed ASAP! So if you can manage to catch evidence of the "double In the mean time I am going to try and throw together a standalone test to see if I can reproduce this bug. I rather suspect it will prove to be very difficult to reproduce! But I will try. Thank you for reporting it. |
When I first saw this I also thought the nature was cosmetic. However, when I got translation errors because of the double load of R14 in the expansion of a z/OS macro before a PC instruction (PC 0(14)), I realized it wasn’t cosmetic but actually executing the instruction twice.
Also please note I added additional comments and uploaded a parmlib member EXSPAT00. It is necessary from a z/OS point of view to run SET EXS=00 from the z/OS console. The contents of this member tells z/OS to allow a CPU to spin on not responsive for up to 999 seconds.
The version of Hyperion I have is the one that I downloaded for the JES2 PIC12 issue. The date I have is May 3rd.
I’ll do some more traces (direct the output to a log) after work and I’m sure there is a good chance I’ll get a abend very likely from the PC 0(14). I’ll also try to be more exact with the Hyperion version I’m running. I’ll set a Visual Studio "tracepoint" and direct it to the output window in the two relevant areas.
I’ll also make a PayPal donation at the end of this too because quite honestly without Hyperion I would have little reason to live!
Thank you.
Joe Reichman
Lead Developer Sam Golob Systems Programming
71-28 163 st apt 1A
Fresh meadows NY 11365
|
s+
stepping through code
s+
stepping through code
This time I have a trace where by executing/displaying the same instruction a register is changed. It happened at the expansion of a IBM macro that expands to PC instruction. If you look to the bottom of the log, I also output a Visual Studio trace and an image of the source code where I inserted the trace. The Hercules version is "4.3.9999.0-SDL" (4.3.9999.0). |
If you look at the bottom of the log you will see |
Wow. I see it. This is bad. Very bad. Very, VERY bad. I'll try to get right right on this for you, Joe! One thing though: the Hercules version you reported is a very "generic" version number which is not helpful at all at identifying what specific version of Hercules you are using. This is important since I have made some important changes recently (just the other day as a matter of fact) in the Hercules logic that deals with instruction tracing and I'm sure the version you're using doesn't have those changes. I very much need to be using the SAME version that YOU are using so I can be sure I am chasing the right bug. Thanks. OR..., as an alternative, you might want to try downloading the most recent version of Hercules we have (updated just yesterday!) and build and test using that version instead of the one you're current using. I might have already fixed your bug! But I don't know that! Which Is why I would prefer to use the same version as what you are using so that once I identify the bug, I can know whether or not it's already fixed or not. So please post the output of the Hercules Thanks.
Rather, it is much preferred that you instead do a In the future, if you need/want to build Hercules from source, please try to remember to build it from a CLONE of our git source code repository and not from the generic "Download ZIP" source code. Thanks! |
Fish
Here is it thanks I’ll make it worth your effort.
Ill follow your instruction and down load the latest copy and it test it out again most likely Saturday night as I am going away for the weekend.
thanks
HHC01603I version HHC01413I Hercules version 4.3.9999.0-SDL (4.3.9999.0) HHC01414I (C) Copyright 1999-2020 by Roger Bowler, Jan Jaeger, and others HHC01417I ** The SoftDevLabs version of Hercules ** HHC01415I Build date: Aug 27 2020 at 17:59:52 HHC01417I Built with: Microsoft Visual Studio 2015 (MSVC 190024234 1) HHC01417I Build type: Windows MSVC AMD64 host architecture build HHC01417I Modes: S/370 ESA/390 z/Arch HHC01417I Max CPU Engines: 64 HHC01417I Using shared libraries Vherc =====>
From: Fish-Git <notifications@github.com>
Sent: Friday, August 28, 2020 12:48 AM
To: SDL-Hercules-390/hyperion <hyperion@noreply.github.com>
Cc: YossiReichman <reichmanjoe@gmail.com>; Author <author@noreply.github.com>
Subject: Re: [SDL-Hercules-390/hyperion] Same instruction sometimes executed TWICE when stepping through code (Herc s+ command) (#321)
If you look at the bottom of the log you will see L 14,232(14,0) repeated twice with register 14 changing.
Wow. I see it.
This is bad. Very bad.
Very, VERY bad. :(
I'll try to get right right on this for you, Joe!
One thing though: the Hercules version you reported is a very "generic" version number which is not helpful at all at identifying what specific version of Hercules you are using.(*) Perhaps posting the output of the Hercules version command might be better since it contains the build date and that might allow me to better guess what version you are using.
This is important since I have made some important changes recently (just the other day as a matter of fact) in the Hercules logic that deals with instruction tracing and I'm sure the version you're using doesn't have those changes. I very much need to be using the SAME version that YOU are using so I can be sure I am chasing the right bug. Thanks.
OR..., as an alternative, you might want to try downloading the most recent version of Hercules we have (updated just yesterday!) and build and test using that version instead of the one you're current using. I might have already fixed your bug! But I don't know that! Which Is why I would prefer to use the same version as what you are using so that once I identify the bug, I can know whether or not it's already fixed or not.
So please post the output of the Hercules version command. Hopefully the "build date" that it reports will get me a little bit closer to the version you are actually using.
Thanks.
…_____
(*) As you may recall in a much earlier private email to you back in early May, I explained that building Hercules from the source provided by GitHub's big green <https://github.com/SDL-Hercules-390/hyperion> "Code" button ("Download ZIP") is NOT the preferred way to build Hercules from source.
Rather, it is much preferred that you instead do a git clone of the actual source code repository instead, and build that source code instead. Building from a clone of our source code repository will provide an exact version number, whereas build from the "Download ZIP" source code like you did only constructs a very generic (and very unhelpful) version string.
In the future, if you need/want to build Hercules from source, please try to remember to build it from a CLONE of our git source code repository and not from the generic "Download ZIP" source code.
Thanks!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#321 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJNZZP4X6IA4I2AKQNSJMPDSC4ZG5ANCNFSM4QMO4RQQ> . <https://github.com/notifications/beacon/AJNZZPZPOFTKZZ6C6GPZFGTSC4ZG5A5CNFSM4QMO4RQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFCVWUOQ.gif>
|
PLEASE do not respond to GitHub via email. (Pretty please?) Instead, please respond (reply) via GitHub's web interface. Thanks. |
Fish, I think I downloaded or cloned the latest version. I have the github desktop installed on my desktop and used it to "clone" what I think is the latest version. Afterwards I tried to build it and I got two errors (missing files): |
Trying to upload a picture I took doesn't seem to work, so I'll just type it: I also got a wait state for the two new config statements |
Missing |
I got it built. What I am not sure of is, I thought I had the latest version. Since I am getting the wait state stated above, I am beginning to wonder if this is true. The Version I have is |
Joseph, Are you using With the information from git, you should see a version string like this: (from an SVN checkout):
(from git clone):
Without the information from git (or svn), the build process won't know how to fill in that last part of the version number, so it will be a zero:
Bill |
I tried to do a clone it asked me for a SSH key. Since I didn't have one I clicked "use download using HTTPS". Nothing seemed to happen. It gave me the option of downloading using GITHUB desktop. That worked. However, based on getting a transaction facility wait state code, GitHub desktop must of downloaded a really old version, as the transaction facility wait code seemed to have been resolved for a few months. |
This "dialog" from GitHub: The little icon to the right of the https:// URL simply copies that URL to your clipboard (just to make it easier, so you can paste it into your terminal to do a That said, I just used GitHub Desktop that you mentioned (I had really never noticed that there or heard of it before), and then built that, and it worked. Version number line shows the current commit ID hash. Bill |
One other point, Personally, I've been using Git For Windows: https://gitforwindows.org/. |
Got it. Thanks. Going to build shortly. One more thing: when update this current issue I click write tab and doesn't update the comment. I end up having to hit the green comment button which just start a space for another comment. So my question is, when I finish writing in order to post, do I have to click anything? |
Correct. But it is the abbreviated git hash value (i.e. the "019e7511" digits following the "g" in the version string) that tells us exactly what version you are using, being the first 8 digits of the got commit hash. For example, in your above example version string, the "g019e7511" part tells us you are using the version of SDL Hyperion that was current as of commit 019e751 made on 8/29/2020 (which is indeed the most current version). Without knowing that, with just a "generic" version 4.3.9999.0 string, we cannot know precisely what version of Hercules you are using and thus cannot know whether the version you are using has certain important fixes in it or not. That is why it is highly recommended that you always clone our git repository and then build Hercules from that git clone. |
Correct! That is probably what Joe is missing. I suspect Joe doesn't have git installed.
I myself personally prefer TortoiseGit since it integrates itself directly into the Windows User Interface itself (i.e. the "Shell"), making using git almost trivially easy. But TortoiseGit is just the user-interface side of things of course. You still need to have git installed on your system too of course (and yes, I too use Git For Windows as well, since that's what TortoiseGit recommends). Being a long time Window user I personally prefer GUI programs over command line programs. (I also have TortoiseSVN on my system as well for example!) I'm guessing you're more of a command line person? Joe? You need to install git. I thought later versions of Visual Studio came with git, but maybe not (or maybe it's not a standard git client?). I recommend installing Git For Windows and, optionally, TortoiseGit as well. If you need help configuring it, I (or Bill I'm sure!) can help you with that. |
Now that you mentioned it, I do see that there is a
So I imagine I just have Git For Windows closer to the front end of my search path. It appeared from posts above that Joe was using a Hercules compiled by VS2015. I don't see a git anywhere in its directories. Bill |
Just click inside the "Leave a comment" text box and start typing your response. As soon as you type anything into the "Leave a comment" text box, the green "Comment" button should turn from a very dull green to a bright green. When you are finished typing your reply, just click the green Comment button! Sheesh. |
Bug is still there: Right in the beginning of the log there is a |
The fix is ready: Please note the following: When you run your test you MAY still see what at first looks like duplicate instructions being executed, but if you look closely at the registers, you should see that the instruction is actually not being executed twice. Rather, it is simply being fetched twice and subsequently traced twice too as a result. For those interested, it turns out the first When a CPU is being stepped there can be a considerable delay between the time an instruction is fetched and the time it is executed (due to the system waiting for you to press enter). And when the other CPUs that are not being stepped are allowed to continue running normally (which as I said I personally feel is wrong), this greatly increases the likelihood of the TLB being invalidated before the instruction being stepped can be eventually executed. This is why Joe had such a relatively easy time causing it to occur whereas it went unnoticed by everyone else for over 14 years. This bug could easily also be the cause of other TLB related issues (i.e. unusual/unexpected Hercules behavior) such as Ivan's SIE Joe? I'll give you a couple of days (two? three? more? less?) to try the patch and confirm that the problem is indeed now fixed before eventually committing my changes. I would like to thank you profusely for your cooperation and tremendous patience with me while working together on this bug. I really appreciate it buddy. This was a tough one! But we got through it. Thanks. |
Ill test it out after work 4pm est thanks |
Fish, this only affects execution when tracing/stepping, correct? |
As best as I can tell, Bill, yes, the problem can only(?) occur during instruction stepping. But I have admittedly not performed a complete audit of all Hercules code to be 100% sure. The way Hercules is currently coded (i.e. due to the nature of this particular bug), I believe it may be possible for other potentially bad? things to occur in certain specific situations as well, but I cannot say that with any high degree of certainty. As I said, I have not had time to examine all of Hercules to see where, and whether or not, the problem (or similar a TLB related problem) might also possibly occur. The problem occurs whenever the IPTE (Invalidate Page Table Entry) instruction is executed by some other CPU thereby impacting all CPUs in the complex, including the one being stepped. Now normally, the way our Whenever instruction stepping is enabled for a given CPU however, the CPU "pauses" (stops) after an instruction is fetched and before it is executed. It is during this "pause" that the interrupt lock is obtained and the CPU is put to sleep while waiting for the user to press the enter key to indicate to the system it should go ahead and execute the instruction it just fetched. It is during this time period (which could be seconds, minutes, hours, days, years, etc; we cannot know when the user will decide to press the enter key!) that the window of opportunity opens for the bug to occur. It is during this time period where it is possible for some other CPU to execute the IPTE instruction thereby invalidating the TLB for the CPU being stepped, and thus, as a result, invalidate the instruction just fetched as well. Due to the way Hercules was coded, this invalidation wasn't noticed and the just fetched instruction was executed anyway. After it was executed, we returned to our instruction stepping logic and upon attempting to fetch what should have been the next instruction, noticed the invalidation, which caused us to fetch the next instruction not from where our internal instruction pointer said to fetch it from, but rather from where the instruction address in the PSW said to fetch it from (because our internal instruction pointer was rendered "invalid" due to the But in Hercules, the instruction address field in the PSW is not updated after every single instruction. Doing so would slow Hercules down to a crawl. It is only updated whenever "needed". As a result, the "next" instruction that was fetched came from where the PSW instruction address field said to get it from, which was (you guessed it!) the very same instruction that it had just executed! End result: the same instruction ends up being executed twice in a row! Oops! But as I said, this is unlikely to occur during normal instruction execution since during normal instruction execution the interrupt lock is not obtained between when the instruction is fetched and when is is executed. It is only during instruction stepping that this window of opportunity ever opens. Now it is interesting to note that this problem, as explained, occurs because:
But it's important to note that even if the "Local-TLB-Clearing Facility" was supported, the problem could still occur if the local clearing option wasn't specified in the IPTE instruction. The "Local-TLB-Clearing Facility" simply gives the program (i.e. z/OS in this case) the option of whether or not to only clear its own TLB and not the TLBs of all other CPUs, but just because the option would be available to it, that doesn't mean it must therefore always specify that option! It is free to continue executing the IPTE instruction without specifying the local TLB clearing option, in which case the problem would still occur just like it does today without that option. |
first of if this problem would occur in normal settings there would be abends all over place, Just think of how much code is executed by ipling z/os. Second the user cant wait an indefinite amount of time to press enter because by stopping a CPU to step z/os thinks there is a malfunction. it is only be executing the exspatxx parmlib that gives a user sometime to debug your code. But think about it there is absolutely no way to debug a SRB or PC rtn. The only product that does something like this is XDC and trust me it has it issues with this type code. The code basically establishes estae FRR this is all task related with this feature you can debug multitasking xmem code. One thing I will say from time to time I have abend messages from hyperion truth is if there is a problem with z/os component z/os has recovery in place and issues an SDUMP. So for very display of an exceptional message there most likely should be an SDUMP. I am going to have delay my testing of the patch till 6 pm tonite as something came up thank you again |
Fish just put in diff7 seems like there was a problem. The reason this bug became so apparent now because in this code I have large number of PC instructions typically used by IBM macros. To get the PC number the expansion of the IBM macro involves using Register 14 in the first operand (as a register) and in the second referencing storage. This first ins was L R14,772(,R14) and the second was L 14,208(14,0) this instruction is repeated however RE also changes here is the log |
Then try again until there ISN'T a problem! Sheesh! You need to use the new "joe7.diff" patch I gave you! You hercules log is WORTHLESS because you weren't using my new code! Try again! |
I did the following git apply -R joe6.diff and then git apply joe7.diff this time I'll delete the directory and clone it then apply the patch few minutes |
wait a minute hit enter again continuing but PSW seems to freeze as if none of the other CPU's are running |
There were a few dumps. Didnt notice any dupes, but I have to be honest in that my code never got this far before. Here's the log: I'll continue tomorrow. Thanks. |
I doubt that very much. What you are likely seeing is a simple re-fetching of the same instruction due to TLB invalidation. That is to be expected and is the fix. |
I doubt that very much. What you are likely seeing is a simple re-fetching of the same instruction due to TLB invalidation. That is to be expected and is the fix. |
The "dumps" are to just provide confirmatory evidence that my fix is working as expected (and it appears it is). I'll clean up my code and commit the official fix as soon as I can, and will let you know when it's ready. Thanks for your help. |
Agreed! Which is why I suspect it very likely does NOT occur during normal instruction execution. The problem only seems to occur during instruction stepping when the "window of opportunity" opens due to the "pause" between the fetch of the instruction and the actual execution of that instruction. The fix is to simply check the |
Fixed by commit 2b136d0. Please discard your current build of Hercules (and all patches as well), (re-)clone the current version, rebuild and re-test. Everything should work just fine now. I will await your confirmation that your problem is now fixed before closing this issue. Thanks for everything, Joe. p.s. to Bill: I have confirmed that this problem is impossible to occur under normal circumstances. It only occurs when instruction stepping is active (and now with the fix in place, should not cause any problem even if it does occur). |
Thanks. Will test at 4pm EST. I will delete and re-clone. |
Looks good. The ultimate test is that I had this instruction where operand 1 and 2 referenced the same register and the repeat would make it blow up. Here, as in other instructions, you refetch the registers which fixes the problem. Attaching the log:
|
Excellent! Closing issue! |
I don't know if any one uses the
B+
andS+
commands to trace code, and I think this is old bug, but I have observed that at certain points instruction are repeated (i.e. executed twice). Maybe the CPU that is stopped is not re-awoken?The relevant code is in
process_trace
(cpu.c) where, after displaying the instruction, the CPU is put in Wait state. It is then awoken by pressing the <enter key> by the code in hscpufun.c functionstart_cmd_cpu
.The attached log is of a step/trace I did. I saw no less than 4 instructions that were repeated: The first is a
CLM
instruction at address 1F5941EA. The next is aSTM
at 1F594276, The third is aMVC
at 1F5943D8. And finally there is aSTG
instruction at 1F59447E.Many times, as in the case of a
LR
instruction for example, there is no harm done, but this has happened to me in the expansion of macros to aPC
instruction where the sequence of code isL 14,772(14,0)
followed byL 14,212(14,0)
where, if the first load is repeated, I get a translation error.Note: Issue #322 ("Likely incorrect Hercules breaking/stepping handling/support?") is closely related to this one.
The text was updated successfully, but these errors were encountered: