-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panel issue under Linux #289
Comments
|
Nah.. I checked that - it ain't it... Tried counting the number of EINTR occurrences after the select (in a global external variable to prevent it from being optimized out) and it's always 0. |
Good thing is I am able to reproduce the problem on my own test systems, so I can play, break and fiddle with everything now! |
Ok - symptom:
Also the panel thread is completely hogging the CPU before you hit the <ESC> key. And also if the configuration has devices that use TCP sockets, the problem doesn't show. |
Interesting... (and weird!) Let us know what you find out. This is definitely a weird problem. I wonder why pressing the <ESC> key fixes things? Weird. Very weird... |
Following my original reporting of this problem to Ivan, I have done some more tests. There are other weird symptoms: If you use SDL current to try to IPL MVS TK4 with 2 CP defined, the MIPS counter will indicate 180-200MIPS activity with 5-6 SSIO per second, and not not manage to IPL the system at all. Not even the first nucleus message OS/VS2... etc shows up. Nothing. |
Actually (for info) the problem occurs even BEFORE IPLing anything.... |
Hi Moshe, As I'm currently (once again!) trying to come closer to releasing TK4- Update 09, I was wondering about your specific test scenario. Using vanilla TK4- Update 08 with the SDL-binaries update applied, I can't reproduce what you are describing. Of course I don't want to go into building the binaries for all the many platforms TK4- is supporting if there is a specific show stopper in the current code. I'd appreciate any enlightening. Cheers |
Juergen, Sure. I will try to explain as clearly as I possibly can. 1st problem with SDL is when you run Hyperion on a 1 CPU (one thread only). You might say all modern CPUs have at least dual cores etc. But the cheapest instances at AWS, Google Cloud and Azure etc. are all single thread CPUs. When you have such a system and you configure TK4 with two CPs, then the problem I described above occurs reliably on all common Linux distributions. Didn't try Windows. N.B.: It also sometimes occurs (only on Centos 7 so far) when TK4 has only one CP configured and the underlying Linux has one single CPU. The symptoms are very busy CP(s) with a few SIOs, and when you press ESC between the console log view and the register view, the responsiveness improves somewhat. IPL fails in all instances. 2nd problem is that if you release TK4 Update9 with your own Hercules inside it, inevitably a lot of people will start to use that same Hercules for their other operating systems (because they want to avoid the hassle to build their own SDL). It is well known that current SDL will trigger
Hope this helps. Thanks, M |
Thanks a lot, Moshe for me that means I can continue as planned:
By the way: The other mainframe emulator, Emily, poses clear and high prereqs on the underlying host system, and I think that's for good reason. We shouldn't go that far with Hercules but it is in the nature of multi threaded emulators that they become unusable when the underlying host provides too little parallelism.
Otherwise the same is true as I said above: One simply shouldn't try to run operating systems that require a certain amount of parallelism and memory/disk resources on a host platform that cannot provide this. Of course, in this case too, Hercules should be nice enough to communicate this instead of letting the guest OS run into abends. Cheers |
Jürgen Are you sure it's a resources problem and not a bug? If I could do a shared "screen" session with you on Linux, you would see it's not a simple case of asking Linux or Hercules too much. It's a bonafide bug, and Ivan agrees with me. I should be able to IPL TK4 with one CP defined, on a one CPU Linux machine with 3.5GB of RAM, right? I don't think this bug should be dismissed so easily because, for example myself, I run one CPU instances on Google cloud where Spinhawk IPLs TK4 just fine, and SDL doesn't. But of course you are the TK4 maintainer and you decide. Many thanks M |
Hi Moshe no, I'm of course not sure whether it's a resource problem or a bug. If you and Ivan both think it's a bug, then it probably is. However, my statement wasn't about this question at all. It was simply about the fact that I don't see this problem as a showstopper for going on with TK4-, and that a good enough "solution" for me would be Hercules writing something like "low on host ressources, continue on your own risk", regardless, whether it is a bug or not. Cheers |
No, you can't state it like this. It totally depends on the CPU defined to or installed in your Linux machine. You can give it as much RAM as you like, but if the CPU doesn't provide enough parallelism and speed you'll have a problem. |
ok, but if Spinhawk 3.13 on that very same machine IPLs TK4 without any issues and I get excellent performance on all my PL/1 and Cobol compile jobs, while SDL does not even IPL TK4 at all, then I think we are just closing our eyes to an obvious issue other than resources. |
I agree with you in that all the Hyperions (not only the SDL one) are performance wise quite a bit behind Spinhawk, particularly in I/O intensive scenarios, which IPLs typically are. This already has been discussed in lengths and is mostly a consequence of longer instructions paths in the channel subsystem as compared to the original I/O implementation that still exists in Spinhawk. Well, but on the other hand a compliant channel subsystem is obviously a key requirement -- and the one we have in the Hyperions meets this requirement much closer than the old implementation. (just my 2¢) |
Yes, I am aware that Hyperions are slower (due to new thread model). But I think slower means still being able to IPL TK4. The problem with SDL right now is that it does not IPL, not even the nucleus of OS/VS2. Nothing. Not in one minute, not in on hour, not in one day. If it just worked slow, then it should IPL but just take more time. However, failure to get MVS to run, in my opinion is a serious bug which quite a few people will encounter with TK4 Update 9. After all the goodness you and the community have put into Update 9, it's a shame to have a distribution which will not IPL on the low-end cloud systems (which are exactly where you want to run a system like TK4 because the low-end systems are so cheap and come with more than enough resources to run a 24bit OS) I respectfully suggest we should investigate this bug and fix it before inclusion in TK4. I know Ivan is working on it. |
Well, don't get me wrong: I never intended to prevent you and Ivan from investigating and resolving the issue. On the contrary! I only said, I'll ignore the issue in the TK4- context. Given that TK4- Update 09 will run on stock SDL Hyperion, there will be absolutely no problem to replace the binaries once the issue is resolved. |
Out of curiosity, is the behavior any different (better or worse) when you run Hercules not as root? (i.e. as a regular user?) What about when you set |
on Ubuntu 18.04, no different behavior root or sudo thanks |
I just had a quick look into it, just for fun: It probably is a simple race condition. The panel, running second highest priority, wins a race against someone else (the logger?), leading to a clobbered result from select(). A temporary fix can be found here: Note: This only gives the "race partner" a better chance to win. I tested under Linux only, but the other platforms shouldn't be affected (at least not negatively). A definitive solution surely should identify the race partner and solve the conflict in a more friendly way. |
Hi Fish regarding your question on running as root and various TIMERINT settings: Yes, the problem occurs only when running as root. And I tried various settings of TIMERINT and PANRATE, all with no changes in the behaviour. Btw. TK4- runs with TIMERINT 10000 for many years now -- so this is good enough for MVS 3.8j and greatly reduces resource consumption when the OS is idle… Cheers |
… well, my patch from yesterday was a nice try, quick and dirty. But: It doesn't help in daemon mode, and from this can be concluded that the panel is only a casualty, i.e. not part of the problem. So, basically, I made an intervention that works, but it takes place at the wrong location in the code. The working hypothesis still remains trying to find a race condition and fix this then directly. When I'm at it now anyway I can follow on, unless Ivan or Fish volunteered desperately… |
It sure sounds like it. But then why doesn't it also occur on Windows? Why does it seem to only occur on Linux? Any ideas why that might be?
I would like to, but unfortunately my Linux debugging skills are extremely weak to non-existent. |
Maybe it has something to do with the Or maybe we just need to change it to a completely separate BYTE flag instead? Right now it's a bit flag and we've had trouble in the past with compilers setting/testing bit flags? <shrug> I'm just guessing here folks! I have no idea what the frick is going on! |
Thanks for the idea! Given that the panel as such can be ruled out (as even in daemon mode things are getting completely out of control), I think that we are seeing only secondary effects around the panel and its select() logic. When I make the panel work using my dirty trick from Wednesday, the system behaves in panel mode exactly the same as it does in daemon mode: At IPL the processor thread goes into a tight loop, but no instructions get executed by the guest… I'm having a few ideas, but let me just do a few more experiments to get on the right track... |
How so? the kernel only sees and activates one CPU. Plus, with SDL I have the problem, with the patch applied, I don't... |
Oh! So someone found a solution to the problem! Great! |
Well, no, I didn't find a solution. I'd be glad if I found one, but the above patch is a circumvention (workaround) in the best case -- or else it's just voodoo. There is no way out: Someone must find the root cause; otherwise this thing will arbitrarily bite again. |
Thank you, Jürgen! I agree that your "fix" is really just a workaround (circumvention) and that the actual root cause of the problem (no pun intended!) really needs to be (must be) identified by someone, somehow, or else this is just going to come back to bite us on the arse again at some later point. Or... as I suggested, we should consider a complete redesign/rewrite of our inter-thread message handling. OR... (option 2): maybe we just need to closely review our current thread priority handling? Windows's priority handling is quite different (obviously) from Linux's and/or POSIX's, and given that Jürgen's circumvention seems to work by simply tweaking the priorities of the threads involved, we might simply have a flaw in our POSIX thread priorities implementation? <shrug> Just yet another thing to think about/consider. So... given that Jürgen's patch does seem to resolve the problem for now (for the time being), should we maybe go ahead and commit it and close this issue? Or does his patch still need further testing/confirmation by more people? |
Please can we have a few days to investigate this further? I think I know why the select pops - it’s from the logger thread which doesn’t seem to get read by the panel loop. I have a test harness that I’m working on at the moment. |
Andrew Gadsby wrote:
NO!! Absolutely not! Are you crazy?! We must install Jürgen's patch IMMEDIATELY!! The entire world will self destruct if we don't!!! (heh) Of course we can wait a bit longer, Andrew. That's why I asked. Especially given that it sounds like you're on to something. We would all be very happy if you could manage to identify the actual cause of the problem, so PLEASE, take as long as you need! |
Worked for me on Ubuntu 18.04 64-bit single CPU VM. Bill |
Thanks Fish, I'm making headway - I suspect the full fix will be relatively simple. I'm currently unpicking how the logger thread gets data into the panel thread. There is a partial solution by removing the logger_thread and stdio FDs from the panel select - that fixes the panel CPU burn. However, it breaks the daemon mode by dropping the first few messages - this has pointed me at the root cause now and I'm working on that. I suspect the original CPU burn is caused by the logger writing to the logger_syslogfd FD but the panel doesn't clear that until you do a mode switch, hence that crazy CPU loop, with the high thread priority when root I suspect more data gets stuffed in leading to the select looping. Another couple of days, excluding this weekend, should get this nailed. |
OK, I have a solution to the CPU burn problem. The immediate solution is to remove the use of the Lines 1759 to 1770 in 017e437
The CPU burn is caused by the So changing the select on line 1770 to: rc = select (keybfd + 1, &readset, NULL, NULL, &tv); does the trick. The root cause of the issue is that data gets written to different places by logmsg.c depending on various fields in sysblk AND where stdin, stderr, stdout point to, so as threads start and exit, sysblk fields change, altering the route that messages go - if unlucky messages get sent down the logger pipe, triggering that select(), but are not then read within the select loop. This fits with Jürgen's suspicions that it was caused by thread timing. There seems to be a lot of history in the logging code so I'd like to have a go at simplifying this - as a demonstration of how confusing the log flow is try this:
You can see that the startup messages go to STDOUT and the shutdown ones to STDERR! This is a side affect of the code in logger.c starting around 382: Lines 382 to 384 in 017e437
Maybe that's a clue. If the team agrees, I'd like to offer an immediate change to panel.c for the |
Andrew Gadsby wrote:
Excellent work, Andrew! THANK YOU!
I can't speak for the rest of the team of course, but for myself, you certainly have my vote! And as far as your proposed temporary fix is concerned, you certainly have my vote on that as well. If no one else on the team objects, I propose that we implement Andrew's temporary workaround to close this issue while he continues to work on a more permanent solution. Does everyone agree? |
Thanks, Andrew! For me your proposal is fine, this pipe has hit us way too often... |
I've queued up change Stop panel select watching logger pipe #1 for the initial fix. I couldn't link it to this issue, maybe I need to read the git manual. |
(below comment copied from Issue #280 thread) Andrew Gadsby (@agadsby) wrote:
The fix is in your personal fork of SDL Hyperion, yes. I can see that from the helpful GitHub mention in issue #289. But your fix is still not in the official SDL Hyperion repository yet. It's only in your repository (fork), but not in our repository. I've never done it myself so I'm not one to speak, but I believe there's a way to submit/create a "pull request" requesting that we (the SDL Hyperion repository) please do a "pull" from your repository (fork) of the commit that you made that fixed the problem. Then we could Merge your pull request to get your fix into our repository and you'd get credit for it. OR.... The much faster and easier way (easier for both you and me) would be, since the fix is so simple, for me to simply directly commit your fix to our repository myself (but with you as the "Author" of the fix of course, so that you get proper credit for it). So I think that's what I'm going to do. Give me a few minutes and I'll commit your fix and close this issue (i.e. Issue #289!). |
Remove unnecessary reference to the logger pipe file descriptor in the main panel loop. This caused the select() to continually pop if data was present in the pipe because this thread did not drain the data. This manifested itself on uniprocessor systems when running as root as a CPU burn in the panel thread. Closes #289 Panel issue under Linux. [skip travis]
I believe some of the wider issues are caused by the write to the syslog pipe not checking the return length and delaying when the logger thread is behind in logmsg.c I have a fix for this that both checks the return length, looping if necessary, and also sets the logger pipe to none blocking. As commadpt.c has time sensitive things taking place it may explain the wider issues that Jurgen reported. I’ll finalise my proposal and get it submitted shortly. |
I have pulled together my changes to address several issues with the logger process and the message handling. See my internal commit diffs: See Comment added 1st June These changes do two things:
Messages still seem to come out of other routes but I believe these changes form a reasonable start. Please can someone else try these changes to see if I've broken anything before I ask Fish to commit them? |
FWIW, your changes seem quite reasonable to me! Thanks, Andrew! Per your request however, I will wait until others have first confirmed your changes before I commit them. (But thanks again! Much appreciated, Andrew!) |
Revised changes to my suggestions from 3 days ago. I'd introduced a deadlock problem that has been removed by further simplifying the shutdown code. As the logger now shuts down when the write pipe is closed it does not need the external shutdown to call logger_term. Please can the community give these a spin before we commit them. |
Thank you, Andrew! I will once again wait a few days(?) (longer?) until others have had a chance to test your latest patch. |
Andrew (@agadsby), Is 25 days long enough? I would like to get your changes into Hercules. |
Hi Fish, I agree. At least once it's in it'll get a good thrashing. I cannot update remotely as I'm not authorized so how would you like me to proceed? Or are you happy to take the changes proposed above and commit them? Andy |
Change to keep the logger thread active until all messages have been read from the logger pipe. Logger pipe is now non-blocking and logmsg tries to avoid stalling for too long.
Your changes look fine to me! Fixed by commit 34f4353. I gave you full credit of course (you were the author of the commit, I was just the committer). I did make some minor cosmetic changes to your code however, which I hope was okay! (refer to above commit for details) If it's not okay, let me know and I'll do what I can to make it right. Thanks again for your great work, Andy! It's contributions over the years from people just like you that has helped to make Hercules the fine product it is today. Your efforts are greatly appreciated! |
Thank you Fish, I'm doing a full reinstall now to give it a spin on tk4-. |
There seems to be a problem with the panel thread at startup under very specific conditions:
Tracing seems to show it is looping around the select() around line 1770 in panel.c:
hyperion/panel.c
Lines 1756 to 1777 in 2b6b899
Input is very sluggish. every key takes a couple a seconds to show, and there is no real response to the command.
However, once the register panel is displayed (via ESC key) and turned back, everything works normal again.
With gdb, it seems the 'readset' is now different.
Investigating!
The text was updated successfully, but these errors were encountered: