-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auxiliary threads eating up CPU time #589
Comments
Ivan (@ivan-w) Just out of curiosity, please apply the following research patch, --- hyperion-git/timer.c 2023-06-11 11:04:10.508749500 -0700
+++ hyperion-1/timer.c 2023-08-02 10:21:04.960290000 -0700
@@ -322,16 +322,36 @@
} /* end if (intv_secs >= one_sec) */
/* Sleep for another timer update interval... */
-
+ {
+ int usecs, rc;
+ bool txf;
#if defined( _FEATURE_073_TRANSACT_EXEC_FACILITY )
/* Do we need to temporarily reduce the frequency of timer
interrupts? (By waiting slightly longer than normal?)
*/
if (txf_PPA)
- usleep( sysblk.txf_timerint );
+ {
+ txf = true;
+ usecs = sysblk.txf_timerint;
+ }
else
#endif
- usleep ( sysblk.timerint );
+ {
+ txf = false;
+ usecs = sysblk.timerint;
+ }
+
+ rc = usleep( usecs );
+
+ if (rc != 0)
+ {
+ LOGMSG("+++ TIMER_THREAD: usleep( %d ) ==> %d!! (usecs %d from %s)\n", usecs, rc,
+ usecs, txf ? "sysblk.txf_timerint" : "sysblk.timerint" );
+ SLEEP(1);
+ CRASH();
+ }
+
+ }
} /* end while */
@@ -429,7 +449,16 @@
*/
release_lock( &sysblk.rublock );
{
- usleep( sysblk.txf_timerint );
+ int rc;
+ rc = usleep( sysblk.txf_timerint );
+
+ if (rc != 0)
+ {
+ LOGMSG("+++ RUBATO_THREAD: usleep( %d ) ==> %d!!\n",
+ sysblk.txf_timerint, rc );
+ SLEEP(1);
+ CRASH();
+ }
}
obtain_lock( &sysblk.rublock );
} If it crashes, please report the LOGMSG that got logged just before it did. Thanks. |
We shouldn't be allowing a We should probably define |
@ivan-w: Please note the updated research patch: I added a test in the |
The root cause of the bug was in my incorrect #define of MAX_TOD_UPDATE_USECS in featall.h, which was #defined to 1000000 (one million) instead of 999999, thereby causing the TXF rubato_thread to do a usleep for 1000000 (one million) microseconds, which, being invalid, was causing it to not sleep at all and instead return immediately with an error, causing it to loop continuously without sleeping, consuming all available host CPU due to the timer threads being the highest priority thread in Hercules. comm3705.c also had a similar bug wherein it was using "min(1000000,delay)" for one of its usleep(). In addition to fixing the previously mentioned bug, I have also introduced a new USLEEP() macro which calls a new "herc_usleep()" function that will report any usleep() error that might occur. (I purposely did not code it to fix (hide) the problem but rather to just report it so that the caller can thus be fixed to not make invalid usleep calls.) This should close GitHub Issue #589 "Auxiliary threads eating up CPU time" as well as the long running "CPU utilization on Raspberry Pi" thread that took place in the main Hercules group recently.
Fixed by commit 87d04ed. Closing. |
I presume this is an important enough bug to start using this latest commit for it. But unfortunately the Hercules log now shows numerous Cheers, |
Should hopefully be fixed now by commit c962c5d Re-closing. |
And now the threads are using DOUBLE of what they used to (on an intel xeon system).. and even more on my aarch64 system! |
Are you sure you're using commit c962c5d? Are there any If you ARE using commit c962c5d, AND you are not seeing any In fact, I'm strongly considering resigning completely from Hercules development altogether. I'm serious. This Linux shit is getting old REAL fucking fast. You're a fucking Hercules developer. YOU figure it out and fucking fix it. |
After reading up on "usleep" for Ubuntu and some other Linux variants, I think that automatically retrying following a non-zero RC should not be done. Even the error reporting could perhaps be useful only in the case of errno!=EINTR. If I had known this prior to my "HHC00075E" message, some frustrations could have been avoided. I'm sorry about that, with apologies to both Fish and Ivan. Cheers, Peter |
Then what should be done?! How should EINTR be handled? There is no difference between us automatically retrying it, and the caller not sleeping and calling us again. Either way, the usleep will be immediately retried. Over and over and over again. Continuously. Causing host CPU to be unnecessarily consumed. Same with EINVAL too. For EINVAL, we could report it, and then attempt a valid usleep, but even if we do a valid usleep (i.e. even if we fix the caller's error so that EINVAL does not occur), we are still left with the possibility that EINTR could occur! What do we do if/when EINTR occurs?! If retrying it keeps continuing to fail, then we're screwed! We're fucked! We can't call nanosleep instead, because the exact same thing will occur! Should we change it to a |
What do I need to do to test/try this? With the latest commit, on a Debian x86_64 VM, I don't see any error msgs in the Hercules log, and just sitting there not IPLed, it's using very little CPU. Bill |
I have no freaking idea. @ivan-w never attached a Hercules log. Peter attached a log to his comment that reports his version as:
Why it works fine on some systems but not on others I have no freaking idea. All I know is shit like this makes me scream and want to pull out what little gray hair I have left, and after dealing with it for the past 20 years I'm getting damn sick and tired of it. |
Ok ok. Let's be calm about all this ! It is a problem that affects less than 1% of people.. I have a PI 4B and - well.. It sill works just fine ! The entire thing is about forensic (or external observation) of threads that seem to be consuming more than the usual CPU cycles.. It's an oddity.. it's a strange behavior.. I'd like to get to the bottom of this (because of personal pride), but it doesn't need the immediate and urgent attention of anyone as far as I understand (except me maybe)... I'll keep the issue open and assign it to mysef.. Again there is no immediate or urgent action needed (classify this as "odd behavior, no important or immediate action needed").. |
Maybe the problem is how CPU utilization is being measured or similar type bug? |
I propose that the current implementation behavior not be changed, i.e. do nothing. The The man page also mentions that To me it is clear that neither In short: I propose to not alter the current I have therefore tested the following fix, which should get rid of the doubled thread times: --- SDL-hyperion/hscutl.c 2023-08-08 10:39:00.713698533 +0200
+++ hscutl_pjj_fix.c 2023-08-08 10:36:26.469081326 +0200
@@ -609,13 +609,9 @@
{
int rc, save_errno;
- while (1
+ if (1
&& (rc = usleep( usecs )) != 0
- && (save_errno = errno) == EINTR
- )
- continue;
-
- if (rc != 0)
+ && (save_errno = errno) != EINTR)
{
char fnc[128], msg[128]; The above has been working fine on my Ubuntu 20.04 system: I will try it next on my Raspberri PI 4 - 8GB Raspbian OS 64-bit system (i.e. Debian based). Us trying to support so many different OS's is bound to cause difficulties. Even the original bug of an usleep time exceeding 999999 micro seconds apparently will not be problem on all Linux or Unix systems. In retrospect, we could've just fixed that bug, without adding the extra macro to check for EINVAL invalid sleep times, as the vast majority of the usleep() invocations are with hardcoded numbers or constants less than 1000000. Cheers, Peter |
<snip patch>
So basically ignore EINTR altogether and only report any error that is not EINTR? Sounds good to me! Unless someone else objects or comes up with a better solution, I say go ahead and make the change, Peter! Thanks!
No shit!
Yes, I saw that too. But we of course cannot rely on that. In my own quick testing (on Windows) before I fixed Thus the reason for reporting all EINVAL (i.e. non-EINTR) errors, so such bugs could be identified and eventually fixed. But I figured EINTR was only a temporary error that likely would not occur on retry. If however, it continually occurs (i.e. if the cause cannot be "cleared" somehow), then yes, we should probably ignore it I guess (which is what your proposed patch does and is what I agree we should probably do instead).
I would rather have the thread sleep for slightly longer than requested than for it to not sleep at all! I believe it is the "not sleeping at all" that is causing the unusually high CPU time to be consumed. With that in mind, I believe we should strongly consider changing our So... which should we do? Quick fix Me? I vote for the nanosleep approach (with the same check for, and reporting of, EINVAL). How do others feel? (Yes, I've calmed down a bit now. I've just been having a lot of bad luck with Linux lately, which was greatly affecting me mood-wise. I think the recent threads on the main Hercules group these past few weeks got to me. I apologize to you and Ivan for blowing up like that, but I find myself becoming upset much more easily as I get older, so if it happens again, just remind me to take a few deep breaths.) |
We all have our bad hair day - no ,biggy ! |
usleep() should not be retried when rc!=0, and only issue HHC00075E when errno!=EINTR. This approach may be improved upon alltogether by replacing the usleep() call in herc_usleep with nanosleep().
Couldn't agree with you more Ivan! And I also agree with Fish in that the When Cheers, Peter |
Good point. I remember seeing that too.
Another excellent idea. The EINTR could occur more than once for a single call, and we don't want to issue a (warning? informational?) message each time. A message should only be issued if the EINTR occurs more than 'n' number of times in a row (there 'n' is some value such as 3 as you suggested), and even then, it should be issued only once. (If it occurs, say, 10 times in a row on a given call, the (warning? informational?) message should only be issued once.) Since you suggested it, may I presume you will implement that suggested change, Peter, after Ivan has confirmed your temporary stop-gap solution resolves his double CPU issue? I should also mention that, depending on how often EINTR occurs during normal Hercules processing, that such a "restartable" (retry) nanosleep approach will of course more than likely cause a noticeable increase in CPU time! How much of an increase it will cause is unknown, but there will of course be an increase. Hopefully it will only be a slight (insignificant) increase, but we won't know that until we actually give it a try. Maybe it can be coded to skip the restart (retry) if the remaining number of nanoseconds is less than a certain threshold value? What should that threshold value be? 1000? (i.e. one microsecond?) Or some slightly larger value? |
Even whist awaiting feedback on my last stop-gap commit bf1a5d7 I tried a The --- SDL-hyperion/hscutl.c 2023-08-11 15:55:40.673462497 +0200
+++ hscutl.c 2023-08-11 12:54:50.518908396 +0200
@@ -607,24 +607,54 @@
/* Hercules microsecond sleep */
DLL_EXPORT int herc_usleep( useconds_t usecs, const char* file, int line )
{
- int rc, save_errno;
- if (1
- && (rc = usleep( usecs )) != 0
- && (save_errno = errno) != EINTR)
+#define USLEEP_MIN 1
+#define NANOSLEEP_EINTR_RETRY_WARNING_TRESHOLD 3
+
+ int rc, save_errno, eintr_retrys = 0;
+
+ struct timespec usecs_left;
+ usecs_left.tv_nsec += usecs * 1000L;
+ usecs_left.tv_sec += usecs_left.tv_nsec / 1000000000;
+ usecs_left.tv_nsec %= 1000000000;
+
+ while ( 1
+ && ( rc = nanosleep( &usecs_left, &usecs_left ) != 0 )
+ && ( 0
+ || usecs_left.tv_sec != 0
+ || usecs_left.tv_nsec > ( USLEEP_MIN * 1000 ) )
+ && ( eintr_retrys++ )
+ && ( save_errno = errno ) != EINTR );
+
+ if ( 0
+ || rc != 0
+ || eintr_retrys > NANOSLEEP_EINTR_RETRY_WARNING_TRESHOLD )
{
char fnc[128], msg[128];
MSGBUF( fnc, "USLEEP() at %s(%d)",
TRIMLOC( file ), line);
+
+ if ( save_errno != EINTR )
+ {
+ MSGBUF( msg, "rc=%d, errno=%d: %s",
+ rc, save_errno, strerror( save_errno ));
+
+ // "Error in function %s: %s"
+ WRMSG( HHC00075, "E", fnc, msg );
+ errno = save_errno;
+ }
- MSGBUF( msg, "rc=%d, errno=%d: %s",
- rc, save_errno, strerror( save_errno ));
+ if ( eintr_retrys > NANOSLEEP_EINTR_RETRY_WARNING_TRESHOLD )
+ {
+ MSGBUF( msg, "rc=%d, EINTR retrys count=%d",
+ rc, eintr_retrys );
- // "Error in function %s: %s"
- WRMSG( HHC00075, "E", fnc, msg );
- errno = save_errno;
- }
+#define HHC00092 "Warning in function %s: %s"
+ // "Warning in function %s: %s"
+ WRMSG( HHC00092, "W", fnc, msg );
+ }
+ }
return rc;
} But as stated, not good. (The Cheers, Peter P.S. I will be off-line in a few days time for about 5 weeks (only reading s-mails now and then), but before that I'll still open an Issue which is now perfectly reproducible, but one I have not able to been able to fix. So in case you're bored, please check out GitHub Issue #590: "Guest IPL under z/VM causes FRE016 Hard Abend". |
Surprisingly, when I compile the above with clang, I get no warnings whatsoever! When I compile it with gcc however, I correctly get the following warnings:
Trying to compile it on Windows also correctly fails with:
(it still misses the "may be used uninitialized" for Perhaps the above bug(s) are what was causing your problem Peter? |
FYI: The following seems to work well for me on Kubuntu 21.10 under VMware: --- hyperion-1/hscutl.c 2023-08-08 09:25:22.238314000 -0700
+++ hyperion-0/hscutl.c 2023-08-11 15:41:02.086933900 -0700
@@ -607,22 +607,54 @@
/* Hercules microsecond sleep */
DLL_EXPORT int herc_usleep( useconds_t usecs, const char* file, int line )
{
- int rc, save_errno;
+#define USLEEP_MIN 1
+#define NANOSLEEP_EINTR_RETRY_WARNING_TRESHOLD 3
- if (1
- && (rc = usleep( usecs )) != 0
- && (save_errno = errno) != EINTR)
+ int rc, save_errno = 0, eintr_retrys = 0;
+ struct timespec usecs_left;
+
+ usecs_left.tv_nsec = usecs * 1000L;
+ usecs_left.tv_sec = usecs_left.tv_nsec / ONE_BILLION;
+ usecs_left.tv_nsec %= ONE_BILLION;
+
+ while (1
+ && (rc = nanosleep( &usecs_left, &usecs_left )) != 0
+ && (save_errno = errno) == EINTR
+ && (0
+ || usecs_left.tv_sec != 0
+ || usecs_left.tv_nsec > (USLEEP_MIN * 1000)
+ )
+ )
+ ++eintr_retrys; // (keep retrying until done or min reached)
+
+ if (0
+ || rc != 0
+ || eintr_retrys > NANOSLEEP_EINTR_RETRY_WARNING_TRESHOLD
+ )
{
char fnc[128], msg[128];
MSGBUF( fnc, "USLEEP() at %s(%d)",
TRIMLOC( file ), line);
- MSGBUF( msg, "rc=%d, errno=%d: %s",
- rc, save_errno, strerror( save_errno ));
+ if (save_errno != EINTR)
+ {
+ MSGBUF( msg, "rc=%d, errno=%d: %s",
+ rc, save_errno, strerror( save_errno ));
+
+ // "Error in function %s: %s"
+ WRMSG( HHC00075, "E", fnc, msg );
+ }
+
+ if (eintr_retrys > NANOSLEEP_EINTR_RETRY_WARNING_TRESHOLD)
+ {
+ MSGBUF( msg, "rc=%d, EINTR retrys count=%d",
+ rc, eintr_retrys );
+
+ // "Warning in function %s: %s"
+ WRMSG( HHC00092, "W", fnc, msg );
+ }
- // "Error in function %s: %s"
- WRMSG( HHC00075, "E", fnc, msg );
errno = save_errno;
}
return rc;
@@ -2317,12 +2349,12 @@
DLL_EXPORT const char* perc2name( BYTE perc, char* buf, size_t bufsiz )
{ /*
Hex Bit PER Event
- 80 0 Successful-branching
+ 80 0 Successful-branching
40 1 Instruction-fetching
20 2 Storage-alteration
10 3 Storage-key-alteration
- 08 4 Store-using-real-address
- 04 5 Zero-address-detection
+ 08 4 Store-using-real-address
+ 04 5 Zero-address-detection
02 6 Transaction-end
01 7 Instruction-fetching nullification (PER-3) */
@@ -3325,7 +3357,7 @@
}
//---------------------------------------------------------------------
-// I/O Interrupt
+// I/O Interrupt
// (handles both HHC00805 and HHC00806)
//---------------------------------------------------------------------
DLL_EXPORT bool tf_0806( REGS* regs, U32 ioid, U32 ioparm, U32 iointid )
@@ -4114,7 +4146,7 @@
if (max_recsize < sizeof( TF01321 ))
max_recsize = sizeof( TF01321 );
-
+
if (max_recsize < sizeof( TF01329 ))
max_recsize = sizeof( TF01329 );
--- hyperion-1/msgenu.h 2023-07-09 17:33:35.000000000 -0700
+++ hyperion-0/msgenu.h 2023-08-11 14:56:41.453140900 -0700
@@ -287,7 +287,8 @@
#define HHC00089 "The are no HAO rules defined"
#define HHC00090 "HAO thread waiting for logger facility to become active"
#define HHC00091 "Logger facility now active; HAO thread proceeding"
-//efine HHC00092 - HHC00099 (available)
+#define HHC00092 "Warning in function %s: %s"
+//efine HHC00093 - HHC00099 (available)
// reserve 100-129 thread related
#define HHC00100 "Thread id "TIDPAT", prio %d, name '%s' started" (The non-related changes are simply removal of trailing blanks) |
Yes, thanks for catching my errors. This works for me fine as well (under 1st level (no VMware) Lubunto 20.04). I got a few Cheers, Peter |
Sounds good to me! (And I would suggest moving the My goal today is to try and reproduce your issue #590 problem. |
FYI:I have removed the "Researching..." label and have added the "BUG" label as I believe we may have more than likely identified the root cause of the excessive host CPU consumption by the timer thread. Since I am not 100% certain of this however (we will have to monitor our "fix" to see whether it does indeed fix the problem or not), I am leaving the "Unknown" label for now. Once we have enough evidence (feedback) to reliably conclude the problem has indeed been fixed, I will then remove the "Uknown" label at that time. |
On my ARM system (4 cores) I have 1 instance of hercules running (z/arch)
I see 2 threads running 100% of a core :
I did issue a timerint 100000 - but still the same
I understand the rubato_thread is transaction facility related.. but is it normal they (timer and.. rubato ?) both are is taking a full core ?
--Ivan
The text was updated successfully, but these errors were encountered: