-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hercules crash after resume from suspend #458
Comments
Analysis of the crash dump reveals: Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64 Copyright (c) Microsoft Corporation. All rights reserved. Loading Dump File [C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS\Hercules.dmp] User Mini Dump File: Only registers, stack and portions of memory are available WARNING: Minidump contains unknown stream type 0x15 WARNING: Minidump contains unknown stream type 0x16 Comment: 'HHC01413I Hercules version 4.3.0.10296-SDL-ga4db8213 (4.3.0.10296)' Comment: 'HHC01414I (C) Copyright 1999-2020 by Roger Bowler, Jan Jaeger, and others' Comment: 'HHC01417I ** The SoftDevLabs version of Hercules **' Comment: 'HHC01415I Build date: Nov 27 2020 at 10:51:04' Comment: 'HHC01417I Built with: Microsoft Visual Studio 2008 (MSVC 150030729 1)' Comment: 'HHC01417I Build type: Windows MSVC AMD64 host architecture build' Comment: 'HHC01417I Modes: S/370 ESA/390 z/Arch' Comment: 'HHC01417I Max CPU Engines: 64' Comment: 'HHC01417I Using shared libraries' Comment: 'HHC01417I Using Fish threads Threading Model' Comment: 'HHC01417I Using Error-Checking Mutex Locking Model' Comment: 'HHC01417I With Shared Devices support' Comment: 'HHC01417I With Dynamic loading support' Comment: 'HHC01417I With External GUI support' Comment: 'HHC01417I With Partial TCP keepalive support' Comment: 'HHC01417I With IPV6 support' Comment: 'HHC01417I With HTTP Server support' Comment: 'HHC01417I With sqrtl support' Comment: 'HHC01417I With Signal handling' Comment: 'HHC01417I With Watchdog monitoring' Comment: 'HHC01417I With CCKD BZIP2 support' Comment: 'HHC01417I With HET BZIP2 support' Comment: 'HHC01417I With ZLIB support' Comment: 'HHC01417I With Regular Expressions support' Comment: 'HHC01417I With Object REXX support' Comment: 'HHC01417I Without Regina REXX support' Comment: 'HHC01417I With Automatic Operator support' Comment: 'HHC01417I Without National Language Support' Comment: 'HHC01417I With CCKD64 Support' Comment: 'HHC01417I With Transactional-Execution Facility support' Comment: 'HHC01417I With "Optimized" instructions' Comment: 'HHC01417I Machine dependent assists: cmpxchg1 cmpxchg4 cmpxchg8 cmpxchg16 hatomics=msvcIntrinsics' Comment: 'HHC01417I Running on: DESKTOP-OSERMAE (Windows-6.2.9200 Intel(R) x64) LP=4, Cores=2, CPUs=1' Comment: 'HHC01417I Built with crypto external package version 1.0.0.31-gbf46d03' Comment: 'HHC01417I Built with decNumber external package version 3.68.0.84-g94144c9' Comment: 'HHC01417I Built with SoftFloat external package version 3.5.0.87-gb706677' Comment: 'HHC01417I Built with telnet external package version 1.0.0.48-gb90a708' Comment: 'HHC00018W Hercules is NOT running in elevated mode' Comment: '10:12:32.299449 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]' Comment: '10:12:32.299596 0:0141 cckd_dasd_end(512): end i/o bufcur 2390 cache[27] waiters 0' Comment: '10:12:32.299841 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6020 cache[89]' Comment: '10:12:32.300034 0:034F cckd_dasd_end(512): end i/o bufcur 6020 cache[89] waiters 0' Comment: '10:12:32.300264 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]' Comment: '10:12:32.300423 0:0141 cckd_dasd_end(512): end i/o bufcur 2390 cache[27] waiters 0' Comment: '10:12:32.300578 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6020 cache[89]' Comment: '10:12:32.300693 0:034F cckd_read_track(866): read trk 6021 (asynchronous)' Comment: '10:12:32.300702 0:034F cckd_read_trk(1194): 0 rdtrk 6021' Comment: '10:12:32.301348 0:034F cckd_read_trk(1242): 0 rdtrk[150] 6021 cache hit buf 0000000004CA67A0:0000c80015' Comment: '10:12:32.301391 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0' Comment: '10:12:32.301674 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]' Comment: '10:12:32.301786 0:0141 cckd_read_track(866): read trk 2391 (asynchronous)' Comment: '10:12:32.301792 0:0141 cckd_read_trk(1194): 0 rdtrk 2391' Comment: '10:12:32.302280 0:0141 cckd_read_trk(1242): 0 rdtrk[28] 2391 cache hit buf 0000000004CE6480:00004f0015' Comment: '10:12:32.302685 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0' Comment: '10:12:32.313596 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]' Comment: '10:12:32.313759 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0' Comment: '10:12:32.313940 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]' Comment: '10:12:32.314142 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0' Comment: '10:12:32.314339 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]' Comment: '10:12:32.314486 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0' Comment: '10:12:32.314649 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]' Comment: '10:12:32.315347 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0' Comment: '10:12:32.315475 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]' Comment: '10:12:32.315744 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0' Comment: '10:12:32.315972 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]' Comment: '10:12:32.316161 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0' Comment: '10:12:32.316323 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]' Comment: '10:12:32.316434 0:034F cckd_read_track(866): read trk 6022 (asynchronous)' Comment: '10:12:32.316439 0:034F cckd_read_trk(1194): 0 rdtrk 6022' Comment: '10:12:32.316961 0:034F cckd_read_trk(1242): 0 rdtrk[120] 6022 cache hit buf 0000000004A9B8C0:0000c80016' Comment: '10:12:32.317002 0:034F cckd_dasd_end(512): end i/o bufcur 6022 cache[120] waiters 0' Comment: '10:12:32.322752 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]' Comment: '10:12:32.323013 0:07F0 cckd_dasd_end(512): end i/o bufcur 11490 cache[180] waiters 0' Comment: '10:12:32.328969 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]' Comment: '10:12:32.329318 0:07F0 cckd_dasd_end(512): end i/o bufcur 11490 cache[180] waiters 0' Comment: '10:12:32.351690 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]' Comment: '10:12:32.351845 0:07F0 cckd_read_track(866): read trk 11491 (asynchronous)' Comment: '10:12:32.351850 0:07F0 cckd_read_trk(1194): 0 rdtrk 11491' Comment: '10:12:32.351859 0:07F0 cckd_read_trk(1242): 0 rdtrk[181] 11491 cache hit buf 0000000004B25690:00017f0001' Comment: '10:12:32.351972 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0' Comment: '10:12:32.352374 0:07F0 cckd_read_trk(1194): 1 rdtrk 11492' Comment: '10:12:32.352385 0:07F0 cckd_read_trk(1273): 1 rdtrk[182] 11492 cache miss' Comment: '10:12:32.352391 0:07F0 cckd_read_trk(1302): 1 rdtrk[182] 11492 dropping 0241:30 from cache' Comment: '10:12:32.352396 0:07F0 cckd_read_trk(1325): 1 rdtrk[182] 11492 buf 0000000004A60C20 len 19456' Comment: '10:12:32.352401 0:07F0 cckd_read_trkimg(3333): trk[11492] read_trkimg' Comment: '10:12:32.352817 0:07F0 cckd_read_l2ent(3250): file[1] l2[44,228] trk[11492] read_l2ent 0x13d08' Comment: '10:12:32.353219 0:07F0 cckd_read_l2(3009): file[1] read_l2 44 active 1 44 144' Comment: '10:12:32.353640 0:07F0 cckd_read_l2ent(3268): file[1] l2[44,228] trk[11492] read_l2ent 0x185c57 189 189' Comment: '10:12:32.353647 0:07F0 cckd_read(624): file[1] fd[20] read, off 0x0000000000185c57 len 189' Comment: '10:12:32.354346 0:07F0 cckd_read_trk(1366): 1 rdtrk[182] 11492 complete buf 0000000004A60C20:01017f0002' Comment: '10:12:36.017050 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]' Comment: '10:12:36.017270 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0' Comment: '10:12:36.023011 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]' Comment: '10:12:36.023265 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0' Comment: '10:12:36.024658 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]' Comment: '10:12:36.024907 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0' Comment: '10:12:36.027040 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]' Comment: '10:12:36.027313 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0' Comment: '10:12:43.774786 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]' Comment: '10:12:43.774950 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0' Comment: '10:12:43.775210 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]' Comment: '10:12:43.775390 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0' Comment: '' Symbol search path is: C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS;srv*C:\Windows\Symbols*http://msdl.microsoft.com/download/symbols Executable search path is: C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS Windows 7 Version 19043 MP (4 procs) Free x64 Product: WinNt, suite: SingleUserTS Machine Name: Debug session time: Sat Dec 25 02:13:03.000 2021 (UTC - 8:00) System Uptime: not available Process Uptime: 2 days 0:49:02.000 ......................................................... This dump file has an exception of interest stored in it. The stored exception information can be accessed via .ecxr. (8440.7bc8): Access violation - code c0000005 (first/second chance not available) ntdll!ZwGetContextThread+0x14: 00007ffa`3c24eba4 c3 ret 0:003> !analyze -v ******************************************************************************* * * * Exception Analysis * * * ******************************************************************************* Failed calling InternetOpenUrl, GLE=12029 FAULTING_IP: hengine!watchdog_thread+486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586] 00000001`005db506 c6450000 mov byte ptr [rbp],0 EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff) ExceptionAddress: 00000001005db506 (hengine!watchdog_thread+0x0000000000000486) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter[0]: 0000000000000001 Parameter[1]: 0000000000000000 Attempt to write to address 0000000000000000 DEFAULT_BUCKET_ID: NULL_POINTER_WRITE PROCESS_NAME: hercules.exe ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s. EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s. EXCEPTION_PARAMETER1: 0000000000000001 EXCEPTION_PARAMETER2: 0000000000000000 WRITE_ADDRESS: 0000000000000000 FOLLOWUP_IP: hengine!watchdog_thread+486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586] 00000001`005db506 c6450000 mov byte ptr [rbp],0 MOD_LIST: NTGLOBALFLAG: 0 APPLICATION_VERIFIER_FLAGS: 0 FAULTING_THREAD: 0000000000007bc8 PRIMARY_PROBLEM_CLASS: NULL_POINTER_WRITE BUGCHECK_STR: APPLICATION_FAULT_NULL_POINTER_WRITE LAST_CONTROL_TRANSFER: from 0000000000c60450 to 00000001005db506 STACK_TEXT: 00000000`02dff500 00000000`00c60450 : 00000000`00dbf130 00000000`00000000 00000000`00000003 00000000`7055334a : hengine!watchdog_thread+0x486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586] 00000000`02dff790 00000000`00c53c04 : 00000000`00dc03f0 00000000`00000000 00000000`00000000 00000000`00000000 : hutil!hthread_func+0x60 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\hthreads.c @ 1051] 00000000`02dff7c0 00000000`70552fdf : 00000000`00dc1950 00000000`00000000 00000000`00dc1950 00000000`00000000 : hutil!FTWin32ThreadFunc+0x44 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\fthreads.c @ 809] 00000000`02dff810 00000000`70553080 : 00000000`705dcad0 00000000`00dc1950 00000000`00000000 00000000`00000000 : msvcr90!_callthreadstartex+0x17 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 348] 00000000`02dff840 00007ffa`3bbc7034 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr90!_threadstartex+0x84 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 326] 00000000`02dff870 00007ffa`3c202651 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14 00000000`02dff8a0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21 STACK_COMMAND: ~3s; .ecxr ; kb FAULTING_SOURCE_CODE: 582: /* Give logger thread time to log messages */ 583: SLEEP(1); 584: 585: /* Create the crash dump for offline analysis */ > 586: CRASH(); 587: } 588: } 589: while (!sysblk.shutdown); 590: 591: return NULL; SYMBOL_STACK_INDEX: 0 SYMBOL_NAME: hengine!watchdog_thread+486 FOLLOWUP_NAME: MachineOwner MODULE_NAME: hengine IMAGE_NAME: hengine.dll DEBUG_FLR_IMAGE_TIMESTAMP: 5fc14ab7 FAILURE_BUCKET_ID: NULL_POINTER_WRITE_c0000005_hengine.dll!watchdog_thread BUCKET_ID: X64_APPLICATION_FAULT_NULL_POINTER_WRITE_hengine!watchdog_thread+486 WATSON_STAGEONE_URL: http://watson.microsoft.com/StageOne/hercules_exe/4_3_0_10296/5fc14ad2/hengine_dll/4_3_0_10296/5fc14ab7/c0000005/001db506.htm?Retriage=1 Followup: MachineOwner --------- Which makes perfect sense! The WatchDog thread checks for malfunctioning CPUs by checking to see if they've executed any instructions in the past 20 seconds, and when the host is later resumed after having been suspended, much longer than 20 seconds has passed! (thereby misleading the watchdog thread into believing the CPU is malfunctioning when in fact is likely wasn't!) New code needs to be added to Hercules (at least for Windows anyway; not sure how to do it for non-Windows) to detect when a resume after a suspend has occurred, and compensate for that fact to prevent erroneous malfunctioning CPUs from being detected. |
Fish:
See https://stackoverflow.com/questions/7253714/get-notified-about-windows-is-going-to-sleep-waking-up-in-c for Windows; on Linux, DBUS service notification will most likely be needed (https://hg.mozilla.org/integration/autoland/rev/aa2f75bd6480 is Mozilla’s implementation).
Regards,
Mark L. Gaubatz
|
Thanks anyway Mark, but I'm already on it. Having been a Windows programmer for 25+ years now, I DO know about these things after all. (Now the Linux side of things I could use some help with!) |
QUESTION FOR MY FELLOW HERCULES DEVELOPERS: Before I begin testing, which fix technique do you prefer? Technique #1 or #2? Technique #1 is host operating system dependent and would obviously require adding additional code to deal with the non-Windows case (only the Windows case has been coded), whereas technique #2 has the advantage of not being specific to any host os (i.e. should work equally well on both Windows and non-Windows systems) and is much simpler and more straightforward as well. I realize it's obvious the second technique is clearly superior to the first as far as resolving the specific issue at hand goes, but I guess what I'm asking is, how important is it for us to be able to detect host system suspend/resume events in general? The first technique has the advantage(?) of directly detecting such events, thereby allowing us the ability react differently to such events elsewhere in Hercules code too (and not just in the watchdog thread), whereas the second technique only fixes the one specific suspend/resume problem that we've discovered thus far (in the watchdog thread). There might well be other places in Hercules code where we need to compensate for suspend/resume that we just haven't discovered yet, and by using technique #1, such code would simply need to check the sysblk flags. The groundwork of actually directly detecting suspend/resume events would already have been laid. Personally, I vote for technique #2. |
HOPEFULLY closes #458. Not having a laptop, I was unable to test. Attempts to reproduce problem in a VMware virtual machine (so I could test whether my fix corrected the problem or not) failed. But I'm nonetheless confident this likely corrects the problem. <me: fingers crossed>
Fixed (hopefully!) by commit 4b439ae. Closing. |
NOTE: GitHub Issue #489 (Hercules 4.4.1 crashes after OSA failure) is also closely related to this issue.
NOTE: GitHub Issue #449 (Periodic "HHC00801I Processor CP00: Fixed-point-divide exception running VSE/ESA) is also closely related to this issue.
Problem reported by Dave Wade:
He provided a crash dump:
The text was updated successfully, but these errors were encountered: