Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: test with multiple Nim version #429

Merged
merged 15 commits into from
Dec 11, 2021
Merged

CI: test with multiple Nim version #429

merged 15 commits into from
Dec 11, 2021

Conversation

stefantalpalaru
Copy link
Contributor

and clean up the testing tree a little

Copy link
Contributor

@kdeme kdeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but strange that windows 1.2 / 1.4 segfaults?

You're probably aware of this but the linux-amd64 failure is a sporadic failure we see everywhere: #148

@stefantalpalaru stefantalpalaru changed the title CI: test with multiple Nim version [WIP] CI: test with multiple Nim version Nov 19, 2021
@stefantalpalaru
Copy link
Contributor Author

Plenty of failures, including this test case crashing at run time, on Windows only, after unrelated changes:

test "New protocol with enr":

or this check failing with Nim-1.6:

All this on top of the usual 1.6 partial GC-safety analysis that stops halfway, asking me to add {.gcsafe.} pragmas so it can continue.

I'm trying to fix all I can, before giving up on some Nim versions and moving them in a daily workflow.

@kdeme
Copy link
Contributor

kdeme commented Nov 19, 2021

I'm trying to fix all I can, before giving up on some Nim versions and moving them in a daily workflow.

Yeah, I was referring to the 1.2 crash on Windows. That is strange as I've never seen it before (including Windows). I can have a look at that if you aren't already.

When adding multiple Nim versions being tested in CI I don't expect necessarily for all of the new versions also to work and/or be fixed in the same PR. But if so, even better, sure.

@stefantalpalaru
Copy link
Contributor Author

Yeah, I was referring to the 1.2 crash on Windows. That is strange as I've never seen it before (including Windows). I can have a look at that if you aren't already.

Please do. I don't know what triggers it. Feel free to push to this branch, while testing. I'll refrain from rebasing it.

@stefantalpalaru
Copy link
Contributor Author

It happens in here:

Got a GDB backtrace:

[Suite] test api usage

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ffe1c6863c2 in ntdll!RtlUnwindEx () from C:\Windows\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ffe1c6863c2 in ntdll!RtlUnwindEx ()
   from C:\Windows\SYSTEM32\ntdll.dll
#1  0x00007ffe1b98337d in msvcrt!_setjmpex ()
   from C:\Windows\System32\msvcrt.dll
#2  0x00007ff7530c6c62 in raiseExceptionAux__na8C8pUZ9cLQWVwk35l5vfw (
    e=0x27e6a532508)
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:447
#3  0x00007ff7530b1858 in sysFatal__METp0EHKQZlD51D9bYP6PAAassertions (
    message=<optimized out>)
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim:50
#4  raiseAssert__gpGJG5CoQzE64skFd9bPG7A (msg=<optimized out>)
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/assertions.nim:22
#5  0x00007ff7530b1931 in failedAssertImpl__W9cjVocn1tjhW7p7xohJj6A (
    msg=msg@entry=0x7ff753196280 <TM__WaXqGRAfVeSDXlS9cwIpbgg_2>)
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/assertions.nim:29
#6  0x00007ff75310b81c in currentElemEnd__EqfIXRABRD085klc9c3SUgg (self=...)
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/eth/rlp.nim:223
#7  0x00007ff75310ba2b in skipElem__8ZqaueEfG6l7Lij1Qr1R9cA (
    rlp=rlp@entry=0xf4c5dfd690)
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/eth/rlp.nim:243
#8  0x00007ff753117aab in suite__9c0phw9aEBCc9csEkhwDen7HA ()
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:24
#9  0x00007ff753127a53 in eth_test_api_usageInit000 ()
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:217
#10 0x00007ff75317fb4b in PreMainInner ()
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:191
#11 0x00007ff75318163d in PreMain ()
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:239
#12 0x00007ff753181651 in NimMain ()
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:248
#13 0x00007ff75318b6c3 in main (argc=1, args=0x27e6a5b4900, env=0x27e6a5b2400)
    at C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:258

@stefantalpalaru
Copy link
Contributor Author

A memory debugger called "Dr. Memory" has this to say about it:

$ drmemory tests/rlp/all_tests.exe
~~Dr.M~~ Dr. Memory version 2.3.18352
~~Dr.M~~ Running "tests/rlp/all_tests.exe"
~~Dr.M~~ Using system call file C:\Users\user\AppData\Roaming\Dr. Memory\symcache\syscalls_x64.txt
~~Dr.M~~
~~Dr.M~~ Error #1: UNADDRESSABLE ACCESS beyond top of stack: reading 0x00000088ad7ff3d0-0x00000088ad7ff3d8 8 byte(s)
~~Dr.M~~ # 0 .text
~~Dr.M~~ # 1 _pei386_runtime_relocator
~~Dr.M~~ # 2 __tmainCRTStartup
~~Dr.M~~ # 3 .l_start
~~Dr.M~~ # 4 KERNEL32.dll!BaseThreadInitThunk
~~Dr.M~~ Note: @0:00:02.371 in thread 8172
~~Dr.M~~ Note: 0x00000088ad7ff3d0 refers to 904 byte(s) beyond the top of the stack 0x00000088ad7ff758
~~Dr.M~~ Note: instruction: or     $0x0000000000000000 (%rcx) -> (%rcx)
~~Dr.M~~
~~Dr.M~~ Error #2: UNADDRESSABLE ACCESS beyond top of stack: reading 0x00000088ad7fe648-0x00000088ad7fe650 8 byte(s)
~~Dr.M~~ # 0 .text
~~Dr.M~~ # 1 suite__9c0phw9aEBCc9csEkhwDen7HA               [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:12]
~~Dr.M~~ # 2 eth_test_api_usageInit000                      [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:217]
~~Dr.M~~ # 3 PreMainInner                                   [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:191]
~~Dr.M~~ # 4 PreMain                                        [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:239]
~~Dr.M~~ # 5 NimMain                                        [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:248]
~~Dr.M~~ # 6 main                                           [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:258]
~~Dr.M~~ Note: @0:00:05.419 in thread 8172
~~Dr.M~~ Note: 0x00000088ad7fe648 refers to 4072 byte(s) beyond the top of the stack 0x00000088ad7ff630
~~Dr.M~~ Note: instruction: or     $0x0000000000000000 (%rcx) -> (%rcx)
~~Dr.M~~
~~Dr.M~~ Error #3: UNADDRESSABLE ACCESS beyond top of stack: reading 0x00000088ad7fced0-0x00000088ad7fced8 8 byte(s)
~~Dr.M~~ # 0 .text
~~Dr.M~~ # 1 suite__9c0phw9aEBCc9csEkhwDen7HA               [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:12]
~~Dr.M~~ # 2 eth_test_api_usageInit000                      [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:217]
~~Dr.M~~ # 3 PreMainInner                                   [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:191]
~~Dr.M~~ # 4 PreMain                                        [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:239]
~~Dr.M~~ # 5 NimMain                                        [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:248]
~~Dr.M~~ # 6 main                                           [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-metrics/metrics.nim:258]
~~Dr.M~~ Note: @0:00:05.429 in thread 8172
~~Dr.M~~ Note: 0x00000088ad7fced0 refers to 10080 byte(s) beyond the top of the stack 0x00000088ad7ff630
~~Dr.M~~ Note: instruction: or     $0x0000000000000000 (%rcx) -> (%rcx)

[Suite] test api usage
~~Dr.M~~
~~Dr.M~~ Error #4: UNADDRESSABLE ACCESS beyond top of stack: reading 0x00000088ad7f91f0-0x00000088ad7f91f4 4 byte(s)
~~Dr.M~~ # 0 libwinpthread-1.dll!?                  +0x0      (0x00007ffe0edb425c <libwinpthread-1.dll+0x425c>)
~~Dr.M~~ # 1 ntdll.dll!RtlpCallVectoredHandlers
~~Dr.M~~ # 2 ntdll.dll!RtlDispatchException
~~Dr.M~~ # 3 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ # 4 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ # 5 ntdll.dll!LdrpInitializeDllPath
~~Dr.M~~ # 6 ntdll.dll!RtlDispatchException
~~Dr.M~~ # 7 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ # 8 ntdll.dll!LdrpInitializeDllPath
~~Dr.M~~ # 9 ntdll.dll!RtlDispatchException
~~Dr.M~~ #10 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ #11 ntdll.dll!LdrpInitializeDllPath
~~Dr.M~~ #12 ntdll.dll!RtlUnwindEx
~~Dr.M~~ Note: @0:00:06.007 in thread 8172
~~Dr.M~~ Note: instruction: cmp    (%rdx) $0x406d1388
~~Dr.M~~
~~Dr.M~~ Error #5: UNADDRESSABLE ACCESS beyond top of stack: reading 0x00000088ad7f83b0-0x00000088ad7f83b4 4 byte(s)
~~Dr.M~~ # 0 libwinpthread-1.dll!?                  +0x0      (0x00007ffe0edb425c <libwinpthread-1.dll+0x425c>)
~~Dr.M~~ # 1 ntdll.dll!RtlpCallVectoredHandlers
~~Dr.M~~ # 2 ntdll.dll!RtlDispatchException
~~Dr.M~~ # 3 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ # 4 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ # 5 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ # 6 ntdll.dll!LdrpInitializeDllPath
~~Dr.M~~ # 7 ntdll.dll!RtlDispatchException
~~Dr.M~~ # 8 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ # 9 ntdll.dll!LdrpInitializeDllPath
~~Dr.M~~ #10 ntdll.dll!RtlDispatchException
~~Dr.M~~ #11 ntdll.dll!KiUserExceptionDispatcher
~~Dr.M~~ #12 ntdll.dll!LdrpInitializeDllPath
~~Dr.M~~ #13 ntdll.dll!RtlUnwindEx
~~Dr.M~~ Note: @0:00:06.026 in thread 8172
~~Dr.M~~ Note: instruction: cmp    (%rdx) $0x406d1388
<Application C:\Users\user\Desktop\status\nimbus-eth2\vendor\nim-eth\tests\rlp\all_tests.exe (6168).  Dr. Memory internal crash at PC 0x00000000710cafbb.  Please report this at http://drmemory.org/issues along with the results of running '-debug -dr_debug'.  Program aborted.
0xc0000005 0x00000000 0x00000000710cafbb 0x00000000710cafbb 0x0000000000000001 0x000000007118dcbc
Base: 0x0000000071000000
Registers: eax=0x0000000000000001 ebx=0x000002029727cdc0 ecx=0xffffffffffffffff edx=0x0000000000000001
        esi=0x00000202970aed70 edi=0x0000000000000000 esp=0x00000202970aec40 ebp=0x0000000000000000
        r8 =0x0000000000000008 r9 =0x0000000000000000 r10=0x0000000000000000 r11=0x0000000000000246
        r12=0x0000000000000001 r13=0x0000020296e94060 r14=0x0000000000000000 r15=0x0000000000000000
        eflags=0x0000000000010286
2.3.18352-0-(Apr  4 2020 00:02:59) WinVer=105;Rel=1809;Build=17763;Edition=Professional
-no_dynamic_options -disasm_mask 8 -logdir 'C:\Users\user\AppData\Roaming\Dr. Memory\dynamorio' -client_lib 'C:\Program Files (x86)\Dr. Memory\bin64\release\drmemorylib.dll;0;-logdir `C:\Users\user\AppData\Roaming\Dr. Memory` -symcache_dir `C:\Users\user\AppData\Roaming\Dr. Memory\symcache` -lib_blacklist `C:\Windows*.d?>
~~Dr.M~~ Fetching 1 symbol files...
~~Dr.M~~ [1/1] Fetching symbols for C:\Windows\System32\msvcrt.dll
~~Dr.M~~ Fetched 1 symbol files successfully
~~Dr.M~~ WARNING: application exited with abnormal code 0xffffffff

@stefantalpalaru
Copy link
Contributor Author

stefantalpalaru commented Nov 21, 2021

A newer Dr. Memory version doesn't crash, but is more terse:

Dr. Memory version 2.5.0 build 0 built on Oct 18 2021 03:01:22
Windows version: WinVer=105;Rel=1809;Build=17763;Edition=Professional
Dr. Memory results for pid 10420: "test_api_usage.exe"
Application cmdline: "tests/rlp/test_api_usage.exe"
Recorded 124 suppression(s) from default C:\Program Files (x86)\Dr. Memory\bin64\suppress-default.txt

Error #1: UNADDRESSABLE ACCESS beyond top of stack: reading 0x0000004b559ff960-0x0000004b559ff968 8 byte(s)
# 0 ___chkstk_ms      
# 1 _pei386_runtime_relocator               [C:\Users\user\Desktop\status\nimbus-eth2\vendor\nim-eth/generated_not_to_break_here:1000018]
# 2 __tmainCRTStartup 
# 3 .l_start          
# 4 KERNEL32.dll!BaseThreadInitThunk
Note: @0:00:00.250 in thread 11804
Note: 0x0000004b559ff960 refers to 904 byte(s) beyond the top of the stack 0x0000004b559ffce8
Note: instruction: or     $0x0000000000000000 (%rcx) -> (%rcx)

Error #2: UNADDRESSABLE ACCESS beyond top of stack: reading 0x0000004b559fec48-0x0000004b559fec50 8 byte(s)
# 0 ___chkstk_ms                                   [C:\Users\user\Desktop\status\nimbus-eth2\vendor\nim-eth/generated_not_to_break_here:1000018]
# 1 suite__9c0phw9aEBCc9csEkhwDen7HA               [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:12]
# 2 NimMainInner                                   [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:217]
# 3 NimMain                                        [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim:229]
# 4 main                                           [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim:236]
Note: @0:00:02.390 in thread 11804
Note: 0x0000004b559fec48 refers to 4072 byte(s) beyond the top of the stack 0x0000004b559ffc30
Note: instruction: or     $0x0000000000000000 (%rcx) -> (%rcx)

Error #3: UNADDRESSABLE ACCESS beyond top of stack: reading 0x0000004b559fd4d0-0x0000004b559fd4d8 8 byte(s)
# 0 ___chkstk_ms                                   [C:\Users\user\Desktop\status\nimbus-eth2\vendor\nim-eth/generated_not_to_break_here:1000018]
# 1 suite__9c0phw9aEBCc9csEkhwDen7HA               [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:12]
# 2 NimMainInner                                   [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth/tests/rlp/test_api_usage.nim:218]
# 3 NimMain                                        [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim:229]
# 4 main                                           [C:/Users/user/Desktop/status/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim:236]
Note: @0:00:02.422 in thread 11804
Note: 0x0000004b559fd4d0 refers to 10080 byte(s) beyond the top of the stack 0x0000004b559ffc30
Note: instruction: or     $0x0000000000000000 (%rcx) -> (%rcx)

@kdeme
Copy link
Contributor

kdeme commented Nov 21, 2021

It happens in here:

Got a GDB backtrace:

You got this locally? Looks like it crashes when it should catch the AssertionError with the unittest expect: https://github.com/status-im/nim-eth/blob/master/tests/rlp/test_api_usage.nim#L24

And the CI crash also happens in a test that does an expect on a defect: https://github.com/status-im/nim-eth/blob/master/tests/p2p/test_discoveryv5.nim#L435

@stefantalpalaru
Copy link
Contributor Author

Yes, locally. The segfault doesn't appear with ARC or ORC and it seems to be triggered by attempting to unwind a corrupt call stack while raising that exception.

That first "unaddressable access" reported by Dr. Memory is a false positive. That's something _pei386_runtime_relocator() is supposed to do.

@kdeme
Copy link
Contributor

kdeme commented Nov 21, 2021

Yes, locally. The segfault doesn't appear with ARC or ORC and it seems to be triggered by attempting to unwind a corrupt call stack while raising that exception.

Ok, I'll see if I can also reproduce in a VM. I quickly tried without the QUICK_AND_DIRTY flags as I noticed that as one clear difference in the CI compared to before, but it still failed. I've removed that commit again.

@stefantalpalaru
Copy link
Contributor Author

the QUICK_AND_DIRTY flags

Those are just to avoid building tools we don't need and to skip a final compiler bootstrap step that is known to produce a binary identical to the previous one.

@stefantalpalaru
Copy link
Contributor Author

Commenting out these two lines towards the end of the test suite makes it pass:

chk f
chk -f

Adding a simple echo "1" in their place makes it fail again. Increasing the stack size from 1 to 8 MB has no effect.

@markspanbroek
Copy link
Member

markspanbroek commented Nov 22, 2021

It appears that you can work around the crash in the discoveryv5 tests on Windows by replacing test with asyncTest here. This probably works because the asyncTest template places the test code inside a proc.

@stefantalpalaru
Copy link
Contributor Author

You know what's weird? I cannot replicate this new crash locally, in a Windows 10 VM.

@markspanbroek
Copy link
Member

This probably works because the asyncTest template places the test code inside a proc.

Did a bit more testing to see why this works, turns out that just putting the test code in a proc is not enough. I needed to add {.closure.} to the proc to make the test pass.

Adding {.stackTrace:off.} also works, by the way.

@markspanbroek
Copy link
Member

You know what's weird? I cannot replicate this new crash locally, in a Windows 10 VM.

Just for reference: in my Windows 10 VM I only see the crash in test_discoveryv5, not in test_api_usage.

@stefantalpalaru
Copy link
Contributor Author

I needed to add {.closure.} to the proc

Which puts an env struct pointer on the stack, right? Still feels like a random stack change.

@markspanbroek
Copy link
Member

Which puts an env struct pointer on the stack, right? Still feels like a random stack change.

I don't know what {.closure.} does exactly. I was just trying to figure out what it is that the {.async.} macro adds that makes the crash go away. You might very well be right that it's just a random stack change.

@stefantalpalaru
Copy link
Contributor Author

I managed to replicate the crash locally by adding --skipParentCfg:on, so there's something in nimbus-eth2's "config.nims" that mitigates it:

user@DESKTOP-JJ7DJA5 MINGW64 /c/Users/user/Desktop/status/nimbus-eth2/vendor/nim-eth
$ PATH="../../build:${PATH}" ../../env.sh nim c -r -d:release -d:chronosStrictException -d:chronicles_log_level=ERROR  --verbosity:0 --hints:off --skipUserCfg:on --skipParentCfg:on --warning[ObservableStores]:off tests/p2p/all_tests

@stefantalpalaru
Copy link
Contributor Author

Believe it or not, --stacktrace:on makes it work. Now, what this does is insert proc calls at the beginning and end of each of our procs, plus creating a bunch of stack frames.

As to how it works, I don't know yet. Maybe it changes the stack alignment, which MinGW is notoriously bad with, creating problems when DLLs with one stack alignment are used from a binary with another one.

@stefantalpalaru stefantalpalaru changed the title [WIP] CI: test with multiple Nim version CI: test with multiple Nim version Dec 11, 2021
@stefantalpalaru
Copy link
Contributor Author

-d:nimRawSetjmp can now be used on Windows, on all Nim branches, so we're good to go: nim-lang/Nim#19197

@stefantalpalaru stefantalpalaru merged commit 2088d75 into master Dec 11, 2021
@stefantalpalaru stefantalpalaru deleted the ci_multi_nim branch December 11, 2021 18:12
@markspanbroek markspanbroek mentioned this pull request Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants