Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataStorm/reliability hang (macos, debug) #3056

Closed
bernardnormier opened this issue Nov 4, 2024 · 6 comments · Fixed by #3294
Closed

DataStorm/reliability hang (macos, debug) #3056

bernardnormier opened this issue Nov 4, 2024 · 6 comments · Fixed by #3294
Assignees
Milestone

Comments

@bernardnormier
Copy link
Member

From https://github.com/zeroc-ice/ice/actions/runs/11670664314

Looking at the client log

-- 11/04/24 19:24:55952 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Data: e1:[k3:barrier]@1:int: created key reader
-- 11/04/24 19:24:55952 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: p/1: announcing elements `[k3]@1' on topic `1:int'
-- 11/04/24 19:25:59964 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: s/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:25:59964 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: p/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:27:04091 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: s/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:27:04092 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: p/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:28:08233 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: s/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:28:08234 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: p/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:29:12237 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: s/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:29:12237 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: p/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:30:16364 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: s/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:30:16364 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: p/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect
-- 11/04/24 19:31:20494 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/writer: Session: s/2: can't retry connecting to `0EB25647-5A1A-4982-B853-AF5E62D3C8BC -t -e 1.1`, waiting 64000 (ms) for peer to reconnect

@pepone
Copy link
Member

pepone commented Nov 9, 2024

There is a crash report attached to the job from the original report:

0   libsystem_kernel.dylib        	       0x19d0ae600 __pthread_kill + 8
1   libsystem_pthread.dylib       	       0x19d0e6f70 pthread_kill + 288
2   libsystem_c.dylib             	       0x19cff3908 abort + 128
3   reader                        	       0x1047148b8 Test::testFailed(char const*, char const*, unsigned int) + 136 (TestHelper.h:142)
4   reader                        	       0x104713d34 main + 1420 (Reader.cpp:45)
5   dyld                          	       0x19cd64274 start + 2840

This crash corresponds to the reader receiving an unexpected sample in:

cerr << "unexpected sample: " << sample.getValue() << " expected:" << i << endl;
test(false);

The crash occurs at 2024-11-04 19:24:55.8588 +0000, which aligns with the last logged server activity:

-- 11/04/24 19:24:55847 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/reader: Session: s/1: attaching elements ack `[e1:pe1:sz170:k2:pvk2]@1' on topic `1:int'
-- 11/04/24 19:24:55847 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/reader: Session: s/1: initialized `e1:[k2:element]@1:int' from `e1@1'
-- 11/04/24 19:24:55847 /Users/runner/work/ice/ice/cpp/test/DataStorm/reliability/build/macosx/shared/reader: Data: e1:[k2:element]@1:int: initialized 170 samples from `1@1'

The "unexpected sample" message doesn't appear anywhere. This can be explained because the test is running in the background. For background tests, the console output is only displayed after the test finishes. However, this test never finishes—it hangs because the writer is waiting for the reader to reconnect, but the reader has already crashed.

@pepone pepone self-assigned this Nov 9, 2024
@pepone
Copy link
Member

pepone commented Nov 12, 2024

I let this run overnight and got the same crash:

unexpected sample: 0 expected:538
failed!
test/DataStorm/reliability/Reader.cpp:52: assertion `false' failed
saved /Users/jose/Documents/3.8/ice/cpp/test/DataStorm/reliability/client-111124-2327.log
saved /Users/jose/Documents/3.8/ice/cpp/test/DataStorm/reliability/server-111124-2327.log
saved /Users/jose/Documents/3.8/ice/cpp/test/DataStorm/reliability/node2-111124-2327.log
saved /Users/jose/Documents/3.8/ice/cpp/test/DataStorm/reliability/node1-111124-2327.log
Traceback (most recent call last):

test-logs.zip

@pepone
Copy link
Member

pepone commented Nov 19, 2024

@pepone
Copy link
Member

pepone commented Dec 3, 2024

@bernardnormier
Copy link
Member Author

@pepone
Copy link
Member

pepone commented Dec 17, 2024

There is a crash dump attached to the latest Windows failure. The crash report is a bit useless without the symbols from the build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants