-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][CI] Sporadic test failures in AsofJoinBasicTest #36482
Comments
@rtpsw @icexelloss This seems to point to flaky/incorrect logic in the asof-join implementation. |
No idea, sorry. |
Do we have any systematic way to locate past CI job failures? |
Not easily. You could browse through the Github Actions results, but that's not handy. I would recommend finding a way to reproduce locally on your machine, then |
Yeah, I'd like to repro locally, if I can, but past experience with similar cases suggests this may be hard. I'll see what the best way forward is. |
Also, it seems it's always either |
I'm able to reproduce after about 400 iterations with: $ /build/build-test/debug/arrow-acero-asof-join-node-test --gtest_filter="*AsofJoinBasicTest.TestBasic7Forward/2*" --gtest_repeat=-1 --gtest_break_on_failure
[...]
Repeating all tests (iteration 428) . . .
Note: Google Test filter = *AsofJoinBasicTest.TestBasic7Forward/2*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AsofJoinNodeTest/AsofJoinBasicTest
[ RUN ] AsofJoinNodeTest/AsofJoinBasicTest.TestBasic7Forward/2
/home/antoine/arrow/dev/cpp/src/arrow/testing/gtest_util.cc:505: Failure
Failed
Unequal at absolute position 3
@@ -1, +1 @@
-1970-01-01 00:00:00.000000011
+null
Expected:
[
1970-01-01 00:00:00.000000010,
null,
1970-01-01 00:00:00.000000012
]
Actual:
[
1970-01-01 00:00:00.000000010,
1970-01-01 00:00:00.000000011,
1970-01-01 00:00:00.000000012
]
Google Test trace:
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:630: Right-1 type: uint16
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:628: Right-0 type: timestamp[ns, tz=UTC]
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:626: Left type: large_binary
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:624: Key type: uint32
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:622: Time type: date64[ms]
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:620: Iteration: 30
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:611: Types seed: 1688567126938002996
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:998: AsofJoinBasicTest_TestBasic7_MutateByKey
/home/antoine/arrow/dev/cpp/src/arrow/acero/asof_join_node_test.cc:997: AsofJoinBasicTest_TestBasic7Forward_MutateByKey
Trappe pour point d'arrêt et de trace |
I could work with that. Just to clarify, which iterations do you mean? There are internal test iterations (up to 100) and there are repetitions of gtest. Do you mean the latter? |
The latter (see excerpt above). |
Unfortunately, on my platform (virtualized Ubuntu 22.04 on Intel) I didn't get a single failure after 7000 iterations, when I gave up. @pitrou, what platform did you use when you got the failure? |
Ubuntu 22.04 on AMD64, debug mode. |
Mine was debug mode too. I tried again in parallel with |
I was able to reproduce on a different platform and then fix using the code in the PR I just created. |
@pitrou, could you change the title of this issue to match that of the PR? |
The title of the issue and the PR doesn't have to match. The commit message used will be the title of the PR. I think the title of the issue being |
Ah, in this case, never mind. |
…36499) ### What changes are included in this PR? The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time. ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** * Closes: #36482 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…36499) ### What changes are included in this PR? The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time. ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** * Closes: #36482 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…Test (apache#36499) ### What changes are included in this PR? The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time. ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** * Closes: apache#36482 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…Test (apache#36499) ### What changes are included in this PR? The key hasher is invalidated before the first invocation of `GetKey` (via `GetLatestKey`) after a new batch arrives. In the pre-PR code, this invalidation happens within `Advance`, which is called from `AdvanceAndMemoize` only after `GetLatestKey` is called. The change adds synchronization between the input-receiving- and processing- threads, because avoiding that would require a more complicated and brittle change, e.g., one that involves detecting in the processing thread when a new batch was added to the queue in order to invalidate the key hasher at that time. ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** * Closes: apache#36482 Authored-by: Yaron Gvili <rtpsw@hotmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
We routinely see this kind of failures on CI:
https://github.com/apache/arrow/actions/runs/5464016379/jobs/9945469106#step:7:3365
Component(s)
C++, Continuous Integration
The text was updated successfully, but these errors were encountered: