Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit List Row Encoding (#5807) #5811

Merged
merged 3 commits into from
May 28, 2024
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #5807

Rationale for this change

Historically the variable length encoding used a fixed block size of 32 bytes, this made it very expensive for encoding small byte arrays. This meant we couldn't use it directly for encoding list elements, and instead came up with an alternative encoding. Unfortunately this encoding is incorrect #5807

Fortunately following #4818 the space overheads for small values using the variable length encoding is significantly reduced, and so we can just use it directly. The result is not only correct, but also typically smaller.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 28, 2024
@tustvold
Copy link
Contributor Author

I have also tested this against #5792 which integrates lists and structs into the row format and the tests pass 🎉

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good to me but I am not sure I would catch any subtle bugs as I am not an expert in this code

I suggest you also extend the fuzz test to include ListArrays to make sure we cover all corner cases.

https://github.com/apache/arrow-rs/blob/37d1d3d4e65f8630ced38975cf077316be90a03e/arrow-row/src/lib.rs#L2111-L2110

@@ -2271,4 +2263,16 @@ mod tests {

dictionary_eq(&back[0], &array);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this test covers the issue by running this test on main and it failed like this:

thread 'tests::test_list_prefix' panicked at arrow-row/src/lib.rs:2284:9:
assertion `left == right` failed
  left: Greater
 right: Less
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:298:5
   4: arrow_row::tests::test_list_prefix
             at ./src/lib.rs:2284:9
   5: arrow_row::tests::test_list_prefix::{{closure}}
             at ./src/lib.rs:2276:26
   6: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

error: test failed, to rerun pass `--lib`
error: 1 target failed:
    `--lib`

@tustvold
Copy link
Contributor Author

I suggest you also extend the fuzz test to include ListArrays to make sure we cover all corner cases.

I have this ready as part of #5792

@tustvold tustvold merged commit 54efb65 into apache:master May 28, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

List Row Encoding Sorts Incorrectly
2 participants