Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support List and LargeList in Row format (#3159) #3251

Merged
merged 6 commits into from
Dec 2, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 1, 2022

Which issue does this PR close?

Part of #3159

Rationale for this change

The longer term goal is to make the Row Format support enough types to use in DataFusion so we can use it for a unified GroupBy operation

What changes are included in this PR?

Add support for encoding/decoding lists from the row format

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 1, 2022
///
/// ```text
/// ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
/// [1_u8, 2_u8, 3_u8] │01│01│01│02│01│03│00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03│
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the │01│01│01│02│01│03│ prefix and the │00│00│00│03│ suffix. But where do the other bytes come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lengths of each encoded row, and the number of elements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically 00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03 represents [element0_len, element1_len, element2_len, element count] --> [2, 2, 2, 3]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add this explanation to the docstring?

arrow/src/row/mod.rs Outdated Show resolved Hide resolved
Co-authored-by: Marco Neumann <marco@crepererum.net>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code and tests carefully. It is a little mind bending but I think it is very nicely done 🏆

///
/// ```text
/// ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
/// [1_u8, 2_u8, 3_u8] │01│01│01│02│01│03│00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03│
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically 00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03 represents [element0_len, element1_len, element2_len, element count] --> [2, 2, 2, 3]

arrow/src/row/list.rs Outdated Show resolved Hide resolved
!= sort_field.options.descending,
};

let field = SortField::new_with_options(f.data_type().clone(), options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how descending sort is achieved if the list is always encoded descending false 🤔 )

However, I see it is tested below, so 👍

       let options = SortOptions {
            descending: true,
            nulls_first: false,
        };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the elements are encoded with descending false, they are then encoded using variable length encoding which may reorder them. Yes it is mind-bending 😅

builder.values().append_value(32);
builder.values().append_value(52);
builder.append(true);
builder.values().append_value(32); // MASKED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

masked means this row is NULL so these values should be igored


assert!(rows.row(0) < rows.row(1)); // [32, 52, 32] < [32, 52, 12]
assert!(rows.row(2) > rows.row(1)); // [32, 42] > [32, 52, 12]
assert!(rows.row(3) < rows.row(2)); // null < [32, 42]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Thank you for the comments -- they make the tests easy to follow

// ]
let options = SortOptions {
descending: false,
nulls_first: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a test in nested lists for nulls_first: false, and verify that

        assert!(rows.row(0) < rows.row(1));

tustvold and others added 3 commits December 1, 2022 16:21
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@ursabot
Copy link

ursabot commented Dec 2, 2022

Benchmark runs are scheduled for baseline = de3828c and contender = 9833288. 9833288 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Dec 2, 2022

I can feel DataFusion grouping getting faster.....

high-five

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants