Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dictionary preservation optional in row encoding #3831

Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 9, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

As discussed on apache/datafusion#4973 (comment) the dictionary preserving row encoding is particularly expensive to compute and may not be appropriate for all use-cases. This PR therefore adds an option to instead encode the dictionary values directly, instead of constructing an order preserving dictionary.

convert_columns 4096 string_dictionary(100, 0)
                        time:   [637.45 µs 637.75 µs 638.08 µs]

convert_columns_prepared 4096 string_dictionary(100, 0)
                        time:   [141.49 µs 141.54 µs 141.60 µs]

convert_columns 4096 string_dictionary_hydrate(100, 0)
                        time:   [89.498 µs 89.546 µs 89.594 µs]

convert_columns_prepared 4096 string_dictionary_hydrate(100, 0)
                        time:   [87.264 µs 87.292 µs 87.323 µs]

This can be significantly faster, especially when seeing values for the first time (the unprepared case above), however, does come with the potential downside of a significantly less efficient encoding, which in turn needs more memory and may make comparisons slower.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 9, 2023
/// By default dictionaries are preserved as described on [`RowConverter`]
///
/// However, this process requires maintaining and incrementally updating
/// an order-preserving mapping of dictionary values which is relatively expensive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// an order-preserving mapping of dictionary values which is relatively expensive
/// an order-preserving mapping of dictionary values which is relatively expensive
/// computationally but minimizes memory used.

test_string_dictionary_preserve(true);
}

fn test_string_dictionary_preserve(preserve: bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the repeated preserve was a little confusing at first -- as I would expect test_string_dictionary_preserve to mean testing preserve = true.

totally minor nitpick but maybe it could be renamed to test_string_dictionary

actual.data().validate_full().unwrap();
assert_eq!(actual, expected)
dictionary_eq(preserve, actual, expected)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these tests seem to verify that the values are the same. I wonder if it would be valuable to cover space efficiency too in a test (like encode using preserved and non preserved dictionaries and show them taking up different sizes 🤔 )


let cols =
vec![Arc::new(create_string_dict_array::<Int32Type>(4096, 0.5, 100)) as ArrayRef];
do_bench(c, "4096 string_dictionary(100, 0.5)", cols);
do_bench(c, "4096 string_dictionary(100, 0.5)", cols.clone(), true);
do_bench(c, "4096 string_dictionary_hydrate(100, 0.5)", cols, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using hydrate which I think is a somewhat non standard term, maybe this could be non_preserve to match the parameter name

Copy link
Contributor Author

@tustvold tustvold Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I have a very bad reason why I did this... Formatting 😆 I'll work around it differently

@alamb
Copy link
Contributor

alamb commented Mar 10, 2023

Looks good to me -- nice work

@tustvold tustvold merged commit 69c04db into apache:master Mar 10, 2023
@ursabot
Copy link

ursabot commented Mar 10, 2023

Benchmark runs are scheduled for baseline = 61c4f12 and contender = 69c04db. 69c04db is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants