Make dictionary preservation optional in row encoding #3831

tustvold · 2023-03-09T14:44:11Z

Which issue does this PR close?

Closes #.

Rationale for this change

As discussed on apache/datafusion#4973 (comment) the dictionary preserving row encoding is particularly expensive to compute and may not be appropriate for all use-cases. This PR therefore adds an option to instead encode the dictionary values directly, instead of constructing an order preserving dictionary.

convert_columns 4096 string_dictionary(100, 0)
                        time:   [637.45 µs 637.75 µs 638.08 µs]

convert_columns_prepared 4096 string_dictionary(100, 0)
                        time:   [141.49 µs 141.54 µs 141.60 µs]

convert_columns 4096 string_dictionary_hydrate(100, 0)
                        time:   [89.498 µs 89.546 µs 89.594 µs]

convert_columns_prepared 4096 string_dictionary_hydrate(100, 0)
                        time:   [87.264 µs 87.292 µs 87.323 µs]

This can be significantly faster, especially when seeing values for the first time (the unprepared case above), however, does come with the potential downside of a significantly less efficient encoding, which in turn needs more memory and may make comparisons slower.

What changes are included in this PR?

Are there any user-facing changes?

…ve-dictionaries-row-encoding

alamb · 2023-03-10T11:17:08Z

arrow-row/src/lib.rs

+    /// By default dictionaries are preserved as described on [`RowConverter`]
+    ///
+    /// However, this process requires maintaining and incrementally updating
+    /// an order-preserving mapping of dictionary values which is relatively expensive


Suggested change

/// an order-preserving mapping of dictionary values which is relatively expensive

/// an order-preserving mapping of dictionary values which is relatively expensive

/// computationally but minimizes memory used.

alamb · 2023-03-10T11:19:14Z

arrow-row/src/lib.rs

+        test_string_dictionary_preserve(true);
+    }
+
+    fn test_string_dictionary_preserve(preserve: bool) {


the repeated preserve was a little confusing at first -- as I would expect test_string_dictionary_preserve to mean testing preserve = true.

totally minor nitpick but maybe it could be renamed to test_string_dictionary

alamb · 2023-03-10T11:20:32Z

arrow-row/src/lib.rs

                actual.data().validate_full().unwrap();
-                assert_eq!(actual, expected)
+                dictionary_eq(preserve, actual, expected)
            }
        }
    }


All these tests seem to verify that the values are the same. I wonder if it would be valuable to cover space efficiency too in a test (like encode using preserved and non preserved dictionaries and show them taking up different sizes 🤔 )

alamb · 2023-03-10T11:23:29Z

arrow/benches/row_format.rs


    let cols =
        vec![Arc::new(create_string_dict_array::<Int32Type>(4096, 0.5, 100)) as ArrayRef];
-    do_bench(c, "4096 string_dictionary(100, 0.5)", cols);
+    do_bench(c, "4096 string_dictionary(100, 0.5)", cols.clone(), true);
+    do_bench(c, "4096 string_dictionary_hydrate(100, 0.5)", cols, false);


rather than using hydrate which I think is a somewhat non standard term, maybe this could be non_preserve to match the parameter name

Heh, I have a very bad reason why I did this... Formatting 😆 I'll work around it differently

alamb · 2023-03-10T11:24:14Z

Looks good to me -- nice work

ursabot · 2023-03-10T13:52:39Z

Benchmark runs are scheduled for baseline = 61c4f12 and contender = 69c04db. 69c04db is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Make dictionary preservation optional in row encoding

3217768

github-actions bot added the arrow Changes to the arrow crate label Mar 9, 2023

Merge remote-tracking branch 'upstream/master' into optionally-preser…

d128b5a

…ve-dictionaries-row-encoding

alamb approved these changes Mar 10, 2023

View reviewed changes

Review feedback

d8f2b1e

tustvold merged commit 69c04db into apache:master Mar 10, 2023

tustvold mentioned this pull request Sep 13, 2023

Stateless Row Conversion #4811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make dictionary preservation optional in row encoding #3831

Make dictionary preservation optional in row encoding #3831

tustvold commented Mar 9, 2023

alamb Mar 10, 2023

alamb Mar 10, 2023

alamb Mar 10, 2023

alamb Mar 10, 2023

tustvold Mar 10, 2023 •

edited

Loading

alamb commented Mar 10, 2023

ursabot commented Mar 10, 2023

	/// an order-preserving mapping of dictionary values which is relatively expensive
	/// an order-preserving mapping of dictionary values which is relatively expensive
	/// computationally but minimizes memory used.

Make dictionary preservation optional in row encoding #3831

Make dictionary preservation optional in row encoding #3831

Conversation

tustvold commented Mar 9, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb Mar 10, 2023

Choose a reason for hiding this comment

alamb Mar 10, 2023

Choose a reason for hiding this comment

alamb Mar 10, 2023

Choose a reason for hiding this comment

alamb Mar 10, 2023

Choose a reason for hiding this comment

tustvold Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

alamb commented Mar 10, 2023

ursabot commented Mar 10, 2023

tustvold Mar 10, 2023 •

edited

Loading