ARROW-12426: [Rust] Fix concatentation of arrow dictionaries #10073

tustvold · 2021-04-16T17:22:04Z

Currently calling concat on two dictionary encoded arrays blindly concatenates the keys arrays, and keeps only the values of the first. This can lead to invalid data, including keys referring to value indexes beyond the bounds of the values array.

This PR alters MutableArrayData to concatenate the child data of any dictionary arrays passed to it, and offset the source keys to correctly map to the relevant "slice" of the resulting child data. This does mean that the resulting dictionary array may contain duplicates, and will not be sorted, but I'm not sure if this is a problem?

Signed-off-by: Raphael Taylor-Davies r.taylordavies@googlemail.com

github-actions · 2021-04-16T17:22:22Z

https://issues.apache.org/jira/browse/ARROW-12426

Signed-off-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>

tustvold · 2021-04-16T17:23:55Z

rust/arrow/src/array/transform/mod.rs

+                    })
+                    .collect();
+
+                extend_values.expect("MutableArrayData::new is infallible")


I'm not sure how best to handle this, the other option would be to handle dictionary concatenation in the concat kernel which is fallible - not sure which is better as I'm not all that familiar with the codebase yet

alamb

This looks good to me. Thank you @tustvold

fyi @nevi-me / @jorgecarleitao in case you have comments on the Arrow implementation where I am not as familiar with the code.

One thing I did notice in this formulation was that it results in duplicated entries in the Dictionary (which is probably unavoidable, but I figured I would point it out). You can see the duplicates by adding this code to the unit test:

        let concat = concat(&[&input_1 as _, &input_2 as _]).unwrap();

        let expected: DictionaryArray<Int32Type> =
            vec![
                "hello", "A", "B", "hello", "hello", "C",
                "hello", "E", "E", "hello", "F", "E",
            ]
                .into_iter()
            .collect();

        let expected = Arc::new(expected) as ArrayRef;
        assert_eq!(&concat, &expected);

Which actually fails (showing the contents of the dictionary are not the same):

---- compute::kernels::concat::tests::test_string_dictionary_array stdout ----
thread 'compute::kernels::concat::tests::test_string_dictionary_array' panicked at 'assertion failed: `(left == right)`
  left: `DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
  2,
  0,
  0,
  3,
  4,
  5,
  5,
  4,
  6,
  5,
] values: StringArray
[
  "hello",
  "A",
  "B",
  "C",
  "hello",
  "E",
  "F",
]}
`,
 right: `DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
  2,
  0,
  0,
  3,
  0,
  4,
  4,
  0,
  5,
  4,
] values: StringArray
[
  "hello",
  "A",
  "B",
  "C",
  "E",
  "F",
]}

alamb · 2021-04-18T09:43:03Z

rust/arrow/src/compute/kernels/concat.rs

+        (0..dictionary.len())
+            .map(move |i| match dictionary.keys().is_valid(i) {
+                true => {
+                    let key = dictionary.keys().value(i);
+                    Some(values.value(key as _).to_string())
+                }
+                false => None,
+            })
+            .collect()


I think you can probably iterate over the keys directly if you wanted:

Suggested change

(0..dictionary.len())

.map(move |i| match dictionary.keys().is_valid(i) {

true => {

let key = dictionary.keys().value(i);

Some(values.value(key as _).to_string())

}

false => None,

})

.collect()

dictionary.keys()

.iter()

.map(|key| {

key.map(|key| values.value(key as _).to_string())

})

.collect()

alamb · 2021-04-18T09:59:53Z

rust/arrow/src/compute/kernels/concat.rs

+            .as_any()
+            .downcast_ref::<DictionaryArray<Int32Type>>()
+            .unwrap();
+


I recommend testing the actual values from the dictionary array (aka test the content is as expected) as well here explicitly -- something like

let expected_str = vec![ "hello", "A", "B", "hello", "hello", "C", "hello", "E", "E", "hello", "F", "E", ]; let values = concat.values(); let values = values .as_any() .downcast_ref::<StringArray>() .unwrap(); let concat_str = concat.keys().iter().map(|key| { let key = key.unwrap(); values.value(key as usize) }) .collect::<Vec<&str>>(); assert_eq!(expected_str, concat_str);

(tested and it passes for me locally)

I'm confused, is this not what I do a couple of lines down?

Edit: I've taken a guess at what you meant and hardcoded the expected result instead of computing it

Edit: I've taken a guess at what you meant and hardcoded the expected result instead of computing it

Yes, I meant a hard coded expected result.

The rationale being that then there is less change a bug in the arrow implementation can cause the expected results to change. By comparing the output of concat (which calls concat_dictionary) to concat_dictionary I was thinking that a bug in concat_dictionary could be masked in some cases

alamb · 2021-04-18T10:00:15Z

rust/arrow/src/compute/kernels/concat.rs

+
+        assert_eq!(concat_collected, expected);
+
+        Ok(())


I suggest adding at least one Null value to the dictionary to test the nulls case

alamb · 2021-04-18T10:05:54Z

(and there is a clippy failure on this PR :()

jorgecarleitao

I went through this and it looks good. Well spotted, @tustvold , and great solution.

I agree that we have an issue with the dictionary key sizes that can result in panic, but I agree that this is a step forward that should not be hold by that.

Signed-off-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>

alamb · 2021-04-19T10:20:50Z

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

tustvold · 2021-04-21T10:15:44Z

Will retarget to the new repository

github-actions bot added the Component: Rust label Apr 16, 2021

ARROW-12426: [Rust] Fix concatentation of arrow dictionaries

d1dc085

Signed-off-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>

tustvold force-pushed the dictionary-concat branch from 6542fcb to d1dc085 Compare April 16, 2021 17:22

tustvold commented Apr 16, 2021

View reviewed changes

alamb approved these changes Apr 18, 2021

View reviewed changes

jorgecarleitao approved these changes Apr 18, 2021

View reviewed changes

tustvold added 2 commits April 19, 2021 09:46

ARROW-12426: [Rust] review feedback

4ff3e80

Signed-off-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>

ARROW-12426: [Rust] Explicit expected test results

5631fdc

Signed-off-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>

tustvold closed this Apr 21, 2021

tustvold mentioned this pull request Apr 21, 2021

ARROW-12426: [Rust] Fix concatentation of arrow dictionaries apache/arrow-rs#15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12426: [Rust] Fix concatentation of arrow dictionaries #10073

ARROW-12426: [Rust] Fix concatentation of arrow dictionaries #10073

tustvold commented Apr 16, 2021 •

edited

Loading

github-actions bot commented Apr 16, 2021

tustvold Apr 16, 2021

alamb left a comment

alamb Apr 18, 2021

alamb Apr 18, 2021

tustvold Apr 19, 2021 •

edited

Loading

alamb Apr 19, 2021 •

edited

Loading

alamb Apr 18, 2021

alamb commented Apr 18, 2021

jorgecarleitao left a comment

alamb commented Apr 19, 2021

tustvold commented Apr 21, 2021

ARROW-12426: [Rust] Fix concatentation of arrow dictionaries #10073

ARROW-12426: [Rust] Fix concatentation of arrow dictionaries #10073

Conversation

tustvold commented Apr 16, 2021 • edited Loading

github-actions bot commented Apr 16, 2021

tustvold Apr 16, 2021

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 18, 2021

Choose a reason for hiding this comment

alamb Apr 18, 2021

Choose a reason for hiding this comment

tustvold Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

alamb Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

alamb Apr 18, 2021

Choose a reason for hiding this comment

alamb commented Apr 18, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

alamb commented Apr 19, 2021

tustvold commented Apr 21, 2021

tustvold commented Apr 16, 2021 •

edited

Loading

tustvold Apr 19, 2021 •

edited

Loading

alamb Apr 19, 2021 •

edited

Loading