Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DictionaryArray::keys_iter, and take_iter for other array types #1296

Merged
merged 6 commits into from
Feb 16, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 10, 2022

Which issue does this PR close?

Closes #1295.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@viirya viirya marked this pull request as draft February 10, 2022 08:32
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 10, 2022
Comment on lines 409 to 414
pub struct DictionaryIter<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> {
array: &'a DictionaryArray<K>,
values: &'a PrimitiveArray<T>,
current: usize,
current_end: usize,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may add iterators for others, e.g. Utf8, binary..., if this direction is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 as you note, the way this is implemented, it will only work for DictionaryArrays whose values are PrimitiveArray<> such as Int64Array

If we take this approach, we'll probably end up with different iterator types for different value types like

pub struct DictionaryIterPrimitive<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> {
...
}

impl<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> std::iter::Iterator
    for DictionaryIter<'a, K, T>
{
    type Item = Option<T::Native>;
...
}
pub struct DictionaryIterBool<'a, K: ArrowPrimitiveType> {
...
}

impl <'a, K: ArrowPrimitiveType> std::iter::Iterator for DictionaryIterBool<'a, K>
{
    type Item = Option<bool>;
...
}
...

And then functions like the following

impl DictionaryArray {
  pub fn iter_primitive<T>::(&self) -> DictionaryIterPrimitive
  ...
  }

  pub fn iter_bool<T>::(&self) -> DictionaryIterBool
  ...
  }
}

Which would effectively require the user to know what type of values their dictionary held

🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking some more we could potentially avoid parameterizing on the key type by always upcasting the indexes to usize (which has to happen anyways) before creating the iterator

@viirya
Copy link
Member Author

viirya commented Feb 10, 2022

cc @alamb

@codecov-commenter
Copy link

codecov-commenter commented Feb 10, 2022

Codecov Report

Merging #1296 (d376976) into master (6ee1fda) will increase coverage by 0.00%.
The diff coverage is 94.66%.

❗ Current head d376976 differs from pull request most recent head 99f77f5. Consider uploading reports for the commit 99f77f5 to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1296   +/-   ##
=======================================
  Coverage   83.05%   83.06%           
=======================================
  Files         180      180           
  Lines       52824    52870   +46     
=======================================
+ Hits        43874    43916   +42     
- Misses       8950     8954    +4     
Impacted Files Coverage Δ
arrow/src/array/array_binary.rs 93.26% <0.00%> (-0.27%) ⬇️
arrow/src/array/array_boolean.rs 93.07% <0.00%> (-1.46%) ⬇️
arrow/src/array/array_dictionary.rs 90.90% <100.00%> (+2.15%) ⬆️
arrow/src/array/array_primitive.rs 94.69% <100.00%> (+0.02%) ⬆️
arrow/src/array/array_string.rs 97.63% <100.00%> (+0.01%) ⬆️
arrow/src/array/array_union.rs 90.71% <100.00%> (ø)
arrow/src/compute/kernels/aggregate.rs 73.33% <100.00%> (ø)
arrow/src/compute/kernels/arithmetic.rs 90.86% <100.00%> (ø)
arrow/src/compute/kernels/cast.rs 95.21% <100.00%> (ø)
arrow/src/csv/reader.rs 88.12% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ee1fda...99f77f5. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @viirya -- I think this could get messy quickly due to the number of types, but I don't have any other great ideas to be honest

maybe @tustvold or @jorgecarleitao have ideas 🤔

Comment on lines 409 to 414
pub struct DictionaryIter<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> {
array: &'a DictionaryArray<K>,
values: &'a PrimitiveArray<T>,
current: usize,
current_end: usize,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 as you note, the way this is implemented, it will only work for DictionaryArrays whose values are PrimitiveArray<> such as Int64Array

If we take this approach, we'll probably end up with different iterator types for different value types like

pub struct DictionaryIterPrimitive<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> {
...
}

impl<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> std::iter::Iterator
    for DictionaryIter<'a, K, T>
{
    type Item = Option<T::Native>;
...
}
pub struct DictionaryIterBool<'a, K: ArrowPrimitiveType> {
...
}

impl <'a, K: ArrowPrimitiveType> std::iter::Iterator for DictionaryIterBool<'a, K>
{
    type Item = Option<bool>;
...
}
...

And then functions like the following

impl DictionaryArray {
  pub fn iter_primitive<T>::(&self) -> DictionaryIterPrimitive
  ...
  }

  pub fn iter_bool<T>::(&self) -> DictionaryIterBool
  ...
  }
}

Which would effectively require the user to know what type of values their dictionary held

🤔

Comment on lines 409 to 414
pub struct DictionaryIter<'a, K: ArrowPrimitiveType, T: ArrowPrimitiveType> {
array: &'a DictionaryArray<K>,
values: &'a PrimitiveArray<T>,
current: usize,
current_end: usize,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking some more we could potentially avoid parameterizing on the key type by always upcasting the indexes to usize (which has to happen anyways) before creating the iterator

@viirya
Copy link
Member Author

viirya commented Feb 10, 2022

Thanks @alamb. Yea, that is also I have concern with it. But I don't have better idea for DictionaryIter which supports all types of arrays (primitive, utf8, binary..).

@alamb
Copy link
Contributor

alamb commented Feb 12, 2022

Thanks @alamb. Yea, that is also I have concern with it. But I don't have better idea for DictionaryIter which supports all types of arrays (primitive, utf8, binary..).

I thought a bit more about this.

What about adding something slightly more general such as:

impl DictionaryArray {
  /// return an iterator over the keys (indexes into the dictionary)
  fn keys_iter() -> impl Iterator<Item = usize>
}

Then add an iterator to the various other types of array:

impl PrimitiveArray<T> {
...
   /// return an iterator that returns the values of `array.value(i)` for an iterator with each element `i`
   fn take_iter(&self, indexes: impl Iterator<item = usize>) -> impl Iterator<Item = T::NativeType> {
      ..
   }
}

So then to iterate over a dictionary by value the user would do something like:

let keys = dict_array.keys_iter();
let values = dict_array.values().downcast_ref::<UInt64Array>().unwrap().take_iter(keys);
// values is an interator over u64!

@viirya
Copy link
Member Author

viirya commented Feb 13, 2022

@alamb Looks like a good idea. Thanks. Let me revise this.

@viirya
Copy link
Member Author

viirya commented Feb 14, 2022

@alamb Revised this based on your suggestion.

indexes: impl Iterator<Item = Option<usize>> + 'a,
) -> impl Iterator<Item = Option<T::Native>> + 'a {
indexes.map(|opt_index| {
opt_index.map(|index| unsafe { self.value_unchecked(index) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

Reading this, however, if someone passed in an interator that was out of bounds it will lead to undefined behavior; So I guess we would probably need to either validate every index or else mark the take_iter method as unsafe itself 🤔 which seems less than ideal.

We can only avoid the bounds check if we knew the iterator came from a dictionary that had been previously validated.

@viirya viirya changed the title wip. Implement an iterator for DictionaryArray Implement an iterator for DictionaryArray Feb 15, 2022
@viirya viirya marked this pull request as ready for review February 15, 2022 06:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good. Any thoughts @Dandandan or @liukun4515 or @tustvold ?

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

@viirya I resolved a conflict in this PR and merged from master in f683cb2

@viirya
Copy link
Member Author

viirya commented Feb 15, 2022

Thank you @alamb !

@alamb alamb merged commit a76b507 into apache:master Feb 16, 2022
@alamb
Copy link
Contributor

alamb commented Feb 16, 2022

Thanks again @viirya

@viirya
Copy link
Member Author

viirya commented Feb 16, 2022

Thank you @alamb !

@alamb alamb changed the title Implement an iterator for DictionaryArray Add DictionaryArray::keys_iter, and take_iter for other array types Feb 16, 2022
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it easier to iterate over DictionaryArray
3 participants