Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FixedSizeBinary + FixedSizeList #30

Merged
merged 2 commits into from
Mar 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,13 @@ The following features are supported:
- Optional fields.
- Deep nesting via structs which derive the `ArrowField` macro or by Vec<T>.
- Large types:
- ['LargeBinary'], ['LargeString'], ['LargeList`]
- [`LargeBinary`], [`LargeString`], [`LargeList`]
- These can be used via the "override" attribute. Please see the [complex_example.rs](./arrow2_convert/tests/complex_example.rs) for usage.
- Fixed size types:
- [`FixedSizeBinary`]
- [`FixedSizeList`]
- This is supported for a fixed size `Vec<T>` via the `FixedSizeVec` type override.
- Note: nesting of [`FixedSizeList`] is not supported.

The following are not yet supported.

Expand Down Expand Up @@ -91,9 +96,9 @@ To achieve this, the following approach is used:

### Implementing Large Types

Ideally for code reusability, we wouldn’t have to reimplement `ArrowSerialize` and `ArrowDeserialize` for large and fixed offset types since the primitive types are the same. However, this requires the trait functions to take a generic bounded mutable array as an argument instead of a single array type. This requires the `ArrowSerialize` and `ArrowDeserialize` implementations to be able to specify the bounds as part of the associated type , which not possible without generic associated types.
Ideally for code reusability, we wouldn’t have to reimplement `ArrowSerialize` and `ArrowDeserialize` for large and fixed size types since the primitive types are the same. However, this requires the trait functions to take a generic bounded mutable array as an argument instead of a single array type. This requires the `ArrowSerialize` and `ArrowDeserialize` implementations to be able to specify the bounds as part of the associated type, which is not possible without generic associated types.

As a result, we’re forced to sacrifice code reusability and introduce a little bit of complexity by providing separate `ArrowSerialize` and `ArrowDeserialize` implementations for large types via placeholder structures. This also requires introducing the `Type` attribute to `ArrowField` so that the large types can be used via a field attribute without affecting the structure field types.
As a result, we’re forced to sacrifice code reusability and introduce a little bit of complexity by providing separate `ArrowSerialize` and `ArrowDeserialize` implementations for large and fixed size types via placeholder structures. This also requires introducing the `Type` associated type to `ArrowField` so that the arrow type can be overriden via a macro field attribute without affecting the actual type.

### Why not serde?

Expand Down
3 changes: 2 additions & 1 deletion arrow2_convert/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ repository = "https://github.com/DataEngineeringLabs/arrow2-convert/arrow2_conve
description = "Convert between nested rust types and Arrow with arrow2"

[dependencies]
arrow2 = { version = "0.9.1", default-features = false }
# Temporary until next arrow2 release
arrow2 = { git = "https://github.com/jorgecarleitao/arrow2", rev = "81bfad", default_features = false }
arrow2_convert_derive = { version = "0.1.0", path = "../arrow2_convert_derive", optional = true }
chrono = { version = "0.4", default_features = false, features = ["std"] }
err-derive = "0.3"
Expand Down
54 changes: 40 additions & 14 deletions arrow2_convert/src/deserialize.rs
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,29 @@ impl ArrowDeserialize for LargeBinary {
}
}

impl<const SIZE: usize> ArrowDeserialize for FixedSizeBinary<SIZE> {
type ArrayType = FixedSizeBinaryArray;

#[inline]
fn arrow_deserialize(v: Option<&[u8]>) -> Option<Vec<u8>> {
v.map(|t| t.to_vec())
}
}

fn arrow_deserialize_vec_helper<T>(v: Option<Box<dyn Array>>) -> Option<<Vec<T> as ArrowField>::Type>
where
T: ArrowDeserialize + ArrowEnableVecForType + 'static,
for<'a> &'a T::ArrayType: IntoIterator,
{
use std::ops::Deref;
match v {
Some(t) => arrow_array_deserialize_iterator_internal::<<T as ArrowField>::Type, T>(t.deref())
.ok()
.map(|i| i.collect::<Vec<<T as ArrowField>::Type>>()),
None => None,
}
}

// Blanket implementation for Vec
impl<T> ArrowDeserialize for Vec<T>
where
Expand All @@ -179,13 +202,7 @@ where
type ArrayType = ListArray<i32>;

fn arrow_deserialize(v: Option<Box<dyn Array>>) -> Option<<Self as ArrowField>::Type> {
use std::ops::Deref;
match v {
Some(t) => arrow_array_deserialize_iterator_internal::<<T as ArrowField>::Type, T>(t.deref())
.ok()
.map(|i| i.collect::<Vec<<T as ArrowField>::Type>>()),
None => None,
}
arrow_deserialize_vec_helper::<T>(v)
}
}

Expand All @@ -198,13 +215,20 @@ where
type ArrayType = ListArray<i64>;

fn arrow_deserialize(v: Option<Box<dyn Array>>) -> Option<<Self as ArrowField>::Type> {
use std::ops::Deref;
match v {
Some(t) => arrow_array_deserialize_iterator_internal::<<T as ArrowField>::Type, T>(t.deref())
.ok()
.map(|i| i.collect::<Vec<<T as ArrowField>::Type>>()),
None => None,
}
arrow_deserialize_vec_helper::<T>(v)
}
}

impl<T, const SIZE: usize> ArrowDeserialize for FixedSizeVec<T, SIZE>
where
T: ArrowDeserialize + ArrowEnableVecForType + 'static,
<T as ArrowDeserialize>::ArrayType: 'static,
for<'b> &'b <T as ArrowDeserialize>::ArrayType: IntoIterator,
{
type ArrayType = FixedSizeListArray;

fn arrow_deserialize(v: Option<Box<dyn Array>>) -> Option<<Self as ArrowField>::Type> {
arrow_deserialize_vec_helper::<T>(v)
}
}

Expand All @@ -213,8 +237,10 @@ impl_arrow_array!(Utf8Array<i32>);
impl_arrow_array!(Utf8Array<i64>);
impl_arrow_array!(BinaryArray<i32>);
impl_arrow_array!(BinaryArray<i64>);
impl_arrow_array!(FixedSizeBinaryArray);
impl_arrow_array!(ListArray<i32>);
impl_arrow_array!(ListArray<i64>);
impl_arrow_array!(FixedSizeListArray);

/// Top-level API to deserialize from Arrow
pub trait TryIntoCollection<Collection, Element>
Expand Down
29 changes: 29 additions & 0 deletions arrow2_convert/src/field.rs
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,17 @@ impl ArrowField for LargeBinary {
}
}

pub struct FixedSizeBinary<const SIZE: usize> {}

impl<const SIZE: usize> ArrowField for FixedSizeBinary<SIZE> {
type Type = Vec<u8>;

#[inline]
fn data_type() -> arrow2::datatypes::DataType {
arrow2::datatypes::DataType::FixedSizeBinary(SIZE)
}
}

// Blanket implementation for Vec.
impl<T> ArrowField for Vec<T>
where
Expand Down Expand Up @@ -204,17 +215,35 @@ where
}
}

pub struct FixedSizeVec<T, const SIZE: usize> {
d: std::marker::PhantomData<T>
}

impl<T, const SIZE: usize> ArrowField for FixedSizeVec<T, SIZE>
where
T: ArrowField + ArrowEnableVecForType
{
type Type = Vec<<T as ArrowField>::Type>;

#[inline]
fn data_type() -> arrow2::datatypes::DataType {
arrow2::datatypes::DataType::FixedSizeList(Box::new(<T as ArrowField>::field("item")), SIZE)
}
}

arrow_enable_vec_for_type!(String);
arrow_enable_vec_for_type!(LargeString);
arrow_enable_vec_for_type!(bool);
arrow_enable_vec_for_type!(NaiveDateTime);
arrow_enable_vec_for_type!(NaiveDate);
arrow_enable_vec_for_type!(Vec<u8>);
arrow_enable_vec_for_type!(LargeBinary);
impl<const SIZE: usize> ArrowEnableVecForType for FixedSizeBinary<SIZE> {}

// Blanket implementation for Vec<Option<T>> if vectors are enabled for T
impl<T> ArrowEnableVecForType for Option<T> where T: ArrowField + ArrowEnableVecForType {}

// Blanket implementation for Vec<Vec<T>> if vectors are enabled for T
impl<T> ArrowEnableVecForType for Vec<T> where T: ArrowField + ArrowEnableVecForType {}
impl<T> ArrowEnableVecForType for LargeVec<T> where T: ArrowField + ArrowEnableVecForType {}
impl<T, const SIZE: usize> ArrowEnableVecForType for FixedSizeVec<T, SIZE> where T: ArrowField + ArrowEnableVecForType {}
102 changes: 95 additions & 7 deletions arrow2_convert/src/serialize.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,8 @@ pub trait ArrowSerialize: ArrowField {
/// The [`arrow2::array::MutableArray`] that holds this value
type MutableArrayType: ArrowMutableArray;

#[inline]
/// Create a new mutable array
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::default()
}
fn new_array() -> Self::MutableArrayType;

/// Serialize this field to arrow
fn arrow_serialize(v: &<Self as ArrowField>::Type, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()>;
Expand All @@ -29,7 +26,7 @@ pub trait ArrowSerialize: ArrowField {
///
/// Implementations of this trait are provided for all mutable arrays provided by [`arrow2`].
#[doc(hidden)]
pub trait ArrowMutableArray: arrow2::array::MutableArray + Default {
pub trait ArrowMutableArray: arrow2::array::MutableArray {
fn reserve(&mut self, additional: usize, additional_values: usize);
}

Expand All @@ -39,6 +36,11 @@ macro_rules! impl_numeric_type {
impl ArrowSerialize for $physical_type {
type MutableArrayType = MutablePrimitiveArray<$physical_type>;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::default()
}

#[inline]
fn arrow_serialize(
v: &Self,
Expand Down Expand Up @@ -71,6 +73,11 @@ where
{
type MutableArrayType = <T as ArrowSerialize>::MutableArrayType;

#[inline]
fn new_array() -> Self::MutableArrayType {
<T as ArrowSerialize>::new_array()
}

#[inline]
fn arrow_serialize(v: &<Self as ArrowField>::Type, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
match v.as_ref() {
Expand All @@ -97,6 +104,11 @@ impl_numeric_type!(f64, Float64);
impl ArrowSerialize for String {
type MutableArrayType = MutableUtf8Array<i32>;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::default()
}

#[inline]
fn arrow_serialize(v: &Self, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
array.try_push(Some(v))
Expand All @@ -107,6 +119,11 @@ impl ArrowSerialize for LargeString
{
type MutableArrayType = MutableUtf8Array<i64>;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::default()
}

#[inline]
fn arrow_serialize(v: &String, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
array.try_push(Some(v))
Expand All @@ -116,6 +133,11 @@ impl ArrowSerialize for LargeString
impl ArrowSerialize for bool {
type MutableArrayType = MutableBooleanArray;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::default()
}

#[inline]
fn arrow_serialize(v: &Self, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
array.try_push(Some(*v))
Expand Down Expand Up @@ -156,6 +178,11 @@ impl ArrowSerialize for NaiveDate {
impl ArrowSerialize for Vec<u8> {
type MutableArrayType = MutableBinaryArray<i32>;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::default()
}

#[inline]
fn arrow_serialize(v: &Self, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
array.try_push(Some(v))
Expand All @@ -166,16 +193,37 @@ impl ArrowSerialize for LargeBinary
{
type MutableArrayType = MutableBinaryArray<i64>;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::default()
}

#[inline]
fn arrow_serialize(v: &Vec<u8>, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
array.try_push(Some(v))
}
}

impl<const SIZE: usize> ArrowSerialize for FixedSizeBinary<SIZE> {
type MutableArrayType = MutableFixedSizeBinaryArray;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::new(SIZE)
}

#[inline]
fn arrow_serialize(v: &Vec<u8>, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
array.try_push(Some(v))
}
}


// Blanket implementation for Vec
impl<T> ArrowSerialize for Vec<T>
where
T: ArrowSerialize + ArrowEnableVecForType + 'static,
<T as ArrowSerialize>::MutableArrayType: Default
{
type MutableArrayType = MutableListArray<i32, <T as ArrowSerialize>::MutableArrayType>;

Expand All @@ -200,6 +248,7 @@ where
impl<T> ArrowSerialize for LargeVec<T>
where
T: ArrowSerialize + ArrowEnableVecForType + 'static,
<T as ArrowSerialize>::MutableArrayType: Default
{
type MutableArrayType = MutableListArray<i64, <T as ArrowSerialize>::MutableArrayType>;

Expand All @@ -221,6 +270,32 @@ where
}
}

impl<T, const SIZE: usize> ArrowSerialize for FixedSizeVec<T, SIZE>
where
T: ArrowSerialize + ArrowEnableVecForType + 'static,
<T as ArrowSerialize>::MutableArrayType: Default
{
type MutableArrayType = MutableFixedSizeListArray<<T as ArrowSerialize>::MutableArrayType>;

#[inline]
fn new_array() -> Self::MutableArrayType {
Self::MutableArrayType::new_with_field(
<T as ArrowSerialize>::new_array(),
"item",
<T as ArrowField>::is_nullable(),
SIZE
)
}

fn arrow_serialize(v: &<Self as ArrowField>::Type, array: &mut Self::MutableArrayType) -> arrow2::error::Result<()> {
let values = array.mut_values();
for i in v.iter() {
<T as ArrowSerialize>::arrow_serialize(i, values)?;
}
array.try_push_valid()
}
}

impl ArrowMutableArray for MutableBooleanArray {
impl_mutable_array_body!();
}
Expand All @@ -247,17 +322,30 @@ impl ArrowMutableArray for MutableBinaryArray<i64> {
impl_mutable_array_body!();
}

impl ArrowMutableArray for MutableFixedSizeBinaryArray {
#[inline]
fn reserve(&mut self, _additional: usize, _additional_values: usize) {}
}

impl<M> ArrowMutableArray for MutableListArray<i32, M>
where
M: ArrowMutableArray + 'static,
M: ArrowMutableArray + Default + 'static,
{
#[inline]
fn reserve(&mut self, _additional: usize, _additional_values: usize) {}
}

impl<M> ArrowMutableArray for MutableListArray<i64, M>
where
M: ArrowMutableArray + 'static,
M: ArrowMutableArray + Default + 'static,
{
#[inline]
fn reserve(&mut self, _additional: usize, _additional_values: usize) {}
}

impl<M> ArrowMutableArray for MutableFixedSizeListArray<M>
where
M: ArrowMutableArray + Default + 'static,
{
#[inline]
fn reserve(&mut self, _additional: usize, _additional_values: usize) {}
Expand Down
Loading