-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generify parquet write path (#1764) #2045
Conversation
@@ -212,7 +212,7 @@ pub enum Repetition { | |||
/// Encodings supported by Parquet. | |||
/// Not all encodings are valid for all types. These enums are also used to specify the | |||
/// encoding of definition and repetition levels. | |||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] | |||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Ord, PartialOrd)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to allow for stable encoding ordering
{ | ||
const PHYSICAL_TYPE: Type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved onto ParquetValueType
as it allows implementing methods in terms of ParquetValueType
which has the advantage that type inference works
|
||
// Find out number of batches to process. | ||
let write_batch_size = self.props.write_batch_size(); | ||
let num_batches = min_len / write_batch_size; | ||
let num_batches = num_levels / write_batch_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was a bug before, previously the presence of nulls (which would lead to less values), would result in skewed batching
pub(crate) mod private { | ||
use super::*; | ||
|
||
pub trait MakeStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of this, but it was the only way I could work out to do this, and it is crate-local so if we come up with a better way we can change it in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another reason to not switch from TypedStatistics
to ValueStatistics
🤔
Codecov Report
@@ Coverage Diff @@
## master #2045 +/- ##
=======================================
Coverage 83.73% 83.73%
=======================================
Files 224 225 +1
Lines 59391 59373 -18
=======================================
- Hits 49733 49718 -15
+ Misses 9658 9655 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I followed all the details of this PR (especially all the generics) but the code seems easier to read and other than a few questions on the tests I think things look good to me
Epic 🏅
num_column_nulls: u64, | ||
column_distinct_count: Option<u64>, | ||
encodings: BTreeSet<Encoding>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a BTreeSet (which requires Ord
on Encoding
) rather than a HashSet
? Is there any reason the ordering is important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the output deterministic, which makes testing easier, but in terms of correctness, there should be no requirement for a particular order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- Perhaps it is worth a comment explaining the rationale
pub struct TypedStatistics<T: DataType> { | ||
min: Option<T::T>, | ||
max: Option<T::T>, | ||
pub type TypedStatistics<T> = ValueStatistics<<T as DataType>::T>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry my rust-fu isn't good enough -- what is the purpose of renaming TypedStatistics
to ValueStatistics (and doing the DataType::T nonsense in the typedef)? Is that needed to #[derive]
works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed so you can create Statistics from a T: ParquetValueType, whereas previously it was being created (in a very hacky way) using DataType and casting via byte slices
pub(crate) mod private { | ||
use super::*; | ||
|
||
pub trait MakeStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another reason to not switch from TypedStatistics
to ValueStatistics
🤔
); | ||
check_encoding_write_support::<Int32Type>( | ||
WriterVersion::PARQUET_2_0, | ||
false, | ||
&[1, 2], | ||
None, | ||
&[Encoding::DELTA_BINARY_PACKED, Encoding::RLE], | ||
&[Encoding::RLE, Encoding::DELTA_BINARY_PACKED], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand these changes. Why did the order of supported encodings change? Is it due to the use of a Set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we track the encodings has changed slightly (using a set) which results in these changes
// Column writer properties | ||
descr: ColumnDescPtr, | ||
props: WriterPropertiesPtr, | ||
statistics_enabled: EnabledStatistics, | ||
|
||
page_writer: Box<dyn PageWriter + 'a>, | ||
has_dictionary: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be a significant improvement / change -- not tracking DictEncoder
specially
@nevi-me are you still reviewing this? |
I'm going to get this in, @nevi-me if you have any feedback I'll be more than happy to address it in a follow up |
Benchmark runs are scheduled for baseline = c585544 and contender = 02371d2. 02371d2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part of #1764
Rationale for this change
This will allow encoding arrow arrays directly instead of the current approach which must first convert them to a row format.
What changes are included in this PR?
Adds generics to ColumnWriterImpl to allow custom encoders
Are there any user-facing changes?
No