-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate the correct list repetition values to read #692
Conversation
When reading a list value, the reader would sometimes return an incomplete list slot due to not correctly accounting for the right repetition levels to read. This fixes that by counting the number of rep levels at the beginning of the list slot, and using that to read the right number of values to correctly construct list arrays
Codecov Report
@@ Coverage Diff @@
## master #692 +/- ##
==========================================
- Coverage 82.44% 82.36% -0.08%
==========================================
Files 168 168
Lines 47347 47310 -37
==========================================
- Hits 39033 38969 -64
- Misses 8314 8341 +27
Continue to review full report at Codecov.
|
} | ||
// Check if we have read more values than num_values | ||
let total_read = self.total_consumed_levels; | ||
// self.total_consumed_levels = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line intended to be commented? Do we need to reset total_consumed_levels to 0 here?
value_decoder.read_value_bytes(values_to_read, read_bytes)?; | ||
if values_read != child_values_read { | ||
child_values_to_read = child_values_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't child_values_read
need to be accumulated / summed up across loop iterations (instead of only taking the value from the last iteration) ?
@@ -754,20 +786,40 @@ impl<I: Iterator<Item = Box<dyn ValueDecoder>>> ValueDecoder | |||
} | |||
} | |||
|
|||
Ok(num_values - values_to_read) | |||
Ok((num_values - values_to_read, child_values_to_read)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears the child_values_to_read
return value is not actually used anywhere; is that on purpose?
} else { | ||
break; | ||
|
||
// When reading repetition levels, num_values will likely be less than the array values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with these changes, how does this ArrowArrayReader work with the ListArrayReader
? Shouldn't it be the responsibility of the ListArrayReader
to understand the meaning of repetition levels and use them to assemble lists of values from lower level / primitive arrays?
// as we do not need to do this for them. | ||
// In the example above, we could have `num_values` being 1 because we want to only read | ||
// one value, but we will return that we need to read 2 values to fill that 1 list slot. | ||
match self.level_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this match operation can be avoided if the LevelValueDecoder
is split into separate decoders for repetition and definition levels
// See https://github.com/apache/arrow-rs/issues/518, as this test fixes that. | ||
fn list_single_column_string() { | ||
let a_values = StringArray::from(vec![ | ||
"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something to keep in mind is that the new ArrowArrayReader
is currently only used for strings, which limits the effect of this fix; ArrowArrayReader
could be used / "turned on" for primitive types as well, but it's currently slower for primitive types compared to the old PrimitiveArrayReader
Ping @nevi-me / @yordan-pavlov / @mcassels -- I plan to make a arrow 5.3 release candidate later today (6-8 hours from now) -- if we want to include the changes in this PR in 5.3.0 it would need to happen soon-ish. Is there anything we need to do to push this along? |
level_value_buffer: vec![0i16; 2048], | ||
level_value_buffer: vec![0i16; BUFFER_LEN], | ||
level_type, | ||
total_consumed_levels: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another reason to split LevelValueDecoder
into separate decoders for repetition and definition levels is because they use different state - these 3 new fields are only used for repetition levels
LevelType::Repetition => { | ||
let mut carried_over_levels = 0; | ||
let mut record_values_read = 0; | ||
while total_values_read < num_values { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be while record_values_read < num_values {
? It's not very clear what the intention is here? - is it to read enough repetition level values for num_values
full list of values? It's also not very clear which variables represent lists of values and which variables represent values within lists; maybe some more comments will help?
Hey @alamb @mcassels @yordan-pavlov I currently don't have capacity to work on this. No chance of making 5.3, but I will try look at it over the weekend. The PR still needs a lot of work from what I've seen Yordan raising. I worked on it exploratively until tests passed, so my work needs cleaning up and documenting before this is in an acceptable state. I can move it to a draft actually |
Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them. |
Closing this PR -- note that @helgikrs has done some non trivial work in this area so reading nullable lists / structs may be working now (scheduled for release in arrow-8.0.0, which is due out shortly) |
Which issue does this PR close?
Closes #518 .
Rationale for this change
See the related issue
What changes are included in this PR?
When reading a list value, the reader would sometimes return an incomplete list slot
due to not correctly accounting for the right repetition levels to read.
This fixes that by counting the number of rep levels at the beginning of the list slot,
and using that to read the right number of values to correctly construct list arrays
Are there any user-facing changes?
No changes