Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate the correct list repetition values to read #692

Closed
wants to merge 1 commit into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Aug 13, 2021

Which issue does this PR close?

Closes #518 .

Rationale for this change

See the related issue

What changes are included in this PR?

When reading a list value, the reader would sometimes return an incomplete list slot
due to not correctly accounting for the right repetition levels to read.

This fixes that by counting the number of rep levels at the beginning of the list slot,
and using that to read the right number of values to correctly construct list arrays

Are there any user-facing changes?

No changes

When reading a list value, the reader would sometimes return an incomplete list slot
due to not correctly accounting for the right repetition levels to read.

This fixes that by counting the number of rep levels at the beginning of the list slot,
and using that to read the right number of values to correctly construct list arrays
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 13, 2021
@codecov-commenter
Copy link

Codecov Report

Merging #692 (dceddd7) into master (1a5c26b) will decrease coverage by 0.07%.
The diff coverage is 77.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #692      +/-   ##
==========================================
- Coverage   82.44%   82.36%   -0.08%     
==========================================
  Files         168      168              
  Lines       47347    47310      -37     
==========================================
- Hits        39033    38969      -64     
- Misses       8314     8341      +27     
Impacted Files Coverage Δ
parquet/src/arrow/arrow_array_reader.rs 72.00% <74.50%> (-6.43%) ⬇️
parquet/src/arrow/arrow_writer.rs 98.09% <100.00%> (+0.03%) ⬆️
parquet/src/arrow/array_reader.rs 77.72% <0.00%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a5c26b...dceddd7. Read the comment docs.

}
// Check if we have read more values than num_values
let total_read = self.total_consumed_levels;
// self.total_consumed_levels = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line intended to be commented? Do we need to reset total_consumed_levels to 0 here?

value_decoder.read_value_bytes(values_to_read, read_bytes)?;
if values_read != child_values_read {
child_values_to_read = child_values_read;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't child_values_read need to be accumulated / summed up across loop iterations (instead of only taking the value from the last iteration) ?

@@ -754,20 +786,40 @@ impl<I: Iterator<Item = Box<dyn ValueDecoder>>> ValueDecoder
}
}

Ok(num_values - values_to_read)
Ok((num_values - values_to_read, child_values_to_read))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears the child_values_to_read return value is not actually used anywhere; is that on purpose?

} else {
break;

// When reading repetition levels, num_values will likely be less than the array values
Copy link
Contributor

@yordan-pavlov yordan-pavlov Aug 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with these changes, how does this ArrowArrayReader work with the ListArrayReader? Shouldn't it be the responsibility of the ListArrayReader to understand the meaning of repetition levels and use them to assemble lists of values from lower level / primitive arrays?

// as we do not need to do this for them.
// In the example above, we could have `num_values` being 1 because we want to only read
// one value, but we will return that we need to read 2 values to fill that 1 list slot.
match self.level_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this match operation can be avoided if the LevelValueDecoder is split into separate decoders for repetition and definition levels

// See https://github.com/apache/arrow-rs/issues/518, as this test fixes that.
fn list_single_column_string() {
let a_values = StringArray::from(vec![
"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something to keep in mind is that the new ArrowArrayReader is currently only used for strings, which limits the effect of this fix; ArrowArrayReader could be used / "turned on" for primitive types as well, but it's currently slower for primitive types compared to the old PrimitiveArrayReader

@alamb
Copy link
Contributor

alamb commented Aug 26, 2021

Ping @nevi-me / @yordan-pavlov / @mcassels -- I plan to make a arrow 5.3 release candidate later today (6-8 hours from now) -- if we want to include the changes in this PR in 5.3.0 it would need to happen soon-ish. Is there anything we need to do to push this along?

level_value_buffer: vec![0i16; 2048],
level_value_buffer: vec![0i16; BUFFER_LEN],
level_type,
total_consumed_levels: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another reason to split LevelValueDecoder into separate decoders for repetition and definition levels is because they use different state - these 3 new fields are only used for repetition levels

LevelType::Repetition => {
let mut carried_over_levels = 0;
let mut record_values_read = 0;
while total_values_read < num_values {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be while record_values_read < num_values { ? It's not very clear what the intention is here? - is it to read enough repetition level values for num_values full list of values? It's also not very clear which variables represent lists of values and which variables represent values within lists; maybe some more comments will help?

@nevi-me
Copy link
Contributor Author

nevi-me commented Aug 27, 2021

Ping @nevi-me / @yordan-pavlov / @mcassels -- I plan to make a arrow 5.3 release candidate later today (6-8 hours from now) -- if we want to include the changes in this PR in 5.3.0 it would need to happen soon-ish. Is there anything we need to do to push this along?

Hey @alamb @mcassels @yordan-pavlov I currently don't have capacity to work on this. No chance of making 5.3, but I will try look at it over the weekend.

The PR still needs a lot of work from what I've seen Yordan raising. I worked on it exploratively until tests passed, so my work needs cleaning up and documenting before this is in an acceptable state.

I can move it to a draft actually

@nevi-me nevi-me marked this pull request as draft August 27, 2021 09:12
@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

Marking PRs that are over a month old as stale -- please let us know if there is additional work planned or if we should close them.

@alamb alamb added the stale-pr label Dec 20, 2021
@alamb
Copy link
Contributor

alamb commented Jan 24, 2022

Closing this PR -- note that @helgikrs has done some non trivial work in this area so reading nullable lists / structs may be working now (scheduled for release in arrow-8.0.0, which is due out shortly)

@alamb alamb closed this Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate stale-pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet list reader not reading correct element lengths
5 participants