Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Push-Based CSV Decoder #3604

Merged
merged 5 commits into from
Jan 27, 2023
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Inspired by the RawDecoder interface added in https://github.com/apache/arrow-rs/pull/3479/files I wanted to add a similar interface to the CSV reader. This PR does this

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 25, 2023
}
}

/// Clears and then fills the buffers on this [`RecordReader`]
/// returning the number of records read
fn fill_buf(&mut self, to_read: usize) -> Result<usize, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively all this PR does is lift the state from fill_buf's stack frame onto the struct

let mut skipped = 0;
while to_skip > skipped {
let read = self.fill_buf(to_skip.min(1024))?;
if read == 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an error here was a quick workaround for an infinite loop, added in #3470

This PR handles this properly and simply returns no [RecordBatch] if the offset exceeds the length of the file - I think this makes for a better UX

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code carefully -- thank you @tustvold

My only question about this PR is if there is sufficient test coverage that feed data in small / quasi-random buffer sizes to cover all the decoding corner cases (i.e. picking up decoding state from where it is)?

///
/// See [`Reader`] for a higher-level interface for interface with [`Read`]
///
/// The push-based interface facilitates integration with sources that yield arbitrarily
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// last call to [`Self::flush`], or `buf` is exhausted. Any remaining bytes
/// should be included in the next call to [`Self::decode`]
///
/// There is no requirement that `buf` contains a whole number of records, facilitating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

arrow-csv/src/reader/records.rs Show resolved Hide resolved
/// Clears the current contents of the decoder
pub fn clear(&mut self) {
// This does not reset current_field to allow clearing part way through a record
self.offsets_len = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the usecase for clear is here -- how would it clear part way through a record and then pick back up

@tustvold tustvold merged commit d9c2681 into apache:master Jan 27, 2023
@ursabot
Copy link

ursabot commented Jan 27, 2023

Benchmark runs are scheduled for baseline = 9728c67 and contender = d9c2681. d9c2681 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants