Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(rust): Add new streaming CSV source #19694

Merged
merged 2 commits into from
Nov 16, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Nov 8, 2024

Adds a simple CSV source - we have one thread scanning for line batches which get sent to other threads for parsing

The performance is mostly equivalent with the in-memory-engine, except for one very specific type of query -

q = pl.scan_csv(path_base + "yellow_tripdata_2015-01.csv").slice(10_000_000)
print(q.collect())
# shape: (2_748_986, 19)

# 9.60s user 1.47s system 1093% cpu 1.012 total # in-memory engine
# 2.23s user 0.41s system  414% cpu 0.638 total # new-streaming engine

@github-actions github-actions bot added internal An internal refactor or improvement rust Related to Rust Polars labels Nov 8, 2024
@nameexhaustion nameexhaustion added blocked Cannot be worked on due to external dependencies, or significant new internal features needed first and removed blocked Cannot be worked on due to external dependencies, or significant new internal features needed first labels Nov 8, 2024
@nameexhaustion nameexhaustion force-pushed the csv-source branch 2 times, most recently from 3fb0430 to 868d5a8 Compare November 13, 2024 11:06
@@ -14,7 +14,7 @@ use super::row_group_decode::RowGroupDecoder;
use super::{AsyncTaskData, ParquetSourceNode};
use crate::async_executor;
use crate::async_primitives::connector::connector;
use crate::async_primitives::wait_group::{WaitGroup, WaitToken};
use crate::async_primitives::wait_group::IndexedWaitGroup;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor drive-by refactors of parquet source

line_batch_receivers,
// TODO: Refactor so that we don't unwrap, it's currently hard because
// `ComputeNode::{initialize, spawn}` doesn't return a `PolarsResult`
Arc::new(self.try_init_chunk_reader().unwrap()),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unwrap() is not ideal, but I don't currently have any good ideas on how to avoid this

@nameexhaustion nameexhaustion marked this pull request as ready for review November 13, 2024 11:36
@nameexhaustion nameexhaustion marked this pull request as draft November 13, 2024 11:36
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 1.52381% with 517 lines in your changes missing coverage. Please review.

Project coverage is 79.37%. Comparing base (8cb7839) to head (644dbd9).
Report is 41 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-stream/src/nodes/csv_source.rs 0.00% 486 Missing ⚠️
...s/polars-stream/src/async_primitives/wait_group.rs 0.00% 16 Missing ⚠️
py-polars/polars/io/csv/functions.py 37.50% 6 Missing and 4 partials ⚠️
...tes/polars-stream/src/nodes/parquet_source/init.rs 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19694      +/-   ##
==========================================
- Coverage   79.55%   79.37%   -0.19%     
==========================================
  Files        1544     1545       +1     
  Lines      213240   213735     +495     
  Branches     2441     2445       +4     
==========================================
+ Hits       169643   169647       +4     
- Misses      43048    43536     +488     
- Partials      549      552       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion marked this pull request as ready for review November 14, 2024 08:08
@ritchie46
Copy link
Member

The performance is mostly equivalent with the in-memory-engine, except for one very specific type of query -

What do you think causes the difference?

@ritchie46 ritchie46 merged commit 4f3e828 into pola-rs:main Nov 16, 2024
26 checks passed
@nameexhaustion nameexhaustion deleted the csv-source branch November 18, 2024 08:21
@c-peters c-peters added the accepted Ready for implementation label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation internal An internal refactor or improvement rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants