Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Cache path resolving of scan functions #17616

Merged
merged 8 commits into from
Jul 15, 2024
Merged

Conversation

stinodego
Copy link
Member

@stinodego stinodego commented Jul 13, 2024

Closes #17584

Changes

  • Update the DslPlan after resolving paths. This avoids re-resolving the paths upon repeated collect calls on the same LazyFrame.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jul 13, 2024
@ritchie46
Copy link
Member

Yeah, got the same isslue with hive partitioning. First one who knows what it is shares. ;) (afk atm)

I think the mutex is better here as the read time is super short.

@stinodego stinodego force-pushed the scan-cache-expansion branch from a85ece7 to ec90d1d Compare July 15, 2024 09:09
Copy link

codecov bot commented Jul 15, 2024

Codecov Report

Attention: Patch coverage is 90.47619% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.68%. Comparing base (f90753b) to head (063a7c7).
Report is 1 commits behind head on main.

Files Patch % Lines
crates/polars-lazy/src/scan/ndjson.rs 25.00% 3 Missing ⚠️
...ates/polars-plan/src/plans/conversion/dsl_to_ir.rs 96.42% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17616      +/-   ##
==========================================
- Coverage   80.69%   80.68%   -0.02%     
==========================================
  Files        1484     1484              
  Lines      195421   195453      +32     
  Branches     2782     2782              
==========================================
+ Hits       157695   157700       +5     
- Misses      37214    37241      +27     
  Partials      512      512              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stinodego stinodego marked this pull request as ready for review July 15, 2024 12:26
} => expand_paths(&lock.0, file_options.glob, cloud_options.as_ref())?,
#[cfg(feature = "json")]
FileScan::NDJson { .. } => expand_paths(&lock.0, file_options.glob, None)?,
FileScan::Anonymous { .. } => lock.0.clone(), // Anonymous scans are already expanded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FileScan::Anonymous { .. } => lock.0.clone(), // Anonymous scans are already expanded.
FileScan::Anonymous { .. } => unreachable!(), // Invariant: Anonymous scans are already expanded.

@ritchie46 ritchie46 merged commit c6e1d9e into main Jul 15, 2024
25 checks passed
@ritchie46 ritchie46 deleted the scan-cache-expansion branch July 15, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New deferred path expansion logic greatly reduced io performance
2 participants