-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support directory paths in scans for Parquet, IPC and CSV #17017
Conversation
CodSpeed Performance ReportMerging #17017 will not alter performanceComparing Summary
|
7487f00
to
0d4526b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17017 +/- ##
==========================================
- Coverage 80.93% 80.92% -0.01%
==========================================
Files 1448 1448
Lines 190704 190710 +6
Branches 2723 2723
==========================================
+ Hits 154338 154340 +2
- Misses 35862 35866 +4
Partials 504 504 ☔ View full report in Codecov by Sentry. |
@@ -104,19 +205,9 @@ pub trait LazyFileListReader: Clone { | |||
true | |||
} | |||
|
|||
/// Path of the scanned file. | |||
/// It can be potentially a glob pattern. | |||
fn path(&self) -> &Path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the path
function from the trait - we just use a length-1 slice in paths
for single paths
@@ -191,9 +192,6 @@ def data_file( | |||
def test_scan( | |||
capfd: Any, monkeypatch: pytest.MonkeyPatch, data_file: _DataFile, force_async: bool | |||
) -> None: | |||
if data_file.path.suffix == ".csv" and force_async: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by - enable these parametric tests for CSV now it supports async
} | ||
|
||
// Todo: | ||
// This maintains existing behavior - will remove very soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change this and add tests in a follow-up PR
@@ -80,6 +81,7 @@ pub enum DslPlan { | |||
paths: Arc<[PathBuf]>, | |||
// Option as this is mostly materialized on the IR phase. | |||
file_info: Option<FileInfo>, | |||
hive_parts: Option<Vec<Arc<HivePartitions>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this out from file_info
- we store hive parts for every path in a Vec
that we can resolve up-front - this replaces the current approach of using update_hive_partitions
.
Can you do a rebase? 🙈 The physical engine is moved to its own crate. |
rebased 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff. Left some questions.
This PR enables passing directory paths to
scan_(parquet|ipc|csv)
, which will be recursively traversed to load all files.There is also some refactoring around hive partition logic to prepare for future behavior changes.