-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read partition columns of Hive dataset #404
Comments
Thanks, but I think this is a problem that should be solved on the Rust side, and am against adding workarounds to this package that are only possible on the R side. |
Ha, I've just seen that this functionally has now been added on the Rust side! pola-rs/polars#11284 This will solve a major headache that I, personally, have had so far with using Polars in real-life projects. |
we have had a rough time trying to translate some deltalake, arrow, and other fancy connections which were implemented with python via third-party python packages. We are very happy when rust-polars implements such connections from the ground up, it makes it much easier for us to support. just a thought |
Yeah, totally makes sense. However, since the Rust implementation has now been added, what's the best way to go about incorporating these updates on the r-polars side? There are a number of .rs files that have been changed. Do you want me to take a stab at pulling in these? I must admit that I'm not entirely sure about the mapping of usptream Rust changes to the current r-polars file structure, so I'll probably be quite inefficient. Any guidance etc. would be much appreciated! |
I think these changes will be incorporated when the rust-polars dependency in @grantmcdermott in the meantime, I'm interested in having your custom function in |
#334 was more labour intense than average :) |
Thanks @etiennebacher. Will do when I get a sec. |
The breaking changes in (Rust) polars are not sufficiently documented in Changelog, so we have to read a lot of commits and source code to understand the changes. |
I can try to take a stab at -> 0.33 also. I don't use change log too much as there is too much to read ^^ . I mostly look at the git blame view on corresponding py-polars implementation of something where we now have a compiler error or unit test error. All we have to automatically monitor change is compiler error and unit tests. |
@grantmcdermott it is now possible to concat a list of LazyFrames (#407 ) in case you want to update your function |
Super, thanks for the HU @etiennebacher. I haven't forgotten about that potential PR. A few too many plates spinning at the moment... |
Great to see this update on the Rust side @etiennebacher! FWIW the vignette will need to updated to reflect this change: Lines 470 to 474 in fbfc8b4
|
To import multiple files within the same directory, we can use the pattern globbing capabilities of
scan_parquet
and co. However, as we have documented in the "Data import" section of the intro vignette, this globbing strategy unfortunately doesn't recognize the partition columns (directories) of a Hive-style dataset, i.e. those of the formparentdir/subdir1=value1/subsubdir2=value2/data.parquet
. This is particularly limiting for larger datasets, which are almost certainly going to be Hive-partitioned for efficient storage.Example:
Note that this is an upstream issue affecting Polars main, although py-polars does have
pl.scan_pyarrow_dataset
as a workaround. See pola-rs/polars#4347 and pola-rs/polars#10276.As a workaround in r-polars, I've written a little
read_hive
function that seems to do a pretty good job of recognizing and appending the partition column types for Hive-partitioned datasets.One nice feature of this function is that correctly parses integer versus float types where possible, avoiding unnecessary memory overhead. It also does so upfront to avoid type mismatches when the sub datasets are finally concatenated. Quick examples:
Is there interest in adding this function (or some variant thereof) to the r-polars codebase? If so, let me know and I can put in a PR.
P.S. As noted in the function comments, ideally we can get rid of some scan versus read control-flow once a
concat
method for LazyFrames is enabled (#386).The text was updated successfully, but these errors were encountered: