-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable dataset writer to write hive partitioned parquet datasets #11500
Comments
and streaming from lazy...ya know, while you're at it. |
This should be the other side of pl.scan_pyarrow_dataset (maybe pl.write_pyarrow_dataset). |
|
Can't you just do df.to_arrow() and then wrap it inside pyarrow.dataset.dataset? |
well that's what write_parquet does if you set those parameters. It can't stream though. So if you've got a lazyframe then you can't stream that into a dataset. |
Does someone have an example showing how you would take a Polars dataframe and write out parquets by partition? I am running into this use case frequently converting big datasets into smaller pieces for deep learning ingestion |
@uditrana you can use |
If I pass in additional kwargs (along with |
yup, it just uses |
May I suggest to use delta-rs instead combined with delta torch |
Is streaming supported for |
Hi, quick question, is the aim of this issue to not rely on pyarrow? I've found using pyarrow to write partitioned datasets can be memory intensive and I thought doing it in Rust side could be beneficial on low-memory systems. Thanks for the hard work! |
Are you referring to ds.write_dataset()? If yes, then is has some drawbacks, too: apache/arrow#39768 |
Yes
Probably not, pyarrow is c and c++ so I wouldn't think it'd be materially different in memory usage than rust. It maybe implemented differently but it's not like pyarrow is implemented in pure Python. |
How to |
It there any option to make |
@Smotrov I don't believe so. Other formats are on the list in:
Perhaps JSON can be added. |
But sink_parquet still doesn't support hive partitioning which leads to two bottlenecks:
So could we reopen this or should I create new issue? Because I didn't find such |
I guess you should create a new one, and reference this one there. |
Maintainer said that it will be in polars 2.0 |
Description
Enable dataset writer to write hive partitioned datasets (writing all / many destination files simultaneously).
The text was updated successfully, but these errors were encountered: