Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable dataset writer to write hive partitioned parquet datasets #11500

Closed
lmocsi opened this issue Oct 4, 2023 · 21 comments
Closed

Enable dataset writer to write hive partitioned parquet datasets #11500

lmocsi opened this issue Oct 4, 2023 · 21 comments
Assignees
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@lmocsi
Copy link

lmocsi commented Oct 4, 2023

Description

Enable dataset writer to write hive partitioned datasets (writing all / many destination files simultaneously).

@lmocsi lmocsi added the enhancement New feature or an improvement of an existing feature label Oct 4, 2023
@deanm0000
Copy link
Collaborator

and streaming from lazy...ya know, while you're at it.

@lmocsi
Copy link
Author

lmocsi commented Oct 4, 2023

This should be the other side of pl.scan_pyarrow_dataset (maybe pl.write_pyarrow_dataset).

@deanm0000
Copy link
Collaborator

write_parquet already takes arguments to use pyarrow dataset writer. See the example here https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html.

@ion-elgreco
Copy link
Contributor

This should be the other side of pl.scan_pyarrow_dataset (maybe pl.write_pyarrow_dataset).

Can't you just do df.to_arrow() and then wrap it inside pyarrow.dataset.dataset?

@deanm0000
Copy link
Collaborator

This should be the other side of pl.scan_pyarrow_dataset (maybe pl.write_pyarrow_dataset).

Can't you just do df.to_arrow() and then wrap it inside pyarrow.dataset.dataset?

well that's what write_parquet does if you set those parameters. It can't stream though. So if you've got a lazyframe then you can't stream that into a dataset.

@uditrana
Copy link

Does someone have an example showing how you would take a Polars dataframe and write out parquets by partition?

I am running into this use case frequently converting big datasets into smaller pieces for deep learning ingestion

@deanm0000
Copy link
Collaborator

@uditrana you can use df.write_parquet with use_pyarrow=True and feed pyarrow_options with a dict where one key is partition_cols https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html#polars.DataFrame.write_parquet

@uditrana
Copy link

If I pass in additional kwargs (along with partition_cols) to pyarrow_options, then they should match the signature of pyarrow.write_to_dataset and not pyarrow.write_to_table, correct?

@deanm0000
Copy link
Collaborator

yup, it just uses pyarrow.write_to_dataset under the hood.

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Oct 24, 2023

May I suggest to use delta-rs instead combined with delta torch

@jmakov
Copy link

jmakov commented Dec 15, 2023

May I suggest to use delta-rs instead combined with delta torch

Is streaming supported for .scan_delta?

@stinodego stinodego added A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation labels Mar 29, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Mar 29, 2024
@29antonioac
Copy link
Contributor

Hi, quick question, is the aim of this issue to not rely on pyarrow? I've found using pyarrow to write partitioned datasets can be memory intensive and I thought doing it in Rust side could be beneficial on low-memory systems.

Thanks for the hard work!

@lmocsi
Copy link
Author

lmocsi commented Apr 7, 2024

yup, it just uses pyarrow.write_to_dataset under the hood.

Are you referring to ds.write_dataset()? If yes, then is has some drawbacks, too: apache/arrow#39768

@deanm0000
Copy link
Collaborator

Hi, quick question, is the aim of this issue to not rely on pyarrow?

Yes

I've found using pyarrow to write partitioned datasets can be memory intensive and I thought doing it in Rust side could be beneficial on low-memory systems.

Probably not, pyarrow is c and c++ so I wouldn't think it'd be materially different in memory usage than rust. It maybe implemented differently but it's not like pyarrow is implemented in pure Python.

@eromoe
Copy link

eromoe commented Apr 12, 2024

How to agg.sink_parquet with partition ? My dataset is super large, I want to split it into year/ month partitions, and I found write_parquet method require collect first...

@Smotrov
Copy link

Smotrov commented May 8, 2024

It there any option to make sink_json partitioned? Usually have huge datasets.
If it could sink it to a cloud (AWS S3) it would be amazing.

@cmdlineluser
Copy link
Contributor

@Smotrov I don't believe so.

Other formats are on the list in:

Support Hive partitioning logic in other readers besides Parquet

Perhaps JSON can be added.

@nameexhaustion nameexhaustion self-assigned this Jul 5, 2024
@nameexhaustion nameexhaustion changed the title Enable dataset writer to write hive partitioned datasets Enable dataset writer to write hive partitioned parquet datasets Jul 8, 2024
@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Jul 8, 2024

Support for writing hive partitioned parquet is added by #17324
IPC support is tracked at #17481

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jul 8, 2024
@EpicUsaMan
Copy link

Support for writing hive partitioned parquet is added by #17324 IPC support is tracked at #17481

But sink_parquet still doesn't support hive partitioning which leads to two bottlenecks:

  • compression in one thread
  • not utilizing whole SSD while writing using sink_parquet

So could we reopen this or should I create new issue? Because I didn't find such

@lmocsi
Copy link
Author

lmocsi commented Aug 26, 2024

I guess you should create a new one, and reference this one there.

@EpicUsaMan
Copy link

I guess you should create a new one, and reference this one there.

Maintainer said that it will be in polars 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

No branches or pull requests