Enable dataset writer to write hive partitioned parquet datasets #11500

lmocsi · 2023-10-04T07:03:47Z

Description

Enable dataset writer to write hive partitioned datasets (writing all / many destination files simultaneously).

deanm0000 · 2023-10-04T11:04:02Z

and streaming from lazy...ya know, while you're at it.

lmocsi · 2023-10-04T11:21:33Z

This should be the other side of pl.scan_pyarrow_dataset (maybe pl.write_pyarrow_dataset).

deanm0000 · 2023-10-04T11:34:00Z

write_parquet already takes arguments to use pyarrow dataset writer. See the example here https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html.

ion-elgreco · 2023-10-04T15:34:07Z

This should be the other side of pl.scan_pyarrow_dataset (maybe pl.write_pyarrow_dataset).

Can't you just do df.to_arrow() and then wrap it inside pyarrow.dataset.dataset?

deanm0000 · 2023-10-04T16:14:41Z

This should be the other side of pl.scan_pyarrow_dataset (maybe pl.write_pyarrow_dataset).

Can't you just do df.to_arrow() and then wrap it inside pyarrow.dataset.dataset?

well that's what write_parquet does if you set those parameters. It can't stream though. So if you've got a lazyframe then you can't stream that into a dataset.

uditrana · 2023-10-24T20:24:04Z

Does someone have an example showing how you would take a Polars dataframe and write out parquets by partition?

I am running into this use case frequently converting big datasets into smaller pieces for deep learning ingestion

deanm0000 · 2023-10-24T20:29:52Z

@uditrana you can use df.write_parquet with use_pyarrow=True and feed pyarrow_options with a dict where one key is partition_cols https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html#polars.DataFrame.write_parquet

uditrana · 2023-10-24T20:34:42Z

If I pass in additional kwargs (along with partition_cols) to pyarrow_options, then they should match the signature of pyarrow.write_to_dataset and not pyarrow.write_to_table, correct?

deanm0000 · 2023-10-24T20:38:31Z

yup, it just uses pyarrow.write_to_dataset under the hood.

ion-elgreco · 2023-10-24T20:39:55Z

May I suggest to use delta-rs instead combined with delta torch

jmakov · 2023-12-15T02:44:05Z

May I suggest to use delta-rs instead combined with delta torch

Is streaming supported for .scan_delta?

29antonioac · 2024-04-06T17:14:34Z

Hi, quick question, is the aim of this issue to not rely on pyarrow? I've found using pyarrow to write partitioned datasets can be memory intensive and I thought doing it in Rust side could be beneficial on low-memory systems.

Thanks for the hard work!

lmocsi · 2024-04-07T21:03:16Z

yup, it just uses pyarrow.write_to_dataset under the hood.

Are you referring to ds.write_dataset()? If yes, then is has some drawbacks, too: apache/arrow#39768

deanm0000 · 2024-04-07T21:25:49Z

Hi, quick question, is the aim of this issue to not rely on pyarrow?

Yes

I've found using pyarrow to write partitioned datasets can be memory intensive and I thought doing it in Rust side could be beneficial on low-memory systems.

Probably not, pyarrow is c and c++ so I wouldn't think it'd be materially different in memory usage than rust. It maybe implemented differently but it's not like pyarrow is implemented in pure Python.

eromoe · 2024-04-12T12:14:19Z

How to agg.sink_parquet with partition ? My dataset is super large, I want to split it into year/ month partitions, and I found write_parquet method require collect first...

Smotrov · 2024-05-08T15:00:49Z

It there any option to make sink_json partitioned? Usually have huge datasets.
If it could sink it to a cloud (AWS S3) it would be amazing.

cmdlineluser · 2024-05-08T15:27:51Z

@Smotrov I don't believe so.

Other formats are on the list in:

Support Hive partitioning logic in other readers besides Parquet

Hive partitioning tracking issue #15441

Perhaps JSON can be added.

nameexhaustion · 2024-07-08T02:16:09Z

Support for writing hive partitioned parquet is added by #17324
IPC support is tracked at #17481

EpicUsaMan · 2024-08-26T02:02:47Z

Support for writing hive partitioned parquet is added by #17324 IPC support is tracked at #17481

But sink_parquet still doesn't support hive partitioning which leads to two bottlenecks:

compression in one thread
not utilizing whole SSD while writing using sink_parquet

So could we reopen this or should I create new issue? Because I didn't find such

lmocsi · 2024-08-26T16:16:50Z

I guess you should create a new one, and reference this one there.

EpicUsaMan · 2024-08-26T18:40:49Z

I guess you should create a new one, and reference this one there.

Maintainer said that it will be in polars 2.0

lmocsi added the enhancement New feature or an improvement of an existing feature label Oct 4, 2023

cmdlineluser mentioned this issue Feb 20, 2024

partition_by on LazyFrame #14603

Closed

stinodego added A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation labels Mar 29, 2024

github-project-automation bot added this to Backlog Mar 29, 2024

github-project-automation bot moved this to Ready in Backlog Mar 29, 2024

stinodego mentioned this issue Apr 2, 2024

Hive partitioning tracking issue #15441

Open

13 tasks

stinodego mentioned this issue Jun 24, 2024

Support writing to multiple files in a directory with write/sink_parquet #17163

Open

nameexhaustion self-assigned this Jul 5, 2024

nameexhaustion changed the title ~~Enable dataset writer to write hive partitioned datasets~~ Enable dataset writer to write hive partitioned parquet datasets Jul 8, 2024

nameexhaustion closed this as completed Jul 8, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jul 8, 2024

aut0clave mentioned this issue Nov 18, 2024

Enable "partition_by" for "sink_parquet" function #19845

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable dataset writer to write hive partitioned parquet datasets #11500

Enable dataset writer to write hive partitioned parquet datasets #11500

lmocsi commented Oct 4, 2023

deanm0000 commented Oct 4, 2023

lmocsi commented Oct 4, 2023

deanm0000 commented Oct 4, 2023

ion-elgreco commented Oct 4, 2023

deanm0000 commented Oct 4, 2023

uditrana commented Oct 24, 2023

deanm0000 commented Oct 24, 2023

uditrana commented Oct 24, 2023

deanm0000 commented Oct 24, 2023

ion-elgreco commented Oct 24, 2023 •

edited

Loading

jmakov commented Dec 15, 2023

29antonioac commented Apr 6, 2024

lmocsi commented Apr 7, 2024

deanm0000 commented Apr 7, 2024

eromoe commented Apr 12, 2024

Smotrov commented May 8, 2024

cmdlineluser commented May 8, 2024

nameexhaustion commented Jul 8, 2024 •

edited

Loading

EpicUsaMan commented Aug 26, 2024

lmocsi commented Aug 26, 2024

EpicUsaMan commented Aug 26, 2024

Enable dataset writer to write hive partitioned parquet datasets #11500

Enable dataset writer to write hive partitioned parquet datasets #11500

Comments

lmocsi commented Oct 4, 2023

Description

deanm0000 commented Oct 4, 2023

lmocsi commented Oct 4, 2023

deanm0000 commented Oct 4, 2023

ion-elgreco commented Oct 4, 2023

deanm0000 commented Oct 4, 2023

uditrana commented Oct 24, 2023

deanm0000 commented Oct 24, 2023

uditrana commented Oct 24, 2023

deanm0000 commented Oct 24, 2023

ion-elgreco commented Oct 24, 2023 • edited Loading

jmakov commented Dec 15, 2023

29antonioac commented Apr 6, 2024

lmocsi commented Apr 7, 2024

deanm0000 commented Apr 7, 2024

eromoe commented Apr 12, 2024

Smotrov commented May 8, 2024

cmdlineluser commented May 8, 2024

nameexhaustion commented Jul 8, 2024 • edited Loading

EpicUsaMan commented Aug 26, 2024

lmocsi commented Aug 26, 2024

EpicUsaMan commented Aug 26, 2024

ion-elgreco commented Oct 24, 2023 •

edited

Loading

nameexhaustion commented Jul 8, 2024 •

edited

Loading