feat: Support writing hive partitioned parquet #17324

nameexhaustion · 2024-07-01T08:57:34Z

df = pl.DataFrame(
    {
        "date1" : [    date(2024, 1, 1),     date(2024, 1, 2)],
        "date2" : [datetime(2024, 1, 1), datetime(2024, 1, 1)],
        "x"     : [                   1,                    2],
    }
)

df.write_parquet(".env/hive_parquet/", partition_by=["date1", "date2"])

# .env/hive_parquet/
# ├── date1=2024-01-01
# │   └── date2=2024-01-01%2000%3A00%3A00.000000
# │       └── 00000000.parquet
# └── date1=2024-01-02
#     └── date2=2024-01-01%2000%3A00%3A00.000000
#         └── 00000000.parquet

codecov · 2024-07-01T10:13:25Z

Codecov Report

Attention: Patch coverage is 95.00000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 80.60%. Comparing base (27ac6cc) to head (df69033).

Files	Patch %	Lines
py-polars/polars/dataframe/frame.py	54.54%	4 Missing and 1 partial ⚠️
crates/polars-io/src/partition.rs	98.86%	1 Missing ⚠️
.../polars-pipe/src/executors/sinks/output/parquet.rs	50.00%	1 Missing ⚠️
py-polars/src/dataframe/io.rs	97.05%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #17324      +/-   ##
==========================================
+ Coverage   80.58%   80.60%   +0.02%     
==========================================
  Files        1480     1480              
  Lines      193682   193822     +140     
  Branches     2765     2769       +4     
==========================================
+ Hits       156071   156224     +153     
+ Misses      37103    37089      -14     
- Partials      508      509       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexander-beedie · 2024-07-02T08:57:37Z

Looking forward to this one! ✌️

ritchie46 · 2024-07-03T07:23:01Z

py-polars/src/dataframe/io.rs

@@ -438,6 +438,51 @@ impl PyDataFrame {
        Ok(())
    }

+    #[cfg(feature = "parquet")]
+    #[pyo3(signature = (py_f, partition_by, compression, compression_level, statistics, row_group_size, data_page_size))]
+    pub fn write_parquet_partitioned(


Can we create the full write_partitioned_dataset in polars-io.

Python should not have more functionality than rust, so we should lower this and just dispatch here.

Created it - made a comment below

ritchie46 · 2024-07-03T07:25:16Z

crates/polars-io/src/partition.rs

+where
+    S: AsRef<str>,
+{
+    for x in partition_by.iter() {


If the partition length is large, we should first collect the schema, otherwise we have quadratic performance here.

Yes, I didn't realize before indexing the DataFrame by column name was linear access time

py-polars/polars/dataframe/frame.py

nameexhaustion · 2024-07-03T07:26:43Z

py-polars/tests/unit/io/test_hive.py

-
-        return
-
-    assert pl.thread_pool_size() == 1


Stopped working and I'm not sure why, but this test is valid as long as the prefetch size is set to 1

ritchie46 · 2024-07-03T07:51:06Z

crates/polars-io/src/partition.rs

+
+    let out: Box<dyn Iterator<Item = (String, DataFrame)>> = match groups {
+        GroupsProxy::Idx(idx) => Box::new(idx.into_iter().map(move |(_, group)| {
+            let part_df =


I think we should iterate over DataFrames of a certain size. So that we don't write a single file per folder, but for large partitions many smaller parquet files.

I am not entirely sure how other tools determine the size of the parquet. We could split by n_rows where we use the estimated_size as hint?

I set it to 1 million rows per file for now

If you know the schema at this point (or rather, the number of cols) it's better to target a given number of elements (rows x cols), as "rows" by itself is not a useful metric.

1 million rows with 1 col is a full three orders of magnitude removed from 1 million rows with 1000 cols 😆

Somewhere between 10-25 million elements is probably going to be a more consistent target 🤔 (and using estimated size is even more helpful to avoid edge-cases like large binary blobs).

I see, I've changed it slice the df into chunks of a target size

nameexhaustion · 2024-07-03T14:05:17Z

crates/polars-plan/src/plans/hive.rs

    let Some(path) = paths.first() else {
        return Ok(None);
    };

    let sep = separator(path);
    let path_string = path.to_str().unwrap();

+    fn parse_hive_string_and_decode(part: &'_ str) -> Option<(&'_ str, std::borrow::Cow<'_, str>)> {
+        let (k, v) = parse_hive_string(part)?;
+        let v = percent_encoding::percent_decode(v.as_bytes())


drive-by - decode after splitting by =, otherwise we break when the value contains / or =

nameexhaustion · 2024-07-03T14:10:07Z

crates/polars-io/src/partition.rs

@@ -127,3 +128,107 @@ where
    }
    path
 }
+
+pub fn write_partitioned_dataset<S>(


Created write_partitioned_dataset here in polars-io.

I was considering putting a fn write_parquet_partitioned into impl DataFrame, but I notice that on the rust side we don't have e.g. DataFrame::write_parquet and others, so I just made it a function like this

ritchie46 · 2024-07-06T09:57:55Z

crates/polars-io/src/partition.rs

+            format!("{:013x}.parquet", i)
+        }
+
+        for (i, slice_start) in (0..part_df.height()).step_by(rows_per_file).enumerate() {


For a future PR we can see if we can speed this up, but enabling parallism/async here.

ritchie46

Nice, Great addition. I think we can experiment making this fast, but let's first get the core functionality in. 👍

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jul 1, 2024

nameexhaustion force-pushed the hive-write branch from 86340fc to bd7a49a Compare July 2, 2024 05:56

nameexhaustion force-pushed the hive-write branch from bd7a49a to 05a3790 Compare July 2, 2024 09:23

nameexhaustion marked this pull request as ready for review July 2, 2024 14:05

nameexhaustion requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners July 2, 2024 14:05

ritchie46 reviewed Jul 3, 2024

View reviewed changes

nameexhaustion commented Jul 3, 2024

View reviewed changes

ritchie46 reviewed Jul 3, 2024

View reviewed changes

nameexhaustion force-pushed the hive-write branch from 29b61aa to 1517147 Compare July 3, 2024 11:28

nameexhaustion marked this pull request as draft July 3, 2024 11:30

nameexhaustion force-pushed the hive-write branch from 1517147 to 8bd264b Compare July 3, 2024 13:29

nameexhaustion commented Jul 3, 2024

View reviewed changes

nameexhaustion marked this pull request as ready for review July 3, 2024 14:51

ritchie46 reviewed Jul 6, 2024

View reviewed changes

ritchie46 approved these changes Jul 6, 2024

View reviewed changes

c

df69033

nameexhaustion force-pushed the hive-write branch from 7f62043 to df69033 Compare July 6, 2024 11:00

ritchie46 merged commit f0a82d9 into pola-rs:main Jul 6, 2024
27 checks passed

nameexhaustion mentioned this pull request Jul 8, 2024

Enable dataset writer to write hive partitioned parquet datasets #11500

Closed

c-peters added the accepted Ready for implementation label Jul 8, 2024

c-peters assigned nameexhaustion Jul 8, 2024

nameexhaustion deleted the hive-write branch July 8, 2024 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support writing hive partitioned parquet #17324

feat: Support writing hive partitioned parquet #17324

nameexhaustion commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading

alexander-beedie commented Jul 2, 2024

ritchie46 Jul 3, 2024

nameexhaustion Jul 4, 2024

ritchie46 Jul 3, 2024

nameexhaustion Jul 3, 2024

nameexhaustion Jul 3, 2024

ritchie46 Jul 3, 2024

nameexhaustion Jul 3, 2024

alexander-beedie Jul 3, 2024 •

edited

Loading

nameexhaustion Jul 4, 2024

nameexhaustion Jul 3, 2024

nameexhaustion Jul 3, 2024

ritchie46 Jul 6, 2024

ritchie46 left a comment

feat: Support writing hive partitioned parquet #17324

feat: Support writing hive partitioned parquet #17324

Conversation

nameexhaustion commented Jul 1, 2024 • edited Loading

codecov bot commented Jul 1, 2024 • edited Loading

Codecov Report

alexander-beedie commented Jul 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexander-beedie Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritchie46 left a comment

Choose a reason for hiding this comment

nameexhaustion commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading

alexander-beedie Jul 3, 2024 •

edited

Loading