Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lacking support for DataFrameWriter's "maxRecordsPerFile" option? #781

Closed
noslowerdna opened this issue Sep 17, 2021 · 2 comments
Closed
Labels
acknowledged This issue has been read and acknowledged by Delta admins bug Something isn't working

Comments

@noslowerdna
Copy link
Contributor

It doesn't appear that the "maxRecordsPerFile" DataFrameWriter option, e.g. df.write.option("maxRecordsPerFile", 10000) is supported when using the "delta" format. However the behavior can still be achieved by setting the spark.sql.files.maxRecordsPerFile configuration property in the SparkConf.

If verified and not considered a bug, could this be a simple enhancement to implement?

@scottsand-db scottsand-db added acknowledged This issue has been read and acknowledged by Delta admins bug Something isn't working labels Oct 7, 2021
@scottsand-db
Copy link
Collaborator

Thanks for bringing this to our attention. We will look into this.

@zsxwing
Copy link
Member

zsxwing commented Oct 8, 2021

This is related to #652

jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example,

```
spark.range(100)
          .write
          .format("parquet")
          .option("maxRecordsPerFile", 5)
          .save(path)
```

The above code will generate 20 parquet files and each one contains 5 rows.

This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat.

Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior.

Fixes delta-io#781

Closes delta-io#1017

Co-authored-by: Andrew Olson <aolson1@cerner.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944
jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example,

```
spark.range(100)
          .write
          .format("parquet")
          .option("maxRecordsPerFile", 5)
          .save(path)
```

The above code will generate 20 parquet files and each one contains 5 rows.

This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat.

Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior.

Fixes delta-io#781

Closes delta-io#1017

Co-authored-by: Andrew Olson <aolson1@cerner.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged This issue has been read and acknowledged by Delta admins bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants