Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure that writeOptions is used when passed to the writeFiles function #652

Conversation

mahmoudmahdi24
Copy link
Contributor

I have realized while digging into the code that we have a function called writeFiles inside the TransactionaWriter with an unused parameter writeOptions.
This Pull Request changes the code so it's now possible to pass DeltaOptions to the writeFiles function and use its values during the write.
Before these changes, the code never used what we pass as write options.

@liwensun
Copy link
Contributor

@mahmoudmahdi24 Thanks for making the PR. Is there any particular option that you wanted to pass in?

@mahmoudmahdi24
Copy link
Contributor Author

Hello @liwensun
Thanks for your feedback. Not necessarily, I have just realized that we had a function with an unused parameter, so I tried to make it used (I am talking about writeOptions)

@zsxwing
Copy link
Member

zsxwing commented Oct 7, 2021

@mahmoudmahdi24 If you don't have a specific option, I'm inclined to close this. This change will enable users to pass parquet options to Delta, and may cause surprising behaviors.

@zsxwing zsxwing added the need author feedback Issue is waiting for the author to respond label Oct 7, 2021
@noslowerdna
Copy link
Contributor

@zsxwing I saw your comment on the issue I opened (#781). What kinds of surprising behaviors are you concern about? It seems that someone setting Spark writer options would reasonably expect them to be honored by all DataFrameWriter format implementations.

@noslowerdna
Copy link
Contributor

Hi @zsxwing, did you see my question?

@zsxwing
Copy link
Member

zsxwing commented Mar 22, 2022

@noslowerdna Sorry. I missed your previous comment. My concern is about blindly passing options to Parquet. If there were options that could affect the data to be written, we would break the compatibility.

In addition, the maxRecordsPerFile option seems not be documented anywhere on https://spark.apache.org/docs/latest/configuration.html . What I can find is the SQL conf spark.sql.files.maxRecordsPerFile.

@noslowerdna
Copy link
Contributor

noslowerdna commented Mar 22, 2022

@zsxwing Thanks for the response. I agree, it should not be blind. Let's just handle the special maxRecordsPerFile and timeZone options specifically. Being able to supply a maxRecordsPerFile value is a core feature of Spark as of 2.2.0 and not limited to Parquet; see SPARK-18775 and Spark PR #16204 with the technical details,

This patch introduces a new write config option maxRecordsPerFile (default to a session-wide setting spark.sql.files.maxRecordsPerFile) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag).

The implementation of that configuration resolution process can be found here in the write method of the FileFormatWriter class, which Delta Lake invokes when writing files (it is also obtained in the createWriteJobDescription method of the FileWrite class here).

Similarly the timeZone option was also added in Spark 2.2.0 (SPARK-18939, Spark PR #17053). SPARK-19817 (PR #17281) clarifies,

Make it clear that timeZone option is a general option in DataFrameReader/Writer, DataStreamReader/Writer. As the time zone setting can also affect partition values, it works for all formats.

@noslowerdna
Copy link
Contributor

@zsxwing I opened #1017 to address this in a cleaner way, as suggested. Let me know what you think.

@noslowerdna
Copy link
Contributor

@mahmoudmahdi24 I would appreciate your feedback here: #1017 . . . thanks!

@noslowerdna
Copy link
Contributor

This PR should be closed, #1017 addresses this and has been merged.

@zsxwing
Copy link
Member

zsxwing commented Apr 8, 2022

Thanks! Closing this one as #1017 addressed the issue.

@zsxwing zsxwing closed this Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need author feedback Issue is waiting for the author to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants