Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system #29976

Closed
wants to merge 1 commit into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 8, 2020

What changes were proposed in this pull request?

Propagate ORC options to Hadoop configs in Hive OrcFileFormat and in the regular ORC datasource.

Why are the changes needed?

There is a bug that when running:

spark.read.format("orc").options(conf).load(path)

The underlying file system will not receive the conf options.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added UT to OrcSourceSuite.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 8, 2020

This is similar changes to #29971. @HyukjinKwon @yuningzh-db @dongjoon-hyun Please, review this PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 8, 2020

I ran the test from this PR (just changed format) on Parquet V1/V2, ORC v2, CSV v1/v2, JSON v1/v2 and on Text v1/v2. Everywhere it passed.

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34160/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34160/

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 8, 2020

I have looked at build failures, it seems they are not related to the changes - some failures while downloading artefacts.

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Test build #129554 has finished for PR 29976 at commit ef6d7f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @MaxGekk and @HyukjinKwon .

The K8s IT test failure is irrelevant to this one.

- Test basic decommissioning *** FAILED ***

Merged to master for Apache Spark 3.1.0

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 8, 2020

@dongjoon-hyun Can this be merged to branch-3.0 (maybe 2.4 too) since it can be considered as a bug fix?
Here is a real use case, a customer tries to read files in Azure Data Lake:

def hadoopConf1() = Map[String, String](
  s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("...").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")

but gets the following exception because the settings above are not propagated to the filesystem:

java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 9, 2020

@HyukjinKwon You merged similar fix for avro to branch-3.0 in #29971 . WDYT should I open a PR with the changes for branch-3.0?

@HyukjinKwon
Copy link
Member

So it can fix a bug right? sure let's open a PR to port back.

@dongjoon-hyun
Copy link
Member

+1 for backporting, @MaxGekk and @HyukjinKwon .

MaxGekk added a commit to MaxGekk/spark that referenced this pull request Oct 9, 2020
…tions to underlying HDFS file system

Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.

There is a bug that when running:
```scala
spark.read.format("orc").options(conf).load(path)
```
The underlying file system will not receive the conf options.

Yes

Added UT to `OrcSourceSuite`.

Closes apache#29976 from MaxGekk/orc-option-propagation.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c5f6af9)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 9, 2020

Here is the backport to branch-3.0: #29985

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 15, 2020

Regarding to #29976 (comment) , I could put the test to a common trait and test all built-in datasources including Avro, ORC, LibSVM, CSV and so on. Let me know if you think it makes sense for improving test coverage. cc @gatorsmile @cloud-fan

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 16, 2020

Here is the PR #30067 with common test.

@MaxGekk MaxGekk deleted the orc-option-propagation branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants