[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system #29976

MaxGekk · 2020-10-08T09:25:05Z

What changes were proposed in this pull request?

Propagate ORC options to Hadoop configs in Hive OrcFileFormat and in the regular ORC datasource.

Why are the changes needed?

There is a bug that when running:

spark.read.format("orc").options(conf).load(path)

The underlying file system will not receive the conf options.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added UT to OrcSourceSuite.

… HDFS file system

MaxGekk · 2020-10-08T09:26:11Z

This is similar changes to #29971. @HyukjinKwon @yuningzh-db @dongjoon-hyun Please, review this PR.

MaxGekk · 2020-10-08T09:58:34Z

I ran the test from this PR (just changed format) on Parquet V1/V2, ORC v2, CSV v1/v2, JSON v1/v2 and on Text v1/v2. Everywhere it passed.

SparkQA · 2020-10-08T10:26:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34160/

SparkQA · 2020-10-08T10:45:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34160/

MaxGekk · 2020-10-08T12:25:18Z

I have looked at build failures, it seems they are not related to the changes - some failures while downloading artefacts.

SparkQA · 2020-10-08T13:46:15Z

Test build #129554 has finished for PR 29976 at commit ef6d7f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @MaxGekk and @HyukjinKwon .

The K8s IT test failure is irrelevant to this one.

- Test basic decommissioning *** FAILED ***

Merged to master for Apache Spark 3.1.0

MaxGekk · 2020-10-08T19:36:34Z

@dongjoon-hyun Can this be merged to branch-3.0 (maybe 2.4 too) since it can be considered as a bug fix?
Here is a real use case, a customer tries to read files in Azure Data Lake:

def hadoopConf1() = Map[String, String](
  s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
  s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("...").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")

but gets the following exception because the settings above are not propagated to the filesystem:

java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
	at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
	at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
	at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
	at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)

MaxGekk · 2020-10-09T09:11:44Z

@HyukjinKwon You merged similar fix for avro to branch-3.0 in #29971 . WDYT should I open a PR with the changes for branch-3.0?

HyukjinKwon · 2020-10-09T09:41:15Z

So it can fix a bug right? sure let's open a PR to port back.

dongjoon-hyun · 2020-10-09T10:09:55Z

+1 for backporting, @MaxGekk and @HyukjinKwon .

…tions to underlying HDFS file system Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource. There is a bug that when running: ```scala spark.read.format("orc").options(conf).load(path) ``` The underlying file system will not receive the conf options. Yes Added UT to `OrcSourceSuite`. Closes apache#29976 from MaxGekk/orc-option-propagation. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c5f6af9) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2020-10-09T10:15:59Z

Here is the backport to branch-3.0: #29985

MaxGekk · 2020-10-15T06:45:34Z

Regarding to #29976 (comment) , I could put the test to a common trait and test all built-in datasources including Avro, ORC, LibSVM, CSV and so on. Let me know if you think it makes sense for improving test coverage. cc @gatorsmile @cloud-fan

MaxGekk · 2020-10-16T12:28:24Z

Here is the PR #30067 with common test.

Make ORC format propagate Hadoop config from DS options to underlying…

ef6d7f9

… HDFS file system

HyukjinKwon approved these changes Oct 8, 2020

View reviewed changes

dongjoon-hyun approved these changes Oct 8, 2020

View reviewed changes

dongjoon-hyun closed this in c5f6af9 Oct 8, 2020

MaxGekk mentioned this pull request Oct 9, 2020

[SPARK-33101][ML] Make LibSVM format propagate Hadoop config from DS options to underlying HDFS file system #29984

Closed

MaxGekk deleted the orc-option-propagation branch December 11, 2020 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system #29976

[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system #29976

MaxGekk commented Oct 8, 2020

MaxGekk commented Oct 8, 2020

MaxGekk commented Oct 8, 2020

SparkQA commented Oct 8, 2020

SparkQA commented Oct 8, 2020

MaxGekk commented Oct 8, 2020

SparkQA commented Oct 8, 2020

dongjoon-hyun left a comment

MaxGekk commented Oct 8, 2020

MaxGekk commented Oct 9, 2020

HyukjinKwon commented Oct 9, 2020

dongjoon-hyun commented Oct 9, 2020

MaxGekk commented Oct 9, 2020

MaxGekk commented Oct 15, 2020

MaxGekk commented Oct 16, 2020

[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system #29976

[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system #29976

Conversation

MaxGekk commented Oct 8, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Oct 8, 2020

MaxGekk commented Oct 8, 2020

SparkQA commented Oct 8, 2020

SparkQA commented Oct 8, 2020

MaxGekk commented Oct 8, 2020

SparkQA commented Oct 8, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

MaxGekk commented Oct 8, 2020

MaxGekk commented Oct 9, 2020

HyukjinKwon commented Oct 9, 2020

dongjoon-hyun commented Oct 9, 2020

MaxGekk commented Oct 9, 2020

MaxGekk commented Oct 15, 2020

MaxGekk commented Oct 16, 2020