Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33094][SQL][3.0] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system #29985

Closed

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 9, 2020

What changes were proposed in this pull request?

Propagate ORC options to Hadoop configs in Hive OrcFileFormat and in the regular ORC datasource.

Why are the changes needed?

There is a bug that when running:

spark.read.format("orc").options(conf).load(path)

The underlying file system will not receive the conf options.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added UT to OrcSourceSuite.

Authored-by: Max Gekk max.gekk@gmail.com
Signed-off-by: Dongjoon Hyun dhyun@apple.com
(cherry picked from commit c5f6af9)
Signed-off-by: Max Gekk max.gekk@gmail.com

…tions to underlying HDFS file system

Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.

There is a bug that when running:
```scala
spark.read.format("orc").options(conf).load(path)
```
The underlying file system will not receive the conf options.

Yes

Added UT to `OrcSourceSuite`.

Closes apache#29976 from MaxGekk/orc-option-propagation.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c5f6af9)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 9, 2020

The changes conflict with branch-2.4, and cannot be merged smoothly. @dongjoon-hyun @HyukjinKwon If you think that it should be in 2.4 too, need to open a separate PR.

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34187/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34187/

HyukjinKwon pushed a commit that referenced this pull request Oct 9, 2020
…DS options to underlying HDFS file system

### What changes were proposed in this pull request?
Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.

### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("orc").options(conf).load(path)
```
The underlying file system will not receive the conf options.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added UT to `OrcSourceSuite`.

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Dongjoon Hyun <dhyunapple.com>
(cherry picked from commit c5f6af9)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes #29985 from MaxGekk/orc-option-propagation-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member

Merged to branch-3.0.

Sure, let's open a PR for branch-2.4 as well.

@HyukjinKwon HyukjinKwon closed this Oct 9, 2020
MaxGekk added a commit to MaxGekk/spark that referenced this pull request Oct 9, 2020
…DS options to underlying HDFS file system

Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.

There is a bug that when running:
```scala
spark.read.format("orc").options(conf).load(path)
```
The underlying file system will not receive the conf options.

Yes

Added UT to `OrcSourceSuite`.

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Dongjoon Hyun <dhyunapple.com>
(cherry picked from commit c5f6af9)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes apache#29985 from MaxGekk/orc-option-propagation-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 9892b3e)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 9, 2020

Here is the backport to 2.4: #29987

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129582 has finished for PR 29985 at commit 9608ace.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

+1, late LGTM.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…DS options to underlying HDFS file system

### What changes were proposed in this pull request?
Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.

### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("orc").options(conf).load(path)
```
The underlying file system will not receive the conf options.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added UT to `OrcSourceSuite`.

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Dongjoon Hyun <dhyunapple.com>
(cherry picked from commit c5f6af9)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes apache#29985 from MaxGekk/orc-option-propagation-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the orc-option-propagation-3.0 branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants