Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34075][SQL][CORE] Hidden directories are being listed for partition inference #31169

Closed
wants to merge 2 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jan 13, 2021

What changes were proposed in this pull request?

Fix a regression from #29959.

In Spark, the following file paths are considered as hidden paths and they are ignored on file reads:

  1. starts with "_" and doesn't contain "="
  2. starts with "."

However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426

This PR is to fix the bug. To archive the goal, the method InMemoryFileIndex.shouldFilterOut is refactored as HadoopFSUtils.shouldFilterOutPathName

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

Yes, it fixes a bug for reading file paths with partitions.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Test build #134009 has finished for PR 31169 at commit 896cf34.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38598/

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38598/

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Test build #134011 has finished for PR 31169 at commit 29b9fd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - good catch and sorry for introducing this regression!

@HyukjinKwon
Copy link
Member

It's alrighty, @sunchao.

Merged to master and branch-3.1.

HyukjinKwon pushed a commit that referenced this pull request Jan 14, 2021
…ition inference

### What changes were proposed in this pull request?

Fix a regression from #29959.

In Spark, the following file paths are considered as hidden paths and they are ignored on file reads:
1. starts with "_" and doesn't contain "="
2. starts with "."

However, after the refactoring PR #29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426

This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName`

### Why are the changes needed?

Bugfix

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug for reading file paths with partitions.

### How was this patch tested?

Unit test

Closes #31169 from gengliangwang/fileListingBug.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 467d758)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants