-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5068][SQL]fix bug query data when path doesn't exists #3891
Conversation
Can one of the admins verify this patch? |
hi, @marmbrus ,can you please take a look and give some suggestions?thx. |
@@ -141,7 +141,11 @@ class HadoopTableReader( | |||
partitionToDeserializer: Map[HivePartition, | |||
Class[_ <: Deserializer]], | |||
filterOpt: Option[PathFilter]): RDD[Row] = { | |||
val hivePartitionRDDs = partitionToDeserializer.map { case (partition, partDeserializer) => | |||
val hivePartitionRDDs = partitionToDeserializer.filter{ case (partition, partDeserializer) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before {
What is the rational behind this change? It seems like the table is corrupted and you should know about it. Does hive work in this case? |
Yes,hive is work in this situation.I found this issue from our production environment when i try to use spark-sql to test some sql which run in hive original. I am not familiar with the business logic,but i think we should strengthen the compatibility of spark.Thanks for your check. |
Okay, that is reasonable and we should probably support this. So then the question is can we do this check on the executor in parallel (or just catch the exception if it is thrown) instead of doing it serially when constructing the RDD? |
Thanks for suggestion! I would optimize this and commit later. |
ok to test |
Test build #25090 has started for PR 3891 at commit
|
Test build #25090 has finished for PR 3891 at commit
|
Test FAILed. |
Test build #26511 has started for PR 3891 at commit
|
Test build #26513 has started for PR 3891 at commit
|
Test build #26511 has finished for PR 3891 at commit
|
Test FAILed. |
Test build #26513 has finished for PR 3891 at commit
|
Test PASSed. |
OK.I close this one |
…ontext This PR follow up PR #3907 & #3891 & #4356. According to marmbrus liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally. [1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*) [2]. retrieve all FileStatus objects ,and cache them by undating existPathSet. [3]. do the filtering locally [4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern) chenghao-intel jeanlyn Author: lazymam500 <lazyman500@gmail.com> Author: lazyman <lazyman500@gmail.com> Closes #5059 from lazyman500/SPARK-5068 and squashes the following commits: 5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style e1d6386 [lazymam500] fix scala style f23133f [lazymam500] bug fix 47e0023 [lazymam500] fix scala style,add config flag,break the chaining 04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2 41f60ce [lazymam500] Merge pull request #1 from apache/master
the issue is descript on SPARK-5068
the purpose of this pull request is to prevent to make RDD for the path which doesn't exists