[SPARK-5068][SQL]fix bug query data when path doesn't exists #3891

jeanlyn · 2015-01-04T16:49:09Z

the issue is descript on SPARK-5068
the purpose of this pull request is to prevent to make RDD for the path which doesn't exists

AmplabJenkins · 2015-01-04T16:52:10Z

Can one of the admins verify this patch?

jeanlyn · 2015-01-05T07:42:54Z

hi, @marmbrus ,can you please take a look and give some suggestions?thx.

marmbrus · 2015-01-05T23:44:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

@@ -141,7 +141,11 @@ class HadoopTableReader(
      partitionToDeserializer: Map[HivePartition,
      Class[_ <: Deserializer]],
      filterOpt: Option[PathFilter]): RDD[Row] = {
-    val hivePartitionRDDs = partitionToDeserializer.map { case (partition, partDeserializer) =>
+    val hivePartitionRDDs = partitionToDeserializer.filter{ case (partition, partDeserializer) =>


Space before {

marmbrus · 2015-01-05T23:45:24Z

What is the rational behind this change? It seems like the table is corrupted and you should know about it. Does hive work in this case?

jeanlyn · 2015-01-06T01:30:58Z

Yes,hive is work in this situation.I found this issue from our production environment when i try to use spark-sql to test some sql which run in hive original. I am not familiar with the business logic,but i think we should strengthen the compatibility of spark.Thanks for your check.

marmbrus · 2015-01-06T01:38:58Z

Okay, that is reasonable and we should probably support this. So then the question is can we do this check on the executor in parallel (or just catch the exception if it is thrown) instead of doing it serially when constructing the RDD?

jeanlyn · 2015-01-06T01:47:07Z

Thanks for suggestion! I would optimize this and commit later.

marmbrus · 2015-01-06T06:59:35Z

ok to test

SparkQA · 2015-01-06T07:02:35Z

Test build #25090 has started for PR 3891 at commit 55636f3.

This patch merges cleanly.

SparkQA · 2015-01-06T07:02:40Z

Test build #25090 has finished for PR 3891 at commit 55636f3.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-06T07:02:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25090/
Test FAILed.

SparkQA · 2015-02-02T11:57:49Z

Test build #26511 has started for PR 3891 at commit 3ce56ba.

This patch does not merge cleanly.

SparkQA · 2015-02-02T12:07:53Z

Test build #26513 has started for PR 3891 at commit 40d1c94.

This patch merges cleanly.

SparkQA · 2015-02-02T12:52:54Z

Test build #26511 has finished for PR 3891 at commit 3ce56ba.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-02T12:52:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26511/
Test FAILed.

SparkQA · 2015-02-02T13:15:45Z

Test build #26513 has finished for PR 3891 at commit 40d1c94.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FPGrowthModel(val freqItemsets: RDD[(Array[String], Long)]) extends Serializable
- class Node[T](val parent: Node[T]) extends Serializable
- logDebug(s"Did not load class $name from REPL class server at $uri", e)
- logError(s"Failed to check existence of class $name on REPL class server at $uri", e)

AmplabJenkins · 2015-02-02T13:15:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26513/
Test PASSed.

srowen · 2015-02-17T21:18:17Z

Is this superseded by #3907 or #4356 ? if so can this be closed?

jeanlyn · 2015-02-20T15:54:46Z

OK.I close this one

…ontext This PR follow up PR #3907 & #3891 & #4356. According to marmbrus liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally. [1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*) [2]. retrieve all FileStatus objects ,and cache them by undating existPathSet. [3]. do the filtering locally [4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern) chenghao-intel jeanlyn Author: lazymam500 <lazyman500@gmail.com> Author: lazyman <lazyman500@gmail.com> Closes #5059 from lazyman500/SPARK-5068 and squashes the following commits: 5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style e1d6386 [lazymam500] fix scala style f23133f [lazymam500] bug fix 47e0023 [lazymam500] fix scala style,add config flag,break the chaining 04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2 41f60ce [lazymam500] Merge pull request #1 from apache/master

marmbrus reviewed Jan 5, 2015
View reviewed changes

jeanlyn mentioned this pull request Jan 6, 2015

[SPARK-5068][SQL]fix bug query data when path doesn't exists #3907

Closed

jeanlyn added 3 commits February 2, 2015 19:50

SPARK-5068: fix bug query data when path doesn't exists

5d9951e

add the Licensed

7278300

fix code style

40d1c94

jeanlyn force-pushed the SPARK-5068 branch from 55636f3 to 3ce56ba Compare February 2, 2015 11:56

jeanlyn force-pushed the SPARK-5068 branch from 3ce56ba to 40d1c94 Compare February 2, 2015 12:06

chenghao-intel mentioned this pull request Feb 4, 2015

[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

Closed

jeanlyn closed this Feb 20, 2015

lazyman500 mentioned this pull request Mar 17, 2015

[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5068][SQL]fix bug query data when path doesn't exists #3891

[SPARK-5068][SQL]fix bug query data when path doesn't exists #3891

jeanlyn commented Jan 4, 2015

AmplabJenkins commented Jan 4, 2015

jeanlyn commented Jan 5, 2015

marmbrus Jan 5, 2015

marmbrus commented Jan 5, 2015

jeanlyn commented Jan 6, 2015

marmbrus commented Jan 6, 2015

jeanlyn commented Jan 6, 2015

marmbrus commented Jan 6, 2015

SparkQA commented Jan 6, 2015

SparkQA commented Jan 6, 2015

AmplabJenkins commented Jan 6, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

AmplabJenkins commented Feb 2, 2015

SparkQA commented Feb 2, 2015

AmplabJenkins commented Feb 2, 2015

srowen commented Feb 17, 2015

jeanlyn commented Feb 20, 2015

[SPARK-5068][SQL]fix bug query data when path doesn't exists #3891

[SPARK-5068][SQL]fix bug query data when path doesn't exists #3891

Conversation

jeanlyn commented Jan 4, 2015

AmplabJenkins commented Jan 4, 2015

jeanlyn commented Jan 5, 2015

marmbrus Jan 5, 2015

Choose a reason for hiding this comment

marmbrus commented Jan 5, 2015

jeanlyn commented Jan 6, 2015

marmbrus commented Jan 6, 2015

jeanlyn commented Jan 6, 2015

marmbrus commented Jan 6, 2015

SparkQA commented Jan 6, 2015

SparkQA commented Jan 6, 2015

AmplabJenkins commented Jan 6, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

AmplabJenkins commented Feb 2, 2015

SparkQA commented Feb 2, 2015

AmplabJenkins commented Feb 2, 2015

srowen commented Feb 17, 2015

jeanlyn commented Feb 20, 2015