-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184
Conversation
@gatorsmile Could you kindly help trigger Jenkins and review? |
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |||
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. | |||
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. | |||
|
|||
## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above | |||
|
|||
- In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a behavior change. I am not sure whether we should backport it to 2.3.2. How about sending a note to the dev mailing list?
BTW, this only affects data source table. How about hive serde table? Are they consistent?
Could you add a test case? Create a table by the syntax like CREATE TABLE ... STORED AS PARQUET
. You also need to turn off spark.sql.hive.convertMetastoreParquet
in the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following your advice, I did a thorough comparison between data source table and hive serde table.
Parquet data and tables are created via the following code:
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data")
CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'
spark.sql.hive.convertMetastoreParquet
is set to false:
spark.conf.set("spark.sql.hive.convertMetastoreParquet", false)
Below are the comparison results both without #22148 and with #22148.
The comparison result without #22148:
no. | caseSensitive | table columns | select column | parquet column (select via data source table) | parquet column (select via hive serde table) | consistent? | resolved by #22148 |
---|---|---|---|---|---|---|---|
1 | true | a, b, c | a | a | a | Y | |
2 | b | null | B | NG | |||
3 | c | c | c | Y | |||
4 | A | AnalysisException | AnalysisException | Y | |||
5 | B | AnalysisException | AnalysisException | Y | |||
6 | C | AnalysisException | AnalysisException | Y | |||
7 | A, B, C | a | AnalysisException | AnalysisException | Y | ||
8 | b | AnalysisException | AnalysisException | Y | |||
9 | c | AnalysisException | AnalysisException | Y | |||
10 | A | null | a | NG | |||
11 | B | B | B | Y | |||
12 | C | C | c | NG | |||
13 | false | a, b, c | a | a | a | Y | |
14 | b | null | B | NG | Y | ||
15 | c | c | c | Y | |||
16 | A | a | a | Y | |||
17 | B | null | B | NG | Y | ||
18 | C | c | c | Y | |||
19 | A, B, C | a | null | a | NG | Y | |
20 | b | B | B | Y | |||
21 | c | C | c | NG | |||
22 | A | null | a | NG | Y | ||
23 | B | B | B | Y | |||
24 | C | C | c | NG |
The comparison result with #22148 applied:
no. | caseSensitive | table columns | select column | parquet column (select via data source table) | parquet column (select via hive serde table) | consistent? | introduced by #22148 |
---|---|---|---|---|---|---|---|
1 | true | a, b, c | a | a | a | Y | |
2 | b | null | B | NG | |||
3 | c | c | c | Y | |||
4 | A | AnalysisException | AnalysisException | Y | |||
5 | B | AnalysisException | AnalysisException | Y | |||
6 | C | AnalysisException | AnalysisException | Y | |||
7 | A, B, C | a | AnalysisException | AnalysisException | Y | ||
8 | b | AnalysisException | AnalysisException | Y | |||
9 | c | AnalysisException | AnalysisException | Y | |||
10 | A | null | a | NG | |||
11 | B | B | B | Y | |||
12 | C | C | c | NG | |||
13 | false | a, b, c | a | a | a | Y | |
14 | b | B | B | Y | |||
15 | c | RuntimeException | c | NG | Y | ||
16 | A | a | a | Y | |||
17 | B | B | B | Y | |||
18 | C | RuntimeException | c | NG | Y | ||
19 | A, B, C | a | a | a | Y | ||
20 | b | B | B | Y | |||
21 | c | RuntimeException | c | NG | |||
22 | A | a | a | Y | |||
23 | B | B | B | Y | |||
24 | C | RuntimeException | c | NG |
We can see that data source table and hive serde table have two major differences about parquet field resolution
- Whether respect spark.sql.caseSensitive. Without [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148, both data source tables and hive serde tables do NOT respect spark.sql.caseSensitive. However data source tables always do case-sensitive parquet field resolution, while hive serde tables always do case-insensitive parquet field resolution no matter whether spark.sql.caseSensitive is set to true or false. [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148 let data source tables respect spark.sql.caseSensitive while hive serde table behavior is not changed.
- How to resolve ambiguity in case-insensitive mode. Without [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive serde tables always return the first matched column ignoring cases. [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148 let data source tables throw exception when there is ambiguity while hive serde table behavior is not changed.
WRT parquet field resolution, shall we make hive serde table behavior consistent with data source table behavior? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should respect spark.sql.caseSensitive
in both modes, but also add a legacy SQLConf to enable users to revert back to the previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case for the one you did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we should not change the behavior of hive tables. It inherits many behaviors from Hive and let's keep it as it was.
Second, why we treat it as a behavior change? I think it's a bug that we don't respect spark.sql.caseSensitive
in field resolution. In general we should not add a config to restore a bug.
I don't think this document is helpful. It explains a subtle and unreasonable behavior to users, which IMO just make them confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making 1, 2 consistent is enough. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the parquet files could be generated by our DataFrameWriter. Thus, the physical schema and logical schema could still have different cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile I think 1 and 2 are always consistent. They both use Spark reader. Am I wrong?
- parquet table created by Spark (using parquet) read by Spark reader
- parquet table created by Spark (using hive) read by Spark reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#22184 (comment) already shows they are not consistent, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing is based on spark.sql.hive.convertMetastoreParquet
is set false, so it should use Hive serde reader instead of Spark reader, sorry if it is too confusing here.
I guess you mean 1 and 3 :). I understand now.
If we are not going to backport the PR to 2.3, I can close SPARK-25206 also?
ok to test |
Test build #95262 has finished for PR 22184 at commit
|
Since we are not going to backport the PR to 2.3, do we still need this migration guide? Strictly speaking, we do have a behavior change here: hive table is always case insensitive and we should not read hive parquet table with native parquet reader if Spark is in case-sensitive mode. @seancxmao can you send a followup PR to do it? thanks! |
@cloud-fan OK, I will do it. Just to confirm, when reading from hive parquet table, if Another thing to confirm is that when there is ambiguity in case-insensitive mode, native parquet reader throws exception while hive serde reader returns the first matched field, which is not consistent. Is it OK? Or shall we handle this case in the same way as above? I mean when reading from hive parquet table, if |
I'd like to just skip the conversion and log a warning message to say why.
I think it's ok. At the end they are different data sources and can define their own behaviors. But you do have a point about |
You mentioned this option is not SQL conf. Could you give me some advice about where this option should be defined? I just thought to define this option in SQLConf as something like |
see |
@cloud-fan I've just sent a PR (#22343) for this. |
@cloud-fan @gatorsmile I think the old |
@seancxmao, so this behaviour changes description is only valid when we upgrade spark 2.3 to 2.4? Then we can add it in |
@HyukjinKwon Thank you for your comments. Yes, this is only valid when upgrade Spark 2.3 to 2.4. I will do it. |
@seancxmao is this PR still live? |
@srowen Sorry for the late reply! I'd like to close this PR and file a new one since our SQL doc has changed a lot. Thank you all for your comments and time! |
…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? #22148 introduces a behavior change. According to discussion at #22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes #23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 55276d3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? #22148 introduces a behavior change. According to discussion at #22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes #23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 55276d3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 55276d3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
#22148 introduces a behavior change. We need to document it in the migration guide.
How was this patch tested?
N/A