[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184

seancxmao · 2018-08-22T09:31:11Z

What changes were proposed in this pull request?

#22148 introduces a behavior change. We need to document it in the migration guide.

How was this patch tested?

N/A

…g from Parquet

seancxmao · 2018-08-22T10:57:32Z

@gatorsmile Could you kindly help trigger Jenkins and review?

gatorsmile · 2018-08-22T15:44:04Z

docs/sql-programming-guide.md

@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
  - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
  - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.

+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.


This is a behavior change. I am not sure whether we should backport it to 2.3.2. How about sending a note to the dev mailing list?

BTW, this only affects data source table. How about hive serde table? Are they consistent?

Could you add a test case? Create a table by the syntax like CREATE TABLE ... STORED AS PARQUET. You also need to turn off spark.sql.hive.convertMetastoreParquet in the test case.

Following your advice, I did a thorough comparison between data source table and hive serde table.

Parquet data and tables are created via the following code:

val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data") CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'

spark.sql.hive.convertMetastoreParquet is set to false:

spark.conf.set("spark.sql.hive.convertMetastoreParquet", false)

Below are the comparison results both without #22148 and with #22148.

The comparison result without #22148:

no. caseSensitive table columns select column parquet column (select via data source table) parquet column (select via hive serde table) consistent? resolved by #22148

1 true a, b, c a a a Y

2 b null B NG

3 c c c Y

4 A AnalysisException AnalysisException Y

5 B AnalysisException AnalysisException Y

6 C AnalysisException AnalysisException Y

7 A, B, C a AnalysisException AnalysisException Y

8 b AnalysisException AnalysisException Y

9 c AnalysisException AnalysisException Y

10 A null a NG

11 B B B Y

12 C C c NG

13 false a, b, c a a a Y

14 b null B NG Y

15 c c c Y

16 A a a Y

17 B null B NG Y

18 C c c Y

19 A, B, C a null a NG Y

20 b B B Y

21 c C c NG

22 A null a NG Y

23 B B B Y

24 C C c NG

The comparison result with #22148 applied:

no. caseSensitive table columns select column parquet column (select via data source table) parquet column (select via hive serde table) consistent? introduced by #22148

1 true a, b, c a a a Y

2 b null B NG

3 c c c Y

4 A AnalysisException AnalysisException Y

5 B AnalysisException AnalysisException Y

6 C AnalysisException AnalysisException Y

7 A, B, C a AnalysisException AnalysisException Y

8 b AnalysisException AnalysisException Y

9 c AnalysisException AnalysisException Y

10 A null a NG

11 B B B Y

12 C C c NG

13 false a, b, c a a a Y

14 b B B Y

15 c RuntimeException c NG Y

16 A a a Y

17 B B B Y

18 C RuntimeException c NG Y

19 A, B, C a a a Y

20 b B B Y

21 c RuntimeException c NG

22 A a a Y

23 B B B Y

24 C RuntimeException c NG

We can see that data source table and hive serde table have two major differences about parquet field resolution

Whether respect spark.sql.caseSensitive. Without [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148, both data source tables and hive serde tables do NOT respect spark.sql.caseSensitive. However data source tables always do case-sensitive parquet field resolution, while hive serde tables always do case-insensitive parquet field resolution no matter whether spark.sql.caseSensitive is set to true or false. [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148 let data source tables respect spark.sql.caseSensitive while hive serde table behavior is not changed.

How to resolve ambiguity in case-insensitive mode. Without [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive serde tables always return the first matched column ignoring cases. [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet #22148 let data source tables throw exception when there is ambiguity while hive serde table behavior is not changed.

WRT parquet field resolution, shall we make hive serde table behavior consistent with data source table behavior? What do you think?

We should respect spark.sql.caseSensitive in both modes, but also add a legacy SQLConf to enable users to revert back to the previous behavior.

Could you add a test case for the one you did?

First, we should not change the behavior of hive tables. It inherits many behaviors from Hive and let's keep it as it was.

Second, why we treat it as a behavior change? I think it's a bug that we don't respect spark.sql.caseSensitive in field resolution. In general we should not add a config to restore a bug.

I don't think this document is helpful. It explains a subtle and unreasonable behavior to users, which IMO just make them confused.

Making 1, 2 consistent is enough. : )

BTW, the parquet files could be generated by our DataFrameWriter. Thus, the physical schema and logical schema could still have different cases.

@gatorsmile I think 1 and 2 are always consistent. They both use Spark reader. Am I wrong?

parquet table created by Spark (using parquet) read by Spark reader

parquet table created by Spark (using hive) read by Spark reader

#22184 (comment) already shows they are not consistent, right?

The testing is based on spark.sql.hive.convertMetastoreParquet is set false, so it should use Hive serde reader instead of Spark reader, sorry if it is too confusing here.
I guess you mean 1 and 3 :). I understand now.

If we are not going to backport the PR to 2.3, I can close SPARK-25206 also?

HyukjinKwon · 2018-08-26T13:03:21Z

ok to test

SparkQA · 2018-08-26T13:22:46Z

Test build #95262 has finished for PR 22184 at commit eae8a3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-29T06:32:13Z

Since we are not going to backport the PR to 2.3, do we still need this migration guide?

Strictly speaking, we do have a behavior change here: hive table is always case insensitive and we should not read hive parquet table with native parquet reader if Spark is in case-sensitive mode. @seancxmao can you send a followup PR to do it? thanks!

seancxmao · 2018-08-29T08:33:49Z

@cloud-fan OK, I will do it.

Just to confirm, when reading from hive parquet table, if spark.sql.hive.convertMetastoreParquet and spark.sql.caseSensitive are both set to true, we throw an exception to tell users they should not do this because it could lead to inconsistent results. Our objective is to make native parquet reader consistent with hive serde reader wherever possible, and stop reading when we are not able to. Is my understanding correct?

Another thing to confirm is that when there is ambiguity in case-insensitive mode, native parquet reader throws exception while hive serde reader returns the first matched field, which is not consistent. Is it OK? Or shall we handle this case in the same way as above? I mean when reading from hive parquet table, if spark.sql.hive.convertMetastoreParquet = true, spark.sql.caseSensitive = false, and we find out there is ambiguity, we throw an exception to tell users they should not do this because it could lead to inconsistent results.

cloud-fan · 2018-08-29T09:21:26Z

if spark.sql.hive.convertMetastoreParquet and spark.sql.caseSensitive are both set to true, we throw an exception

I'd like to just skip the conversion and log a warning message to say why.

... which is not consistent

I think it's ok. At the end they are different data sources and can define their own behaviors.

But you do have a point about spark.sql.hive.convertMetastoreParquet, the behavior must be consistent to do the conversion. My proposal is, parquet data source should provide an option(not SQL conf) to switch the behavior when hitting duplicated field names in case-insensitive mode. And when converting hive parquet table to parquet data source, set the option and ask parquet data source to pick the first matched field.

seancxmao · 2018-08-29T10:29:25Z

My proposal is, parquet data source should provide an option(not SQL conf) to ...

You mentioned this option is not SQL conf. Could you give me some advice about where this option should be defined? I just thought to define this option in SQLConf as something like spark.sql.parquet.onDuplicatedFields = FAIL, FIRST_MATCH, as I see bunch of options starting with spark.sql.parquet in SQLConf. Do you mean this option is supposed to be only used internally and not exposed to end users?

cloud-fan · 2018-08-30T02:05:30Z

see ParquetOptions. Option can be specified per-query while SQL conf is per-session.

seancxmao · 2018-09-05T15:15:22Z

@cloud-fan I've just sent a PR (#22343) for this.

seancxmao · 2018-09-10T06:55:13Z

@cloud-fan @gatorsmile I think the old Upgrading From Spark SQL 2.3.1 to 2.3.2 and above is not needed since we do not backport SPARK-25132 to branch-2.3. I'm wondering if we need Upgrading From Spark SQL 2.3 to 2.4 and above. What do you think?

HyukjinKwon · 2018-11-11T03:24:52Z

@seancxmao, so this behaviour changes description is only valid when we upgrade spark 2.3 to 2.4? Then we can add it in Upgrading From Spark SQL 2.3 to 2.4.

seancxmao · 2018-11-12T04:00:17Z

@HyukjinKwon Thank you for your comments. Yes, this is only valid when upgrade Spark 2.3 to 2.4. I will do it.

srowen · 2018-11-26T12:46:45Z

@seancxmao is this PR still live?

seancxmao · 2018-12-05T14:34:39Z

@srowen Sorry for the late reply! I'd like to close this PR and file a new one since our SQL doc has changed a lot. Thank you all for your comments and time!

…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? #22148 introduces a behavior change. According to discussion at #22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes #23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 55276d3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? #22148 introduces a behavior change. According to discussion at #22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes #23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ive field resolution when reading from Parquet ## What changes were proposed in this pull request? apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 55276d3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-25132][SQL][DOC] Case-insensitive field resolution when readin…

eae8a3c

…g from Parquet

gatorsmile reviewed Aug 22, 2018

View reviewed changes

seancxmao mentioned this pull request Sep 11, 2018

[SPARK-25391][SQL] Make behaviors consistent when converting parquet hive table to parquet data source #22343

Closed

seancxmao closed this Dec 5, 2018

seancxmao mentioned this pull request Dec 5, 2018

[SPARK-25132][SQL][FOLLOWUP][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #23238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184

seancxmao commented Aug 22, 2018

seancxmao commented Aug 22, 2018

gatorsmile Aug 22, 2018

seancxmao Aug 23, 2018 •

edited

Loading

gatorsmile Aug 24, 2018

gatorsmile Aug 24, 2018

cloud-fan Aug 24, 2018

gatorsmile Aug 28, 2018

gatorsmile Aug 28, 2018 •

edited

Loading

yucai Aug 29, 2018 •

edited

Loading

gatorsmile Aug 29, 2018

yucai Aug 29, 2018

HyukjinKwon commented Aug 26, 2018

SparkQA commented Aug 26, 2018

cloud-fan commented Aug 29, 2018

seancxmao commented Aug 29, 2018 •

edited

Loading

cloud-fan commented Aug 29, 2018

seancxmao commented Aug 29, 2018 •

edited

Loading

cloud-fan commented Aug 30, 2018

seancxmao commented Sep 5, 2018

seancxmao commented Sep 10, 2018

HyukjinKwon commented Nov 11, 2018

seancxmao commented Nov 12, 2018

srowen commented Nov 26, 2018

seancxmao commented Dec 5, 2018

no.	caseSensitive	table columns	select column	parquet column (select via data source table)	parquet column (select via hive serde table)	consistent?	resolved by #22148
1	true	a, b, c	a	a	a	Y
2			b	null	B	NG
3			c	c	c	Y
4			A	AnalysisException	AnalysisException	Y
5			B	AnalysisException	AnalysisException	Y
6			C	AnalysisException	AnalysisException	Y
7		A, B, C	a	AnalysisException	AnalysisException	Y
8			b	AnalysisException	AnalysisException	Y
9			c	AnalysisException	AnalysisException	Y
10			A	null	a	NG
11			B	B	B	Y
12			C	C	c	NG
13	false	a, b, c	a	a	a	Y
14			b	null	B	NG	Y
15			c	c	c	Y
16			A	a	a	Y
17			B	null	B	NG	Y
18			C	c	c	Y
19		A, B, C	a	null	a	NG	Y
20			b	B	B	Y
21			c	C	c	NG
22			A	null	a	NG	Y
23			B	B	B	Y
24			C	C	c	NG

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184

Conversation

seancxmao commented Aug 22, 2018

What changes were proposed in this pull request?

How was this patch tested?

seancxmao commented Aug 22, 2018

gatorsmile Aug 22, 2018

Choose a reason for hiding this comment

seancxmao Aug 23, 2018 • edited Loading

Choose a reason for hiding this comment

gatorsmile Aug 24, 2018

Choose a reason for hiding this comment

gatorsmile Aug 24, 2018

Choose a reason for hiding this comment

cloud-fan Aug 24, 2018

Choose a reason for hiding this comment

gatorsmile Aug 28, 2018

Choose a reason for hiding this comment

gatorsmile Aug 28, 2018 • edited Loading

Choose a reason for hiding this comment

yucai Aug 29, 2018 • edited Loading

Choose a reason for hiding this comment

gatorsmile Aug 29, 2018

Choose a reason for hiding this comment

yucai Aug 29, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Aug 26, 2018

SparkQA commented Aug 26, 2018

cloud-fan commented Aug 29, 2018

seancxmao commented Aug 29, 2018 • edited Loading

cloud-fan commented Aug 29, 2018

seancxmao commented Aug 29, 2018 • edited Loading

cloud-fan commented Aug 30, 2018

seancxmao commented Sep 5, 2018

seancxmao commented Sep 10, 2018

HyukjinKwon commented Nov 11, 2018

seancxmao commented Nov 12, 2018

srowen commented Nov 26, 2018

seancxmao commented Dec 5, 2018

seancxmao Aug 23, 2018 •

edited

Loading

gatorsmile Aug 28, 2018 •

edited

Loading

yucai Aug 29, 2018 •

edited

Loading

seancxmao commented Aug 29, 2018 •

edited

Loading

seancxmao commented Aug 29, 2018 •

edited

Loading