Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet #22184

Closed
wants to merge 1 commit into from

Conversation

seancxmao
Copy link
Contributor

What changes were proposed in this pull request?

#22148 introduces a behavior change. We need to document it in the migration guide.

How was this patch tested?

N/A

@seancxmao
Copy link
Contributor Author

@gatorsmile Could you kindly help trigger Jenkins and review?

@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
- Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
- Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.

## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above

- In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change. I am not sure whether we should backport it to 2.3.2. How about sending a note to the dev mailing list?

BTW, this only affects data source table. How about hive serde table? Are they consistent?

Could you add a test case? Create a table by the syntax like CREATE TABLE ... STORED AS PARQUET. You also need to turn off spark.sql.hive.convertMetastoreParquet in the test case.

Copy link
Contributor Author

@seancxmao seancxmao Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your advice, I did a thorough comparison between data source table and hive serde table.

Parquet data and tables are created via the following code:

val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data")

CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data'

spark.sql.hive.convertMetastoreParquet is set to false:

spark.conf.set("spark.sql.hive.convertMetastoreParquet", false)

Below are the comparison results both without #22148 and with #22148.

The comparison result without #22148:

no. caseSensitive table columns select column parquet column (select via data source table) parquet column (select via hive serde table) consistent? resolved by #22148
1 true a, b, c a  a  
2     b null B NG  
3     c  
4     A AnalysisException AnalysisException  
5     B AnalysisException AnalysisException  
6     C AnalysisException AnalysisException  
7   A, B, C a AnalysisException  AnalysisException  
8     b AnalysisException  AnalysisException   
9     c AnalysisException  AnalysisException   
10     A null  NG   
11     B B  
12     C NG   
13 false a, b, c a  
14     b null  NG Y
15     c  
16     A  
17     B null  NG  Y
18     C  
19   A, B, C a null  NG  Y
20     b  
21     c NG   
22     A null  NG  Y
23     B  
24     C NG   

The comparison result with #22148 applied:

no. caseSensitive table columns select column parquet column (select via data source table) parquet column (select via hive serde table) consistent? introduced by #22148
1 true a, b, c a  
2     b null  NG   
3     c  
4     A AnalysisException  AnalysisException   
5     B AnalysisException  AnalysisException   
6     C AnalysisException  AnalysisException   
7   A, B, C a AnalysisException  AnalysisException   
8     b AnalysisException  AnalysisException   
9     c AnalysisException  AnalysisException   
10     A null  NG   
11     B  
12     C NG   
13 false a, b, c a  
14     b  
15     c RuntimeException  NG  Y
16     A  
17     B  
18     C RuntimeException  NG  Y
19   A, B, C a  
20     b  
21     c RuntimeException  NG   
22     A  
23     B  
24     C RuntimeException  NG   

We can see that data source table and hive serde table have two major differences about parquet field resolution

WRT parquet field resolution, shall we make hive serde table behavior consistent with data source table behavior? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should respect spark.sql.caseSensitive in both modes, but also add a legacy SQLConf to enable users to revert back to the previous behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case for the one you did?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, we should not change the behavior of hive tables. It inherits many behaviors from Hive and let's keep it as it was.

Second, why we treat it as a behavior change? I think it's a bug that we don't respect spark.sql.caseSensitive in field resolution. In general we should not add a config to restore a bug.

I don't think this document is helpful. It explains a subtle and unreasonable behavior to users, which IMO just make them confused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making 1, 2 consistent is enough. : )

Copy link
Member

@gatorsmile gatorsmile Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the parquet files could be generated by our DataFrameWriter. Thus, the physical schema and logical schema could still have different cases.

Copy link
Contributor

@yucai yucai Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile I think 1 and 2 are always consistent. They both use Spark reader. Am I wrong?

  1. parquet table created by Spark (using parquet) read by Spark reader
  2. parquet table created by Spark (using hive) read by Spark reader

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#22184 (comment) already shows they are not consistent, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing is based on spark.sql.hive.convertMetastoreParquet is set false, so it should use Hive serde reader instead of Spark reader, sorry if it is too confusing here.
I guess you mean 1 and 3 :). I understand now.

If we are not going to backport the PR to 2.3, I can close SPARK-25206 also?

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 26, 2018

Test build #95262 has finished for PR 22184 at commit eae8a3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Since we are not going to backport the PR to 2.3, do we still need this migration guide?

Strictly speaking, we do have a behavior change here: hive table is always case insensitive and we should not read hive parquet table with native parquet reader if Spark is in case-sensitive mode. @seancxmao can you send a followup PR to do it? thanks!

@seancxmao
Copy link
Contributor Author

seancxmao commented Aug 29, 2018

@cloud-fan OK, I will do it.

Just to confirm, when reading from hive parquet table, if spark.sql.hive.convertMetastoreParquet and spark.sql.caseSensitive are both set to true, we throw an exception to tell users they should not do this because it could lead to inconsistent results. Our objective is to make native parquet reader consistent with hive serde reader wherever possible, and stop reading when we are not able to. Is my understanding correct?

Another thing to confirm is that when there is ambiguity in case-insensitive mode, native parquet reader throws exception while hive serde reader returns the first matched field, which is not consistent. Is it OK? Or shall we handle this case in the same way as above? I mean when reading from hive parquet table, if spark.sql.hive.convertMetastoreParquet = true, spark.sql.caseSensitive = false, and we find out there is ambiguity, we throw an exception to tell users they should not do this because it could lead to inconsistent results.

@cloud-fan
Copy link
Contributor

if spark.sql.hive.convertMetastoreParquet and spark.sql.caseSensitive are both set to true, we throw an exception

I'd like to just skip the conversion and log a warning message to say why.

... which is not consistent

I think it's ok. At the end they are different data sources and can define their own behaviors.

But you do have a point about spark.sql.hive.convertMetastoreParquet, the behavior must be consistent to do the conversion. My proposal is, parquet data source should provide an option(not SQL conf) to switch the behavior when hitting duplicated field names in case-insensitive mode. And when converting hive parquet table to parquet data source, set the option and ask parquet data source to pick the first matched field.

@seancxmao
Copy link
Contributor Author

seancxmao commented Aug 29, 2018

My proposal is, parquet data source should provide an option(not SQL conf) to ...

You mentioned this option is not SQL conf. Could you give me some advice about where this option should be defined? I just thought to define this option in SQLConf as something like spark.sql.parquet.onDuplicatedFields = FAIL, FIRST_MATCH, as I see bunch of options starting with spark.sql.parquet in SQLConf. Do you mean this option is supposed to be only used internally and not exposed to end users?

@cloud-fan
Copy link
Contributor

see ParquetOptions. Option can be specified per-query while SQL conf is per-session.

@seancxmao
Copy link
Contributor Author

@cloud-fan I've just sent a PR (#22343) for this.

@seancxmao
Copy link
Contributor Author

@cloud-fan @gatorsmile I think the old Upgrading From Spark SQL 2.3.1 to 2.3.2 and above is not needed since we do not backport SPARK-25132 to branch-2.3. I'm wondering if we need Upgrading From Spark SQL 2.3 to 2.4 and above. What do you think?

@HyukjinKwon
Copy link
Member

@seancxmao, so this behaviour changes description is only valid when we upgrade spark 2.3 to 2.4? Then we can add it in Upgrading From Spark SQL 2.3 to 2.4.

@seancxmao
Copy link
Contributor Author

@HyukjinKwon Thank you for your comments. Yes, this is only valid when upgrade Spark 2.3 to 2.4. I will do it.

@srowen
Copy link
Member

srowen commented Nov 26, 2018

@seancxmao is this PR still live?

@seancxmao
Copy link
Contributor Author

@srowen Sorry for the late reply! I'd like to close this PR and file a new one since our SQL doc has changed a lot. Thank you all for your comments and time!

@seancxmao seancxmao closed this Dec 5, 2018
asfgit pushed a commit that referenced this pull request Dec 9, 2018
…ive field resolution when reading from Parquet

## What changes were proposed in this pull request?
#22148 introduces a behavior change. According to discussion at #22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4.

## How was this patch tested?
N/A

Closes #23238 from seancxmao/SPARK-25132-doc-2.4.

Authored-by: seancxmao <seancxmao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 55276d3)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
asfgit pushed a commit that referenced this pull request Dec 9, 2018
…ive field resolution when reading from Parquet

## What changes were proposed in this pull request?
#22148 introduces a behavior change. According to discussion at #22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4.

## How was this patch tested?
N/A

Closes #23238 from seancxmao/SPARK-25132-doc-2.4.

Authored-by: seancxmao <seancxmao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…ive field resolution when reading from Parquet

## What changes were proposed in this pull request?
apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4.

## How was this patch tested?
N/A

Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4.

Authored-by: seancxmao <seancxmao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…ive field resolution when reading from Parquet

## What changes were proposed in this pull request?
apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4.

## How was this patch tested?
N/A

Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4.

Authored-by: seancxmao <seancxmao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 55276d3)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…ive field resolution when reading from Parquet

## What changes were proposed in this pull request?
apache#22148 introduces a behavior change. According to discussion at apache#22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4.

## How was this patch tested?
N/A

Closes apache#23238 from seancxmao/SPARK-25132-doc-2.4.

Authored-by: seancxmao <seancxmao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 55276d3)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants