[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits #4768

liancheng · 2015-02-25T17:27:06Z

ReadContext.init calls InitContext.getMergedKeyValueMetadata, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.

In this PR, we manually merge the schemas before passing it to ReadContext to avoid the exception.

SparkQA · 2015-02-25T17:32:37Z

Test build #27951 has started for PR 4768 at commit 9002f0a.

This patch merges cleanly.

SparkQA · 2015-02-25T18:47:25Z

Test build #27951 has finished for PR 4768 at commit 9002f0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-25T18:47:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27951/
Test PASSed.

…g splits `ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue. In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768)  Author: Cheng Lian <lian@databricks.com> Closes #4768 from liancheng/spark-6010 and squashes the following commits: 9002f0a [Cheng Lian] Fixes SPARK-6010 (cherry picked from commit e0fdd46) Signed-off-by: Michael Armbrust <michael@databricks.com>

Fixes SPARK-6010

9002f0a

asfgit closed this in e0fdd46 Feb 25, 2015

liancheng deleted the spark-6010 branch February 26, 2015 02:30

liancheng mentioned this pull request Feb 26, 2015

[SPARK-6037][SQL] Avoiding duplicate Parquet schema merging #4786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits #4768

[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits #4768

liancheng commented Feb 25, 2015

SparkQA commented Feb 25, 2015

SparkQA commented Feb 25, 2015

AmplabJenkins commented Feb 25, 2015

[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits #4768

[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits #4768

Conversation

liancheng commented Feb 25, 2015

SparkQA commented Feb 25, 2015

SparkQA commented Feb 25, 2015

AmplabJenkins commented Feb 25, 2015