Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits #4768

Closed
wants to merge 1 commit into from

Conversation

liancheng
Copy link
Contributor

ReadContext.init calls InitContext.getMergedKeyValueMetadata, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.

In this PR, we manually merge the schemas before passing it to ReadContext to avoid the exception.

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Feb 25, 2015

Test build #27951 has started for PR 4768 at commit 9002f0a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 25, 2015

Test build #27951 has finished for PR 4768 at commit 9002f0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27951/
Test PASSed.

asfgit pushed a commit that referenced this pull request Feb 25, 2015
…g splits

`ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.

In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4768 from liancheng/spark-6010 and squashes the following commits:

9002f0a [Cheng Lian] Fixes SPARK-6010

(cherry picked from commit e0fdd46)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in e0fdd46 Feb 25, 2015
@liancheng liancheng deleted the spark-6010 branch February 26, 2015 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants