Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avro.schema.literal support and some related patches #5101

Closed
wants to merge 5 commits into from

Conversation

lxynov
Copy link
Member

@lxynov lxynov commented Sep 8, 2020

Closes #5001
A related issue: prestodb/presto#9116

@lxynov lxynov marked this pull request as draft September 8, 2020 15:29
@jiegzhan
Copy link

@lxynov, you have anything to add to this PR? Thanks.

@cla-bot cla-bot bot added the cla-signed label Sep 22, 2020
@lxynov lxynov force-pushed the avro branch 2 times, most recently from eabdc5e to 75400d3 Compare September 23, 2020 05:51
@lxynov lxynov changed the title [WIP] avro.schema.literal support and some related patches avro.schema.literal support and some related patches Sep 23, 2020
@lxynov lxynov marked this pull request as ready for review September 23, 2020 05:52
@lxynov lxynov requested review from jiegzhan and findepi September 23, 2020 21:28
@lxynov
Copy link
Member Author

lxynov commented Sep 23, 2020

@findepi Could you help review this? Some commits seems hacky (due to Hive's lack of spec and doc). Feel free to slack me if you'd like to discuss in DMs.

@jiegzhan
Copy link

@lxynov @findepi Hey guys, any progress on this PR? Thanks.

if (isAvroTableWithSchemaSet(table)) {
// If a partition's Serde doesn't use metastore for schema (like Avro), we use table-level schema. Data publishers
// are responsible for making sure that table-level schema is workable for all partitions.
if (!isPartitionSerdesUsingMetastoreForSchema(partition)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test (product test?) for partitioned tables where table and partition schema mismatches?

for example

  • table is ORC, partition is Parquet
  • table is ORC, partition is Avro with schema url
  • table is ORC, partition is Avro with schema literal
  • table is ORC, partition is CSV

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to add such a product test (using HQLs) which fails before the commit and succeeds after the commit, but failed to do so. But I think this commit logically makes sense and it has been deployed at our company for a long time. Please let me know if you have concerns.

new Schema.Parser().parse(avroSchemaLiteral);
return avroSchemaLiteral;
}
catch (SchemaParseException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be any other exception flying here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at the implementation of Schema.Parser::parse(), I think SchemaParseException is sufficient here

if (getAvroSchemaUrl(tableMetadata.getProperties()) != null) {
throw new PrestoException(NOT_SUPPORTED, "CREATE TABLE AS not supported when Avro schema url is set");
if (getAvroSchemaUrl(tableMetadata.getProperties()) != null || getAvroSchemaLiteral(tableMetadata.getProperties()) != null) {
throw new PrestoException(NOT_SUPPORTED, "CREATE TABLE AS not supported when Avro schema url or literal is set");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason not to support CTAS with schema url is because we need the table to be in metastore in order to get the information what the schema actually (at least this is the current implementation).

For schema literal we could obtain this information locally. Even if we do not do this (which I am fine with), we should keep those cases separate here -- as separate ifs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated ifs

@findepi
Copy link
Member

findepi commented Oct 16, 2020

avro.schema.literal support and some related patches

the table/partition schema handliung changes require some improvement
would it be possible to separate avro.schema.literal into its own PR, or are there too many dependencies?

@ebyhr ebyhr self-requested a review December 3, 2020 06:27
@lxynov lxynov force-pushed the avro branch 2 times, most recently from c68ea1c to 96ce3ef Compare December 17, 2020 04:20
@lxynov
Copy link
Member Author

lxynov commented Dec 17, 2020

Hi @ebyhr @findepi , this PR is ready for another review. Please let me know if you have questions around any of the commits.

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add test cases about schema evolution (maybe in TestAvroSchemaEvolution)? This PR is going to close #5001, but the test looks not covered.

@Test(groups = AVRO)
public void testHiveCreatedTable()
{
onHive().executeQuery("DROP TABLE IF EXISTS test_avro_schema_literal_hive");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to limit onHive() usage for Hive specific operation because it's slow. Other places are also the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is meant to test Hive created tables, so onHive() is used here.

@lxynov lxynov force-pushed the avro branch 2 times, most recently from c175748 to 7c82037 Compare January 7, 2021 04:42
@lxynov
Copy link
Member Author

lxynov commented Jan 7, 2021

@ebyhr Thanks for the review! Updated.

Could you add test cases about schema evolution (maybe in TestAvroSchemaEvolution)? This PR is going to close #5001, but the test looks not covered.

Added in TestAvroSchemaEvolution

@lxynov
Copy link
Member Author

lxynov commented Jan 7, 2021

@ebyhr All tests have passed. This is ready for review

@lxynov
Copy link
Member Author

lxynov commented Jan 19, 2021

@ebyhr @findepi Could you please take a look? Thanks in advance!

would it be possible to separate avro.schema.literal into its own PR, or are there too many dependencies?

@findepi I thought about this before but found that commit Allow creating tables with Avro schema literal sort of depends on commit Call getFields for serdes not using use metastore for schema, as the later commit changed method isAvroTableWithSchemaSet to isTableSerdesUsingMetastoreForSchema

@colebow
Copy link
Member

colebow commented Oct 19, 2022

👋 @lxynov - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@colebow colebow closed this Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Presto did not pick up new column from avro.schema.literal
5 participants