Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#438 Optimize hive table existence query #439

Merged
merged 5 commits into from
Jul 18, 2024

Conversation

yruslan
Copy link
Collaborator

@yruslan yruslan commented Jul 18, 2024

Closes #438

Added option:

hive {
  #...
  optimize.exist.query = true
}

to any Hive context. If

  • false (default = previous behavior), uses this query for checking Hive table existence:
    SELECT 1 FROM my_db.my_table WHERE 0 = 1
  • true, uses this query for checking Hive table existence:
    DESCRIBE my_db.my_table
    (this is faster since it never touhces data, but may depend on Hive dialect)

In addition, added more logging related to Hive table existence check for easier debugging Hive issues.

Copy link

github-actions bot commented Jul 18, 2024

Unit Test Coverage

File Coverage [85.33%] 🍏
QueryExecutorJdbc.scala 96.74% 🍏
JdbcConfig.scala 91.53% 🍏
TaskRunnerBase.scala 83.31% 🍏
QueryExecutorSpark.scala 82.61% 🍏
Total Project Coverage 82.17% 🍏

@yruslan yruslan marked this pull request as ready for review July 18, 2024 05:53
Copy link
Collaborator

@VladimirRybalko VladimirRybalko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking out loud.
Maybe, it would be better to create a fallback logic rather than deploying too many specific options like this. in my humble opinion, package users shouldn't care about such small nuances.
Even for technical users without a proper background a difference between two flows might be quite unobvious.
You say "DESCRIBE ..." is better, however might be incompatible with some HIVE dialect. So, let's use it by default. Then, in case of failure, just call the second one (SELECT ...).

I know, it's completely different approach with various pros and cons. I just wanted to share my opinion. Generally, my suggestion is a good approach how to avoid creating an enormous amount of options which sometimes not trivial to understand. If you prefer to keep a dedicated option, it's up to you. Who am I to judge :-)

@yruslan
Copy link
Collaborator Author

yruslan commented Jul 18, 2024

Yes, it is a good suggestion, especially since Pramen already supports templates for creating and updating tables.

However, existence of Hive tables is a tiny bit of functionality I don't want users to care. So hard coded for now.

The reason I left the original (slow) behavior by default, is to make sure I don't break existing pipelines accidentally.

But if we encounter more issues regarding this, I'm indeed going to generalize the solution to make it more flexible, possibly with templates.

@yruslan yruslan merged commit 9fff698 into main Jul 18, 2024
8 checks passed
@yruslan yruslan deleted the bugfix/438-optimize-hive-table-existence-query branch July 18, 2024 11:38
@yruslan yruslan mentioned this pull request Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Checking Hive table existence is slow
2 participants