-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#438 Optimize hive table existence query #439
Conversation
Unit Test Coverage
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just thinking out loud.
Maybe, it would be better to create a fallback logic rather than deploying too many specific options like this. in my humble opinion, package users shouldn't care about such small nuances.
Even for technical users without a proper background a difference between two flows might be quite unobvious.
You say "DESCRIBE ..." is better, however might be incompatible with some HIVE dialect. So, let's use it by default. Then, in case of failure, just call the second one (SELECT ...).
I know, it's completely different approach with various pros and cons. I just wanted to share my opinion. Generally, my suggestion is a good approach how to avoid creating an enormous amount of options which sometimes not trivial to understand. If you prefer to keep a dedicated option, it's up to you. Who am I to judge :-)
Yes, it is a good suggestion, especially since Pramen already supports templates for creating and updating tables. However, existence of Hive tables is a tiny bit of functionality I don't want users to care. So hard coded for now. The reason I left the original (slow) behavior by default, is to make sure I don't break existing pipelines accidentally. But if we encounter more issues regarding this, I'm indeed going to generalize the solution to make it more flexible, possibly with templates. |
Closes #438
Added option:
to any Hive context. If
false
(default = previous behavior), uses this query for checking Hive table existence:SELECT 1 FROM my_db.my_table WHERE 0 = 1
true
, uses this query for checking Hive table existence:DESCRIBE my_db.my_table
(this is faster since it never touhces data, but may depend on Hive dialect)
In addition, added more logging related to Hive table existence check for easier debugging Hive issues.