#438 Optimize hive table existence query #439

yruslan · 2024-07-18T05:43:47Z

Closes #438

Added option:

hive {
  #...
  optimize.exist.query = true
}

to any Hive context. If

false (default = previous behavior), uses this query for checking Hive table existence:
SELECT 1 FROM my_db.my_table WHERE 0 = 1
true, uses this query for checking Hive table existence:
DESCRIBE my_db.my_table
(this is faster since it never touhces data, but may depend on Hive dialect)

In addition, added more logging related to Hive table existence check for easier debugging Hive issues.

…too long to execute.

github-actions · 2024-07-18T05:47:07Z

Unit Test Coverage

File	Coverage [85.33%]	🍏
QueryExecutorJdbc.scala	96.74%	🍏
JdbcConfig.scala	91.53%	🍏
TaskRunnerBase.scala	83.31%	🍏
QueryExecutorSpark.scala	82.61%	🍏

Total Project Coverage	82.17%	🍏

VladimirRybalko

I'm just thinking out loud.
Maybe, it would be better to create a fallback logic rather than deploying too many specific options like this. in my humble opinion, package users shouldn't care about such small nuances.
Even for technical users without a proper background a difference between two flows might be quite unobvious.
You say "DESCRIBE ..." is better, however might be incompatible with some HIVE dialect. So, let's use it by default. Then, in case of failure, just call the second one (SELECT ...).

I know, it's completely different approach with various pros and cons. I just wanted to share my opinion. Generally, my suggestion is a good approach how to avoid creating an enormous amount of options which sometimes not trivial to understand. If you prefer to keep a dedicated option, it's up to you. Who am I to judge :-)

yruslan · 2024-07-18T11:38:13Z

Yes, it is a good suggestion, especially since Pramen already supports templates for creating and updating tables.

However, existence of Hive tables is a tiny bit of functionality I don't want users to care. So hard coded for now.

The reason I left the original (slow) behavior by default, is to make sure I don't break existing pipelines accidentally.

But if we encounter more issues regarding this, I'm indeed going to generalize the solution to make it more flexible, possibly with templates.

yruslan added 4 commits July 16, 2024 14:21

#438 Allow optimized Hive table exist query for metastores that take …

0d80e80

…too long to execute.

#438 Use 'DESCRIBE <table>' for a quicker Hive table existence check.

ee5f8e5

#438 Add task elapsed time to logs.

25a6e49

#438 Log results of checking Hive table existence.

a0065a7

#438 Update README.

527a9a1

yruslan requested review from VladimirRybalko and kevinwallimann July 18, 2024 05:53

yruslan marked this pull request as ready for review July 18, 2024 05:53

VladimirRybalko approved these changes Jul 18, 2024

View reviewed changes

yruslan merged commit 9fff698 into main Jul 18, 2024
8 checks passed

yruslan deleted the bugfix/438-optimize-hive-table-existence-query branch July 18, 2024 11:38

yruslan mentioned this pull request Jul 19, 2024

Release Pramen v1.9.1 #441

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#438 Optimize hive table existence query #439

#438 Optimize hive table existence query #439

yruslan commented Jul 18, 2024 •

edited

Loading

github-actions bot commented Jul 18, 2024 •

edited

Loading

VladimirRybalko left a comment •

edited

Loading

yruslan commented Jul 18, 2024

#438 Optimize hive table existence query #439

#438 Optimize hive table existence query #439

Conversation

yruslan commented Jul 18, 2024 • edited Loading

github-actions bot commented Jul 18, 2024 • edited Loading

Unit Test Coverage

VladimirRybalko left a comment • edited Loading

Choose a reason for hiding this comment

yruslan commented Jul 18, 2024

yruslan commented Jul 18, 2024 •

edited

Loading

github-actions bot commented Jul 18, 2024 •

edited

Loading

VladimirRybalko left a comment •

edited

Loading