-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added documentation to describe interaction with external Hive Metastores #473
Changes from all commits
18bf58e
6bd457f
5ed453f
2368fc8
ee91dbe
7b332c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# External HMS Integration | ||
### TL;DR | ||
The UCX toolkit by default relies on the internal workspace HMS as a source for tables and views. | ||
<br/>Many DB users utilize an external HMS instead of the Workspace HMS provided by DB. | ||
<br/>A popular external HMS is Amazon Glue. | ||
<br/>This document describes the considerations for UCX integration with external HMS. | ||
|
||
### Current Considerations | ||
- Integration with external HMS is set up on individual clusters. | ||
- Theoretically we can integrate separate clusters in a workspace with different HMS repositories. | ||
- In reality most customers use a single (internal or external) HMS within a workspace. | ||
- When migrating from an external HMS we have to consider that it is used by more than one workspace. | ||
- Integration with external HMS has to be set on all DB Warehouses together. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can Ext-HMS WH access workspace-local HMS? |
||
- HMS connectivity is set, usually, on cluster policy. As well as global SQL Warehouse config | ||
- Typically external HMS setup relies on: | ||
- Spark Config | ||
- Instance Profiles | ||
- Init scripts | ||
|
||
### Design Decisions | ||
- Should we set up a single HMS for UCX? | ||
- Should we suggest copying the setup from an existing Cluster/Cluster policy? | ||
- We shouldn't override the set up for the DB Warehouses (that may break functionality) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This requirement, not a design decision |
||
- We should allow overriding cluster settings and instance profile setting to accommodate novel settings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is requirement, not a design decision |
||
|
||
### Challenges | ||
- We cannot support multiple HMS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but we can store raw delta tables on DBFS There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we not set up one job task for each unique hms. Usually workspace dont use different HMS, but they likely to have one Ext HMS and one local HMS. |
||
- Using an external HMS to persist UCX tables will break functionality for a second workspace using UCX | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please suggest couple of options on addressing this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the inventory tables, if we use the Workspace ID in either the schema name "ucx-################" or as a column for uniqueness, then we don't have to worry about conflicts. Likely easier to introduce in the schema name since our teardown scripts drop tables. |
||
- We should consider using a pattern similar to our integration testing to rename the target database to allow persisting from multiple workspaces. For example WS1 --> UCX_ABC, WS2 --> UCX_DEF. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If multiple workspaces are sharing the same Ext HMS, we should do table crawl just once but should do multiple grants crawl per each workspace There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. changing the code to only crawl the external HMS once is a premature optimization. This operation only takes a few minutes right now. |
||
- With external HMS it is likely that some of the tables will not be accessible by some of the workspaces. We may need to migrate certain databases from certain workspaces. | ||
|
||
### Suggested flow | ||
1. Start the installer. | ||
2. The installer looks for use of external HMS by the workspace. We review cluster policies or DBSQL warehouses settings. | ||
3. We alert the user that an external HMS is set and request ask a YES/NO to set external HMS. | ||
4. We alert the user if they opted for external HMS and the DB Warehouses are not set for external HMS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be step3, not 4 :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the concern that a workspace may be connected to an external metatore on standard clusters, but the SQL Warehouse is not? If so, is this common? |
||
5. We update the configuration file with the HMS settings. | ||
6. We set the job clusters with the required External HMS settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we need for it practically? please describe