Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added documentation to describe interaction with external Hive Metastores #473

Merged
merged 6 commits into from
Oct 28, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions docs/external_hms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# External HMS Integration
### TL;DR
The UCX toolkit by default relies on the internal workspace HMS as a source for tables and views.
<br/>Many DB users utilize an external HMS instead of the Workspace HMS provided by DB.
<br/>A popular external HMS is Amazon Glue.
<br/>This document describes the considerations for UCX integration with external HMS.

### Current Considerations
- Integration with external HMS is set up on individual clusters.
- Theoretically we can integrate separate clusters in a workspace with different HMS repositories.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we need for it practically? please describe

- In reality most customers use a single (internal or external) HMS within a workspace.
- When migrating from an external HMS we have to consider that it is used by more than one workspace.
- Integration with external HMS has to be set on all DB Warehouses together.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can Ext-HMS WH access workspace-local HMS?

- HMS connectivity is set, usually, on cluster policy. As well as global SQL Warehouse config
- Typically external HMS setup relies on:
- Spark Config
- Instance Profiles
- Init scripts

### Design Decisions
- Should we set up a single HMS for UCX?
- Should we suggest copying the setup from an existing Cluster/Cluster policy?
- We shouldn't override the set up for the DB Warehouses (that may break functionality)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement, not a design decision

- We should allow overriding cluster settings and instance profile setting to accommodate novel settings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is requirement, not a design decision


### Challenges
- We cannot support multiple HMS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can store raw delta tables on DBFS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not set up one job task for each unique hms. Usually workspace dont use different HMS, but they likely to have one Ext HMS and one local HMS.

- Using an external HMS to persist UCX tables will break functionality for a second workspace using UCX
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please suggest couple of options on addressing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the inventory tables, if we use the Workspace ID in either the schema name "ucx-################" or as a column for uniqueness, then we don't have to worry about conflicts.

Likely easier to introduce in the schema name since our teardown scripts drop tables.

- We should consider using a pattern similar to our integration testing to rename the target database to allow persisting from multiple workspaces. For example WS1 --> UCX_ABC, WS2 --> UCX_DEF.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple workspaces are sharing the same Ext HMS, we should do table crawl just once but should do multiple grants crawl per each workspace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the code to only crawl the external HMS once is a premature optimization. This operation only takes a few minutes right now.

- With external HMS it is likely that some of the tables will not be accessible by some of the workspaces. We may need to migrate certain databases from certain workspaces.

### Suggested flow
1. Start the installer.
2. The installer looks for use of external HMS by the workspace. We review cluster policies or DBSQL warehouses settings.
3. We alert the user that an external HMS is set and request ask a YES/NO to set external HMS.
4. We alert the user if they opted for external HMS and the DB Warehouses are not set for external HMS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be step3, not 4 :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the concern that a workspace may be connected to an external metatore on standard clusters, but the SQL Warehouse is not? If so, is this common?

5. We update the configuration file with the HMS settings.
6. We set the job clusters with the required External HMS settings.