Crawler for RunSubmit API usages from External Orchestrators (ADF/Airflow) #366

zpappa · 2023-10-03T15:58:57Z

Resolves #266

Added JobRunsCrawler
Added a crawler to look at JobRuns from the SDK and determine which of the job runs are from the RunsSubmit API

Added Unit Tests
Added tests to cover basic logic, further tests pending.

codecov · 2023-10-03T16:00:46Z

Codecov Report

Merging #366 (bb3fbd2) into main (47123a9) will decrease coverage by 1.13%.
Report is 7 commits behind head on main.
The diff coverage is 69.78%.

@@            Coverage Diff             @@
##             main     #366      +/-   ##
==========================================
- Coverage   83.73%   82.61%   -1.13%     
==========================================
  Files          30       30              
  Lines        2337     2490     +153     
  Branches      410      445      +35     
==========================================
+ Hits         1957     2057     +100     
- Misses        293      326      +33     
- Partials       87      107      +20

Files	Coverage Δ
src/databricks/labs/ucx/framework/crawlers.py	`86.79% <100.00%> (ø)`
src/databricks/labs/ucx/framework/parallel.py	`98.36% <100.00%> (+0.48%)`	⬆️
src/databricks/labs/ucx/install.py	`83.66% <100.00%> (+0.09%)`	⬆️
src/databricks/labs/ucx/runtime.py	`55.55% <ø> (+0.76%)`	⬆️
src/databricks/labs/ucx/workspace_access/scim.py	`100.00% <100.00%> (ø)`
...rc/databricks/labs/ucx/workspace_access/generic.py	`97.36% <80.00%> (-1.73%)`	⬇️
src/databricks/labs/ucx/workspace_access/groups.py	`92.45% <93.33%> (+2.17%)`	⬆️
src/databricks/labs/ucx/assessment/crawlers.py	`66.33% <55.17%> (-7.33%)`	⬇️

... and 1 file with indirect coverage changes

…it on tasks and can vary inside a job. Added hashing algorithm to assess uniqueness across task submissions as external systems do not consistently provide identifiers Added additional tests.

larsgeorge-db

lgtm

nfx

missing logic for hashing/deduping the similar run-submit jobs and has some logic errors

src/databricks/labs/ucx/assessment/crawlers.py

Added a table migration doc. Let's discuss the migration process.

This PR aims to fix #297 and #346 It adds a utility method to filter rows that have a column containing None, this will help Crawlers to not throw an error when a column is None. it also checks if the column in the class is Nullable or not, if it's nullable and the value is None, it's ignored

#357) Closes #312 Closes #313

…e been synchronised to Unity Catalog already or not (#306) Closes #303

…allation (#367) Closes #347 #329

Check that database name is a word, loop until correct.

* Added `inventory_database` name check during installation ([#275](#275)). * Added a column to `$inventory.tables` to specify if a table might have been synchronised to Unity Catalog already or not ([#306](#306)). * Added a migration state to skip already migrated tables ([#325](#325)). * Fixed appending to tables by adding filtering of `None` rows ([#356](#356)). * Fixed handling of missing but linked cluster policies. ([#361](#361)). * Ignore errors for Redash widgets and queries redeployment during installation ([#367](#367)). * Remove exception and added proper logging for groups in the list that… ([#357](#357)). * Skip group migration when no groups are available after preparation step. ([#363](#363)). * Update databricks-sdk requirement from ~=0.9.0 to ~=0.10.0 ([#362](#362)).

add test for `_configure_inventory_database()` add test for `run_for_config()`

Close #348

…wler

Changed days to 90 Co-authored-by: Serge Smertin <259697+nfx@users.noreply.github.com>

nfx

this starts to take shape. minor comments remaining - please also add the integration for this, so that we have some happy path calling the real apis.

nfx · 2023-10-04T14:50:38Z

src/databricks/labs/ucx/assessment/crawlers.py

+
+class ExternallyOrchestratedJobTaskCrawler(CrawlerBase):
+    def __init__(self, ws: WorkspaceClient, sbe: SqlBackend, schema):
+        super().__init__(sbe, "hive_metastore", schema, "job_runs", ExternallyOrchestratedJobTask)


job_runs may be too generic here for a table name

will rename it

nfx · 2023-10-04T14:51:47Z

src/databricks/labs/ucx/assessment/crawlers.py

+        if task.spark_python_task is not None:
+            hash_values.append(task.spark_python_task.python_file)
+        if task.spark_submit_task is not None:
+            hash_values.append(task.spark_submit_task.parameters)


wouldn't these be different for every run in airflow?

🤦 parameters shouldn't be in there

But python file should largely not change if it's the same DAG, unless they're doing some clever metadata driven selection

Will remove it, it shouldn't be there

nfx · 2023-10-04T14:52:50Z

src/databricks/labs/ucx/assessment/crawlers.py

+            hash_values.append(task.sql_task.dashboard.dashboard_id)
+            hash_values.append(task.sql_task.query.query_id)
+        if task.dbt_task is not None:
+            task.dbt_task.commands.sort()


is it really needed here?

Defensive posture :)

nfx · 2023-10-04T14:53:31Z

src/databricks/labs/ucx/assessment/crawlers.py

+            hash_values.append(task.dbt_task.catalog)
+            hash_values.append(task.dbt_task.warehouse_id)
+            hash_values.append(task.dbt_task.project_directory)
+            hash_values.append(",".join(task.dbt_task.commands))


Suggested change

hash_values.append(",".join(task.dbt_task.commands))

hash_values.append(",".join(sorted(task.dbt_task.commands)))

this might be a bit more deterministic

okay, will opt for this instead

nfx · 2023-10-04T14:54:15Z

src/databricks/labs/ucx/assessment/crawlers.py

+            hash_values.append(task.git_source.git_tag)
+            hash_values.append(task.git_source.git_branch)
+            hash_values.append(task.git_source.git_commit)


Suggested change

hash_values.append(task.git_source.git_tag)

hash_values.append(task.git_source.git_branch)

hash_values.append(task.git_source.git_commit)

these might vary a lot, no need to have them in hash

nfx · 2023-10-04T14:54:23Z

src/databricks/labs/ucx/assessment/crawlers.py

+            hash_values.append(task.git_source.git_branch)
+            hash_values.append(task.git_source.git_commit)
+            hash_values.append(task.git_source.git_provider)
+            hash_values.append(task.git_source.git_snapshot.used_commit)


Suggested change

hash_values.append(task.git_source.git_snapshot.used_commit)

src/databricks/labs/ucx/assessment/crawlers.py

nfx · 2023-10-04T15:00:36Z

src/databricks/labs/ucx/assessment/crawlers.py

+            for task in job_run.tasks:
+                spark_version = self._get_spark_version_from_task(task, job_run, all_clusters)
+                data_security_mode = self._get_data_security_mode_from_task(task, job_run, all_clusters)
+                hashed_id = self._create_hash_from_job_run_task(task)


one more thing i forgot: you're not checking if hashed_id was seen already or not. add it.

also: why do we hash on job task and NOT on the job_run? airflow creates job runs, not tasks.

I figured we can do dedupe in the actual report based on the hash.

Airflow creates job runs yes, but each task is triggered in a workflow, so a job run might exist that only has one task out of 5 that ran successfully, also each task can have it's own cluster configuration, and that's what we're checking.

two things:

this table has to get as little noise as possible, so we have to dedupe before we write to table.

we have to do dedupe based on the job_run, not the task, because that's how multi-task jobs are designed.

Yes, but cluster definitions live on the task as mentioned, trying to imagine a user running this..

Potentially we run the risk, in particularly large multi-task workflows, of directing the user to the wrong task to address.

If there are multiple tasks that would fail, then a user has to potentially change multiple cluster configurations. If we roll this up to one result, then we run the risk of telling the user to fix "1" thing.

For job with 50 tasks we have to have only one record in this table.

The challenge is around usability, if this shows up in failure JSON, returning a useful report from this is hard.

Especially if there's 50 tasks ☝️

I'll change the grain for now so we can move this along.

But we'll have to make a decision about which cluster version info we surface.

A multi task job run can use multiple cluster definitions.

For now I'll just choose the lowest dbr, but this can be misleading.

Co-authored-by: Serge Smertin <259697+nfx@users.noreply.github.com>

nfx · 2023-10-04T20:20:28Z

src/databricks/labs/ucx/assessment/crawlers.py

+            for task in job_run.tasks:
+                spark_version = self._get_spark_version_from_task(task, job_run, all_clusters)
+                data_security_mode = self._get_data_security_mode_from_task(task, job_run, all_clusters)
+                hashed_id = self._create_hash_from_job_run_task(task)


two things:

this table has to get as little noise as possible, so we have to dedupe before we write to table.

we have to do dedupe based on the job_run, not the task, because that's how multi-task jobs are designed.

nfx

this also needs an integration test

nfx · 2023-10-05T13:16:01Z

src/databricks/labs/ucx/assessment/crawlers.py

+            for task in job_run.tasks:
+                spark_version = self._get_spark_version_from_task(task, job_run, all_clusters)
+                data_security_mode = self._get_data_security_mode_from_task(task, job_run, all_clusters)
+                hashed_id = self._create_hash_from_job_run_task(task)


For job with 50 tasks we have to have only one record in this table.

zpappa · 2023-10-06T04:42:28Z

this also needs an integration test

Haven't even bothered trying to set one up yet, the amount of data required is high to run this, will work on it...

nfx · 2023-10-06T09:52:00Z

@zpappa just create few job runs with fixtures - e.g. calling the ws.jobs.submit_run(...)

Fixes: #89

…n errors (#375) Fixed deletion of backup groups [issue #374]. Added rate limits and retries to group operations [issue #353]. Temp fix for issue #359 Added log messages for better visibility. Added useful troubleshooting snippets to the docs.

…tion (#385) Fixes: #382 ``` Select PRO or SERVERLESS SQL warehouse to run assessment dashboards on [0] Shared Endpoint (475b...cd5211, Serverless, RUNNING) [1] [Create new PRO SQL warehouse] ... Enter a number between 0 and 7: ```

Fixed the external location test. Modified crawler to reference dataclass instead of type.

Added asserts to e2e tests to verify that the backup groups get deleted

zpappa · 2023-10-06T16:41:37Z

Closing in favor of #395 after a rebase gone wrong.

Cherry-picking from old branch

8609e95

zpappa requested review from nfx and larsgeorge-db as code owners October 3, 2023 15:58

zpappa added 2 commits October 3, 2023 21:07

Renamed objects, lowered assessment granularity to task as clusters s…

1db76d0

…it on tasks and can vary inside a job. Added hashing algorithm to assess uniqueness across task submissions as external systems do not consistently provide identifiers Added additional tests.

Additional changes

298245a

larsgeorge-db approved these changes Oct 4, 2023

View reviewed changes

nfx requested changes Oct 4, 2023

View reviewed changes

FastLee and others added 12 commits October 4, 2023 10:23

Define table migration process (#307)

9b21fab

Added a table migration doc. Let's discuss the migration process.

Remove exception and added proper logging for groups in the list that… (

48f1096

#357) Closes #312 Closes #313

Added a column to $inventory.tables to specify if a table might hav…

0519b33

…e been synchronised to Unity Catalog already or not (#306) Closes #303

Ignore errors for Redash widgets and queries redeployment during inst…

d38ce51

…allation (#367) Closes #347 #329

Add inventory_database name check (#275)

fad59db

Check that database name is a word, loop until correct.

Improved test coverage for installer (#371)

bf1eb22

add test for `_configure_inventory_database()` add test for `run_for_config()`

Ensure that table exists, even when crawlers produce zero records (#373)

8d011ab

Close #348

Merge branch 'main' into feature/external-orchestrator-run-submit-cra…

506ade0

…wler

Fixing tests after init change for Crawler

917764b

Update src/databricks/labs/ucx/assessment/crawlers.py

55cabc7

Changed days to 90 Co-authored-by: Serge Smertin <259697+nfx@users.noreply.github.com>

nfx requested changes Oct 4, 2023

View reviewed changes

Update src/databricks/labs/ucx/assessment/crawlers.py

ca15d6c

Co-authored-by: Serge Smertin <259697+nfx@users.noreply.github.com>

nfx requested changes Oct 4, 2023

View reviewed changes

nfx requested changes Oct 5, 2023

View reviewed changes

Checking in changes for logic and naming, redid some tests

db88744

zpappa requested review from a team October 6, 2023 16:12

larsgeorge-db and others added 8 commits October 6, 2023 12:14

Added logging of errors during threadpool operations. (#376)

e17c25b

Fixes: #89

Fixed integration tests for crawlers (#379)

0d5eb1c

Fixed the external location test. Modified crawler to reference dataclass instead of type.

Added integration test to check backup groups get deleted (#387)

8d51397

Added asserts to e2e tests to verify that the backup groups get deleted

Use groups instead of usernames in code owners file (#389)

1181e7f

Update CODEOWNERS (#390)

b6681df

Added additonal comments

bb3fbd2

zpappa closed this Oct 6, 2023

zpappa self-assigned this Dec 13, 2023

nfx deleted the feature/external-orchestrator-run-submit-crawler branch December 15, 2023 00:32

nfx restored the feature/external-orchestrator-run-submit-crawler branch December 15, 2023 00:32

nfx deleted the feature/external-orchestrator-run-submit-crawler branch April 4, 2024 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler for RunSubmit API usages from External Orchestrators (ADF/Airflow) #366

Crawler for RunSubmit API usages from External Orchestrators (ADF/Airflow) #366

zpappa commented Oct 3, 2023 •

edited

Loading

codecov bot commented Oct 3, 2023 •

edited

Loading

larsgeorge-db left a comment

nfx left a comment

nfx left a comment

nfx Oct 4, 2023

zpappa Oct 6, 2023

nfx Oct 4, 2023

zpappa Oct 4, 2023

zpappa Oct 6, 2023

nfx Oct 4, 2023

zpappa Oct 4, 2023

nfx Oct 4, 2023

zpappa Oct 4, 2023

nfx Oct 4, 2023

zpappa Oct 4, 2023

nfx Oct 4, 2023

nfx Oct 4, 2023

zpappa Oct 4, 2023

nfx Oct 4, 2023

zpappa Oct 4, 2023

nfx Oct 5, 2023

zpappa Oct 6, 2023

zpappa Oct 6, 2023

zpappa Oct 6, 2023

zpappa Oct 6, 2023 •

edited

Loading

nfx Oct 4, 2023

nfx left a comment

nfx Oct 5, 2023

zpappa commented Oct 6, 2023

nfx commented Oct 6, 2023

zpappa commented Oct 6, 2023

	hash_values.append(",".join(task.dbt_task.commands))
	hash_values.append(",".join(sorted(task.dbt_task.commands)))

	hash_values.append(task.git_source.git_tag)
	hash_values.append(task.git_source.git_branch)
	hash_values.append(task.git_source.git_commit)

Crawler for RunSubmit API usages from External Orchestrators (ADF/Airflow) #366

Crawler for RunSubmit API usages from External Orchestrators (ADF/Airflow) #366

Conversation

zpappa commented Oct 3, 2023 • edited Loading

codecov bot commented Oct 3, 2023 • edited Loading

Codecov Report

larsgeorge-db left a comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zpappa Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zpappa commented Oct 6, 2023

nfx commented Oct 6, 2023

zpappa commented Oct 6, 2023

zpappa commented Oct 3, 2023 •

edited

Loading

codecov bot commented Oct 3, 2023 •

edited

Loading

zpappa Oct 6, 2023 •

edited

Loading