Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change: Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together #512

Merged
merged 13 commits into from
Oct 28, 2023

Conversation

mwojtyczka
Copy link
Contributor

@mwojtyczka mwojtyczka commented Oct 26, 2023

Applying grants that belong to the same principle, object id, and object type concurrently is not supported in legacy Table ACLs currently:

TableAcl grant/revoke operations are not atomic. When granting the permissions, the service would first get all existing permissions, append with the new permissions, and set the full list in the database. If there are concurrent grant requests, both requests might succeed and emit the audit logs, but what actually happens could be that the new permission list from one request overrides the other one, causing permission loss.

This PR will:

  • Resolve issue 499.
  • Improve performance of migrating Table ACLs permissions
  • Added tests covering objects ownership
  • Increased test coverage around Table ACLs

Example

Instead of executing grants in separate transactions:
GRANT SELECT ON TABLE hive_metastore.db_a.table_a TO group_a
GRANT MODIFY ON TABLE hive_metastore.db_a.table_a TO group_a

Folding and executing in one statement:
GRANT SELECT, MODIFY ON TABLE hive_metastore.db_a.table_a TO group_a

@mwojtyczka mwojtyczka requested a review from a team October 26, 2023 18:44
@mwojtyczka mwojtyczka changed the title Fix and optimize Table ACLs by folding grants belonging to the same principal, object id and object type Fix Table ACLs by folding grants belonging to the same principal, object id and object type Oct 26, 2023
@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Merging #512 (4a3fc6f) into main (18e3b54) will increase coverage by 0.08%.
The diff coverage is 93.33%.

@@            Coverage Diff             @@
##             main     #512      +/-   ##
==========================================
+ Coverage   80.64%   80.73%   +0.08%     
==========================================
  Files          31       31              
  Lines        3240     3265      +25     
  Branches      626      633       +7     
==========================================
+ Hits         2613     2636      +23     
  Misses        478      478              
- Partials      149      151       +2     
Files Coverage Δ
...rc/databricks/labs/ucx/workspace_access/manager.py 82.35% <100.00%> (ø)
src/databricks/labs/ucx/workspace_access/tacl.py 96.36% <93.10%> (-0.31%) ⬇️

... and 1 file with indirect coverage changes

@mwojtyczka mwojtyczka changed the title Fix Table ACLs by folding grants belonging to the same principal, object id and object type Fix Table ACLs permissions loss by folding grants belonging to the same principal, object id and object type Oct 26, 2023
@mwojtyczka mwojtyczka changed the title Fix Table ACLs permissions loss by folding grants belonging to the same principal, object id and object type Fix Table ACLs permissions loss by folding grants belonging to the same principal, object id and object type together Oct 26, 2023
@mwojtyczka mwojtyczka temporarily deployed to account-admin October 26, 2023 19:09 — with GitHub Actions Inactive
@mwojtyczka mwojtyczka changed the title Fix Table ACLs permissions loss by folding grants belonging to the same principal, object id and object type together Optimize and mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together Oct 26, 2023
@mwojtyczka mwojtyczka changed the title Optimize and mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together Oct 26, 2023
# * GRANT MODIFY ON TABLE hive_metastore.db_a.table_a TO group_a
# will be folded and executed in one statement/transaction:
# * GRANT SELECT, MODIFY ON TABLE hive_metastore.db_a.table_a TO group_a
folded = collections.defaultdict(lambda: {"action_type": set()})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
folded = collections.defaultdict(lambda: {"action_type": set()})
folded = collections.defaultdict(set)

Why do we need more than a set?

Copy link
Contributor Author

@mwojtyczka mwojtyczka Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

folded[key]["action_type"].add(grant.action_type)
grant_dict = dataclasses.asdict(grant)
grant_dict["action_type"] = ", ".join(sorted(folded[key]["action_type"]))
folded[key]["grant_folded"] = grant_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

@mwojtyczka mwojtyczka Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this a bit. We need to fold (put together) action types for all grants that have the same principal, object id and type. We then put the actions into the grant object. We only take one grant object per principal, object id and type, and just update its actions. The rest of the grants can be skipped.

)

assert (
f'{{"principal": "{ws_group.display_name}", "action_type": "MODIFY, SELECT", '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assert on JSON string, assert on fields. Keys will be serialized in different order each time and flake the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

assert "TABLE" == permissions.object_type
assert "catalog_a.database_b.table_c" == permissions.object_id
assert (
'{"principal": "foo@example.com", "action_type": "MODIFY, SELECT", '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assert on JSON string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@mwojtyczka mwojtyczka temporarily deployed to account-admin October 27, 2023 16:01 — with GitHub Actions Inactive
# use one of the grants with actions folded per principal, object type and id
grant_dict = dataclasses.asdict(grant)
grant_dict["action_type"] = ", ".join(sorted(folded_actions[key]))
grant_folded_actions[key] = grant_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a risk of corrupting data, can't accept this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@nfx
Copy link
Collaborator

nfx commented Oct 27, 2023

do reconstruction of grants, not the overwrite

Index: src/databricks/labs/ucx/workspace_access/tacl.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/databricks/labs/ucx/workspace_access/tacl.py b/src/databricks/labs/ucx/workspace_access/tacl.py
--- a/src/databricks/labs/ucx/workspace_access/tacl.py	(revision 45bae4a880095ce35719cc01c27c05f1786ac711)
+++ b/src/databricks/labs/ucx/workspace_access/tacl.py	(date 1698304643087)
@@ -1,5 +1,5 @@
+import collections
 import dataclasses
-import functools
 import json
 from collections.abc import Callable, Iterator
 from functools import partial
@@ -21,18 +21,37 @@
         self._sql_backend = sql_backend
 
     def get_crawler_tasks(self) -> Iterator[Callable[..., Permissions | None]]:
-        def inner(grant: Grant) -> Permissions:
-            object_type, object_key = grant.this_type_and_key()
-            return Permissions(object_type=object_type, object_id=object_key, raw=json.dumps(dataclasses.asdict(grant)))
-
+        folded = collections.defaultdict(set)
         for grant in self._grants_crawler.snapshot():
-            yield functools.partial(inner, grant)
+            folded[(grant.principal, grant.this_type_and_key())].add(grant.action_type)
+        for (principal, (object_type, key)), actions in folded.items():
+            yield lambda: Permissions(
+                object_type=object_type,
+                object_id=f'{principal}/{key}',
+                raw=json.dumps(sorted(actions)))
 
     def object_types(self) -> set[str]:
         return {"TABLE", "DATABASE", "VIEW", "CATALOG", "ANONYMOUS FUNCTION", "ANY FILE"}
 
+    def _from_reduced(self, item: Permissions):
+        actions = ', '.join(json.loads(item.raw))
+        principal, object_id = item.object_id
+        raw = {'principal': principal, 'action_type': actions}
+        match item.object_type:
+            case 'TABLE':
+                return raw | {'table': object_id}
+            case 'VIEW':
+                return raw | {'view': object_id}
+            case 'DATABASE':
+                return raw | {'database': object_id}
+            case 'ANONYMOUS FUNCTION':
+                return raw | {'anonymous_function': object_id}
+            case 'CATALOG':
+                return raw | {'catalog': object_id}
+
     def get_apply_task(self, item: Permissions, migration_state: GroupMigrationState, destination: Destination):
-        grant = Grant(**json.loads(item.raw))
+        raw = self._from_reduced(item)
+        grant = Grant(**raw)
         target_principal = migration_state.get_target_principal(grant.principal, destination)
         if target_principal is None:
             # this is a grant for user, service principal, or irrelevant group

@mwojtyczka
Copy link
Contributor Author

mwojtyczka commented Oct 28, 2023

do reconstruction of grants, not the overwrite

Index: src/databricks/labs/ucx/workspace_access/tacl.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/databricks/labs/ucx/workspace_access/tacl.py b/src/databricks/labs/ucx/workspace_access/tacl.py
--- a/src/databricks/labs/ucx/workspace_access/tacl.py	(revision 45bae4a880095ce35719cc01c27c05f1786ac711)
+++ b/src/databricks/labs/ucx/workspace_access/tacl.py	(date 1698304643087)
@@ -1,5 +1,5 @@
+import collections
 import dataclasses
-import functools
 import json
 from collections.abc import Callable, Iterator
 from functools import partial
@@ -21,18 +21,37 @@
         self._sql_backend = sql_backend
 
     def get_crawler_tasks(self) -> Iterator[Callable[..., Permissions | None]]:
-        def inner(grant: Grant) -> Permissions:
-            object_type, object_key = grant.this_type_and_key()
-            return Permissions(object_type=object_type, object_id=object_key, raw=json.dumps(dataclasses.asdict(grant)))
-
+        folded = collections.defaultdict(set)
         for grant in self._grants_crawler.snapshot():
-            yield functools.partial(inner, grant)
+            folded[(grant.principal, grant.this_type_and_key())].add(grant.action_type)
+        for (principal, (object_type, key)), actions in folded.items():
+            yield lambda: Permissions(
+                object_type=object_type,
+                object_id=f'{principal}/{key}',
+                raw=json.dumps(sorted(actions)))
 
     def object_types(self) -> set[str]:
         return {"TABLE", "DATABASE", "VIEW", "CATALOG", "ANONYMOUS FUNCTION", "ANY FILE"}
 
+    def _from_reduced(self, item: Permissions):
+        actions = ', '.join(json.loads(item.raw))
+        principal, object_id = item.object_id
+        raw = {'principal': principal, 'action_type': actions}
+        match item.object_type:
+            case 'TABLE':
+                return raw | {'table': object_id}
+            case 'VIEW':
+                return raw | {'view': object_id}
+            case 'DATABASE':
+                return raw | {'database': object_id}
+            case 'ANONYMOUS FUNCTION':
+                return raw | {'anonymous_function': object_id}
+            case 'CATALOG':
+                return raw | {'catalog': object_id}
+
     def get_apply_task(self, item: Permissions, migration_state: GroupMigrationState, destination: Destination):
-        grant = Grant(**json.loads(item.raw))
+        raw = self._from_reduced(item)
+        grant = Grant(**raw)
         target_principal = migration_state.get_target_principal(grant.principal, destination)
         if target_principal is None:
             # this is a grant for user, service principal, or irrelevant group

Maybe I am wrong here but IMO this implementation proposal does not make sense for several reasons:

  • The raw permissions is incomplete. It only contains actions. This may introduce a lot of bugs because the consumers may not be aware that the permissions object is incomplete. It will also break a lot of existing tests.
  • object id will contain both the object and principal. Why an object id should ever contain information about a principal? it does not fit logically. It will also break a lot of existing tests.
  • The permissions is reconstructed wrongly. A lot of fields will be missing in the raw field. Again this will break a lot of existing tests.

IMO the crawler should return complete permissions to avoid issues with consumers of the method that may not be aware of the limitation that the raw object is incomplete. A better approach would be to create a complete permissions by creating a Grant based on the information at hand. I pushed an updated version for doing exactly that. It reuses the proposed code partially but do this on the crawler side, not applier.

@mwojtyczka mwojtyczka temporarily deployed to account-admin October 28, 2023 10:59 — with GitHub Actions Inactive
@nfx nfx changed the title Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together Breaking change: Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together Oct 28, 2023
@nfx nfx merged commit 3c1b5a2 into main Oct 28, 2023
@nfx nfx deleted the fix/fold_grants branch October 28, 2023 15:08
nfx added a commit that referenced this pull request Nov 17, 2023
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs)

 * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)).
 * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)).

**New features**

 * Added support for the experimental Databricks CLI launcher ([#517](#517)).
 * Added support for external Hive Metastores including AWS Glue ([#400](#400)).
 * Added more views to assessment dashboard ([#474](#474)).
 * Added rate limit for creating backup group to increase stability ([#500](#500)).
 * Added deduplication for mount point list ([#569](#569)).
 * Added documentation to describe interaction with external Hive Metastores ([#473](#473)).
 * Added failure injection for job failure message propagation  ([#591](#591)).
 * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)).
 * Added a global init script to collect Hive Metastore lineage ([#513](#513)).
 * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)).
 * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)).

**Bug fixes**

 * Fixed handling for `OWN` table permissions ([#571](#571)).
 * Fixed handling of keys with and without values. ([#514](#514)).
 * Fixed integration test failures related to concurrent group delete ([#584](#584)).
 * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)).
 * Fixed missing group entitlement migration bug ([#583](#583)).
 * Fixed entitlement application for account-level groups ([#529](#529)).
 * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)).
 * Fixed installer to migrate between different configuration file versions ([#596](#596)).
 * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)).
 * Improved error message for not null constraints violated ([#532](#532)).
 * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)).
 * Introduced Safer access to workspace objects' properties. ([#530](#530)).
 * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)).
 * Running apply permission task before assessment should display message ([#487](#487)).
 * Split integration tests from blocking the merge queue ([#496](#496)).
 * Support more than one dashboard per step ([#472](#472)).
 * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)).
 * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).
@nfx nfx mentioned this pull request Nov 17, 2023
nfx added a commit that referenced this pull request Nov 17, 2023
**Breaking changes** (existing installations need to reinstall UCX and
re-run assessment jobs)

* Switched local group migration component to rename groups instead of
creating backup groups
([#450](#450)).
* Mitigate permissions loss in Table ACLs by folding grants belonging to
the same principal, object id and object type together
([#512](#512)).

**New features**

* Added support for the experimental Databricks CLI launcher
([#517](#517)).
* Added support for external Hive Metastores including AWS Glue
([#400](#400)).
* Added more views to assessment dashboard
([#474](#474)).
* Added rate limit for creating backup group to increase stability
([#500](#500)).
* Added deduplication for mount point list
([#569](#569)).
* Added documentation to describe interaction with external Hive
Metastores ([#473](#473)).
* Added failure injection for job failure message propagation
([#591](#591)).
* Added uniqueness in the new warehouse name to avoid conflicts on
installation ([#542](#542)).
* Added a global init script to collect Hive Metastore lineage
([#513](#513)).
* Added retry set/update permissions when possible and assess the
changes in the workspace
([#519](#519)).
* Use `~/.ucx/state.json` to store the state of both dashboards and jobs
([#561](#561)).

**Bug fixes**

* Fixed handling for `OWN` table permissions
([#571](#571)).
* Fixed handling of keys with and without values.
([#514](#514)).
* Fixed integration test failures related to concurrent group delete
([#584](#584)).
* Fixed issue with workspace listing process on None type `object_type`
([#481](#481)).
* Fixed missing group entitlement migration bug
([#583](#583)).
* Fixed entitlement application for account-level groups
([#529](#529)).
* Fixed assessment throwing an error when the owner of an object is
empty ([#485](#485)).
* Fixed installer to migrate between different configuration file
versions ([#596](#596)).
* Fixed cluster policy crawler to be aware of deleted policies
([#486](#486)).
* Improved error message for not null constraints violated
([#532](#532)).
* Improved integration test resiliency
([#597](#597),
[#594](#594),
[#586](#586)).
* Introduced Safer access to workspace objects' properties.
([#530](#530)).
* Mitigated permissions loss in Table ACLs by running appliers with
single thread ([#518](#518)).
* Running apply permission task before assessment should display message
([#487](#487)).
* Split integration tests from blocking the merge queue
([#496](#496)).
* Support more than one dashboard per step
([#472](#472)).
* Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0
([#505](#505)).
* Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0
([#575](#575)).
pritishpai pushed a commit that referenced this pull request Nov 21, 2023
**Breaking changes** (existing installations need to reinstall UCX and
re-run assessment jobs)

* Switched local group migration component to rename groups instead of
creating backup groups
([#450](#450)).
* Mitigate permissions loss in Table ACLs by folding grants belonging to
the same principal, object id and object type together
([#512](#512)).

**New features**

* Added support for the experimental Databricks CLI launcher
([#517](#517)).
* Added support for external Hive Metastores including AWS Glue
([#400](#400)).
* Added more views to assessment dashboard
([#474](#474)).
* Added rate limit for creating backup group to increase stability
([#500](#500)).
* Added deduplication for mount point list
([#569](#569)).
* Added documentation to describe interaction with external Hive
Metastores ([#473](#473)).
* Added failure injection for job failure message propagation
([#591](#591)).
* Added uniqueness in the new warehouse name to avoid conflicts on
installation ([#542](#542)).
* Added a global init script to collect Hive Metastore lineage
([#513](#513)).
* Added retry set/update permissions when possible and assess the
changes in the workspace
([#519](#519)).
* Use `~/.ucx/state.json` to store the state of both dashboards and jobs
([#561](#561)).

**Bug fixes**

* Fixed handling for `OWN` table permissions
([#571](#571)).
* Fixed handling of keys with and without values.
([#514](#514)).
* Fixed integration test failures related to concurrent group delete
([#584](#584)).
* Fixed issue with workspace listing process on None type `object_type`
([#481](#481)).
* Fixed missing group entitlement migration bug
([#583](#583)).
* Fixed entitlement application for account-level groups
([#529](#529)).
* Fixed assessment throwing an error when the owner of an object is
empty ([#485](#485)).
* Fixed installer to migrate between different configuration file
versions ([#596](#596)).
* Fixed cluster policy crawler to be aware of deleted policies
([#486](#486)).
* Improved error message for not null constraints violated
([#532](#532)).
* Improved integration test resiliency
([#597](#597),
[#594](#594),
[#586](#586)).
* Introduced Safer access to workspace objects' properties.
([#530](#530)).
* Mitigated permissions loss in Table ACLs by running appliers with
single thread ([#518](#518)).
* Running apply permission task before assessment should display message
([#487](#487)).
* Split integration tests from blocking the merge queue
([#496](#496)).
* Support more than one dashboard per step
([#472](#472)).
* Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0
([#505](#505)).
* Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0
([#575](#575)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants