-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a single projection for multiple column masks #14420
Conversation
The same issue may happen when you define masks together with filters. It can still happen that one thing influence the other. In order to avoid the problem you need to define the user for masks and filters and then these expression will be executed with privileges of such user so mask and filler won't affect each other. |
The mask expression should take as input table source columns. The masks shouldn't "stack". |
If I understand correctly what @kokosing says in #12262 (comment), it is expected behavior that masks "stack". |
Thanks for the pointer. It's not expected to me, so i am challenging that :) |
My opinion is consistent to what @findepi described above. We should only use the unmasked/original columns as conditionals for other column masks. |
@kokosing To get to the bottom of such issues, maybe we can approximately categorize the use cases as below: |
This is definitely a bug. The masks shouldn't stack. Any references they make to columns should be based on the original un-masked value. |
BTW, the only situations where they might "stack" with respect to other masks is when:
|
@@ -209,7 +209,8 @@ public void testMultipleMasksOnSameColumn() | |||
USER, | |||
new ViewExpression(USER, Optional.empty(), Optional.empty(), "custkey * 2")); | |||
|
|||
assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '-740'"); | |||
// When there are multiple masks on the same column, the latter one overrides the previous ones | |||
assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '740'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in semantics. Looking at this, I realize we never settled the question of why we'd ever have multiple masks on a single column: #11654 (comment). That still doesn't make a lot of sense to me. cc @kokosing @ksobolew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have multiple system access controls and each of them can return mask for a given column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the masks expected to be applied in such a case? There's no guarantee about the order in which the system access controls are invoked.
Also, you're talking about multiple system access controls, but the SystemAccessControl interface allows a single access control interface to return multiple masks for a given column. That does not make much sense -- if only one mask is to be applied, the interface should return a single mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple situations in which more than one mask may apply to a single column. The simplest example is when there are two roles enabled and each of them applies a different column mask on a particular column. This immediately raises the question of ordering, of course. While it looks like Trino does not make such guarantees explicitly, the ordering is in practice deterministic as long as the access controls return them in a deterministic order. The masks and filters are returned and processed as List
s, and this implies ordering. The engine has no trouble dealing with this and applies the masks iteratively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still doesn't make a lot of sense to me. Masks are defined in terms of the values in the column. For example, if column is a string that contains only numbers, one might implement a mask function by parsing the number, doing some math (a hash, etc) on it and converting it back to a string. If the the engine applies the function on anything other than the original values (e.g., because another mask got injected in between), the mask will fail. This makes it hard for someone writing a mask function to reason about, as it requires non-local understanding of how every role, user and mask relate to each other.
If multiple masks are possible, it should be up to the connector to decide which one is the most appropriate and go with it. In the case of multiple system access controls returning masks, there are two options:
- Pick the first one. This can produce non-deterministic results depending on which order the access controls get called. It could change across restarts of the server
- Fail with an error indicating there's ambiguity.
@weiatwork from the earlier discussion #12262 (comment) I understood that this is not the case. The behavior depends on the order of masks applied, but it has nothing to do with the order of column definitions. If I'm wrong, then this is yet another issue which should be addressed independently from the one here. |
@kasiafi As demonstrated by the two examples above, with the first AC policy the query retuned wrong result because column
For the second AC policy in original examples, again column IMHO, column masking shouldn't have dependency on masking definition in the policy, nor on the schema definition. Each column mask should be independently applied, and if there is a conditional relying on other columns, the unmasked/original values of other columns must be used. That's the main motivation of this PR. As to the discussion on multiple masks on the same column, it's a separate issue that we can follow up. |
Just to follow up, do we have a conclusion what to do with multiple masks (on the same column or from a dependent column)? |
@weiatwork haven't seen this pr and started implementing the same thing as well in #15443, but there I retained compatibility with multiple masks per field. |
Thanks @atanasenko for the heads-up. I'm not sure if our changes are the same by looking at changes in the unit test. Will your change solve the issue I mentioned in this ticket in the beginning? |
@weiatwork I started with exactly the same changes as you, but then modified the code to support multiple masks per field. In this case additional masks will get added to the second projection. There is a mask for n_regionkey that shows it only for nations starting with A, and there are two masks for nation, first only shows it if it belongs to region=0, second lowercases the name.
|
@martint what would be the right approach with multiple column masks per field? I think in order to not break some users that depend on it, we should retain the compatibility. |
In what cases would someone be relying on multiple masks per field? I don't think any connectors in Trino support or do that currently. It should be the responsibility of the connector to compose masks in that manner if it knows how to deal with that situation and present a single mask expression to the engine (in general this doesn't make sense, though, as a mask may transform the values in manners that create invalid input for downstream mask expressions) |
Masks can be provided by otherwise unrelated system and connector access controls and then are combined into a single list (per |
That use case is unsupported. There's no guarantee about the order in which masks will be applied and whether the functions are even compatible with each other. For instance, one function might want to parse a string that contains numbers and another function may just replace numbers with Xs. Depending on the order in which the functions are applied, the results may vary or queries may fail altogether. |
So all-in-all we need to decide what to do about this. If we want to explicitly say that we don't support multiple masks per field, we need to throw an error. Otherwise, we can make best-effort to try and stack them (and let query fail if there are any type incompatibilities). We definitely should not silently ignore them. |
Yeah, multiple masks should result in an error. Returning arbitrary results that depend on the order of application, or potentially failing if the given order makes the masks incompatible would lead to surprising behaviors, hard to debug results and confused users. |
@weiatwork is the failing approach good with you? Can you get that in? |
Sorry for the delay. Sure, that sounds good to me. Just want to clarify, do we want to only disallow multiple masks on the same column when the masks are from both system AC and connector AC? Or do we also want to change AccessControl's |
@weiatwork I think it should be enough to error out when we detect there are multiple masks in the combined list. Changing SPI is problematic, and maybe we'll have a better approach at mutliple masks per column in the future, like validating expressions that they yield the same type, for instance. |
8cf5b29
to
66e5d4b
Compare
We should (separately), change the SPI to expect a single mask. We may be able to do this in a backward compatible way by adding the new desired method and falling back, temporarily, to an implementation that composes the masks from the existing methods. We'd mark the existing method as deprecated and remove it in a future version. Plugins would be responsible for composing the masks if they need to and are able to. |
Thanks Martin. That makes sense to me. Do you mind to get this PR merged then? I don't think the only failing check is functionally related (TestHiveFaultTolerantExecutionTest). It looks like a test resource issue relevant to JVM. Or could you trigger a re-run of the tests? |
@martint can we get that merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last minor comment.
core/trino-main/src/test/java/io/trino/security/TestAccessControlManager.java
Outdated
Show resolved
Hide resolved
Current approach of stacking multiple projections has correctness issue: It introduces chained dependency on derived/masked columns instead of the original columns. Collapsing multiple projections into one can ensure the consistent dependency on the original columns, as well as make the plan more efficient.
66e5d4b
to
ec2c794
Compare
@@ -212,7 +212,8 @@ public void testMultipleMasksOnSameColumn() | |||
USER, | |||
new ViewExpression(USER, Optional.empty(), Optional.empty(), "custkey * 2")); | |||
|
|||
assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '-740'"); | |||
// When there are multiple masks on the same column, the latter one overrides the previous ones | |||
assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '740'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this now fail with .hasMessageMatching("Multiple masks on a single column is not supported: column");
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That only happens after the second commit. Perhaps they should be reordered, so that the prohibition on multiple masks goes first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH, it does pass in the CI apparently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see - the test is using TestingAccessControlManager
, which replaces the actual AccessControlManager
and does not check for multiple masks.
@@ -209,6 +209,7 @@ public void testDenyTableFunctionCatalogAccessControl() | |||
} | |||
} | |||
|
|||
// TODO: need to properly handle the ordering of multiple masks as they are not allowed currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TODO and test name is incorrect. I would simply remove TODO and name test to testMultipleColumnMasks
Subsumed by #15680 thus closing |
The current column masking behavior is problematic, if there are multiple column masks and some column is being masked itself as well as used as a condition for masking other columns. It's due to the masks being stacked with multiple projections. This creates a "chained masking" behavior, which is hardly anything users expect. The behavior is not predictable because it depends on the order of columns defined in the table.
The problem can be illustrated with examples below. Here's the DDL for table
t1
.Here's the original table content for
t1
Now let's use below file-based access control policy to do column masking for both column
a
andb
.The above result is wrong because the expected result is:
However, with a symmetric policy as below, we get expected result:
The only reason for the latter example to (coincidentally) return the correct result is because column
a
is defined before columnb
.Description
Non-technical explanation
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:
Fixed possible incorrect result issue when multiple column masks exist.