Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a single projection for multiple column masks #14420

Closed
wants to merge 2 commits into from

Conversation

weiatwork
Copy link
Contributor

The current column masking behavior is problematic, if there are multiple column masks and some column is being masked itself as well as used as a condition for masking other columns. It's due to the masks being stacked with multiple projections. This creates a "chained masking" behavior, which is hardly anything users expect. The behavior is not predictable because it depends on the order of columns defined in the table.

The problem can be illustrated with examples below. Here's the DDL for table t1.

CREATE TABLE t1 (a int, b int);

Here's the original table content for t1

  a   |  b
------+------
    1 |    2
    3 |    4

Now let's use below file-based access control policy to do column masking for both column a and b.

{
    "tables": [
        {
            "catalog": "hive",
            "schema": "default",
            "table": "t1",
            "privileges": ["SELECT"],
            "columns": [
                {
                    "name": "a",
                    "mask": "IF (a <> 1, a)"
                },
                {
                    "name": "b",
                    "mask": "IF (a <> 1 OR a = 1, b)"
                }
            ]
        }
    ]
}

trino:default> select * from t1;
  a   |  b
------+------
 NULL | NULL
    3 |    4

The above result is wrong because the expected result is:

  a   |  b
------+------
 NULL |    2
    3 |    4

However, with a symmetric policy as below, we get expected result:

{
    "tables": [
        {
            "catalog": "hive",
            "schema": "default",
            "table": "t1",
            "privileges": ["SELECT"],
            "columns": [
                {
                    "name": "a",
                    "mask": "IF (b <> 2 OR b = 2, a)"
                },
                {
                    "name": "b",
                    "mask": "IF (b <> 2, b)"
                }
            ]
        }
    ]
}

trino:default> select * from t1;
  a   |  b
------+------
    1 | NULL
    3 |    4

The only reason for the latter example to (coincidentally) return the correct result is because column a is defined before column b.

Description

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:
Fixed possible incorrect result issue when multiple column masks exist.

# Section
* Fix some things. ({issue}`issuenumber`)

@kokosing
Copy link
Member

kokosing commented Oct 2, 2022

The same issue may happen when you define masks together with filters. It can still happen that one thing influence the other.

In order to avoid the problem you need to define the user for masks and filters and then these expression will be executed with privileges of such user so mask and filler won't affect each other.

@findepi
Copy link
Member

findepi commented Oct 3, 2022

The mask expression should take as input table source columns. The masks shouldn't "stack".
Same for filter - it should take unmasked columns as the input.

@kasiafi
Copy link
Member

kasiafi commented Oct 3, 2022

If I understand correctly what @kokosing says in #12262 (comment), it is expected behavior that masks "stack".

@findepi
Copy link
Member

findepi commented Oct 3, 2022

If I understand correctly what @kokosing says in #12262 (comment), it is expected behavior that masks "stack".

Thanks for the pointer.

It's not expected to me, so i am challenging that :)

@weiatwork
Copy link
Contributor Author

My opinion is consistent to what @findepi described above. We should only use the unmasked/original columns as conditionals for other column masks.

@weiatwork
Copy link
Contributor Author

The same issue may happen when you define masks together with filters. It can still happen that one thing influence the other.

In order to avoid the problem you need to define the user for masks and filters and then these expression will be executed with privileges of such user so mask and filler won't affect each other.

@kokosing To get to the bottom of such issues, maybe we can approximately categorize the use cases as below:
(1) multiple row filters
(2) multiple column masks
(3) combined use of filters and masks
For (1) I think we are fine, since they will be conjuncted and there shouldn't be any side effect. For (2) it is what this PR is about. For (3) do you mind to provide a simple example?

@martint
Copy link
Member

martint commented Oct 3, 2022

This is definitely a bug. The masks shouldn't stack. Any references they make to columns should be based on the original un-masked value.

@martint
Copy link
Member

martint commented Oct 3, 2022

BTW, the only situations where they might "stack" with respect to other masks is when:

  1. The table is actually a view and it references other tables/columns with their own masks.
  2. The mask expression contains a reference to other tables (via a subquery) and those tables contain their own masks.

@@ -209,7 +209,8 @@ public void testMultipleMasksOnSameColumn()
USER,
new ViewExpression(USER, Optional.empty(), Optional.empty(), "custkey * 2"));

assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '-740'");
// When there are multiple masks on the same column, the latter one overrides the previous ones
assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '740'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in semantics. Looking at this, I realize we never settled the question of why we'd ever have multiple masks on a single column: #11654 (comment). That still doesn't make a lot of sense to me. cc @kokosing @ksobolew

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have multiple system access controls and each of them can return mask for a given column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the masks expected to be applied in such a case? There's no guarantee about the order in which the system access controls are invoked.

Also, you're talking about multiple system access controls, but the SystemAccessControl interface allows a single access control interface to return multiple masks for a given column. That does not make much sense -- if only one mask is to be applied, the interface should return a single mask.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple situations in which more than one mask may apply to a single column. The simplest example is when there are two roles enabled and each of them applies a different column mask on a particular column. This immediately raises the question of ordering, of course. While it looks like Trino does not make such guarantees explicitly, the ordering is in practice deterministic as long as the access controls return them in a deterministic order. The masks and filters are returned and processed as Lists, and this implies ordering. The engine has no trouble dealing with this and applies the masks iteratively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still doesn't make a lot of sense to me. Masks are defined in terms of the values in the column. For example, if column is a string that contains only numbers, one might implement a mask function by parsing the number, doing some math (a hash, etc) on it and converting it back to a string. If the the engine applies the function on anything other than the original values (e.g., because another mask got injected in between), the mask will fail. This makes it hard for someone writing a mask function to reason about, as it requires non-local understanding of how every role, user and mask relate to each other.

If multiple masks are possible, it should be up to the connector to decide which one is the most appropriate and go with it. In the case of multiple system access controls returning masks, there are two options:

  • Pick the first one. This can produce non-deterministic results depending on which order the access controls get called. It could change across restarts of the server
  • Fail with an error indicating there's ambiguity.

@kasiafi
Copy link
Member

kasiafi commented Oct 5, 2022

The behavior is not predictable because it depends on the order of columns defined in the table.

@weiatwork from the earlier discussion #12262 (comment) I understood that this is not the case. The behavior depends on the order of masks applied, but it has nothing to do with the order of column definitions.

If I'm wrong, then this is yet another issue which should be addressed independently from the one here.

@weiatwork
Copy link
Contributor Author

@kasiafi As demonstrated by the two examples above, with the first AC policy the query retuned wrong result because column a was masked first, and then column b was masked using a's masked value. We can see in this case the masking of b (arguably incorrectly) depends on a. Even if we switch the order of the two masks in the AC policy, it still returns the same wrong result, which means the dependence was not based on mask definitions, but on schema definition (since b comes after a in DDL):

                {
                    "name": "b",
                    "mask": "IF (a <> 1 OR a = 1, b)"
                },
                {
                    "name": "a",
                    "mask": "IF (a <> 1, a)"
                }

For the second AC policy in original examples, again column a was masked first, by testing b's values (but this time, b's original unmasked values). And then column b itself was masked, using the value of itself. We got correct result (luckily), mainly because b's masking evaluation happened after a's, due to schema definition.

IMHO, column masking shouldn't have dependency on masking definition in the policy, nor on the schema definition. Each column mask should be independently applied, and if there is a conditional relying on other columns, the unmasked/original values of other columns must be used. That's the main motivation of this PR.

As to the discussion on multiple masks on the same column, it's a separate issue that we can follow up.

@weiatwork
Copy link
Contributor Author

Just to follow up, do we have a conclusion what to do with multiple masks (on the same column or from a dependent column)?

@atanasenko
Copy link
Member

@weiatwork haven't seen this pr and started implementing the same thing as well in #15443, but there I retained compatibility with multiple masks per field.

@weiatwork
Copy link
Contributor Author

Thanks @atanasenko for the heads-up. I'm not sure if our changes are the same by looking at changes in the unit test. Will your change solve the issue I mentioned in this ticket in the beginning?

@atanasenko
Copy link
Member

atanasenko commented Dec 20, 2022

Thanks @atanasenko for the heads-up. I'm not sure if our changes are the same by looking at changes in the unit test. Will your change solve the issue I mentioned in this ticket in the beginning?

@weiatwork I started with exactly the same changes as you, but then modified the code to support multiple masks per field. In this case additional masks will get added to the second projection.
Basically, in my pr masks are spread over projections as follows:

There is a mask for n_regionkey that shows it only for nations starting with A, and there are two masks for nation, first only shows it if it belongs to region=0, second lowercases the name.

Output      | Proj2         | Proj1                             | Scan
------------+---------------+-----------------------------------+---------------
n_nationkey |               |                                   | n_nationkey
n_name      | lower(n_name) | IF(n_regionkey = 0, n_name)       | n_name
n_regionkey |               | IF(n_name like 'A%', n_regionkey) | n_regionkey
n_comment   |               |                                   | n_comment

@atanasenko
Copy link
Member

@martint what would be the right approach with multiple column masks per field? I think in order to not break some users that depend on it, we should retain the compatibility.

@martint
Copy link
Member

martint commented Dec 20, 2022

In what cases would someone be relying on multiple masks per field? I don't think any connectors in Trino support or do that currently. It should be the responsibility of the connector to compose masks in that manner if it knows how to deal with that situation and present a single mask expression to the engine (in general this doesn't make sense, though, as a mask may transform the values in manners that create invalid input for downstream mask expressions)

@atanasenko
Copy link
Member

Masks can be provided by otherwise unrelated system and connector access controls and then are combined into a single list (per AccessControlManager#getColumnMasks()).

@martint
Copy link
Member

martint commented Dec 20, 2022

Masks can be provided by otherwise unrelated system and connector access controls and then are combined into a single list (per AccessControlManager#getColumnMasks()).

That use case is unsupported. There's no guarantee about the order in which masks will be applied and whether the functions are even compatible with each other. For instance, one function might want to parse a string that contains numbers and another function may just replace numbers with Xs. Depending on the order in which the functions are applied, the results may vary or queries may fail altogether.

@atanasenko
Copy link
Member

So all-in-all we need to decide what to do about this. If we want to explicitly say that we don't support multiple masks per field, we need to throw an error. Otherwise, we can make best-effort to try and stack them (and let query fail if there are any type incompatibilities). We definitely should not silently ignore them.

@martint
Copy link
Member

martint commented Dec 21, 2022

Yeah, multiple masks should result in an error. Returning arbitrary results that depend on the order of application, or potentially failing if the given order makes the masks incompatible would lead to surprising behaviors, hard to debug results and confused users.

@atanasenko
Copy link
Member

@weiatwork is the failing approach good with you? Can you get that in?
I'd close my pr in favour of this one as it's way better documented.

@weiatwork
Copy link
Contributor Author

Sorry for the delay. Sure, that sounds good to me. Just want to clarify, do we want to only disallow multiple masks on the same column when the masks are from both system AC and connector AC? Or do we also want to change AccessControl's getColumnMasks interface and make it singular getColumnMask everywhere, i.e. even multiple masks from a single system or connector AC is not supported?

@atanasenko
Copy link
Member

@weiatwork I think it should be enough to error out when we detect there are multiple masks in the combined list. Changing SPI is problematic, and maybe we'll have a better approach at mutliple masks per column in the future, like validating expressions that they yield the same type, for instance.

@weiatwork weiatwork force-pushed the oneprojection branch 2 times, most recently from 8cf5b29 to 66e5d4b Compare January 5, 2023 19:27
@martint
Copy link
Member

martint commented Jan 9, 2023

We should (separately), change the SPI to expect a single mask. We may be able to do this in a backward compatible way by adding the new desired method and falling back, temporarily, to an implementation that composes the masks from the existing methods. We'd mark the existing method as deprecated and remove it in a future version. Plugins would be responsible for composing the masks if they need to and are able to.

@weiatwork
Copy link
Contributor Author

Thanks Martin. That makes sense to me. Do you mind to get this PR merged then? I don't think the only failing check is functionally related (TestHiveFaultTolerantExecutionTest). It looks like a test resource issue relevant to JVM. Or could you trigger a re-run of the tests?

@atanasenko
Copy link
Member

@martint can we get that merged?

Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last minor comment.

Current approach of stacking multiple projections has correctness issue:
It introduces chained dependency on derived/masked columns instead of the original columns.

Collapsing multiple projections into one can ensure the consistent dependency on the original
columns, as well as make the plan more efficient.
@@ -212,7 +212,8 @@ public void testMultipleMasksOnSameColumn()
USER,
new ViewExpression(USER, Optional.empty(), Optional.empty(), "custkey * 2"));

assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '-740'");
// When there are multiple masks on the same column, the latter one overrides the previous ones
assertThat(assertions.query("SELECT custkey FROM orders WHERE orderkey = 1")).matches("VALUES BIGINT '740'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this now fail with .hasMessageMatching("Multiple masks on a single column is not supported: column");?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That only happens after the second commit. Perhaps they should be reordered, so that the prohibition on multiple masks goes first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, it does pass in the CI apparently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see - the test is using TestingAccessControlManager, which replaces the actual AccessControlManager and does not check for multiple masks.

@@ -209,6 +209,7 @@ public void testDenyTableFunctionCatalogAccessControl()
}
}

// TODO: need to properly handle the ordering of multiple masks as they are not allowed currently
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TODO and test name is incorrect. I would simply remove TODO and name test to testMultipleColumnMasks

@weiatwork
Copy link
Contributor Author

Subsumed by #15680 thus closing

@weiatwork weiatwork closed this Jan 20, 2023
@weiatwork weiatwork deleted the oneprojection branch January 20, 2023 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants