fix lookup tables #291

eschultink · 2023-03-16T05:19:38Z

apologies for the massive refactor 😬 ; but generally we were using Sanitizer in bulk cases, only to get its "pseudonymize()" methods. So split out a Pseudonymizer interface/implementation, and made it much clear that the Sanitizer is really for REST APIs.

NOTE: take your time, don't think need to get this into v0.4.14, which I want out by EoW.

Fixes

fixes the lookup table problem, but supporting additional transforms each outputting to their own bucket

Other ideas:

StorageHandler is really StorageObjectSanitizer, to be clearer - as it it implements redaction, pseudonymization, etc. but arch differs from RESTApiSanitizer; rules are a method parameter to StorageHandler atm, but are a property of RESTApiSanitizer instances. Which is better approach?
~~refactor aws-psoxy-bulk to use aws-psoxy-output-bucket module, as currently repeats a lot of that (at risk of having hierarchical TF modules)~~ changed my mind this, bc will necessitate a lot of moved stuff in the modules for little value

Change implications

dependencies added/changed? no

…zer, better decouple REST/bulk cases in Java, extend AWS to deal with multiple files

eschultink · 2023-03-16T05:21:02Z

infra/modular-examples/aws-google-workspace/main.tf

@@ -325,40 +323,38 @@ module "psoxy-bulk-to-worklytics" {
  }, try(each.value.settings_to_provide, {}))
 }

-module "psoxy_lookup_tables_builders" {
+# BEGIN lookup builders
+module "lookup_output" {


instead of a lambda for each lookup table, just an output bucket (with associated IAM policies, settings, etc).

eschultink · 2023-03-16T05:22:56Z

infra/modular-examples/aws-google-workspace/main.tf

+  inputs_to_build_lookups_for = toset(distinct([ for k, v in var.lookup_table_builders : v.input_connector_id ]))
+}
+
+resource "aws_ssm_parameter" "additional_transforms" {


setting this SSM parameter for each "bulk" lambda to which we want to build an extra lookup table for - eg, but default lambda doesn't have any ADDITIONAL_TRANSFORMS; but if you provide this SSM param, then you change the behavior of existing lambda

eschultink · 2023-03-16T05:28:50Z

java/core/src/main/java/co/worklytics/psoxy/rules/Rules2.java

@@ -21,7 +21,7 @@
 @EqualsAndHashCode
 @JsonPropertyOrder({"allowAllEndpoints", "endpoints", "defaultScopeIdForSource"})
 @JsonInclude(JsonInclude.Include.NON_NULL) //NOTE: despite name, also affects YAML encoding
-public class Rules2 implements RuleSet, Serializable {
+public class Rules2 implements RESTRules {


this is clearer; lets people bind to RESTRules interface instead of Rules2 implementation

Rules2 --> PsoxyRules

jlorper

Few suggestions, no blockers. Liked the split of rules + pseudonymization 👍

jlorper · 2023-03-17T17:11:10Z

infra/modular-examples/aws-msft-365/main.tf

  for_each = var.lookup_table_builders

-  source = "../../modules/aws-psoxy-bulk-existing"
-  # source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-psoxy-bulk-existing?ref=v0.4.13"
+  source = "../../modules/aws-psoxy-output-bucket"


shouldn't this point to source when deployed?

jlorper · 2023-03-17T17:14:52Z

infra/modules/aws-psoxy-output-bucket/main.tf

+
+resource "aws_s3_bucket" "output" {
+  # note: this ends up with a long UTC time-stamp + random number appended to it to form the bucket name
+  bucket_prefix = "psoxy-${var.instance_id}-"


could we use the customer id too (if available)? from out point of view all these buckets look similar.
In any case, with the new export connectors to come should not be an issue anymore

So this is the S3 bucket that we import customer's sanitized data from. we shouldn't be touching it directly; they should copy-paste the value into the "Add Connection" flow.

Security on this is that within our platform, only the specific customer's tenant SA can access it. So even if we did for some reason configure the connection on behalf of the customer, and mixed up buckets across customer, should be 403 from AWS IAM when the tenant SA tries to assume a role which it can't assume (based on the role's config in customer AWS)

jlorper · 2023-03-17T17:17:57Z

java/core/src/main/java/co/worklytics/psoxy/PseudonymizerImpl.java

+    @Override
+    public PseudonymizedIdentity pseudonymize(@NonNull String value) {
+        return pseudonymize((Object)  value);
+    }
+
+    @Override
+    public PseudonymizedIdentity pseudonymize(@NonNull Number value) {
+        return pseudonymize((Object) value);
+    }


really needed in the interface? they just call the Object one. That could deal internally with specialized versions per object class.

yeah, maybe can be eliminated. The Object one was previously private

jlorper · 2023-03-17T17:26:52Z

java/core/src/main/java/co/worklytics/psoxy/rules/Rules2.java

@@ -21,7 +21,7 @@
 @EqualsAndHashCode
 @JsonPropertyOrder({"allowAllEndpoints", "endpoints", "defaultScopeIdForSource"})
 @JsonInclude(JsonInclude.Include.NON_NULL) //NOTE: despite name, also affects YAML encoding
-public class Rules2 implements RuleSet, Serializable {
+public class Rules2 implements RESTRules {


Rules2 --> PsoxyRules

jlorper · 2023-03-17T17:27:29Z

java/core/src/main/java/co/worklytics/psoxy/rules/RulesUtils.java

+                throw new RuntimeException("Failed to parse ADDITIONAL_TRANSFORMS from config", e);
+            }
+        } else {
+            return new ArrayList<>();


Collections.emptyList() - immutable, guess you don't need it mutable

jlorper · 2023-03-17T17:30:31Z

java/core/src/main/java/co/worklytics/psoxy/storage/BulkDataSanitizer.java

+                    BulkDataRules bulkDataRules,
+                    Pseudonymizer pseudonymizer) throws IOException;


I kind of liked the Sanitizer interface being (rules + pseudonymizer), but ok

jlorper · 2023-03-17T17:40:11Z

java/impl/gcp/src/main/java/co/worklytics/psoxy/GCSFileEvent.java

+                System.out.println("Writing to: " + storageEventResponse.getDestinationBucketName() + "/" + storageEventResponse.getDestinationObjectPath());
+
+                storage.createFrom(BlobInfo.newBuilder(BlobId.of(storageEventResponse.getDestinationBucketName(), storageEventResponse.getDestinationObjectPath()))
+                    .setContentType(blobInfo.getContentType())
+                    .build(), processedStream);
+
+                System.out.println("Successfully pseudonymized " + importBucket + "/"
+                    + sourceName + " and uploaded to " + storageEventResponse.getDestinationBucketName() + "/" + storageEventResponse.getDestinationObjectPath());


@Log calls instead of sys out?

jlorper · 2023-03-17T17:41:18Z

java/impl/aws/src/main/java/co/worklytics/psoxy/S3Handler.java

+            process(importBucket, sourceKey, transform.getDestinationBucketName(), transform.getRules());
+        }
+
+        return "Processed!";


"200-OK", not sure where this is read back (probably nowhere) but if so, easier to process

eschultink added 7 commits March 15, 2023 14:12

big refactor to split pseudonymization responsibilities out of Saniti…

d2fef59

…zer, better decouple REST/bulk cases in Java, extend AWS to deal with multiple files

add test

f041b55

refactor config properties; consolidate and unify across GCS + AWS cases

11a675b

move config parsing for bulk case up to RulesUtils

ce9eb93

support ADDITIONAL_TRANSFORMS in GCS case

d5e2c70

infra to support corrected lookup tables

486b860

add move to CHANGELOG

c6966e4

eschultink commented Mar 16, 2023

View reviewed changes

instanceof check unneeded

ad719c1

eschultink commented Mar 16, 2023

View reviewed changes

eschultink requested review from aperez-worklytics and jlorper March 16, 2023 05:33

more refactoring to improve naming

9785e3d

eschultink mentioned this pull request Mar 16, 2023

shuffle bulk data #292

Merged

eschultink changed the title ~~S144 fix lookup tables alt~~ fix lookup tables Mar 16, 2023

eschultink and others added 3 commits March 16, 2023 14:37

fix compile

a5bc790

fix compile problems

2741032

Merge branch 'rc-v0.4.14' into s144-fix-lookup-tables-alt

7ba5c93

jlorper approved these changes Mar 17, 2023

View reviewed changes

Merge branch 'rc-v0.4.14' into s144-fix-lookup-tables-alt

f6800c3

eschultink changed the base branch from rc-v0.4.14 to rc-v0.4.15 March 17, 2023 18:58

eschultink and others added 5 commits March 17, 2023 12:00

Merge branch 'rc-v0.4.15' into s144-fix-lookup-tables-alt

8ece4a8

fix compile

b6c0b3c

log instead of system.out

b34a094

minimize Psueonymizer interface

c3c6a20

add SHA1 of lookup table rules, to force restart if lookup rules change

befc7e7

eschultink merged commit e9982fc into rc-v0.4.15 Mar 17, 2023

eschultink deleted the s144-fix-lookup-tables-alt branch March 17, 2023 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix lookup tables #291

fix lookup tables #291

eschultink commented Mar 16, 2023 •

edited

Loading

eschultink Mar 16, 2023

eschultink Mar 16, 2023

eschultink Mar 16, 2023

jlorper Mar 17, 2023

jlorper left a comment

jlorper Mar 17, 2023

jlorper Mar 17, 2023

eschultink Mar 17, 2023

jlorper Mar 17, 2023

eschultink Mar 17, 2023

jlorper Mar 17, 2023

jlorper Mar 17, 2023

jlorper Mar 17, 2023

jlorper Mar 17, 2023

jlorper Mar 17, 2023

		BulkDataRules bulkDataRules,
		Pseudonymizer pseudonymizer) throws IOException;

fix lookup tables #291

fix lookup tables #291

Conversation

eschultink commented Mar 16, 2023 • edited Loading

Fixes

Change implications

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlorper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eschultink commented Mar 16, 2023 •

edited

Loading