Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix lookup tables #291

Merged
merged 18 commits into from
Mar 17, 2023
Merged

fix lookup tables #291

merged 18 commits into from
Mar 17, 2023

Conversation

eschultink
Copy link
Member

@eschultink eschultink commented Mar 16, 2023

apologies for the massive refactor 😬 ; but generally we were using Sanitizer in bulk cases, only to get its "pseudonymize()" methods. So split out a Pseudonymizer interface/implementation, and made it much clear that the Sanitizer is really for REST APIs.

NOTE: take your time, don't think need to get this into v0.4.14, which I want out by EoW.

Fixes

  • fixes the lookup table problem, but supporting additional transforms each outputting to their own bucket

Other ideas:

  • StorageHandler is really StorageObjectSanitizer, to be clearer - as it it implements redaction, pseudonymization, etc. but arch differs from RESTApiSanitizer; rules are a method parameter to StorageHandler atm, but are a property of RESTApiSanitizer instances. Which is better approach?
  • refactor aws-psoxy-bulk to use aws-psoxy-output-bucket module, as currently repeats a lot of that (at risk of having hierarchical TF modules) changed my mind this, bc will necessitate a lot of moved stuff in the modules for little value

Change implications

  • dependencies added/changed? no

@@ -325,40 +323,38 @@ module "psoxy-bulk-to-worklytics" {
}, try(each.value.settings_to_provide, {}))
}

module "psoxy_lookup_tables_builders" {
# BEGIN lookup builders
module "lookup_output" {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a lambda for each lookup table, just an output bucket (with associated IAM policies, settings, etc).

inputs_to_build_lookups_for = toset(distinct([ for k, v in var.lookup_table_builders : v.input_connector_id ]))
}

resource "aws_ssm_parameter" "additional_transforms" {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting this SSM parameter for each "bulk" lambda to which we want to build an extra lookup table for - eg, but default lambda doesn't have any ADDITIONAL_TRANSFORMS; but if you provide this SSM param, then you change the behavior of existing lambda

@@ -21,7 +21,7 @@
@EqualsAndHashCode
@JsonPropertyOrder({"allowAllEndpoints", "endpoints", "defaultScopeIdForSource"})
@JsonInclude(JsonInclude.Include.NON_NULL) //NOTE: despite name, also affects YAML encoding
public class Rules2 implements RuleSet, Serializable {
public class Rules2 implements RESTRules {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is clearer; lets people bind to RESTRules interface instead of Rules2 implementation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rules2 --> PsoxyRules

@eschultink eschultink mentioned this pull request Mar 16, 2023
@eschultink eschultink changed the title S144 fix lookup tables alt fix lookup tables Mar 16, 2023
Copy link
Member

@jlorper jlorper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few suggestions, no blockers. Liked the split of rules + pseudonymization 👍

for_each = var.lookup_table_builders

source = "../../modules/aws-psoxy-bulk-existing"
# source = "git::https://github.com/worklytics/psoxy//infra/modules/aws-psoxy-bulk-existing?ref=v0.4.13"
source = "../../modules/aws-psoxy-output-bucket"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this point to source when deployed?


resource "aws_s3_bucket" "output" {
# note: this ends up with a long UTC time-stamp + random number appended to it to form the bucket name
bucket_prefix = "psoxy-${var.instance_id}-"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use the customer id too (if available)? from out point of view all these buckets look similar.
In any case, with the new export connectors to come should not be an issue anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the S3 bucket that we import customer's sanitized data from. we shouldn't be touching it directly; they should copy-paste the value into the "Add Connection" flow.

Security on this is that within our platform, only the specific customer's tenant SA can access it. So even if we did for some reason configure the connection on behalf of the customer, and mixed up buckets across customer, should be 403 from AWS IAM when the tenant SA tries to assume a role which it can't assume (based on the role's config in customer AWS)

Comment on lines 46 to 54
@Override
public PseudonymizedIdentity pseudonymize(@NonNull String value) {
return pseudonymize((Object) value);
}

@Override
public PseudonymizedIdentity pseudonymize(@NonNull Number value) {
return pseudonymize((Object) value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really needed in the interface? they just call the Object one. That could deal internally with specialized versions per object class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe can be eliminated. The Object one was previously private

@@ -21,7 +21,7 @@
@EqualsAndHashCode
@JsonPropertyOrder({"allowAllEndpoints", "endpoints", "defaultScopeIdForSource"})
@JsonInclude(JsonInclude.Include.NON_NULL) //NOTE: despite name, also affects YAML encoding
public class Rules2 implements RuleSet, Serializable {
public class Rules2 implements RESTRules {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rules2 --> PsoxyRules

throw new RuntimeException("Failed to parse ADDITIONAL_TRANSFORMS from config", e);
}
} else {
return new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.emptyList() - immutable, guess you don't need it mutable

Comment on lines +27 to +28
BulkDataRules bulkDataRules,
Pseudonymizer pseudonymizer) throws IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of liked the Sanitizer interface being (rules + pseudonymizer), but ok

Comment on lines 90 to 97
System.out.println("Writing to: " + storageEventResponse.getDestinationBucketName() + "/" + storageEventResponse.getDestinationObjectPath());

storage.createFrom(BlobInfo.newBuilder(BlobId.of(storageEventResponse.getDestinationBucketName(), storageEventResponse.getDestinationObjectPath()))
.setContentType(blobInfo.getContentType())
.build(), processedStream);

System.out.println("Successfully pseudonymized " + importBucket + "/"
+ sourceName + " and uploaded to " + storageEventResponse.getDestinationBucketName() + "/" + storageEventResponse.getDestinationObjectPath());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Log calls instead of sys out?

process(importBucket, sourceKey, transform.getDestinationBucketName(), transform.getRules());
}

return "Processed!";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"200-OK", not sure where this is read back (probably nowhere) but if so, easier to process

@eschultink eschultink changed the base branch from rc-v0.4.14 to rc-v0.4.15 March 17, 2023 18:58
@eschultink eschultink merged commit e9982fc into rc-v0.4.15 Mar 17, 2023
@eschultink eschultink deleted the s144-fix-lookup-tables-alt branch March 17, 2023 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants