Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Sensitive data masking utility #1858

Closed
2 tasks done
seshubaws opened this issue Jan 25, 2023 · 32 comments · Fixed by #2197
Closed
2 tasks done

RFC: Sensitive data masking utility #1858

seshubaws opened this issue Jan 25, 2023 · 32 comments · Fixed by #2197
Assignees
Labels
data-masking Sensitive Data Masking feature RFC

Comments

@seshubaws
Copy link
Contributor

seshubaws commented Jan 25, 2023

Is this related to an existing feature request or issue?

#1173

Which AWS Lambda Powertools utility does this relate to?

Other

Summary

Customers would like to obfuscate incoming data for known fields that contain PII, so that they're not passed downstream or accidentally logged. With the increase of batch processing utilities and GDPR, this is one of the hardest tasks for customers today, specially when considering multi-account users.

AWS Encryption SDK is a good starting point but it can be too complex for the average developer, data engineer, or DevOps persona to use. As such, it is highly requested that the Powertools library have a utility to easily mask and/or encrypt sensitive data.

Use case

The use case for this utility would be for developers who want to mask or encrypt sensitive data such as names, addresses, SSNs, etc. in order for them to not be logged in CloudWatch so such data is not compromised, and so that downstream systems like S3, DynamoDB, RDBMS, etc. will not need any additional work on handling PII data.

Additionally, developers should be able to recover encrypted sensitive data to its original form so that they can handle sensitive requests around that data on a as-needed basis.

Proposal

The data masking utility should allow users to mask data, or encrypt and decrypt it. If they would like to encrypt their data, customers should be able to decide for themselves which encryption provider they want to use, though we will provide an out-of-the-box integration with the AWS Encryption SDK. The below code snippet is a rudimentary look at how this utility can be used and how it will function.

Usage

from aws_lambda_powertools.utilities.data_masking.constants import KMS_KEY_ARN
from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider.kms.aws_encryption_sdk import AwsEncryptionSdkProvider


def lambda_handler():
    
   data_masker = DataMasking()
    data = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "johndoe@example.com",
               "address": {
                    "street": "123 Main St", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }

    masked = data_masker.mask(data=data, fields=["email", "address.street"])
    """
    masked = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "*****",
               "address": {
                    "street": "*****", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }
   """

    encryption_provider = AwsEncryptionSdkProvider(keys=[KMS_KEY_ARN])
    data_masker = DataMasking(provider=encryption_provider)

    encrypted = data_masker.encrypt(data=data, fields=["email", "address.street"])
    """
    encrypted = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "InRoaXMgaXMgYSBzdHJpbmciHsLZGx2na-XzP_TB5Bf2LNU1bLc",
               "address": {
                    "street": "XMgYSB_KDddaDJYMb-JpbmGnagTklwQ-msdaDLP", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }
   """
 
    decrypted = data_masker.decrypt(data=encrypted, fields=["email", "address.street"])
    """
    data = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "johndoe@example.com",
               "address": {
                    "street": "123 Main St", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }
   """

AWS Encryption SDK

The AWS Encryption SDK is a client-side encryption library that makes it easier to encrypt and decrypt data of any type in your application. The Encryption SDK is available in all the languages that Powertools supports. You can use it with customer master keys in AWS Key Management Service (AWS KMS), though the library does not require any AWS service. When you encrypt data, the SDK returns a single, portable encrypted message that includes the encrypted data and encrypted data keys. This object is designed to work in many different types of applications. You can specify many of the encryption options, including selecting an encryption and signing algorithm.

Latencies
Latencies for using this utility with the AWS Encryption SDK in Lambda functions configured with 128MB, 1024MB, and 1769MB, respectively.

plugins.metrics-by-endpoint.response_time./Prod/function128:
  min: ......................................................................... 0
  max: ......................................................................... 5373
  median: ...................................................................... 561.2
  p95: ......................................................................... 713.5
  p99: ......................................................................... 1620
plugins.metrics-by-endpoint.response_time./Prod/function1024:
  min: ......................................................................... 0
  max: ......................................................................... 5039
  median: ...................................................................... 89.1
  p95: ......................................................................... 144
  p99: ......................................................................... 232.8
plugins.metrics-by-endpoint.response_time./Prod/function1769:
  min: ......................................................................... 0
  max: ......................................................................... 1726
  median: ...................................................................... 82.3
  p95: ......................................................................... 133
  p99: ......................................................................... 183.1

Custom encryption
If customer would like to use another encryption provider, or define their own encrypt and decrypt functions, we will define an interface that the customer can implement and pass in to the DataMaskingUtility class.

from aws_lambda_powertools.utilities.data_masking.provider import BaseProvider
from itsdangerous.url_safe import URLSafeSerializer

class MyCustomEncryption(BaseProvider):
    def __init__(self, secret):
        self.secret = URLSafeSerializer(secret)

    def encrypt(self, value: str) -> str:
        if value is None:
            return value
        return self.secret.dumps(value)

    def decrypt(self, value: str) -> str:
        if value is None:
            return value
        return self.secret.loads(value)


def lambda_handler():
    data = {
        "id": 1,
        "name": "John Doe",
        "age": 30,
        "email": "johndoe@example.com",
        "address": {
            "street": "123 Main St", 
            "city": "Anytown", 
            "state": "CA", 
            "zip": "12345"
        },
    }

    masking_provider = MyCustomEncryption(secret="secret-key")
    data_masker = DataMasking(provider=masking_provider)

    encrypted = data_masker.encrypt(data, fields=["email", "address.street"])
    """
    encrypted = {
        "id": 1,
        "name": "John Doe",
        "age": 30,
        "email": "InRoaXMgaXMgYSBzdHJpbmciHsLZGx2na-XzP_TB5Bf2LNU1bLc",
        "address": {
            "street": "XMgYSB_KDddaDJYMb-JpbmGnagTklwQ-msdaDLP", 
            "city": "Anytown", 
            "state": "CA", 
            "zip": "12345"
        },
    }
   """

   decrypted = data_masker.decrypt(data=encrypted, fields=["email", "address.street"])
    """
    decrypted = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "johndoe@example.com",
               "address": {
                    "street": "123 Main St", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }
   """

Out of scope

Traversing an arbitrary dictionary will be out of scope for the initial launch of this tool. This feature will be to receive instructions as to where in the given dictionary it should mask/unmask the data.

We still need to determine the most efficient method of taking input JSON path masking or encrypting the value at that path. JMESPath or JSON path can be considered for simple use cases but we need to find the fastest method.

Potential challenges

  • We will need to discuss the design and implementation with a security expert to ensure safety.
  • Also need to determine if we should support envelope encryption for customers using the AWS Encryption SDK
    • Envelope encryption is the practice of encrypting plaintext data with a data key, and then encrypting the data key under another key. But, eventually, one key must remain in plaintext so you can decrypt the keys and your data. This top-level plaintext key encryption key is known as the root key. You can store root keys in AWS KMS, known as AWS KMS keys.
  • Need to decide what the best name for the methods are. Should they still be called encrypt and decrypt even in the case where the customer only masks and won't be able to decrypt later? Should we have a mask method in the interface that is always accessible so that users can irretrievably mask some data and also encrypt some data?

Dependencies and Integrations

Integration with the AWS Encryption SDK.

Alternative solutions

No response

Acknowledgment

@seshubaws seshubaws added RFC triage Pending triage from maintainers labels Jan 25, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 25, 2023

Thanks for opening your first issue here! We'll come back to you as soon as we can.
In the meantime, check out the #python channel on our AWS Lambda Powertools Discord: Invite link

@leandrodamascena leandrodamascena pinned this issue Jan 27, 2023
@heitorlessa
Copy link
Contributor

Prioritising for review tomorrow - missed our typical response time due to team offsite.

We'll revert with comments by tomorrow COB

@rubenfonseca
Copy link
Contributor

Hi @seshubaws thank you for your great work on the RFC. I especially liked the way you already pre-tested some solutions to find some bottlenecks.

I have two concerns:

  • it seems to me the "JSON path" method to identify fields that can contain PII is very limiting, compared to the offer we have on CloudWatch right now. With CloudWatch you instead declare categories of PII you are interested in masking, and AWS will mask automatically no matter where they appear. I believe Heitor's proposal where we use pydantic's parser could be a little better, but still do you think we can inverse the responsibilities here, and help the user declare "what to mask" vs "where to mask"?

  • regarding ItsDangerous, I understand the benefit of having something simpler. However, handling private keys and secrets on our own seems scary to me, and I would much prefer to build on top of the existing SDK and mechanisms, even if it means it will be slower. Alternatively, ItsDangerous could probably be used as long as we leave all the secret management to KMS.

Last, one thing I would love to see, is to augment objects that have PII with the corresponding encrypted version. Example:

{ "email": "foo@bar.com", name: "John" }

If we are interested in masking emails, this could turn into

{ "email": "*** 1 ***", name: "John", "*** 1 ***": "encrypted/signed version of the PII" }

This way, anyone with access to the decryption keys would be able to just augment the object with the original data, instead of having to find it on external systems.

@heitorlessa heitorlessa removed the triage Pending triage from maintainers label Jan 31, 2023
@heitorlessa
Copy link
Contributor

This is great, I'm so excited we're moving along with this now. A few immediate comments:

  • We should strive for cross-language compatibility. It's not a hard requirement, but it'd be great if ~50-70% is achievable to aid feature parity in the future. Other Powertools languages will follow the docs and features, it doesn't have to be a 1:1 mapping, otherwise we risk becoming half-good everywhere - we delight customers in their language of choice.
    • Q: For AST, are you using stdlib ast or a 3rd party library? If the latter, what's their license and package size?
  • Batteries included but swappable. I agree with Ruben that AWS Encryption SDK however heavy it addresses common security concerns we are not typically aware. We should be mindful to support 80% of customers looking for correctness and security while also allowing 20% of customers to bring their own implementation (e.g. ItsDangerous is everywhere) without bloating their dependencies with our recommended choice. For example, ItsDangerous is roughly 36K zipped while AWS Encryption SDK brings 18M zipped, largely due to AWS SDK and other heavy hitters. Encryption SDK alone might make it difficult for some Data Engineers/Scientists to use it, hence having a Provider helps us meet them where they are.
    • Q: Could you include what the experience would look like to bring-your-own-masking provider?
    • Q: Based on your research with both Encryption SDK and ItsDangerous, what common nomenclature could we use for interfaces? e.g., sign, dumps, etc.
  • What vs Where to mask. Unless Ruben has suggestions on how to address it for customers not using Parser (SRE, DevOps, Data Scientists), we'd want to support - however, it doesn't have to be at launch. [What] We could start with the Pydantic above and focus on getting a Provider interface, and correctness approach first. [Where] Once that's out of the picture, we can hear from customers on how they'd plan to use the JSON Path-like to scrub data recursively.
    • NOTE: They should be easily composable, since one traverses and one mask. If one doesn't have a dict data structure, you simply use a Provider to mask data.
  • Value of CloudWatch Logs PII. We recently announced native support for masking sensitive data in CloudWatch Logs, quite similarly to SNS. Part of Powertools exists because of integration and feature gaps between services, however, we should not assume that everyone using Lambda will want their logs dumped in CloudWatch - see Observability providers top demanded feature. Instead, we should recognize the plurality of customers we have. Depending on volume and need, it might be more economically viable to use this feature vs a managed service.
    • NOTE: Alternatively, if a customer wants to mask PII before they send it to downstream services via REST/GraphQL APIs, or non-SNS messaging services (EventBridge, Kinesis, SQS, Kafka, MQ), this is still a value add for them to rely on this feature. Over time, we will continue to voice our customers feedback to make it into a native feature to reduce operational overhead - until then, I'd be happy to hear otherwise why not.

On UX:

  • A draft proof of concept would be great to improve this discussion.
  • We'd want to avoid high-level top imports to prevent cold starts, like what happened to Tracer and thus requiring additional logic for lazy imports. Instead, use aws_lambda_powertools.utilities.data_masking.
  • Let's not mutate implicitly and always return the data masked, despite Python using pass-by-reference. This makes it easier for cross-languages too.
  • Prefer data parameter over event parameter. This enables a near future where we officially support this feature beyond Lambda runtimes - this feature could easily be used anywhere.
  • Customers have more than a Dict, so let's account for those too. This makes it easier to interop with existing systems they might have too to ease migration (on/off), if necessary - e.g., Signer.sign("my_email@domain.com")

Super looking forward to this.

Thank you for all the hard work you already put into this.

@lpsuerj
Copy link

lpsuerj commented Feb 2, 2023

I'm excited about all the possibilities we can uncover with this feature on the security front! Few things that come to my mind about it:

  1. When logs are not configured to be sent to CloudWatch logs, we need to ensure there's a way to show evidence of security control implementation effectiveness. Encryption of sensitive data is a high priority for security audit teams and to enabled an easy path to comply with regulator requirements (internal and external) would be a big win.

  2. AWS is broadly adopting a new security industry-wide logging format called OCSF. I't be great if we could embrace the same format for a future potential integration with services like AWS Security Lake for encryption events logs. Ideally we should still log the encryption activities following defined patterns in the Encryption SDK. This is NOT a hard requirement but a great nice to have

I'm looking forward to deep dive in this topic and join the team bringing this to life!

@ran-isenberg
Copy link
Contributor

Hey, really liking this idea!
I'd like to share my two cents as this is an issue I also deal with in my day to day.
I'd usually use a KMS CMK and boto to encrypt PII and in many cases either the customer brings its own key or i have a key per tenant but the encryption itself is required to be FIPS certified and other goodies.
For safe logging, if you parse your input with the Parser utility you can define any sensitive information as SecretStr (https://docs.pydantic.dev/usage/types/#secret-types) which causes it to printed as '****' when serialized to a string in the logger.

@seshubaws
Copy link
Contributor Author

seshubaws commented Feb 6, 2023

@rubenfonseca Thank you for your comments, let me know if I haven't answered all your questions!

It seems to me the “JSON path” method to identify fields that can contain PII is very limiting, compared to the offer we have on CloudWatch right now. With CloudWatch you instead declare categories of PII you are interested in masking, and AWS will mask automatically no matter where they appear. I believe Heitor’s proposal where we use pydantic’s parser could be a little better, but still do you think we can inverse the responsibilities here, and help the user declare “what to mask” vs “where to mask”?

Yes, that makes sense to me! I played around a little with CW’s new data masking tool and it seems they might be using regex for pattern matching for the desired categories users want to mask, but not exactly sure (let me know if you know what how they implemented this). I’m not very familiar with Pydantic but if it will help users identify more generic PII fields, we should definitely use that.

regarding ItsDangerous, I understand the benefit of having something simpler. However, handling private keys and secrets on our own seems scary to me, and I would much prefer to build on top of the existing SDK and mechanisms, even if it means it will be slower. Alternatively, ItsDangerous could probably be used as long as we leave all the secret management to KMS.

Yes, I think this goes along with what Heitor mentioned about designing a bring-your-own-masking-provider type interface. As both the AWS Encryption SDK and ItsDangerous have their tradeoffs, we should leave it up to the customer to decide which would fit best with their specific use case. I’ll work on coming up with a UX draft for this.

Last, one thing I would love to see, is to augment objects that have PII with the corresponding encrypted version. Example:

We could definitely do this! I am a little confused on what use case would require a field to be masked and also encrypted though, as I am under the impression that if it’s encrypted, then it’s just as good as being masked with the added benefit of only users with access can decrypt it.

@seshubaws
Copy link
Contributor Author

seshubaws commented Feb 9, 2023

Had an offline discussion with @heitorlessa (thanks for the thorough review and comments Heitor!) and thought I'd add it here for the public:

What vs Where to mask. Unless Ruben has suggestions on how to address it for customers not using Parser (SRE, DevOps, Data Scientists), we’d want to support - however, it doesn’t have to be at launch. [What] We could start with the Pydantic above and focus on getting a Provider interface, and correctness approach first. [Where] Once that’s out of the picture, we can hear from customers on how they’d plan to use the JSON Path-like to scrub data recursively.

Seshu: So do we want to launch first with customers choosing categories of things to mask like what CW is offering right now? ie categories like DOB, credit cards, SSN? Or would they be more general categories like Financial?

Heitor: Categories will prolly have false positives, so I’d wait until customers actually request it so we can brainstorm a bit more. Right now, masking any input that can be unmasked with appropriate permissions later is the way forward — for irreversibly masking (****), I’m still torn on whether we do this now with AST, or simply recommend Pydantic SecretStr for now until other languages have bandwidth to port this utility.

Basic UX of this:

from pydantic import BaseModel, validator

def obfuscate_forever(value: str) -> str:
    return value.replace(value, "****")


data = {
    "username": "johndoe",
    "password": "secretpassword"
}

class Policy(BaseModel):
    username: str
    password: str
    
    _irreversibly_obfuscate_password = validator('password', allow_reuse=True)(obfuscate_forever)  # Pydantic will call `obfuscate_forever` as soon as it traverses the tree and finds the password attribute

user = Policy(**data)
print(user)
# Output: {'username': 'johndoe', 'password': '******'} 

This means we can use the same mechanic to provide an actual Obfuscator who can encrypt data at any node in the tree, and customers can deobfuscate by using the same method in reverse (Obfuscator.unmark, etc.).

Using the sample example but with SecretStr:

from pydantic import BaseModel, SecretStr

data = {
    "username": "johndoe",
    "password": "secretpassword"
}

class Policy(BaseModel):
    username: str
    password: SecretStr  # Policy.password.get_secret_value() can get the real value


user = Policy(**data)
print(user)
# Output: {'username': 'johndoe', 'password': '******'} 

@seshubaws
Copy link
Contributor Author

seshubaws commented Feb 14, 2023

@heitorlessa Responding to the rest of your questions here!

Q: For AST, are you using stdlib ast or a 3rd party library? If the latter, what’s their license and package size?

I am using only the stdlib ast, but just to call out that python3.9 or later is needed for what we’re using from this library, though I don’t think that should be a problem.

Q: Could you include what the experience would look like to bring-your-own-masking provider?

I updated the original proposal above to include this, but had a few questions about the UX I wanted to get clarification on, since I was basing this draft on the one Heitor had proposed earlier in this comment:

  • Are customers going to have to write the Policy class themselves? For a better experience, they should just be able to call a method and have it done for them right?
  • Are we for sure going to use Pydantic and not ASTs?

Q: Based on your research with both Encryption SDK and ItsDangerous, what common nomenclature could we use for interfaces? e.g., sign, dumps, etc.

I suppose we could use encrypt and decrypt which translates directly for Encryption SDK, and would be dumps and loads for ItsDangerous.

@rubenfonseca
Copy link
Contributor

@heitorlessa Responding to the rest of your questions here!

Q: For AST, are you using stdlib ast or a 3rd party library? If the latter, what’s their license and package size?

I am using only the stdlib ast, but just to call out that python3.9 or later is needed for what we’re using from this library, though I don’t think that should be a problem.

Great catch. We will have to support all the current supported Python Lambda runtimes, which at this moment means we have to support python 3.7. Do you know if there's a 3rd party library that we could use so we can cover all Python versions?

Q: Could you include what the experience would look like to bring-your-own-masking provider?

  • Are customers going to have to write the Policy class themselves? For a better experience, they should just be able to call a method and have it done for them right?

One idea was to have common use cases implemented by us in Powertools, so we make it as easier as possible for customers to adapt.

  • Are we for sure going to use Pydantic and not ASTs?

If we can't find a good 3rd party library to replace stdlib ast, Pydantic could be a good option since we already use it elsewhere.

Q: Based on your research with both Encryption SDK and ItsDangerous, what common nomenclature could we use for interfaces? e.g., sign, dumps, etc.

I suppose we could use encrypt and decrypt which translates directly for Encryption SDK, and would be dumps and loads for ItsDangerous.

I don't have any strong opinion on this.

@seshubaws
Copy link
Contributor Author

@rubenfonseca

We will have to support all the current supported Python Lambda runtimes, which at this moment means we have to support python 3.7. Do you know if there's a 3rd party library that we could use so we can cover all Python versions?

astunparse is a library that also offers the same capabilities as the stdlib ast for all Python versions, and it has a BSD license.

@heitorlessa
Copy link
Contributor

@seshubaws and I had a 1:1 sync as there were some confusing areas on scope (now vs later). I'd like to take a moment to recap on what we're trying to do and when.

There are two macro use cases:

  1. Mask/Unmask primitive data types. This feature receives an input and return data masked - this will often be a string, a number, binary, or a sequence (list, tuple, set, range). In its simplest form, customers can use for use cases where they are passing the exact value they want it to be masked/unmasked irreversibly or not.
  2. Traverse an arbitrary dictionary. This feature receives instructions as to where in the given dictionary it should mask/unmask the data - this means we have to work with Abstract Syntax Tree (AST) to efficiently recurse through arbitrary data.

For now, we should focus on the first macro use case - Mask/Unmask primitive data types.

Why?

Our primary focus is making sure this feature can use AWS Encryption SDK and it is easily swappable for another implementation (e.g., ItsDangerous, Bring-your-own). The second use case is also dependent on the first one.

For example, if a customer uses Pydantic (the most popular data validation library), they can start address the second use case by incorporating this feature with a validator. Pydantic will do the heavy lifting of traversing (Node visitor pattern) the model and call this feature to mask/unmask data upon the data validation phase.

This gives us time to focus on the UX to solve the actual problem (Mask/Unmask data), give us a window for customer feedback to get extensibility right (batteries included but swappable), and only then take the second use case for those not using Pydantic. This could mean a launch with the first use case completed, followed by the second use case in a separate release - we don't have do it all at once; it's a complex domain and UX is at utmost importance.

Please let me know if I can make any of these parts any clearer.

@heitorlessa
Copy link
Contributor

Please let us know if you have any updates @seshubaws in the coming weeks.

@heitorlessa heitorlessa unpinned this issue Mar 1, 2023
@heitorlessa
Copy link
Contributor

Yes to 1, and absolutely I missed that entirely on 2. Now that you brought it up, this makes me thinking that we might let customers down if we don't launch with the AST traversal option, since mask()/redact() method would work out-of-the-box.

@lpsuerj
Copy link

lpsuerj commented Mar 8, 2023

I don’t understand why the default is the AwsEncryptionSdkProvider() because what if users just want to mask and never encrypt? They shouldn’t need to install the whole encryption SDK just to mask if they're not ever going to use the SDK right? Should we have another provider that is just for masking, while keeping masking as an option for all the other providers?

I agree with @seshubaws regarding the use case for masking and not encryption specially for log outputs. For passing info over to downstream systems we would in most cases opt for encryption, especially if changes trust domains and security requirements - i.e less trusted source to more trusted downstream system.

Additionally, there is also a salt component, which is combined with the secret key to derive a unique key for distinguishing different contexts. Unlike the secret key, the salt doesn’t have to be random, and can be saved in code. It only has to be unique between contexts, not private.

@seshubaws can we not have the salt saved in code? any chance we can randomise it following code security best practices? How much effort would it be required in this case? I understand that salt only has to be unique between contexts, not private but I'd rather have it generated and hardcoded.

I also checked the security of the package and in its version 2.1.2 there are no known vulnerabilities (direct or indirect).

Need to decide what the best name for the methods are. Should they still be called encrypt and decrypt even in the case where the customer only masks and won't be able to decrypt later? Should we have a mask method in the interface that is always accessible so that users can irretrievably mask some data and also encrypt some data?

If we decide to follow the industry standards, we need to differentiate encrypt, decrypt and mask due the nature of security controls being describe. I believe we should have all 3 separate methods for clarity.

@heitorlessa
Copy link
Contributor

heitorlessa commented Mar 8, 2023

@lpsuerj is redact more appropriate than mask here?

The salt is just an example for the provider ItsDangerous. That said, I think you raise a secondary and more important point which is -- given we know some of the security best practices (with proper review etc), how much do we want to bake in to ease Providers buildout? e.g., AWS Encryption SDK handles a lot of the responsibilities like caching, encryption X data with Y key only, etc.

@valerena
Copy link

valerena commented Mar 8, 2023

Can you expand a little more @heitorlessa on "we might let customers down if we don't launch with the AST traversal option, since mask()/redact() method would work out-of-the-box."? Are you saying to change the approach we talked before about starting with the simple case and expand to dictionaries (ASTs) later? Doesn't it make sense to still start with the "basic" case only?

And what do you mean with "mask()/redact() method would work out-of-the-box"? Will it work out-of-the-box if we implement ASTs? How is the "support AST vs basic data types" discussion related to the "mask/redact vs encrypt"?

@lpsuerj
Copy link

lpsuerj commented Mar 9, 2023

@lpsuerj is redact more appropriate than mask here?

I think mask would be more appropriate.

@heitorlessa
Copy link
Contributor

Are you saying to change the approach we talked before about starting with the simple case and expand to dictionaries (ASTs) later? Doesn't it make sense to still start with the "basic" case only?

Please allow me to expand. We should still work on the initial use case first (encrypt/decrypt). What I realized after seeing the default experience if no providers are used at the initialization was that mask() method doesn't have much value without AST - in other words, we can only provide a delightful experience for non-Pydantic customers once we have ASTs in place.

This means we can either (1) work on a fast follow-up AST feature after launch (same plan as before), or (2) consider delaying the launch to get AST done like originally suggested by @seshubaws (IIRC, Seshu mentioned AST would be fairly trivial to get done).

And what do you mean with "mask()/redact() method would work out-of-the-box"? Will it work out-of-the-box if we implement ASTs?

My understanding from re-reading the updated RFC was that mask() method would work out of the box with any string value - did I perhaps misunderstand it?

dataStr = "this is a string"

# This would effectively be DataMasking(provider=None)
data_masker = DataMasking() # no providers no keys needed
masked = data_masker.mask(data=dataStr)

How is the "support AST vs basic data types" discussion related to the "mask/redact vs encrypt"?

AST continues to be about Dictionary/JSON, where data to be masked (or encrypted) could be in any primitive data types.

I think the confusion stems from the fact that a method mask/encrypt would be too limited to expect a str only. Perhaps, we can look into answering these questions in the RFC to avoid confusion:

  • As a customer, how should I mask/encrypt a single str value? (answered in the RFC already)
  • As a customer, how should I mask/encrypt a single Number value?
  • As a customer, how should I mask/encrypt all data available in a list/tuple?
  • (Optional) As a customer, how can I handle unsupported data types? (think datetime, Decimal)

After that is clear, AST instruction becomes simply an additional parameter to mask/encrypt/decrypt methods.

Stream of consciousness

After seeing the interface and the initial UX, and talking to yet another customer today on a related security topic, it made me realize that:

  • Encryption providers can start with the safest approach. AwsEncryptionSdkProvider bakes in industry security practices by default whilst ItsDangerous is much faster but needs significant care, documentation, and customer effort to reach a similar posture (e.g., key rotation, key management, fetching, caching, envelope encryption). Having the primitives (encrypt/decrypt) interface nailed is the most costly part of this implementation. After that, we can have more time to think through what security practices we can take care as part of the provider abstract implementation.
  • Mask method doesn't seem to add value without AST. In the current UX, data_masker.mask(value) is the equivalent of "value".replace("value", "***"). Having additional primitives like Number will add some value, Sequences will save some keystrokes, but the actual value comes with the AST as masking only parts of a tree is where the value is at. If the default UX is to not specify a provider (which I agree), then mask doesn't add much to what the average person can find on StackOverflow.

Does that help? Happy to go over any areas that aren't clear as that also helps me improve my understanding of this problem domain.

@lpsuerj
Copy link

lpsuerj commented Mar 16, 2023

Encryption providers can start with the safest approach. AwsEncryptionSdkProvider bakes in industry security practices by default whilst ItsDangerous is much faster but needs significant care, documentation, and customer effort to reach a similar posture (e.g., key rotation, key management, fetching, caching, envelope encryption). Having the primitives (encrypt/decrypt) interface nailed is the most costly part of this implementation. After that, we can have more time to think through what security practices we can take care as part of the provider abstract implementation.

I can safely say that this is the expected approach a security engineer would want from the function. We have lots of controls and use cases baked in to KMS and the SDK takes care of a big part of key management procedures that are needed for a proper security feature and I believe it should be our default approach.

@github-actions
Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

@github-actions github-actions bot added the pending-release Fix or implementation already in dev waiting to be released label Sep 27, 2023
@seshubaws seshubaws added the data-masking Sensitive Data Masking feature label Sep 27, 2023
@leandrodamascena
Copy link
Contributor

Reopening this issue to update the future changes and the developer experience.

@heitorlessa
Copy link
Contributor

Quick summary of the experience that's been merged.

Take the following sample data in mind:

{
     "id": 1,
     "name": "John Doe",
     "age": 30,
     "email": "johndoe@example.com",
     "address": {
          "street": "123 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
     }
}

We can now use .encrypt method along with email and address.street fields that should be encrypted:

from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider.kms.aws_encryption_sdk import AwsEncryptionSdkProvider
import os

encryption_keys = os.getenv("ENCRYPTION_KEYS")

encryption_provider = AwsEncryptionSdkProvider(keys=[encryption_keys])
data_masker = DataMasking(provider=encryption_provider)


def lambda_handler(event: dict, context):
     encrypted: dict = data_masker.encrypt(data=event, fields=["email", "address.street"])
     # {
     #      "id": 1,
     #      "name": "John Doe",
     #      "age": 30,
     #      "email": "InRoaXMgaXMgYSBzdHJpbmciHsLZGx2na-XzP_TB5Bf2LNU1bLc",
     #      "address": {
     #           "street": "XMgYSB_KDddaDJYMb-JpbmGnagTklwQ-msdaDLP",
     #           "city": "Anytown",
     #           "state": "CA",
     #           "zip": "12345"
     #      },
     # }

Conversely, the decrypt operation could work in a separate function:

from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider.kms.aws_encryption_sdk import AwsEncryptionSdkProvider
import os

encryption_keys = os.getenv("ENCRYPTION_KEYS")

encryption_provider = AwsEncryptionSdkProvider(keys=[encryption_keys])
data_masker = DataMasking(provider=encryption_provider)


def lambda_handler(event, context):
     decrypted = data_masker.decrypt(data=event, fields=["email", "address.street"])
     # {
     #          "id": 1,
     #           "name": "John Doe",
     #           "age": 30,
     #           "email": "johndoe@example.com",
     #           "address": {
     #                "street": "123 Main St",
     #                "city": "Anytown",
     #                "state": "CA",
     #                "zip": "12345"
     #           },
     # }

@github-actions
Copy link
Contributor

This is now released under 2.26.0 version!

@github-actions github-actions bot removed the pending-release Fix or implementation already in dev waiting to be released label Oct 13, 2023
@leandrodamascena
Copy link
Contributor

Reopening

@heitorlessa
Copy link
Contributor

Closing as code is merged; docs is in progress. that should better reflect what's pending on the board.

Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-masking Sensitive Data Masking feature RFC
Projects
Status: Shipped
Development

Successfully merging a pull request may close this issue.

7 participants