Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AES256 encryption of sensitive params #45195

Conversation

andrii-korotkov-verkada
Copy link
Contributor

Implements support for _ENCRYPTED versions of env variables and _encrypted versions of config params. Those would be encrypted using a parameter params_encryption_aes256_key (which can be retrieved from a secret for example). The params encryption key itself doesn't support encrypted suffix, or it'd create an infinite recursion.

closes: #45194


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@andrii-korotkov-verkada
Copy link
Contributor Author

Starting with a draft with a POC. Please, let me know if this is a good direction.

@andrii-korotkov-verkada andrii-korotkov-verkada force-pushed the 45194-support-aes256-encryption-of-sensitive-params branch from 692732e to 56d5d47 Compare December 24, 2024 15:01
Implements support for `_ENCRYPTED` versions of env variables and `_encrypted` versions of config params.
Those would be encrypted using a parameter `params_encryption_aes256_key` (which can be retrieved from a secret for example).
The params encryption key itself doesn't support encrypted suffix, or it'd create an infinite recursion.

closes: apache#45194

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
@andrii-korotkov-verkada andrii-korotkov-verkada force-pushed the 45194-support-aes256-encryption-of-sensitive-params branch from 56d5d47 to c2ca77e Compare December 24, 2024 15:06
@potiuk
Copy link
Member

potiuk commented Dec 24, 2024

Before we go any deeper. I would like to dicuss the reasoning of doing it

I simply think it does not achieve any "more" security than we currently have. Both parameters and decryption key are necesarlly in the same environment and you neeed to have access to both - if you don't, then you cannot decrypt them.

We already have encryption using fernet key (symmetric) but there we use it to encrypt the database values, becuase then for example if you have only access to database backups/dumps or database itself, you are not able to decrypt it without access to the configuraiton of Airlfow - where the fernet key is stored. So here values and keys are kept in two separate "spaces" / "security scopes/perimeters".

We also already support _CMD pattern - where the actual sensitive values are retrieved by running a script that can (for example) use workload identity to retrieve sensitive values from somewhere else (say secrets manager) - so that it is not stored in the configuration at all.

In this case and your proposal, seemlngly both encrypted values and the key are supposed to be stored in the same "space" - so if you have access to one, you have access to other, and you can use the key to decrypt the values, so at best it's security-by-obscurity.

Do you see any scenario where someone could get hold of the values but not the key - where it could be helpful? What kind of attacks and which actor behaviours it would prevent ?

It might be you have some concrete use cases in mind that I don't know.

@andrii-korotkov-verkada
Copy link
Contributor Author

andrii-korotkov-verkada commented Dec 24, 2024

Sure. I think of this as more of a convenience feature rather than a feature that heightens a security.
ArgoCD is used to manage manifests, as well as there are multiple shards (i.e. AWS accounts and clusters), where each needs to have some configuration, ideally with as little overhead as possible. Some secret values can be based on large things, e.g. pgbouncer.ini config, so automated way for managing them would be very helpful.
Here are several options I've considered:

  • Storing generated manifests in the repo, even for secrets. That's a hard no from a security perspective.
  • Generate and apply secrets to the cluster directly without committing them to the repo. This makes secrets not very durable, e.g. if fernet key is removed for some reason, the encrypted data is effectively lost. This would require duplicating the key in some password manager, which would make management a bit clunkier.
  • Use secrets manager backend to store secrets. This helps with durability and overall seems like a good option, however adding secrets to the manager and rotating them would be clunky, since it's not integrated with helm templating.
  • Using _CMD env variables. This can work, but has a few downsides in my case, like aws cli not being pre-installed in the default image (at least I didn't see it in constraints) and cmd code needs to be repeated for each parameter.
  • Use AES256 encryption to store encrypted versions of secrets. This solves durability, auditing, security and convenience. Since helm templating supports aes256 encryption natively, I can just generate all manifests with it with all secrets needed for all shards, which is especially convenient given some secrets like connection are based on multiple helm inputs (e.g. host, username, password). This way, I reduce the number of secrets to manage with external means (like cluster secret or secrets manager) to just one.

I'd have to think separately on how to work with secret things mounted as files, e.g. pgbouncer.ini. Though probably a similar approach with having a key like pgbouncer.ini.encrypted and using the same key to decrypt would work.

Please, let me know if these arguments make sense. Happy to discuss further.

@andrii-korotkov-verkada
Copy link
Contributor Author

Here's an example when there's a secret value in a config where it doesn't seem to be possible to hide it https://airflow.apache.org/docs/apache-airflow/1.10.8/security.html#github-enterprise-ghe-authentication.

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

Just to note - not that I am against it totally, but I want you to think and come up with proposal how to tell the user to maintain "secret" status of such values?

If you are keeping both - secret values and private key in the same configuration place (secure perimeter), then you can use the key to decrypt the values. What your advise should be for the users who are going to use that feature, how they should keep the key and what exactly will the encryption will be protecting against?

If we are going to accept some form of it, we need have detailed documentation describing the users what kind of security they can expect and how they should protect (and against what) - the thing is that often when people see some "Encrypted" values, they think they are generally protected against the values leaking (but they don't realise for example that storing the key to decrypt the values in the same security perimeter/storage gives them false sense of security).

What your "user documentation" for that feature and user recommendation would look ilike for it?

BTW:

Using _CMD env variables. This can work, but has a few downsides in my case, like aws cli not being pre-installed in the default image (at least I didn't see it in constraints) and cmd code needs to be repeated for each parameter.

Not really. You can have a single command that is used there and pass the parameter to that command indicating which secret you want to retrieve - ...SQL_CON_CMD=/opt/airflow/bin/get_secret_value.sh SQL_CONN etc. That's what it was designed for.

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

Also. I think it needs an explanation on why you cannot use "k8s secrets" (if you are talking about k8s) - and why is it problematic to deploy them separately and mount as config files.

@andrii-korotkov-verkada
Copy link
Contributor Author

I think the problem I'm trying to solve is that the secrets are in different places in different formats. While some of them can be handled with _CMD params (thanks for pointing that a single script can be used!), in some cases this option is not offered, for example:

  • Pgbouncer configs such as pgbouncer.ini (not sure why this one is secret) and users.txt (contains some username and passwords). Those are read from a secret and mounted as files.
  • Auth-related secrets (e.g. for GHE) which seem to be just put in the airflow.cfg which is mounted as a config map, not a secret.
  • Secrets that might need to be specified as a part of server init Python file (e.g. Okta client secret), though these probably can be passed as env variables reading values from k8s secrets.

My proposal offers an option to replace these secrets with one, still probably a k8s secret, to simplify management while offering the same amount of security (e.g. reading one k8s secret in a namespace requires pretty much the same amount of permissions and effort as reading multiple). I'll set users' expectations that they are getting the same amount of security and still need to keep their k8s secret with encrypted key safe, but that this way would simplify managing a configuration.

@andrii-korotkov-verkada
Copy link
Contributor Author

One alternative to the _encrypted suffix is to have encrypted values start with some prefix, e.g.

client_secret: aes256encrypted:<base64-encoded-ciphertext>

@potiuk
Copy link
Member

potiuk commented Dec 26, 2024

Why not having it all configured through env variables? All airflow configurations can be configured via ENV variables in one place - same source for every secret value https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#configure-all-key-value-pairs-in-a-secret-as-container-environment-variables

The thing is that we tend to avoid having to add "features" to Airflow where they can be provided via "external" means in this case deployment that might be responsible to keep all secrets (i.e. env variables) that are providing all secret values from a single "secret" that can even be backed by a secret manager.

Generally we are now more in the mode (for Airlfow 3) to search what non core issues we can remove from Airflow rather than what we can add to it. Similalry as your other issue (now discussion) #45221 - we are very, very, very cautious about adding things to Airflow that could be done "externally" - we are good at scheduling workfloads, we are not good in keeping secret configurations (K8S is way better for that) and authentication (KeyCloak is way better in this). You have to remember, that whatever we approve as contributions to Airlfow, we also have to maintain it - and if there are other ways to achieve it - for example via deployment features - we tend to be very hesitant about accepting those.

You seem to be very focused on K8S / ArgoCD environment, so my question is - why "standard" feature where all the configurationsecrets can be configured in one place and exposed as env variables are not "good" for you ?

@andrii-korotkov-verkada
Copy link
Contributor Author

Thanks for detailed replies. After more thinking I concede - marginal wins in maintenance of configs across shards are not worth the complexity of adding a new feature. I understand the complexity of maintaining open source. I can probably achieve my goals with some customized manifests overlays + a custom script to render the helm templates (which we have anyways). That said, I'm still looking to contribute something, maybe that KeyCloak auth manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support AES256-encrypted values for sensitive parameters in configs, env variables etc.
2 participants