Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST #32] On-Premises S3 / S3 Compatible... #389

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

lefebsy
Copy link

@lefebsy lefebsy commented Oct 21, 2024

Description (edited) :

This is a S3 proposition of Polaris core storage implementation, copy of the aws + new parameters : endpoint, path style...

It is tested OK with :

This should works with many S3 compatible solutions like Dell ECS, NetApp StorageGRID, etc...

  • By default it is trying to respect the same behavior about credentials than AWS (IAM/STS). The same dynamic policy is applied, limiting the scope to the data queried.

  • Otherwise if STS is not available 'skipCredentialSubscopingIndirection' = true will disabling Polaris "SubScoping" of the credentials

Let me know your opinion about this design proposal.
Thank you

curl -X POST -H "Authorization: Bearer ${SPARK_BEARER_TOKEN}" \
     -H 'Accept: application/json' -H 'Content-Type: application/json' \
     http://${POLARIS_HOST}:8181/api/management/v1/catalogs -d \
      '{
          "name": "my-s3compatible-catalog",
          "id": 100,
          "type": "INTERNAL",
          "readOnly": false,
          "properties": {
            "default-base-location": "${S3_LOCATION}"
          },
          "storageConfigInfo": {
            "storageType": "S3_COMPATIBLE",
            "allowedLocations": ["${S3_LOCATION}/"],
            "s3.endpoint": "https://localhost:9000"
          }
        }'
            # optional:
            "s3.pathStyleAccess": true / false
            "s3.region": "rack-1 or us-east-1"
            "s3.roleArn": "arn:xxx:xxx:xxx:xxxx"
            "s3.credentials.catalog.accessKeyId": "CATALOG_1_ACCESS_KEY_ENV_VARIABLE_NAME"
            "s3.credentials.catalog.secretAccessKey": "CATALOG_1_SECRET_KEY_ENV_VARIABLE_NAME"
            # optional in case STS/IAM is not available :
            "skipCredentialSubscopingIndirection": true
                # optional for skipCredentialSubscopingIndirection - served by catalog to client like spark or trino
                "s3.credentials.client.accessKeyId": "CLIENT_OF_CATALOG_1_ACCESS_KEY_ENV_VARIABLE_NAME"
                "s3.credentials.client.secretAccessKey": "CLIENT_OF_CATALOG_1_SECRET_KEY_ENV_VARIABLE_NAME"

Included Changes:

  • New type of storage "S3_COMPATIBLE".
  • Tested against MinIO with self-signed certificate
  • regtests/run_spark_sql_s3compatible.sh

Type of change:

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

@lefebsy lefebsy changed the title add s3 compatible storage - first commit [FEATURE REQUEST] On-Premise S3... #32 Oct 21, 2024
@lefebsy lefebsy changed the title [FEATURE REQUEST] On-Premise S3... #32 [FEATURE REQUEST #32] On-Premise S3... Oct 21, 2024
@lefebsy lefebsy changed the title [FEATURE REQUEST #32] On-Premise S3... [FEATURE REQUEST #32] On-Premises S3... Oct 21, 2024
@lefebsy

This comment was marked as outdated.

@@ -23,6 +23,8 @@ public enum PolarisCredentialProperty {
AWS_KEY_ID(String.class, "s3.access-key-id", "the aws access key id"),
AWS_SECRET_KEY(String.class, "s3.secret-access-key", "the aws access key secret"),
AWS_TOKEN(String.class, "s3.session-token", "the aws scoped access token"),
AWS_ENDPOINT(String.class, "s3.endpoint", "the aws s3 endpoint"),
AWS_PATH_STYLE_ACCESS(Boolean.class, "s3.path-style-access", "the aws s3 path style access"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether or not to use path-style access

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many S3COMPATIBLE solutions are deployed without network devices or configurations in front of them allowing support of dynamic hosts names including buckets.
TLS certificate with private AC could also be a challenge for dynamic host name. "*. domain" can also be forbidden by some enterprise security policy

Path style is useful in many cases. In ideal world I agree it should stay deprecated...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @eric-maynard was asking if we can change the description to something like this:

Suggested change
AWS_PATH_STYLE_ACCESS(Boolean.class, "s3.path-style-access", "the aws s3 path style access"),
AWS_PATH_STYLE_ACCESS(Boolean.class, "s3.path-style-access", "whether or not to use path-style access"),

I also agreed that we should make sure it false by default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IBM's watsonx.data product it is set to true by default for Minio and Ceph bucket types, reason being that it's more likely to work. Path style will work regardless of whether the customer has setup wildcard DNS, a TLS certificate with a subject-alternate-name (the wildcard), and the hostname in the zonegroup (for Ceph). Virtual host style will only work if all of those things are done.

It's not a hill I would die on, but it's worthy of consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I was referring to the Polaris scope default. Users always have the option to set it to true when creating a new catalog.

@lefebsy
Copy link
Author

lefebsy commented Mar 7, 2025

I refer to the field(s3PathStyleAccess) in this class, which can be set to null. In that case, storageConfig.getS3PathStyleAccess().toString() will throw a NPE

There is a default "false" in rest spec. But ok 👍

@flyrain
Copy link
Contributor

flyrain commented Mar 7, 2025

I refer to the field(s3PathStyleAccess) in this class, which can be set to null. In that case, storageConfig.getS3PathStyleAccess().toString() will throw a NPE

There is a default "false" in rest spec. But ok 👍

I agreed that it's fine if the object is always created from the REST. But the class itself could be constructed internally in the future. I don't want a NPE surprise in that case.

@lefebsy
Copy link
Author

lefebsy commented Mar 8, 2025

Thanks a lot for working on it. Getting close. Left some comments. Other than the server side env variable settings, can we also have unit tests for two new classes added?

Is it ok for you if I do UT only for config class and move to dockerized tests for credentials class ?
I am really not good in the mock code area.

I can easily add minio to regtests docker-compose and enrich "regtests/t_spark_sql/"

My first try just learned me not to name script with "S3" inside otherwise it's trapped by AWS stop pattern and never run.

@flyrain
Copy link
Contributor

flyrain commented Mar 9, 2025

Is it ok for you if I do UT only for config class and move to dockerized tests for credentials class ? I am really not good in the mock code area.

I can easily add minio to regtests docker-compose and enrich "regtests/t_spark_sql/"

My first try just learned me not to name script with "S3" inside otherwise it's trapped by AWS stop pattern and never run.

I'm OK with a followup PR to resolve the mock related unit tests. We can get some help from the community. Can you file an issue to track that?

@omarsmak
Copy link
Member

@lefebsy are we merging this PR soon or there is still some outstanding work?

@flyrain
Copy link
Contributor

flyrain commented Mar 10, 2025

Hi @lefebsy, let me know once the PR is ready for another look. As I said here, #389 (comment), this PR isn't blocked by other PRs.

@lefebsy
Copy link
Author

lefebsy commented Mar 10, 2025

@lefebsy are we merging this PR soon or there is still some outstanding work?

Hello @omarsmak
Still need 1 review to check the latest modifications requested (regression tests) but except if the merge have to wait for #1068 I think the biggest part of the task is done.

@lefebsy
Copy link
Author

lefebsy commented Mar 10, 2025

Hi @lefebsy, let me know once the PR is ready for another look. As I said here, #389 (comment), this PR isn't blocked by other PRs.

@flyrain
So... It is ready for review :-)
(Helm chart CI test is failing but I do not understand the error message, otherwise I would have fixed it)

@lefebsy lefebsy requested a review from flyrain March 10, 2025 18:59
@@ -62,6 +63,7 @@
@JsonTypeInfo(use = JsonTypeInfo.Id.NAME)
@JsonSubTypes({
@JsonSubTypes.Type(value = AwsStorageConfigurationInfo.class),
@JsonSubTypes.Type(value = S3CompatibleStorageConfigurationInfo.class),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to model non-AWS S3 storage as a separate config info class as opposed to adding a few optional settings to AwsStorageConfigurationInfo?

My understanding is that the same AWS SDK is used for actually access anyway, so storage config info basically goes into the same AWS client, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question of modifying the current implementation arose at the beginning of the PR. The direction given was to propose a variant in order not to disrupt the existing. Possibly in a second step, unify.

So this PR follows the instructions and avoids modifying the existing and proposes a variant adapted to the S3 alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough... I'll review closer soon :)

S3CompatibleStorageConfigInfo s3ConfigModel =
(S3CompatibleStorageConfigInfo) storageConfigModel;
config =
new S3CompatibleStorageConfigurationInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we as well need pass a custom STS endpoint in order to initialize STSClient with a custom endpoint that points to MinIO as an example?

Copy link
Author

@lefebsy lefebsy Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation explicitly pass the s3 endpoint to the stsClient.

Apart from AWS which exposes AssumeRole, STS and globally IAM on a separate endpoint from the S3 APIs, other S3 providers seem to use the same endpoint.

In a mixed AWS + other approach, it would be necessary to explicitly specify the STS endpoint and the S3 endpoint.

In the non-AWS approach, it is not mandatory, it is the same endpoint (Minio, Ceph, backblaze, Dell ECS, ...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Backblaze (only one 'l'! 🙂) B2, and I'm guessing some other S3-compatible services, doesn't have an STS, so there's no point in specifying an endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. Thanks for the info :)

@@ -250,6 +268,21 @@ public Builder setStorageConfigurationInfo(
awsConfig.validateArn(awsConfigModel.getRoleArn());
config = awsConfig;
break;

case S3_COMPATIBLE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still trying to figure out why we opted to introduce a new source type S3_COMPATIBLE? IMO this should be agnostic, is the issue here enforcing the s3 custom endpoint and path style? (they are mandatory with compatible)

Copy link
Author

@lefebsy lefebsy Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

To complete the response given to @dimas-b a few lines higher :

Before the security arbitrage favoring vending only STS there were more options to be used by onPrem ecosystems habits (keys that do not expire quickly by exemple).

Since then, the difference between the AWS class and the S3compatible class is smaller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thew info. I am fine with it as it is as long we do have a plan to unify it at some point

# test is not recommended for production.
authenticator:
# -- Configures whether to enable the bootstrap metastore manager job
bootstrapMetastoreManager: false
Copy link
Contributor

@adutra adutra Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't look right. You are removing extremely useful stuff from this values.yaml file, and the added content seems to come from an old state of the Helm chart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lefebsy maybe would be easier if you take the helm stuff outside of this PR and have it as a follow up. Wdyt? (so we don't prolong this PR further :))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it's a total mistake from my part, I do not know when this occurred, probably during a rebase. I will come back to main branch for this file.
Sorry

@apache apache deleted a comment from sfc-gh-ygu Mar 11, 2025
@flyrain
Copy link
Contributor

flyrain commented Mar 11, 2025

Thanks a lot @lefebsy for working on it. I agreed that we don't have to put Helm change in, so that the github pipeline can pass. We can always add it back as a followup PR. It should be ready to merge as long as we fix this https://github.com/apache/polaris/pull/389/files#r1980741611.

@lefebsy
Copy link
Author

lefebsy commented Mar 11, 2025

Thanks a lot @lefebsy for working on it. I agreed that we don't have to put Helm change in, so that the github pipeline can pass. We can always add it back as a followup PR. It should be ready to merge as long as we fix this https://github.com/apache/polaris/pull/389/files#r1980741611.

Yes. I have restored helm from main branch. I have probably done a small test modification and forget it. I'm really confused.

@omarsmak
Copy link
Member

Thanks a lot @lefebsy for working on it. I agreed that we don't have to put Helm change in, so that the github pipeline can pass. We can always add it back as a followup PR. It should be ready to merge as long as we fix this https://github.com/apache/polaris/pull/389/files#r1980741611.

Yes. I have restored helm from main branch. I have probably done a small test modification and forget it. I'm really confused.

Thanks @lefebsy . The PR looks now a lot better from helm standpoint. We just need to address last comment from @flyrain to get this PR going 🙂

@omarsmak
Copy link
Member

@lefebsy you have a conflict by the way

lefebsy and others added 3 commits March 12, 2025 20:52
# Conflicts:
#	polaris-core/src/main/java/org/apache/polaris/core/storage/PolarisCredentialProperty.java
@lefebsy
Copy link
Author

lefebsy commented Mar 12, 2025

@omarsmak

=> Hello, the conflict should be resolved since rebased :-)

@omarsmak
Copy link
Member

@omarsmak

=> Hello, the conflict should be resolved since rebased :-)

Thanks @lefebsy . Can you please address this last comment so we have this PR wrapped up? https://github.com/apache/polaris/pull/389/files#r1980741611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.