Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: builder configuration api #3436

Merged
merged 10 commits into from
Jan 4, 2023

Conversation

roeap
Copy link
Contributor

@roeap roeap commented Jan 3, 2023

Which issue does this PR close?

Closes #3419.

Rationale for this change

See discussion in #3429

What changes are included in this PR?

This is just a quick draft based on some previous work in other crates, to get some feedback on the API. As an example I implemented a proposal for the azure builder, since one issue is most prominent with azure. Specifically, for azure there are many different commonly used keys vor configuration options and environment variables. Thus the implementation includes some handling for aliases. While I do see the need for this to be handled, I am unsure if this shuld be the responsibility of this crate.

Any feedback is great appreciated - @tustvold @crepererum @winding-lines

Are there any user-facing changes?

yes, a new way to provide configuration to object sore builders.

@github-actions github-actions bot added the object-store Object Store Interface label Jan 3, 2023
Copy link
Contributor

@winding-lines winding-lines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction of this. Can you outline a bit how this is exposed to python?

enum AzureConfigKey {
/// The name of the azure storage account
///
/// Supported keys:
Copy link
Contributor

@winding-lines winding-lines Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see now why a second layer is needed: you want to support multiple variants of the environment variables. This is ok with me, but supporting more than 2-3 variants for the same name will be confusing in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually one of my main concern. On one hand there simply is quite a few variants out there people migt be used to, on the other hand, having many options for the same thing is confusing.

@winding-lines
Copy link
Contributor

Ah I see that you have this separate project here: https://github.com/roeap/object-store-python. I can start to play with this :)

I still need the Rust object_store configuration to be inert data that I can pass around.

UseEmulator,
}

static ALIAS_MAP: Lazy<HashMap<&'static str, AzureConfigKey>> = Lazy::new(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a HashMap noticeably faster than as simple match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not - main reason for a map is so we can iterate through it when reading from the env.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we got rid of the maps now :).

@winding-lines
Copy link
Contributor

@roeap I like where this is going. I am mostly testing with AWS, so for me a good next step is to integrate the AWS side in this pola-rs/polars#5972

I plan to use this PR and your python companion once the work has progressed enough. I assume that once this is stable it will also be suitable for deltalake?

How do you want to collaborate on this?

  1. If you plan to push on this in the following days I can put on hold my PR.
  2. If you are focussing on other things I can refector the AWS support from deltalake in here

Let me know.

@roeap
Copy link
Contributor Author

roeap commented Jan 3, 2023

How do you want to collaborate on this?

@winding-lines, once we know the pattern, most of this is just leg-work, so best to just get it done 😆. Best case we can get this in before #3422.

@roeap roeap marked this pull request as ready for review January 3, 2023 20:49
/// Set an option on the builder via a key - value pair.
pub fn with_option(
mut self,
key: impl Into<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roeap thanks for the super quick turn around time on this one. This is pretty close to what I am thinking, I would like to suggest one more change?

As an user I would like to have my configuration fully checked by the compiler. So I would like to either be able to pass in one of the AmazonS3ConfigKeys as the key or have a helper function that returns one of the string representations. This way in my code I never have to type a string constant for the key.

Let me know what you think....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported my PR to the code so far. I had to toggle back and forth between the 2 PRs to make sure I get these constants correctly.
https://github.com/pola-rs/polars/pull/5972/files#diff-28ad3830b06d49511094951431beea1edcf6a567f646250482e9f00f769ace27R16

This is error prone specially since you have a type safe Enum sitting right there...

Copy link
Contributor Author

@roeap roeap Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have some conversions implemented for the config key enums, so you should be able to pass in the variants directly, or constuct a map with the variants as keys. Is that what you are looking for here?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is headed, I left some comments.

The main part I'm not certain about is how to propagate errors, ideally an unrecognised config key, or a config key containing an invalid value would return an error. To avoid breaking backwards compatibility, from_env could just log this error.

object_store/src/aws/mod.rs Outdated Show resolved Hide resolved
object_store/src/aws/mod.rs Outdated Show resolved Hide resolved
object_store/src/util.rs Outdated Show resolved Hide resolved
object_store/src/azure/mod.rs Outdated Show resolved Hide resolved
object_store/src/azure/mod.rs Outdated Show resolved Hide resolved
object_store/src/aws/mod.rs Outdated Show resolved Hide resolved
object_store/src/aws/mod.rs Outdated Show resolved Hide resolved
object_store/src/aws/mod.rs Outdated Show resolved Hide resolved
let aws_endpoint = "object_store:fake_endpoint".to_string();
let aws_session_token = "object_store:fake_session_token".to_string();
let options = HashMap::from([
(AmazonS3ConfigKey::AccessKeyId, aws_access_key_id.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very ergonomic, many thanks 🎉

@@ -379,6 +383,170 @@ pub struct AmazonS3Builder {
client_options: ClientOptions,
}

/// Configuration keys for [`AmazonS3Builder`]
#[derive(PartialEq, Eq, Hash, Clone, Debug, Copy)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roeap this is very close, I think the last issue from my side is to derive serde::Serialize and Deserialize for the *ConfigKey enums.

Pretty please 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, serialize away 😆.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really nice, just some minor nits.

@winding-lines are you happy with this?

@@ -673,13 +872,33 @@ fn url_from_env(env_name: &str, default_url: &str) -> Result<Url> {
Ok(url)
}

fn split_sas(sas: &str) -> Result<Vec<(String, String)>, Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests of this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we do .. tested the refrerence key before that is was useful - albeit now exired :)


#[test]
fn azure_test_config_from_map() {
let azure_client_id = "object_store:fake_access_key_id".to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think you could drop the to_string() and clone() from this test and it should work

@winding-lines
Copy link
Contributor

Looking really nice, just some minor nits.

@winding-lines are you happy with this?

Yes, very happy and grateful to both of you 🙇

@tustvold tustvold merged commit 28a04db into apache:master Jan 4, 2023
@ursabot
Copy link

ursabot commented Jan 4, 2023

Benchmark runs are scheduled for baseline = 3213ef5 and contender = 28a04db. 28a04db is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Derive Clone for the builders in object-store.
4 participants