Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: support encoded path as input #3651

Closed
jychen7 opened this issue Feb 2, 2023 · 7 comments · Fixed by #3663
Closed

object_store: support encoded path as input #3651

jychen7 opened this issue Feb 2, 2023 · 7 comments · Fixed by #3663
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@jychen7
Copy link
Contributor

jychen7 commented Feb 2, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
while passing an percentage encoded Path to object store, it re-encode before get object.
e.g.

S3

utf8_percent_encode(path.as_ref(), &STRICT_PATH_ENCODE_SET)

GCS

percent_encoding::utf8_percent_encode(path.as_ref(), NON_ALPHANUMERIC);

a workaround is to pass decoded path from upstream application, even though upstream already receive encoded path.

e.g. roapi/roapi#98 (comment)

Describe the solution you'd like
ideally, path should just need encode once. One idea is to improve Path to include new attribute called 'percent_encoded'

pub struct Path {
/// The raw path with no leading or trailing delimiters
raw: String,
}

Describe alternatives you've considered
auto detect whether already percentage encode?

Additional context

@jychen7 jychen7 added the enhancement Any new improvement worthy of a entry in the changelog label Feb 2, 2023
@tustvold
Copy link
Contributor

tustvold commented Feb 2, 2023

Have you tried using Path::parse, this shouldn't re-encode

@jychen7
Copy link
Contributor Author

jychen7 commented Feb 2, 2023

@tustvold yes, I have tried.

I understand that Path::parse("blogs%20space.parquet") will NOT re-encode, the result is Path { raw: "blogs%20space.parquet" }.

However, when client GET object, it will re-encode at L246 -> L211 -> L459 below

pub async fn get_request(
&self,
path: &Path,
range: Option<Range<usize>>,
head: bool,
) -> Result<Response> {
use reqwest::header::RANGE;
let credential = self.get_credential().await?;
let url = self.config.path_url(path);

impl S3Config {
fn path_url(&self, path: &Path) -> String {
format!("{}/{}", self.bucket_endpoint, encode_path(path))
}
}

fn encode_path(path: &Path) -> PercentEncode<'_> {
utf8_percent_encode(path.as_ref(), &STRICT_PATH_ENCODE_SET)
}

I imagine L211 need to know whether input Path is already encoded, either via new attribute or auto detect

@tustvold
Copy link
Contributor

tustvold commented Feb 2, 2023

However, when client GET object, it will re-encode at L246 -> L211 -> L459 below

Yes, this is so that the path of the created object matches exactly the Path provided. So if you provide blogs%20space.parquet that is what will be created in object storage. This is the same as if you provided blogs%invalidescape.parquet.

Path::parse("blogs%20space.parquet")

What is percent encoding this input, if you call Path::parse("blogs space.parquet") everything should work? Is it possible something is receiving a URL and not decoding it properly, and just handing off the raw path?

@jychen7 jychen7 changed the title object_store: support encoded path object_store: support encoded path as input Feb 2, 2023
@jychen7
Copy link
Contributor Author

jychen7 commented Feb 2, 2023

if you call Path::parse("blogs space.parquet") everything should work?

yes, it work

What is percent encoding this input
Is it possible something is receiving a URL and not decoding it properly, and just handing off the raw path?

this is from ROAPI config, the user input can be already percent encoded. ROAPI currently decode it before passing to object_store. I am wondering whether object_store can support receiving encoded path, so from end to end, we don't need additional decode -> encode
More detail here: roapi/roapi#98 (comment)

@tustvold
Copy link
Contributor

tustvold commented Feb 2, 2023

can support receiving encoded path

Adding a Path::from_url_path that accepts a Url argument makes sense to me, I would be happy to review a PR that adds this 👍

This would likely still need to decode the path, in order to ensure path safety, but would save users that complexity.

FWIW overheads of encoding and signing the requests will dominate any additional latency resulting from this encoding dance, and both will be completely dominated by the latency of the store itself (typically in the 10s or 100s of milliseconds).

@jychen7
Copy link
Contributor Author

jychen7 commented Feb 5, 2023

I would be happy to review a PR that adds this

thank you, I have draft the PR here: #3663, waiting CI to start

@tustvold tustvold added the object-store Object Store Interface label Feb 10, 2023
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'object-store'} from #3663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants