Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues reading/writing to delta tables in S3 buckets with name containing dots (deltalake>=0.7.0) #1239

Closed
gray-sat opened this issue Mar 23, 2023 · 7 comments
Assignees
Labels
binding/rust Issues for the Rust crate bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason storage/aws AWS S3 storage related

Comments

@gray-sat
Copy link

gray-sat commented Mar 23, 2023

Environment

Delta-rs version: deltalake==0.8.1

Binding: Python

Environment: Local / Cloud

  • Cloud provider: AWS
  • OS: MacOS 12.2.1 / Ubuntu 20.04
  • Other:

Bug

What happened:

Please let me know if I have missed something glaringly obvious. It seems the read tests related to S3 passed for the 0.8.1 build on at least Python 3.10.10, so I'm not sure why I'm getting this locally.

I first attempted to read a delta table from deltalake==0.6.4 before finding that AWS_PROFILE was not supported for that version. When passing credentials in storage options, I have no issues reading delta tables in S3 using 0.6.4.

Our development flows often use AWS_PROFILE, and I noticed that 0.8.0+ supported that env var so I upgraded to 0.8.1. Now when trying to read a Delta table using either AWS_PROFILE or passing aws credentials instead, I am getting errors.

If I pass the S3 path like s3://<bucket-name>/<prefix-to-table>, I get

deltalake.PyDeltaTableError: Failed to read delta log object: Generic S3 error: URL did not match any known pattern for scheme: s3://<bucket-name>/<prefix-to-table>

I am able to list / download the contents of those paths using aws cli / boto3 using the same credentials.

When using the AWS_PROFILE env, which uses the same credentials as tested above:

>>> DeltaTable("s3://<bucket-name>/<prefix-to-table>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../lib/python3.10/site-packages/deltalake/table.py", line 122, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Failed to read delta log object: Generic S3 error: URL did not match any known pattern for scheme: s3://<bucket-name>/<prefix-to-table>

What you expected to happen:
Can read Delta table from S3 without issue when proper credentials are exposed to deltalake.

How to reproduce it:
On M1 Pro Mac:
Create python 3.10.10 or python 3.8.13 virtualenv

pip install deltalake==0.8.1
python
>>> from deltalake import DeltaTable
>>> DeltaTable("s3://<bucket>/<prefix>", storage_options={"AWS_ACCESS_KEY_ID": ACCESS_KEY, "AWS_SECRET_ACCESS_KEY":SECRET_KEY, "AWS_REGION": 'us-east-1'})
>>> exit()
export AWS_PROFILE=<some-profile>
python
>>> from deltalake import DeltaTable
>>> DeltaTable("s3://<bucket>/<prefix>", storage_options={"AWS_REGION": 'us-east-1'})
@gray-sat gray-sat added the bug Something isn't working label Mar 23, 2023
@gray-sat
Copy link
Author

gray-sat commented Mar 23, 2023

After hitting myself and running a quick test... this is related to dots being present in the bucket name.

The default request style being used is path style, so in theory bucket names with dots should be supported. The error indicates that the issue lies somewhere in the RawDeltaTable constructor. I'm digging in a bit more into the Rust code to see what's changed there -- admittedly I'm not familiar with Rust, so not sure how the root cause investigation is going to go. The change occurred between 0.6.4 -> 0.7.0.

@gray-sat gray-sat changed the title Issues reading/writing to delta tables in S3 using python bindings in deltalake==0.8.1 Issues reading/writing to delta tables in dotted S3 buckets in deltalake==0.8.1 Mar 23, 2023
@gray-sat gray-sat changed the title Issues reading/writing to delta tables in dotted S3 buckets in deltalake==0.8.1 Issues reading/writing to delta tables in S3 buckets with name containing dots (deltalake>=0.7.0) Mar 23, 2023
@gray-sat
Copy link
Author

gray-sat commented Mar 24, 2023

After a little digging I think this can be traced to changes that went out as part of 8e0b8cb -- specifically the move from specific backends like S3StorageBackend to the generic DeltaObjectStore with ObjectStoreKind implementations.

Again, I've not use Rust before looking at this issue, so you'll have to look at some ugly code. I downloaded the project to run the following code against the deltalake 0.8.0 crate:

use std::collections::HashMap;
use url::Url;
use deltalake::{DeltaResult, DeltaTableError};
use deltalake::storage::DeltaObjectStore;
use deltalake::storage::config::StorageOptions;

pub(crate) fn ensure_table_uri(table_uri: impl AsRef<str>) -> DeltaResult<Url> {
    let table_uri = table_uri.as_ref();
    if let Ok(path) = std::fs::canonicalize(table_uri) {
        return Url::from_directory_path(path)
            .map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()));
    }
    if let Ok(url) = Url::parse(table_uri) {
        return Ok(match url.scheme() {
            "file" => url,
            _ => {
                let mut new_url = url.clone();
                new_url.set_path(url.path().trim_end_matches('/'));
                new_url
            }
        });
    }
    // The table uri still might be a relative paths that does not exist.
    std::fs::create_dir_all(table_uri)
        .map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))?;
    let path = std::fs::canonicalize(table_uri)

        .map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))?;
    Url::from_directory_path(path)
        .map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))
}

fn main() {
    let mut mymap = HashMap::new();
    mymap.insert("AWS_REGION".to_string(), "us-east-1".to_string());
    let location = ensure_table_uri("s3://test-bucket/test-prefix").unwrap();
    let something = DeltaObjectStore::try_new(
        location,
        StorageOptions(mymap),
    );
    println!("{}", something.unwrap().root_uri());

    let mut mymap2 = HashMap::new();
    mymap2.insert("AWS_REGION".to_string(), "us-east-1".to_string());
    let location2 = ensure_table_uri("s3://test.bucket/test-prefix").unwrap();
    let something2 = DeltaObjectStore::try_new(
        location2,
        StorageOptions(mymap2),
    );
    println!("{}", something2.unwrap().root_uri());
}

Cargo.toml

[package]
name = "test"
version = "0.1.0"
rust-version = "1.64"
edition = "2021"

[dependencies]
deltalake = {version="0.8.0", features=["s3"]}
url = "2.3"

When running the above code, cargo run outputs:

rust-test % cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/test`
s3://test-bucket/test-prefix
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ObjectStore { source: Generic { store: "S3", source: UrlNotRecognised { url: "s3://test.bucket/test-prefix" } } }', src/main.rs:51:31
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

In other words, it spits out a root_uri just fine when the bucket name is only dashes, but it errors out when the bucket name includes dots. This may be an issue specifically in the object_store crate (https://github.com/apache/arrow-rs/tree/master/object_store), as I do think this should be supported.

@rtyler rtyler self-assigned this Mar 25, 2023
@rtyler
Copy link
Member

rtyler commented Mar 25, 2023

This does look like a bug upstream in object-store, there is an incorrect assumption being made that buckets can not have dots in the name.

I have just validated that this is legal with AWS S3 in `us-east-2.

rtyler added a commit to rtyler/arrow-rs that referenced this issue Mar 26, 2023
S3 bucket names can have dots in them, see [this
documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html)

> Bucket names can consist only of lowercase letters, numbers, dots (.), and hyphens (-).

This was originally reported in delta-io/delta-rs#1239 by @gray-sat
@rtyler rtyler added binding/rust Issues for the Rust crate storage/aws AWS S3 storage related on-hold Issues and Pull Requests that are on hold for some reason labels Mar 26, 2023
@rtyler
Copy link
Member

rtyler commented Mar 26, 2023

I will close this issue once the upstream changes have been incorporated

tustvold pushed a commit to apache/arrow-rs that referenced this issue Mar 26, 2023
S3 bucket names can have dots in them, see [this
documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html)

> Bucket names can consist only of lowercase letters, numbers, dots (.), and hyphens (-).

This was originally reported in delta-io/delta-rs#1239 by @gray-sat
@gray-sat
Copy link
Author

Wow -- super quick. Thank you.

@rtyler
Copy link
Member

rtyler commented Mar 27, 2023

The change has been merged upstream, but has not yet been released. There are some other API incompatibilities with the latest arrow that we'll be working through in #1249

LuaKT pushed a commit to LuaKT/object_store that referenced this issue May 7, 2023
S3 bucket names can have dots in them, see [this
documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html)

> Bucket names can consist only of lowercase letters, numbers, dots (.), and hyphens (-).

This was originally reported in delta-io/delta-rs#1239 by @gray-sat
@rtyler rtyler closed this as completed Sep 15, 2023
@rtyler
Copy link
Member

rtyler commented Sep 15, 2023

This should have been closed a while ago, changes have been released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason storage/aws AWS S3 storage related
Projects
None yet
Development

No branches or pull requests

2 participants