Error reading partitioned table with slash in partition key #1228

kvedes · 2023-03-17T13:12:20Z

Environment

Delta-rs version: 0.8.1

Binding: python

Environment:

Azure and locally

Bug

What happened:
I'm saving a partitioned table in Databricks (DBR 12.1) into Azure Data Lake Storage V2 which is partitioned on a string column containing slashes. When trying to read the table using deltalake from my local pc I get an error saying there is 'No body' in the parquet file.

Traceback (most recent call last):
  File "[..]/delta_connect.py", line 32, in <module>
    table = dt.to_pandas(
  File "[..]/python3.10/site-packages/deltalake/table.py", line 418, in to_pandas
    return self.to_pyarrow_table(
  File "[..]/python3.10/site-packages/deltalake/table.py", line 400, in to_pyarrow_table
    return self.to_pyarrow_dataset(
  File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_open_input_file
  File "[..]/python3.10/site-packages/deltalake/fs.py", line 22, in open_input_file
    return pa.PythonFile(DeltaFileSystemHandler.open_input_file(self, path))
deltalake.PyDeltaTableError: Object at location partition_table_test/value=you%25252Fthere/part-00007-adfd81b1-1bff-4431-83c2-948e2274a892.c000.snappy.parquet not found: response error "No Body", after 0 retries: HTTP status client error (404 Not Found) for url [..]

If I query for a key without a slash it returns the data without issue.

What you expected to happen:
Succesfully read the data

How to reproduce it:
In Databricks saving to ADLS2:

df = spark.createDataFrame([
    (0, "hej"),
    (1, "you/there"),
    (2, "the two of us"),
], schema="id int, value string")
df.write.partitionBy("value").saveAsTable("mytable")

Locally running in python (3.10) gives the above error:

from deltalake import DeltaTable

container_name = "my_container"
storage_account_name = "my_sa"
table_name = "my_table"
sas_token = "my_sas_token"

abfs_path = (
    f"abfs://{container_name}@{storage_account_name}.dfs.core.windows.net/{table_name}/"
)

dt = DeltaTable(
    abfs_path,
    storage_options={
        "sas_key": sas_token,
    },
)

table = dt.to_pandas(
    partitions=[("value", "=", "you/there")],
)

Using the other value "hej" works fine:

table = dt.to_pandas(
    partitions=[("value", "=", "hej")],
)

returns:

    id value
0   0   hej

More details:

The text was updated successfully, but these errors were encountered:

volker48 · 2023-03-24T14:40:16Z

I'm hitting the same error with partitions that are timed based like the following: "2023-03-23 00:00:00". The paths in S3 are URL encoded so they look like this 2023-03-23 00%3A00%3A00 when viewed via aws s3 ls. So I don't think this is unique to just having slashes in the partition key.

dudleydhsa · 2023-03-27T08:25:01Z

I have the same issue with timestamp as partitions. For example the timestamp is encoded in the list of files in the DeltaTable object.
edl_extract_dt=2022-11-11 13%253A00%253A01/part-00000-2aa47137-6e20-4cd7-a66e-fc5b6762158d.c000.snappy.parquet
This matches the symlink manifest paths.

When data is actually being read it seems the path is double encoded (the %) become encoded again to %25.
Returned by the error path not found:
edl_extract_dt=2022-11-11 13%25253A00%25253A01/part-00000-2aa47137-6e20-4cd7-a66e-fc5b6762158d.c000.snappy.parquet

So it appears to be a double encoding issue.

mrjoe7 · 2023-05-01T22:18:08Z

This is caused when the path property is double encoded in ObjectMeta::try_from.

This is how table data looks on local drive (note: the file structure was created by pyspark):

> ls
_delta_log  gender=m%2Ff

Contents of JSON transaction log:

{"add":{"path":"gender=m%252Ff/part-00023-c9d275a0-a620-408b-889f-e4d7280dfad3.c000.snappy.parquet","partitionValues":{"gender":"m/f"},"size":511,"modificationTime":1682978128394,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"firstName\":\"Kristine\"},\"maxValues\":{\"firstName\":\"Kristine\"},\"nullCount\":{\"firstName\":0}}"}}

It gets correctly url decoded and stored in Add action

But, then is url encoded when converted to Path in ObjectMeta here:

# Description This PR is fixing #1228. # Related Issue(s) - #1228

mattfysh · 2023-06-11T06:53:02Z

The release notes say a fix has landed in v0.9.0, but I've tested both v0.9.0 and v0.10. and I'm still seeing this issue with S3.

Smallest reproducible I could come up with:

from deltalake.fs import DeltaStorageHandler

# setup storage_options using boto3

store = DeltaStorageHandler(
    "s3://my_bucket/path/to/delta",
    storage_options,
)

# Works
store.open_input_file(
    "p=x/part-00000-3209f4d9-01c8-4e49-a01b-c5f9a6f61262.c000.snappy.parquet"
)

# None of these work
store.open_input_file("p=x%253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x%3Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")

mattfysh · 2023-06-11T06:53:45Z

cc @mrjoe7 @wjones127

mrjoe7 · 2023-06-11T09:02:08Z

@mattfysh can you provide listing of your bucket so we can see how exacly the file you are trying to read is named in S3?

I was able to successfully read the file using Rust in the following example:

Create and upload file to S3:

# create new empty bucket
awslocal s3api create-bucket --bucket delta
# create dummy file
echo 'test' > part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
# uload it to local s3
awslocal s3 cp part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet s3://delta/path/to/delta/p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet

Read the file using Rust and verify the file contents:

use std::collections::HashMap;
use bytes::Bytes;
use object_store::ObjectStore;
use object_store::path::Path;
use url::Url;
use deltalake::storage::DeltaObjectStore;

#[tokio::main(flavor = "current_thread")]
async fn main() -> Result<(), deltalake::errors::DeltaTableError> {
    let store_url = Url::parse("s3://delta/path/to/delta").unwrap();
    dbg!(store_url.path());
    let store = DeltaObjectStore::try_new(
        store_url,
        HashMap::from([
            (
                "AWS_ACCESS_KEY_ID".to_string(),
                "TESTACCESSKEY12345".to_string(),
            ),
            (
                "AWS_SECRET_ACCESS_KEY".to_string(),
                "ABCSECRETKEY".to_string(),
            ),
            ("AWS_REGION".to_string(), "us-east-1".to_string()),
            (
                "AWS_ENDPOINT_URL".to_string(),
                "http://localhost:4566".to_string(),
            ),
            ("AWS_STORAGE_ALLOW_HTTP".to_string(), "TRUE".to_string()),
        ]),
    )
        .unwrap();

    let store_get_res = store.get(&Path::from("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")).await.unwrap();
    let data_res = store_get_res.bytes().await.unwrap();
    assert_eq!(Bytes::from("test\n".to_owned()), data_res);

    Ok(())
}

mattfysh · 2023-06-11T09:50:14Z

Hey Tomas. I’m not near a computer right now but the files were written by a Databricks pyspark job, and the folder contains the percent encoding of the colon character. In your example, it’s not using percent encoding so that could be why the issue is not showing up? I can double check tomorrows thanks!

…

On Sun, 11 Jun 2023 at 7:02 pm, Tomas Sedlak ***@***.***> wrote: @mattfysh <https://github.com/mattfysh> can you provide listing of your bucket so we can see how exacly the file you are trying to read is named in S3? I was able to successfully read the file using Rust in the following example: Create and upload file to S3: # create new empty bucket awslocal s3api create-bucket --bucket delta# create dummy fileecho 'test' > part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet# uload it to local s3 awslocal s3 cp part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet s3://delta/path/to/delta/p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet Read the file using Rust and verify the file contents: use std::collections::HashMap;use bytes::Bytes;use object_store::ObjectStore;use object_store::path::Path;use url::Url;use deltalake::storage::DeltaObjectStore; #[tokio::main(flavor = "current_thread")]async fn main() -> Result<(), deltalake::errors::DeltaTableError> { let store_url = Url::parse("s3://delta/path/to/delta").unwrap(); dbg!(store_url.path()); let store = DeltaObjectStore::try_new( store_url, HashMap::from([ ( "AWS_ACCESS_KEY_ID".to_string(), "TESTACCESSKEY12345".to_string(), ), ( "AWS_SECRET_ACCESS_KEY".to_string(), "ABCSECRETKEY".to_string(), ), ("AWS_REGION".to_string(), "us-east-1".to_string()), ( "AWS_ENDPOINT_URL".to_string(), "http://localhost:4566".to_string(), ), ("AWS_STORAGE_ALLOW_HTTP".to_string(), "TRUE".to_string()), ]), ) .unwrap(); let store_get_res = store.get(&Path::from("p=x:y/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")).await.unwrap(); let data_res = store_get_res.bytes().await.unwrap(); assert_eq!(Bytes::from("test\n".to_owned()), data_res); Ok(())} — Reply to this email directly, view it on GitHub <#1228 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADSZ4USKDMESJH6UUQ4EIDXKWCRXANCNFSM6AAAAAAV6QQUEE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mattfysh · 2023-06-11T13:09:19Z

Ran some tests and it looks like having the partition column value being percent-encoded is what leads to the issue, but even in that case I'm at a loss to understand why test 5 is not working. Note that each file "part number" changes based on the partition folder name

In the meantime, I'll try to see why my databricks job writes partition column values in percent-encoding, but I'm inclined to believe it's doing so intentionally

Setup:

BUCKET_NAME="delta-rs-read-test"
aws s3api create-bucket --bucket $BUCKET --region us-east-1
echo 'test' > stub.pq
aws s3 cp stub.pq "s3://$BUCKET_NAME/path/to/delta/p=x/part-00000-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet"
aws s3 cp stub.pq "s3://$BUCKET_NAME/path/to/delta/p=x:y/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet"
aws s3 cp stub.pq "s3://$BUCKET_NAME/path/to/delta/p=x%3Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet"

Tests:

store = DeltaStorageHandler("s3://delta-rs-read-test/path/to/delta", storage_options)
# Tests 1-2 work
store.open_input_file("p=x/part-00000-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x:y/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
# Tests 3-6 not working
store.open_input_file("p=x%3Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x%253Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
    # Test 5 uses same literal path that appears on S3, and should work given that 1 & 2 are both working
store.open_input_file("p=x%3Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")
store.open_input_file("p=x%253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet")

mattfysh · 2023-07-09T00:52:30Z

Hey @mrjoe7 - any thoughts on why test 5 doesn't behave the same as test 1 & 2? I am using the same literal path value present in S3

mrjoe7 · 2023-07-09T08:38:57Z

Hi @mattfysh. It's because how some characters are encoded by object_store that delta-rs is using internally.

This is what each of your test paths will get encoded into (not including the /path/to/delta for better readability):

p=x/part-00000-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
p=x:y/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
p=x%253Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
p=x%25253Ay/part-00001-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
p=x%253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet
p=x%25253Ay/part-00002-d384c434-9775-45ad-925e-434a0528b594.c000.snappy.parquet

As you can see : is not encoded into %3A, but if you include % character in the path it will be encoded into %25.
Consequence of that is that it's impossible to reach a file that has path containing %3A.

mattfysh · 2023-07-10T07:05:13Z

Thanks @mrjoe7 - I was wondering if there is any way to change this behaviour? Because the default behaviour of Databricks (spark/hadoop) is to percent-encode these partition column values. I understand they do this to support Windows machines. Here is the spark job I ran to produce the structure on S3:

# Define the data with specific rows
data = [
    (1, "part1", 100),
    (2, "part2:with_colon", 200),
    (3, "part3", 300),
]

# Define the column names
columns = ["id", "partition_column", "value"]

# Create a DataFrame with specific data
df = spark.createDataFrame(data, columns)

# Write the DataFrame to an S3 bucket in Delta format
s3_bucket_path = "s3a://mysandbox/deltabug"
df.write.format("delta").partitionBy("partition_column").save(s3_bucket_path)

Then when i try to query this table with delta-rs, it fails due to this encoding issue

mattfysh · 2023-07-10T10:24:17Z

It looks like there may be an option when using object_store to prevent it from re-encoding the percent sign: apache/arrow-rs#3651

I don't know Rust very well, but could this be something that can be surfaced as an option into delta-rs and the python package?

caseyrathbone · 2023-09-01T20:25:12Z

I am able to reproduce with a simple local example using the Python library. It appears the space is double encoded translating to %2520 instead of just %20

import datetime
from deltalake import write_deltalake
import pyarrow as pa

data = pa.table({"data": pa.array(["mydata"]),
                 "inserted_at": pa.array([datetime.datetime.now()]),
                 "partition_column": pa.array(["hello world"])})

write_deltalake(table_or_uri="./unqueryable_table", \
  mode="append", \
  data=data, \
  partition_by=["partition_column"]
)

Output

> tree unqueryable_table
unqueryable_table
├── _delta_log
│   └── 00000000000000000000.json
└── partition_column=hello%2520world
    └── 0-8a13288c-f252-43f8-9c2c-c38416e7296c-0.parquet

2 directories, 2 files

ikstewa · 2023-09-20T20:01:22Z

We're still encountering issues when trying to interface with Spark Databricks. We see different partition formats from different clients/libraries.

Re-opened in: #1651

kvedes added the bug Something isn't working label Mar 17, 2023

mrjoe7 mentioned this issue May 1, 2023

fix: double url encode of partition key #1324

Merged

wjones127 pushed a commit that referenced this issue May 3, 2023

fix: double url encode of partition key (#1324)

6b66455

# Description This PR is fixing #1228. # Related Issue(s) - #1228

This was referenced Sep 7, 2023

fix: Parse paths before encoding them via a utility. #1618

Closed

fix: don't re-encode paths #1613

Merged

wjones127 closed this as completed in #1613 Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reading partitioned table with slash in partition key #1228

Error reading partitioned table with slash in partition key #1228

kvedes commented Mar 17, 2023

volker48 commented Mar 24, 2023

dudleydhsa commented Mar 27, 2023 •

edited

Loading

mrjoe7 commented May 1, 2023

mattfysh commented Jun 11, 2023

mattfysh commented Jun 11, 2023

mrjoe7 commented Jun 11, 2023

mattfysh commented Jun 11, 2023 via email

mattfysh commented Jun 11, 2023 •

edited

Loading

mattfysh commented Jul 9, 2023

mrjoe7 commented Jul 9, 2023

mattfysh commented Jul 10, 2023

mattfysh commented Jul 10, 2023

caseyrathbone commented Sep 1, 2023

ikstewa commented Sep 20, 2023 •

edited

Loading

Error reading partitioned table with slash in partition key #1228

Error reading partitioned table with slash in partition key #1228

Comments

kvedes commented Mar 17, 2023

Environment

Bug

volker48 commented Mar 24, 2023

dudleydhsa commented Mar 27, 2023 • edited Loading

mrjoe7 commented May 1, 2023

mattfysh commented Jun 11, 2023

mattfysh commented Jun 11, 2023

mrjoe7 commented Jun 11, 2023

mattfysh commented Jun 11, 2023 via email

mattfysh commented Jun 11, 2023 • edited Loading

mattfysh commented Jul 9, 2023

mrjoe7 commented Jul 9, 2023

mattfysh commented Jul 10, 2023

mattfysh commented Jul 10, 2023

caseyrathbone commented Sep 1, 2023

ikstewa commented Sep 20, 2023 • edited Loading

dudleydhsa commented Mar 27, 2023 •

edited

Loading

mattfysh commented Jun 11, 2023 •

edited

Loading

ikstewa commented Sep 20, 2023 •

edited

Loading